Monday, September 26, 2011

Data Warehousing Questions

Data Warehousing Questions

Hi All,

We like to share data warehousing related questions. This topic covers from scratch to end in High level.


Need of Data warehouse
To Analysis of data and History Maintenance.
Companies Require Strategic information to face the competition in market. 

The Operation system are not designed for strategic information.
To Maintain History of data for whole Organization and to have a single place where the entire data stored.


What is data warehousing and Explain Approaches?

Many companies follow either Characteristic defined by W.H.Inmon or Sean kelly.
Inmon definition
Subjected Oriented,Integrated,Non Volatile,Time Variant.

Sean Kelly definition
Seperate,Available,Integrated,TimeStamped,Suject Oriented,Non Volatile,Accessible.

Dwh Approaches
There are two Approches
1.Top Down by Inmon

2.Bottom Up by Ralph kimbal

Inmon approach -->Enterprise datawarehouse structured first and next Datamart created.(TopDown).
Ralph kimbal------>Datamart designed first.Later Datamarts to Datawarehouse designed.(BottomUp).

What are the responsibilities of a data warehouse consultant/professional?

The basic responsibility of a data warehouse consultant is to ‘publish the right data’.
Some of the other responsibilities of a data warehouse consultant are:

1. Understand the end users by their business area, job responsibilities, and computer
tolerance.

2. Find out the decisions the end users want to make with the help of the data warehouse.

3. Identify the ‘best’ users who will make effective decisions using the data warehouse

4. Find the potential new users and make them aware of the data warehouse.

5. Determining the grain of the data.

6. Make the end user screens and applications much simpler and more template driven.

What are fundamental stages of Data Warehousing?

Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.

Offline Data Warehouse - Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure.


Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)

Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.


What is Datamart Explain Types?

It is a specific Subject area or Functionality or Task.

It is Designed to facilitate end user Analysis.

Wrong Answer-- It is a subset of warehouse--Please dont use this wrong answer.
Types of Datamarts
Dependent,Independent,Logical.
Dependent--->Warehouse created first and datamart is created next.
Independent-->Datamart is created directly from the source systems without depending on the warehouse.
Logical--->It is a backup or replica of any other Datamart.


How to create Datawarehouse and Datamart?

DWH----->By Applying Datawarehouse Approach on any Database.

DM------->Its Created by either using Views or Complex Tables.

What is Dimensional Modeling?

It provides relationship between Dimension and Fact with the help of particular model.(Star,Snowflake etc)

What do you mean by Dimension table and Explain Dimension Types?

Dimension table is a collection of Attributes which defines a Functionality or Task.

Features:
1.It contains textual information or descriptive information.
2.Does not contain any measurable information.
3.Answers for wht,where,when,why qstns.
4,These tables are Master tables and also Maintains History.

Types of Dimension
a.Confirmed
b.Degenerated
c.Junk
d.Role Playing
e.SCD
f.Dirty

What is Fact table and explain types of Measures?

Fact table is a main table in Relational Model.it contains two sections.
a.Foreign keys to Dimensions
b.Measures or Facts.

Features
1.Fact table contains measurable information or Numerical information.
2.Answers for how many,how much related questions.
3.These tables are children or transactional tables also contain history.

Types of Measures

Additive Measure,Semi Additive Measure, Non Additive Measure.

What is Factless Fact Table?

A table which does not contain any Meaningful or Additive measures.


What is Surrogate key? How do we generate?

It is a key contains Unique values like a Primary Key.
A surrogate key is an artificial or synthetic key that is used as a substitute for a natural key.
It is just a unique identifier or number for each row that can be used for the primary key to the table.

we may generate this key in 2 ways

System generated
Manual sequence

What is the necessity of having surrogate keys?

1.Production may reuse keys that it has purged but that you are still maintaining.

2.Production might legitimately overwrite some part of a product description or a
customer description with new values but not change the product key or the customer
key to a new value. We might be wondering what to do about the revised attribute
values (slowly changing dimension crisis)

3.Production may generalize its key format to handle some new situation in the
transaction system.
E.g. changing the production keys from integers to alphanumeric
or may have 12-byte keys you are used to have become 20-byte keys.

4.Acquisition of companies

What are the advantages of using Surrogate Keys?

1. We can save substantial storage space with integer valued surrogate keys.

2.Eliminate administrative surprises coming from production.

3.Potentially adapt to big surprises like a merger or an acquisition.

4.Have a flexible mechanism for handling slowly changing dimensions.

What is SCD? Explian SCD types?

SCD--->Slowly Changing Dimension
As a Dimensions maintains history of the Data.A process into this dimensions in less volume so we call this dimensions as Slowly Changing Dimension.The process we follow here called SCD process.

SCD Types
Type 1 ---> No History
The new record replaces the original record. Only one record exist in database - current data.

Type 2----> History Maintained ---> 1. Current Expired Method
2.Effective Date Range Method.
A new record is added into the customer dimension table.
Two records exist in database - current data and previous history data.

Type 3---->History Maintained.
The original data is modified to include new data. One record exist in database - new information are attached with old information in same row.

What are the techniques for handling SCD’s?

Overwriting
Creating another dimension record
Creating a current value filed

What are the Different methods of loading Dimension tables?

There are two different ways to load data in dimension tables.

Conventional (Slow) :
All the constraints and keys are validated against the data before, it is
loaded, this way data integrity is maintained.

Direct (Fast) :
All the constraints and keys are disabled before the data is loaded.
Once data is loaded, it is validated against all the constraints and keys.
If data is found invalid or dirty it is not included in index and all future
processes are skipped on this data.

What is OLTP?

OLTP is abbreviation of On-Line Transaction Processing. This system is
an application that modifies data the instance it receives and has a
large number of concurrent users.

What is OLAP?

OLAP is abbreviation of Online Analytical Processing. This system is an
application that collects, manages, processes and presents
multidimensional data for analysis and management purposes.

What is the difference between OLTP and OLAP?

Data Source
OLTP: Operational data is from original data source of the data.

OLAP: Consolidation data is from various source.

Process Goal
OLTP: Snapshot of business processes which does fundamental business tasks.


OLAP: Multi-dimensional views of business activities of planning and decision making.

Queries and Process Scripts
OLTP: Simple quick running queries ran by users.

OLAP: Complex long running queries by system to update the aggregated data.

Database Design
OLTP: Normalized small database. Speed will be not an issue due to
smaller database and normalization will not degrade performance.
This adopts entity relationship(ER) model and an application-oriented
database design.

OLAP: De-normalized large database. Speed is issue due to largern database and de-normalizing will improve performance as there will be lesser tables to scan while performing tasks.
This adopts star,snowflake or fact constellation mode of subject-oriented database
design.

Back up and System Administration

OLTP: Regular Database backup and system administration can do the job.

OLAP: Reloading the OLTP data is good considered as good backup option.


Describes the foreign key columns in fact table and dimension table?

Foreign keys of dimension tables are primary keys of entity tables.
Foreign keys of facts tables are primary keys of Dimension tables.

What is Data Mining?

Data Mining is the process of analyzing data from different perspectives and summarizing
it into useful information.

What is the difference between view and materialized view?

A view takes the output of a query and makes it appear like a virtual
table and it can be used in place of tables.

A materialized view provides indirect access to table data by storing
the results of a query in a separate schema object.


What is ODS?

ODS is abbreviation of Operational Data Store. A database structure that is a repository
for near real-time operational data rather than long term trend data.
The ODS may further become the enterprise shared operational database,
allowing operational systems that are being reengineered to use the ODS as there operation databases.

What is VLDB?

VLDB is abbreviation of Very Large DataBase. A one terabyte database would normally be considered to be a VLDB. Typically, these are decision support systems or transaction processing applications serving large numbers of users.

Is OLTP database is design optimal for Data Warehouse?

No. OLTP database tables are normalized and it will add additional time to queries to return results. Additionally OLTP database is smaller and it does not contain longer period (many years) data, which needs to be analyzed.

A OLTP system is basically ER model and not Dimensional Model.
If a complex query is executed on a OLTP system,it may cause a heavy overhead on the OLTP server that will affect the normal business processes.

If de-normalized is improves data warehouse processes, why fact table is in normal form?

Foreign keys of facts tables are primary keys of Dimension tables. It is clear that fact table contains columns which are primary key to other table that itself make normal form table.


What are lookup tables?

A lookup table is the table placed on the target table based upon the primary key of the target,
it just updates the table by allowing only modified (new or updated) records based on the lookup condition.

What are Aggregate tables?

Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions . It is always easy to retrieve data from aggregated tables than visiting original table which has million records.
Aggregate tables reduces the load in the database server and increases the performance of the query and can retrieve the result quickly.



What is real time data-warehousing?

Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes
available instantly.

What are conformed dimensions?

Conformed dimensions mean the exact same thing with every possible fact table to which they are joined . They are common to the cubes.


What is conformed fact?

Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly.

How do you load the time dimension?

Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. 100 years may be represented in a time dimension, with one row per day.

What is a level of Granularity of a fact table?

Level of granularity means level of detail that you put into the fact table in a data warehouse. Level of granularity would mean what detail are you willing to put for each transactional fact.

What are non-additive facts?

Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. However they are not considered as useless. If there is changes in dimensions the same facts can be
useful.

What are Additive Facts? Or what is meant by Additive Fact?

The fact tables are mostly very huge and almost never fetch a single record into our answer set.
We fetch a very large number of records on which we then do, adding, counting, averaging, or
taking the min or max. The most common of them is adding. Applications are simpler if they store facts in an additive format as often as possible.
Thus, in the grocery example, we don’t need to store the unit price.
We compute the unit price by dividing the dollar sales by the unit sales whenever necessary.


What are the 3 important fundamental themes in a data warehouse?

The 3 most important fundamental themes are:
1. Drilling Down
2. Drilling Across and
3. Handling Time

What is meant by Drilling Down?

Drilling down means nothing more than “give me more detail”.
Drilling Down in a relational database means “adding a row header” to an existing SELECT
statement.

For instance, if you are analyzing the sales of products at a manufacturer level, the
select list of the query reads:

SELECT MANUFACTURER, SUM(SALES).

If you wish to drill down on the list of manufacturers to show the brand sold, you add the BRAND row header:

SELECT MANUFACTURER, BRAND, SUM(SALES).

Now each manufacturer row expands into multiple rows listing all the brands sold. This is the
essence of drilling down.

We often call a row header a “grouping column” because everything in the list that’s not
aggregated with an operator such as SUM must be mentioned in the SQL GROUP BY clause.
So the GROUP BY clause in the second query reads, GROUP BY MANUFACTURER, BRAND.


What is meant by Drilling Across?

Drilling Across adds more data to an existing row. If drilling down is requesting ever finer and
granular data from the same fact table, then drilling across is the process fo linking two or more
fact tables at the same granularity, or, in other words, tables with the same set of grouping
columns and dimensional constraints.

A drill across report can be created by using grouping columns that apply to all the fact tables
used in the report.

The new fact table called for in the drill-across operation must share certain dimensions with the
fact table in the original query. All fact tables in a drill-across query must use conformed
dimensions.

What is the significance of handling time?

Example, when a customer moves from a property, we might want to know:

1. who the new customer is
2. when did the old customer move out
3. when did the new customer move in
4. how long was the property empty etc


What are the important fields in a recommended Time dimension table?

Time_key
Day_of_week
Day_number_in_month
Day_number_overall
Month
Month_number_overall
Quarter
Fiscal_period
Season
Holiday_flag
Weekday_flag
Last_day_in_month_flag

What is the main difference between Data Warehousing and Business Intelligence?


The differentials are:

DW - is a way of storing data and creating information through leveraging data marts.
DM's are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category.
DW does not require BI to work. Reporting tools can generate reports from the DW.


BI - is the leveraging of DW to help make business decisions and recommendations.
Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools.

What is a Physical data model?

During the physical design process, you convert the data gathered during the logical design
phase into a description of the physical database, including tables and constraints.


What is a Logical data model?

A logical design is a conceptual and abstract design. We do not deal with the physical
implementation details yet;
we deal only with defining the types of information that we need.
The process of logical design involves arranging data into a series of logical relationships called
entities and attributes.


What are an Entity, Attribute and Relationship?

An entity represents a chunk of information. In relational databases, an entity often maps to a
table.
An attribute is a component of an entity and helps define the uniqueness of the entity. In relational databases, an attribute maps to a column.
The entities are linked together using relationships.


What is junk dimension?

A number of very small dimensions might be lumped together to form a single dimension,
a junk dimension - the attributes are not closely related.
Grouping of Random flags and text Attributes in a dimension and moving them to a separate sub dimension is known as junk dimension.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.