Business Intelligence: Data warehouse concepts

Showing posts with label Data warehouse concepts. Show all posts

Monday, April 29, 2013

Data Warehouse Design Approaches

Data warehouse design is one of the key technique in building the data warehouse. Choosing a right data warehouse design can save the project time and cost. Basically there are two data warehouse design approaches are popular.

Bottom-Up Design:

In the bottom-up design approach, the data marts are created first to provide reporting capability. A data mart addresses a single business area such as sales, Finance etc. These data marts are then integrated to build a complete data warehouse. The integration of data marts is implemented using data warehouse bus architecture. In the bus architecture, a dimension is shared between facts in two or more data marts. These dimensions are called conformed dimensions. These conformed dimensions are integrated from data marts and then data warehouse is built.

Advantages of bottom-up design are:

This model contains consistent data marts and these data marts can be delivered quickly.
As the data marts are created first, reports can be generated quickly.
The data warehouse can be extended easily to accommodate new business units. It is just creating new data marts and then integrating with other data marts.

Disadvantages of bottom-up design are:

The positions of the data warehouse and the data marts are reversed in the bottom-up approach design.

Top-Down Design:

In the top-down design approach the, data warehouse is built first. The data marts are then created from the data warehouse.

Advantages of top-down design are:

Provides consistent dimensional views of data across data marts, as all data marts are loaded from the data warehouse.
This approach is robust against business changes. Creating a new data mart from the data warehouse is very easy.

Disadvantages of top-down design are:

This methodology is inflexible to changing departmental needs during implementation phase.

It represents a very large project and the cost of implementing the project is significant

Thursday, April 25, 2013

Degenerate dimension

A Degenerate Dimension is a DIMENSION which exists within the fact table alone - as in a Foreign Key which has no Primary Key.

Usually when a dimension table is growing at roughly the same rate as the fact table, there is a degenerate dimension lurking that has been missed in the initial design.

A degenerate dimension (DD) acts as a dimension key in the fact table, however does not join to a corresponding dimension table because all its interesting attributes have already been placed in other analytic dimensions. Sometimes people want to refer to degenerate dimensions as textual facts, however they’re not facts since the fact table’s primary key often consists of the DD combined with one or more additional dimension foreign keys.

The easiest way to understand what we are talking about is via an example.
For instance, we might have a FACT table for sales - tblSales - which contains the following data:

CustomerID  ProductID CreditCardType

101         5         Visa

102         6         Mastercard

103         7         Visa

In this case, the CustomerID and ProductID are conventional dimensional references - they are Foreign Keys which have a corresponding primary Key table.
However, CreditCardType is also a piece of dimensional information - something we might want to analyze by - but there is no tblCreditCardType. CreditCardType is a Degenerate Dimension.
Where degenerate dimensions come in is that there are often some columns that we want to have, but that are not measures, and don't have a table of stuff we want to join to. Example: a purchase order number. These columns store something that we want to have (the purchase order number), but to create an empty dimension table would only slow things down. So, to ensure we don't feel bad about breaking the "only a measure or a dimension in the fact table" rule, we just CALL them dimensions- even without the table.
In the fact itself, any attribute of the purchase order that was of interest, and that therefore had values that would each have more attributes we would be interested in would have been turned into a dimension, and a dimension table would have been created.
But to create a dimension table that contains a row for every purchase order would create a very large dimension with nothing in it (since there are lots of purchase orders, possibly as many as there are facts if the grain your fact table is one per purchase order). But our users would not be happy if they could not get a list of the purchase orders included in a given total, or be able to drill down to that bottom level of detail that we've gone to all the trouble to include.
So, when we create transactional level fact tables, it is normal, in fact, necessary to include some degenerate dimensions- include columns that have useful information (very often referencing back to the source system) but that do not join to any dimension table. Plus you can just impress everyone with your dimensional modelling knowledge when you say "degenerate dimension"

Dimension Table

A dimension table is a table that stores details about the facts. Dimension tables contain attributes that describe fact records in the fact table.

A dimension table typically has two types of columns, primary keys to fact tables and textual/descreptive data. Commonly used dimensions are customers, products, and time
For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I say, "20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that gives a meaningful sense. These product, customer and dates are some dimension that qualified the measure - 20kg.

OLTP VS OLAP

	OLTP System	OLAP System
	Online Transaction Processing	Online Analytical Processing
	(Operational System)	(Data Warehouse)
Source of data	Operational data; OLTPs are the original source of the data.	Consolidation data; OLAP data comes from the various OLTP Databases
Purpose of data	To control and run fundamental business tasks	To help with planning, problem solving, and decisionsupport
What the data	Reveals a snapshot of ongoing business processes	Multi-dimensional views of various kinds of business activities
Inserts and Updates	Short and fast inserts and updates initiated by end users	Periodic long-running batch jobs refresh the data
Queries	Relatively standardized and simple queries Returning relatively few records	Often complex queries involving aggregations
Processing Speed	Typically very fast	Depends on the amount of data involved; batch datarefreshes and complex queries may take many hours; query speed can be improved by creating indexes
Space Requirements	Can be relatively small if historical data is archived	Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP
DatabaseDesign	Highly normalized with many tables	Typically de-normalized with fewer tables; use of star and/or snowflake schemas
Backup and Recovery	Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability	Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method