Simple guide to get familiarize with the concept of data warehouse

What is a data warehouse?

A data warehouse is a repository of an organization’s electronically stored data. Data warehouses are designed to facilitate reporting and analysis.

Dash board

Datamart

A data mart is a subset of an organizational data store, usually oriented to a specific purpose or major data subject, which may be distributed to support business needs. Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. Data marts are often derived from subsets of data in a data warehouse, though in the bottom-up data warehouse design methodology the data warehouse is created from the union of organizational data marts.

What is a datawarehouse?

In one line, it is a repository of several years of data using which reports can be created for business purposes (to take decisions).

It is a database specifically modeled and fine-tuned for analysis and decision making.

Data is extracted from heterogeneous sources as they are generated and stored centrally. This makes it much easier and more efficient to run queries over data that originally came from different sources.

Why do you need a DWH? Can’t the reports be created directly from daily transactional (operational) data?

Hmmm.. you can.. but it is not easy. Here is why.

Consider a huge car rental company with a fleet of 150,000 cars. They have a daily rent-out ratio of 50% meaning 75,000 cars are rented out per day. The challenge is to have fleet availability information ready at hand. This will enable them to have the right fleet available at the right time.

For this, the operational data will have 75,000 transactions per day on an average. This system will feed data for things like (only few listed below)
1. reserve a car
2. process a request
3. generate an invoice
4. receive cash
5. cancellation of booking

Now, this system is designed for feeding data in. What we need is to use this data in an intelligent manner to analyze the business. Such analysis may need answers for
1. Which model of car is best rented?
2. Which region is most profitable?
3. Which area is the most problematic?
4. Can I get the entire sales data aggregated by region, by year, by car model?
5. Can I drill down to quarter wise data for the same report above?
6. What is the operating profit margin per model? or per region?
7. Can I flag regions that are not yet problematic, but are on the border?

Now, my daily data is not optimized to give me answers for the above queries. This is because the data is stored as per the application and not by business area (subject like customer, product info, garage or warehouse, etc). If we have a invoicing application, it will have a merely invoiced data and same with the booking & reservation application. And there might be a third application like the one to track loans given to large corporate clients which may be in a different environment that is incompatible with the other 2 above.

Now how do I get the data for the loans given and also make it interact with the other applications? How do I see the loan payment of top 100 huge clients and see their past sales information and maybe also their preference over models? These are in different systems which may or may not be compatible. Even if they are compatible, making them interact is tough.

In most cases, data is not meaningfully represented and is dispersed. This will cause several issues like duplication of effort, poor productivity, quality and integrity issues, etc. Also the range of analysis available is limited.

What is the solution then?

To have an (all applications) integrated system that is grouped business wise and not applicaton wise. This data should be non-volatile (unchanging..because its history data unless you made a wrong entry) and growing everyday! And its separated from the transactional data and is the single point of contact for all your decision making information.

Wow, welcome to datawarehousing!!

Does that solve all my problems?

Probably not! We still have issues of integrating the data, making them interact without issues, sort out discrepancies, cleansing data, modelling the data for easier understanding and querying, having methods of loading the tables, etc.

How is the data fed into the warehouse?

Data from several transactional sources is merged and is fed into the warehouse tables that have been modeled using concepts like star schema, snowflake schema, etc. This data is extracted from source transaction data and appended into the warehouse. We now have a history database of the business.

Okay, let me talk about star schema briefly. It will have a central table called Fact table and peripheral tables called dimensions. See image below.

Fact tables will have numeric data and descriptions are contained in dimensions.

Now, dimensions are subject wise data. It can be a customer, product, warehouse, etc. The above picture is called a data mart modeled on star schema. This data mart is far easier to comprehend and understand than the operational system. And now we can use this for easier querying. See these questions
1. Which product sold the most? Available here
2. Which customer group had which product preference? available here
3. What was the sales last year? available.

And the user will understand this business driven subject wise grouped modeling better.

To create this data mart, we need to get all customer, product, supplier data from all (different) systems and load them into the respective tables. Then get the transaction data into the fact table and just join them with the respective dimensions using a foreign key (we call this surrogate key, will discuss in a later post). The transaction data can be at the individual transaction level or it can be grouped as per need.

Now, what is the dimension exactly?
In my report, when I say what is cost per X, X can be the dimension. Cost per employee, cost per product, sales per month (time dimension), etc. A dimension is an attribute across which you review data.

And what is the fact table?
Data that is numerical…the sales amount, the cost amount, inventory quantity, etc will form the fact table. There will be separate fact tables for inventory, sales revenue, sales orders, account receivables, account payables, etc.

So?
Now, you can get the data from fact tables grouped across dimensions to get data you need. What is the account payable last month (grouped across time dimension) for this region (grouped across region)? What is sales for product A (grouped across product dim filtered for A) in 2006 (group time dim and filter for 2006) ?

why surrogate key?

Here is why.

To separate the DWH from the operational environment
Think of this. You have a dimension loading from transactional tables. All of a sudden, the OLTP people decide to re-use all closed / inactive account id’s. For them, it won’t matter a bit. For the DWH team, it will.
The incoming record will have an account id that is already there in DWH. It will overwrite the old record. This is unacceptable.

Consider another case where you have a customer dimension loaded from OLTP tables. And your company purchases another company. And now, the acquired company has similar type of primary key logic for its customers. It clashes with your own primary keys! Oops, how do we integrate them?

Surrogate keys!! Use surrogate keys!! This will shield the DWH from operational changes.

Performance
The fact table will have a composite key. If surrogate keys are used, then in the fact table, we will have integers for keys.
In most cases, all we need is only 4 bytes for each surrogate key column. So, if you have 5 dimensions, all we need is 20 bytes!
If you have a 10 million rows, you only need 20 MB space for it.
If you had primary keys instead, you might have need several bytes for each key. Remember, in this case, each byte increases space by 1 MB. So, if you have a billion rows, it will increase size by 1 GB. For every single byte increase!!

Also, the queries will run faster when you join on integers rather than varchar. Using surrogate keys will enable faster querying.

And add to it, the option of being able to partition as per time_key or any other suitable key. Since it is a sequence, you can easily partition it.

History preservation
Consider this. Today, in July, I make a purchase. My current customer record will say

Prem 101 Bangalore Male

Next month, I move to Hyderabad. Make a purchase there.

Prem 101 Hyderabad Male

Now, you want to query for all purchases made by customers in Bangalore in July. How can I get it? My record says I am in Hyderabad! The way to handle this is by using SCD’s. So I need to use surrogate keys.

1 Prem 101 Bangalore Male record “Status_close”
2 Prem 101 Hyderabad Male record “Status_current”

This will enable you to preserve history. Note the additional column to indicate the current record.

To enable special situations that can’t be done by OLTP primary keys
How can you have a OLTP primary key for a ‘not found’ record in a dimension? My fact table has a record that doesn’t have a match yet in the dimension table. (Late arriving dimension.. assume) How do I handle it? I need to say it maps to a ‘not found’ record in the dimension table.

Surrogate key usage will enable such option!

Types of Facts

There are three types of facts:

Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:

Date

Store

Product

Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table — date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.

Say we are a bank with the following fact table:

Date

Account

Current_Balance

Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what’s the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

What are semi-additive facts?
Well, facts that can be aggregated across a few dimensions, but not all.

Take the case of an inventory fact table.

We track inventory every day for every product at every warehouse.

To get the inventory on a given day for all products, I aggregate the inventory fact measures across the product dimension filtered for that particular day. The following query will give the inventory for every product for today.

Select product_dim.name, sum(Inv_fact.inventory_qty)
from
product_dim, inv_fact, date_dim
where
date_dim.date_key= inv_fact.date_key and
date_dim.calendar_dt=sysdate and
product_dim.product_key = inv_fact.product_key
group by product_dim

However, I cannot add all the inventory quantities every day for a product, say for a month and get anything meaningful. Aggregation across time is meaningless. (of course, you can use it to calculate the average inventory per day)

Such facts that can be aggregated across certain dimensions but not across all dimensions are called Semi-additive or partial facts.

Types of Fact Tables

Based on the above classifications, there are two types of fact tables:

Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.
Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Snowflake schema

A snowflake schema is a way of arranging tables in a relational database such that the entity relationship diagram resembles a snowflake in shape. At the center of the schema are fact tables which are connected to multiple dimensions. When the dimensions consist of only single tables, you have the simpler star schema. When the dimensions are more elaborate, having multiple levels of tables, and where child tables have multiple parent tables (“forks in the road”), a complex snowflake starts to take shape. Generally, whether a snowflake or a star schema is used only affects the dimensional tables. The fact table is unchanged.

Using snowflake schema

A snowflake schema is a variation on the star schema, in which very large dimension tables are normalized into multiple tables. Dimensions with hierarchies can be decomposed into a snowflake structure when you want to avoid joins to big dimension tables when you are using an aggregate of the fact table. For example, if you have brand information that you want to separate out from a product dimension table, you can create a brand snowflake that consists of a single row for each brand and that contains significantly fewer rows than the product dimension table. The Figure shows a snowflake structure for the brand and product line elements and the brand_agg aggregate table.

An Example of a Snowflake Schema

If you create an aggregate, brand_agg, that consists of the brand code and the total revenue per brand, you can use the snowflake schema to avoid the join to the much larger sales table, as the following query on the brand and brand_agg tables shows:

SELECT brand.brand_name, brand_agg.total_revenueFROM brand, brand_agg WHERE brand.brand_code = brand_agg.brand_code AND brand.brand_name = ‘Anza’

Without a snowflaked dimension table, you use a SELECT UNIQUE or SELECT DISTINCT statement on the entire product table (potentially, a very large dimension table that includes all the brand and product-line attributes) to eliminate duplicate rows.

While snowflake schemas are unnecessary when the dimension tables are relatively small, a retail or mail-order business that has customer or product dimension tables that contain millions of rows can use snowflake schemas to significantly improve performance.

If an aggregate table is not available, any joins to a dimension element that was normalized with a snowflake schema must now be a three-way join, as the following query shows. A three-way join reduces some of the performance advantages of a dimensional database.

SELECT brand.brand_name, SUM(sales.revenue)FROM product, brand, sales WHERE product.brand_code = brand.brand_code AND brand.brand_name = ‘Alltemp’GROUP BY brand_name

Sample snowflake

Star schema

Snowflaked

What is a dimension table?

I dimension table is a table that describes a measure in a fact table, in other words it contains the dimensions or characteristics of a measure in a given fact table:Example if I have a sales key in the fact table, this measure has characteristics like sales person and his description, date and its description , store and its location and so on.

Slowly Changing Dimensions

The “Slowly Changing Dimension” problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. We give an example below:

Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record:

Customer Key	Name	State
1001	Christina	Illinois

At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the “Slowly Changing Dimension” problem.

There are in general three ways to solve this type of problem, and they are categorized as follows:

Type 1: The new record replaces the original record. No trace of the old record exists.

Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people.

Type 3: The original record is modified to reflect the change.

Rapidly changing dimensions

Dimensions can be loaded in different ways. Slowly changing dimension loading is one. Even in this there are
1. Type 1 – any changes in existing data is updated. Used for correction handling.

Type 2 – change is not updated, but a new record is inserted with change and the old data is kept as such and flagged as old version. For example, Prem is a record in Customer dimension. Today Prem lives in Hyderabad. So the current record in customer dim says
#1 — Prem — Hyderabad — Male

Then I move to Bangalore. Now how do I update this? So I say

# no — Name — Place — Sex — Current version
#1 — Prem — Hyd — M — No
#2 — Prem — Blr — M — Yes

Why should I have the old record?
There might be a case where you want to know all the sales that was done by customers in Hyderabad last year. And Prem had a purchase in Hyderabad last year. I want this to be tracked. I dont want the sale to point to Prem living in Bangalore, but in Hyderabad.

This way, if Prem moves 10 times, we will have 10 records in customer each pointing to a different Prem as per his location. This will keep the data accurate.

Type 3
Here instead of having all the history, we keep selected history data. Like 3 versions. The last 3 location moves of Prem will be tracked, not more!

Okay, what are rapidly changing dimensions?
Any dimension whose attributes change too rapidly. For example, credit rating of Prem. Assume that your rating agency rates people every 3 months (assume for example purpose), then every 3 months Prem can have a possibly different rating!

# no — Name — Place — Sex — Cr rate –Current version
#1 — Prem — Hyd — M — AAA –No
#2 — Prem — Blr — M — AA –No
#2 — Prem — Blr — M — A –No
#2 — Prem — Blr — M — B –Yes

So if you have around 3000000 customer records and if they keep changing every quarter as they do here, customer dimension will grow exponentially. That defeats the purpose of a warehouse, effective querying.

What do we do?
Remove the rapidly changing attributes and make a new dimension out of it. Pull the Credit rating and make a dimension out of it and call it credit rating behavior dimension.

So the customer will not grow exponentially. The rating dimension will. But this will be used only by queries that need rating info. Any query that doesn’t need rating info and only other customer info will go to the original customer dimension.

Conformed dimensions

These are dimensions that are shared between multiple facts, like date dimension can be shared between a inventory fact and sales fact.

ETL

ETL stands for extract, transform, and load. It is a process of moving data from point A to B.

ETL tool enables developers to their disparate data while moving it from place to place, and it doesn’t matter that that data is in different forms or formats. The data can come from any source. ETL is powerful enough to handle such data disparities.

For example, a financial institution might have information on a customer in several departments and each department might have that customer’s information listed in a different way. The membership department might list the customer by name, whereas the account department might list the customer by number. ETL can bundle all this data and consolidate it into a uniform presentation, such as for storing in a database or data warehouse.

Another way that companies use ETL is to move information to another application permanently. For instance, word-processing data might be translated into numbers and letters, which are easier to track in a spreadsheets or database program. This is particularly useful in backing up information as companies transition to new software altogether.

One important function of ETL is “cleansing” data. The ETL consolidation protocols also include the elimination of duplicate or fragmentary data, so that what passes from the E portion of the process to the L portion is easier to assimilate and/or store. Such cleansing operations can also include eliminating certain kinds of data from the process. If you don’t want to include certain information, you can customize your ETL to eliminate that kind of information from your transformation.

The T portion of the equation, of course, is the most powerful. ETL can transform not only data from different departments but also data from different sources altogether. For example, data in an email program such as Microsoft Outlook could be transformed right along with data from an SAP manufacturing application, with the result being data of a common thread in the end.

Microsoft, IBM and Informatica all provide ETL tools