T-SQL/MSBI Knowledge Share Videos

Understanding Data Warehouse


What is Data Warehousing?
                                         
A data warehouse is the main repository of an organization's historical data. It contains the raw material for management's decision support system. It is a repository of integrated information, available for queries and analysis.
In computing, a data warehouse (DW) is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.

A data warehouse maintains its functions in three layers: staging, integration, and access. Staging is used to store raw data for use by developers. The integration layer is used to integrate data and to have a level of abstraction from users. The access layer is for getting data out for users.

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject. Suppose if we take any retail industry, sales will be one subject area where all sales related details will be stored, Product will me another subject area where all product details will be stored, customer will be another subject area where all customer details will be stored and so on. These subject area can be built over the data model (E-R modeling/Dimensional modeling).

Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product. Normally data for data warehouse comes from different source system, may be through files, different databases etc. All these data will be integrated based on the business logic.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.

NOTE : A Data warehouse is nothing but a database (any RDBMS) where the Subject-Oriented, Integrated, Time-variant, Non-volatile data resides. Business is analyzed from Data warehouse by taking reports on the Data Warehouse data.

Why Data Warehouse Subject Oriented?
The Subject areas are what the business wants to talk “about” or the “nouns” of the business, e.g. finance, manufacturing, marketing, sales, HR, etc. So that it will be easy to analyze the business as the data related to each department will be in respective subject area. With this can find, what each department is doing and what is the performance in each department, which together defines the company performance. Also each department can have access only to their respective subject area and run the business easily without any confusion or complexity.

Why Data Warehouse Integrated?
The data in data warehouse is not only from one source system or one database; it is from heterogeneous source system. Data may come from files or different databases, on which a business logic is applied to integrate the data and loaded it into data warehouse.


Why Data Warehouse Time-varient?
The data in data warehouse is not only meant for current time or current period. To analyze the business we need data over time period, meaning, over many years of data. So that we can find the things like, what was the business 3 years ago and in what position it is now, which month every year improves the turnover of the business and so on.

Why Data Warehouse Non-volatile?
The data in data warehouse is non-volatile because once the data comes in data warehouse it is never changes over written, the changes are stored as new record, meaning, it maintains the history of the data. Suppose if you need any customer’s address which changes very often and need all the address in which the customer resided, then we can get it from data warehouse, thing we need to define constraints on which we need historical data.

Below is the very high level diagram about Data Warehouse, but its not the exact Data Warehouse Architecture.


A data warehouse is the main repository of an organization's historical data. It contains the raw material for management's decision support system. It is a repository of integrated information, available for queries and analysis.

In computing, a data warehouse(DW) is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.

A data warehouse maintains its functions in three layers: staging, integration, and access. Staging is used to store raw data for use by developers. The integration layer is used to integrate data and to have a level of abstraction from users. The access layer is for getting data out for users.





Why do we need Data Warehouse?
Data warehouse is mainly required for making business decision based on the historical data. DWH keeps track of all the changes w.r.t current record. We need data in an organized manner to make the decision over business and that is done in the DWH. Since the data may come from various sources, all these are integrated and organized in the DWH based on different subject areas, so that it will be helpful for making business decisions. Also we can view the information in multi dimension, i.e. profit based on year, month, location etc. can be analyzed. Here we convert raw data to useful and meaningful information.


Difference between Raw data and Information?
Raw data: It is the unprocessed or unorganized, in the sense, for example consider the shopping mart. Customers purchase different items and are tracked by entering the details like item number, quantity, price etc.
Information: It is a processed data, which is meaningful information i.e. how many items sold, what quantity of items sold, what is the percentage of profit in sales etc.

Data Warehouse Architecture differs based on the requirements. Below is one such common and mostly used DWH Architecture.



What is ETL?
ETL is Extract Transfer Load. It means Extract data from the source system, apply some transformations on the source data like data type conversion, business logic etc and load it to Staging/ODS/DWH.

What is ODS and why do we need it?
ODS(Operational Data Store) is nothing but a OLTP system, where the data is non-static. ODS is just like a DWH, where in it stores the data for fewer days as per the business requirement and later on data gets changed as per the source system. Here the history is not preserved like DWH, it has only transactional data over which the business logic is applied.

ODS is designed to support the detailed day-to-day activities of the organization at the transaction level. It is mainly used for operational reporting. In operational reporting, detail is much more important than the summary.Examples of operational reporting include bank tellers end-of-day window balancing reports, daily account audits and adjustments, daily production records, etc. The kind of decisions made from these reports are very detailed, immediate decisions made at the line-manager level.

Now consider the informational reporting, which are used strategically as opposed to operational reports that are tactically by the organization. These informational reports are fetched from the DWH. Middle management uses informational reports in order to make longer- term decisions. In informational report details are almost irrelevant, but the summarizations are everything. Example of informational reporting include monthly sales trends, annual revenue, and regional sales by product line for the quarter, industry production figures for the year, number of employees by quarter and weekly shipping cost by carrier, etc.

With respect to performance, if every report (daily,monthly,yearly etc.) points to DWH then the performance will be slow. So if reports like daily or monthly is pointed to ODS and the yearly report points to DWH, performance will be good.

Its upto the organization if it needs the ODS.

What is Staging area?
Some times ODS is also called as Staging area. Actually Staging area is where the data from all source system is bought at one place by freezing up the data. In the sense if we use the data directly from the source system, since its a transactional system and changes happens frequently the data wont be in sync with the different source system and the reports shows the invalid values. If we freez the data for specific time for all the source system then the reports generated will be upto the mark and correct for that freezed time.

What is Data Mart?
Data mart is a subset of data warehouse. A Data Warehouse contains enterprise wide data with all or many subject area, where as Data mart is related to a specific subject area and it is accessible to specific set of people who are related to that specific subject area. Example a Data warehouse contains all Sales, Marketing, Finance subject area, but Sales is one Data mart, Marketing is another data mart and so on.