Data Warehouse referat




3.2.1 Definition

A data warehouse is a central repository that stores the data collected in the enterprise\'s business operations. Data in the warehouse is extracted and organized from various OLTP applications and other sources. The data is used by OLAP systems and user queries to aid in decision-making analysis over a long historical time perspective. The warehouse is updated by snapshots of the corporate business data at certain moments of time, and once stored, data does not change anymore. The data warehouse may also store data generated by the OLAP reports to provide quick retrieval for later use.



3.2.2 Data Warehouse vs. Traditional Operational Data Stores

One of the main differences between these two types of systems is the data they contain. In operational systems (OLTP systems), the data is called operational data and is constantly in a state of flux, while in the data warehouse, the data is usually referred to as decision support data and doesn't change over time. The table below summarizes the main differences between operational data found in OLTP systems and decision support data found in a data warehouse.

Operational Data Decision Support Data
Application-oriented: data serves a particular business process or functionality, usually normalized data. Subject-oriented: data serves a certain subject of the business, such as customer information. De-normalized data where one record may cover the whole business process.



Detailed data (ex. all transactions).
Summarized or refined data with complex calculations over time periods.
Static structure. Dynamic structure, as new data cubes can be created upon need. Existing cubes can also be extended, by adding either new dimensions or dimension levels.
Oriented for data-entry people. Targets decision makers at all levels. Data structures are designed so that analysis can be made at different levels and different areas within a business.
Volatile (can be changed). Non-volatile (unchanged after insertion).
Performance is important because of the large number of concurrent users that may access the data. Performance issues are more relaxed, since a far
smaller number of people are expected to access
the data simultaneously.
Must provide high availabillity (backup and recovery plans are very well planned). Does not have the same degree of high availability requirement and plans for back-up and recovery are more relaxed.
Usually retrieved in units or records at any given time. This leads to processing only a small amount of data in a transaction. Usually retrieved in sets of records. This leads to
processing large amounts of data in one single process (such as finding out a particular trend based on data collected over several years).
Reflects current situation. Reflects values over time (historical).
Usually managed (administered) as a whole. Usually managed in pieces, or smaller sets.

Table 3-1. Operational data vs decision support data.

3.2.3 Data Warehouse and OLAP

The data warehouse is the storage area that OLAP relies on for its analyses. The data warehouse is populated over time from operational systems techniques of data scrubbing and transformation. This process is done because the raw data in the operational system may not be in the format and shape that can lead to efficient storage in the data warehouse, or efficient analyses by the OLAP tools. Furthermore, if the data was collected from several sources before storing it into the data warehouse, data inconsistencies have to be resolved. OLAP systems aggregate and summarize the data in the warehouse, organizing it in OLAP cubes that allow for fast retrieval of the information expected from a query or report.

3.2.4 Data volume

OLTP systems are designed to be small in size to allow for efficient updates, inserts, and deletions. Once some data is recognized to have become static and not likely to change, the data is archived from such systems or moved into the data warehouses after some transformation. The archived data can then be deleted from the OLTP system leaving room for new data to be collected.
On the contrary, the data warehouse is designed to hold large amounts of data. While an OLTP system can rarely exceed the 100 GB limit, a data warehouse is likely to be measured in the TB ranges. The data stored in the data warehouse is organized in a way that optimizes query performance.
This rule is not always the case. There may be corporations that use their OLTP system as the backend for the analysis tools. In this case, the data warehouse and the OLTP have the same size. This practice is not recommended, however, because optimizing the database for transaction processing will degrade its performance in analytical processing, and vice versa.

3.2.5 Data Marts

Today\'s corporations strive to conduct their business on a national and global basis. Business expansion creates the need to access corporate data from different geographical locations in a timely manner. The solution to this problem is to create smaller versions of the data warehouse based on certain criteria, called data marts.
Some corporations completely replace the concept of having a data warehouse with that of having smaller data marts. Other companies use data marts to complement their data warehouses by moving data into them from the data warehouse allowing for more efficient analysis based on criteria such as geographical area, time period, etc. Finally, some corporations use their data marts as the first stage of storing operational data. The data is then replicated from all of the corporate data marts into a central \"enterprise\" data warehouse. Such a data warehouse then evolves over time as the data marts continue to be built to cover as many aspects of the company business as needed.

3.2.6 Data Mining

With data mining, managers find out aspects and facts about their business that may not be evident otherwise. Data mining establishes relationships between business elements that may seem completely unrelated. In general data mining leads to the following results:
. Discovering unknown associations. Such associations can be found when one event can be correlated to another event that seems completely unrelated.
. Sequences, where one event leads to another later event.
. Recognizing patterns that lead to classification, or new organization of data (for example, certain profiles are established for customers based on what they purchase).
. Finding groups of facts not previously known. This process is known as event clustering.
. Forecasting, or simply discovering patterns in the data that can lead to predictions about the future.

Data mining is not discussed any further in this paper, but is included here as an important issue related to data warehousing.