SQL Core Concepts (The Data Life Cycle)

In SQL Server database data is actually similar to a product in an assembly line moving from one environment to another and undergoing transformations along the way. Figure-1 illustrates the data life cycle.

data-life-cycle

OnLine Transactional Processing

Data is entered initially into an OnLine Transactional Processing (OLTP) system. The focus of an OLTP system is data entry and not reporting—transactions mainly insert, update, and delete data. The relational model is targeted mainly at OLTP systems, where a normalized model provides both good performance for data entry and data consistency. In a normalized environment each table represents a single entity and redundancy is at a minimum. When you need to modify a fact you need to modify it in only one place, resulting in optimized performance for modifications and little chance for error.

An OLTP environment is not suitable for reporting purposes because a normalized model usually involves many tables (one for each entity) with complex relationships. Even simple reports require joining many tables, resulting in complex and poorly performing queries.

You can implement an OLTP database in SQL Server and both manage it and query it with T-SQL.

 

Data Warehouse

A data warehouse (DW) is an environment designed for data retrieval/reporting purposes. When serving an entire organization such an environment is called a data warehouse; when serving part of the organization (such as a specific department) it is called a data mart. The model has intentional redundancy, which allows fewer tables and simpler relationships, ultimately resulting in simpler and more efficient queries compared to an OLTP environment.

The simplest design of a data warehouse is called a star schema, which includes several dimension tables and a fact table. Each dimension table represents a subject by which data is analyzed. For example, in a system that deals with orders and sales, you will probably want to analyze data by customers, products, employees, time, and so on. In a star schema each dimension is implemented as a single table with redundant data. For example, a product dimension could be implemented as a single ProductDim table instead of three normalized tables: Products, ProductSubCategories, and ProductCategories. If you normalize a dimension table, resulting in multiple tables representing the dimension, you get what’s known as a snowflake dimension. A schema that contains snowflake dimensions is known as a snowflake schema (as opposed to a star schema).

The fact table holds the facts and measures such as quantity and value for each relevant combination of dimension keys. For example, for each relevant combination of customer, product, employee, and day there will be a row with the quantity and value in the fact table. Note that data in a data warehouse is typically pre-aggregated to a certain level of granularity (such as a day), unlike data in an OLTP environment, which is usually recorded at the transaction level.

You can implement a data warehouse as a SQL Server database and manage and query it with T-SQL. The process that pulls data from source systems (OLTP and others), manipulates it, and loads it into the data warehouse is called Extract Transform and Load, or ETL. SQL Server provides a tool called Microsoft SQL Server Integration Services (SSIS) to handle ETL needs.

 

OnLine Analytical Processing

OnLine Analytical Processing (OLAP) systems support dynamic, online analysis of aggregated data.

Consider a data warehouse that you implemented as a relational database in SQL Server. Whenever a user makes a request for aggregated data, the application submits a query to the database, typically scanning and aggregating large amounts of base data. Even though it is more efficient to handle such requests against a relational data warehouse compared to an OLTP environment, that approach might not be efficient enough. Online dynamic analysis of aggregated data usually involves frequent requests for different levels of aggregations, which require slicing and dicing the data. Each such request might end up being very expensive if it needs to scan and aggregate large amounts of data, and chances are the response time will not be satisfactory.

To handle such needs you can pre-calculate different levels of aggregations. For example, you can pre-calculate yearly, monthly, and daily with the time dimension; category, subcategory, and product with the product dimension, and so on. When you pre-calculate aggregates, requests for aggregated data can be satisfied more quickly.

One option to implement this idea is to calculate and store the different levels of aggregations in the relational data warehouse. This involves writing a sophisticated process to handle the initial processing of the aggregates and the incremental updates. Another option is to use a special product designed for OLAP needs—Microsoft SQL Server Analysis Services (SSAS or AS). Note that SSAS is a separate service/engine from the SQL Server service. SSAS supports calculating different levels of aggregations and storing them in optimized multidimensional structures known as cubes. The source data for SSAS cubes can—and usually is—a relational data warehouse. Besides supporting large volumes of aggregated data, SSAS also provides many rich and sophisticated data analysis capabilities. The language used to manage and query SSAS cubes is called Multidimensional Expressions (MDX).

 

Data Mining

OLAP systems provide the user with answers to all possible questions, but the user’s task is to ask the right questions—to sift anomalies, trends, and other useful information from the sea of data. In the dynamic analysis process the user navigates from one view of aggregates to another—again, slicing and dicing the data—to find useful information.

Data mining (DM) is the next step; instead of letting the user look for useful information in the sea of data, data mining models can do this for the user. That is, data mining algorithms comb the data and sift the useful information from it. Data mining has enormous business value for organizations, helping to identify trends, figure out which products are purchased together, predict customer choices based on given parameters, and so on.

SSAS supports data mining algorithms—including clustering, decision trees, and others—to address such needs. The language used to manage and query data mining models is Data Mining Extensions (DMX).

 

Reference : Book – Microsoft_SQL_Server_2008_T_SQL_Fundamentals by Itzik Ben-Gan.

I Hope, you will like this post. Please share your thoughts on this topic or write comments in the comment section.

Advertisements

About Ashish Jain

I am Ashish Jain, a software engineer by profession. My goal of creating this blog is to share my knowledge of SQL server with all other SQL enthusiasts and also to learn from them.
This entry was posted in SQL Fundamentals. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s