CO6002 Advance Database Management

March 30

2011

The AWISDM project can be found by utilizing the short cut on the desktop or by following the path C:\Users\0817804\Documents\Visual Studio 2008\Projects\aj_AWISDM\aj_AWISDM the user name for the remote desktop is 0817804 and the password is P@55word

[Type the document subtitle]

 


 

Contents

Table of Figures. 2

Table of Tables. 2

Pilot InternetSales Data Mart Design. 3

Starting with dimension tables. 3

Database design as a star schema model 3

Date. 5

ProductSubcategory. 5

Territory. 5

Customer. 5

Surrogate Keys. 6

Date Attributes. 6

Normalisation and de-normalisation. 6

Comparing the terms star schema and snow-flake schema. 7

Schemas in Data Warehouses. 8

Slowly Changing Dimensions. 10

Design the fact table. 10

Fact Table considerations and Granularity. 11

Estimate the size of the Data Mart. 11

Design, develop and save an SSIS Package. 12

Integrate data from the four data sources into a staging database. 12

Load data to the data mart. 14

The key features of the ETL process. 14

Business Intelligence Applications. 15

Slicing and Dicing the Cube. 16

Reports. 17

Ad hoc Queries. 18

OLAP. 18

Data Mining. 19

Decision Support. 20

 


 

Table of Figures

Figure 1 Four types of dimension tables. 3

Figure 2 Star Schema structure for database design. 3

Figure 3 Entity relationship model of fact and dimension tables. 4

Figure 4 Starter Dimensions. 4

Figure 5 Date table with the addition of surrogate keys. 5

Figure 6 ProductSubCategory dimension with additional surrogate keys. 5

Figure 7 The Territory Dimension. 5

Figure 8 The Customer dimension. 5

Figure 9 Star schema model 7

Figure 10 AWISDM Data Mart as a Star Schema. 8

Figure 11 The benefits of a Snowflake Schema. 9

Figure 12 Snowflake Schema. 9

Figure 13 AWISDM as Snowflake Schema. 9

Figure 14 Fact Table. 10

Figure 15 Entity Relationship Diagram for AWISDM... 11

Figure 16 Concept design for integrating the four data sources. 12

Figure 17 Data is imported from the four sources. 13

Figure 18 the ETL Process. 13

Figure 19 Loading the data into the Data Mart. 14

Figure 20 A flow diagram to deal with error checking. 14

Figure 21 Data flow to check for and correct errors. 15

Figure 22 Output to Fact Table. 15

Figure 23 BI Platform.. 15

Figure 24 Microsoft Office Integration. 16

Figure 25 Single Data Point. 16

Figure 26 one Dimension of Data. 16

Figure 27 Two Dimensions of Data. 17

Figure 28 Three Dimensional Data. 17

Figure 29 Decision Support. 20

 

Table of Tables

Table 1 Snowflake verses Star Schema. 7

Table 2 AWISDM Dimension Tables Sizes. 12

 


 

Pilot InternetSales Data Mart Design

In this section we will be exploring the dimensional modelling of the pilot data mart for AdventureWorks Ltd. The broad agreement in the fields of data warehousing and business intelligence is that the dimensional model is the preferred structure. Data warehousing centres on facts and the fact table. Facts are usually numeric and are a measurement of an event.

Starting with dimension tables

Text Box: Figure 1 Four types of dimension tablesDimensions are the nouns of the business world; they describe the surrounding measurement events.

A single dimension that is shared across all of these facts is called a conformed dimension. A junk dimension is a convenient way of grouping flags and indicators. A degenerate dimension are common when the grain is of the fact table is at a unit level. Role-playing dimensions are recycled from multiple applications within the same database.

Database design as a star schema model

Figure 2 Star Schema structure for database design

In its initial state, the database consists of four tables with no relationships, shown in Figure 2 Star Schema structure for database design.

Figure 3 Entity relationship model of fact and dimension tables

The first steps in developing our dimension tables are to add surrogate keys to them.

Figure 4 Starter Dimensions

This basic starting point requires us to add surrogate keys to the dimension tables. The addition of surrogate keys


 

Date

Figure 5 Date table with the addition of surrogate keys

We can see from the data table shown in Figure 5 Date table with the addition of surrogate keys, we have added a number of surrogate, or non-natural keys to the data dimension.

ProductSubcategory

Figure 6 ProductSubCategory dimension with additional surrogate keys

The ProductSubCategory dimension with additional keys, as show in Figure 6 ProductSubCategory dimension with additional surrogate keys.

Territory

The territory dimension with additional surrogate keys, shown in Figure 7 The Territory Dimension.

 

Figure 7 The Territory Dimension

Customer

Figure 8 The Customer dimension

Finally we have the customer dimension with its surrogate keys, as shown in Figure 8 The Customer dimension.


 

Surrogate Keys

A surrogate key is usually created to simplify the key structure. That is to say it is an artificial substitution for a natural primary key, data held within this key may change with time, known as a slowly changing dimension. These surrogate keys are neither intelligent nor business specific and most commonly system-generated. Their purpose is to ensure that rows are unique within a database table, typically they are numeric fields.

Date Attributes

Data mart solutions often include a role-playing dimension to record the passage of time; the “Date” dimension. It is necessary to provide additional attributes to this dimension other than just the date field, which would only allow users to view sales for a specific date, or specify a range of dates, this provides a limiting scope for data analysis.

Adding fields such as: the day of the week, month name, fiscal calendar quarter or a public holiday flag fields enables much more flexibility. For example a user would be able to compare sales data for a particular day in a particular month to the same day in the previous year, or compare sales totals from one week to the next. This additional functionality allows business decision makers to generate considerably more useful conclusions from the data.

Normalisation and de-normalisation

The dimension tables of a data mart contain de-normalised data, whilst the fact tables will contain fully normalised data. Whilst this does not optimise the databases’ file size, it does increase the performance that queries on the data will execute at, often seen as the If the database was fully normalised – for example, if the customer table was normalised, additional slow lookups for marital status, profession and education would have to be performed.

The fact tables are fully normalised; they contain only time-period data and the relevant surrogate key for each dimension.

The dimensional tables are de-normalised, usually to the second normal form, and contain surrogate keys to facilitate query speeds.


 

Comparing the terms star schema and snow-flake schema

“Star and snowflake schema designs are mechanisms to separate facts and dimensions into separate tables. Snowflake schemas further separate the different levels of a hierarchy into separate tables.” (IBM, 2011)

Snowflake Schema

Star Schema

Joins

Higher number of Joins

Fewer Joins

 

Ease of Use

More complex queries and hence less easy to understand

Less complex queries and easy to understand

 

Query Performance

More foreign keys-and hence more query execution time

Less no. of foreign keys and hence lesser query execution time

 

Ease of maintenance or change

No redundancy and hence more easy to maintain and change

Has redundant data and hence less easy to maintain/change

 

Type of data warehouse

Good to use for small data warehouses/data marts

Has redundant data and hence less easy to maintain/change

 

Dimensional table

It may have more than one dimension table for each dimension

 

Contains only single dimension table for each dimension

 

Dimension Table Normalisation

3 Normal Form

2 Normal De-normalised Form

 

Table 1 Snowflake verses Star Schema

Primary keys

Foreign keys

Fact tables

Dimension Tables

Star schemas

Snowflake schemas

Figure 9 Star schema model

A hierarchy is a set of levels having many-to-one relationships between each other, and the set of levels, collectively makes up a dimension. In a relational database, the different levels of a hierarchy can be stored in a single table (as in Figure 9 Star schema model) or in separate tables (as in a snowflake schema).

·          Many-to-one relationships

·          Balanced and unbalanced hierarchies

·         Ragged hierarchies

Schemas in Data Warehouses

Figure 10 AWISDM Data Mart as a Star Schema

A star schema model consists of one or more fully normalised fact tables,(3NF),  that reference any number of de-normalised dimension tables. The dimension tables feature redundant data in the most granular form and are in second normal form, (2NF). This increases the simplicity of the database and reduces the complexity of queries made upon it. A snow-flake schema is a variation on this and features fully normalised dimension tables, (3NF). This reduces the amount of file space needed to store the database and reduces the number of places data would need to be altered if an update is required; but this come at a cost which is a reduction of query performance.

The facts that the data warehouse aids in analyse are classified along different dimensions. The fact table holds the main data. It includes a large amount of aggregated data, such as price and units sold. There may be multiple fact tables in a star schema.

Dimension tables, which are usually smaller than fact tables, include the attributes that describe the facts. Most often we see a segregation of these dimensions into a separate table for each dimension. Dimension tables are be joined to the fact table(s) as needed.

Dimension tables have a simple primary key, while fact tables have a set of foreign keys which make up a compound primary key consisting of a combination of relevant dimension keys.

It is common for dimension tables to consolidate redundant data in the most granular column, and thus rendered in second normal form. Fact tables are usually in third normal form because all data depends on either one dimension or all of them, not on combinations of a few dimensions.

Figure 11 The benefits of a Snowflake Schema

The benefit of using the snowflake schema as shown in Figure 11 The benefits of a Snowflake Schema, is that the storage requirements are lower since the snowflake schema eliminates many duplicate values from the dimensions themselves.

Figure 12 Snowflake Schema

The main design scheme is star but the ProductCategory can be granulated by adding more depth through using a snowflake schema for this element of the design. Extending the granularity of the DimProductSubCategory dimension

Figure 13 AWISDM as Snowflake Schema

Slowly Changing Dimensions

This is a dimensional problem that occurs with time, the attributes for a record changes over time. This can be seen in particular in the Customer table of the AWISDM.

For example, if the customer number was used as a natural key and a customer changes their address it would cause numerous problems. Firstly, if the record was simply updated it would not be possible to track where previous items had been delivered; it would be as though the customer had only a single address and had always lived there.

By using a surrogate key, a new record for the customer can be added with the new address in the table and all new queries will use this new record; the old data remains intact. It is possible by creating a composite natural key (customer number and address), if not all parts of this natural key exist in the fact table it may not be possible to do a join on this new enlarged key – therefore requiring another ID field; the surrogate key.

Although there is only one customer, they would be treated dimensionally as two separate customers. The problem of slowly changing dimensions is one that needs addressing to avoid the deletion of data that may be required for statistical analysis. For example: if the company wanted to analyse sales by a certain region and the customer’s address has changed, all sales made before the alteration would become assigned to the new region.

Design the fact table

Figure 14 Fact Table

Figure 15 Entity Relationship Diagram for AWISDM

Fact Table considerations and Granularity

The fact table in the AWISDM requires a daily sales total for a customer in each product sub category and region. The ETL process is currently unaffected by the source granularity as the incoming data is all of the same level of detail. If one of the data sources listed its sales on an individual order basis rather than a daily total for example, this data would have to be aggregated by date to enable a courser level detail in the other data.

All subsequent sales will need to at least match this granularity. If one of the data sources begins to produce a finer level of detail of data; for example grouping sales data by product category instead of sub-category, all of the data being imported to the data mart would have to be brought to the finer level of granularity to ensure consistency.

Estimate the size of the Data Mart

The fact table in the AWISDM currently has seven fields with an average field length of 38.9 bytes (1 field of 4 byte, 5 fields with a size of 52 bytes, and one of 8bytes). Assuming that there are 500,000 rows in the table this gives a total table size of

7 fields x 38.9 bytes/field x 500,000 rows = 136,150,000 bytes (129.84 Megabytes, or 132,959 Kilobytes)

The dimension tables are much smaller and typically contain a fraction of the number of rows found in the fact table. The dimension tables in the AWISDM are of the following sizes, shown in Table 2 AWISDM Dimension Tables Sizes.

Table Name

Average Field Size

Field Count

Row Count

File Size

DimCustomer

41.4 bytes

14

18,484

10.22 Megabytes

DimDate

14.4 bytes

27

1,158

0.43 Megabytes

DimTerritory

52 bytes

5

10

≥ 2.54 Kilobytes

DimProductSubCategory

52 bytes

7

37

13.15 Kilobytes

Table 2 AWISDM Dimension Tables Sizes

As shown above, the Data mart fact table is exponentially larger than its dimension tables and will continue to grow at a much faster rate; as more and more sales data is added each day the dimension tables will remain relatively unchanged.

Design, develop and save an SSIS Package

Design an ETL Process to:

Integrate data from the four data sources into a staging database

Figure 16 Concept design for integrating the four data sources

The four data sources are uploaded into the transact process, for the UK sales a replacement takes place, replacing tyre with tire. All four data sources are then formatted to ensure that the data types are consistent throughout the process; the output from these conversions is then joined into a union all function to align the data being imported. See Figure 17 Data is imported from the four sources.

Figure 17 Data is imported from the four sources

Figure 18 the ETL Process

Load data to the data mart

Figure 19 Loading the data into the Data Mart

The key features of the ETL process

The key features of the ETL process are in the acronym, we firstly extract the data from the disparate data sources, we transact this data by converting the data types and then error check the data, and finally we load the data to the data mart.

Tasks designed to validate the data

Figure 20 A flow diagram to deal with error checking

Figure 21 Data flow to check for and correct errors

Figure 22 Output to Fact Table

Business Intelligence Applications.

Figure 23 BI Platform

As we can see from Figure 23 BI Platform, the Business Integration system is a collection of Microsoft SQL Server Services. The BI model is used to draw data from disparate sources and present a single unified view of the data in the form of a data mart. The BI toolkit allows for complex and demanding data process, such as data conversions and error validation to be undertaken simply. The output data mart would then be interrogated by the user utilising the Microsoft Office suite of applications, some of which are shown in Figure 24 Microsoft Office Integration.

Figure 24 Microsoft Office Integration

The use and application of the Office suite provides business decision makers with a familiar interface from which to build questions and queries of the data mart.

Slicing and Dicing the Cube

We start with a data point D1

Description: C:\Users\Andrew\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.IE5\QRYTE20F\MC900197530[1].wmf

Figure 25 Single Data Point

We then build a

Description: C:\Users\Andrew\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.IE5\T9H7E4RL\MP900443794[1].jpg

Figure 26 one Dimension of Data

Extension of data point

Description: C:\Users\Andrew\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.IE5\T9H7E4RL\MP900438684[1].jpg

Figure 27 Two Dimensions of Data

Then we constrict a Cube, an example of this would be:

Cubic on Date

X             Date

Y              Customer

Z              Product

 

Cubic on Amount

X             Customer

Y              Product

Z              Amount

Description: C:\Users\Andrew\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.IE5\QRYTE20F\MC900438451[1].jpg

Figure 28 Three Dimensional Data

Finally we build our cube

Reports

Description: C:\Users\Andrew\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.IE5\QF6OTVZ9\MC900058959[1].wmfReports are simply to produce utilising Excel, Access or even Word.

Ad hoc Queries

Description: C:\Users\Andrew\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.IE5\T9H7E4RL\MC900058872[1].wmfAd-hoc queries become simple to create as the star schema data mart is specifically designed with these in mind.

OLAP

As already stated a number of times (Saeed, 2011), “in a Data Warehouse decision support environment we are interested in the big picture”, we want to look at the data from a macro level instead of the micro level. For a macroscopic view aggregates are used.

In this example we look at the sales volume

·         i.e. number of items sold as a function of:

o   product

o   time

o   geography

Note that all three of them are dimensions.

The proceed with the analysis, a cube structure will be first created such that each dimension of the cube will correspond to each identified dimension, and within each dimension will be the corresponding hierarchy. The example further shows how the dimensions are “rolled-out” i.e. Province into divisions, then division into district, then district into city and finally cities into zones. Note that weeks could be rolled into a year and at the same time months can be rolled into quarters and quarters rolled into years. Based on these three dimensions a cube is created and shown in

Cube Operations

·         Rollup: summarize data

e.g., given sales data, summarize sales for last year by product category and region

·         Drill down: get more details

e.g., given summarized sales as above, find breakup of sales by city within each region, or within province or county

·         Slice and dice: select and project

e.g.: Sales of soft-bicycles in Chester during last quarter

·         Pivot: change the view of data

How we view the data stored in a cube.

There are four fundamental cubes operations which are:

(i)                  rollup

(ii)                drill down

(iii)               slice and dice

(iv)              pivoting

Data Mining

Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering, etc.

Data Mining commonly involves four classes of task:

(i)                  Classification

Arranges the data into predefined groups. For example an email program might attempt to classify an email as legitimate or spam. Common algorithms include Nearest neighbor, Naive Bayes classifier and Neural network.

(ii)                Clustering

Is like classification but the groups are not predefined, so the algorithm will try to group similar items together.

(iii)               Regression

Attempts to find a function which models the data with the least error. A common method is to use Genetic Programming.

(iv)              Association rule learning

Searches for relationships between variables. For example a store might gather data of what each customer buys. Using association rule learning, AdventureWorks Ltd can work out what products are frequently bought together, which is useful for marketing purposes. This is sometimes referred to as “market basket analysis”.

Decision Support

Figure 29 Decision Support