Andrew Jones

0817804 J01209

 

CO7107 Data Warehousing and Multi-dimensional Databases

Assessment


 

Table of Contents

Table of Figures. 1

Introduction. 2

The evolution of databases. 2

The information environment. 3

The alternatives to SQL. 3

The data warehouse. 4

Star schema. 4

Snowflake schema. 4

Fact constellation. 5

Data modelling. 5

System design. 6

Data mining. 7

Conclusion. 8

Bibliography. 9

 

Table of Figures

Figure 1 the types of noSQL databases. 3

Figure 2 the database environment as idealised by the noSQL fan brigade. 4

Figure 3 Traditional process leading towards a Business Intelligence system.. 4

Figure 4 Enterprise Information Systems Architecture. 5

Figure 5 the redundancy of schema less design. 5

Figure 6 Transitioning from a data model to a database. 6

Figure 7 Data Mining Algorithms. 7


 

Introduction

The evolution of the modern Relational Database Management System has been shaped by the database itself, but the problem remains the “distance” between the decision maker and the data. The complexity arises when the decision maker requires information from the data in the form of answers to queries. Now it is about the toolkit that allows for the searching, indexing and querying of vast amounts of disparate data sources that exist within the enterprise; enter the data warehouse: a read only set of data and query results all shaped up and fit for mining. But for today’s management the time taken to run the warehousing routine and subsequently mine it is too long! They, management, want answers to unstructured questions and general grammatical queries in milliseconds utilising all of the available warehoused information and up to the moment transaction data.

Is it that senior managers, or managers in general, lack the basic information retrieval skills necessary to seek and utilise the toolkit resources available to them within the enterprise data structure?

The evolution of databases

If you can cast your mind back as far as the Halcion days of the 1960’s then this is where there the computer database began; 1968 saw the arrival of the file based database in which data is stored and maintained in a flat file. Things really took off with the next generation Hierarchical data model databases. They were enormously complex in implementation and often difficult to manage. There were no standards and the systems databases could not efficiently manage multiple many to many relationships. On the plus side they were efficient at searching, together with there being less redundant data. Data was independent together with the implementation of security and referential integrities. The next iteration of database development came with the advent of the schema and was brought about by the network database: the advantages of which were its’ ability to handle many more relationship types, combined with the simpler access of data. On the downside the systems were complex, difficult to design and maintain. In 1970 E.F. Codd created the relational database model, it is this 3rd generation model that has been so massively successful that it forms the vast majority of data stored and processed is utilising these RDBMs to the present day and into the foreseeable future. The advantages of these RDBMs: data management, redundancy, consistency, abstraction, integrity, standards, accessibility and maintenance far outweigh the complexity that comes with size and the higher impact of failure. The fourth generation databases are the object-orientated database, they can efficiently manage large numbers of differing data types, although switching to an OODBMS would require stating fresh and re-doing the entire database. Objects with complex behaviours are managed without difficulty utilising inheritance and polymorphism; because they are typically tied to an API OODBMS their flexibility is reduced. Querying is another feature that does not implement well, you are unable to join two classes as you would be able to join two tables. Principally queries rely on the effective design of the system. There are also problems with the bulk deletion of data. A hybrid emerges: in the form of object relational databases, looking to extend over both the relation and object orientated concepts. They had large storage capacities although the architecture is not web friendly, but it is fast with high access speeds. Fifth generation databases brings about a divergence in the way that data is managed and stored with the database. Until the emergence of noSQL database to gain in performance you scaled up, with noSQL you scale out. This cost effective way of improving database performance is attractive to modern business enterprises. Although they con not replace the ACID compliant SQL based OLTP with the generally BASE compliant noSQL databases. Now with noSQL you have four different thematic approaches to managing you data. (Haadi, 2010)

The information environment

Most commonly identified as “the information environment is the aggregate of individuals, organizations, and systems that collect, process, disseminate or act on information.” In today’s world this leads to a more complex information environment. There has been an exponential explosion in data and data capture, as the internet of things starts to gain a foothold. The growth of “things” has led to an enormous growth in sparse data in which schema less data structures are become the vogue allowing the storage of data in the database without prior knowledge of its definition. There is a downside to schema less data model in that: at some point it is necessary to derive or imply a schema.

Bringing together the disparate data stores and systems management processes within the enterprise into a homogenous whole is the goal of the information systems manager and is primarily considered as the enterprise information environment. One of the primary reasons for designing and developing a database is for its use as an integration tool.

Ask the right questions of your data!

The alternatives to SQL

Figure 1 the types of noSQL databases

In a nutshell if you are serious about your enterprise information system then there is not a viable alternative to SQL RDBMs. It’s super-fast, reliable, scalable, resilient and cost effective. noSQL is simply muddying the waters for the media savvy. To start with it is not just one alternative technology, but they share a scale out philosophy and are predominantly open source which makes choosing one of them a difficult choice as vendor lock in becomes critical to the success of its implementation within the enterprise. So thematically there are four database types within the noSQL camp each having a competitive advantage in the data environment. However, within each theme there are a plethora of alternates as there are over 250 different offerings: all of them different, all of them proprietary in nature; and, all of them demand more work, effort, resources, time and money than originally planned for and advertised. Object database management systems still exist in an extremely although in an extremely thin niche.

Figure 2 the database environment as idealised by the noSQL fan brigade

SQL relational databases do not have scalability issues and problems, Facebook recently extended its SQL systems to cover databases in the petabyte range of size. noSQL is cluster friendly so scales horizontally efficiently, which is on a per unit basis more cost attractive than up scaling an SQL server. A point which truly expresses the lack of thought within the noSQL camp is that the very “popular” description noSQL came from a meeting hashtag on twitter. noSQL lacks a standard language SQL has been ANSI almost since the beginning. Because of the factional nature of the noSQL market there are a severe and stark lack of skills (Cruz, 2012).

·         RDBMS == ACID

·         noSQL == BASE

For the foreseeable future SQL is going to continue to attract the majority of investment, with noSQL utilised and applied in specific project arenas that aid and abet in the construction and management of the data warehouse as well as being implemented in the OLAP and BI processes.

The data warehouse

Everything ploughed into the data warehouse is a solid philosophy. A data warehouse is not a technology but an architecture. The advantage is gained in recognising the strategic value of a data warehouse. Having identified the need the architectural requirements mean that the organisation benefits from simply planning and designing for a data warehouse.

Figure 3 Traditional process leading towards a Business Intelligence system

Star schema

This is often considered as the simplest of the data warehouse schemas, as such is utilised pervasively. A single fact table is linked to multiple dimension tables radially: by use of a primary and foreign key relationship. The fact table is third normal form, whereas the dimension tables are de-normalised.

Snowflake schema

This schema is again often considered as more complex. As with the star schema the fact table is third normal form, but this time round the dimension tables are also third normal form. Although the schema saves on redundant data and space within the database, there is an increase in the number of joins required to build the dimension tables.

Fact constellation

This schema requires the deconstruction of the star schema into to multiple fact tables with many shared dimensional links. Fact constellations are difficult to design due to the aggregation of variants.

 

The language of relationships is changing and evolving with trends.

Figure 4 Enterprise Information Systems Architecture

Data modelling

Not believing in schema less ideas at the on-set, because a schema will always have to be derived, so why not start with one? It is the constant tick of time and “management” demands for it now! Without understanding that it takes time to learn the skills, design the applications and deploy the solutions. Systems analysis has developed in types to the point of ridicule; systems are deployed as solutions from the first moment, without thought or planning, then fixed along the way, meanwhile the project teams have churned so many times that there is no one left who actually understands the code, let alone the project. Start with a plan: add to it! Then refine and tinker along the way. That way the project will understand who and how, and the: what, where when and why?: are clearly evident to those involved.

Figure 5 the redundancy of schema less design

Feature

Conceptual

Logical

Physical

Entity Names

Entity Relationships

Attributes

Primary Keys

Foreign Keys

Table Names

Column Names

Column Data Types

Table 1 Data Modelling model types and their features

Methodologies and techniques utilised by the data modelling process provide a: standard, consistent and predictable: aspect with which to administer data as a viable resource within the enterprise. By managing data as a resource we are able to ensure close-fitting synergies for the integration of the enterprises information systems as a complete all-inclusive entity. It is especially advantageous in the functional design and development of databases and the data warehouse, the data repositories of the enterprise IS.

System design

There is an enterprise requirement for the organisational structure for storing data. Lists carry redundant data in multiple themes or concepts. These lists are broken apart by theme and concept to provide relational tables. Third normal form relational databases are more complex requiring effort, and as such are hard work; not an attractive feature for the masses, who demand things are easy and effortless.

SQL is an international standard language, recognised and implemented by the biggest names in the software industry, the trendy new noSQL brigade have not got a common standard, each of the more than 250 noSQL databases requires that you learn their own proprietary query language.

There is a great deal of form without function. It seems today that the ethos and very fundamentals of design are wrong.

Normalisation. When is de-normalisation appropriate? From relation to tables and entities that have been derived by data modelling.

Figure 6 Transitioning from a data model to a database

noSQL is vulnerable to modification anomalies.

There comes a point when the computational expense of multiple queries with complex joins means a denormalisation of the data is necessary.

Denormalisation exposes problems that are not in normal form databases, in the end it is all for nothing.

 

Costs of denormalisation

Benefits of denormalisation

·         Modification anomalies

·         Redundant data

·         More storage required

·         Simplicity

·         Query performance improved

Table 2 Cost benefit analysis of denormalisation

Data mining

For this section we will be recreating the first chapter of the book Mining the Social Web, 2nd Edition

In chapter 1: Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More: we will work through a number of exercises. These cover a number of distinct problems when utilising the programming language Python to interface with and interrogate the Twitter API.

Figure 7 Data Mining Algorithms

The twitter exercise was tedious and complicated, in addition to being completely uninteresting as a scientific exercise. From the exercise we learned that the Python programming language is the province of esoteric code writers: “anyone can hack a piece of code” has been said, and that is what programming in Python is like. The language construct is minimal by design, it lacks IDE support and interfaces, and it is predominantly made up of third party libraries and is so incredibly whitespace unfriendly.

The exercise takes us through the authentication and creation of a Twitter object which is then integrated in the iPython interface. First just a print to screen and then some formatted output utilising some imported library functions. It starts to get interesting towards the end of the exercises, where matplotlib is imported and some of the results are graphed and charted. Python is a powerful programming language and some of the libraries available make it data analytics system. It would be and has tremendous value and applications at the Extraction and Transaction phases in the preparation and implementation of the data warehouse.

The introduction of the Pandas library really enhances the performance of Python, it makes the slicing and dicing of data sets remarkably straight forwards. The problem is the subject: Twitter: Understanding that a tweet is 140 characters and the data set for a tweet is approximately 8K bytes, the amount of augmented data and meta data is enormous, but this is where it apparently falls down is the analysis model. Yes we can analyse what’s trending, what devices, where the tweets come from, how many times it’s been retweeted and a plethora of other analytics: but what’s the point? What is twitter all about? What contribution does it actually make? How is it benefitting anything? Put simply it makes no valid scientific contribution it is merely for the chattering non-cognisant. Celebrity status does not make us wise!

Conclusion

Although it is all change, everything still remains the same. It is a question of the big picture; the bigger the picture the more often the same patterns will remerge. In the beginning of database history there was no SQL, then SQL got invented, adopted and grew to dominance in the database industry; now comes no SQL in a resurgent attempt to gain market share, as it claims: a viable alternative to SQL.

It is the 80:20 rule. 80% of the database market will remain SQL and the rest comprising of the remaining 20% will battle for dominance among the market and data niches. As SQL scales into the petabyte range, it will not be long before the database sizes supported by SQL database systems reaches and exceeds the Exabyte range.

The information environment is going to continue to evolve in complexity, challenging the systems and database designers of now and tomorrow with difficult and multifaceted data sources, operational systems and reporting mechanisms.


 

Bibliography

Alexios Delis, N. R. (1993). Performance Comparison of Three Modern DBMS Architectures. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 19(2). Retrieved 07 31, 2014

Ali, A. (2013, 06 24). Comparing Data Warehouse Design Methodologies for Microsoft SQL Server. Retrieved 08 07, 2014, from MSSQLTips.com: http://www.mssqltips.com/sqlservertip/2976/comparing-data-warehouse-design-methodologies-for-microsoft-sql-server/

Ambler, S. (2014, 07 03). Data Modeling 101. Retrieved 07 03, 2014, from Agile Data: http://www.agiledata.org/essays/dataModeling101.html

Babcock, C. (2010, 09 10). What's So Great About NoSQL? InformationWeek(1279), 26-31. Retrieved 07 31, 2014, from http://search.proquest.com.voyager.chester.ac.uk/docview/1416192825/abstract?accountid=14620

Cruz, X. (2012, June 4). The Database Battle: MongoDB vs MySQL. Retrieved 08 30, 2014, from CloudTimes: http://cloudtimes.org/2012/06/04/the-database-battle-mongodb-vs-mysql/

Delis, A., & Roussopoulos, N. (1991, 09). Modern Client-Server DBMS Architectures. SIGMOD RECORD, 52-61. Retrieved 08 01, 2014

Evans, P. (2011, 01 11). Business Intelligence - SQL or NoSQL - Which is better. Retrieved 07 28, 2014, from Database Journal: http://www.databasejournal.com/sqletc/article.php/3919371/Business-Intelligence---SQL-or-NoSQL---Which-is-better.htm

George, S. (2012, April 01). Inmon vs. Kimball: Which approach is suitable for your data warehouse? Retrieved 08 07, 2014, from TechTarget.in: http://searchbusinessintelligence.techtarget.in/tip/Inmon-vs-Kimball-Which-approach-is-suitable-for-your-data-warehouse

Haadi, M. (2010, October 118). The Evolution of Database. Retrieved September 03, 2014, from All About Databases: http://mhaadi.wordpress.com/2010/10/18/the-evolution-of-database/

Inmon, W. H. (2002). Building the Data Warehouse (3rd ed.). (R. Elliott, Ed.) New York, USA: Robert Ipsen. Retrieved September 16, 2014

Kajeepeta, S. (2012, 05 14). NoSQL Everywhere? Nope. InformationWeek(1333), 31-32. Retrieved 07 31, 2014, from http://search.proquest.com.voyager.chester.ac.uk/docview/1021724923?pq-origsite=summon

Kavis Technology Consulting. (2010, 03 01). NoSQL vs. RDBMS: Apples and Oranges? Retrieved 07 28, 2014, from Kavis Technology Consulting: http://www.kavistechnology.com/blog/nosql-vs-rdbms-apples-and-oranges/

Loshin, D. (2013, 09 04). An Introduction to NoSQL Data Management for Big Data. Retrieved 08 07, 2014, from DataInformed: http://data-informed.com/introduction-nosql-data-management-big-data/

M.Thenmozhi, K. (2014, 06). AN ONTOLOGICAL APPROACH TO HANDLE MULTIDIMENSIONAL SCHEMA EVOLUTION FOR DATA WAREHOUSE. International Journal of Database Management Systems, 6(3), 33-52. doi:10.5121/ijdms.2014.6303

Microsoft Corporation. (2014, 07 03). Data Warehouse Design Considerations. Retrieved 07 03, 2014, from SQL Server: http://technet.microsoft.com/en-us/library/aa902672(v=sql.80).aspx

Stonebraker, M. (2010, 04). SQL Databases v. NoSQL Databases. communications of the acm, 53(4). doi:10.1145/1721654.1721659

Vasco Santos, R. S. (2014, April). TOWARDS A LOW COST ETL SYSTEM. International Journal of Database Management Systems ( IJDMS ), 6(2), 67-79. doi:10.5121/ijdms.2014.6205

Veronika Abramova, J. B. (2014, June). EXPERIMENTAL EVALUATION OF NOSQL DATABASES. International Journal of Database Management Systems, 6(3), 16. doi:10.5121/ijdms.2014.6301