Sunday, 25 December 2016

LESSON-2: DATA WAREHOUSE – A DEEP DIVE

Data Warehouse Architecture

There are various ways in which a data warehouse can be built and implemented. We will discuss about the following different architectures with special reference to the first one:
  •        Generic Two-Level Architecture
  •        Independent Data Mart Model
  •        Dependent Data Mart Model and Operational Data Store
  •        Logical Data Mart and Active Data Warehouse
  •        Three Layer Architecture

All of these architectures involve some form of Extract, Transform and Load (ETL).

Generic Two Level Architecture


Source Data Systems:

Source data coming into the data warehouse may be categorised into 4 broad categories:

Production Data:

                                                               i.      This category of data comes from the various operational systems of the enterprise.
                                                             ii.      While dealing with this data, we come across many variants of the data formats.
                                                            iii.      There is no conformance of the data among the various operational systems of the enterprise. For example, a term like Account may have different meanings in different systems.
                                                           iv.      Because of this data disparity, a greater challenge is to standardise and transform the data from different operational systems, convert the data and integrate the pieces into useful data for storage in data warehouse.

 Internal Data:

                                                               i.      This is the data internal to the organisation and generally is in form of “private” spreadsheets, documents, customer profiles and sometimes the departmental databases.
                                                             ii.      This add complexity to the process of transforming and integrating the data in terms of determining the strategies for collecting data from spreadsheets, finding ways of taking data from textual documents.
                                                            iii.      Again we need to schedule the acquisition of internal data. It is a collective judgement call on how much internal data should be included in the data warehouse.

Archived Data:

                                                               i.      These are the old operational data which are stored in archived files (in flat files on disk storage).
                                                             ii.      These are used for getting historical information which is a major characteristics of data warehouse.
                                                            iii.      This type of data is useful for discerning patterns and analysing trends.

 External Data:

                                                               i.      These are the data from external sources (sources outside of the organisation).
                                                             ii.      These are used to serve industry trends and compare performance against other organisations.
                                                            iii.      These data don’t conform to our organisational formats and hence we have to design conversions of the data into our internal formats and data types.
                                                           iv.      We need to organise data transmissions from external sources depending on when the data is received. In some cases, we may get data on request.

Data Staging Area:

  •           After the data is extracted from various operational and external systems, we need to prepare the data for storing it in the data warehouse.
  •           Data staging component consists of workbench which provides the function of data extraction, transformation and preparation for loading to the data warehouse.
  •           This provides a place where the data is cleaned, changed, combined, converted, de-duplicated for use in the data warehouse.

Q: Why a Separate Staging Area?

A: Following are the reasons/benefits of having a separate staging area in a data warehouse implementation:
  •          Various Source Systems have different allotted timing for data extraction.
  •          Source systems are available for extraction during a specific time slot which is generally lesser than the overall data loading time. Hence it is good to have data in staging before we lose connection to the source system.
  •          We may need to extract data based on some conditions which requires us to join two or more different systems together. We will not be able to perform a SQL query joining two tables from two physically different databases.
  •          Data Warehouse’s data loading frequency doesn’t match with the refresh frequency of the source systems.
  •          Extracted data from the same set of source systems are going to be used in multiple places.
  •          ETL process involves more complex data transformation that require extra space to temporarily stage the data.
  •          There is specific data reconciliation/ debugging requirements which warrants the use of staging area for pre, during or post load data validations.

Data Storage Component:

-          Data storage for the data warehouse is a separate repository. This is because:
o   Data repository for a data warehouse must contain historical data for analysis which is not the case with the operational systems.
o   Data should be in structures suitable for analysis and not for quick retrieval of individual pieces of information.
o   Data in operational data base could change from moment to moment. However analysts using the data from the data warehouse require stable data which represents snapshots at specific periods. Hence data warehouses are “read-only” repositories.

End-User Presentation Layer/Information Delivery Component:

  •           This is the layer where the data in the data warehouse is utilised by wide range of different users for analysis and reporting purpose.
  •           This layer involves different tools for presenting data to different users.




  •           Ad Hoc Reports are pre-defined reports primarily for novice and casual users.
  •           Provision for complex queries, Multi-Dimensional (MD) analysis and Statistical analysis cater to the needs of Business Analyst and Power Users.
  •           Executive Information System (EIS) is meant for senior-executives and high level managers.
  •           Data mining applications are knowledge discovery systems where the mining algorithm helps discover trends and pattern from the usage of the data.

Independent Data Mart Model

Data Mart à These are mini-warehouses, limited in scope and pertains to specific line of business.

Independent Data Mart à A data mart filled with the data extracted from the operational environment without the benefit of the data warehouse as these contain only the data from the specific operational systems.

  •          Here there is separate ETL for Each Independent Data mart.
  •          Data Access complexity increases because of multiple data marts.

Dependent Data Mart Model With Operational Data Store

Dependent Data Mart à These are filled exclusively from the enterprise data warehouse and its reconciled data.

Operational Data Store (ODS) à Integrated, Subject Oriented, updatable, current-valued, enterprise-wide and detailed database designed to serve operational users as they do decision support processing.

  • Here there is a single ETL for Enterprise Data Warehouse (EDW).
  • Dependent Data arts loaded from EDW.
  • ODS provides option for obtaining current data.

Logical Data Mart and Active Data Warehouse

Active Data Warehouse à An EDW which has the following features:

  •           Accepts near-real-time feed of the transactional data
  •           Records, analyses warehouse data and in near-real-time relays business rules to the data warehouse and systems of record so that immediate actions can be taken in response to business events.


  •          ETL here is near-real-time
  •          Data Marts are not separate databases and are logical views of the data warehouse.

Three Layer Architecture

  •          Reconciled Data: Detailed, current data intended to be the single, authoritative source for all decision support.
  •          Derived Data: Data that have been selected, formatted, and aggregated for end-user decision support application.
  •          Metadata: Technical and business data that describes the properties or characteristics of other data.

Introduction To DSS

  •           DSS stands for Decision Support System
  •           Data Warehouse plays a major role in decisions making of the organisations by making the historical data available for trend analysis.
  •           Basically there are two different types of decisions which are taken from different stages of the data warehouse which are below:

o   Tactical Decisions: These decisions are taken for the day to day functioning of the organisations. These are generally derived from the Operational Data Store (ODS).

o   Strategic Decisions: These decisions are taken for the long run. These decisions are derived from analysing the data in the data warehouse for trends and also are affected by the analysis of the external data. These are generally derived from the Data Warehouse/Data marts.

3 comments:

  1. Thanks for Sharing a Very Informative Post & I read Your Article & I must say that is very helpful post for us.
    Data Science

    Python

    Selenium

    ETL Testing

    AWS

    ReplyDelete
  2. Thanks for sharing this. Here are a few links for (ETL testing / software testing) job seekers,
    ETL Testing Jobs
    Job Portal For Software Testers

    ReplyDelete