Complete Guide To ETL Testing

Sunday, 25 December 2016

LESSON-2: DATA WAREHOUSE – A DEEP DIVE

Data Warehouse Architecture

There are various ways in which a data warehouse can be built and implemented. We will discuss about the following different architectures with special reference to the first one:

Generic Two-Level Architecture
Independent Data Mart Model
Dependent Data Mart Model and Operational Data Store
Logical Data Mart and Active Data Warehouse
Three Layer Architecture

All of these architectures involve some form of Extract, Transform and Load (ETL).

Generic Two Level Architecture

Source Data Systems:

Source data coming into the data warehouse may be categorised into 4 broad categories:

Production Data:

i. This category of data comes from the various operational systems of the enterprise.

ii. While dealing with this data, we come across many variants of the data formats.

iii. There is no conformance of the data among the various operational systems of the enterprise. For example, a term like Account may have different meanings in different systems.

iv. Because of this data disparity, a greater challenge is to standardise and transform the data from different operational systems, convert the data and integrate the pieces into useful data for storage in data warehouse.

Internal Data:

i. This is the data internal to the organisation and generally is in form of “private” spreadsheets, documents, customer profiles and sometimes the departmental databases.

ii. This add complexity to the process of transforming and integrating the data in terms of determining the strategies for collecting data from spreadsheets, finding ways of taking data from textual documents.

iii. Again we need to schedule the acquisition of internal data. It is a collective judgement call on how much internal data should be included in the data warehouse.

Archived Data:

i. These are the old operational data which are stored in archived files (in flat files on disk storage).

ii. These are used for getting historical information which is a major characteristics of data warehouse.

iii. This type of data is useful for discerning patterns and analysing trends.

External Data:

i. These are the data from external sources (sources outside of the organisation).

ii. These are used to serve industry trends and compare performance against other organisations.

iii. These data don’t conform to our organisational formats and hence we have to design conversions of the data into our internal formats and data types.

iv. We need to organise data transmissions from external sources depending on when the data is received. In some cases, we may get data on request.

Data Staging Area:

After the data is extracted from various operational and external systems, we need to prepare the data for storing it in the data warehouse.
Data staging component consists of workbench which provides the function of data extraction, transformation and preparation for loading to the data warehouse.
This provides a place where the data is cleaned, changed, combined, converted, de-duplicated for use in the data warehouse.

Q: Why a Separate Staging Area?

A: Following are the reasons/benefits of having a separate staging area in a data warehouse implementation:

Various Source Systems have different allotted timing for data extraction.
Source systems are available for extraction during a specific time slot which is generally lesser than the overall data loading time. Hence it is good to have data in staging before we lose connection to the source system.
We may need to extract data based on some conditions which requires us to join two or more different systems together. We will not be able to perform a SQL query joining two tables from two physically different databases.
Data Warehouse’s data loading frequency doesn’t match with the refresh frequency of the source systems.
Extracted data from the same set of source systems are going to be used in multiple places.
ETL process involves more complex data transformation that require extra space to temporarily stage the data.
There is specific data reconciliation/ debugging requirements which warrants the use of staging area for pre, during or post load data validations.

Data Storage Component:

- Data storage for the data warehouse is a separate repository. This is because:

o Data repository for a data warehouse must contain historical data for analysis which is not the case with the operational systems.

o Data should be in structures suitable for analysis and not for quick retrieval of individual pieces of information.

o Data in operational data base could change from moment to moment. However analysts using the data from the data warehouse require stable data which represents snapshots at specific periods. Hence data warehouses are “read-only” repositories.

End-User Presentation Layer/Information Delivery Component:

This is the layer where the data in the data warehouse is utilised by wide range of different users for analysis and reporting purpose.
This layer involves different tools for presenting data to different users.

Ad Hoc Reports are pre-defined reports primarily for novice and casual users.
Provision for complex queries, Multi-Dimensional (MD) analysis and Statistical analysis cater to the needs of Business Analyst and Power Users.
Executive Information System (EIS) is meant for senior-executives and high level managers.
Data mining applications are knowledge discovery systems where the mining algorithm helps discover trends and pattern from the usage of the data.

Independent Data Mart Model

Data Mart à These are mini-warehouses, limited in scope and pertains to specific line of business.

Independent Data Mart à A data mart filled with the data extracted from the operational environment without the benefit of the data warehouse as these contain only the data from the specific operational systems.

Here there is separate ETL for Each Independent Data mart.
Data Access complexity increases because of multiple data marts.

Dependent Data Mart Model With Operational Data Store

Dependent Data Mart à These are filled exclusively from the enterprise data warehouse and its reconciled data.

Operational Data Store (ODS) à Integrated, Subject Oriented, updatable, current-valued, enterprise-wide and detailed database designed to serve operational users as they do decision support processing.

Here there is a single ETL for Enterprise Data Warehouse (EDW).
Dependent Data arts loaded from EDW.
ODS provides option for obtaining current data.

Logical Data Mart and Active Data Warehouse

Active Data Warehouse à An EDW which has the following features:

Accepts near-real-time feed of the transactional data
Records, analyses warehouse data and in near-real-time relays business rules to the data warehouse and systems of record so that immediate actions can be taken in response to business events.

ETL here is near-real-time
Data Marts are not separate databases and are logical views of the data warehouse.

Three Layer Architecture

Reconciled Data: Detailed, current data intended to be the single, authoritative source for all decision support.
Derived Data: Data that have been selected, formatted, and aggregated for end-user decision support application.
Metadata: Technical and business data that describes the properties or characteristics of other data.

Introduction To DSS

DSS stands for Decision Support System
Data Warehouse plays a major role in decisions making of the organisations by making the historical data available for trend analysis.
Basically there are two different types of decisions which are taken from different stages of the data warehouse which are below:

o Tactical Decisions: These decisions are taken for the day to day functioning of the organisations. These are generally derived from the Operational Data Store (ODS).

o Strategic Decisions: These decisions are taken for the long run. These decisions are derived from analysing the data in the data warehouse for trends and also are affected by the analysis of the external data. These are generally derived from the Data Warehouse/Data marts.

Sunday, 18 December 2016

LESSON-1: INTRODUCTION TO DATA WAREHOUSE

What is a Data Warehouse?

A data warehouse is an Integrated, Subject Oriented, Time Variant and Non-Volatile repository which provides support to the organisation/business for Decision making.

Integrated: Consistent naming conventions, format, encoding structures; from multiple data sources.

Subject Oriented: Data is arranged and optimised to answer business, questions specific to their interest. This means, data is organised and summarized by topic or SUBJECT AREAs (e.g., Sales, Marketing, and Finance etc.)

Time Variant: Data warehouse represents the behavioural changes of the data over a time period (e.g. 5-10 years of history data) which is significantly more compared to the operational systems, which in almost all the cases holds current data.

Non-Volatile: Data warehouse doesn’t require transaction processing, recovery and concurrency control mechanisms. As such it requires only two operations for data access: Data Load and Data Access. It is read-only and periodically refreshed.

What is Data Warehousing?

Data Warehousing is nothing but the process of building and using the data warehouse.
It is the science of storing data for the purpose of meaningful future analysis.
It is the process of transforming the data into information.

What is Business Intelligence?

This is the science of presenting historical data in a meaningful way (often by using different data visualisation techniques).
Raw data stored in databases turns into valuable information through the implementation of Business Intelligence process.
This generally involves the use of different analysis and reporting tools.

Characteristics of Data Warehouse

The key characteristics of the data warehouse includes the following:

Some data is de-normalised for simplification and to improve performance.
Large amounts of historical data are used.
Queries often retrieve large amounts of data.
Both planned and ad-hoc queries are common.
The data load is controlled.

In general, high data throughput is the key to a successful data warehouse.

Operational Systems (OLTP)

Operational Systems are designed to support operational monitoring.
Operational systems generally hold the current data and the inventory data is generally purged after 3 months.
Online Transaction Processing (OLTP) Systems handles the operational daily transaction. These systems handles the core business of an organisation.
Examples:

o Application Processing Systems

o Account Transaction Systems

OLTP Systems were built to automate the business transactions.
Historical data is typically not needed or retained.
Following are the key characteristics of OLTP Systems:

o Supports large number of short real time transactions (INSERT, UPDATE and DELETES) and hence result in large number of small size tables

o Faster query processing

o Concurrent operations by a large number of users who use the database in their daily work for regular business operations

o Contains detailed and current data

o Schema used to store transactional databases is the entity model (usually 3NF)

Analytical Systems (OLAP)

Informational system with current and historical data
Historical data is used for Analytical Purposes
Primarily supports Decision making
Data required for decision making is collected from the operational systems and moved to the Data Warehouse.
Online Analytical Processing (OLAP) refers to workloads where large amounts of historical data are processed to generate reports and perform data analysis.
OLAP Databases are fed from OLTP databases.
OLAP database store a large volume of the same transactional data as an OLTP database, but this data has been transformed through an ETL process to enable best performance for easy report generation and analytics.
Following are the key characteristics of the OLAP systems:

o Relatively low volume of transactions

o Queries are very complex and often involve aggregations

o Aggregated and Historical data stored in multidimensional schemas (usually STAR Schemas)

o Response time is an effectiveness measure. (Quick read response time is desired)

Comparing OLTP and OLAP

Differentiating Factor	OLTP/Operational Systems	OLAP/Data Warehouse
Business Need and Usage	Critical for controlling and running the fundamental business task of an organization.	Needed to generate reports, run analytics useful for decision making in a multiple decision support environment.
Source Of Data	Operational Data. These are the original source of data and are the first data entry point for any organisation.	Consolidation data. Data comes from various OLTP systems.
Nature of Data Stored	Data represents the current snapshot of the ongoing business processes.	Data represents the business trends of the organisation. Multidimensional view of the various kind of business activities.
Database Tuning	Database is tuned for extremely fast Insert, Update and Deletes	Tuned only for quick reads.
Data Lifespan	Deals with data of short lifespan.	Deals with data f very large lifespan (historic)
Data Size	Data is of small size stored in large number of tables.	Data size is huge stored in comparatively less number of tables.
Database Design	Data is highly normalized. (Usually 3NF)	Data is somewhat de-normalized with fewer tables. (2NF/Star/Snowflake Schema)
Data Backup and Recovery	Backup religiously as this is very critical to run the business.	Generally don’t require high availability and data may be archived in external storage such as tapes and hence involves reloading of the OLTP data as a recovery method.
Examples	Banking Applications, Online Reservation systems, e-commerce systems etc.	Reporting for Sales, Marketing, Management Reporting and Financial Forecasting

Advantages and Dis-Advantages of Data Warehouse

Advantages:

Clean, Consistent Data
Integrated data from multiple sources and hence providing a centralised repository of data
Performing new types of analysis
Reduces the costs of accessing historical data
Standardised data across the organisation
Improving turnaround time for analysis and reporting
Support ad hoc reporting and enquiry
Removes informational processing load from transaction oriented database
Shared data for easy accessibility

Disadvantages:

Major schema transformation from each of the data sources to one schema in the data warehouse can represent more than 50% of the total data warehouse effort.
Long initial implementation time and associated high cost
Adding new data sources takes time and associated costs
Typically data is static and dated
Difficult to accommodate changes in the data type and ranges, data source schema, indexes and queries
Typically can’t actively monitor changes in the data.

Database vs. Data Warehouse

Feature	Database	Data Warehouse
Characteristics	Operational Processing (OLTP)	Analytical Processing (OLAP)
Data	Current data	Historical Data
Database Design	Highly Normalized	Somewhat de-normalized
Database modelling	E-R Modelling	Dimensional Modelling
Unit Of Work	Short and Simple Transaction	Complex queries
Focus	“Data IN”	“Information OUT”
Orientation	Transaction	Analysis
Summarization	Highly Detailed	Summarized and Consolidated form
View	Flat Relational	Multidimensional
Access	Read/Write	Read
Number of Records Accessed	Small. Tens of Records	High, Millions of Records
Number Of Users	High. In thousands	Small. Hundreds
DB Size	100 MB to GB	100GB to TB
Priority	High performance and High Availability	High Flexibility
Efficiency Metric	Transaction throughput	Query throughput and response time