Sunday, 18 December 2016

LESSON-1: INTRODUCTION TO DATA WAREHOUSE

What is a Data Warehouse?

A data warehouse is an Integrated, Subject Oriented, Time Variant and Non-Volatile repository which provides support to the organisation/business for Decision making.

Integrated: Consistent naming conventions, format, encoding structures; from multiple data sources.



Subject Oriented: Data is arranged and optimised to answer business, questions specific to their interest. This means, data is organised and summarized by topic or SUBJECT AREAs (e.g., Sales, Marketing, and Finance etc.)
Time Variant: Data warehouse represents the behavioural changes of the data over a time period (e.g. 5-10 years of history data) which is significantly more compared to the operational systems, which in almost all the cases holds current data.
Non-Volatile: Data warehouse doesn’t require transaction processing, recovery and concurrency control mechanisms. As such it requires only two operations for data access: Data Load and Data Access. It is read-only and periodically refreshed.


What is Data Warehousing?

  •           Data Warehousing is nothing but the process of building and using the data warehouse.
  •           It is the science of storing data for the purpose of meaningful future analysis.
  •           It is the process of transforming the data into information.

What is Business Intelligence?

  •           This is the science of presenting historical data in a meaningful way (often by using different data visualisation techniques).
  •           Raw data stored in databases turns into valuable information through the implementation of Business Intelligence process.
  •           This generally involves the use of different analysis and reporting tools.

Characteristics of Data Warehouse

The key characteristics of the data warehouse includes the following:
  •           Some data is de-normalised for simplification and to improve performance.
  •           Large amounts of historical data are used.
  •           Queries often retrieve large amounts of data.
  •           Both planned and ad-hoc queries are common.
  •           The data load is controlled.

In general, high data throughput is the key to a successful data warehouse.

Operational Systems (OLTP)

  •           Operational Systems are designed to support operational monitoring.
  •           Operational systems generally hold the current data and the inventory data is generally purged after 3 months.
  •           Online Transaction Processing (OLTP) Systems handles the operational daily transaction. These systems handles the core business of an organisation.
  •           Examples:

o   Application Processing Systems
o   Account Transaction Systems
  •           OLTP Systems were built to automate the business transactions.
  •           Historical data is typically not needed or retained.
  •           Following are the key characteristics of OLTP Systems:

o   Supports large number of short real time transactions (INSERT, UPDATE and DELETES) and hence result in large number of small size tables
o   Faster query processing
o   Concurrent operations by a large number of users who use the database in their daily work for regular business operations
o   Contains detailed and current data
o   Schema used to store transactional databases is the entity model (usually 3NF)

Analytical Systems (OLAP)

  •           Informational system with current and historical data
  •           Historical data is used for Analytical Purposes
  •           Primarily supports Decision making
  •           Data required for decision making is collected from the operational systems and moved to the Data Warehouse.
  •           Online Analytical Processing (OLAP) refers to workloads where large amounts of historical data are processed to generate reports and perform data analysis.
  •           OLAP Databases are fed from OLTP databases.
  •           OLAP database store a large volume of the same transactional data as an OLTP database, but this data has been transformed through an ETL process to enable best performance for easy report generation and analytics.
  •           Following are the key characteristics of the OLAP systems:

o   Relatively low volume of transactions
o   Queries are very complex and often involve aggregations
o   Aggregated and Historical data stored in multidimensional schemas (usually STAR Schemas)
o   Response time is an effectiveness measure. (Quick read response time is desired)


Comparing OLTP and OLAP

Differentiating Factor
OLTP/Operational Systems
OLAP/Data Warehouse
Business Need and Usage
Critical for controlling and running the fundamental business task of an organization.
Needed to generate reports, run analytics useful for decision making in a multiple decision support environment.
Source Of Data
Operational Data. These are the original source of data and are the first data entry point for any organisation.
Consolidation data. Data comes from various OLTP systems.
Nature of Data Stored
Data represents the current snapshot of the ongoing business processes.
Data represents the business trends of the organisation. Multidimensional view of the various kind of business activities.
Database Tuning
Database is tuned for extremely fast Insert, Update and Deletes
Tuned only for quick reads.
Data Lifespan
Deals with data of short lifespan.
Deals with data f very large lifespan (historic)
Data Size
Data is of small size stored in large number of tables.
Data size is huge stored in comparatively less number of tables.
Database Design
Data is highly normalized. (Usually 3NF)
Data is somewhat de-normalized with fewer tables. (2NF/Star/Snowflake Schema)
Data Backup and Recovery
Backup religiously as this is very critical to run the business.
Generally don’t require high availability and data may be archived in external storage such as tapes and hence involves reloading of the OLTP data as a recovery method.
Examples
Banking Applications, Online Reservation systems, e-commerce systems etc.
Reporting for Sales, Marketing, Management Reporting and Financial Forecasting

Advantages and Dis-Advantages of Data Warehouse

Advantages:

  •           Clean, Consistent Data
  •           Integrated data from multiple sources and hence providing a centralised repository of data
  •           Performing new types of analysis
  •           Reduces the costs of accessing historical data
  •           Standardised data across the organisation
  •           Improving turnaround time for analysis and reporting
  •           Support ad hoc reporting and enquiry
  •           Removes informational processing load from transaction oriented database
  •           Shared data for easy accessibility

Disadvantages:

  •           Major schema transformation from each of the data sources to one schema in the data warehouse can represent more than 50% of the total data warehouse effort.
  •           Long initial implementation time and associated high cost
  •           Adding new data sources takes time and associated costs
  •           Typically data is static and dated
  •           Difficult to accommodate changes in the data type and ranges, data source schema, indexes and queries
  •           Typically can’t actively monitor changes in the data.

Database vs. Data Warehouse


Feature
Database
Data Warehouse
Characteristics
Operational Processing (OLTP)
Analytical Processing (OLAP)
Data
Current data
Historical Data
Database Design
Highly Normalized
Somewhat de-normalized
Database modelling
E-R Modelling
Dimensional Modelling
Unit Of Work
Short and Simple Transaction
Complex queries
Focus
“Data IN”
“Information OUT”
Orientation
Transaction
Analysis
Summarization
Highly Detailed
Summarized and Consolidated form
View
Flat Relational
Multidimensional
Access
Read/Write
Read
Number of Records Accessed
Small. Tens of Records
High, Millions of Records
Number Of Users
High. In thousands
Small. Hundreds
DB Size
100 MB to GB
100GB to TB
Priority
High performance and High Availability
High Flexibility
Efficiency Metric
Transaction throughput
Query throughput and response time

6 comments: