You've reached the end of your free preview.
Want to read all 659 pages?
Unformatted text preview: CSI ZG515/ SS ZG515
Data Warehousing
BITS Pilani Utpal Mukhopadhyay Pilani|Dubai|Goa|Hyderabad 1 BITS Pilani
Pilani|Dubai|Goa|Hyderabad Data Warehousing
2 Background
1980’s to early 1990’s
– Focus on computerizing business processes
– To gain competitive advantage By early 1990’s
– All companies had operational systems
– It no longer offered any advantage How to get competitive advantage?? 3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 OLTP Systems:
Primary Purpose
Run the operations of the business
For example: Banks, Railway reservation etc.
Based on ER Data Modeling
Transaction based system
Data is always current valued
Little history is available
Data is highly volatile
Has “Intelligent keys”
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 OLTP Systems
Has relational normalized design
Redundant data is undesirable
Consists of many tables
High volume retrieval is inefficient
Optimized for repetitive “narrow” queries
Common data in many applications 5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Problem: Heterogeneous Information
Sources
“Heterogeneities are everywhere”
Personal
Databases World
Wide
Web Scientific Databases
Digital Libraries Different interfaces Different data representations Duplicate and inconsistent information 6 CS 336
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Need for
Data Warehousing
Companies, over the years, gathered huge volumes of data
“Hidden Treasure”
Can this data be used in any way?
Can we analyze this data to get any competitive
advantage?
If yes, what kind of advantage? 7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 The Warehousing Approach Information integrated
in advance Stored in wh for direct
querying and analysis Clients Data
Warehouse Integration System Metadata ...
Extractor/
Monitor Source Extractor/
Monitor Source Extractor/
Monitor ... Source
8 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Benefits of
Data Warehousing
Allows “efficient” analysis of data
Competitive Advantage
Analysis aids strategic decision making
Increased productivity of decision makers
Potential high ROI
Classic example: Diaper and Beer
More recently: Polo shirts & Barbie dolls
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Benefits of
Data Warehousing
High query performance
– But not necessarily most current information
Doesn’t interfere with local processing at sources
– Complex queries at warehouse
– OLTP at information sources
Information copied at warehouse
– Can modify, annotate, summarize, restructure, etc.
– Can store historical information
– Security, no auditing
Has caught on in industry
10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 What is a Data Warehouse?
A Practitioners Viewpoint
“A data warehouse is simply a single, complete, and
consistent store of data obtained from a variety of sources
and made available to end users in a way they can
understand and use it in a business context.”
-- Barry Devlin, IBM Consultant 11
11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 What is a Data Warehouse?
An Alternative Viewpoint
“A DW is a
– subject-oriented,
– integrated,
– time-varying,
– non-volatile
collection of data that is used primarily in organizational decision
making.”
-- W.H. Inmon, Building the Data Warehouse, 1992 12
12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 A Data Warehouse is...
• Stored collection of diverse data
– A solution to data integration problem
– Single repository of information
• Subject-oriented
– Organized by subject, not by application
– Used for analysis, data mining, etc.
• Optimized differently from transaction-oriented db
• User interface aimed at executive 13
13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 A Data Warehouse is...
• Large volume of data (Gb, Tb)
• Non-volatile
– Historical
– Time attributes are important
• Updates infrequent
• May be append-only
Examples
– All transactions ever at Sainsbury’s
– Complete client histories at insurance firm
– LSE financial information and portfolios
14
14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Decision Support Systems,
DW, & OLAP
Information technology to help the knowledge worker
(executive, manager, analyst) make faster and better
decisions. Data Warehouse is a DSS
A data warewhouse is an architectural construct of an
information system that provides users with current and
historical decision support information that is hard to
access or present in traditional operational systems. Data Warehouse is not an Intelligent system
On-Line Analytical Processing (OLAP) is an element of
DSS
15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Why Separate Data Warehouse?
Performance
– Op dbs designed & tuned for known txs &
workloads.
– Complex OLAP queries would degrade
performance for op txs.
– Special data organization, access &
implementation methods needed for
multidimensional views & queries.
– Current and historical decision support
information that is hard to access or present in
traditional operational systems.
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Why Separate Data Warehouse?
Function
– Missing data: Decision support requires historical
data, which op dbs do not typically maintain.
– Data consolidation: Decision support requires
consolidation (aggregation, summarization) of
data from many heterogeneous sources: op dbs,
external sources.
– Data quality: Different sources typically use
inconsistent data representations, codes, and
formats which have to be reconciled. 17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse: Characteristics
Analysis driven
Ad-hoc queries
Complex queries
Used by top managers
Based on Dimensional Modeling
Denormalized structures 18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse:
Major Players
SAS institute
IBM
Oracle
Sybase
Microsoft
HP
Cognos
Business Objects
19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse
A decision support database that is maintained
separately from the organization’s operational
databases.
A data warehouse is a
– subject-oriented,
– integrated,
– time-varying,
– non-volatile
collection of data that is used primarily in
organizational decision making
20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Subject Oriented
Data Warehouse is designed around
“subjects” rather than processes
A company may have
– Retail Sales System
– Outlet Sales System
– Catalog Sales System Problems Galore!!!
DW will have a Sales Subject Area 21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Subject Oriented Retail Sales System Catalog Sales
System Outlet Sales
System OLTP Systems Sales Subject Area Data Warehouse Subject-Oriented Sales Information 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Integrated
Heterogeneous Source Systems
Little or no control
Need to Integrate source data
For Example: Product codes could be
different in different systems
Arrive at common code in DW
“Surrogate keys”
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Non-Volatile(Read-Mostly)
Write
USER
Read OLTP Read
USER DW
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Time Variant
Most business analysis
has a time component Sales Trend Analysis
(historical data is required)
2001 2002 2003 2004 25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehousing Architecture
Monitoring & Administration OLAP servers Metadata
Repository External
Sources
Operational
dbs Extract
Transform
Load
Refresh Analysis
Serve Query/
Reporting Data
Mining Data Marts 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 OLTP Systems:
Characteristics
Run the operations of the business
For example: Banks, Railway reservation etc.
Based on ER Data Modeling
Transaction based system
Data is always current valued
Little history is available
Data is highly volatile
Has “Intelligent keys” 27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse: Characteristics
Analysis driven
Ad-hoc queries
Complex queries
Used by top managers
Based on Dimensional Modeling
Denormalized structures 28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Populating & Refreshing the
Warehouse Data Extraction Data Cleaning Data Transformation Convert from legacy/host format to warehouse
format Load Sort, summarize, consolidate, compute views,
check integrity, build indexes, partition Refresh Bring new data from source systems
29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 ETL Process
Issues & Challenges Consumes 70-80% of project time Heterogeneous Source Systems Little or no control over source systems Source systems scattered Source systems operating in different time zones Different currencies Different measurement units Data not captured by OLTP systems Ensuring data quality
30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Staging Area A storage area where extracted data is Cleaned Transformed Deduplicated Initial storage for data Need not be based on Relational model Spread over a number of machines Mainly sorting and Sequential processing COBOL or C code running against flat files Does not provide data access to users Analogy – kitchen of a restaurant 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Presentation Servers A target physical machine on which DW data is
organized for Direct querying by end users using OLAP Report writers Data Visualization tools Data mining tools Data stored in Dimensional framework Analogy – Sitting area of a restaurant 32
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Cleaning Why? Data warehouse contains data that is analyzed for
business decisions More data and multiple sources could mean more
errors in the data and harder to trace such errors Results in incorrect analysis Detecting data anomalies and rectifying them early
has huge payoffs Long Term Solution Change business practices and data entry tools Repository for meta-data 33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Soundex Algorithms Misspelled terms
For example NAMES
Phonetic algorithms – can find similar
sounding names
Based on the six phonetic classifications
of human speech sounds 34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse Design
OLTP Systems are Data Capture Systems
“DATA IN” systems
DW are “DATA OUT” systems OLTP DW
35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Analyzing the DATA
Active Analysis – User Queries
– User-guided data analysis
– Show me how X varies with Y
– OLAP
Automated Analysis – Data Mining
– What’s in there?
– Set the computer FREE on your data
– Supervised Learning (classification)
– Unsupervised Learning (clustering)
36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 OLAP Queries
How much of product P1 was sold in 1999 state
wise?
Top 5 selling products in 2002
Total Sales in Q1 of FY 2002-03?
Color wise sales figure of cars from 2000 to 2003
Model wise sales of cars for the month of Jan from
2000 to 2003 37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Mining Investigations
Which type of customers are more likely to spend
most with us in the coming year?
What additional products are most likely to be sold
to customers who buy sportswear?
In which area should we open a new store in the
next year?
What are the characteristics of customers most
likely to default on their loans before the year is
out? 38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Continuum of Analysis
OLTP Specialized
Algorithms SQL OLAP
Primitive &
Canned
Analysis Complex
Ad-hoc
Analysis Data Mining
Automated
Analysis 39
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 World’s Largest
Data Warehouse
SAP in conjunction with NetApp and several other partners
@ SAP/Intel data center in Santa Clara, California
12 petabytes (PB) of addressable storage had been created
Guinness World Record
Based on the SAP® HANA in-memory data platform, SAP IQ (formerly
Sybase IQ), and BMMsoft Federated EDMT.
NetApp® SAN storage
Contains more than 221 trillion transactional records
more than 100 billion unstructured documents, including emails, SMS,
and images
It also contains data from 30 billion sources, including users, smart
sensors, and mobile devices. Source:
An Insider’s View into the World’s Largest Data Warehouse
by Larry Freeman, NetApp 40 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Q&A BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Thank You BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 CSI ZG515/ SS ZG515
Data Warehousing
BITS Pilani
Pilani|Dubai|Goa|Hyderabad UtpalMukhopadhyay
Session 2
5th August 2018
1 BITS Pilani
Pilani|Dubai|Goa|Hyderabad Introduction to Dimensional Modelling 2 Data Warehouse Design
OLTP Systems are Data Capture Systems
“DATA IN” systems
DW are “DATA OUT” systems OLTP DW BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Design Requirements
Design of the DW must directly reflect the way the
managers look at the business Should capture the measurements of
importance along with parameters by
which these parameters are viewed It must facilitate data analysis, i.e.,
answering business questions BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 ER Modeling
A logical design technique that seeks to eliminate data
redundancy
Illuminates the microscopic relationships among data
elements
Perfect for OLTP systems
Responsible for success of transaction processing in
Relational Databases BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Problems with ER Model
ER models are NOT suitable for DW?
End user cannot understand or remember an
ER Model
Many DWs have failed because of overly
complex ER designs
Not optimized for complex, ad-hoc queries
Data retrieval becomes difficult due to
normalization
Browsing becomes difficult BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 ER vs Dimensional Modeling
Dimensional modeling (DM) is the name of a
logical design technique often used for data
warehouses. It is different from, and contrasts with,
entity-relation modeling (ER).
DM is the only viable technique for databases that
are designed to support end-user queries in a data
warehouse. BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 ER vs Dimensional Modeling
ER models are constituted to
– Remove redundant data (normalization)
– Facilitate retrieval of individual records having certain
critical identifiers
– Thereby optimizing OLTP performance Dimensional model supports the reporting and
analytical needs of a data warehouse system. BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Dimensional Modeling:
Salient Features
Represents data in a standard framework
Framework is easily understandable by end
users
Contains same information as ER model
Packages data in symmetric format
Resilient to change
Facilitates data retrieval/analysis BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Dimensional Modeling:
Vocabulary
Measures or facts
Facts are “numeric” & “additive”
For example; Sale Amount, Sale Units
Factors or dimensions
Star Schemas
Snowflake & Starflake Schemas Sales Amt = f (Product,Location,Time)
Fact Dimensions
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Dimensional Modeling
Facts are stored in FACT Tables
Dimensions are stored in DIMENSION tables
Dimension tables contains textual descriptors
of business
Fact and dimension tables form a Star Schema
“BIG” fact table in center surrounded by
“SMALL” dimension tables BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Star Schema FK FK Product
Dimension Location
Dimension Sales Fact
Table FK
Time
Dimension FK
Promotion
Dimension
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 The “Classic” Star Schema
Store Dimension
STORE KEY
Store Description
City
State
District ID
District Desc.
Region_ID
Region Desc.
Regional Mgr. Fact Table
STORE KEY
PRODUCT KEY
PERIOD KEY
Dollars_sold
Units
Dollars_cost Product Dimension Time Dimension
PERIOD KEY
Period Desc
Year
Quarter
Month
Day PRODUCT KEY
Product Desc.
Brand
Color
Size
Manufacturer
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Fact Tables
Contains numerical measurements of the
business
Each measurement is taken at the intersection
of all dimensions
Intersection is the composite key
Represents Many-to-many relationships
between dimensions
Examples of facts
Sale_amt, Units_sold, Cost, Customer_count BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Fact Table Granularity
Granularity
The first step in designing a fact table is to determine
the granularity of the fact table. By granularity, we
mean the lowest level of information that will
be stored in the fact table. This constitutes two steps:
1. Determine which dimensions will be included.
2. Determine where along the hierarchy of each
dimension the information will be kept.
The determining factors usually goes back to the
requirements.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Dimension Tables
Contains attributes for dimensions
50 to 100 attributes common
Best attributes are textual and descriptive
DW is only as good as the dimension attributes
Contains hierarchal information albeit redundantly
Entry points into the fact table BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Which Dimensions To Include
Determining which dimensions to include is usually a
straightforward process, because business processes will
often dictate clearly what are the relevant dimensions.
For example, in an offline retail world, the dimensions for a
sales fact table are usually time, geography, and product.
This list, however, is by no means a complete list for all offline
retailers. A supermarket with a Rewards Card program, where
customers provide some personal information in exchange for
a rewards card, and the supermarket would offer lower prices
for certain items for customers who present a rewards card at
checkout, will also have the ability to track the customer
dimension. Whether the data warehousing system includes
the customer dimension will then be a decision that needs to
be made.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Types of Facts
Fully-additive-all dimensions
– Units_sold, Sales_amt Semi-additive-some dimensions
– Account_balance, Customer_count
28/3,tissue paper,store1, 25, 250,20
28/3,paper towel,store1, 35, 350,30
Is no. of customers who bought either tissue paper or paper
towel is 50? NO. Non-additive-none
– Gross margin=Gross profit/amount
– Note that GP and Amount are fully additive
– Ratio of the sums and not sum of the ratios BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse: Design Steps
Step 1: Identify the Business Process
Step 2: Declare the Grain
Step 3: Identify the Dimensions
Step 4: Identify the Facts BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Retail Store
Most organizations have an underlying value chain
consisting of their key business processes. The value chain
identifies the natural, logical flow of an organization’s
primary activities. For example, in the case of a retailer, the
company may issue a purchase order to a product
manufacturer. The products are delivered to the retailer’s
warehouse,...
View
Full Document
- Summer '19
- Data Management, Data Mining, Convert