CSI_ZG515_Consolidated_Slides_Compressed.pdf - CSI ZG515 SS ZG515 Data Warehousing BITS Pilani Utpal Mukhopadhyay Pilani|Dubai|Goa|Hyderabad 1 BITS

CSI_ZG515_Consolidated_Slides_Compressed.pdf - CSI ZG515 SS...

This preview shows page 1 out of 659 pages.

You've reached the end of your free preview.

Want to read all 659 pages?

Unformatted text preview: CSI ZG515/ SS ZG515 Data Warehousing BITS Pilani Utpal Mukhopadhyay Pilani|Dubai|Goa|Hyderabad 1 BITS Pilani Pilani|Dubai|Goa|Hyderabad Data Warehousing 2 Background 1980’s to early 1990’s – Focus on computerizing business processes – To gain competitive advantage By early 1990’s – All companies had operational systems – It no longer offered any advantage How to get competitive advantage?? 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 OLTP Systems: Primary Purpose Run the operations of the business For example: Banks, Railway reservation etc. Based on ER Data Modeling Transaction based system Data is always current valued Little history is available Data is highly volatile Has “Intelligent keys” 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 OLTP Systems Has relational normalized design Redundant data is undesirable Consists of many tables High volume retrieval is inefficient Optimized for repetitive “narrow” queries Common data in many applications 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Problem: Heterogeneous Information Sources “Heterogeneities are everywhere” Personal Databases World Wide Web Scientific Databases Digital Libraries Different interfaces Different data representations Duplicate and inconsistent information 6 CS 336 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Need for Data Warehousing Companies, over the years, gathered huge volumes of data “Hidden Treasure” Can this data be used in any way? Can we analyze this data to get any competitive advantage? If yes, what kind of advantage? 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 The Warehousing Approach Information integrated in advance Stored in wh for direct querying and analysis Clients Data Warehouse Integration System Metadata ... Extractor/ Monitor Source Extractor/ Monitor Source Extractor/ Monitor ... Source 8 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Benefits of Data Warehousing Allows “efficient” analysis of data Competitive Advantage Analysis aids strategic decision making Increased productivity of decision makers Potential high ROI Classic example: Diaper and Beer More recently: Polo shirts & Barbie dolls 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Benefits of Data Warehousing High query performance – But not necessarily most current information Doesn’t interfere with local processing at sources – Complex queries at warehouse – OLTP at information sources Information copied at warehouse – Can modify, annotate, summarize, restructure, etc. – Can store historical information – Security, no auditing Has caught on in industry 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 What is a Data Warehouse? A Practitioners Viewpoint “A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.” -- Barry Devlin, IBM Consultant 11 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 What is a Data Warehouse? An Alternative Viewpoint “A DW is a – subject-oriented, – integrated, – time-varying, – non-volatile collection of data that is used primarily in organizational decision making.” -- W.H. Inmon, Building the Data Warehouse, 1992 12 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 A Data Warehouse is... • Stored collection of diverse data – A solution to data integration problem – Single repository of information • Subject-oriented – Organized by subject, not by application – Used for analysis, data mining, etc. • Optimized differently from transaction-oriented db • User interface aimed at executive 13 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 A Data Warehouse is... • Large volume of data (Gb, Tb) • Non-volatile – Historical – Time attributes are important • Updates infrequent • May be append-only Examples – All transactions ever at Sainsbury’s – Complete client histories at insurance firm – LSE financial information and portfolios 14 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Decision Support Systems, DW, & OLAP Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions. Data Warehouse is a DSS A data warewhouse is an architectural construct of an information system that provides users with current and historical decision support information that is hard to access or present in traditional operational systems. Data Warehouse is not an Intelligent system On-Line Analytical Processing (OLAP) is an element of DSS 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Why Separate Data Warehouse? Performance – Op dbs designed & tuned for known txs & workloads. – Complex OLAP queries would degrade performance for op txs. – Special data organization, access & implementation methods needed for multidimensional views & queries. – Current and historical decision support information that is hard to access or present in traditional operational systems. 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Why Separate Data Warehouse? Function – Missing data: Decision support requires historical data, which op dbs do not typically maintain. – Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. – Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled. 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse: Characteristics Analysis driven Ad-hoc queries Complex queries Used by top managers Based on Dimensional Modeling Denormalized structures 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse: Major Players SAS institute IBM Oracle Sybase Microsoft HP Cognos Business Objects 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse A decision support database that is maintained separately from the organization’s operational databases. A data warehouse is a – subject-oriented, – integrated, – time-varying, – non-volatile collection of data that is used primarily in organizational decision making 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Subject Oriented Data Warehouse is designed around “subjects” rather than processes A company may have – Retail Sales System – Outlet Sales System – Catalog Sales System Problems Galore!!! DW will have a Sales Subject Area 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Subject Oriented Retail Sales System Catalog Sales System Outlet Sales System OLTP Systems Sales Subject Area Data Warehouse Subject-Oriented Sales Information 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Integrated Heterogeneous Source Systems Little or no control Need to Integrate source data For Example: Product codes could be different in different systems Arrive at common code in DW “Surrogate keys” 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Non-Volatile(Read-Mostly) Write USER Read OLTP Read USER DW 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Time Variant Most business analysis has a time component Sales Trend Analysis (historical data is required) 2001 2002 2003 2004 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehousing Architecture Monitoring & Administration OLAP servers Metadata Repository External Sources Operational dbs Extract Transform Load Refresh Analysis Serve Query/ Reporting Data Mining Data Marts 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 OLTP Systems: Characteristics Run the operations of the business For example: Banks, Railway reservation etc. Based on ER Data Modeling Transaction based system Data is always current valued Little history is available Data is highly volatile Has “Intelligent keys” 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse: Characteristics Analysis driven Ad-hoc queries Complex queries Used by top managers Based on Dimensional Modeling Denormalized structures 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Populating & Refreshing the Warehouse Data Extraction Data Cleaning Data Transformation Convert from legacy/host format to warehouse format Load Sort, summarize, consolidate, compute views, check integrity, build indexes, partition Refresh Bring new data from source systems 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 ETL Process Issues & Challenges Consumes 70-80% of project time Heterogeneous Source Systems Little or no control over source systems Source systems scattered Source systems operating in different time zones Different currencies Different measurement units Data not captured by OLTP systems Ensuring data quality 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Staging Area A storage area where extracted data is Cleaned Transformed Deduplicated Initial storage for data Need not be based on Relational model Spread over a number of machines Mainly sorting and Sequential processing COBOL or C code running against flat files Does not provide data access to users Analogy – kitchen of a restaurant 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Presentation Servers A target physical machine on which DW data is organized for Direct querying by end users using OLAP Report writers Data Visualization tools Data mining tools Data stored in Dimensional framework Analogy – Sitting area of a restaurant 32 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Cleaning Why? Data warehouse contains data that is analyzed for business decisions More data and multiple sources could mean more errors in the data and harder to trace such errors Results in incorrect analysis Detecting data anomalies and rectifying them early has huge payoffs Long Term Solution Change business practices and data entry tools Repository for meta-data 33 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Soundex Algorithms Misspelled terms For example NAMES Phonetic algorithms – can find similar sounding names Based on the six phonetic classifications of human speech sounds 34 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse Design OLTP Systems are Data Capture Systems “DATA IN” systems DW are “DATA OUT” systems OLTP DW 35 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Analyzing the DATA Active Analysis – User Queries – User-guided data analysis – Show me how X varies with Y – OLAP Automated Analysis – Data Mining – What’s in there? – Set the computer FREE on your data – Supervised Learning (classification) – Unsupervised Learning (clustering) 36 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 OLAP Queries How much of product P1 was sold in 1999 state wise? Top 5 selling products in 2002 Total Sales in Q1 of FY 2002-03? Color wise sales figure of cars from 2000 to 2003 Model wise sales of cars for the month of Jan from 2000 to 2003 37 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Mining Investigations Which type of customers are more likely to spend most with us in the coming year? What additional products are most likely to be sold to customers who buy sportswear? In which area should we open a new store in the next year? What are the characteristics of customers most likely to default on their loans before the year is out? 38 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Continuum of Analysis OLTP Specialized Algorithms SQL OLAP Primitive & Canned Analysis Complex Ad-hoc Analysis Data Mining Automated Analysis 39 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 World’s Largest Data Warehouse SAP in conjunction with NetApp and several other partners @ SAP/Intel data center in Santa Clara, California 12 petabytes (PB) of addressable storage had been created Guinness World Record Based on the SAP® HANA in-memory data platform, SAP IQ (formerly Sybase IQ), and BMMsoft Federated EDMT. NetApp® SAN storage Contains more than 221 trillion transactional records more than 100 billion unstructured documents, including emails, SMS, and images It also contains data from 30 billion sources, including users, smart sensors, and mobile devices. Source: An Insider’s View into the World’s Largest Data Warehouse by Larry Freeman, NetApp 40 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Q&A BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Thank You BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 CSI ZG515/ SS ZG515 Data Warehousing BITS Pilani Pilani|Dubai|Goa|Hyderabad UtpalMukhopadhyay Session 2 5th August 2018 1 BITS Pilani Pilani|Dubai|Goa|Hyderabad Introduction to Dimensional Modelling 2 Data Warehouse Design OLTP Systems are Data Capture Systems “DATA IN” systems DW are “DATA OUT” systems OLTP DW BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Design Requirements Design of the DW must directly reflect the way the managers look at the business Should capture the measurements of importance along with parameters by which these parameters are viewed It must facilitate data analysis, i.e., answering business questions BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 ER Modeling A logical design technique that seeks to eliminate data redundancy Illuminates the microscopic relationships among data elements Perfect for OLTP systems Responsible for success of transaction processing in Relational Databases BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Problems with ER Model ER models are NOT suitable for DW? End user cannot understand or remember an ER Model Many DWs have failed because of overly complex ER designs Not optimized for complex, ad-hoc queries Data retrieval becomes difficult due to normalization Browsing becomes difficult BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 ER vs Dimensional Modeling Dimensional modeling (DM) is the name of a logical design technique often used for data warehouses. It is different from, and contrasts with, entity-relation modeling (ER). DM is the only viable technique for databases that are designed to support end-user queries in a data warehouse. BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 ER vs Dimensional Modeling ER models are constituted to – Remove redundant data (normalization) – Facilitate retrieval of individual records having certain critical identifiers – Thereby optimizing OLTP performance Dimensional model supports the reporting and analytical needs of a data warehouse system. BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Dimensional Modeling: Salient Features Represents data in a standard framework Framework is easily understandable by end users Contains same information as ER model Packages data in symmetric format Resilient to change Facilitates data retrieval/analysis BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Dimensional Modeling: Vocabulary Measures or facts Facts are “numeric” & “additive” For example; Sale Amount, Sale Units Factors or dimensions Star Schemas Snowflake & Starflake Schemas Sales Amt = f (Product,Location,Time) Fact Dimensions BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Dimensional Modeling Facts are stored in FACT Tables Dimensions are stored in DIMENSION tables Dimension tables contains textual descriptors of business Fact and dimension tables form a Star Schema “BIG” fact table in center surrounded by “SMALL” dimension tables BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Star Schema FK FK Product Dimension Location Dimension Sales Fact Table FK Time Dimension FK Promotion Dimension BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 The “Classic” Star Schema Store Dimension STORE KEY Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. Fact Table STORE KEY PRODUCT KEY PERIOD KEY Dollars_sold Units Dollars_cost Product Dimension Time Dimension PERIOD KEY Period Desc Year Quarter Month Day PRODUCT KEY Product Desc. Brand Color Size Manufacturer BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Fact Tables Contains numerical measurements of the business Each measurement is taken at the intersection of all dimensions Intersection is the composite key Represents Many-to-many relationships between dimensions Examples of facts Sale_amt, Units_sold, Cost, Customer_count BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Fact Table Granularity Granularity The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps: 1. Determine which dimensions will be included. 2. Determine where along the hierarchy of each dimension the information will be kept. The determining factors usually goes back to the requirements. BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Dimension Tables Contains attributes for dimensions 50 to 100 attributes common Best attributes are textual and descriptive DW is only as good as the dimension attributes Contains hierarchal information albeit redundantly Entry points into the fact table BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Which Dimensions To Include Determining which dimensions to include is usually a straightforward process, because business processes will often dictate clearly what are the relevant dimensions. For example, in an offline retail world, the dimensions for a sales fact table are usually time, geography, and product. This list, however, is by no means a complete list for all offline retailers. A supermarket with a Rewards Card program, where customers provide some personal information in exchange for a rewards card, and the supermarket would offer lower prices for certain items for customers who present a rewards card at checkout, will also have the ability to track the customer dimension. Whether the data warehousing system includes the customer dimension will then be a decision that needs to be made. BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Types of Facts Fully-additive-all dimensions – Units_sold, Sales_amt Semi-additive-some dimensions – Account_balance, Customer_count 28/3,tissue paper,store1, 25, 250,20 28/3,paper towel,store1, 35, 350,30 Is no. of customers who bought either tissue paper or paper towel is 50? NO. Non-additive-none – Gross margin=Gross profit/amount – Note that GP and Amount are fully additive – Ratio of the sums and not sum of the ratios BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Data Warehouse: Design Steps Step 1: Identify the Business Process Step 2: Declare the Grain Step 3: Identify the Dimensions Step 4: Identify the Facts BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Retail Store Most organizations have an underlying value chain consisting of their key business processes. The value chain identifies the natural, logical flow of an organization’s primary activities. For example, in the case of a retailer, the company may issue a purchase order to a product manufacturer. The products are delivered to the retailer’s warehouse,...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture