06-Unit6 - Business Intelligence and Tools Unit 6 Unit 6...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Business Intelligence and Tools Unit 6 Unit 6 Data Extraction Structure 6.1 Introduction Objectives 6.2 ETL Overview 6.2.1 Significance of ETL Processes 6.2.2 ETL Requirements and Steps Self Assessment Question(s) (SAQs) 6.3 Overview of the Data Extraction Process Self Assessment Question(s) (SAQs) 6.4 Types of Data in Operational Systems 6.4.1 Current Value 6.4.2 Periodic Status Self Assessment Question(s) (SAQs) 6.5 Source Identification Self Assessment Question(s) (SAQs) 6.6 Data Extraction Techniques 6.6.1 Immediate Data Extraction 6.6.2 Deferred Data Extraction Self Assessment Question(s) (SAQs) 6.7 Evaluation of the techniques Self Assessment Question(s) (SAQs) 6.8 Summary 6.9 Terminal Questions (TQs) 6.10 Multiple Choice Questions (MCQs) 6.11 Answers to SAQs, TQs, and MCQs 6.11.1 Answers to Self Assessment Questions (SAQs) 6.11.2 Answers to Terminal Questions (TQs) 6.11.3 Answers to Multiple Choice Questions (MCQs) Sikkim Manipal University Page No.: 130 Business Intelligence and Tools Unit 6 6.1 Introduction In this Unit, we discuss the data extraction process that is carried out from the sources systems into data warehouses or data marts. As we have already discussed, ‘data extraction’ is the first step in the execution of the ETL (Extraction, Transaction, and Loading) functions to build a data warehouse. This extraction can be done from an OLTP database and nonOLTP systems, such as text files, legacy systems, and spreadsheets. The data extraction process is complex in its nature because of the tremendous diversity that exists among the source systems in practice. Objectives: The objectives of the Unit are to make you understand: Source Identification for extraction of the data. Various methods being used fro data extraction. Evaluation of the extraction techniques. Exception handling in case some data has not been extracted properly. 6.2 ETL Overview Mostly the information contained in a data warehouse comes from the operational systems. But we all know that the operational systems could not be used to provide the strategic information. So you need to carefully understand what constitutes the difference between the data in the source operational systems and the information in the data warehouse. It is all ETL functions that reshape the relevant data from the source systems into useful information to be stored in the data warehouse. There would be no strategic information in a data warehouse in the absence of these functions. 6.2.1 Significance of ETL Processes The ETL functions act as the back-end processes that cover the extraction of the data from the source systems. Also, they include all the functions and procedures for changing the source data into the exact formats and structures appropriate for storage in the data warehouse database. After the transformation of the data, the processes include all processes that Sikkim Manipal University Page No.: 131 Business Intelligence and Tools Unit 6 physically move the data into the data warehouse repository. After capturing the information, you cannot dump the data into the data warehouse. You have to carefully subject the extracted data to all manner of transformations so that the data will be fit to convert into information. Data Extraction Extraction from heterogeneous internal and external source systems Data Transformation Conversion and restructuring of data as per the transformation rules Data Integration Combining the entire data on the basis of source-to-target mapping Data Cleansing Scrubbing and enriching as per the cleansing rules Data Summarization Creating aggregate datasets on the basis of predefined procedures Initial Data Loading Apply initial data in large volumes to the warehouse Meta Data Updating Maintain and use metadata for ETL functions Ongoing Data Loading Apply ongoing incremental loads, periodic refreshes to the warehouse Fig 6.1: ETL Processes Summary Sikkim Manipal University Page No.: 132 Business Intelligence and Tools Unit 6 Let us try to understand the significance of the ETL function by taking an example. For instance, you want to analyze and compare sales according to stores, product and month. But the sales data is available in various applications of your organization. Therefore, you have to have the entire sales details in the data warehouse database. You can do this by providing the sales and price in a fact table, the products in a product dimension table, the stores in a stores dimension table and months in a time dimension table. To do this, you need to extract the data from the respective source systems, reconcile the variations in the data representations among the source systems, transform the entire sales details, and load the sales into fact and dimension table. Thus the execution of ETL functions is challenging because of the nature of the source systems. Also, the amount of time to be spent on performing the ETL functions is as much as 50-70% of the total effort to be put for building a data warehouse. To extract the data, you have to know the time window during each day to extract the data from a specific source system without impacting the usage of that system. Also, you need to determine the mechanism for capturing the changes in the data in each of the relevant systems. Apart from the ETL functions, the building of a data warehouse includes the functions like data integration, data summarization and metadata updating. Figure 6.1 details the processes involved to execute the ETL functions in building a data warehouse. 6.2.2 ETL Requirements and Steps Ideally, you are required to undergo the following steps provided in Fig. 6.2 for the execution of the ETL functions. Sikkim Manipal University Page No.: 133 Business Intelligence and Tools Unit 6 1. To determine the Target data of the Data Warehouse 2. To identify the internal and external data sources 3. To map the sources with the target data elements 4. To establish comprehensive data extraction rules 5. To prepare data transformation and cleansing rules 6. To plan for aggregate tables 7. To organize the data staging area and test tools 8. To write the procedures for all data loads 9. To execute ETL functions for dimension tables 10. To execute ETL functions for fact tables Fig 6.2: Steps involved in executing the ETL functions Self Assessment Question(s) (SAQs) For Section 6.2 1. Mention the ETL functions and explain how they are important in building a data warehouse? 2. The implementation of ETL function includes various steps from determining the source systems to preparation of fact and dimension tables. List out the major steps involved in the ETL process in a sequence? 6.3 Overview of the Data Extraction Process In general, data extraction is performed within the source system itself, especially if it is a relational database to which the extraction procedures can easily be added. Also, it is possible for the extraction logic to exist in the Sikkim Manipal University Page No.: 134 Business Intelligence and Tools Unit 6 data warehouse staging area and query the source system for data using ODBC (Open Database Connectivity), OLE DB (Object Linking and Embedding, Database), or other APIs (Application Programming Interfaces). The most common method of data extraction for the legacy systems is to produce text files, although many newer systems offer direct query APIs or accommodate access through ODBC or OLE DB. Data extraction in the context of a data warehousing is different from that of building an operating system. For a data warehouse, you have to extract data from many disparate sources. Also, you have to extract data on the changes for ongoing incremental loads as well as for a one-time initial full load. But for operational systems, all you need is just one-time extractions and data conversions. These two factors increase the complexity of data extraction for a data warehouse and so an effective data extraction is a key to the success of your data warehouse. Therefore, you need to pay special attention to the issues and formulate an exclusive data extraction strategy for your data warehouse. Some of the important issues that you need to take care while extracting the data from various source systems are as follows: Source identification that involves identification of the source applications and source structures Method of data extraction from each of the source systems and to define whether the extraction process is to be carried out manually or by using the tool Preparation of time window for each data source to denote the time for the data extraction Determination of the extraction frequency for each data source in order to establish how frequently the data extraction must be done (like daily, weekly, quarterly, etc.) Sikkim Manipal University Page No.: 135 Business Intelligence and Tools Unit 6 Job sequencing to determine whether the beginning of one job in an extraction job stream has to wait until the previous job has been finished successfully Exception handling to determine an approach to handle input records that cannot be extracted Self Assessment Question(s) (SAQs) For Section 6.3 1. What do you mean by ‘Data Extraction’? Discuss various issues involved while extracting the data in building a data warehouse? 6.4 Types of Data in Operational Systems As the data in the data warehouse is extracted from the operational system, we now discuss the characteristics of the data in an operational system. The data in the source operational systems is said to be time-dependent or temporal as the source data changes with the time. As business transaction keeps changing, the data in the source systems, the value of an attribute in source system is the value of that attribute at the current time. For instance, when a customer moves to another location, the data about customer address changes in the customer table in the source system. Similarly, the product data gets changed in the source system when a new feature has been added to the existing product line. In case of change of address, the current address of the customer is important in the operational systems. But in the data warehouse the entire history is to be maintained. If you want to make sales analysis based on the location for a given time interval, the sales of the customer prior to the change of address is taken for the previous period, and the current address is considered for the current period. Sikkim Manipal University Page No.: 136 Business Intelligence and Tools Unit 6 Now we discuss how the data is stored in the source systems. There are two ways of storing the data. The type of data extraction method you have to use depends on nature of each of these two categories. 6.4.1 Current Value The current value is the stored value of an attribute at that moment of time. These values are transient and change as and when business transactions happen. You cannot forecast when the value changes next. For instance, the current balance amount in a bank account and the stock price of a company are some examples of this category. Thus the value of an attribute remains constant until a business transaction changes it and so data extraction for preserving the history of the changes in the data warehouse gets quite involved. 6.4.2 Periodic Status Here, the value of the attributes is stored as the status every time a change occurs. It means the status value is stored with reference to the time. For example, the data about an insurance policy is stored as the status data of the policy at each point of time. So the history of the changes is preserved in the source systems themselves. Thus the data extraction becomes relatively easier. Self Assessment Question(s) (SAQs) For Section 6.4 1. What are the types of storing the data in the source systems? 6.5 Source Identification Source identification is a critical process in the data extraction process. For instance, if you have intended to design a data warehouse to provide strategic information on fulfillment of orders, you need to store the information as fulfilled and pending orders. If you deliver the orders through Sikkim Manipal University Page No.: 137 Business Intelligence and Tools Unit 6 multiple channels, you also need to capture the data about the delivery channels. If you provide the customers an option to get the status of the order, you are required to extract the data on the status of each of these orders. To achieve this, you need to maintain the attributes like total order amount, discounts, and expected delivery time in the fact table for order fulfillment. Also, you need to prepare dimension tables for product, customers, order disposition, delivery channel, etc. You may take up the following step-wise approach for the source identification: Prepare a list of all data items or facts needed in fact tables Prepare each dimension attribute from all dimensions Find out the source system and source data item for every target data item Identify the preferred source in case there are multiple sources for a specific data item Identify the single source filed for multiple target fields and establish splitting rules Ascertain the default values and inspect the source data for missing values Self Assessment Question(s) (SAQs) For Section 6.5 1. Discuss various steps involved in the source identification process of extracting the data. Sikkim Manipal University Page No.: 138 Business Intelligence and Tools Unit 6 6.6 Data Extraction Techniques The initial data that you might have moved to the data warehouse to get it started is called the initial load. After the initial load, you need to update the data warehouse so as to maintain the history of the changes. Broadly, there are two types of data extraction from the source operation systems. They are “as is” (or static data) and data of revisions. ‘Static data’ is the capture of data at a specific point of time. For current data, this capture includes all transient data identified for extraction. In case of the data categorized periodically, the capture has to include each status at each point of time as available in the source systems. Generally, the static data capture is used for the initial load of the data warehouse and you may want a full refreshing of a dimensional table in certain cases. For instance, you may do a full refreshing of the product dimension table if the product master of the source application is completely revamped. ‘Data of revisions’ (also known as incremental data capture) includes the revisions since the last time data was captured. If the source data is transient, the capture of the revisions is a difficult exercise. But for periodic status data, the incremental data capture includes the values of attributes at specific times. Here, you need to extract the status that has been recorded since the last date of extract. This incremental data capture can be divided into immediate data extraction and deferred data extraction. 6.6.1 Immediate Data Extraction The immediate data extraction is a real-time data extraction. There are three types of methods in the immediate data extraction process as discussed below. Sikkim Manipal University Page No.: 139 Business Intelligence and Tools Unit 6 6.6.1.1 Capture through Transaction Logs In this method, the transaction logs of the DBMSs are used to capture the data. Whenever there is an updating in the database table, the DBMS writes entries on the log file. The method reads the transaction log and selects all the committed transactions. Since logging is already a part of the transaction processing, there will not be any additional overheads in this method. However, you need to ensure extraction of all the transactions before the log file gets refreshed. This method works well if all of the source systems are database applications. In case of non-database applications, you need to use some other data extraction method as there are no log files for these applications. 6.6.1.2 Capture through Database Triggers The database triggers are specially stored procedures or programs stored on the database and are fired when a pre-defined event occurs. Here, you can create the trigger programs for all events for which you need to capture the data. The output of these trigger programs is written on a separate file and will be used to extract data for the data ware house. For instance, you can write a trigger program to capture all updates and deletes for a specific table if you intend to capture all changes that happened to the records of the table. But this method is also applicable only if the source operating systems are database applications. This method is quite reliable as the data capturing occurs right at the source and you can capture both before and after images. However, you need to put additional efforts on the front to build and maintain trigger programs. Also, the execution of trigger procedures during the transaction processing leads to additional overhead on the source systems. Sikkim Manipal University Page No.: 140 Business Intelligence and Tools Unit 6 6.6.1.3 Capture in Source Applications In this method, the source application is made to assist in the data capture for the data ware house. Here, you need to modify the relevant application programs that write to the source files and databases. You may revise the programs to write all additions and deletions on the source files and database tables and the other extract programs can use the separate file containing the changes to the source data. The advantage of this method is that the method can be used for all types of source data irrespective of the type of the source systems, say, databases, indexed files or other flat files. But the difficulty lies with revising the programs in the source operational systems and maintaining them properly. Also, the method may reduce the performance of the source applications as additional processing is required to capture the changes on separate files. This method is also referred to as ‘application - assisted data capture’. 6.6.2 Deferred Data Extraction All the methods in the immediate data extraction involve the real-time data capture. In contrast, these deferred data extraction methods do not capture the changes in real time but does the same in later period. 6.6.2.1 Capture based on Date and Time Stamp In this method, every source record created or updated is marked with a stamp that shows the date and time. The data capture occurs at a later time after the creation or updating of a source record and the time stamp provides the basis for selecting the records for data extraction. However, the limitation of the method is that it presumes that all the relevant source records contain date and time stamps. Also, the technique works well in case the revised records are less in number. Sikkim Manipal University Page No.: 141 Business Intelligence and Tools Unit 6 6.6.2.2 Capture by Comparing Files If any of the above techniques is not feasible for you to extract the data in your environment, then you may consider the following method. According to this method, you capture the changes to your product data by comparing the current source data with the previous captured data. Though the method seems simple, the comparison of the data becomes difficult in case of a large file. But this method is only an option to deal with the legacy data sources that do have any transaction logs and time stamps on the source records. This method is also called as ‘snapshot differential method’ as it compares two snapshots of the source data. Self Assessment Question(s) (SAQs) For Section 6.6 1. Differentiate the terms ‘static data capture’ and ‘incremental data capture’ to extract the data from the source systems? 2. Discuss various methods involved in the ‘immediate data capture’ of data extraction? 3. What is ‘Deferred data extraction’? Describe the methods of deferred data extraction? 6.7 Evaluation of the techniques As discussed in the above section, you may use any of the following methods based on their application in your environment: A. Capture of static data B. Capture through transaction logs C. Capture through database triggers D. Capture in source applications E. Capture based on date and time stamp F. Capture by comparing files Sikkim Manipal University Page No.: 142 Business Intelligence and Tools Unit 6 Among all these methods, the methods of using the transaction logs and database triggers are already available through out the database products. Also, these methods are comparatively easy to implement and economical. The technique based on transaction logs is perhaps the least expensive. The method of data capture in source systems may be the most expensive in terms of deployment and maintenance as it requires substantial revisions to the existing source systems. But the method of file comparison is timeconsuming for the data extraction and so can be used only if any of the other methods are not applicable. The features of various data capture methods and their advantages and limitations in implementation are as follows: A. Capture of static data Flexible to capture specifications Does not affect the performance of source systems No revisions required to existing applications Can be used on legacy systems and file-oriented systems B. Capture through transaction logs Not flexible to capture specifications Does not affect the performance of source systems No revisions to existing applications Can be used on most legacy systems, but cannot be used on fileoriented systems. C. Capture through database triggers Not flexible to capture specifications Does not seriously affect the performance of source systems No revisions to existing applications Cannot be used on most legacy systems, file-oriented systems Sikkim Manipal University Page No.: 143 Business Intelligence and Tools Unit 6 D. Capture in source applications Flexible to capture specifications Does not comprehensibly affect the performance of source systems Major revisions to existing applications Can be used on most legacy systems, file-oriented systems High internal costs because of in-house work E. Capture based on date and time stamp Flexible to capture specifications Does not affect the performance of source systems Likely revisions to existing applications. Cannot be used on legacy systems. Can be used on file-oriented systems. F. Capture by comparing files Flexible to capture specifications Does not affect the performance of source systems No revisions to existing applications May be used on legacy systems, file-oriented systems Self Assessment Question(s) (SAQs) For Section 6.7 1. List out the advantages and limitations of various data extraction methods that are under practice. 6.8 Summary ETL (Extraction, Transaction, and Loading) functions are crucial in building a data warehouse. These ETL functions act as the back-end processes and the amount of time to be spent on performing these functions is as much as 50-70% of the total effort to be put for building a data warehouse. These functions include the Sikkim Manipal University following processes; Data Extraction, Data Page No.: 144 Business Intelligence and Tools Unit 6 Transformation, Data Integration, Data Cleansing, Data Summarization, Initial Data Loading, Meta Data Updating, Ongoing Data Loading. Here, data extraction is the first and crucial process. Following are some of the challenges one may face while implementing the data extraction process: identification of all source systems, choosing an appropriate method of data extraction, preparation of time window, determination of the extraction frequency, Job sequencing, and exception handling. There are two ways of storing the data in the source systems. They are ‘current value’ and ‘periodic status’. The current value is the stored value of an attribute at that moment of time. Periodic status is the status every time a change occurs. So the status value is stored with reference to a specific time frame. The type of data extraction method you have to use depends on the nature of each of these two categories. Broadly there are two types of data extraction: “as is” (or static data) and data of revisions. ‘Static data’ is the capture of data at a specific point of time. For current data, this capture includes all transient data identified for extraction. Data of revisions’ (also known as incremental data capture) includes the revisions since the last time a data was captured. There are three types of methods in the ‘immediate data extraction’ process. They are: capture through transaction logs, capture through database triggers, and capture from source applications. All the methods in the immediate data extraction involve the real-time data capture. But there are two methods in practice under deferred data extraction where the data capturing is done in later period, not in the real time. These methods are ‘capture based on date and time stamp’ and ‘capture by comparing files’. But one has to be careful in choosing an appropriate method to implement the data extraction process and it has to be done after understanding the advantages and limitations of these methods. Sikkim Manipal University Page No.: 145 Business Intelligence and Tools Unit 6 6.9 Terminal Questions (TQs) 1. Give the appropriate reasons why you think ETL functions are most challenging in a data warehouse environment. 2. Brief the major issues you may face in the process of data extraction and explain how to deal with those issues? 3. Supposing you have chosen an ETL tool to perform the extraction, transformation, and loading functions, discuss the activities carried out by an ETL in executing the above mentioned functions. 6.10 Multiple Choice Questions (MCQs) 1. What do you mean by 'exception handling' in extracting the data? a. to determine the ways to handle the input records that cannot be extracted b. to identify the source systems for extracting the data into a warehouse c. to arrange the alternative system for a data warehouse to work in contingency situations d. to re-design the architecture of a data warehouse if the performance of the warehouse is not up to the mark 2. The time period set for repetition of the data extraction from a specific source system is called ______. a. Extraction period b. Extraction level c. Extraction frequency d. None of the above 3. The process of data extraction for a data warehouse is complex compared to that of an operational system, because ________. a. For a data warehouse, you have to extract data from many disparate sources Sikkim Manipal University Page No.: 146 Business Intelligence and Tools Unit 6 b. For a data warehouse, you have to extract the data on the changes for ongoing incremental loads as well as for a one-time initial full load c. Both (a) and (b) d. None of the above 4. Why is the data in a source system said to be time-dependent or temporal? a. The data in the source system changes with the time b. The data in the source system does not change with the time c. The attributes of the data in source system changes with the time d. The data in the source system gets erased or deleted after a specific period of time 5. Which of the following statements is false? a. The history of data records are ignored in case of OLTP systems b. Operational data in the source systems can be categorized into 'current value' and 'periodic status' c. the information required for OLTP systems is drawn from the data in the data warehouse d. None of the above 6. If the data capture is not done immediately or not done in real-time, the type of extraction is called _________. a. Delayed extraction b. Deferred extraction c. Delimit extraction d. Deteriorated extraction 7. Which of the following is not an option for immediate data extraction? a. Capture through transaction logs b. Capture in source applications c. Capture through the source code d. Capture through database triggers Sikkim Manipal University Page No.: 147 Business Intelligence and Tools Unit 6 8. The ETL functions in a data warehouse environment are challenging in nature. Because ________. a. Source systems are very diverse and disparate b. Many source systems are older legacy applications running on obsolete database technologies. c. Generally, historical data on changes in values are not preserved in source operational systems. Historical information is critical in a data warehouse. d. All the above 9. Which of the following represents the duration in which the data can be extracted from a specific source system so that the system does not get impacted by the extraction process? a. Source window b. extraction window c. time window d. Impact window 10. The creation of aggregate datasets on the basis of the pre-determined procedures is called ___________. a. Data cleansing b. Data integration c. Data setting d. Data summarization 11. Which of the following is a type of Immediate Data Extraction method to capture the data from the source systems? a. Capture through Transaction Logs b. Capture through Database Triggers c. Capture in Source Applications d. All the above Sikkim Manipal University Page No.: 148 Business Intelligence and Tools Unit 6 12. Which of the following is a type of Deferred Data Extraction method for the data capture? a. Capture based on date and time Stamp b. Capture by comparing files c. Both of (a) and (b) d. None of the above 13. Which of the following methods is the least preferred method and is implemented only when any of the other methods cannot be implemented? a. Capture based on date and time Stamp b. Capture by comparing files c. Capture in Source Applications d. Capture through Database Triggers 6.11 Answers to SAQs, TQs, and MCQs 6.11.1 Answers to Self Assessment Questions (SAQs) Section 6.2 1. ETL stands for Extraction, Transformation and Loading. To implement these functions, you may be required to undergo the following processes: data extraction, data transformation, data integration, data cleansing, data summarization, initial data loading, metadata updating, and ongoing data loading. Also write the significance of these ETL processes as provided in the Section 6.2.1. 2. The steps to be followed in the implementation of ETL functions are listed in the Figure 6.2. You may list out these steps with a small description of each of these steps. Sikkim Manipal University Page No.: 149 Business Intelligence and Tools Unit 6 Section 6.3 1. Data extracting is the process of retrieving and collecting the data from the appropriate source systems. The various issues one may face in the data extraction process include identification of all source systems, choosing an appropriate method of data extraction, preparation of time window, determination of the extraction frequency, Job sequencing, and exception handling. You may describe these processes as discussed in the section 6.3. Section 6.4 1. There are two ways of storing the data in the source systems; ‘current value’ and ‘periodic status’. The current value is the stored value of an attribute at that moment of time. Periodic status is the status every time a change occurs. So the status value is stored with reference to a specific time frame. You may describe these methods as discussed in the sections 6.4.1 and 6.4.2. Section 6.5 1. The source identification process includes the following steps; preparation of a list of all data items or facts needed in fact tables, preparation of each dimension attribute from all dimensions, identifying the source system and source data item for every target data item, identifying the preferred source in case there are multiple sources for a specific data item, identifying the single source filed for multiple target fields and establishing splitting rules, and ascertaining the default values and inspecting the source data for missing values. Section 6.6 1. ‘Static data’ is the capture of data at a specific point of time. For current data, this capture includes all transient data identified for extraction. ‘Data of revisions’ (also known as incremental data capture) includes the Sikkim Manipal University Page No.: 150 Business Intelligence and Tools Unit 6 revisions since the last time the data was captured. You may include the other points as discussed in the Section 6.6. 2. There are three types of methods in the ‘immediate data extraction’ process. They are: capture through transaction logs, capture through database triggers, and capture in source applications. You can describe each of these methods as discussed in the Section 6.6.1. 3. The deferred data extraction methods do not capture the changes in real time. They capture the data in later period. The methods used in deferred data extraction include ‘capture based on date and time stamp’, ‘capture by comparing files’. You can describe each of these methods as discussed in the Section 6.6.2. Section 6.7 1. One can adopt any of the following methods for the data extraction; Capture of static data, Capture through transaction logs, Capture through database triggers, Capture in source applications, Capture based on date and time stamp, and Capture by comparing files. The advantages and limitations of each of these methods are discussed in the Section 6.7. 6.11.2 Answers to Terminal Questions (TQs) 1. Because of the following reasons, you can say that the ETL functions are challenging in the context of building a data warehouse: Source systems are very diverse and disparate. You may have to deal with source systems on multiple platforms and different operating systems. Many source systems are legacy applications running on obsolete database technologies. The quality of data is also dubious in many of the old source systems. Sikkim Manipal University Page No.: 151 Business Intelligence and Tools Unit 6 Source system structures keep changing over time because of new business conditions. Therefore, the ETL functions need to be modified accordingly. Historical data on changes in values are not preserved in source operational systems, which is critical in a data warehouse. Even when inconsistent data is detected among disparate source systems, lack of a means for resolving mismatches escalates the problem of inconsistency. Many of the source systems do not represent the data in types or formats that are meaningful to the users. 2. The major issues involved in the data extraction process are discussed in the Section 6.3. Accordingly, you may discuss the issues viz., source identification, selection of an appropriate method of data extraction, preparation of time window for each of the data source systems, determination of the extraction frequency, sequencing of the jobs, exception handling to handle the missing records during the extraction process. 3. Before the selection of an ETL tool, you need to understand whether the tool has the following capacities: Data extraction from various regional databases of leading vendors, from old legacy databases, indexed files, and flat files Data transformation from one format to another with variations in source and target fields Performing of standard conversions, key reformatting, and structural changes Provision of audit trails from source to target Application of business rules for extraction and transformation Combining of several records from the source systems into one integrated target record Recording and management of metadata. Sikkim Manipal University Page No.: 152 Business Intelligence and Tools Unit 6 6.11.3 Answers to Multiple Choice Questions (MCQs) 1. Ans: a 2. Ans: c 3. Ans: c 4. Ans: a 5. Ans: c 6. Ans: b 7. Ans: c 8. Ans: d 9. Ans: c 10. Ans: d 11. Ans: d 12. Ans: c 13. Ans: b Sikkim Manipal University Page No.: 153 ...
View Full Document

This note was uploaded on 04/15/2010 for the course MBA mba taught by Professor Smu during the Spring '10 term at Manipal University.

Ask a homework question - tutors are online