03-Unit3 - Business Intelligence and Tools Unit 3 Unit 3...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Business Intelligence and Tools Unit 3 Unit 3 Architecting the Data Structure: 3.1 Introduction Objectives 3.2 Types of Data 3.2.1 Real-time Data 3.2.2 Derived Data 3.2.3 Reconciled Data Self Assessment Question(s) (SAQs) 3.3 Enterprise Data Model 3.3.1 Phased Enterprise Data Modeling 3.3.2 A Simple Enterprise Data Model 3.3.3 Advantages and Limitations of EDM Self Assessment Question(s) (SAQs) 3.4 Granularity of the Data Self Assessment Question(s) (SAQs) 3.5 Data Partitioning 3.5.1 3.5.2 Criteria of Partitioning 3.5.3 3.6 Objectives of partitioning Logical Data Partitioning Meta Data 3.6.1 Data warehouse Metadata 3.6.2 Use of Metadata 3.6.3 Managing of the Metadata Self Assessment Question(s) (SAQs) 3.7 Total Data Quality Management (TDQM) 3.7.1 Data Quality Types 3.7.2 Concept of TQDM Self Assessment Question(s) (SAQs) 3.8 Summary Sikkim Manipal University Page No. 50 Business Intelligence and Tools Unit 3 3.9 Terminal Questions (TQs) 3.10 Multiple Choice Questions (MCQs) 3.11 Answers to SAQs, TQs, and MCQs 3.11.1 Answers to Self Assessment Questions (SAQs) 3.11.2 Answers to Terminal Questions (TQs) 3.11.3 Answers to Multiple Choice Questions (MCQs) 3.1 Introduction The basic concept of data warehousing is to clean, filter, transform, summarize and aggregate the data and then keep it in a structured manner so that it is easily accessible for making business decisions. Therefore, defining the data in an appropriate structure is very crucial and plays an important role. By architecting the data, you can structure and locate the data in such a manner that the data loaded in the data warehouse best supports your decision making process. In this Unit, we discuss various types of data used in data warehousing and some key points for architecting that data. Objectives: The objectives of the Unit include: Various types of data used in an Organization The Enterprise Data Model (EDM) to develop a data model Granularity of the Data Physical and Logical Partitioning of the Data Significance and use of Meta data Practice of Total Data Quality Management (TDQM) 3.2 Types of Data Before structuring the data in a data warehousing, you can divide the data into three categories according to usage, scope, and currency as provided below. Real-time data Derived data Reconciled data Sikkim Manipal University Page No. 51 Business Intelligence and Tools Unit 3 3.2.1 Real-time Data This data represents the current status of the business. Typically, real-time data is used by operational applications to run the business and the data constantly changes as operational transactions are processed. This data is at a detailed level and is usually accessed in read/write mode by the operational transactions. The real-time data is extracted and distributed to informational systems (data warehouses) through out the organizations for tactical decision-making. For instance, the real-time data plays an important role in banking industry for tactical decision-making. Thus the real-time data is delivered to the data warehouse from the operational systems through an independent system, called deferred or delayed system. Before using the real-time data in a data warehouse, you have to cleanse the data to ensure that the quality of the data is appropriate, that means it has to be summarized and transformed into a format that is easily understood. This is because the real-time data consists of all the individual, transactional, and detailed data values that are specific to the respective operational systems. In addition, the data may not be consistent in representation and meaning as the data may come from different multiple systems. For example, the units of measure, currency, and exchange rates may differ among the systems. Therefore, reconciliation of these anomalies is to be done before loading the real-time data into a data warehouse. 3.2.2 Derived Data This is the data that has been derived or created perhaps by aggregating or averaging the real-time data through a defined process. This data can represent a view of the business at a specific point of time or can be a historical record of the business over a period of time. Based on the requirements, the derived data can either be detailed or summarized. Sikkim Manipal University Page No. 52 Business Intelligence and Tools Unit 3 Traditionally, derived data is used by the Data analysts for data analysis and decision-making as they do not require large volumes of detailed data; but the summarized data that is much useful for use and manipulation. To improve the query processing capability, an efficient approach followed by the business analysts is to predict the derived data elements, and summarize the detailed data to meet the specific business requirements. 3.2.3 Reconciled Data Reconciled data is the real-time data that has been cleansed, modified, or enhanced. This data provides an integrated source of quality data for use of data analysts in the data analysis. We can create and maintain historical data while reconciling the data and therefore this data is also termed as a type of derived data. You can configure an appropriate warehouse on the basis of the above data types considering the requirements for any particular implementation effort. Depending on the nature of the operational systems, the type of business, the number of users that access the warehouse and the frequency of access, you can combine these types of data to develop the most appropriate warehouse architecture. Self Assessment Question(s) (SAQs) For Section 3.2 1. What is ‘Reconciled Data’ and analyze how this data is different from other types of data; ‘Real-time data’ and ‘Derived data’? 3.3 Enterprise Data Model The Enterprise Data Model (EDM) is an approach to develop a data warehouse data model. An EDM is a consistent definition of all the data elements common to the business, from a high-level business view to a generic logical data design. Using this model, you can derive the general Sikkim Manipal University Page No. 53 Business Intelligence and Tools Unit 3 scope and understanding of the business requirements and the model also includes links to the physical data designs of the individual applications. A typical EDM can be any of the following types; Phased Enterprise Data Modeling A simple Enterprise Data Modeling 3.3.1 Phased Enterprise Data Modeling There are several methodologies for Enterprise Data Modeling in practice. Some authors propose a 3-tiered methodology (such as conceptual, logical and physical data model) or a 5-tiered model. According to the Worldwide Solution Design and Delivery Method proposed by IBM, a phased enterprise data modeling includes the following tiers: ISP: Information System Planning BAA: Business Area Analysis BSD: Business System Design BSI: Business System Implementation BSM: Business System Maintenance Whatever is the type of methodology used, the methodology focuses on the phased or layer approach and the phases include the following tasks or phases: Information Planning Business Analyzing Logical data modeling Physical data design Figure 3.1 represents the tasks or phases in a phased enterprise data model. The size of the phases represents the amount of information to be included in a specific phase. It is observed from the figure that the amount of information is minimal at the Information Planning phase, but increases in later phases. Sikkim Manipal University Page No. 54 Business Intelligence and Tools Unit 3 The Information Planning phase provides the highly consolidated view of the business wherein you can view the business concepts. These business concepts can be categorized into business entity, super entity, or subject area and each of these items maintain related data elements. For example, the business concepts in an organization can be customer, product, policy, schedule, etc. This phase provides the scope and architecture of a data warehouse and acts as a single, comprehensive view for the remaining phases. The Business Analyzing phase provides means for further defining of the concepts provided in the above phase. This phase is described in business terms to make business people understand the data details without any special training. The purpose of the phase is to gather and arrange the business requirements and define the business terms. The Logical Data Modeling phase is organizational-wise in scope and generic to all applications located in the next phases. This model typically consists of several hundred entities and contains the identification and definition of all entities, relationships and attributes. The entities of the logical data model can be further portioned into views by subject areas or by applications. This phase can be divided into two Sikkim Manipal University Page No. 55 Business Intelligence and Tools Unit 3 types; ‘Generic logical data model’ for the organizational level and ‘Logical application model’ for the application level of data view. The Physical Data Design is to design for the actual physical implementation and applies physical constraints, such as space, performance, and the physical distribution of the data. 3.3.2 A Simple Enterprise Data Model A simple Enterprise Data Model is a data model wherein some of the core components that are required for the data warehouse modeling can be extracted and grouped to use in a phased approach. For a simple EDM, you can make a list of around 20 subject areas and then draw a subject area model of the organization by defining the business relationships among the subject areas. Then you are required to define the contents of each of these subject areas. For example, when you define a customer, you can also mention whether the customer is a prospective customer or an ex-customer. 3.3.3 Advantages and Limitations of EDM An EDM will ensure the following advantages in an organization compared to that of an application or department model: EDM stands as a single development base for the applications and thereby promotes the integration of the existing applications This model provides a single set of consistent data definitions The model supports sharing of the data among various business areas But organizations are facing the following challenges while implementing an EDM: All business areas of the organization need to be involved concurrently to deliver the best possible value of EDM to the business, which is an unrealistic expectation Sikkim Manipal University Page No. 56 Business Intelligence and Tools Unit 3 As the scope of the EDM tends to cover the entire organization, the size of the project would be so large that it seldom appropriates results in a reasonable time period The people who involve in the EDM project are expected to have both a broad understanding of the business and a detailed knowledge of a specific business area. Self Assessment Question(s) (SAQs) For Section 3.3 1. Explain the various phases involved in an Enterprise Data Model (EDM) and discuss the type of data to be stored in each of these phases? 3.4 Granularity of the Data Grain is a term that originates in photography. It refers to the degree of detail and precision contained in an image; the pixels or dots per inch, or preserved and communicated by a given medium. Thus ‘granularity’ refers to the level of detail. A typical data warehouse will have some tables in it that have a lot of detail and have other tables that are summarized or aggregated, which means less detail. The more the detailed data available, the lower the level of granularity. Conversely, the lower the level of detailed data, the higher the level of granularity. High granularity refers to the data that is at or near the transaction level (data that is at the transaction level is usually referred to as atomic level data). The concept of granularity is very important as it offers the opportunity for trade-off between important issues in data warehousing. For example, there might be a trade-off between the ability to access data at a very detailed level versus performance and the cost of storing and accessing the larger amounts of data. Selecting an appropriate level of granularity significantly affects the volume of data in a data warehouse. Thus the selection of an Sikkim Manipal University Page No. 57 Business Intelligence and Tools Unit 3 appropriate level of granularity determines the capability of the data warehouse to answer different types of queries. If the granularity does not impact the ability to answer a specific query, the amount of resources required for the execution of the same query could still differ considerably. For example, you have two tables with different levels of granularity, such as monthly account summary and transaction details. To answer a query about the monthly report for channel utilization, you can use either of those two tables without any dependency on the granularity level. However, using the detailed transaction table requires a significantly higher volume of disk activity to scan all the data as well as additional processing power for calculation of the results. Here, using the monthly account summary table requires lesser resources. Therefore, you should always consider the tradeoff between the cost of the volume of the data and the ability to answer the queries in deciding the level of granularity. To overcome the trade-offs, organizations that deal with larger amounts of data can also opt for multiple levels of granularity. For example, the data in a data warehouse can be divided into ‘detailed raw data’ and ‘summarized data.’ ‘Detailed raw data’ is the lowest level of detailed transaction data without any aggregation or summarization. As the volume of the data at this level could be extremely large, you can choose a separate storage medium such as magnetic tape or an optical disk device when this data is not being used. This process enables easy and faster access and it can be performed as and when required. ‘Summarized data’ is the transaction data aggregated at the level required for the most used queries. So the volume of data required for the summarized data sources is lower as compared to that of detailed raw data. But there needs to be a limit to the number of queries and level of details that can be extracted from the summarized data. Sikkim Manipal University Page No. 58 Business Intelligence and Tools Unit 3 Thus you can overcome the trade-off between the volume of data and query capability by creating two levels of granularity in a data warehouse. The ‘detailed raw data’ supports the limited number of queries requiring a detailed level of data and the ‘summarized’ level of data supports all queries with the reduced amount of resources. Self Assessment Question(s) (SAQs) For Section 3.4 1. Define the following terms; ‘Granularity of the data’, ‘Detailed raw data’ and ‘summarized data.’ How can a project manager choose an appropriate level of granularity of the data for a new BI project? 3.5 Data Partitioning The term ‘partition’ refers to the physical status of a data structure that has been divided into two or more separate structures. But logical partitioning of the data is also required to understand and use the data in a better way. In such a case, the logical partitioning overlaps with the physical partitioning. 3.5.1 Objectives of partitioning The partitioning of the data in a data warehouse enables the accomplishment of the following objectives: Flexible access to the data Easy and efficient data management services Easy to restructure, index, sequentially scan, recover, reorganize and monitor Ensure scalability of the data warehouse Enable the data warehouse elements to be portable Sikkim Manipal University Page No. 59 Business Intelligence and Tools Unit 3 3.5.2 Criteria of Partitioning There are a number of important criteria to be considered for partitioning of the data and some of them include: Product (according to the line of business) Time period (date, week, or month) Geography (location) Department or organizational unit-wise A combination of the above However, the choice of criteria is based on the business requirements and physical database constraints. But ‘time period’ must always be considered while partitioning the data. Every database management system has it own way of implementing the physical partitioning. But the important consideration for selecting a database management system on which the data resides is, support for partition indexing. Instead of system level partitioning, you can also consider partitioning by application as this would provide flexibility in defining data over time, and portability in moving the data to the other data warehouses. But you have to understand that the issue of partitioning is closely related to the data granularity modeling, multi-dimensional modeling and the ability of the specific database management system to support data warehousing. 3.5.3 Logical Data Partitioning The logical and physical partitioning of the data plays an important role to understand, maintain and navigate the data warehouse in a better way. The physical partitioning is designed according to the physical implementation requirements and constraints. But logical partitioning of the data is much more important as it affects physical partitioning not only for the overall structure but also the detailed table partitioning. Sikkim Manipal University Page No. 60 Business Intelligence and Tools Unit 3 ‘Subject area’ is considered to be the most common criterion to determine the overall logical data partitioning and it can be defined as ‘a portion of a data warehouse that is classified by a specific consistent perspective’. This perspective is usually based on the characteristics of the data, such as product, customer, time period, account, geographical location, etc. Partitioning of the data in a data warehouse environment is different from the partitioning in an operational environment. In an operational environment, the partitioning is according to an application or function because the operational environment has been built around transactionoriented applications that perform a specific set of functions. Also, the objective is to perform those functions as quickly as possible. The type of queries performed in an operational environment is more tactical in nature and they are concerned with a specific point of time. But the queries in a data warehouse environment are strategic in nature and are concerned with a larger scope. Therefore the data in a data warehouse should be structured and oriented to subject areas such as product or organizational unit. As you will remember, a data warehouse is subject-oriented as per its definition and so the data is oriented to specific subject areas. In a data warehouse modeling, identification of subject areas become useful in defining the following criteria: Unit of the data model Unit of an executing project Unit of management of the data Basis for integration of the multiple implementations 3.6 Meta Data In simple terms, ‘metadata’ refers to "data about data." It is the information that describes, or supplements the main data. For example, metadata of a Sikkim Manipal University Page No. 61 Business Intelligence and Tools Unit 3 digital camera includes the settings used for the picture, such as exposure value or flash intensity. Here, metadata acts as an additional information, and is not critical to the functions of the main data. In other cases, such as a Zip disk, metadata might provide the information regarding the writeprotected status of the disk. In such a case, metadata is essential for proper functioning of the main product. So the value of metadata depends on the context that it is provided, and the ways that contextual information can be used. When data is made available, the potential user (human or computer) must put the data into an existing model of knowledge, and may ask questions to do so. For example, in the case of an image, metadata provides answers to many of the questions like "When was the image taken?" and "Who are in the image?" In sophisticated data systems, the metadata includes the contextual information surrounding the data and will also be very sophisticated, capable of answering many questions that help understand the data. To sum up, metadata can be defined as “the structured, encoded data that describe characteristics of information-bearing entities to aid in the identification, discovery, assessment, and management of the described entities." In general, there are two distinct classes of metadata: structural (or control) metadata and guide metadata. Structural metadata is used to describe the structure of the computer systems like tables, columns, indexes, etc. Guide metadata is used to help people find specific items and is usually expressed as a set of keywords in a natural language. It is difficult to distinguish the metadata from the data because: Some items can be data and metadata at the same time. For example, the headline of an article is both its title (metadata) and part of its text (data). Sikkim Manipal University Page No. 62 Business Intelligence and Tools Unit 3 In some cases, data and metadata can change their roles. For example, a poem, as such, would be regarded as data. If the same poem is attached as lyrics to an audio file of a song, it becomes the metadata. Metadata can be categorized on the basis of the following functionalities: Content: Metadata can describe the resource (name and size of a file) or the content of the resource (for example, "This video file shows a boy playing cricket"). Mutability: Metadata can be either immutable (for example, the video title does not change when the video is being played) or mutable (the "Scene description" does change) with respect to the whole resource. Logical function: There are three layers of logical function; the bottom layer is the sub-symbolic layer that contains the raw data, and on its top, the symbolic layer with metadata that describes the raw data, and on its top the logical layer that consists of metadata that allows logical reasoning using the symbolic layer. 3.6.1 Data warehouse Metadata In general, Data warehouse metadata systems are divided into two sections: back room metadata that are used for Extract, transform, load functions to get OLTP data into a data warehouse front room metadata that are used to label screens and create reports Kimball, a renowned author in the area of business Intelligence and of data warehousing, lists the following types of metadata in a typical data warehouse: Source system metadata Source specifications, such as repositories, and source schemas Source descriptive information, such as ownership descriptions, update frequencies, legal limitations, and access methods Sikkim Manipal University Page No. 63 Business Intelligence and Tools Unit 3 Process information, such as job schedules and extraction code data staging metadata DBMS metadata, such as: DBMS system table contents Data acquisition information, such as data transmission scheduling and results, and file usage Dimension table management, such as definitions of dimensions, and surrogate key assignments Transformation and aggregation, such as data enhancement and mapping, DBMS load scripts, and aggregate definitions Audit, job logs and documentation, such as data lineage records, data transform logs 3.6.2 Use of Metadata The applications of metadata are discussed below. Metadata provides additional information to users of the data it describes and the information can either be descriptive or algorithmic Metadata speeds up and enriches searching for resources. Search queries using metadata saves users from performing more complex filter operations manually. Also, web browsers, P2P applications and media management software automatically downloads and locally caches metadata to improve the speed at which files can be accessed and searched. Metadata plays a vital role on the World Wide Web to find useful information from the large amount of information available. Metadata is an important part of electronic discovery. Application and file system metadata derived from electronic documents and files acts as important evidence. Some metadata is intended to enable variable content presentation. For example, if a picture has metadata that indicates the most important Sikkim Manipal University Page No. 64 Business Intelligence and Tools Unit 3 region, the user can narrow the picture to that region and thus obtain the details required. Metadata can also be used to automate workflows. For example, if a software tool knows content and structure of data, it can convert it automatically and pass it to another tool as input so that users need not perform copy-and-paste operations required. Metadata helps to bridge the semantic gap by explaining how a computer data items are related and how these relations can be evaluated automatically. For example, if a search engine understands that "Aditya Kaashyap" was a "Indian Engineer", it can answer a search query on "Indian Engineers" with a link to a web page about Aditya Kaashyap, although the exact words "Indian Engineers" never occur on that page. This approach (called, knowledge representation) is of special interest to the semantic web and artificial intelligence. 3.6.3 Managing of the Metadata To successfully develop and use metadata, you need to understand the following important issues that should be treated with care: You need to keep track of the entire metadata created even in the early phases of planning and designing. It is not economical to start attaching metadata once the production process has been completed. Metadata must adapt if the resource it describes changes. It should be merged when two resources are merged. It can be useful to keep metadata even after the resource it describes has been removed. Metadata can be stored either internally (in the same file as the data) or externally (in a separate file). Internal storage allows transferring metadata together with the data it describes. Thus metadata is at hand and can be easily manipulated. This method creates high redundancy and does not allow holding metadata together. External storage allows Sikkim Manipal University Page No. 65 Business Intelligence and Tools Unit 3 bundling metadata, for example in a database, for more efficient searching. There is no redundancy and metadata can be transferred simultaneously when using streaming. Storing the metadata in a human-readable format (such as XML) can be useful because users can understand and edit it without specialized tools. But these formats are not optimized for storage capacity. It may be useful to store metadata in a binary, non-human-readable format instead to speed up transfer and save memory. Although the majority of the computer professionals see metadata as a chance for better interoperability, there are some demerits as detailed below: Metadata is expensive and time-consuming. Also, it is very much complicated. Metadata is subjective and depends on context. Two persons will attach different metadata to the same resource due to their different points of view. Moreover, metadata can be misinterpreted due to its dependency on context. There is no end to metadata. For example, when annotating a soccer match with metadata, one can describe only the players and their actions. Others can also describe the advertisements in the background and the clothes the players wear. So even for a simple resource the amount of possible metadata can be gigantic. There is no real need for metadata as most of today's search engines allow finding text very efficiently. Self Assessment Question(s) (SAQs) For Section 3.6 1. Discuss the features of ‘meta data’ in the context of a data warehouse? Sikkim Manipal University Page No. 66 Business Intelligence and Tools Unit 3 3.7 Total Data Quality Management (TDQM) One of the main reasons for the failure of a data warehouse deployment is the poor quality of the data loaded into a warehouse. So the managers need to be careful and take up the precautions required to ensure that the quality of data loaded into the warehouse is appropriate. 3.7.1 Data Quality Types There are two significant dimensions in understanding the quality of the data; intrinsic quality and realistic quality. Here, the 'intrinsic data quality' is the correctness or accuracy of data and 'realistic data quality' is the value that the correct data has in supporting the work of the business or organization. To state simply, the ‘intrinsic data quality’ is the accuracy of the data. It is the degree to which data accurately reflects the real-world object that the data represents. If all facts that an organization needs to know about an entity are accurate, then that data has intrinsic quality. Data that does not enable the organization to accomplish its mission has no quality, no matter how accurate it is. Thus ‘realistic data quality’ comes into the picture. Realistic data quality is the degree of utility and value the data has to support the organizational processes to accomplish the organizational objectives. Fundamentally, realistic data quality is the degree of customer satisfaction that the knowledge workers derive out of the use of the data. 3.7.2 Concept of TQDM Many of the business intelligence projects do not deliver to full potential because of one reason that people tend to see data quality as a one-time undertaking as a part of user acceptance testing (UAT). But it is very important that data quality management is to be undertaken as a continuous Sikkim Manipal University Page No. 67 Business Intelligence and Tools Unit 3 improvement process. You have to use an iterative approach as detailed below to achieve the data quality: Step 1: To establish Data Quality Management environment Undertaking a commitment to the Data Quality Management process can be accomplished by establishing the data quality management environment between information system managers and establishing the conditions to encourage coordination between functional and information system development professionals. Functional users of legacy information systems know data quality problems of the existing systems but hardly know how to improve the quality of the existing data systematically. But the Information system developers know how to identify data quality problems, but hardly know how to change the functional requirements that drive the systematic improvement of data. Given the existing barriers to communication, establishing the data quality environment requires participation of both functional users and information system administrators. Step 2: To draft the Project scope For each data quality analysis project selected, you may have to draft an initial plan that addresses the following items: Task Summary Task Description Project Approach Schedule Resources Step 3: To implement the Data Quality Projects A data quality project consists of four activities; Define Measure Analyze Improve Sikkim Manipal University Page No. 68 Business Intelligence and Tools Unit 3 The data quality project manager performs these activities with input from the functional users of the data, system developers, and database administrators of the legacy and target database systems. Step 4: To evaluate Data Quality Management Methods The objective of this step is to evaluate and assess the progress made in implementing data quality initiatives. All stakeholders in the Data Quality Management process (functional users, program managers, developers, and the Office of Data Management) are required to review the progress to determine whether data quality projects have helped to achieve goals and benefits. Self Assessment Question(s) (SAQs) For Section 3.7 5. Drafting of a Project Plan plays a crucial role in the context of Total Data Quality Management (TDQM). According to you, what details does a typical project plan consist of? 3.8 Summary The data to be stored in a data warehouse can be divided into three categories; real-time data, derived data and reconciled data. Real-time data represents the current status of the business. Derived data is the data created by aggregating the real-time data through a defined process. Reconciled data is the real-time data that has been cleansed, modified, or enhanced. The Enterprise Data Model (EDM) is an approach to develop a data warehouse data model. A typical EDM can be any of the two types; Phased Enterprise Data Modeling and a simple Enterprise Data Modeling. ‘Granularity’ refers to the level of detail of data provided in a warehouse. Selection of an appropriate level of granularity determines the capability of the data warehouse to enable answers to different types of queries. Also, Sikkim Manipal University Page No. 69 Business Intelligence and Tools Unit 3 logical partitioning of the data is required to better understand and use the data. The criteria for partitioning the data include; product as per the line of business, time period, geographical location, department-wise, or a combination of the above. In simple terms, ‘metadata’ refers to "data about data. The data warehouse metadata systems are divided into two sections; back room metadata to extract, transform, and load functions to get OLTP data into a data warehouse, and front room metadata to label screens and create the business reports. There are two important dimensions in understanding the data quality; intrinsic quality and realistic quality. 'Intrinsic data quality' is the correctness or accuracy of data and 'realistic data quality' is the value that the correct data has in supporting the work of the business or organization. The concept of Total Data Quality Management (TDQM) aims at taking up the precautions required to ensure that the quality of data loaded into the warehouse is appropriate. 3.9 Terminal Questions (TQs) 1. Before architecting the data, how can you divide the data into various categories according to its characteristics? 2. Explain the concept of ‘Physical partitioning’ and ‘Logical partitioning’ of the data? 3. Elaborate the activities involved in a data quality project and explain how the project manager can ensure the quality of the project? 4. Discuss ‘intrinsic’ and ‘realistic’ types of data quality and analyze what are the areas you need to concentrate to improve the quality of the data being used in a data warehouse? Sikkim Manipal University Page No. 70 Business Intelligence and Tools Unit 3 3.10 Multiple Choice Questions (MCQs) 1. Which of the following represents the data that has been prepared by aggregating or averaging the real-time data through a defined process? a. Reconciled data b. Granular data c. Derived data d. Non-real time data 2. Which of the following statements is False? a. Real-time data is a detailed level data and is usually accessed in read/write mode in the operational transactions. b. Reconciled data is the real-time data that has been cleansed, modified, or enhanced. c. Derived data is used for data analysis and decision-making as it is the summarized data that is useful for use and manipulation. d. Real-time data can be directly loaded into a data warehouse as it does not require any reconciliation of anomalies. 3. Which of the following phases of an Enterprise Data Model (EDM) provides the further defining of the concepts provided in the phase, Information Planning and describes the business terms to make business people understand the data details without any special training? a. Business Designing b. Business Analyzing c. Business Decision-making d. Business Implementing Sikkim Manipal University Page No. 71 Business Intelligence and Tools Unit 3 4. The data that is at the transaction level is usually referred to __________. a. Meta data b. atomic level data c. derived data d. reconciled data 5. The partitioning of the data in a data warehouse is oriented towards __________. a. Subject areas b. Units c. Models d. Data types 6. The backroom metadata in a data warehouse is used to _______________. a. extract, transform, load functions to get OLTP data into a data warehouse b. label screens and create reports c. plan and implement the data warehouse project d. determine the resources required for the data warehouse project implementation 7. TDQM stands for _____________. a. Total Data Quantity Management b. Tentative Data Quality Management c. Total Data Quality Management d. Tentative Design Quality Management Sikkim Manipal University Page No. 72 Business Intelligence and Tools 8. Unit 3 If all the observations that an organization collected about an entity are accurate, then ________ quality of the data is achieved? a. Intrinsic b. Extrinsic c. Realistic d. Experimental 9. Which of the following statements is False with regard to metadata? a. Metadata can be stored internally (in the same file) or externally (in a separate file) b. Metadata speeds up and enriches searching for resources. c. Some items of data act as ‘data’ and ‘metadata’ at the same time. d. None of the above. 10. Which of the following best describes ‘metadata’? a. Quality of the data b. Data about the data c. Volume of data d. Design of the data 11. Which of the following statements is not correct? a. Realistic data quality is the degree of utility and value the data has to support the organizational processes to accomplish the organizational objectives. b. Front room metadata in a data warehouse metadata system is used to label screens and create the business reports. c. The implementation of a data quality project consists of four activities; Define, Measure, Analyze, and Improve. d. None of the above Sikkim Manipal University Page No. 73 Business Intelligence and Tools 12. Unit 3 Data that does not enable the organization to accomplish its mission has no quality, no matter how accurate it is. Therefore, ___________ quality of data plays a vital role in business decision making? a. Intrinsic b. Extrinsic c. Realistic d. Experimental 3.11 Answers to SAQs, TQs, and MCQs 3.11.1 Answers to Self Assessment Questions (SAQs) Section 3.2 1. Reconciled data is the real-time data that has been cleansed, modified, or enhanced. Real-time data is used by operational applications to run the business and this data constantly changes as operational transactions are processed. Derived data is the data that has been derived by aggregating the real-time data through a defined process. Thus this data can represent a view of the business at a specific point of time. Since the Reconciled data also maintain historical data while reconciling the data it is also termed as a type of derived data. Thus the Reconciled data exhibits the characteristics of both real-time data and derived data. Section 3.3 1. An Enterprise Data Model (EDM) provides all the data elements from a high-level business view to a generic logical data design. Generally, the phased enterprise data modeling follows a 3-tiered methodology; conceptual, logical and physical data model. IBM proposed a 5-tiered model with the following tiers; Information System Planning (ISP), Business Area Analysis (BAA), Business System Design (BSD), Business System Implementation (BSI), Business System Maintenance Sikkim Manipal University Page No. 74 Business Intelligence and Tools Unit 3 (BSM). Most of the organizations today are following a methodology with the following phases; Information Planning, Business Analyzing, Logical data modeling, and Physical data design. Section 3.4 1. ‘Granularity’ refers to the level of detail of the data. The more detail data that is provided, the lower the level of granularity. Similarly, higher granularity means that the data is at or near the transaction level. ‘Detailed raw data’ is the lowest level of detailed transaction data without any aggregation or summarization. ‘Summarized data’ is the transaction data that has been aggregated at the level required for the most used queries. The project manager has to choose an appropriate granularity level by assessing his decision-making requirements. This is because the ‘detailed raw data’ supports the limited number of queries requiring a detailed level of data and the ‘summarized’ level of data supports all queries with the reduced amount of resources. Section 3.6 1. ‘Metadata’ refers to the "data about data." Here, ‘structural metadata’ describes the structure of the computer systems like tables, columns, indexes, etc. and the ‘guide metadata’ helps people in finding specific items. There are two types of Metadata; ‘back room metadata’ that is used to extract, transform, and load the OLTP data into a data warehouse and ‘front room metadata’ that is used to label screens and create reports. Metadata provides descriptive or algorithmic type additional information to its users. It speeds up and enriches the searching functionality. But maintaining the metadata is complex, expensive and time-consuming. Sikkim Manipal University Page No. 75 Business Intelligence and Tools Unit 3 Section 3.6 1. The details of items to be provided while drafting an initial plan of a data quality analysis project are as follows: Document the Project goals, scope, and potential benefits under ‘Task Summary’ Describe the data quality analysis tasks under ‘Task Description’ Summarize tasks and tools used to provide a baseline of existing data quality under ‘Project Approach’ Identify the start and completion dates of the tasks and the project milestones under ‘Schedule’ Identify the resources required to complete the data quality assessment including the costs connected with tools acquisition, labor hours (by labor category), training, travel, and other direct and indirect costs, etc. under ‘Resources’ 3.11.2 Answers to Terminal Questions (TQs) 1. Before structuring the data in a data warehousing, the data in a data warehouse can be categorized into three areas according to usage, scope, and currency; real-time data, derived data, and reconciled data. Real-time data represents the current status of the business and is used by operational applications to run the business and so this data constantly changes. Derived Data is the data that has been developed by aggregating the real-time data through a defined process. So, this data represents a view of the business at a specific point of time. Reconciled Data is the real-time data that has been cleansed, modified, or enhanced. As this data enables the creation and maintenance of the historical data, it can also be termed as a type of derived data. 2. The physical partitioning is done according to the physical implementation requirements and constraints. But logical partitioning of Sikkim Manipal University Page No. 76 Business Intelligence and Tools Unit 3 the data is done according to the ‘subjects’. These subject areas act as important criterion to determine the overall logical data partitioning and it can be defined as ‘a portion of a data warehouse that is classified by a specific consistent perspective’. This perspective is usually based on the characteristics of the data, such as product, customer, time period, account, geographical location, etc. The logical partitioning of the data is much more important as it affects physical partitioning not only for the overall structure but also the detailed table partitioning. The logical and physical partitioning of the data plays an important role to understand, maintain and navigate the data warehouse in a better way. 3. The description of the activities involved in a data quality project is as follows: Define: Identifying the functional user data quality requirements and establishing the data quality metrics. Measure: Measuring the conformance to current business rules and developing the exception reports. Analyze: Verifying, validating, and assessing the poor data quality causes and defining the improvement opportunities. Improve: Selecting and prioritizing the data quality improvement opportunities. Improving data quality may lead to changing data entry procedures, updating data validation rules, and/or use of company data standards to prescribe a uniform representation of data used throughout the company. 4. The ‘intrinsic data quality’ is the accuracy of the data. It is the degree to which data accurately reflects the real-world object that the data represents. The ‘realistic data quality’ is the degree of utility and value the data has to support the organizational processes to accomplish its objectives. Sikkim Manipal University Page No. 77 Business Intelligence and Tools Unit 3 The important areas that one has to concentrate for achieving the Data Quality Improvement are as follows: a. Processes Improvement: This area includes improvement of the functional processes that are used to create, manage, access, and use data. The functional process changes encourage centralized data entry, elimination of non-value added activities, and placing the data quality responsibilities where data is entered into the data set (e.g., certification of data) b. System Improvement: This area includes software, hardware, and telecommunication changes for the improvement of the data quality. For instance, security software can minimize the damage done by malicious updates to databases by unauthorized users. The hardware improvements may lead batch loads faster and thereby make it unnecessary to turn off edit and validation constraints while loading the data into a database. The telecommunications improvements (e.g. increasing the bandwidth) may provide easier access to data and provide both accurate and timely data. Other system improvements may include updating end users, operational and maintenance manuals, provision of required user training, etc. c. Policy & Procedure Improvement: This area includes resolving of the conflicts in existing policies and procedures to institutionalize the behaviors that promote good data quality. One has to develop standard operating procedures for the information system to document the data quality rule sets/filters and measure the data quality accordingly. Also, periodical data quality checks need to be performed as part of the standard operating procedures to improve the data quality. d. Data Design Improvement: This area includes improving of the overall data design and use of the data standards. Some of the activities in this area which include adding the primary key constraints, indexes, unique Sikkim Manipal University Page No. 78 Business Intelligence and Tools Unit 3 key constraints, triggers, stored functions and procedures, controlling administration of user privileges, enforcing security features, referential integrity constraints can improve database design, etc. 3.11.3 Answers to Multiple Choice Questions (MCQs) 1. Ans: c 2. Ans: d 3. Ans: b 4. Ans: b 5. Ans: a 6. Ans: a 7. Ans: c 8. Ans: a 9. Ans: d 10. Ans: b 11. Ans: d 12. Ans: c Sikkim Manipal University Page No. 79 ...
View Full Document

This note was uploaded on 04/15/2010 for the course MBA mba taught by Professor Smu during the Spring '10 term at Manipal University.

Ask a homework question - tutors are online