To make the best use of data mining you must make a

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ults Let’s go through these steps to better understand the knowledge discovery process. 1. Define the business problem. First and foremost, the prerequisite to knowledge discovery is understanding your data and your business. Without this understanding, no algorithm, regardless of sophistication, is going to provide you with a result in which you should have confidence. Without this background you will not be able to identify the problems you’re trying to solve, prepare the data for mining, or correctly interpret the results. To make the best use of data mining you must make a clear statement of your objectives. It may be that you wish to increase the response to a direct mail campaign. Depending on your specific goal, such as “increasing the 22 © 1999 Two Crows Corporation response rate” or “increasing the value of a response,” you will build a very different model. An effective statement of the problem will include a way of measuring the results of your knowledge discovery project. It may also include a cost justification. 2. Build a data mining database. This step along with the next two constitute the core of the data preparation. Together, they take more time and effort than all the other steps combined. There may be repeated iterations of the data preparation and model building steps as you learn something from the model that suggests you modify the data. These data preparation steps may take anywhere from 50% to 90% of the time and effort of the entire knowledge discovery process! The data to be mined should be collected in a database. Note that this does not necessarily imply a database management system must be used. Depending on the amount of the data, the complexity of the data, and the uses to which it is to be put, a flat file or even a spreadsheet may be adequate. In general, it’s not a good idea to use your corporate data warehouse for this. You will be better off creating a separate data mart. Mining the data will make you a very active user of the data warehouse, possibly causing resource allocation problems. You will often be joining many tables together and accessing substantial portions of the warehouse. A single trial model may require many passes through much of the warehouse. Almost certainly you will be modifying the data from the data warehouse. In addition you may want to bring in data from outside your company to overlay on the data warehouse data or you may want to add new fields computed from existing fields. You may need to gather additional data through surveys. Other people building different models from the data warehouse (some of whom will use the same data as you) may want to make similar alterations to the warehouse. However, data warehouse administrators do not look kindly on having data changed in what is unquestionably a corporate resource. One more reason for a separate database is that the structure of the corporate data warehouse may not easily support the kinds of exploration you need to do to understand this data. This includes queries summarizing the data, multi-dimensional reports (sometimes called pivot tables), and many different kinds of graphs or visualizations. Lastly, you may want to store this data in a different DBMS with a different physical design than the one you use for your corporate data warehouse. Increasingly, people are selecting specialpurpose DBMSs which support these data mining requirements quite well. If, however, your corporate data warehouse allows you to create logical data marts and if it can handle the resource demands of data mining, then it may also serve as a good data mining database. The tasks in building a data mining database are: a. Data collection b. Data description c. Selection d. Data quality assessment and data cleansing e. Consolidation and integration f. Metadata construction g. Load the data mining database h. Maintain the data mining database © 1999 Two Crows Corporation 23 You must remember that these tasks are not performed in strict sequence, but as the need arises. For example, you will start constructing the metadata infrastructure as you collect the data, and modify it continuously. What you learn in consolidation or data quality assessment may change your initial selection decision. a. Data collection. Identify the sources of the data you will be mining. A data-gathering phase may be necessary because some of the data you need may never have been collected. You may need to acquire external data from public databases (such as census or weather data) or proprietary databases (such as credit bureau data). A Data Collection Report lists the properties of the different source data sets. Some of the elements in this report should include: • Source of data (internal application or outside vendor) • Owner • Person/organization responsible for maintaining data • DBA • Cost (if purchased) • Storage organization (e.g., Oracle database, VSAM file, etc.) • Size in tables, rows, records, etc. • Size in bytes • Physical storage (CD-ROM, tape, server, etc.) • Security requirements • Restrictions on use • Privacy requirements Be sure to make note of special security and privacy issues that your data mining database will inherit from the source data....
View Full Document

This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.

Ask a homework question - tutors are online