This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ults
Let’s go through these steps to better understand the knowledge discovery process.
1. Define the business problem. First and foremost, the prerequisite to knowledge discovery is
understanding your data and your business. Without this understanding, no algorithm, regardless
of sophistication, is going to provide you with a result in which you should have confidence.
Without this background you will not be able to identify the problems you’re trying to solve,
prepare the data for mining, or correctly interpret the results. To make the best use of data mining
you must make a clear statement of your objectives. It may be that you wish to increase the
response to a direct mail campaign. Depending on your specific goal, such as “increasing the 22 © 1999 Two Crows Corporation response rate” or “increasing the value of a response,” you will build a very different model. An
effective statement of the problem will include a way of measuring the results of your knowledge
discovery project. It may also include a cost justification.
2. Build a data mining database. This step along with the next two constitute the core of the data
preparation. Together, they take more time and effort than all the other steps combined. There
may be repeated iterations of the data preparation and model building steps as you learn
something from the model that suggests you modify the data. These data preparation steps may
take anywhere from 50% to 90% of the time and effort of the entire knowledge discovery process!
The data to be mined should be collected in a database. Note that this does not necessarily imply a
database management system must be used. Depending on the amount of the data, the complexity
of the data, and the uses to which it is to be put, a flat file or even a spreadsheet may be adequate.
In general, it’s not a good idea to use your corporate data warehouse for this. You will be better
off creating a separate data mart. Mining the data will make you a very active user of the data
warehouse, possibly causing resource allocation problems. You will often be joining many tables
together and accessing substantial portions of the warehouse. A single trial model may require
many passes through much of the warehouse.
Almost certainly you will be modifying the data from the data warehouse. In addition you may
want to bring in data from outside your company to overlay on the data warehouse data or you
may want to add new fields computed from existing fields. You may need to gather additional
data through surveys. Other people building different models from the data warehouse (some of
whom will use the same data as you) may want to make similar alterations to the warehouse.
However, data warehouse administrators do not look kindly on having data changed in what is
unquestionably a corporate resource.
One more reason for a separate database is that the structure of the corporate data warehouse may
not easily support the kinds of exploration you need to do to understand this data. This includes
queries summarizing the data, multi-dimensional reports (sometimes called pivot tables), and
many different kinds of graphs or visualizations.
Lastly, you may want to store this data in a different DBMS with a different physical design than
the one you use for your corporate data warehouse. Increasingly, people are selecting specialpurpose DBMSs which support these data mining requirements quite well. If, however, your
corporate data warehouse allows you to create logical data marts and if it can handle the resource
demands of data mining, then it may also serve as a good data mining database.
The tasks in building a data mining database are:
a. Data collection
b. Data description
d. Data quality assessment and data cleansing
e. Consolidation and integration
f. Metadata construction
g. Load the data mining database
h. Maintain the data mining database © 1999 Two Crows Corporation 23 You must remember that these tasks are not performed in strict sequence, but as the need arises.
For example, you will start constructing the metadata infrastructure as you collect the data, and
modify it continuously. What you learn in consolidation or data quality assessment may change
your initial selection decision.
a. Data collection. Identify the sources of the data you will be mining. A data-gathering phase
may be necessary because some of the data you need may never have been collected. You
may need to acquire external data from public databases (such as census or weather data) or
proprietary databases (such as credit bureau data).
A Data Collection Report lists the properties of the different source data sets. Some of the
elements in this report should include:
• Source of data (internal application or outside vendor)
• Person/organization responsible for maintaining data
• Cost (if purchased)
• Storage organization (e.g., Oracle database, VSAM file, etc.)
• Size in tables, rows, records, etc.
• Size in bytes
• Physical storage (CD-ROM, tape, server, etc.)
• Security requirements
• Restrictions on use
• Privacy requirements
Be sure to make note of special security and privacy issues that your data mining database
will inherit from the source data....
View Full Document
This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.
- Winter '08