Unformatted text preview: e workings of the tools you choose and the algorithms on which they are based.
The choices you make in setting up your data mining tool and the optimizations you choose will
affect the accuracy and speed of your models.
Data mining does not replace skilled business analysts or managers, but rather gives them a powerful
new tool to improve the job they are doing. Any company that knows its business and its customers is
already aware of many important, high-payoff patterns that its employees have observed over the
years. What data mining can do is confirm such empirical observations and find new, subtle patterns
that yield steady incremental improvement (plus the occasional breakthrough insight).
Data mining and data warehousing
Frequently, the data to be mined is first extracted from an enterprise data warehouse into a data
mining database or data mart (Figure 1). There is some real benefit if your data is already part of a
data warehouse. As we shall see later on, the problems of cleansing data for a data warehouse and for
data mining are very similar. If the data has already been cleansed for a data warehouse, then it most
likely will not need further cleaning in order to be mined. Furthermore, you will have already
addressed many of the problems of data consolidation and put in place maintenance procedures.
The data mining database may be a logical rather than a physical subset of your data warehouse,
provided that the data warehouse DBMS can support the additional resource demands of data mining.
If it cannot, then you will be better off with a separate data mining database. Data
Warehouse Data Sources Geographic
Data Mart Analysis
Data Mart Data Mining
Data Mart Figure 1. Data mining data mart extracted from a data warehouse. 2 © 1999 Two Crows Corporation A data warehouse is not a requirement for data mining. Setting up a large data warehouse that
consolidates data from multiple sources, resolves data integrity problems, and loads the data into a
query database can be an enormous task, sometimes taking years and costing millions of dollars. You
could, however, mine data from one or more operational or transactional databases by simply
extracting it into a read-only database (Figure 2). This new database functions as a type of data mart. Data Sources Data Mining
Data Mart Figure 2. Data mining data mart extracted from operational databases. Data mining and OLAP
One of the most common questions from data processing professionals is about the difference
between data mining and OLAP (On-Line Analytical Processing). As we shall see, they are very
different tools that can complement each other.
OLAP is part of the spectrum of decision support tools. Traditional query and report tools describe
what is in a database. OLAP goes further; it’s used to answer why certain things are true. The user
forms a hypothesis about a relationship and verifies it with a series of queries against the data. For
example, an analyst might want to determine the factors that lead to loan defaults. He or she might
initially hypothesize that people with low incomes are bad credit risks and analyze the database with
OLAP to verify (or disprove) this assumption. If that hypothesis were not borne out by the data, the
analyst might then look at high debt as the determinant of risk. If the data did not support this guess
either, he or she might then try debt and income together as the best predictor of bad credit risks.
In other words, the OLAP analyst generates a series of hypothetical patterns and relationships and
uses queries against the database to verify them or disprove them. OLAP analysis is essentially a
deductive process. But what happens when the number of variables being analyzed is in the dozens or
even hundreds? It becomes much more difficult and time-consuming to find a good hypothesis (let
alone be confident that there is not a better explanation than the one found), and analyze the database
with OLAP to verify or disprove it.
Data mining is different from OLAP because rather than verify hypothetical patterns, it uses the data
itself to uncover such patterns. It is essentially an inductive process. For example, suppose the analyst
who wanted to identify the risk factors for loan default were to use a data mining tool. The data
mining tool might discover that people with high debt and low incomes were bad credit risks (as
above), but it might go further and also discover a pattern the analyst did not think to try, such as that
age is also a determinant of risk.
Here is where data mining and OLAP can complement each other. Before acting on the pattern, the
analyst needs to know what the financial implications would be of using the discovered pattern to
govern who gets credit. The OLAP tool can allow the analyst to answer those kinds of questions.
Furthermore, OLAP is also complementary in the early stages of the knowledge discovery process
because it can help you explore your data, for instance by focusing attention on important variables,
© 1999 Tw...
View Full Document
- Winter '08
- Data Mining, .........