This preview shows page 1. Sign up to view the full content.
Unformatted text preview: o Crows Corporation 3 identifying exceptions, or finding interactions. This is important because the better you understand
your data, the more effective the knowledge discovery process will be.
Data mining, machine learning and statistics
Data mining takes advantage of advances in the fields of artificial intelligence (AI) and statistics.
Both disciplines have been working on problems of pattern recognition and classification. Both
communities have made great contributions to the understanding and application of neural nets and
Data mining does not replace traditional statistical techniques. Rather, it is an extension of statistical
methods that is in part the result of a major change in the statistics community. The development of
most statistical techniques was, until recently, based on elegant theory and analytical methods that
worked quite well on the modest amounts of data being analyzed. The increased power of computers
and their lower cost, coupled with the need to analyze enormous data sets with millions of rows, have
allowed the development of new techniques based on a brute-force exploration of possible solutions.
New techniques include relatively recent algorithms like neural nets and decision trees, and new
approaches to older algorithms such as discriminant analysis. By virtue of bringing to bear the
increased computer power on the huge volumes of available data, these techniques can approximate
almost any functional form or interaction on their own. Traditional statistical techniques rely on the
modeler to specify the functional form and interactions.
The key point is that data mining is the application of these and other AI and statistical techniques to
common business problems in a fashion that makes these techniques available to the skilled
knowledge worker as well as the trained statistics professional. Data mining is a tool for increasing
the productivity of people trying to build predictive models.
Data mining and hardware/software trends
A key enabler of data mining is the major progress in hardware price and performance. The dramatic
99% drop in the price of computer disk storage in just the last few years has radically changed the
economics of collecting and storing massive amounts of data. At $10/megabyte, one terabyte of data
costs $10,000,000 to store. At 10¢/megabyte, one terabyte of data costs only $100,000 to store! This
doesn’t even include the savings in real estate from greater storage capacities.
The drop in the cost of computer processing has been equally dramatic. Each generation of chips
greatly increases the power of the CPU, while allowing further drops on the cost curve. This is also
reflected in the price of RAM (random access memory), where the cost of a megabyte has dropped
from hundreds of dollars to around a dollar in just a few years. PCs routinely have 64 megabytes or
more of RAM, and workstations may have 256 megabytes or more, while servers with gigabytes of
main memory are not a rarity.
While the power of the individual CPU has greatly increased, the real advances in scalability stem
from parallel computer architectures. Virtually all servers today support multiple CPUs using
symmetric multi-processing, and clusters of these SMP servers can be created that allow hundreds of
CPUs to work on finding patterns in the data. 4 © 1999 Two Crows Corporation Advances in database management systems to take advantage of this hardware parallelism also
benefit data mining. If you have a large or complex data mining problem requiring a great deal of
access to an existing database, native DBMS access provides the best possible performance.
The result of these trends is that many of the performance barriers to finding patterns in large amounts
of data are being eliminated.
Data mining applications
Data mining is increasingly popular because of the substantial contribution it can make. It can be used
to control costs as well as contribute to revenue increases.
Many organizations are using data mining to help manage all phases of the customer life cycle,
including acquiring new customers, increasing revenue from existing customers, and retaining good
customers. By determining characteristics of good customers (profiling), a company can target
prospects with similar characteristics. By profiling customers who have bought a particular product it
can focus attention on similar customers who have not bought that product (cross-selling). By
profiling customers who have left, a company can act to retain customers who are at risk for leaving
(reducing churn or attrition), because it is usually far less expensive to retain a customer than acquire
a new one.
Data mining offers value across a broad spectrum of industries. Telecommunications and credit card
companies are two of the leaders in applying data mining to detect fraudulent use of their services.
Insurance companies and stock exchanges are also interested in applying this technology to reduce
fraud. Medical applications are another fruitful area: data mining can be used to predi...
View Full Document
This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.
- Winter '08