Complexity of Search Space
x Machine learning can be considered as a search problem. We wish to find the
correct hypothesis from among many.
y If there are only a few hypotheses we could try them all but if there are
an infinite number we need a better st

Data integration issues
x Multi-source
y Oracle, Excel, Informix, DB2, MySQL, etc.
Standardized database drivers help (e.g. ODBC)
Data Warehousing helps
x Multi-format
y relational databases, hierarchical structures, XML, HTML, free text, etc.
x Multi-pla

Noise and Redundancy
x The distortion or mutation of a message is the number of bits that are
corrupted
x making the message longer by including redundant information can ensure
that a message is received correctly even in the presence of noise
x Some pat

x Abstraction
y it can sometimes be useful to reduce the information in a field to simple
yes/no values: e.g.
flag people as having a criminal record, rather than having a
separate category for each possible crime
x Unit conversion
y choose a standard uni

Data characterization
x After obtaining all the data streams, the nature of each data stream must be
characterized
y This is not the same as the data format (i.e. field names and lengths)
x Detail/Aggregation Level (Granularity)
y all variables fall somew

Exploratory Data Analysis (EDA)
x Classical statistics has a dogma that the data may not be viewed prior to
modeling
y aim is to avoid choosing biased hypotheses
x During the 1970s the term Exploratory Data Analysis (EDA) was used to
express the notion th

Machine Learning
x A general law can never be verified by a finite number of observations. It can,
however, be falsified by only one observation.
Karl Popper
x Many algorithms now used in data mining were developed by researchers
working in machine learni

The Link between Pattern and Approach
x Data mining aims to reveal knowledge about the data under consideration
x This knowledge takes the form of patterns within the data which embody our
understanding of the data
y Patterns are also referred to as struc

M ONASH U NIVERSITY
Faculty of Information Technology
CSE5230 Data Mining
Semester 2, 2004
Preliminary Quiz - Basic Mathematics
Solutions
1. Let be the number of items bought by a shopper during a visit to a supermarket. Ten shoppers visit the supermarket

Ten Golden Rules for Building Models
1. Select clearly defined problems that will yield tangible benefits
2. Specify the required solution
3. Define how the solution is going to be used
4. Understand as much as possible about the problem and the data set

Discovery-driven Data Mining Techniques
x Discovery-driven data mining techniques can also be broken down into two
broad areas:
y those techniques which are considered predictive, sometimes termed
supervised techniques
y those techniques which are termed

Extracting part of the available data
x In most cases original data sets would be too large to handle as a single
entity. There are two ways of handling this problem:
y Limit the scope of the the problem
concentrate on particular products, regions, time f