*This preview shows
page 1. Sign up
to
view the full content.*

**Unformatted text preview: **CS345 Data Mining
Introductions What Is It? Cultures of Data Mining
1 Course Staff
x Instructors: Anand Rajaraman Jeff Ullman Robbie Yan x TA: 2 Requirements
x Homework (Gradiance and other) 20% x Project 40% x Final Exam 40% Gradiance class code BB8F698B 3 Project
x Software implementation related to course subject matter. x Should involve an original component or experiment. x We will provide some databases to mine; others are OK. 4 Team Projects
x Working in pairs OK, but ...
1. We will expect more from a pair than from an individual. 2. The effort should be roughly evenly distributed. 5 What is Data Mining?
x Discovery of useful, possibly unexpected, patterns in data. x Subsidiary issues: E.g., age = 150. Data cleansing: detection of bogus data. Visualization: something better than megabyte files of output. Warehousing of data (for retrieval).
6 Typical Kinds of Patterns
1. Decision trees: succinct ways to classify by testing properties. 2. Clusters: another succinct classification by similarity of properties. 3. Bayes, hiddenMarkov, and other statistical models, frequentitemsets: expose important associations within data.
7 Example: Clusters
x x x x x x x x xx x x x x x x x x xx x x x x x x x xx x x x x x x x x x x x x 8 Example: Frequent Itemsets
x A common marketing problem: examine what people buy together to discover patterns.
1. What pairs of items are unusually often found together at Safeway checkout? Answer: diapers and beer. 1. What books are likely to be bought by the same Amazon customer?
9 Applications (Among Many)
x Intelligencegathering. x Web Analysis. x Marketing. PageRank. Total Information Awareness. Run a sale on diapers; raise the price of beer.
10 Cultures
x Databases: concentrate on largescale (nonmainmemory) data. x AI (machinelearning): concentrate on complex methods, small data. x Statistics: concentrate on inferring models. 11 x To a database person, datamining is a powerful form of analytic processing queries that examine large amounts of data. x To a statistician, datamining is the inference of models. Result is the data that answers the query. Models vs. Analytic Processing Result is the parameters of the model.
12 (Way too Simple) Example
x Given a billion numbers, a DB person might compute their average. x A statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation. 13 Meaningfulness of Answers
x A big risk when data mining is that you will "discover" patterns that are meaningless. x Statisticians call it Bonferroni's principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.
14 Examples
x A big objection to TIA was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents' privacy. x The Rhine Paradox: a great example of how not to conduct scientific research.
15 Rhine Paradox (1)
x David Rhine was a parapsychologist in the 1950's who hypothesized that some people had ExtraSensory Perception. x He devised an experiment where subjects were asked to guess 10 hidden cards red or blue. x He discovered that almost 1 in 1000 had ESP they were able to get all 10 right!
16 Rhine Paradox (2)
x He told these people they had ESP and called them in for another test of the same type. x Alas, he discovered that almost all of them had lost their ESP. x What did he conclude? Answer on next slide. 17 Rhine Paradox (3)
x He concluded that you shouldn't tell people they have ESP; it causes them to lose it. 18 A Concrete Example
x This example illustrates a problem with intelligencegathering. x Suppose we believe that certain groups of evildoers are meeting occasionally in hotels to plot doing evil. x We want to find people who at least twice have stayed at the same hotel on the same day.
19 The Details
x 109 people being tracked. x 1000 days. x Each person stays in a hotel 1% of the time (10 days out of 1000). x Hotels hold 100 people (so 105 hotels). x If everyone behaves randomly (I.e., no evildoers) will the data mining detect anything suspicious?
20 Calculations (1)
x Probability that persons p and q will be at the same hotel on day d : x Probability that p and q will be at the same hotel on two given days: x Pairs of days: 5*105. 109 * 109 = 1018. 1/100 * 1/100 * 105 = 109. 21 Calculations (2)
x Probability that p and q will be at the same hotel on some two days: x Pairs of people: 5*1017. 5*105 * 1018 = 5*1013. x Expected number of suspicious pairs of people: 5*1017 * 5*1013 = 250,000.
22 Conclusion
x Suppose there are (say) 10 pairs of evil doers who definitely stayed at the same hotel twice. x Analysts have to sift through 250,010 candidates to find the 10 real cases. Not gonna happen. But how can we improve the scheme?
23 ...

View
Full
Document