Problem Set 1
September 14, 2009
Due date:
Monday, September 28 2009 at 4pm; before class.
Exercise 1: (20 points) Some years ago, greek video-club chain Seven had the following oer to their customers: every time a customer rented a DVD, he was given a ra
Solutions to Problem Set 1
October 7, 2009
Exercise 1: (20 points) Some years ago, greek video-club chain Seven had the following oer to their customers: every time a customer rented a DVD, he was given a random coupon with the title of the Academy awards
Boston University Department of Computer Science CS 565 Data Mining
Midterm Exam Solutions Date: Oct 14, 2009 Write Your University Number Here: Answer all questions. Good luck! Problem 1 [25 points] True or False: 1. Maximal frequent itemsets are sucient
Time-series data analysis
Why deal with sequential data?
Because all data is sequential All data items arrive in the data store in some order Examples
transaction data documents and words
In some (or many) cases the order does not matter In many cases
Lecture outline
Classification Decision-tree classification
What is classification?
What is classification?
Classification is the task of learning a target function f that maps attribute set x to one of the predefined class labels y
What is classificati
Lecture outline
Classification Nave Bayes classifier Nearest-neighbor classifier
Eager vs Lazy learners
Eager learners: learn the model as soon as the training data becomes available Lazy learners: delay model-building until testing data needs to be cla
Lecture outline
Support vector machines
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines
B1
One Possible Solution
Support Vector Machines
B2
Another possible solution
Support Vect
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation
How to obtain reliable estimates?
Methods for Model Comparison
How to compare the relative performance of different mo
Link Analysis Ranking
How do search engines decide how to rank your query results?
Guess why Google ranks the query results the way it does
How would you do it?
Nave ranking of query results
Given query q Rank the web pages p in the index based on sim(
More on Rankings
Query-independent LAR
Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the pages in Q ordered according to order What are the advantages of such an approach?
InDegree algorithm
More on Rankings
Query-independent LAR
Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the pages in Q ordered according to order What are the advantages of such an approach?
InDegree algorithm
Graph Clustering
Why graph clustering is useful?
Distance matrices are graphs as useful as any other clustering Identification of communities in social networks Webpage clustering for better data management of web data
Outline
Min s-t cut problem Min cu
CS 535 - Fall 2008 - MIDTERM with Answers DIRECTIONS: Do any 4 of the following 5 problems. Each problem is worth 10 points. Write all of your answers in your blue book. The test is open book and you can use one page of notes. 1. a. Let f be a total compu
Problem Set 3
December 3, 2009
Due date:
Wed, Dec 9 2009 at 4pm; submit by email.
Exercise 1: (30 points) You are asked to evaluate the performance of two classication models, M1 and M2 . The test set you have chosen contains 26 binary attributes, labeled
Mining Association Rules in Large Databases
Association rules
Given a set of transactions D, find rules that will predict the occurrence of an item (or a set of items) based on the occurrences of other items in the transaction
Market-Basket transactions
Recap: Mining association rules from large datasets
Recap
Task 1: Methods for finding all frequent itemsets efficiently
Task 2: Methods for finding association rules efficiently
Recap
Frequent itemsets (measure: support) Apriori principle Apriori algor
Clustering
Lecture outline
Distance/Similarity between data objects Data objects as geometric data points Clustering problems and algorithms
K-means K-median K-center
What is clustering?
A grouping of data objects such that the objects within a group a
Clustering II
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
A tree-like diagram that records the sequences of merges or splits
6
0 .2
5
4
05 .1
3
2
4 5 2
0 .1
05 .0
1 3 1
0
1
Clustering III
Lecture outline
Soft (model-based) clustering and EM algorithm Clustering aggregation [A. Gionis, H. Mannila, P. Tsaparas: Clustering aggregation, ICDE 2004] Impossibility theorem for clustering [Jon Kleinberg, An impossibility theorem for
Clustering V
Outline
Validating clustering results Randomization tests
Cluster Validity
All clustering algorithms provided with a set of points output a clustering How to evaluate the goodness of the resulting clusters? Tricky because clusters are in th
Dimensionality reduction
Outline
From distances to points :
MultiDimensional Scaling (MDS) FastMap
Dimensionality Reductions or data projections Random projections Principal Component Analysis (PCA)
Multi-Dimensional Scaling (MDS)
So far we assumed th
Lecture outline
Dimensionality reduction
SVD/PCA CUR decompositions
Nearest-neighbor search in low dimensions
kd-trees
Datasets in the form of matrices
We are given n objects and d features describing the objects. (Each object has d numeric values des
Boston University Department of Computer Science CS 565 Data Mining
Midterm Exam Date: Oct 14, 2009 Write Your University Number Here: Answer all questions. Good luck! Problem 1 [25 points] True or False: 1. Maximal frequent itemsets are sucient to determ
Problem Set 2
September 28, 2009
Due date:
Wed, October 14 2009 at 4pm; before class.
Exercise 1: (20 points) Assume two d-dimensional real vectors x and y . And denote by xi (yi ) the value in the i-th coordinate of x (y ). Prove or disprove the followin
Lecture outline
Nearest-neighbor search in low dimensions
kd-trees
Nearest-neighbor search in high dimensions
LSH
Applications to data mining
Definition
Given: a set X of n points in Rd Nearest neighbor: for any query point qRd return the point xX m