*This preview shows
pages
1–2. Sign up
to
view the full content.*

This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*
**Unformatted text preview: **Boston University Department of Computer Science CS 565 Data Mining Midterm Exam Solutions Date: Oct 14, 2009 Time: 4:00 p.m. - 5:30 p.m. Write Your University Number Here: Answer all questions. Good luck! Problem 1 [25 points] True or False: 1. Maximal frequent itemsets are sufficient to determine all frequent itemsets with their supports. 2. The maximal frequent itemsets (and only those) constitute the positive border of a frequent-set collection. 3. Let D be the Euclidean distance between multidimensional points. Assume a set of n points X = { x 1 ,...,x n } in a d-dimensional space and project them into a lower- dimensional space k O (log n ). If Y = { y 1 ,...,y n } is the new set of k-dimensional points, then, the Johnson Lindenstrauss lemma states that for all pairs ( i,j ) it holds that S ( x i ,x j ) = D ( y i ,y j ). (All points x i and y i are normalized to have length 1.) 4. Computing the mean and a variance of a stream of numbers can be done using a single pass over the data and constant ( O (1)) space. 5. The disagreement distance between two clusterings is a metric. Answers: false, true, false, true, true Problem 2 [10 points] Consider a dictionary of n terms (words) T = { t 1 ,...,t n } . Each term t i is associated with its importance w ( t i ) (a positive real value). Additionally, assume a collection of m documents D = { d 1 ,...,d m } , such that each document d i uses a subset of terms in the dictionary (i.e., d i T ). You are asked to give a polynomial-time algorithm that finds a collection of)....

View
Full
Document