This preview shows pages 1–4. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CSE5120Fall2009 Subspace Clustering Agrawal, Gehrke, etc al, SIGMOD 98 Data : points in a multiple dimensional space. Each dimension is partitioned into ω intervals Unit : intersection of one interval from each attribute (dimension) Age Income unit Dense Unit : a unit is dense if the fraction of total data points contained in it is greater than a threshold, τ . Income Threshold = 20% dense unit Age Cluster : a maximal set of connected dense units in kdimensions u1 u2 u3 u4 u5 Age Income u1 and u3 are connected u4 and u5 are not connected Region : an axisparallel rectangular kdimensional set, can be expressed as unions of units. Income Age Maximal Region R in a cluster : no proper superset of R is a region in the clus ter. The problem: Given a set of data points and the input parameters ω and τ , find the clusters in all subspaces of the original data space and present a minimal description of each cluster in the form of a DNF expres sion. 27 ƒ ƒ ƒ ω : the number of intervals of equal length to partition every dimension τ : a unit is dense if the fraction of data points in it is greater than τ . subspace: given a set of dimensions D = { D 1 , ..., D k } , a subset of D forms a subpace of D , e.g. { D 1 , D 3 } forms a subspace of D . Income Age Age Projection from 2D to 1D. Minimal description of a cluster: a nonredundant covering of the cluster with maximal regions. ƒ ƒ ƒ A region can be expressed as a conjunction of intervals of the domains A i . e.g. (30 ≤ age < 40) ∧ (1 ≤ salary < 3) ƒ ƒ ƒ A cluster is a union of regions. The minimal description of a cluster can be expressed as a DNF (disjunctive normal form) e.g. ((30 ≤ age < 50) ∧ (1 ≤ salary < 3)) ∨ ((40 ≤ age < 60) ∧ (2 ≤ salary < 4)) 30 40 50 60 1 2 3 4 Proposed algorithm: CLIQUE 1. Identify subspaces that contain clusters (find dense units in different subspaces) 2. Identify clusters 3. Generate minimal description for the clus ters Useful Property : Age Age Income If a collection of points S is a cluster in a k dimensional space, then S is also part of a cluster in any ( k 1)dimensional projections of the space. 28 Algorithm : proceeds level by level. • First determine 1dimensional dense units by making a pass over the data: D(1) • In the kth pass (k > 1): 1. Generate candidates Candidate Generation Procedure Candidate kdimensional (k1) dimensional dense units units 2. A pass over the data is made to find those candidate units that are dense.  D(k) Procedure kdimensional Candidate kdimensional Counting dense units dense units • The algorithm terminates when no more candidates are generated....
View
Full
Document
 Fall '09
 AdaFu

Click to edit the document details