12-knn_perceptron

# Datasets 10 simplest spatial structure on earth split

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: only one point left only a few points left Variants: split only one dimension at a time Kd-trees (in a moment) 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 11 Range search: Put root node on the stack Repeat: pop the next node T from the stack for each child C of T: if C is a leaf, examine point(s) in C if C intersects with the ball of radius r around q, add C to the stack q Nearest neighbor: Start range search with r = ∞ Whenever a point is found, update r Only investigate nodes with respect to current r 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 12 Quadtrees work great for 2 to 3 dimensions Problems: Empty spaces: if the points form sparse clouds, it takes a while to reach them Space exponential in dimension Time exponential in dimension, e.g., points on the hypercube 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 13 Main ideas [Bentley ’75] : Only one-dimensional splits Choose the split “carefully”: E.g., Pick dimension of largest variance and split at median (balanced split) Do SVD or CUR, project and split Queries: as for quadtrees Advantages: no (or less) empty spaces only linear space Query time at most: Min[dn, exponential(d)] 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 Range search: Put root node on the stack Repeat: pop the next node T from the stack for each child C of T: if C is a leaf, examine point(s) in C if C intersects with the ball of radius r around q, add C to the stack In what order we search the children? Best-Bin-First (BBF), Last-Bin-First (LBF) 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 15 Performance of a single Kd-tree is low Randomized Kd-trees: Build several trees Find top few dimensions of largest variance Randomly select one of these dimensions; split on median Construct many complete (i.e., one point per leaf) trees Drawbacks: More memory Additional parameter to tune: number of trees Search Descend through each tree until leaf is reached Maintain a single priority queue for all the trees For approximate search, stop after a certain number of nodes have been examined 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 16 2/14/2011 d=128, n=100k Jure Leskovec, Stanford C246: Mining Massive Datasets [Muja-Lowe, 2010] 17 Overlapped partitioning reduces boundary errors no backtracking necessary Spilling Increases tree depth more memory slower to build Better when split passes through sparse regions Lower nodes may spill too much hybrid of spill and non-spill nodes Designing a good spill factor hard 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 18 For high dim. data, use randomized projections (CUR) or SVD Use Best-Bin-First (BBF) Make a priority queue of all unexplored nodes Visit them in order of their “closene...
View Full Document

Ask a homework question - tutors are online