12-knn_perceptron

Factor hard 2142011 jure leskovec stanford c246 mining

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ss” to the query Closeness is defined by distance to a cell boundary Space permitting: Keep extra statistics on lower and upper bound for each cell and use triangle inequality to prune space Use spilling to avoid backtracking Use lookup tables for fast distance computation 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 19 “Bottom-up” approach [Guttman 84] Start with a set of points/rectangles Partition the set into groups of small cardinality For each group, find minimum rectangle containing objects from this group (MBR) Repeat Advantages: Supports near(est) neighbor search (similar as before) Works for points and rectangles Avoids empty spaces 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 20 R-trees with fan-out 4: group nearby rectangles to parent MBRs I A C G F B E H J D 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets #21 R-trees with fan-out 4: every parent node completely covers its ‘children’ P1 P3 A C G E 2/14/2011 H F B P2 I P4 J ABC D DE Jure Leskovec, Stanford C246: Mining Massive Datasets H I J FG #22 R-trees with fan-out 4: every parent node completely covers its ‘children’ P1 P3 A C H F E 2/14/2011 P1 P2 P3 P4 G B P2 I P4 J ABC D DE Jure Leskovec, Stanford C246: Mining Massive Datasets H I J FG #23 Example of a range search query P1 P3 A C H F E 2/14/2011 P1 P2 P3 P4 G B P2 I P4 J ABC D DE Jure Leskovec, Stanford C246: Mining Massive Datasets H I J FG #24 Example of a range search query P1 P3 A C H F E 2/14/2011 P1 P2 P3 P4 G B P2 I P4 J ABC D DE Jure Leskovec, Stanford C246: Mining Massive Datasets H I J FG #25 Insertion of point x: Find MBR intersecting with x and insert If a node is full, then a split: Linear – choose far apart nodes as ends. Randomly choose nodes and assign them so that they require the smallest MBR enlargement Quadratic – choose two nodes so the dead space between them is maximized. Insert nodes so area enlargement is minimized P1 P2 P3 P4 A C F B P1 E P2 2/14/2011 G P3 P4 H J I ABC D DE Jure Leskovec, Stanford C246: Mining Massive Datasets H I J FG #26 Approach [Weber, Schek, Blott’98] In high-dimensional spaces, all tree-based indexing structures examine large fraction of leaves If we need to visit so many nodes anyway, it is better to scan the whole data set and avoid performing seeks altogether 1 seek = transfer of few hundred KB 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 27 Natural question: How to speed-up linear scan? Answer: Use approximation Use only i bits per dimension (and speed-up the scan by a factor of 32/i) Identify all points which could be returned as an answer Verify the points using original data set 2/14/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 28 Example: Spam filtering Instance space X: Binary feature vectors...
View Full Document

This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online