lecture11-vector-classify-handout-6-per

introducon to informaon retrieval 20 nearest neighbor

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: erated data]   In par)cular, asympto)c error rate is 0 if Bayes rate is 0.   Assume: query point coincides with a training point.   Both query point and training point contribute error → 2 )mes Bayes rate   A single atypical example.   Noise (i.e., an error) in the category label of a single training example.   More robust alterna)ve is to find the k most ­similar examples and return the majority category of these k examples.   Value of k is typically odd to avoid )es; 3 and 5 are most common. 19 Introduc)on to Informa)on Retrieval Sec.14.3 kNN decision boundaries Introduc)on to Informa)on Retrieval Sec.14.3 Similarity Metrics Boundaries are in principle arbitrary surfaces – but usually polyhedra Government Science Arts kNN gives locally defined decision boundaries between classes – far away points do not influence each classifica)on decision (unlike in Naïve Bayes, Rocchio, etc.) Introduc)on to Informa)on Retrieval 20   Nearest neighbor method depends on a similarity (or distance) metric.   Simplest for con)nuous m ­dimensional instance space is Euclidean distance.   Simplest for m ­dimensional binary instance space is Hamming distance (number of feature values that differ).   For text, cosine similarity of h.idf weighted vectors is typically most effec)ve. 21 Sec.14.3 Illustra)on of 3 Nearest Neighbor for Text Vector Space 22 Introduc)on to Informa)on Retrieval 3 Nearest Neighbor vs. Rocchio   Nearest Neighbor tends to handle pol...
View Full Document

This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online