This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 1 Reducing Data Dimension Machine Learning 10701 November 2005 Tom M. Mitchell Carnegie Mellon University Required reading: • Bishop, chapter 3.6, 8.6 Recommended reading: • Wall et al., 2003 Outline • Feature selection – Single feature scoring criteria – Search strategies • Unsupervised dimension reduction using all features – Principle Components Analysis – Singular Value Decomposition – Independent components analysis • Supervised dimension reduction – Fisher Linear Discriminant – Hidden layers of Neural Networks Dimensionality Reduction Why? • Learning a target function from data where some features are irrelevant  reduce variance, improve accuracy • Wish to visualize high dimensional data • Sometimes have data whose “intrinsic” dimensionality is smaller than the number of features used to describe it  recover intrinsic dimension Supervised Feature Selection Supervised Feature Selection Problem: Wish to learn f: X Æ Y, where X=<X 1 , …X N > But suspect not all X i are relevant Approach: Preprocess data to select only a subset of the X i • Score each feature, or subsets of features – How? • Search for useful subset of features to represent data – How? Scoring Individual Features X i Common scoring methods: • Training or crossvalidated accuracy of singlefeature classifiers f i : X i Æ Y • Estimated mutual information between X i and Y : • χ 2 statistic to measure independence between X i and Y • Domain specific criteria – Text: Score “stop” words (“the”, “of”, …) as zero – fMRI: Score voxel by Ttest for activation versus rest condition – … 2 Choosing Set of Features to learn F: X Æ Y Common methods: Forward1: Choose the n features with the highest scores Forward2: – Choose single highest scoring feature X k – Rescore all features, conditioned on the set of alreadyselected features • E.g., Score(X i  X k ) = I(X i ,Y X k ) • E.g, Score(X i  X k ) = Accuracy(predicting Y from X i and X k ) – Repeat, calculating new scores on each iteration, conditioning on set of selected features Choosing Set of Features Common methods: Backward1: Start with all features, delete the n with lowest scores Backward2: Start with all features, score each feature conditioned on assumption that all others are included....
View
Full
Document
This note was uploaded on 02/28/2008 for the course CSCI 6360 taught by Professor Wu during the Spring '08 term at University of Texas at Dallas, Richardson.
 Spring '08
 wu

Click to edit the document details