This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: 1 Reducing Data Dimension Machine Learning 10-701 November 2005 Tom M. Mitchell Carnegie Mellon University Required reading: • Bishop, chapter 3.6, 8.6 Recommended reading: • Wall et al., 2003 Outline • Feature selection – Single feature scoring criteria – Search strategies • Unsupervised dimension reduction using all features – Principle Components Analysis – Singular Value Decomposition – Independent components analysis • Supervised dimension reduction – Fisher Linear Discriminant – Hidden layers of Neural Networks Dimensionality Reduction Why? • Learning a target function from data where some features are irrelevant - reduce variance, improve accuracy • Wish to visualize high dimensional data • Sometimes have data whose “intrinsic” dimensionality is smaller than the number of features used to describe it - recover intrinsic dimension Supervised Feature Selection Supervised Feature Selection Problem: Wish to learn f: X Æ Y, where X=<X 1 , …X N > But suspect not all X i are relevant Approach: Preprocess data to select only a subset of the X i • Score each feature, or subsets of features – How? • Search for useful subset of features to represent data – How? Scoring Individual Features X i Common scoring methods: • Training or cross-validated accuracy of single-feature classifiers f i : X i Æ Y • Estimated mutual information between X i and Y : • χ 2 statistic to measure independence between X i and Y • Domain specific criteria – Text: Score “stop” words (“the”, “of”, …) as zero – fMRI: Score voxel by T-test for activation versus rest condition – … 2 Choosing Set of Features to learn F: X Æ Y Common methods: Forward1: Choose the n features with the highest scores Forward2: – Choose single highest scoring feature X k – Rescore all features, conditioned on the set of already-selected features • E.g., Score(X i | X k ) = I(X i ,Y |X k ) • E.g, Score(X i | X k ) = Accuracy(predicting Y from X i and X k ) – Repeat, calculating new scores on each iteration, conditioning on set of selected features Choosing Set of Features Common methods: Backward1: Start with all features, delete the n with lowest scores Backward2: Start with all features, score each feature conditioned on assumption that all others are included....
View Full Document
This note was uploaded on 02/28/2008 for the course CSCI 6360 taught by Professor Wu during the Spring '08 term at University of Texas at Dallas, Richardson.
- Spring '08