An Introduction to Principal Component Analysis with Examples in R Thomas Phan first.last @ acm.org Technical Report * September 1, 2016 1 Introduction Principal component analysis (PCA) is a series of mathematical steps for reducing the dimensionality of data. In practical terms, it can be used to reduce the number of features in a data set by a large factor (for example, from 1000s of features to 10s of features) if the features are correlated. This type of “feature compression” is often used for two purposes. First, if high-dimensional data is to be visualized by plotting it on a 2-D surface (such as a computer monitor or a piece of paper), then PCA can be used to reduce the data to 2-D or 3-D; in this con- text, PCA can be considered a complete, standalone unsupervised machine learning algorithm. Second, if a different machine learning training algorithm is taking too long to run, then PCA can be used to re- duce the number of features, which in turn reduces the amount of training data and the time to train a model; here, PCA is used as a pre-processing step as part of a larger workflow. In this paper we discuss PCA largely for the first purpose of visualizing and exploring patterns in data. It is important to note that PCA does not reduce features by selecting a subset of the original features (such as what is done with wrapper feature selection algorithms that perform feature-by-feature forward or backward search ). Instead, PCA creates new, uncorrelated features that are a linear combination of the original features. For a given data instance, its features are transformed via a dot product with a nu- meric vector to create a new feature; this vector is a principal component that serves as the direction of an axis upon which the data instance is projected. The new features are thus the projections of the original features into a new coordinate space defined by the principal components. To perform the actual dimen- sionality reduction, the user can follow a well-defined methodology to select the fewest new features that * This document serves as a readable tutorial on PCA using only basic concepts from statistics and linear algebra. explain a desired amount of data variance. This paper is organized in the following manner. In Section 2 we explain how PCA is applied to data sets and how it creates new features from existing fea- tures. Importantly, we explain various tips for how to effectively use PCA with the R programming lan- guage in order to achieve good feature compression. In Section 3 we use PCA to explore three different data sets: Fisher’s Iris data, Kobe Bryant’s shots, and car class fuel economy. In Section 4 we show R code examples that run PCA on data sets, and in Section 5 we provide references for further reading.
You've reached the end of your free preview.
Want to read all 14 pages?
- Fall '13