This preview shows pages 1–11. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 1 Data preparation 2 ¡ Data imputation ¡ Data transformation ¡ Data sampling 3 Data Imputation ¡ Data imputation is to replace missing value by some other nonmissing values ¡ During data cleaning, missing values are identified. Due to the nature of some data analysis models, observations with missing values in one or more variables might not be useful ¢ A multiple regression model requires that no missing values in any explanatory variables ¢ A principal component analysis requires no missing values in the data 4 Treating missing values ¡ Ignore those observations with missing values ¢ Not very effective – the effective number of observations is reduced ¢ Especially poor for data analysis with many variables – consider 20 variables, each with 2% missing data, assume independent, only (10.02) 20 = 66.8% observations without missing values 5 ¡ Use some values to fill in the missing value – imputation ¢ A global constant – replace all missing values by a single constant, e.g. ‘missing’, ‘unknown’, or a specific numeric code ¢ The variable mean – replace all missing values by the mean of the variable (or other summary statistics, e.g. median, mode, etc.) ¢ The variable mean for all samples belonging to the same class as the observation with the missing value, i.e. divide population into subpopulations before replacing missing values by the mean 6 ¡ Use the most probable values to fill in the missing value ¢ This is a more advanced method ¢ Determined by statistical modeling tools such as regression, inferencebased tools, decision tree etc. ¡ Do nothing ¢ Some data analysis methods allow missing values. E.g. decision tree. Observations with missing data may be used in these methods ¢ This may be the most natural way to deal with missing data for these methods 7 ¡ Advantages ¢ More effective number of observations ¢ Some data analysis methods do not allow missing values ¡ Disadvantages ¢ Introduce more uncertainty to the data ¢ Will the imputed values be biased? ¢ Will they distort the original data? 8 Example ¡ Data set LOAN contains information about customers who apply for a bank loan. Customer’s income is among the information ¡ There are 10 out of 700 customers having missing values for customer’s income ¡ A simple way to impute the income for these customers is to use the sample mean 9 * example 11.1  impute missing value by sample mean; proc means data=ch11.loan noprint; var income; output out=sumfile mean=meanincome; run; data newloan; set ch11.loan; if _n_=1 then set sumfile (keep=meanincome); if income= . then income=meanincome; drop meanincome; run; proc print data=newloan; run; 10 ¡ The sample mean of INCOME is output to the summary data set SUMFILE ¡ Merge the summary data set SUMFILE (contains one observation only) with the original LOAN data set....
View Full
Document
 Spring '08
 SMSLee
 Statistics

Click to edit the document details