Final-Exam---Question-3.docx - Final Exam Question 3 Van...

This preview shows page 1 - 4 out of 10 pages.

Final Exam - Question 3 Van Thu Nguyen 2020-08-11 Loading Data data= read.csv ( '/Users/vnguy/Documents/Harrisburg University/ANLY 500/iris_exams.csv' ) df = data.frame (data) summary (df) ## id Species Sepal.Length Sepal.Width ## Length:300 Length:300 Min. :4.417 Min. :1.796 ## Class :character Class :character 1st Qu.:5.209 1st Qu.:2.720 ## Mode :character Mode :character Median :5.844 Median :2.992 ## Mean :5.857 Mean :3.064 ## 3rd Qu.:6.448 3rd Qu.:3.375 ## Max. :8.478 Max. :4.810 ## Petal.Length Petal.Width ## Min. :1.135 Min. :-0.03371 ## 1st Qu.:1.566 1st Qu.: 0.30278 ## Median :4.228 Median : 1.28776 ## Mean :3.738 Mean : 1.19830 ## 3rd Qu.:5.205 3rd Qu.: 1.87452 ## Max. :6.955 Max. : 2.62487 Cleaning up data: Check for accuracy, missing value and outlier library (mice) ## ## Attaching package: 'mice' ## The following objects are masked from 'package:base': ## ## cbind, rbind df $ Species = as.factor (df $ Species) df[, 6 ][df[, 6 ] < 0 ] = NA percentmiss = function (x){ sum ( is.na (x)) / length (x) * 100 } apply (df, 2 ,percentmiss)
## id Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.3333333 missing = apply (df, 1 ,percentmiss) replace = subset (df, missing <= 20 ) missing1 = apply (replace, 1 ,percentmiss) replace_col = replace[, - c ( 1 , 2 )] dont_col = replace[, c ( 1 , 2 )] library (mice) replace_value = mice (replace_col) ## ## iter imp variable ## 1 1 Petal.Width ## 1 2 Petal.Width ## 1 3 Petal.Width ## 1 4 Petal.Width ## 1 5 Petal.Width ## 2 1 Petal.Width ## 2 2 Petal.Width ## 2 3 Petal.Width ## 2 4 Petal.Width ## 2 5 Petal.Width ## 3 1 Petal.Width ## 3 2 Petal.Width ## 3 3 Petal.Width ## 3 4 Petal.Width ## 3 5 Petal.Width ## 4 1 Petal.Width ## 4 2 Petal.Width ## 4 3 Petal.Width ## 4 4 Petal.Width ## 4 5 Petal.Width ## 5 1 Petal.Width ## 5 2 Petal.Width ## 5 3 Petal.Width ## 5 4 Petal.Width ## 5 5 Petal.Width nomiss = complete (replace_value, 1 ) all_col = cbind (dont_col, nomiss) maha = mahalanobis (nomiss, colMeans (nomiss, na.rm= TRUE ), cov (nomiss, use = "pairwise.complete.obs" )) cutoff = qchisq ( 1 -.001 , ncol (nomiss)) noout = subset (all_col, maha < cutoff) summary (noout)
## id Species Sepal.Length Sepal.Width ## Length:298 setosa :100 Min. :4.417 Min. : 1.796 ## Class :character versicolor:100 1st Qu.:5.205 1st Qu.:2.718 ## Mode :character virginica : 98 Median :5.844 Median : 2.990 ## Mean :5.859 Mean : 3.062 ## 3rd Qu.:6.457 3rd Qu.:3.371 ## Max.

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture