Read the data census.RData which can be found on Canvas. Check that it has 74020 rows and 31 columns. Each row represents a census tract, and each column a variable that has been measured. Load the plyr library. (Hint: it will be very useful!)

1. How many states are represented among the 74020 census tracts? How many counties?

2. Columns 8 through 31 of the census data frame represent numeric variables, but columns 8 and 9 are not stored as such. These two are measured in US dollars: median household income (Med_HHD_Inc_ACS_09_13) and median house value (Med_House_value_ACS_09_13). What are the classes of these columns?

3. Convert columns 8 and 9 into numbers (in whole US dollars). For example, $63,030 should be converted into the integer 63030. (Hint: you may first convert them into strings, then remove any non-numeric characters using substr() or gsub(), then convert into numbers.) Check your answer by printing out the summary() of these two new columns. Make sure that empty entries ("") are properly converted to NA.

4. Several entries are missing in this data set, including the ones you discovered in the previous question. Compute the number of missing entries in each row, and save the vector as num.na.row. Then, obtain the indices of rows containing any missing values and save them in a vector named contains.na. What is the average number of missing values among the rows that contain at least one missing entry?

5. Are there any states with no missing values? If so, print out the names of all such states.

6. Redefine the census data frame by removing rows that have missing values, as per the contains.na vector computed in Part 1 Question 4. Check that the new census data frame has now 70877 rows. How many states and counties are represented in this new data frame? What states (if any) have been thrown out compared to the original data frame?

