Lab 11: Clean Your Data

In this lab you will clean your group project data, and create descriptive statistics, conduct data analysis, and create charts or graphs illustrating your results. This is an individual lab assignment, not a group assignment.

1. Open the data you will use for your group project in Excel. Examine the codebook for your data. Label each column with a name that describes the data provided in the column. Delete columns that will not be used for your analysis. Keep any variables you think you may possibly use, since it is much easier to drop them later than to add them back later.

2. Calculate descriptive statistics for your data. Look at each variable’s output, and notice where the maximum and minimum values are unreasonable. Use this and your codebook to recognize which variables need cleaning…in the next step.

3. Look at the data and see if any of the numeric columns have non-numeric data (below the top label row). Sort the data to get all non-numeric values at the top or bottom of the columns. Delete rows with non-numeric data, or change the non-numeric values to corresponding numbers, if they exist.

4. Look to see if there are any rows without data or with missing observations. Use the codebook to determine values that represent N/A (not applicable) or Does Not Know. Sort the data by each suspect variable, being sure to sort the whole data set every time, so each row still corresponds to the correct observation. Delete rows with missing information IF the variables they correspond to are variables you plan to use in your regression.

5. Transform the raw data into variables you can analyze. Create new columns with titles that reflect the data contained in them. Create dummy variables for all categorical variables using IF statements. The form for an if statement is =IF(A1 = 2, value if true, value if false), so if A1 = 2, then set the cell equal to ‘value if true’, otherwise set the cell equal to ‘value if false’. So if you want to change this cell to 1 if A1= 2, and set it equal to 0 otherwise, your if statement should be =IF(A1=2,1,0).

6. Copy and paste all the data you want to use for your analysis to a new worksheet. When you paste the data, choose Paste Special, Values, so that the formulas do not depend on the original data.

7. In your new worksheet, arrange your data so that all the X variables are lined up next to one another, and all the possible Y variables are aligned in the same section of your Excel spreadsheet. You can do this by inserting columns and copying and pasting columns to the correct place. You can delete source columns now, if you don’t need them.

8. Create Descriptive Statistics for all of your variables. In the Data tab, use Data Analysis—Descriptive Statistics, and highlight all the numeric rows to create your Input range.

9. Examine the descriptive statistics to make sure they make sense. Are there any obvious recording errors, or outliers. For example, is average age 20,000? If so, go back to your original data and see what is causing this problem.

10. Create a graph or chart that illustrates an interesting result from your analysis.

11. Perform one regression that you believe makes sense. Be sure to think about what your dependent variable should be, and what regressors you want to include.

In this lab you will clean your group project data, and create descriptive statistics, conduct data analysis, and create charts or graphs illustrating your results. This is an individual lab assignment, not a group assignment.

1. Open the data you will use for your group project in Excel. Examine the codebook for your data. Label each column with a name that describes the data provided in the column. Delete columns that will not be used for your analysis. Keep any variables you think you may possibly use, since it is much easier to drop them later than to add them back later.

2. Calculate descriptive statistics for your data. Look at each variable’s output, and notice where the maximum and minimum values are unreasonable. Use this and your codebook to recognize which variables need cleaning…in the next step.

3. Look at the data and see if any of the numeric columns have non-numeric data (below the top label row). Sort the data to get all non-numeric values at the top or bottom of the columns. Delete rows with non-numeric data, or change the non-numeric values to corresponding numbers, if they exist.

4. Look to see if there are any rows without data or with missing observations. Use the codebook to determine values that represent N/A (not applicable) or Does Not Know. Sort the data by each suspect variable, being sure to sort the whole data set every time, so each row still corresponds to the correct observation. Delete rows with missing information IF the variables they correspond to are variables you plan to use in your regression.

5. Transform the raw data into variables you can analyze. Create new columns with titles that reflect the data contained in them. Create dummy variables for all categorical variables using IF statements. The form for an if statement is =IF(A1 = 2, value if true, value if false), so if A1 = 2, then set the cell equal to ‘value if true’, otherwise set the cell equal to ‘value if false’. So if you want to change this cell to 1 if A1= 2, and set it equal to 0 otherwise, your if statement should be =IF(A1=2,1,0).

6. Copy and paste all the data you want to use for your analysis to a new worksheet. When you paste the data, choose Paste Special, Values, so that the formulas do not depend on the original data.

7. In your new worksheet, arrange your data so that all the X variables are lined up next to one another, and all the possible Y variables are aligned in the same section of your Excel spreadsheet. You can do this by inserting columns and copying and pasting columns to the correct place. You can delete source columns now, if you don’t need them.

8. Create Descriptive Statistics for all of your variables. In the Data tab, use Data Analysis—Descriptive Statistics, and highlight all the numeric rows to create your Input range.

9. Examine the descriptive statistics to make sure they make sense. Are there any obvious recording errors, or outliers. For example, is average age 20,000? If so, go back to your original data and see what is causing this problem.

10. Create a graph or chart that illustrates an interesting result from your analysis.

11. Perform one regression that you believe makes sense. Be sure to think about what your dependent variable should be, and what regressors you want to include.

### Recently Asked Questions

- Between teh years 2000 and 2010, the number of births in the town of Daneville increased from 1432 to 2219. Which of the following is the approximate percent

- Consider a Cl- channel protein. Which of the following best describes the transport activity of this protein? Question 1 options: Passive transport Primary

- Uriah Heep celebrated his 18th birthday by opening a savings account at the Thames River Bank and depositing $ 4700 He continued to deposit the same amount