Question

Student Name:

Submission Date:

DBST 667 -

Data Mining

Dr. Irene Tsapara

__Week 7 Individual Exercise__

**Deliverables: **Two Files: (1) Submit this lab report with answers to all questions including output screenshots into the 'Individual Exercises Week 7' assignment folder. (2) Submit an R script that contains all commands with comments that briefly describe each commands purpose.

**Grading: This exercise is worth 2% of the course grade.** All questions must be answered in your own words with any paraphrased references properly cited using in-text citations and a reference list as needed. In addition, grammatical and spelling errors may affect the grade.

**Part 2** - **Run an exercise on the imports-85 dataset from imports-85.csv (note again that we are NOT using the credit approval nor the vertebral column dataset this week), completing this report and providing the commands, output screenshots, and discussion/interpretation as requested. Ensure that all commands are saved in this report AND in an R script. **

** **

**For Reference: UCI Machine Learning Repository: Imports 85**

** **

**Introduction:****Identify the dependent variable and independent variables in the imports-85 data set.**

** **

** **

** **

** **

**Based on what you have learned this week about multiple linear regression, provide a one-paragraph masters-level response describing what you anticipate that the lm algorithm will accomplish for the imports-85 data? Be specific about the behavior and structure of multiple linear regression model.**

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

**Data Pre-Processing: Load the imports-85 data into R Studio using the read.csv command (***do not use File > Import Dataset > From CSV in the R Studio GUI as this uses read_csv() resulting in significant different variable types!!!*).- Run the commands to remove the following variables: engine_type, make, num_of_cylinders, fuel_system. Include the commands and output screenshot.

**Command(s): > **

** **

** **

** **

** **

** **

**Output: **

** **

** **

** **

** **

** **

** **

** **

** **

** **

**What additional data pre-processing (if any) does the lm() method require for the imports-85 data? Include the commands you ran and the output screenshot.**

** **

**Command(s): > **

** **

** **

** **

** **

** **

**Output: **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

**Multiple Linear Regression - Running the Method with Training Data:****Run 'set.seed(12345)' and then split the data into a training set consisting of 70% of the instances and a test set containing the remaining 30% of the instances. Includes the commands below.**

** **

**Commands: >**

** **

** **

** **

** **

** **

** **

** **

**Run the lm() function to build the multiple linear regression model storing the results in a variable called 'mlr_model'***.*Include the command you ran and a brief discussion about the default input parameters used.

**Command: >**

** **

** Discussion:**

** **

** **

** **

** **

** **

** **

** **

** **

** **

**Run the command 'summary(mlr_model)'. Include the output screenshot and answer the following questions:**

**Output:**

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

**How does the model represent the relationship between dependent and independent variables in the import-85 dataset?**

** **

** **

** **

** **

** **

**How does the method handle categorical variables?**

** **

** **

** **

** **

** **

** **

**What does the residuals section of the output mean?**

** **

** **

** **

** **

** **

** **

**What are the coefficients and what do they mean?**

** **

** **

** **

** **

** **

**What is an intercept and what does it mean?**

** **

** **

** **

** **

** **

**What do the p-values tell about the significance of each variable?**

** **

** **

** **

** **

** **

**What is the overall accuracy of the model?**

** **

** **

** **

** **

**Multiple Linear Regression - Evaluate the Model with Test Data:****Run the command to evaluate the 'mlr_model' on the imports-85***test*data Include the command below.- Command: >
**Run the command to build the predicted vs. actual (observed) value scatter plot. Add a diagonal line to this plot. Include the commands and the final plot with the diagonal line below.**

** **

**Commands: >**

** **

** **

** **

** Output:**

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

**What does the distance between points and the diagonal line tell us about the accuracy of the prediction?**

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

**Multiple Linear Regression - Residual Plots:****Run the 'plot(mlr_model)' command to build the residuals plots. Interpret at least one of the plots. Include the command, the plot, and the interpretation of that plot below.**

** **

**Command: >**

** **

** **

** **

**Output:**

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

**Interpretation:**

** **

** **

** **

** **

** **

** **

** **

** **

** **

**Multiple Linear Regression - Minimum Adequate Model:****What is the minimal adequate model? Why do we build it? Provide a one-paragraph, masters-level response.**

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

**Run the command to build the minimum adequate model and store the model in a variable named 'mlr_model_min'. Include the command and output screenshot.**

** **

**Command: >**

** **

** **

** **

**Output:**

** **

** **

** **

** **

** **

** **

** **

** **

**Run the 'summary(mlr_model_min)' command. Include the command, output screenshot, and answers to the following questions:****Command: >****Output:**

** **

** **

** **

**Which variables were eliminated and which variables remain?**

** **

** **

** **

** **

** **

**What are the coefficients and the intercept? What do the coefficient and intercept mean?**

** **

** **

** **

** **

** **

**Compare the prediction accuracy of the minimum adequate model with the prediction accuracy of the original model. Provide a one-paragraph, masters-level response.**

** **

** **

** **

** **

** **

** **

** **

**New Instance:****Suppose that we have a new car added to the imports-85 data set. We know the values of the independent variables. How would you use the model to predict the value of the dependent variable for the new car? (Hint: Use the lessons learned and hints from the prior week to complete this exercise). Include the command you would run below:**

** **

**Command: >**

** **

** **

** **

** **

** **

**Summary:****Is the multiple linear regression method appropriate for predicting the values of dependent variables in the imports-85 dataset? Explain why or why not. Provide a one-paragraph, masters-level response.**

** **

** **

** **

** **

** **

** **

** **

**(Not graded) Which part of this exercise did you find the most challenging and what steps did you take to resolve the challenge?**

** **

** **

** **

** **

References

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

### Recently Asked Questions

- Fill in the P (X=x) values in the table below to give a legitimateprobability distributionfor the discreterandom variable X , whose possible values

- According to the American Red Cross, about one out of nine people in the U.S. have Type B blood. Suppose the blood types of people arriving at a blood drive

- An expert witness for a paternity lawsuit testifies that the length of a pregnancy is normally distributed with a mean of280 daysand a standard deviation of13