View the step-by-step solution to:

Question

Student Name:           

Submission Date:       

DBST 667 -

Data Mining

Dr. Irene Tsapara

Week 7 Individual Exercise

 

Deliverables: Two Files: (1) Submit this lab report with answers to all questions including output screenshots into the 'Individual Exercises Week 7' assignment folder. (2) Submit an R script that contains all commands with comments that briefly describe each commands purpose.

 

Grading: This exercise is worth 2% of the course grade. All questions must be answered in your own words with any paraphrased references properly cited using in-text citations and a reference list as needed. In addition, grammatical and spelling errors may affect the grade.

 

Part 2 - Run an exercise on the imports-85 dataset from imports-85.csv (note again that we are NOT using the credit approval nor the vertebral column dataset this week), completing this report and providing the commands, output screenshots, and discussion/interpretation as requested. Ensure that all commands are saved in this report AND in an R script.

 

For Reference: UCI Machine Learning Repository: Imports 85

 

  1. Introduction:
  2. Identify the dependent variable and independent variables in the imports-85 data set.
  3.  

 

 

 

 

  1. Based on what you have learned this week about multiple linear regression, provide a one-paragraph masters-level response describing what you anticipate that the lm algorithm will accomplish for the imports-85 data? Be specific about the behavior and structure of multiple linear regression model.

 

 

 

 

 

 

 

 

 

 

  1. Data Pre-Processing: Load the imports-85 data into R Studio using the read.csv command (do not use File > Import Dataset > From CSV in the R Studio GUI as this uses read_csv() resulting in significant different variable types!!!).
  2.  
  3. Run the commands to remove the following variables: engine_type, make, num_of_cylinders, fuel_system. Include the commands and output screenshot.
  4.  

Command(s): >

 

 

 

 

 

Output:          

 

 

 

 

 

 

 

 

 

  1. What additional data pre-processing (if any) does the lm() method require for the imports-85 data? Include the commands you ran and the output screenshot.

 

Command(s): >

 

 

 

 

 

Output:          

 

 

 

 

 

 

 

 

 

 

 

  1. Multiple Linear Regression - Running the Method with Training Data:
  2.  
  3. Run 'set.seed(12345)' and then split the data into a training set consisting of 70% of the instances and a test set containing the remaining 30% of the instances. Includes the commands below.

 

Commands:  >

 

 

 

 

 

 

 

  1. Run the lm() function to build the multiple linear regression model storing the results in a variable called 'mlr_model'. Include the command you ran and a brief discussion about the default input parameters used.
  2.  

Command: >

 

                       Discussion:

 

 

 

 

 

 

 

 

 

  1. Run the command 'summary(mlr_model)'. Include the output screenshot and answer the following questions:
  2.  

Output:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

How does the model represent the relationship between dependent and independent variables in the import-85 dataset?

 

 

 

 

 

How does the method handle categorical variables?

 

 

 

 

 

 

What does the residuals section of the output mean?

 

 

 

 

 

 

What are the coefficients and what do they mean?

 

 

 

 

 

What is an intercept and what does it mean?

 

 

 

 

 

What do the p-values tell about the significance of each variable?

 

 

 

 

 

What is the overall accuracy of the model?

 

 

 

 

  1. Multiple Linear Regression - Evaluate the Model with Test Data:
  2.  
  3. Run the command to evaluate the 'mlr_model' on the imports-85 test data Include the command below.
  4.  
  5. Command: >
  6.  
  7.  
  8.  
  9.  
  10.  
  11.  
  12. Run the command to build the predicted vs. actual (observed) value scatter plot. Add a diagonal line to this plot. Include the commands and the final plot with the diagonal line below.

 

Commands: >

 

 

 

                       Output:

 

 

 

 

 

 

 

 

 

 

 

 

 

  1. What does the distance between points and the diagonal line tell us about the accuracy of the prediction?

 

 

 

 

 

 

 

 

 

 

 

  1. Multiple Linear Regression - Residual Plots:
  2.  
  3. Run the 'plot(mlr_model)' command to build the residuals plots. Interpret at least one of the plots. Include the command, the plot, and the interpretation of that plot below.

 

Command:  >

 

 

 

Output:

 

 

 

 

 

 

 

 

 

 

Interpretation:

 

 

 

 

 

 

 

 

 

  1. Multiple Linear Regression - Minimum Adequate Model:
  2.  
  3. What is the minimal adequate model? Why do we build it? Provide a one-paragraph, masters-level response.

 

 

 

 

 

 

 

 

 

 

  1. Run the command to build the minimum adequate model and store the model in a variable named 'mlr_model_min'. Include the command and output screenshot.

 

Command:  >

 

 

 

Output:

 

 

 

 

 

 

 

 

  1. Run the 'summary(mlr_model_min)' command. Include the command, output screenshot, and answers to the following questions:
  2.  
  3. Command:  >
  4.  
  5.  
  6.  
  7. Output:
  8.  
  9.  
  10.  
  11.  
  12.  

 

 

 

Which variables were eliminated and which variables remain?

 

 

 

 

 

What are the coefficients and the intercept? What do the coefficient and intercept mean?

 

 

 

 

 

Compare the prediction accuracy of the minimum adequate model with the prediction accuracy of the original model. Provide a one-paragraph, masters-level response.

 

 

 

 

 

 

 

  1. New Instance:
  2.  
  3. Suppose that we have a new car added to the imports-85 data set. We know the values of the independent variables. How would you use the model to predict the value of the dependent variable for the new car? (Hint: Use the lessons learned and hints from the prior week to complete this exercise). Include the command you would run below:

 

Command:  >

 

 

 

 

 

  1. Summary:
  2.  
  3. Is the multiple linear regression method appropriate for predicting the values of dependent variables in the imports-85 dataset? Explain why or why not. Provide a one-paragraph, masters-level response.

 

 

 

 

 

 

 

  1. (Not graded) Which part of this exercise did you find the most challenging and what steps did you take to resolve the challenge?

 

 

 

 

References

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Recently Asked Questions

Why Join Course Hero?

Course Hero has all the homework and study help you need to succeed! We’ve got course-specific notes, study guides, and practice tests along with expert tutors.

-

Educational Resources
  • -

    Study Documents

    Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access.

    Browse Documents
  • -

    Question & Answers

    Get one-on-one homework help from our expert tutors—available online 24/7. Ask your own questions or browse existing Q&A threads. Satisfaction guaranteed!

    Ask a Question
Ask Expert Tutors You can ask 0 bonus questions You can ask 0 questions (0 expire soon) You can ask 0 questions (will expire )
Answers in as fast as 15 minutes