can you please help with this homework assignment
R Project: For this homework you will use a data set from the UCI repository. This data contains 1,000 lines, one for each credit applicant for a German bank. The data is from 1990's so you will see references to Deutsch Mark (DM). It uses several numerical and nonnumerical attributes of applicants and one which indicates whether they were approved for loan or not. Here are more information about this data set: • The top web site is here. • To narrow the search for the data we're looking for, on the left panel under default task choose "Classification", under Attribute type choose "Mixed", and under Area choose "Business". By now you have narrowed down the data sets to four. The one for this homework is Statlog (German Credit Data). Follow the link. • Read the information about the data carefully. After that go to the top of the page and follow the "Data Folder" link. • Now copy the link german.data. You will need to use this link to read the data from your R script. an example R script called HW3Q4.r For each question below, clearly write comments delineating the beginning and the code corresponding to each question below.
3a) Read the data file german.data from the web site. Do not download the file into your computer. Read the file directly off the Internet into data frame called credit. You don't need to create headers, you can use the ones created by R (V1, V2, ..., V21) or something similar. The description and details of each column and the meaning of each class for each categorical variable is given in the description of the data. Note that the last column is the response variable: with 1 meaning credit accepted, and 2 means rejected. Change all the 2's into 0 and transform this column into a factor. Next, choose a random subset of the data comprising 80% of the data and store it in the data frame creditTrain; store the rest in another data frame called creditTest. 3b) Build a tree model with the default settings using the training set. Then test the model on the test set, calculate the confusion matrix, and print the error rate for the test data. Also draw the tree with labels and print the tree in text format, and print a summary as well. 3c) Answer the following questions by examining the summary and the text output of the tree (as an output of your R script): • In the leftmost leaf of the tree: How many elements from the training set are in this leaf? How are the items falling in it are classified? What is the empirical probability of error at this leaf? What is the Gini Index (the tree software uses Gini index not cross entropy)? • At the top (root level) on what feature the split has occurred (Answer based on the actual name of this feature, not V1, V3, etc.)? Was this a numerical or a categorical feature? What are the two sides of the split? 3d) Now build a second tree model, but this time set the parameters mindev and minsize in a way that makes the training error zero. Answer all the questions in part 3b,c) and compare the quality of performance of this overfitted model with the default model of Question 3b). 3e) For comparison build a model using the training set and the naive Bayes method, assuming numerical features follow the normal distribution. Repeat 3b) and compare the error rate with the tree models.3f) Repeat part 3e) but use kernel method for numerical variables instead of assuming normal distribution. 3g) Finally, consider a simple minded model where the classification is based only on the response variable value in the training set, (so all features are ignored). What is the error rate of this prediction on the test set? How does the error rate compare to the results of the four models (the two tree models and the two naive Bayes ones) you build earlier? How do you asses the quality of the these four models compared to the simple minded model?