# Predictive Analysis.docx - Problem 1 Linear Regression You...

• 58
• 100% (7) 7 out of 7 people found this document helpful

This preview shows page 1 - 11 out of 58 pages.

Problem 1 : Linear Regression You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an inexpensive diamond alternative with many of the same qualities as a diamond). The company is earning different profits on different prize slots. You have to help the company in predicting the price for the stone on the bases of the details given in the dataset so it can distinguish between higher profitable stones and lower profitable stones so as to have better profit share. Also, provide them with the best 5 attributes that are most important. 1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis. Import Dataset: The data set imported consists of 10 columns namely Carat, Cut, Color, Clarity, Depth, Table, X(length of Cubic Zirconia), Y(width of Cubic Zirconia), Z(height of Cubic Zirconia) and price. There is also a column ‘Unnamed: 0’ it appears to be ID column, hence we can drop it.
Shape : (26967, 10) Information of the dataframe: 7 column data types were found to be float/integer. 3 columns : ‘cut’,’color’ & ‘clarity’ columns were found to be as object type which we will convert into dummy variables for our model building.
Summary of the dataframe: SKEWNESS: carat 1.116481 depth -0.028618 table 0.765758 x 0.387986 y 3.850189 z 2.568257 price 1.618550 dtype: float64 For carat and price we see difference in value of mean and median, which slightly indicates existence of some skewness in the data. x,y,z appears to have almost same mean and median. Depth variable is negatively skewed.
Checking for missing value: carat 0 cut 0 color 0 clarity 0 depth 697 table 0 x 0 y 0 z 0 price 0 dtype: int64 depth column has 697 missing values. Duplicates : There are 34 Observations which are duplicates. Univariate Analysis : BOX PLOT AND DIST PLOT Boxplot gives us a good indication of how the values in the data are spread out and also tells us if any outlier is present. Displot shows us univariant set of observations .
Carat : Most of the Data is concentrated between 0 to 1. From graph we can feel the presence of Outliers. Depth: Most of the Data is concentrated between 60 to 65. Very less outliers are present.
Table: Most of the data is concentrated between 55 to 60. X = Most of the data is concentrated between 4 to 7.
y = Most of the data is concentrated between 5 to 8. z = Most of the data is concentrated between 2 to 5. Most of the Data is concentrated between 0 to 5000.
Boxplot: From above box plots we can see that almost all the variables have outliers.
By checking the Boxplots for all the continous variables we can conclude that outliers are present in almost all the continous variables namely, which means that we need to treat these outlier values so as to proceed further with our model building and analysis as these values can create errors and can deviate from the actual results. Linear Regression models are affected by presence of outliers.