Because this model assumes that each independent variable is contributing separately. Later we club all the probabilities to predict the final Target Class. Attributes values are conditionally independent given the target class. Hence Naïve Baye’s combine prior probability, posterior and conditional probability to predict its class label. Before proceeding to decide whether multi collinearity will affect Naïve Baye’s or not. Let’s discuss about Multi collinearity. What is Multi collinearity? Multi collinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be INDEPENDENT. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results. There are certain reasons why multi collinearity occurs: It is caused by an inaccurate use of dummy variables. It is caused by the inclusion of a variable which is computed from other variables in the data set. Multi collinearity can also result from the repetition of the same kind of variable. 9
Assignment – Machine Learning Basics Generally occurs when the variables are highly correlated to each other. Result: Therefore if we use Naïve Baye’s on Multi collinear dataset then the algorithm will fail. Let’s understand this with an simple example given below; Example: Imagine a person visits a eye glass store to purchase eyeglasses basing on his eye sight prescription. The glass maker will provide the eyeglass basing on the prescription the customer provided. The prescription consists of many variables such as x- axis, y- axis and other eye related measures. If any one of the data/ observation is missing or damaged by a human cause. How the glassmaker deliver the desired eyeglass to the customer? Generally if the data is not present or by mistake if the data is increased or decreased, this results in drastic change in the vision of eyeglasses. Hence if multi collinearity exists in the data it will surely affects the Naïve baye’s assumes that all the variables in the data are independent. Note : Multi collinearity can also be detected with the help of tolerance and its reciprocal, called variance inflation factor (VIF). If the value of tolerance is less than 0.2 or 0.1 and, simultaneously, the value of VIF 10 and above, then the multi collinearity is problematic. 5. If we do not define number of trees to be built in random forest then how many trees random forest internally creates? Solution: Random forest is an Ensemble/group that consists of many decision trees algorithm. It is a supervised classification algorithm. Basically Random forest is a combination of weak data to produce a strong data.