2.How would you deal a data with Target class imbalance problem?

If you take the example of rare diseases, a machine learning model may suffer from accuracy paradox, which makes it difficult to control false positives (or Type I Error) and false negatives (or Type II Error). This means that the patient may suffer from a rare disease but the machine learning model will not predict so since the majority of the datawill be from patients without the disease. In the example of fraud detection, the goal is to identify whether the transaction is fraudulent or not. Because most transactions are not fraudulent, this causes the model to predict the fraudulent transactions as valid. To overcome these challenges, several approaches have been developed that can be implemented during the pre-processing stage. One commonly used strategy is called resampling, which includes under sampling and oversampling techniques. If one balances the dataset by removing the instance from the overrepresented class then it’s called under sampling. Oversampling can be achieved by adding similar instances of underrepresented class to balance the skewed class ratio. Resampling could be done with or without replacement. The first two approaches are depicted in the image below and are explained in the following sections in detail.Re-sampling:Resampling is the process of reconstructing the data sample from the actual data sets either by non-statistical estimation or statistical estimation. In non-statistical estimation, we randomly draw samples from the actual population hoping that the data distribution has a similar distribution to the actual population. Statistical estimation, however, involves estimating the parameters of the actual population and then drawing the subsamples. In this way, we extract data samples that carry most of the information from the actual population. These resampling techniques help us in drawing the samples when the data is highly imbalanced.

Under sampling:Random under sampling is a method in which we randomly select the samples from the majority class and discard the remaining. Because we assume that any random sample accurately reflects the distribution of the data, this is a naïve approach. This is a classical method in which the goal is to balance class distributions through the random elimination of majority class examples. This leads to discarding potentially useful data that could be

#### You've reached the end of your free preview.

Want to read all 7 pages?

- Winter '17
- Madhu
- Linear Regression, Regression Analysis