of majority class examples. This leads to discarding potentially useful data that could be important for classifiers.Most commonly used approaches are based on k-nearest neighbor. These approaches will select the sample set and then exhaustively search the entire dataset and select the k-NN and discard the other data. This method assumes k-NN carries all the information that we need regarding those classes.Several others under sampling techniques are available which are based on two different types of noise model hypotheses. In one of the noise models, it is assumed that samples near the boundary are noise. This is discarded in order to obtain the maximum accuracy.In another noise model, it is assumed that majority class samples concentrated in the same location as minority class samples are noise. Discarding these samples from the data creates a clear boundary that can assist in classification.Oversampling:While under sampling aims to achieve an equal distribution by eliminating majority class samples, oversampling does this by replicating the minority samples so that the distribution is balanced. But naïve oversampling has a few shortcomings. It increases the probability of over fitting as it makes exact replications of the minority samples rather than sampling from the distribution of minority samples. Another problem is that as the
number of samples increases, the complexity of the model increases, which in turn increases the running time of the models.One commonly used oversampling method that helps to overcome these issues is SMOTE. It creates the new samples by interpolating based on the distances between the point and its nearest neighbors. SMOTE calculates the distances for the minority samples near the decision boundary and generates the new samples. This causes the decision boundary to move further away from the majority classes and avoid the over fitting problem.Imbalanced data is one of the potential problems in the field of data mining and machine learning. This problem can be approached by properly analyzing the data. A few approaches that help us in tackling the problem at the data point level are under sampling, oversampling, and feature selection. Moving forward, there is still a lot of research required in handling the data imbalance problem more efficiently.
You've reached the end of your free preview.
Want to read all 7 pages?