Most real-world classification issues exhibit some degree of class imbalance, which occurs when each class does not constitute an equal amount of your data set. It is critical to appropriately adapt your measurements and procedures to meet your objectives. If you do not do this, you may wind up optimizing for a measure that is useless in the context of your use case.
I’ll now go over a few ways of dealing with class imbalance. Some approaches are suitable to the majority of classification issues, but others may be better suited to specific levels of imbalance. For the sake of this essay, I will explain them in terms of binary classification, although the same principles will apply to multi-class classification as well. I’ll also presume the aim is to identify the minority group because these approaches aren’t really essential otherwise.
Simply balancing unbalanced data sets, either by oversampling or undersampling minority or majority class is a straightforward approach to repair them. This just allows us to generate a balanced data set, which, in principle, should not result in classifiers that are biased toward one class or the other. In reality, however, these basic sampling techniques have limitations. Oversampling the minority might result in model overfitting since it introduces duplicate examples into an already limited pool of instances. Similarly, undersampling the majority might result in the omission of crucial examples that highlight significant disparities between the two classes.
There are even more sophisticated sampling methods available that go beyond basic oversampling or undersampling. SMOTE is the most well-known example of this, as it generates new instances of the minority class by constructing convex combinations of nearby examples. It basically constructs lines connecting minority locations in the feature space, and samples along these lines. As we produce fresh synthetic instances rather than using duplicates, we may balance our data-set with less overfitting. This does not, however, prevent all overfitting because they are still generated from existing data points.
Detection of Anomalies
In more severe instances, it may be more appropriate to consider categorization in the context of anomaly detection. In anomaly detection, we assume that there is a “normal” distribution(s) of data points, and anything that deviates significantly from that distribution(s) is considered an anomaly. When we recast our classification issue as an anomaly detection task, we consider the majority class to be the “normal” distribution of points, and the minority class to be anomalies.
This problem, in general, deals with the trade-off between precision and recall. In circumstances when we wish to identify instances of a minority class, we are generally more concerned with recall than accuracy, because it is usually more expensive to miss a positive instance than to incorrectly label a negative instance in the context of detection. For example, if we are attempting to discover abusive content, it is easy for a human reviewer to determine that the content is not abusive, but it is far more difficult to recognize harmful content that was never reported as such.