“Rare events” are low frequency, high-severity problems that can have far-reaching consequences. Examples are insurance fraud, major stock market crashes, and disease epidemics.
Predicting and simulating such events is difficult but can be extremely valuable. Key challenges are typically the lack of historical data and the inapplicability of common statistical techniques. Two basic requirements of most analytics endeavors are the availability of known events and an understanding of their characteristics. In the case of rare events, we often have neither. Such issues have forced decision scientists to get creative and explore unconventional analysis methods.
Consider insurance fraud: insurance companies are prolific users of analytics, and classifying the severity of a claim has become a rather straightforward analytics exercise. That’s because there is plenty of data available on claim severity. Identifying a fraudulent claim, however, is a different story since the prevalence of fraud is much lower. Since there are more data points for the algorithm predicting severity than those for the algorithm predicting fraud, the data sets are imbalanced.
How can we fix this? There are three approaches to handling imbalanced datasets:
Data level: The data-level approach involves resampling to reduce class imbalance. The two commonly used sampling techniques include over-sampling and under-sampling. Over-sampling randomly duplicates the minority class samples, while under-sampling randomly discards the majority class samples in order to modify the class distribution. Both techniques have disadvantages: over-sampling may lead to over-fitting as it makes exact copies of the minority samples, while under-sampling may discard potentially useful majority samples. Thus, the data-level approach tends to be the least effective option, delivering minor changes in performance as compared to the algorithmic and ensemble methods.
Algorithmic level: The algorithmic-level approach leverages machine-learning algorithms modified to accommodate imbalanced data. It compensates for the skew by assigning weights to respective classes, introducing biases and penalty constants. Examples of algorithmic methods for handling imbalance are one-class learning, cost-sensitive learning, recognition-based approaches, and kernel-based learning, such as support vector machine (SVM). Applying an algorithmic approach alone is not preferred because the size of the data and event-to-non-event imbalance ratio are often high. Thus, we recommend focusing on new techniques that combine sampling and algorithmic approaches.
Ensemble methods: Ensemble methods involve a mixture-of-experts approach. These methods combine algorithmic and data approaches to incorporate different misclassification costs for each class in the learning phase. The two most popular ensemble learning algorithms are Boosting and Bagging. Boosting algorithms seek to improve classifier accuracy by reweighting misclassified samples. Bagging, which stands for Bootstrap Aggregating, is a process in which bootstrap samples are drawn randomly with replacement.
Ensemble approaches rely on combining a large number of relatively weak and simple models to obtain a stronger ensemble prediction. The most prominent examples of such machine-learning ensemble techniques are random forests, neural network ensembles, and Gradient Boosting Machines (GBMs), which have found many successful applications in different domains. Ensemble techniques like random forest and neural networks rely on simple averaging of models in the ensemble, whereas GBMs are based on a constructive strategy of ensemble formation. The main idea of boosting is to add new models to the ensemble sequentially. GBMs are applicable for a set of real-world practical applications and provide excellent results in terms of accuracy and generalization.
There are two schools of thought on big data analytics approaches for rare events: businesses should develop highly evolved models that help predict and prevent these events, and the other is to develop systemic mechanisms to negate the effect of these events. An airline, for instance, could benefit from reorganizing its fleet in response to natural calamities. In contrast, a pharmaceutical organization could maintain a high safety stock to deal with epidemic outbreaks.
In a world of great uncertainty, ensuring you have the analytical models for rare events is required to create a more responsive, resilient, and profitable business. Act now.