Rare-event modeling: the law of small numbers
Blog Posts:Mu Sigma
Published On: 06 March 2015
“Rare events” are low frequency, high-severity problems that can have far-reaching consequences. Examples are insurance fraud, major stock market crashes, and disease epidemics.
Predicting and simulating such events is difficult but can be extremely valuable. Key challenges are typically the lack of historic data and inapplicability of common statistical techniques. Two basic requirements of most analytics endeavors are the availability of known events and an understanding of their characteristics. In the case of rare events, we often have neither. Such issues have forced decision scientists to get creative and explore unconventional analysis methods.
Consider insurance fraud: insurance companies are prolific users of analytics, and classifying the severity of a claim has become a rather straightforward analytics exercise. That’s because there is plenty of data available on claim severity. Identifying a fraudulent claim however is a different story since the prevalence of fraud is much lower. Since, there are more data points for the algorithm predicting severity than those for the algorithm predicting fraud, the data sets are imbalanced.
How can we fix this? There are three approaches to handling imbalanced datasets:
Data level: The data-level approach involves resampling to reduce class imbalance. The two commonly used sampling techniques include over-sampling and under-sampling. Over-sampling randomly duplicates the minority class samples while under-sampling randomly discards the majority class samples in order to modify the class distribution. Both techniques have disadvantages: over-sampling may lead to over-fitting as it makes exact copies of the minority samples while under-sampling may discard potentially useful majority samples. Thus, the data-level approach tends to be the least effective option, delivering minor changes in performance as compared to the algorithmic and ensemble methods.
Algorithmic level: The algorithmic-level approach leverages machine-learning algorithms that are modified to accommodate imbalanced data. It compensates for the skew by assigning weights to respective classes, introducing biases and penalty constants. Examples of algorithmic methods for handling imbalance are one-class learning, cost-sensitive learning, recognition-based approaches and kernel-based learning, such as support vector machine (SVM). Applying an algorithmic approach alone is not preferred because the size of the data and event to non-event imbalance ratio is often high. Thus, we recommend focusing on new techniques that combine both sampling method with algorithmic approaches.
Ensemble methods: Ensemble methods involve a mixture-of-experts approach. These methods combine algorithmic and data approaches to incorporate different misclassification costs for each class in the learning phase. The two most popular ensemble-learning algorithms are Boosting and Bagging. Boosting algorithms seek to improve classifier accuracy by reweighting misclassified samples. Bagging, which stands for Bootstrap Aggregating, is a process in which bootstrap samples are drawn randomly with replacement.
Ensemble approaches rely on combining a large number of relatively weak and simple models to obtain a stronger ensemble prediction. The most prominent examples of such machine-learning ensemble techniques are random forests, neural network ensembles and Gradient Boosting Machines (GBMs), which have found many successful applications in different domains. Ensemble techniques like random forest and neural networks rely on simple averaging of models in the ensemble whereas GBMs are based on a constructive strategy of ensemble formation. The main idea of boosting is to add new models to the ensemble sequentially. GBMs are applicable for a set of real-world practical applications and provide excellent results in terms of accuracy and generalization.
There are two schools of thought on big data analytics approaches for rare events: one is that businesses should develop highly evolved models that help predict and prevent these events, and the other is to develop systemic mechanisms to negate the effect of these events. An airline, for instance, could benefit from reorganizing its fleet in response to natural calamities whereas a pharmaceutical organization could choose to maintain a high safety stock to deal with epidemic outbreaks.
While responses to rare events many vary, businesses can’t ignore the need to develop mechanisms to address them. What has your organization done to prepare for rare events?