
Abstract: The Association of Fraud Examiners (ACFE) consistently estimates that organizations lose approximately 5% of their revenues due to fraud. Based on world GDP estimates, this would be anywhere from $4-5 trillion annually. Fraud is one of the most interesting problems to try and solve. Data science techniques are now at the forefront of this industry to help fight the battle against criminals in banking, cybersecurity, and more.
This course outlines the typical fraud framework at an organization and where data science can play a role as well as lay out how to build an analytically advanced fraud system. It then covers statistical and machine learning approaches to anomaly detection. Moving beyond anomaly detection, these supervised and unsupervised approaches to fraud modeling will help an organization combat the every present problem of fraud. These approaches can also be used in other industries to help find unique customers or problems that exist.
Session Outline:
1- Introduction to Fraud:
Section-1: The Problem of Fraud - How can we analytically define fraud? There are important characteristics of fraud that puts a better perspective on the modeling and identification of fraud.
Section-2: Detection and Prevention - The two biggest pieces that any holistic fraud solution should have are detection of previous instances of fraud and prevention of new instances. This section also defines the typical fraud identification process in organizations.
Section-3: Analytical Solution - Now that we now what fraud is as well as the organizational structure of how to deal with fraud, we need to introduce the analytical approaches to becoming a mature organization on detecting and preventing fraud.
2 - Data Preparation:
Section-1: Feature Engineering - The best way to glean information from data is to develop good features to help detect and identify fraud. We talk about and develop strategies for developing good features for anomaly detection.
Section-2: Anomaly Detection with Statistical Techniques - This section goes into details about how to detect anomalies with more classical techniques like Benford’s Law, z-scores, and Mahalanobis distances.
Section-3: Anomaly Detection with Machine Learning Techinques - This is where the biggest improvements in anomaly detection have happened over the past decade. We will start with k-Nearest Neighbors (k-NN) and the Local Outlier Factor (LOF). Then we will move into more advanced machine learning approaches to anomaly detection like isolation forests, classifier-adjusted density estimation (CADE), and one-class support vector machines (SVMs).
Section-4: Sampling Concerns - Fraud is (hopefully) a rare event in your data. This does make modeling a little harder as models may have a tendency to predict that no one will commit fraud. We need to learn how to adjust our data before-hand to better aid the model.
3 – Supervised Modeling:
Section-1: Interpretable Models - This section covers basic interpretable models like decision trees and logistic regression for predicting whether a claim is fraudulent.
Section-2: Naïve Bayes – Classifying fraudulent claims may be better understood when we know that fraud is a rare event. This is where Bayesian modeling, specifically the Naïve Bayes classifier, comes in to help model.
Section-3: More Advanced Models – Tree based models are good for predictive power in classification problems. This section will cover two common tree-based models, random forests, and extreme gradient boosting.
Section-4: Model Evaluation – With all the models that have been previously built, a comparison is needed to know what is best. This section of the course talks about multiple ways to compare models to see which is best for our data.
Seciton-5: NOT-Fraud Model – The previous sections created models to predict previous instances of fraud, but to catch new types of fraud, the not-fraud model is needed.
4 – Implementation / Deployment
Section-1: Clustering Revisited – Once the first fraud model has been built, implementation takes place to get better at catching fraud. However, most of the claims aren’t investigated for fraud since there is limited evidence of it. Clustering helps identify anomalous claims in this group of non-fraud claims as a way of checking if we have built things correctly.
Seciton-2: Interpretability – Sharing fraud model results in a meaningful and interpretable way is the best way to get investigators to use your model. This section details how to do this LIME and sorecards with even the most advanced modeling techniques.
Section-3: Long-Term Fraud Strategy – This section of the course covers how to think about fraud in the long-term and set up your organization in the best way to use the models you have built and get the most value out of your analytical fraud framework.
Seciton-4: Chance & Loss Models – Most models in fraud deal with a probability of committing fraud, but what about the potential loss. Models around loss amounts add value to understanding the impact of a specific fraud risk.
Background Knowledge:
- Introductory R/Python
- Basic introduction to supervised modeling
- Basic introduction to classification models like logistic regression, decision trees, etc. (this isn't required, but helpful for understanding)
Bio: A Teaching Associate Professor in the Institute for Advanced Analytics, Dr. Aric LaBarr is passionate about helping people solve challenges using their data. There he helps design the innovative program to prepare a modern workforce to wisely communicate and handle a data-driven future at the nation's first Master of Science in Analytics degree program. He teaches courses in predictive modeling, forecasting, simulation, financial analytics, and risk management. Previously, he was Director and Senior Scientist at Elder Research, where he mentored and led a team of data scientists and software engineers. As director of the Raleigh, NC office he worked closely with clients and partners to solve problems in the fields of banking, consumer product goods, healthcare, and government. Dr. LaBarr holds a B.S. in economics, as well as a B.S., M.S., and Ph.D. in statistics — all from NC State University.

Aric LaBarr, PhD
Title
Associate Professor of Analytics | Institute for Advanced Analytics at NC State University
