How to Build High Performing Weighted XGBoost ML Model for Real Life Imbalance Dataset

Abstract: Creating end to end ML Flow and Predict Financial Purchase for Imbalance financial data using weighted XGBoost code pattern is for anyone who is also interested in using XGBoost and creating Scikit-Learn based end to end machine learning pipeline for the real dataset where class imbalances are very common.

1. Introduction and Background.
Imbalance dataset where number of positive samples are lot more than negative samples are very common and we setup the premise of our whole code pattern here.
2. Data Set Description.
Data set is from Purtugese Bank Marketing data, where bank's associate makes call to user to sell financial product i.e CD to bank's client. Data set contains 17 columns and is explained here.
3. Statement of Classification Problem.
Before we start building our model, we should clearly defines, our object and high level problem statement.
4. Software and Tools(Xgboost and Scikit Learn).
We will mostly use python based libraries i.e XGBoost, Scikit-learn, Matplotlib, SeaBorn, Pandas. In this section we will load and explain, each of the packages and it's sub packages.
5. Visual Data Exploration to understand data (Using seaborn and matplotlib).
To get better insight into data, data scientists, usually perform data exploration, we will explore inputs for it's distribution, correlation and outliers

We will also explore output and will note the class imbalance issues.

6. Create Scikit learn ML Pipelines for Data Processing.
Split data into train and test set.
Create ML pipeline for data preparation. In typical machine learning application, one would usually creates, ML pipeline, so that all the steps that are done on training data set, can be easily applied on the test set.
7. Model Training.
Model Training is a iterative process and we will do several iteration to improve our model performance.
7.1 What and Why of XGBoost.
We will explain, why we choose XGBoost as our tool of choice.
7.2 Discuss Metrics for Model Performance.
We explain in detail various classification performance metrics like ROC curve,  Precision-Recall curve, Confusion Matrix and our choice for this application.
7.3 First Attempt at Model Training and it's performance, evaluation and analysis.
We will build XGBoost model using cross validation and compare it's performance via various stats and visualization. We will note that, performance is not good for the positive class i.e recall is bad.
7.4 Strategy For Better Classifier for the Imbalance Data.
To improve, recall, we will highlight a few tricks.
7.5 Second Attempt at Model Training using Weighted Samples and it's performance, evaluation and analysis.
Next, we will use one of the tricks of weighted samples to improve performance.
7.6 Third Attempt at Model Training using Weighted Samples and Feature Selection and it's performance analysis.
Lastly, we will build model with weighted samples and feature selection.
8. Inference Discussion (Generalization and Prediction).
Now, our model is ready to used and we run it on held out data, to see it's performance on test data.
9. Summary about what we learned about various techniques.
10. Pointers to Other Advanced Techniques like OverSampling, UnderSampling and SMOTE algorithms.

Bio: Alok Singh is a Principal Engineer at the IBM CODAIT (Center for Open-Source Data and AI Technologies). He has built and architected multiple analytical frameworks and implemented machine learning algorithms along with various data science use cases. His interest is in creating Big Data and scalable machine learning software and algorithms. He has also created many Data Science based applications.

Open Data Science Conference