Navigating the Pitfalls of Applying Machine Learning in Practice


As the amount and complexity of data rapidly increases, machine learning tools are being used for a wide array of analytical tasks. These tasks include supervised and unsupervised prediction and forecasting as well as sophisticated normalization and integration of heterogeneous data sets. Although machine learning has shown great promise in almost every area it has been applied to, mistaken assumptions about the data being used to train such models can lead to erroneous evaluations and to models that do not actually work as well (or at all) in practice. In this session, we will talk concretely about five interrelated pitfalls that one might encounter when using supervised machine learning and how to avoid them. Importantly, these pitfalls are not domain specific --- they can, and do, occur in every industry, and failing to appreciate their significance can cause projects to fail that would otherwise succeed.

Session Outline:

This session will cover five statistical pitfalls:

1. Distributional differences
2. Dependency structure
3. Confounding variables
4. Information leakage
5. Unbalanced data

Each pitfall will have an example, although the first and fourth pitfalls will be discussed the most in-depth. By the end, the audience should have a conceptual understanding of what each of these pitfalls are and how to avoid them.

Background Knowledge:

The audience should understand how machine learning models are trained, i.e. using a training set for training and a separate test set for evaluation, but do not need to know the mathematics behind how any models work. One may get more out of the talk if they have trained a model themself, but that is not a requirement.


Jacob Schreiber is a post-doctoral researcher at the Stanford School of Medicine. As a researcher, he has developed machine learning approaches to integrate thousands of genomics data sets, to design biological sequences with desired characteristics, and has described how statistical pitfalls can be encountered and accounted for in genomics data sets. As an engineer, he has contributed to the community as a core contributor to scikit-learn and as the developer of several machine learning toolkits, including pomegranate for probabilistic modeling and apricot for submodular optimization.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google