Scalable Machine Learning Using Python and a Distributed Analytical Database
Scalable Machine Learning Using Python and a Distributed Analytical Database


Python is a leading programming language for machine learning today due to its flexibility, portability, and libraries. Another major benefit it provides to data scientists is its ability to work well with other analytics tools and frameworks. However, Python has issues around scalability that can make getting machine learning models into production a challenge. Many machine learning projects stall when trying to make the leap to high-scale production.

Financial institutions have huge amounts of structured data which usually resides in distributed data stores. Instead of using Python to extract sample data from those distributed data stores for building machine learning models, Vertica offers the capability to execute Python computations inside the database where the full dataset resides. This both simplifies model training and boosts accuracy by removing the need to downsample. It also greatly speeds model deployment into full-scale production. You can get proven models deployed in minutes, not months.

In this session, we will demonstrate a credit card fraud detection example of how Python can be combined with a distributed analytical database, Vertica, to parallelize and simplify your machine learning model training and deployment.


Bard is a Head of Data Science at Vertica. He has an Engineering Degree in IT Master Degree in Data Science. Also, Bart is a creator of the Vertica ML Python API - Scalable as Vertica, Flexible as Python and speaker in different Data Science events.