Abstract: Moving data, transforming data types, taking small samples so they’ll fit in your sandbox – these are all things every data scientist puts up with as routine. And when you’re finished, a data engineer has to build full production pipelines to reproduce all that work at scale. But what if you could leave all the data where it is and analyze it in place?
What if you could jump straight to the meat of the work, and when you’re finished, a single line of code would push it all into production?
In this tutorial, you’ll use familiar Pandas and SciKit code to build a churn reduction model without ever moving data.
• Modern in-database machine learning
• How to use Python code and a Jupyter notebook inside a database
• How to manage, train, and evaluate models inside a database
• What makes your model ready for production and how to get it there
Lesson 1: Set up environment and load data
Familiarize yourself with how data is stored and accessed in a Vertica analytical database. Set up your environment with MatPlotLib for visualization. Get a quick tour of what is possible in a VerticaPy notebook.
Lesson 2: Prepare Data
Load data. Explore and visualize correlations, outliers, distribution, statistics, etc. Convert categorical to Boolean variables, and determine variables that are likely to contribute most to accuracy. Modify data set as needed for algorithm compatibility.
Lesson 3: Train and Evaluate Model
Train and validate a couple of different churn probability models. Evaluate each model and compare results. Save the model in the database, and apply it to new data. Practice model comparison, retraining, and versioning.
Basic Python - familiarity with Pandas and Scikit helpful
Basic understanding of Jupyter notebook
Bio: In two decades in the data management industry, Paige Roberts has worked as an engineer, a trainer, a support technician, a technical writer, a marketer, a product manager, and a consultant.
She has built data engineering pipelines and architectures, documented and tested large scale open source analytics implementations, spun up Hadoop clusters from bare metal, picked the brains of some of the stars in the data analytics and engineering industry, championed data quality when that was supposedly passé, worked with a lot of companies in a lot of different industries, and questioned a lot of people's assumptions.
Now, she promotes understanding of Vertica, MPP data processing, open source, high scale data engineering, and how the analytics revolution is changing the world.
Open Source Relations Manager | Vertica