In-Database Machine Learning in Jupyter
In-Database Machine Learning in Jupyter

Abstract: 

Moving data, transforming data types, taking small samples so they’ll fit in your sandbox – these are all things every data scientist puts up with as routine. And when you’re finished, a data engineer has to build full production pipelines to reproduce all that work at scale. But what if you could leave all the data where it is and analyze it in place?
What if you could jump straight to the meat of the work, and when you’re finished, a single line of code would push it all into production?
In this tutorial, you’ll use familiar Pandas and SciKit code to build a churn reduction model without ever moving data.
Learn:
• Modern in-database machine learning
• How to use Python code and a Jupyter notebook inside a database
• How to manage, train, and evaluate models inside a database
• What makes your model ready for production and how to get it there

Session Outline
Session Outline
Lesson 1: Set up environment and load data
Familiarize yourself with how data is stored and accessed in a Vertica analytical database. Set up your environment with MatPlotLib for visualization. Get a quick tour of what is possible in a VerticaPy notebook.

Lesson 2: Prepare Data
Load data. Explore and visualize correlations, outliers, distribution, statistics, etc. Convert categorical to Boolean variables, and determine variables that are likely to contribute most to accuracy. Modify data set as needed for algorithm compatibility.

Lesson 3: Train and Evaluate Model
Train and validate a couple of different churn probability models. Evaluate each model and compare results. Save the model in the database, and apply it to new data. Practice model comparison, retraining, and versioning.

Background Knowledge
Basic Python - familiarity with Pandas and Scikit helpful
Basic understanding of Jupyter notebook

Bio: 

In two decades in the data management industry, Paige Roberts has worked as an engineer, a trainer, a support technician, a technical writer, a marketer, a product manager, and a consultant.
She has built data engineering pipelines and architectures, documented and tested large scale open source analytics implementations, spun up Hadoop clusters from bare metal, picked the brains of some of the stars in the data analytics and engineering industry, championed data quality when that was supposedly passé, worked with a lot of companies in a lot of different industries, and questioned a lot of people's assumptions.
Now, she promotes understanding of Vertica, MPP data processing, open source, high scale data engineering, and how the analytics revolution is changing the world.

Open Data Science

Open Data Science
Innovation Center
101 Main St
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google