
Abstract: Join us for an immersive session focused on optimizing PySpark and harnessing the power of machine learning using Spark MLlib. In this hands-on session, we will cover a wide range of topics, from understanding the project overview and core machine learning concepts to diving into the implementation of various classification algorithms. Through practical examples and demonstrations, you will gain a solid understanding of PySpark MLlib and its capabilities. Unsupervised learning techniques, such as K-Means clustering, will be explored alongside different types of classification algorithms, including decision tree and random forest classifiers.
Additionally, we will delve into essential data preprocessing techniques, such as changing column data types and handling missing values, ensuring your data is ready for analysis. You will also learn how to effectively split your data into training and testing datasets and validate your machine learning models using PySpark. Don't miss this opportunity to enhance your PySpark skills and unlock the full potential of Spark MLlib in your machine learning projects.
Session Outline:
Getting started with Machine Learning on Big Data
Why Spark on cloud for Machine Learning
Familiarize yourself with different types of classification algorithms.
Code walkthrough starting from data preparation to model deployment
Explore methods to validate machine learning models in PySpark.
Learning objectives:
Introduction to PySpark MLlib
Understanding the Unsupervised learning
Different types of Classification algorithms
Implementation of one of the classifier (K-Means, Random forest, etc)
Data processing using PySpark
Model building and with PySpark on AWS
Bio: Suman Debnath is a Principal Developer Advocate (Data Engineering) at Amazon Web Services, primarily focusing on Data Engineering, Data Analysis and Machine Learning. He is passionate about large scale distributed systems and is a vivid fan of Python. His background is in storage performance and tool development, where he has developed various performance benchmarking and monitoring tools.

Suman Debnath
Title
Principal Developer Advocate, Data Engineering | Amazon Web Services
