Abstract: The amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data over the last few years and is now a critical part of the data science toolbox. This workshop will introduce you to the fundamentals of PySpark, Spark's Python API, and other best practices in Spark programming. The world of distributed analytics and machine learning is vast and exciting. This session intends to act as a gateway to it.
- Module 1: Basics of PySpark and the DataFrame API
Our goal will be to set up PySpark and get familiar with it. Focus will be on the DataFrame API. We will also understand what use cases is PySpark a good fit for.
- Module 2: Techniques for working with real-world datasets
We parse, preprocess and analyze couple of big datasets. In addition to the DataFrame API, we will work with SparkSQL. It's one of PySpark's best features. To depict the versatility of the PySpark ecosystem, we will also work with textual data using the Spark NLP library.
- Module 3: Building an end-to-end data analytics pipeline
We will use the knowledge gained during the previous modules to analyze and model a real-world dataset. You will be introduced to and work with PySpark's machine learning API, SparkML.
- Required: Python programming - syntax and basics of package installation
- Nice-to-have: Familiarity with Jupyter notebooks, data science techniques such as aggregation and fundamentals of machine learning
Bio: Akash Tandon is co-founder and CTO of Looppanel where he builds software to help product teams record, store and analyze user research data. He is a co-author of Advanced Analytics with PySpark, published by O'Reilly. Previously, Akash worked as a senior data engineer at Atlan, SocialCops and RedCarpet where he built data infrastructure for enterprise, government and finance use-cases. He has also been a participant and mentor in the Google Summer of Code program with the R Project for Statistical Computing.