From Big Data to NLP insights: Getting started with PySpark and Spark NLP


The amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data over the last few years and is now a critical part of the data science toolbox. In recent years, text data is increasingly becoming more common as new techniques to work with them become popular.

This workshop will introduce you to the fundamentals of PySpark (Spark's Python API), the Spark NLP library and other best practices in Spark programming when working with textual or natural language data.

Session Outline:

- Module 1: Basics of PySpark and the DataFrame API
Our goal will be to set up and get familiar with PySpark API. Focus will be on the DataFrame API and basic data operations such as filtering, aggregating and grouping. We will also understand what use cases is PySpark a good fit for.

- Module 2: PySpark for NLP
In this module, we'll discuss using PySpark for NLP tasks such as entity recognition and sentiment analysis. We'll cover how to load, preprocess, and analyze text data using PySpark. We'll also discuss when to use PySpark for NLP tasks and when to consider other Python NLP libraries.
We'll introduce Spark NLP, a popular NLP library built on top of PySpark. The hands-on exercise will demonstrate how to perform text preprocessing and feature extraction with Spark NLP.

- Module 3: Advanced NLP with Spark NLP
We'll discuss Spark NLP's capabilities, advantages, and integration with PySpark. We'll also demonstrate how to use Spark NLP for a task such as entity recognition or sentiment analysis.

Background Knowledge:

Required: Python programming - syntax and basics of package installation

Nice-to-have: Familiarity with Jupyter notebooks; basics of natural language processing and data science techniques such as aggregation.


Akash Tandon is co-founder and CTO of Looppanel where he builds software to help product teams record, store and analyze user research data. He is a co-author of Advanced Analytics with PySpark, published by O'Reilly. Previously, Akash worked as a senior data engineer at Atlan, SocialCops and RedCarpet where he built data infrastructure for enterprise, government and finance use-cases. He has also been a participant and mentor in the Google Summer of Code program with the R Project for Statistical Computing.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google