Spark and R with sparklyr


One of the frustrations in data science is when the size of a problem crosses from being manageable on a laptop or a single server to being too big to fit in memory or taking too long to process. This often involves switching to a completely different environment and even a different language.

Apache Spark is the leader for distributed in-memory data analysis. It comes with advanced machine-learning modules and has interfaces with Scala, Python, and R. The SparkR project brings much of Spark’s capabilities to R but is still missing many of the machine-learning tools available with Python or Scala.

In late 2016 RStudio released the sparklyr package to provide tighter integration with RStudio IDE and Spark. Sparklyr provides a backend to the commonly used dplyr package, allowing R users who are familiar with dplyr to continue using this interface, and it provides much more in terms of machine learning and feature transformations through Spark's MLlib.

This workshop will offer an overview of Apache Spark and the types of problems it can solve before walking you through hands-on examples covering the basics of working with distributed data, data manipulation, and machine learning. You’ll leave with everything you need to seamlessly scale your R data analysis to a distributed environment—without learning a entirely new language.

The workshop will be delivered using RStudio Server in the cloud with Spark infrastructure provided by us. Attendees will only require a laptop with a modern browser and wifi connectivity. Basic R knowledge is required.


Doug is a Senior Data Scientist at Mango Solutions. A statistical physicist by training, he now practices and teaches a wide range of data science disciplines from machine learning pipelines to graph theory.

