General Training Session: Modeling big data with R, sparklyr, and Apache Spark

Abstract: In this workshop I will show you, hands-on, how to work with big data using R and Apache Spark.

I’ll use sparklyr, developed by RStudio in conjunction with IBM, Cloudera, and H2O, as my main tool. Sparklyr provides an R interface to Spark’s distributed machine-learning algorithms and much more. It makes practical machine learning scalable and easy. With sparklyr, you can interactively manipulate Spark data using both dplyr and SQL (via DBI); filter and aggregate Spark datasets then bring them into R for analysis and visualization; orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater; create extensions that call the full Spark API and provide interfaces to Spark packages; and establish Spark connections and browse Spark data frames within the RStudio IDE.

Bio: Dr. John Mount is a principal consultant at Win-Vector LLC a San Francisco data science consultancy. John has worked as a computational scientist in biotechnology and a stock-trading algorithm designer and has managed a research team for (now an eBay company). John is the coauthor of Practical Data Science with R (Manning Publications, 2014). John started his advanced education in mathematics at UC Berkeley and holds a Ph.D. in computer science from Carnegie Mellon.

Open Data Science Conference