Democratizing Distributed Compute and Machine Learning: A Tour of Three Frameworks


Democratizing"" has become a buzzword, and why not? Institutions of all types are discovering that almost every job role touches a bit of large-scale data analysis or data science, and sometimes more than just a bit! In this talk we'll look at the patterns, strengths, and weaknesses of three different open-source tools, which all claim to make large-scale computation simpler, easier, and more accessible to more people.

Our exploration will reveal not only major differences at the technical level, but also differences in culture, documentation, usability, open-source governance, and other areas. How easy are they to use, for real people in real organizations?

We'll look at:
-Apache Spark, a well established cluster computing tool suited to many kinds of work. Among other languages, Apache Spark boasts SparkSQL, which allows a huge number of SQL-capable folks to work on big data.
-Ray, a newer, multi language framework from UC Berkeley's RISE lab. Ray focuses on simplifying the scaffolding beneath distributed task graphs and actor sets so that users can focus on simple distributed training, tuning, reinforcement learning, and more.
-Dask, a Python-native library and part of the SciPy ecosystem dedicated to scaling popular tools like Pandas and NumPy. Dask lets users apply their existing Python knowledge by supporting elements of the Pandas, NumPy, and scikit-learn APIs ... and also extends to scheduling custom task graphs.

All of these projects focus in some way on ease of use, and all have expanded the abilities of normal humans to work with data at scale. But they are also each quite different. This workshop will feature hands-on coding (with supplied notebooks), to help you think about what's easy, what's hard, what life is like with these tools, and which ones may be right for your organization.


Adam Breindel consults and teaches widely on Apache Spark and other technologies. Adam’s experience includes work with banks on neural-net fraud detection, streaming analytics, cluster management code, and web apps, as well as development at a variety of startup and established companies in the travel, productivity, and entertainment industries. He is excited by the way that Spark and other modern big-data tech remove so many old obstacles to system design and make it possible to explore new categories of interesting, fun, hard problems.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google