Training Sessions

– Taught by World-Class Data Scientists –

Learn the latest data science concepts, tools and techniques from the best. Forge a connection with these rock stars from industry and academia, who are passionate about molding the next generation of data scientists.

Highly Experienced Instructors

Our instructors are highly regarded in data science, coming from both academia and notable companies.

Real World Applications

Gain the skills and knowledge to use data science in your career and business, without breaking the bank.

Cutting Edge Subject Matter

Find training sessions offered on a wide variety of data science topics from machine learning to data visualization.

ODSC Training

Form a working relationship with some of the world’s top data scientists for follow up questions and advice.

Additionally, your ticket includes access to 50+ talks and workshops.

Recordings of workshop and talks sessions for later review.

Equivalent training at other conferences costs much more.

Professionally prepared learning materials, custom tailored to each course.

Opportunities to connect with other ambitious like-minded data scientists.

10+ reasons people are attending ODSC East 2019

See Reasons

A Few of Our 2019 Training and Workshop Session Speakers

More sessions to be added soon!

Training Sessions


Training: Apache Spark for Fast Data Science (and Fast Python Integration!) at Scale

We’ll start with the basics of machine learning on Apache Spark: when to use it, how it works, and how it compares to all of your other favorite data science tooling.

You’ll learn to use Spark (with Python) for statistics, modeling, inference, and model tuning. But you’ll also get a peek behind the APIs: see why the pieces are arranged as they are, how to get the most out of the docs, open source ecosystem, third-party libraries, and solutions to common challenges.

By lunch, you will understand when, why, and how Spark fits into the data science world, and you’ll be comfortable doing your own feature engineering and modeling with Spark.

We will then look at some of the newest features in Spark that allow elegant, high performance integration with your favorite Python tooling. We’ll discuss distributed scheduling for popular libraries like TensorFlow, as well as fast model inference, traditionally a challenge with Spark. We’ll even see how you can integrate Spark with Python+GPU computation on arrays (PyTorch) or dataframes (RapidsAI).

By the end of the day, you will be caught up on the latest, easiest, fastest, and most user friendly ways of applying Apache Spark in your job and/or research.

Instructor's Bio

Adam Breindel consults and teaches widely on Apache Spark, big data engineering, and machine learning. He supports instructional initiatives and teaches as a senior instructor at Databricks, teaches classes on Apache Spark and on deep learning for O’Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s 20 years of engineering experience include streaming analytics, machine learning systems, and cluster management schedulers for some of the world’s largest banks, along with web, mobile, and embedded device apps for startups. His first full-time job in tech was on a neural-net-based fraud detection system for debit transactions, back in the bad old days when some neural nets were patented (!) and he’s much happier living in the age of amazing open-source data and ML tools today.

Adam Breindel

Apache Spark Expert, Data Science Instructor and Consultant

Training: Introduction to Machine Learning

Machine learning has become an indispensable tool across many areas of research and commercial applications. From text-to-speech for your phone to detecting the Higgs boson, machine learning excells at extracting knowledge from large amounts of data. This talk will give a general introduction to machine learning, as well as introduce practical tools for you to apply machine learning in your research. We will focus on one particularly important subfield of machine learning, supervised learning. The goal of supervised learning is to “”learn”” a function that maps inputs x to an output y, by using a collection of training data consisting of input-output pairs. We will walk through formalizing a problem as a supervised machine learning problem, creating the necessary training data and applying and evaluating a machine learning algorithm. The talk should give you all the necessary background to start using machine learning yourself.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Lecturer, Core Contributer of scikit-learn at Columbia Data Science Institute

Training: Intermediate Machine Learning with scikit-learn

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning. Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Lecturer, Core Contributer of scikit-learn at Columbia Data Science Institute

Training: Hands-on introduction to LSTMs in Keras/TensorFlow

This is a very hands on introduction to LSTMs in Keras and TensorFlow. We will build a language classifier, generator and a translating sequence to sequence model. We will talk about debugging models and explore various related architectures like GRUs, Bidirectional LSTMs, etc. to see how well they work.

Instructor's Bio

Lukas Biewald is a co-founder and CEO of Weights and Biases which builds performance and visualization tools for machine learning teams and practitioners. Lukas also founded Figure Eight (formerly CrowdFlower) — a human in the loop platform transforms unstructured text, image, audio, and video data into customized high quality training data. — which he co-founded in December 2007 with Chris Van Pelt. Prior to co-founding Weights and Biases and CrowdFlower, Biewald was a Senior Scientist and Manager within the Ranking and Management Team at Powerset, a natural language search technology company later acquired by Microsoft. From 2005 to 2006, Lukas also led the Search Relevance Team for Yahoo! Japan.

Lukas Biewald

Founder at Weights & Biases

Training: Introduction to Reinforcement Learning

Reinforcement Learning recently progressed greatly in industry as one of the best techniques for sequential decision making and control policies.

Deep Mind used RL to greatly reduce energy consumption in Google’s data centre. It has being used to do text summarisation, autonomous driving, dialog systems, media advertisement and in finance by JPMorgan Chase. We are at the very beginning of the adoption of these algorithms as systems are required to operate more and more autonomously.
In this workshop we will explore Reinforcement Learning, starting from its fundamentals and ending creating our own algorithms.

We will use OpenAI gym to try our RL algorithms. OpenAI is a non profit organisation that want committed to open source all their research on Artificial Intelligence. To foster innovation OpenAI crated a virtual environment, OpenAi gym, where it’s easy to test Reinforcement Learning algorithms.

In particular we will start with some popular techniques like Multi Armed Bandit, going thought Markov Decision Processes and Dynamic Programming.

Instructor's Bio

Leonardo De Marchi holds a Master in Artificial intelligence and has worked as a Data Scientist in the sport world, with clients such as New York Knicks and Manchester United, and with large social networks, like Justgiving.
He now works as Lead Data Scientist in Badoo, the largest dating site with over 360 million users, he is also the lead instructor at ideai.io, a company specialized in Deep Learning and Machine Learning training and a contractor for the European Commission.

Leonardo De Marchi

Head of Data Science and Analytics at Badoo

Training: Introduction to Deep Learning for Engineers

We will build and tweak several vision classifiers together starting with perceptrons and building up to transfer learning and convolutional neural networks. We will investigate practical implications of tweaking loss functions, gradient descent algorithms, network architectures, data normalization, data augmentation and so on. This class is super hands on and practical and requires no math or experience with deep learning.

Instructor's Bio

Lukas Biewald is a co-founder and CEO of Weights and Biases which builds performance and visualization tools for machine learning teams and practitioners. Lukas also founded Figure Eight (formerly CrowdFlower) — a human in the loop platform transforms unstructured text, image, audio, and video data into customized high quality training data. — which he co-founded in December 2007 with Chris Van Pelt. Prior to co-founding Weights and Biases and CrowdFlower, Biewald was a Senior Scientist and Manager within the Ranking and Management Team at Powerset, a natural language search technology company later acquired by Microsoft. From 2005 to 2006, Lukas also led the Search Relevance Team for Yahoo! Japan.

Lukas Biewald

Founder at Weights & Biases

TRAINING: MACHINE LEARNING IN R PART I

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander
Author, R Programming Expert, Statistics Professor at Columbia University

TRAINING: MACHINE LEARNING IN R PART II

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander
Author, R Programming Expert, Statistics Professor at Columbia University

Training: Human-Centered Data Science - When the left brain meets the right brain

We will present two different dimensions of the practice of data science, specifically data storytelling (including data visualization) and data literacy. There will be short presentations, integrated with interactive sessions, group activities, and brief moments of brain and body exercise. The combination of these various activities is aimed at demonstrating and practicing the concepts being presented. The Data Literacy theme component will include a section on “data profiling – having a first date with your data”, focusing on getting acquainted with all the facets, characteristics, features (good and bad), and types of your data. This theme will also include a section on matching models to algorithms to data types to the questions being asked. The Data Storytelling theme component will include sections on the neuroscience of visual displays of evidence (visual analytics) for decision-making and include a component on user-centered design in data science. Design thinking, empathy, consultative practice, and the BI Dashboard Formula (BIDF) methodology will be emphasized. The combination of the two themes (data literacy and data storytelling) will be made more concrete through exercises in small breakout groups. Each group will be given a sample problem, then asked to take a data science approach (modeling, visualization, storytelling) to address the three fundamental questions that we should always consider in our projects: What? So what? Now what? The workshop participant will come away with design tips, tricks, and tools for better human-centered data science. The goal is for your next data science project and presentation to be your best ever. As Maya Angelou said so eloquently, “people will forget what you said, people will forget what you did, but people will never forget how you made them feel.” Make your data science matter by demonstrating why and how it matters.

Instructor's Bio

Dr. Kirk Borne is the Principal Data Scientist and an Executive Advisor at global technology and consulting firm Booz Allen Hamilton. In those roles, he focuses on applications of data science, data management, machine learning, A.I., and modeling across a wide variety of disciplines. He also provides training and mentoring to executives and data scientists within numerous external organizations, industries, agencies, and partners in the use of large data repositories and machine learning for discovery, decision support, and innovation. Previously, he was Professor of Astrophysics and Computational Science at George Mason University for 12 years where he did research, taught, and advised students in data science. Prior to that, Kirk spent nearly 20 years supporting data systems activities on NASA space science programs, which included a period as NASA’s Data Archive Project Scientist for the Hubble Space Telescope. Dr. Borne has a B.S. degree in Physics from LSU, and a Ph.D. in Astronomy from Caltech. In 2016 he was elected Fellow of the International Astrostatistics Association for his lifelong contributions to big data research in astronomy. As a global speaker, he has given hundreds of invited talks worldwide, including conference keynote presentations at many dozens of data science, A.I. and big data analytics events globally. He is an active contributor on social media, where he has been named consistently among the top worldwide influencers in big data and data science since 2013. He was recently identified as the #1 digital influencer worldwide for 2018-2019. You can follow him on Twitter at @KirkDBorne.

Dr. Kirk Borne

Principal Data Scientist at Booz Allen Hamilton

Training: Human-Centered Data Science - When the left brain meets the right brain

We will present two different dimensions of the practice of data science, specifically data storytelling (including data visualization) and data literacy. There will be short presentations, integrated with interactive sessions, group activities, and brief moments of brain and body exercise. The combination of these various activities is aimed at demonstrating and practicing the concepts being presented. The Data Literacy theme component will include a section on “data profiling – having a first date with your data”, focusing on getting acquainted with all the facets, characteristics, features (good and bad), and types of your data. This theme will also include a section on matching models to algorithms to data types to the questions being asked. The Data Storytelling theme component will include sections on the neuroscience of visual displays of evidence (visual analytics) for decision-making and include a component on user-centered design in data science. Design thinking, empathy, consultative practice, and the BI Dashboard Formula (BIDF) methodology will be emphasized. The combination of the two themes (data literacy and data storytelling) will be made more concrete through exercises in small breakout groups. Each group will be given a sample problem, then asked to take a data science approach (modeling, visualization, storytelling) to address the three fundamental questions that we should always consider in our projects: What? So what? Now what? The workshop participant will come away with design tips, tricks, and tools for better human-centered data science. The goal is for your next data science project and presentation to be your best ever. As Maya Angelou said so eloquently, “people will forget what you said, people will forget what you did, but people will never forget how you made them feel.” Make your data science matter by demonstrating why and how it matters.

Instructor's Bio

Mico Yuk (@micoyuk) is the founder of BI Brainz and the BI Dashboard Formula (BIDF) methodology, where she has trained thousands globally how to strategically use the power of data visualization to enhance the decision making process. Her inventive approach fuses Enterprise Visual Storytelling with her proprietary BI Dashboard Formula Methodology(BIDF). Mico’s ability to strategically use the power of data visualization to enhance the decision making process, develop analytics portfolios that business users love and help gain ROI from their Business Intelligence investment, has been sought out by several high-profile Fortune 500 companies: Shell, FedEx, Nestle, Qatargas, Ericsson, Procter & Gamble, Kimberly-Clark, FedEx and more. She has also authored, Data Visualization for Dummies (Wiley 2014).
Her ‘blunt’ twitter comments and blogs have been mentioned on tech websites and blogs. Since 2010 she continues to be a sought after global keynote speaker and trainer, and was named one of the Top 50 Analytics Bloggers to follow by SAP. Some of her featured keynotes include Microsoft PASS Business Analytics Conference, MasteringSAP BI, Saloon BI, BOAK, and Big Data World in London to name a few. This year she’s a much anticipated keynote at the first Facebook’s Women in Analytics Conference at their headquarters in Menlo Park. In June she will be returning to Real Business Intelligence Conference at MIT as a featured speaker for the second year in a row.

Mico Yuk

CEO and Founder at BI Brainz and the BI Dashboard Formula

Training: Tensorflow 2.0 and Keras: what's new, what's shared, what's different

Tensorflow 2.0 makes Keras the default API for model definition. This is a big change. It makes Tensorflow more accessible to beginners and newcomers and it also disrupts consolidated patterns and habits for experienced Tensorflow programmers. This workshop is aimed to both audiences and it covers how to define models in Tensorflow 2.0 using the tf.keras API. It also covers the commonalities and differences between the open source Keras package and tf.keras, explaining pros and cons of each of the two. If you are getting started with Tensorflow or you’re puzzled by the changes in Tensorflow 2.0, come and learn how easy it is to design models using Keras.

Instructor's Bio

Francesco Mosconi. Ph.D. in Physics and CEO & Chief Data Scientist at Catalit Data Science. With Catalit Francesco helps Fortune 500 companies to up-skill in Machine Learning and Deep Learning through intensive training programs and strategic advisory. Author of the Zero to Deep Learning book and bootcamp, he is also an instructor at Udemy and Cloud Academy. Formerly co-founder and Chief Data Officer at Spire, a YC-backed company that invented the first consumer wearable device capable of continuously tracking respiration and physical activity. Machine Learning and python expert. Also served as Data Science lead instructor at General Assembly and The Data incubator.

Francesco Mosconi, PhD

Data Scientist at Catalit

Training: Programming with Data: Python and Pandas

Whether in R, MATLAB, Stata, or python, modern data analysis, for many researchers, requires some kind of programming. The preponderance of tools and specialized languages for data analysis suggests that general purpose programming languages like C and Java do not readily address the needs of data scientists; something more is needed.

In this workshop, you will learn how to accelerate your data analyses using the Python language and Pandas, a library specifically designed for interactive data analysis. Pandas is a massive library, so we will focus on its core functionality, specifically, loading, filtering, grouping, and transforming data. Having completed this workshop, you will understand the fundamentals of Pandas, be aware of common pitfalls, and be ready to perform your own analyses.

Instructor's Bio

Daniel Gerlanc has worked as a data scientist for more than decade and written software professionally for 15 years. He spent 5 years as a quantitative analyst with two Boston hedge funds before starting Enplus Advisors. At Enplus, he works with clients on data science and custom software development with a particular focus on projects requiring an expertise in both areas. He teaches data science and software development at introductory through advanced levels. He has coauthored several open source R packages, published in peer-reviewed journals, and is active in local predictive analytics groups.

Daniel Gerlanc

President at Enplus Advisors Inc.

Training: Modern and Old Reinforcement Learning

Reinforcement Learning recently progressed greatly in industry as one of the best techniques for sequential decision making and control policies.

DeepMind used RL to greatly reduce energy consumption in Google’s data centre. It has being used to do text summarisation, autonomous driving, dialog systems, media advertisement and in finance by JPMorgan Chase. We are at the very beginning of the adoption of these algorithms as systems are required to operate more and more autonomously.
In this workshop we will explore Reinforcement Learning, starting from its fundamentals and ending creating our own algorithms.

We will use OpenAI gym to try our RL algorithms. OpenAI is a non profit organisation that want committed to open source all their research on Artificial Intelligence. To foster innovation OpenAI created a virtual environment, OpenAi gym, where it’s easy to test Reinforcement Learning algorithms.

In particular we will start with some popular techniques like Multi Armed Bandit, going thought Markov Decision Processes and Dynamic Programming.

We then will also explore other RL frameworks and more complex concepts like Policy gradients methods and Deep Reinforcement learning, which recently changed the field of Reinforcement Learning. In particular we will see Actor Critic models and Proximal Policy Optimizations that allowed openai to beat some of the best Dota players.

We will also provide the necessary Deep Learning concepts for the course.

Instructor's Bio

Leonardo De Marchi holds a Master in Artificial intelligence and has worked as a Data Scientist in the sport world, with clients such as New York Knicks and Manchester United, and with large social networks, like Justgiving.
He now works as Lead Data Scientist in Badoo, the largest dating site with over 360 million users, he is also the lead instructor at ideai.io, a company specialized in Deep Learning and Machine Learning training and a contractor for the European Commission.

Leonardo De Marchi

Head of Data Science and Analytics at Badoo

Training: AI for Executives

Gain insight into how to drive success in data science. Identify key points in the machine learning life cycle where executive oversight really matters. Learn effective methods to help your team deliver better predictive models, faster. You’ll leave this seminar able to identify business challenges well suited for machine learning, with fully defined predictive analytics projects your team can implement now to improve operational results.

Instructor's Bio

John Boersma is Director of Education for DataRobot. In this role he oversees the company’s client training operations and relations with academic institutions using DataRobot in analytics courses. Previously, John founded and led Adapt Courseware, an adaptive online college curriculum venture. John holds a PhD in computational particle physics and an MBA in general management.

John Boersma, PhD

Director of Education at DataRobot

Training: Engineering For Data Science Part I

Practicing data scientists typically spend the bulk of their time working developing models for a particular inference or prediction application, likely giving substantially less time to the equally complex problems stemming from system infrastructure. We might trivially think of these two often orthogonal concerns as the modeling problem and the engineering problem. The typical data scientist is trained to solve the former, often in an extremely rigorous manner, but can often wind up developing a series of ad hoc solutions to the latter.

This talk will discuss Docker as a tool for the data scientist, in particular in conjunction with the popular interactive programming platform, Jupyter, and the cloud computing platform, Amazon Web Services (AWS). Using Docker, Jupyter and AWS, the data scientist can take control of their environment configuration, prototype scalable data architectures, and trivially clone their work toward replicability and communication. This talk will toward developing a set of best practices for Engineering for Data Science.

Instructor's Bio

Joshua Cook is a mathematician. He writes code in Bash, C, and Python and has done pure and applied for computational work in geospatial predictive modeling, quantum mechanics, semantic search, and artificial intelligence. He also has ten years experience teaching mathematics at the secondary and post-secondary level. His research interests lie in high-performance computing, interactive computing, feature extraction, and reinforcement learning. He is always willing to discuss orthogonality or to explain why Fortran is the language of the future over a warm or cold beverage.

Joshua Cook

Curriculum Designer at Databricks

Training: Engineering For Data Science Part II

Practicing data scientists typically spend the bulk of their time working developing models for a particular inference or prediction application, likely giving substantially less time to the equally complex problems stemming from system infrastructure. We might trivially think of these two often orthogonal concerns as the modeling problem and the engineering problem. The typical data scientist is trained to solve the former, often in an extremely rigorous manner, but can often wind up developing a series of ad hoc solutions to the latter.

This talk will discuss Docker as a tool for the data scientist, in particular in conjunction with the popular interactive programming platform, Jupyter, and the cloud computing platform, Amazon Web Services (AWS). Using Docker, Jupyter and AWS, the data scientist can take control of their environment configuration, prototype scalable data architectures, and trivially clone their work toward replicability and communication. This talk will toward developing a set of best practices for Engineering for Data Science.

Instructor's Bio

Joshua Cook is a mathematician. He writes code in Bash, C, and Python and has done pure and applied for computational work in geospatial predictive modeling, quantum mechanics, semantic search, and artificial intelligence. He also has ten years experience teaching mathematics at the secondary and post-secondary level. His research interests lie in high-performance computing, interactive computing, feature extraction, and reinforcement learning. He is always willing to discuss orthogonality or to explain why Fortran is the language of the future over a warm or cold beverage.

Joshua Cook

Curriculum Designer at Databricks

Training: Introduction to Data Science

Curious about Data Science? Self-taught on some aspects, but missing the big picture? Well you’ve got to start somewhere and this session is the place to do it. This session will cover, at a layman’s level, some of the basic concepts of data science. In a conversational format, we will discuss: What are the differences between Big Data and Data Science – and why aren’t they the same thing? What distinguishes descriptive, predictive, and prescriptive analytics? What purpose do predictive models serve in a practical context? What kinds of models are there and what do they tell us? What is the difference between supervised and unsupervised learning? What are some common pitfalls that turn good ideas into bad science? During this session, attendees will learn the difference between k-nearest neighbor and k-means clustering, understand the reasons why we do normalize and don’t overfit, and grasp the meaning of No Free Lunch.

Instructor's Bio

Todd is a Data Science Evangelist at Data Robot. For more than 20 years, Todd has been highly respected as both a technologist and a trainer. As a tech, he has seen that world from many perspectives: “data guy” and developer; architect, analyst and consultant. As a trainer, he has designed and covered subject matter from operating systems to end-user applications, with an emphasis on data and programming. As a strong advocate for knowledge sharing, he combines his experience in technology and education to impart real-world use cases to students and users of analytics solutions across multiple industries. He is a regular contributor to the community of analytics and technology user groups in the Boston area, writes and teaches on many topics, and looks forward to the next time he can strap on a dive mask and get wet.

Todd Cioffi

Data Science Evangelist at DataRobot

Instructor's Bio

Training

Jeffrey is the Chief Data Scientist at AllianceBernstein, a global investment firm managing over $500 billions. He is responsible for building and leading the data science group, partnering with investment professionals to create investment signals using data science, and collaborating with sales and marketing teams to analyze clients. Graduated with a Ph.D. in economics from the University of Pennsylvania, he has also taught statistics, econometrics, and machine learning courses at UC Berkeley, Cornell, NYU, the University of Pennsylvania, and Virginia Tech. Previously, Jeffrey held advanced analytic positions at Silicon Valley Data Science, Charles Schwab Corporation, KPMG, and Moody’s Analytics.

Jeffrey Yau, PhD

Chief Data Scientist at AllianceBernstein

Training: Building AI based Emotional Detectors in Images and Text - An Hands-on Approach

Deep Learning has become ubiquitous in everyday software applications and services. A solid understanding of DL foundational principles is necessary for researchers and modern-day engineers alike to successfully adapt the state of the art research in DL to business applications.

In this workshop, we will cover the basics of Deep Learning, what deep learning can and cannot do. We will learn the applications of Deep Learning where it has achieved state of the art results viz., to Images and Text.
The session will be a hands-on lab where attendees will use Apache MXNet to build an emotional detector in Images, we will cover basics of Convolutional Neural Networks applied to Computer Vision problems as we build the model.

The attendees will also build a model that detects emotions(sentiments) in text data, we will cover the basics of Recurrent Neural Networks that is widely used to solve Natural Language Processing problems.

The attendees will learn how to leverage the state of the art research to their application, best practices and tips, and tricks used by practitioners.

Instructor's Bio

Naveen is a Senior Software Engineer and a member of Amazon AI at AWS and works on Apache MXNet. He began his career building large scale distributed systems and has spent the last 10+ years designing and developing it. He has delivered various Tech Talks at AMLC, Spark Summit, ApacheCon and loves to share knowledge. His current focus is to make Deep Learning easily accessible to Software Developers without the need for a steep learning curve. In his spare time, he loves to read books, spend time with his family and watch his little girl grow.

Naveen Swamy

Software Developer at Amazon AI – AWS

Training: A Deeper Sack For Deep Learning: Adding Visualisations And Data Abstractions to Your Workflow

In this training session I introduce a new layer of Python software, called ConX, which sits on top of Keras, which sits on a backend (like TensorFlow.) Do we really need a deeper stack of software for deep learning? Backends, like TensorFlow, can be thought of as “assembly language” for deep learning. Keras helps, but is more like “C++” for deep learning. ConX is designed to be “Python” for deep learning. So, yes, this layer is needed.

ConX is a carefully designed library that includes tools for network, weight, and activation visualizations; data and network abstractions; and an intuitive interactive and programming interface. Especially developed for the Jupyter notebook, ConX enhances the workflow of designing and training artificial neural networks by providing interactive visual feedback early in the process, and reducing cognitive load in developing complex networks.

This session will start small and move to advanced recurrent networks for images, text, and other data. Participants are encouraged to have samples of their own data so that they can explore a real and meaningful project.

A basic understanding of Python and a laptop is all that is required. Many example deep learning models will be provided in the form of Jupyter notebooks.

Documentation: https://conx.readthedocs.io/en/latest/

Instructor's Bio

Doug Blank is now a Senior Software Engineer at Comet.ML, a start-up in New York City. Comet.ML helps data scientists and engineers track, manage, replicate, and analyze machine learning experiments.

Doug was a professor of Computer Science for 18 years at Bryn Mawr College, a small, all-women’s liberal arts college outside of Philadelphia. He has been working on artificial neural networks for almost 30 years. His focus has been on creating models to make analogies, and for use with robot control systems. He is one of the core developers of ConX.

Douglas Blank, PhD

Senior Software Engineer | Comet.ML

Training: Integrating Pandas with Scikit-Learn, an Exciting New Workflow

For Python data scientists, a typical workflow consists of using Pandas for exploratory data analysis before turning to Scikit-Learn for machine learning. Pandas and Scikit-Learn arose independently, each focusing on their specific tasks, and were never specifically designed to be integrated together. There was never a clearly defined and standardized process for transitioning between the two libraries. This lack of a concrete handoff lead to practitioners creating a variety of markedly different workflows to make this transition.

One of the main hurdles facing the Pandas to Scikit-Learn transition was the handling of string columns. Inputs to Scikit-Learn’s machine learning models only allow for numeric arrays. The common scenario of taking a Pandas DataFrame with string columns and converting it to an array of only numeric values was quite painful. Yet another hurdle, was processing separate groupings of columns with separate functions.

With the recent release of Scikit-Learn version 0.20, many workflows will start looking similar. The brand new ColumnTransformer allows for direct Pandas integration to Scikit-Learn. It applies separate transformations to specific subsets of columns. The upgraded OneHotEncoder standardizes the encoding of string columns. Before, it only encoded columns containing numeric categorical data.

In this hands-on tutorial, we will use these new additions to Scikit-Learn to build a modern, robust, and efficient workflow for those starting from a Pandas DataFrame. There will be ample practice problems and detailed notes available so that you can use it immediately upon completion.

Instructor's Bio

Ted Petrou is the author of Pandas Cookbook and founder of both Dunder Data and the Houston Data Science Meetup group. He worked as a data scientist at Schlumberger where he spent the vast majority of his time exploring data. Ted received his Master’s degree in statistics from Rice University and used his analytical skills to play poker professionally and teach math before becoming a data scientist.

Ted Petrou

Founder at Dunder Data

Training: Data Visualization with R Shiny

Shiny—an innovative package for R users to develop web applications—makes it easier for R users to share results from their analyses visually to those not familiar with R.
I will offer an overview of the the key ideas that will help you build simple yet robust Shiny applications and walk you through building data visualizations using the R Shiny web framework. You’ll learn how to use R to prepare data, run simple analyses, and display the results in Shiny web applications as you get hands-on experience creating effective and efficient data visualizations. Along the way, I will share best practices to make these applications suitable for production deployment.
Topics include:
– R basics for data preparation, analysis, and visualization
– The structure of a Shiny app
– Interactive elements and reactivity
– Customizing the user interface with HTML and CSS
– Best practices for Shiny app production deployment
– Shiny dashboards, R Markdown, and Shiny app sharing

Instructor's Bio

Alyssa Columbus is a Data Scientist at Pacific Life and member of the Spring 2018 class of NASA Datanauts. Previously, she was a computational statistics and machine learning researcher at the Athena Breast Health Network and has built robust predictive models and applications for a diverse set of industries spanning retail to biologics. Alyssa is a strong proponent of reproducible methods, open source technologies, and diversity in tech. In her free time, she leads R-Ladies Irvine and Girl Scout STEM workshops.

Alyssa Columbus

Data Scientist at Pacific Life

Training: Good, Fast, Cheap: How to do Data Science with Missing Data

If you’ve never heard of the “good, fast, cheap” dilemma, it goes something like this: You can have something good and fast, but it won’t be cheap. You can have something good and cheap, but it won’t be fast. You can have something fast and cheap, but it won’t be good. In short, you can pick two of the three but you can’t have all three.

If you’ve done a data science problem before, I can all but guarantee that you’ve run into missing data. How do we handle it? Well, we can avoid, ignore, or try to account for missing data. The problem is, none of these strategies are good, fast, *and* cheap.

We’ll start by visualizing missing data and identify the three different types of missing data, which will allow us to see how they affect whether we should avoid, ignore, or account for the missing data. We will walk through the advantages and disadvantages of each approach as well as how to visualize and implement each approach. We’ll wrap up with practical tips for working with missing data and recommendations for integrating it with your workflow!

Instructor's Bio

Matt currently leads instruction for GA’s Data Science Immersive in Washington, D.C. and most enjoys bridging the gap between theoretical statistics and real-world insights. Matt is a recovering politico, having worked as a data scientist for a political consulting firm through the 2016 election. Prior to his work in politics, he earned his Master’s degree in statistics from The Ohio State University. Matt is passionate about making data science more accessible and putting the revolutionary power of machine learning into the hands of as many people as possible. When he isn’t teaching, he’s thinking about how to be a better teacher, falling asleep to Netflix, and/or cuddling with his pug.

Matt Brems

Global Lead Data Science Instructor at General Assembly

Training: Data Visualization: From Square One to Interactivity

As data scientists, we are expected to be experts in machine learning, programming, and statistics. However, our audiences might not be! Whether we’re working with peers in the office, trying to convince our bosses to take some sort of action, or communicating results to clients, there’s nothing more clear or compelling than an effective visual to make our point. Let’s leverage the Python libraries Matplotlib and Bokeh along with visual design principles to make our point as clearly and as compellingly as possible!

This talk is designed for a wide audience. If you haven’t worked with Matplotlib or Bokeh before or if you (like me!) don’t have a natural eye for visual design, that’s OK! This will be a hands-on training designed to make visualizations that best communicate what you want to communicate. We’ll cover different types of visualizations, how to generate them in Matplotlib, how to reduce clutter and guide your user’s eye, and how (and when!) to add interactivity with Bokeh.

Instructor's Bio

Matt currently leads instruction for GA’s Data Science Immersive in Washington, D.C. and most enjoys bridging the gap between theoretical statistics and real-world insights. Matt is a recovering politico, having worked as a data scientist for a political consulting firm through the 2016 election. Prior to his work in politics, he earned his Master’s degree in statistics from The Ohio State University. Matt is passionate about making data science more accessible and putting the revolutionary power of machine learning into the hands of as many people as possible. When he isn’t teaching, he’s thinking about how to be a better teacher, falling asleep to Netflix, and/or cuddling with his pug.

Matt Brems

Global Lead Data Science Instructor at General Assembly

Training: Building Better Machine Learning Models with R

With machine learning, it is often difficult to make the leap from the classroom to the real world. Practical applications present challenges that require more advanced approaches for preparing, exploring, modeling, and evaluating the data. Furthermore, in a business setting, the stakes are much higher: an unsuccessful model may allow a competitor to drink your milkshake.

The goal of this workshop is to introduce beginning data scientists to the practical machine learning techniques that are not often found in textbooks, but rather acquired only through hands-on experience. To approximate a cutthroat business environment, we will apply these practical techniques to a real-world case study and simulate a machine learning competition like those found on Kaggle (https://www.kaggle.com/). Example code will be provided in R, but the methods discussed are applicable to any data science toolkit.

In exploring the case study, attendees will learn how to properly evaluate a model, especially in an environment in which models are repeatedly tested to identify the best performer. Participants will consider the impact of noise and outliers, learn tips and tricks for feature engineering, see how to handle missing data, consider approaches for feature selection and dimensionality reduction, and learn more about the problem of imbalanced data. We will also learn from some of Kaggle’s top competitors and discover the ensemble methods and automated pipelines that allow them to produce the highest-performing models.

Instructor's Bio

Brett Lantz is a data scientist at the University of Michigan and the author of Machine Learning with R. While training as a sociologist, Brett was first enchanted by machine learning while studying a large database of teenagers’ social media profiles. Since then, he has applied his passion for data to projects that involve understanding human behavior, such as cell phone calling patterns, medical interventions, and philanthropic activity, among others. Brett’s shares his enthusiasm by teaching on DataCamp and by presenting at conferences and workshops around the world.

Brett Lantz

Senior Associate Director of Analytics at University of Michigan

TRAINING: Introduction to RMarkdown in Shiny

Markdown Primer (45 minutes) Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections Include Table of Contents


Integrate R Code (30 minutes) 
Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching

Build RMarkdown Slideshows (20 minutes) Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode

Develop Flexdashboards (30 minutes )Start with the Flex dashboard, Layout Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code


Shiny Inputs, Drop Downs, Text Radio Checks, Shiny Outputs, Text Tables, Plots, Reactive Expressions, HTML, Widgets, Interactive Plots, Interactive Maps, Interactive Tables, Shiny Layouts UI and Server Files, User Interface

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander
Author, R Programming Expert, Statistics Professor at Columbia University

TRAINING: Intermediate RMarkdown in Shiny

Markdown Primer (45 minutes) Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections Include Table of Contents


Integrate R Code (30 minutes) 
Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching

Build RMarkdown Slideshows (20 minutes) Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode

Develop Flexdashboards (30 minutes )Start with the Flex dashboard, Layout Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code


Shiny Inputs, Drop Downs, Text Radio Checks, Shiny Outputs, Text Tables, Plots, Reactive Expressions, HTML, Widgets, Interactive Plots, Interactive Maps, Interactive Tables, Shiny Layouts UI and Server Files, User Interface

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Author, R Programming Expert, Statistics Professor at Columbia University

Training: Advanced Machine Learning with scikit-learn Part I

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This training will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, advanced model evaluation, feature engineering and working with imbalanced datasets. We will also work with text data using the bag-of-word method for classification.

This workshop assumes familiarity with Jupyter notebooks and basics of pandas, matplotlib and numpy. It also assumes some familiarity with the API of scikit-learn and how to do cross-validations and grid-search with scikit-learn.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Lecturer, Core Contributer of scikit-learn at Columbia Data Science Institute

Training: Advanced Machine Learning with scikit-learn Part II

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This training will cover some advanced topics in using scikit-learn, such as how to perform out-of-core learning with scikit-learn and how to speed up parameter search. We’ll also cover how to build your own models or feature extraction methods that are compatible with scikit-learn, which is important for feature extraction in many domains. We will see how we can customize scikit-learn even further, using custom methods for cross-validation or model evaluation.

This workshop assumes familiarity with Jupyter notebooks and basics of pandas, matplotlib and numpy. It also assumes experience using scikit-learn and familiarity with the API.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Lecturer, Core Contributer of scikit-learn at Columbia Data Science Institute

Training: Artificial Intelligence in Finance

Artificial Intelligence (AI) is about to reshape finance and the financial industry. Many decisions in the industry are already made by algorithms, such as in stock trading, credit scoring, etc. However, most of these applications do not harness the capabilities of recent advances in the field of AI.

Today’s programmatic availability of basically all historical and real-time financial data, in combination with ever more powerful compute infrastructures, facilitates the application of even the most advanced and compute intensive algorithms from AI to financial problems. In that sense, finance already is data-driven to a large extent these days. And it will become an AI-first discipline in the near future.

The workshop provides some introductory background to AI in Finance. It then proceeds with the introduction to and application of different machine and deep learning algorithms to financial problems. The focus here lies on classification algorithms applied to the algorithmic trading of financial instruments. More specifically, the AI algorithms are used to create directional predictions about the future movements of financial prices.

The workshop uses Python and standard packages such as NumPy, pandas, scikit-learn, Keras/TensorFlow and matplotlib. Most of the coding will be presented based on Jupyter Notebooks.

Instructor's Bio

Dr. Yves J. Hilpisch is founder and managing partner of The Python Quants, a group focusing on the use of open source technologies for financial data science, artificial intelligence, algorithmic trading, and computational finance. He is also founder and CEO of The AI Machine, a company focused on harnessing the power of artificial intelligence for algorithmic trading via a proprietary strategy execution platform. He is the author of Python for Finance (2nd ed., O’Reilly) and of two other books: Derivatives Analytics with Python (Wiley, 2015) as well as Listed Volatility and Variance Derivatives (Wiley, 2017). Yves lectures on computational finance at the CQF Program and on algorithmic trading at the EPAT Program. He is also the director of the first online training program leading to a University Certificate in Python for Algorithmic Trading. Yves wrote the financial analytics library DX Analytics and organizes meetups, conferences, and bootcamps about Python for quantitative finance and algorithmic trading in London, Frankfurt, Berlin, Paris, and New York. He has given keynote speeches at technology conferences in the United States, Europe, and Asia.

Yves Hilpisch, PhD

Founder and Managing Partner at Python Quants

Training: Building Recommendation Engines and Deep Learning Models Using Python, R and SAS

Deep learning is the newest area of machine learning and has become ubiquitous in predictive modeling. The complex, brainlike structure of deep learning models is used to find intricate patterns in large volumes of data. These models have heavily improved the performance of general supervised models, time series, speech recognition, object detection and classification, and sentiment analysis.

Factorization machines are a relatively new and powerful tool for modeling high-dimensional and sparse data. Most commonly they are used as recommender systems by modeling the relationship between users and items. For example, factorization machines can be used to recommend your next Netflix binge based on how you and other streamers rate content.

In this session, participants will use recurrent neural networks to analyze sequential data and improve the forecast performance of time series data, and use convolutional neural networks for image classification. Participants will also use a genetic algorithm to efficiently tune the hyperparameters of both deep learning models. Finally, students will use factorization machines to model the relationship between movies and viewers to make recommendations.
Demonstrations are provided in both R and Python, and will be administered from a Jupyter notebook. Students will use the open source SWAT package (SAS Wrapper for Analytics Transfer) to access SAS CAS (Cloud Analytic Services) in order to take advantage of the in-memory distributed environment. CAS provides a fast and scalable environment to build complex models and analyze big data by using algorithms designed for parallel processing.

Instructor's Bio

Coming Soon

Robert Blanchard

Sr. Analytical Training Consultant at SAS

Training: Building Recommendation Engines and Deep Learning Models Using Python, R and SAS

Deep learning is the newest area of machine learning and has become ubiquitous in predictive modeling. The complex, brainlike structure of deep learning models is used to find intricate patterns in large volumes of data. These models have heavily improved the performance of general supervised models, time series, speech recognition, object detection and classification, and sentiment analysis.

Factorization machines are a relatively new and powerful tool for modeling high-dimensional and sparse data. Most commonly they are used as recommender systems by modeling the relationship between users and items. For example, factorization machines can be used to recommend your next Netflix binge based on how you and other streamers rate content.

In this session, participants will use recurrent neural networks to analyze sequential data and improve the forecast performance of time series data, and use convolutional neural networks for image classification. Participants will also use a genetic algorithm to efficiently tune the hyperparameters of both deep learning models. Finally, students will use factorization machines to model the relationship between movies and viewers to make recommendations.
Demonstrations are provided in both R and Python, and will be administered from a Jupyter notebook. Students will use the open source SWAT package (SAS Wrapper for Analytics Transfer) to access SAS CAS (Cloud Analytic Services) in order to take advantage of the in-memory distributed environment. CAS provides a fast and scalable environment to build complex models and analyze big data by using algorithms designed for parallel processing.

Instructor's Bio

Coming Soon

Jordan Bakerman, PhD

Analytical Training Consultant at SAS

Training: Introduction to building a distributed neural network on Apache Spark with BigDL and Analytics Zoo

In this training session you will get hands on experience with developing neural network using Intel BigDL and Analytics Zoo on Apache Spark. You will learn how you can use Spark DataFrames and build deep learning pipelines through implementing some practical examples.
Target Audience: AI developers and aspiring data scientists who are Experienced in Python and Spark. Also big data and analytics professionals interested in neural networks

Prerequisites:
• Experience in Python programming
• Entry level knowledge of Apache Spark
• Basic knowledge of deep learning and techniques in deep learning

Training outline:

Introduction to Deep Learning on Spark, BigDL and Analytics Zoo – 25 minutes
We will begin with a brief introduction to Apache Spark and the Machine Learning/Deep Learning ecosystem around Spark. Then we will introduce Intel BigDL and Analytics Zoo, two deep learning libraries for Apache Spark. We will go into the architectural details of how distributed training happens in BigDL. We will cover the model training process, including how the model, weights and gradients are distributed, calculated, updated and shared with Apache Spark.

Setting Up Sample Environment – 10 minutes
The instructors will highlight the major components of our demonstration environment, including the dataset, docker container and example code along with the public location of these resources and how to set them up.

Exercise 1 – Quick and simple image recognition use case with BigDL – 45 minutes
We will work through a simple image recognition use case that trains a CNN. The goal of this exercise is a simple introduction in to using BigDL with image datasets. Participants will get exposure to:
• How to use read images into Spark data frames
• Building transformation pipelines for images with Spark
• How to train a deep learning model using estimators

Exercise 2 – Transfer Learning for Image Classification Models – 45 minutes
Participants will get exposure to:
• How to build a pipeline in Spark to preprocess images
• How to import a model a trained model from other frameworks like TensorFlow
• How to implement transfer learning on the imported model with the preprocessed images

Quick break: Answer questions or help out anyone who is having trouble – 10 minutes

Exercise 3 – Anomaly Detection or Recommendation system with Intel Analytics Zoo – 30 minutes
In this exercise we will show participants
• How to build an initial pipeline for feature transformation
• How to Build a recommendation model in BigDL/Analytics Zoo
• How to perform training and inference for this use case

Exercise 4 – Model Serving – 15 minutes
In this exercise we will show participants to how to build an end to end pipeline and put their model to production. They will get exposure to:
• Model serving using POJO API
• Integration into web services and streaming services like Kafka for model inference
• Distributed model inference

Practical Knowledge – Discussion of practical experience using Spark and Hadoop for machine learning and deep learning projects – 15 minutes
We will have a discussion on the following topics:
• Spark parameters and how to set them: How to allocate the right amount of executors, cores and memory
• Performance Monitoring
• Tensorboard with BigDL
• Collaboration and reproducing experiments with a data science workbench tool.

Wrapping up / Questions – 15 minutes

Instructor's Bio

Bala Chandrasekaran is a Technical Staff Engineer at Dell Technologies, where he is responsible for building machine learning and deep learning infrastructure solutions. He has over 15 years of experience in the areas of high performance computing, virtualization infrastructure, cloud computing and big data.

Bala Chandrasekaran

Technical Staff at Dell Technologies

Training: Building Generative Adversarial Networks in Tensorflow and Keras

Coming soon

Instructor's Bio

Coming soon

Sophie Searcy

Sr. Data Scientist at Metis

Training: Modeling Volatility Trading Using Econometrics and Machine Learning in Python

How can market volatility be predicted, and what are the differences between heuristic models, econometric models and data science/machine learning models? This workshop provides lessons learned from doing econometric modeling in finance distilled into a training course with example project that compares the performance of turbulence, GARCH and blender algorithms. Particular focus on framing the problem and use the right tools for volatility modeling. Aimed at entry level finance quants who want a refresher on Python techniques or non-finance quants looking to make the leap into financial modeling.

Instructor's Bio

Stephen Lawrence is the Head of Investment Management Fintech Data Science at The Vanguard Group. He oversees the integration of new structured and unstructured data sources into the investment process, leveraging a blend of NLP and predictive analytics. Prior to joining Vanguard, Dr. Lawrence was Head of Quantextual Research at State Street Bank where he lead a machine learning product team. Prior to that he led FX and Macro flow research for State Street Global Markets. Stephen holds a B.A. in Mathematics from the University of Cambridge and a Ph.D. in Finance from Boston College. He is also a TED speaker with a 2015 talk titled “The future of reading: it’s fast”.

Stephen Lawrence, PhD

Head of Investment Management Fintech Data Science at Vanguard

Training: Modeling Volatility Trading Using Econometrics and Machine Learning in Python

How can market volatility be predicted, and what are the differences between heuristic models, econometric models and data science/machine learning models? This workshop provides lessons learned from doing econometric modeling in finance distilled into a training course with example project that compares the performance of turbulence, GARCH and blender algorithms. Particular focus on framing the problem and use the right tools for volatility modeling. Aimed at entry level finance quants who want a refresher on Python techniques or non-finance quants looking to make the leap into financial modeling.

Instructor's Bio

Coming soon

Eunice Hameyie-Sanon

Sr. Data Scientist – Investment Management Fintech Strategies at Vanguard

Training: Engineering a Performant Machine Learning Pipeline And Kubeflow

The lifecycle of any machine learning model, regular or deep, consists of (a) the pre-processing/transformation/augmenting of data (b) the training of the model with different hyper-parameter values/learning rates (c) the computing of results on new data/test sets. Whether you are using transfer learning, or a from-scratch model, this process requires a large amount of computation, management of your experimental process, and the quick perusal of results from your experiment. In this workshop, we will learn how to combine off-the shelf clustering software such as kubernetes and dask, with learning systems such as tensorflow/pytorch/scikit-learn, on cloud infrastructure such as AWS/Google Cloud/Azure to construct a machine-learning system for your data science team. We’ll start with an understanding of kubernetes, move onto analysis pipelines in sklearn and dask, finally arriving at kubeflow. Participants should install minikube on their laptops (https://kubernetes.io/docs/tasks/tools/install-minikube/), and create accounts on the Google Cloud.

Instructor's Bio

Rahul Dave is a lecturer in Bayesian Statistics and Machine Learning at Harvard University, and consults on the same topics at LxPrior. He holds a Ph.D. from the University of Pennsylvania in Computational Astrophysics, and has programmed device drivers for telescopes, bespoke databases for astrophysical data, and machine learning systems in various fields. His new startup, univ.ai, helps students and companies upgrade the skill and understanding of both their developers and managers for this new AI driven world, by providing both corporate training and consulting.

Dr. Rahul Dave

Chief Scientist at univ.ai, lxprior.com and Harvard University

Workshop Sessions


Workshop: Real-time Anomaly Detection in Surveillance Feeds

Rapid advances in the surveillance infrastructure have enabled us to capture normal and anomalous events at scale, coupled with tremendous progress in computer vision and pattern recognition. However, the issue of timely response to potential threatening situations is still a problem at large. Various challenges such as low quality feeds, occlusion , clutter, lack of training data, adversarial attacks make it extremely hard for the network to achieve the desired and timely accuracy and performance, leading to hazardous situations that could have been potentially avoided. In this paper, we study state of the art approaches to tackle this problem and study their capabilities and limitations. Furthermore we also present the results of several experiments conducted to tackle this challenge from a supervised, unsupervised, generative and reinforcement perspective. We hope to present these results as an enabler for future work in this area.

Instructor's Bio

Utkarsh Contractor is the Director of AI at Aisera, where he leads the data science team working on machine learning and artificial intelligence applications in the fields of Natural Language Processing and Vision. He is also pursuing his graduate degree at Stanford University, focussing his research and experiments on computer vision, using CNNs to analyze surveillance scene imagery and footages. Utkarsh has a decade of industry experience in Information Retrieval and Machine Learning working at companies such as LinkedIn and AT&T Labs.

Utkarsh Contractor

ML and AI Director at Aisera Inc.

Workshop: Introduction to Natural Language Processing in Healthcare

Healthcare is an industry that is greatly benefiting from data science and machine learning. To successfully build predictive models, healthcare data scientists must extract and combine data of various types (numerical, categorical, text, and/or images) from electronic medical records. Unfortunately, many clinical signs and symptoms (e.g. coughing, vomiting, or diarrhea) are often not captured with numerical data and are usually only present in the clinical notes of physicians and nurses.

In this workshop, the audience will build a machine learning model to predict unplanned hospital readmission with discharge summaries using the MIMIC III data set. Throughout the tutorial, the audience will have the opportunity to prepare data for a machine learning project, preprocess unstructured notes using a bag-of-words approach, build a simple predictive model, assess the quality of the model and strategize how to improve the model. Note to the audience: the MIMIC III data set requires requesting access in advance, so please request access as early as possible.

Instructor's Bio

Andrew Long is a Data Scientist at Fresenius Medical Care North America (FMCNA). Andrew holds a PhD in biomedical engineering from Johns Hopkins University and a Master’s degree in mechanical engineering from Northwestern University. Andrew joined FMCNA last year after participating in the Insight Health Data Fellows Program. At FMCNA, he is responsible for building predictive models using machine learning to improve the quality of life of every patient who receives dialysis from FMCNA. He is currently creating a model to predict which patients are at the highest risk of imminent hospitalization.

Andrew Long, PhD

Data Scientist at Fresenius Medical Care

Workshop: Making Data Science: AIG, Amazon, Albertsons

Developing an internal data science capability requires a cultural shift, a strategic mapping process thataligns with existing business objectives, a technical infrastructure that can host new processes, and an organizational structure that can alter business practice to create measurable impact on business functions. This workshop will take you through ways to consider the vast opportunities for data science to identify and prioritize what will add the most value to your organization, and then budget and hire into commitments. Learn the most effective ways to establish data science objectives from a business perspective including recruiting, retention, goaling, and improving business.

Instructor's Bio

Haftan Eckholdt, PhD. is Chief Data Science Office at Plated. His career began with research professorships in Neuroscience, Neurology, and Psychiatry followed by industrial research appointments at companies like Amazon and AIG. He holds graduate degrees in Biostatistics and Developmental Psychology from Columbia and Cornell Universities. In his spare time he thinks about things like chess and cooking and cross country skiing and jogging and reading. When things get really really busy, he actually plays chess and cooks delicious meals and jogs a lot. Born and raised in Baltimore, Haftan has been a resident of Kings County, New York since the late 1900’s.

Haftan Eckholdt, PhD

Chief Data Science & Chief Science Officer at Understood.org

Workshop: Pomegranate: fast and flexible probabilistic modeling in Python

Instructor's Bio

Jacob Schreiber is a fifth year Ph.D. student and NSF IGERT big data fellow in the Computer Science and Engineering department at the University of Washington. His primary research focus is on the application of machine larning methods, primarily deep learning ones, to the massive amount of data being generated in the field of genome science. His research projects have involved using convolutional neural networks to predict the three dimensional structure of the genome and using deep tensor factorization to learn a latent representation of the human epigenome. He routinely contributes to the Python open source community, currently as the core developer of the pomegranate package for flexible probabilistic modeling, and in the past as a developer for the scikit-learn project. Future projects include graduating.

Jacob Schreiber

PhD Candidate at University of Washington

Instructor's Bio

Laura Norén is a data science ethicist and researcher currently working in cybersecurity at Obsidian Security in Newport Beach. She holds undergraduate degrees from MIT, a PhD from NYU where she recently completed a postdoc in the Center for Data Science. Her work has been covered in The New York Times, Canada’s Globe and Mail, American Public Media’s Marketplace program, in numerous academic journals and international conferences. Dr. Norén is a champion of open source software and those who write it.

Laura Norén, PhD

Director of Research, Professor at Obsidian Security, NYU Stern School of Business

Workshop: Real-ish Time Predictive Analytics with Spark Structured Streaming

In this workshop we will dive deep into what it takes to build and deliver an always-on “real-ish time” predictive analytics pipeline with Spark Structured Streaming.

The core focus of the workshop material will be on how to solve a common complex problem in which we have no labeled data in an unbounded timeseries dataset and need to understand the substructure of said chaos in order to apply common supervised and statistical modeling techniques to our data in a streaming fashion.

The example problem for the workshop will come from the telecommunications space but the skills you will leave with can be applied to almost any domain as long as you sprinkle in a little creativity and inject a bit of domain knowledge.

Skills Aquired:
1. Structured Streaming experience with Apace Spark.
2. Understand how to use supervised modeling techniques on unsupervised data (caveat: requires some domain knowledge and the good ol human touch).
3. Have fun for 90 minutes.

Instructor's Bio

Scott Haines is a Principal Software Engineer / Tech Lead on the Voice Insights team at Twilio. His focus has been on the architecture and development of a real-time (sub 250ms), highly available, trust-worthy analytics system. His team is providing near real-time analytics that processes / aggregates and analyzes multiple terabytes of global sensor data daily. Scott helped drive Apache Spark adoption at Twilio and actively teaches and consulting teams internally. Scott’s past experience was at Yahoo! where he built a real-time recommendation engine and targeted ranking / ratings analytics which helped serve personalized page content for millions of customers of Yahoo Games. He worked to build a real-time click / install tracking system that helped deliver customized push marketing and ad attribution for Yahoo Sports and lastly Scott finished his tenure at Yahoo working for Flurry Analytics where he wrote the an auto-regressive smart alerting and notification system which integrated into the Flurry mobile app for ios/android.

Scott Haines

Principal Software Engineer at Twilio

Workshop: Mastering Gradient Boosting with CatBoost

Gradient boosting is a powerful machine-learning technique that achieves state-of-the-art results
in a variety of practical tasks. For a number of years, it has remained the primary method for
learning problems with heterogeneous features, noisy data, and complex dependencies: web search,
recommendation systems, weather forecasting, and many others.

CatBoost (http://catboost.yandex) is a popular open-source gradient boosting library with a whole set of advantages:
1. CatBoost is able to incorporate categorical features in your data (like music genre or city) with no additional preprocessing.
2. CatBoost has the fastest GPU and multi GPU training implementations of all the openly available gradient boosting libraries.
3. CatBoost predictions are 20-60 times faster then in other open-source gradient boosting libraries, which makes it possible to use CatBoost for latency-critical tasks.
4. CatBoost has a variety of tools to analyze your model.

This workshop will feature a comprehensive tutorial on using CatBoost library.
We will walk you through all the steps of building a good predictive model.
We will cover such topics as:
– Working with different types of features, numerical and categorical
– Working with inbalanced datasets
– Using cross-validation
– Understanding feature importances and explaining model predictions
– Tuning parameters of the model
– Speeding up the training.

Instructor's Bio

Anna Veronika Dorogush graduated from the Faculty of Computational Mathematics and Cybernetics of Lomonosov Moscow State University and from Yandex School of Data Analysis. She used to work at ABBYY, Microsoft, Bing and Google, and has been working at Yandex since 2015, where she currently holds the position of the head of Machine Learning Systems group and is leading the efforts in development of the CatBoost library.

Anna Veronika Dorogush

ML Lead at Yandex

Instructor's Bio

Yunus Genes completed his Masters in Computer Science, and continuing his part time PhD at University of Central Florida. His research is focused on Applied Machine Learning, social media behavior, misinformation detection/diffusion. He has been working on this field over 4 years. His is currently working for Royal Caribbean Cruise Line LTD. He has previously held Data Science positions at Silicon Valley as well as Florida, Orlando area.

Yunus Genes, PhD

Data Scientist at Royal Caribbean

Workshop: Deciphering the Black Box: Latest Tools and Techniques for Interpretability

This workshop shows how interpretability tools can give you not only more confidence in a model, but also help to improve model performance. Through this interactive workshop, you will learn how to better understand the models you build, along with the latest techniques and many tricks of the trade around interpretability. The workshop will largely focus on interpretability techniques, such as feature importance, partial dependence, and explanation approaches, such as LIME and Shap.
The workshop will demonstrate interpretability techniques with notebooks, some in R and some in Python. Along the way, workshop will consider issues like spurious correlation, random effects, multicollinearity, reproducibility, and other issues that may affect model interpretation and performance. To illustrate the points, the workshop will use easy to understand examples and references to open source tools to illustrate the techniques.

Instructor's Bio

Rajiv Shah is a data scientist at DataRobot, where his primary focus is helping customers improve their ability to make and implement predictions. Previously, Rajiv has been part of data science teams at Caterpillar and State Farm. He has worked on a variety of projects from a wide ranging set of areas including supply chain, sensor data, acturial ratings, and security projects. He has a PhD from the University of Illinois at Urbana-Champaign.

Rajiv Shah, PhD

Data Scientist at Data Robot

Workshop: Building an Open Source Streaming Analytics Solution with Kafka and Druid

The maturation and development of open source technologies has made it easier than ever for companies to derive insights from vast quantities of data. In this talk, we will cover how data analytic stacks have evolved from data warehouses, to data lakes, and to more modern streaming analytics stack. We will also discuss building such a stack using Apache Kafka and Apache Druid.

Analytics pipelines running purely on Hadoop can suffer from hours of data lag. Initial attempts to solve this problem often lead to inflexible solutions, where the queries must be known ahead of time, or fragile solutions where the integrity of the data cannot be assured. Combining Hadoop with Kafka and Druid can guarantee system availability, maintain data integrity, and support fast and flexible queries.

In the described system, Kafka provides a fast message bus and is the delivery point for machine-generated event streams. Kafka streams can be used to manipulated data to load into Druid. Druid provides flexible, highly available, low-latency queries.

This talk is based on our real-world experiences building out such a stack for many use cases across many industries in the real world.

Instructor's Bio

Fangjin is a co-author of the open source Druid project and a co-founder of Imply, a San Francisco based technology company. Fangjin previously held senior engineering positions at Metamarkets and Cisco. He holds a BASc in Electrical Engineering and a MASc in Computer Engineering from the University of Waterloo, Canada.

Fangjin Yang

Core Contributor to Druid | CEO at Imply.io

Workshop: Reproducible Data Science Using Orbyter

Artificial Intelligence is already helping many businesses become more responsive and competitive, but how do you move machine learning models efficiently from research to deployment at enterprise scale? It is imperative to plan for deployment from day one, both in tool selection and in the feedback and development process. Additionally, just as DevOps is about people working at the intersection of development and operations, there are now people working at the intersection of data science and software engineering who need to be integrated into the team with tools and support.

At Manifold, we’ve developed the Lean AI process to streamline machine learning projects and the open-source Orbyter package for Docker-first data science to help your engineers work as an an integrated part of your development and production teams. In this workshop, Sourav and Alex will focus heavily on the DevOps side of things, demonstrating how to use Orbyter to spin up data science containers and discussing experiment management as part of the Lean AI process.

Instructor's Bio

As CTO for Manifold, Sourav is responsible for the overall delivery of data science and data product services to make clients successful. Before Manifold, Sourav led teams to build data products across the technology stack, from smart thermostats and security cams (Google / Nest) to power grid forecasting (AutoGrid) to wireless communication chips (Qualcomm). He holds patents for his work, has been published in several IEEE journals, and has won numerous awards. He earned his PhD, MS, and BS degrees from MIT in Electrical Engineering and Computer Science.

Sourav Dey, PhD

CTO at Manifold

Workshop: Reproducible Data Science Using Orbyter

Artificial Intelligence is already helping many businesses become more responsive and competitive, but how do you move machine learning models efficiently from research to deployment at enterprise scale? It is imperative to plan for deployment from day one, both in tool selection and in the feedback and development process. Additionally, just as DevOps is about people working at the intersection of development and operations, there are now people working at the intersection of data science and software engineering who need to be integrated into the team with tools and support.

At Manifold, we’ve developed the Lean AI process to streamline machine learning projects and the open-source Orbyter package for Docker-first data science to help your engineers work as an an integrated part of your development and production teams. In this workshop, Sourav and Alex will focus heavily on the DevOps side of things, demonstrating how to use Orbyter to spin up data science containers and discussing experiment management as part of the Lean AI process.

Instructor's Bio

Alexander Ng is a Senior Data Engineer at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Prior to Manifold, Alex served as both a Sales Engineering Tech Lead and a DevOps Tech Lead for Kyruus, a startup that built SaaS products for enterprise healthcare organizations. Alex got his start as a Software Systems Engineer at the MITRE Corporation and the Naval Undersea Warfare Center in Newport, RI. His recent projects at the intersection of systems and machine learning continue to combine a deep understanding of the entire development lifecycle with cutting-edge tools and techniques. Alex earned his Bachelor of Science degree in Electrical Engineering from Boston University, and is an AWS Certified Solutions Architect.

Alex Ng

Senior Data Engineer at Manifold

Workshop: Mapping Geographic Data in R

Our customers, store locations, constituents, research subjects, patients, crime locations, traffic accidents, and events don’t exist in a vacuum – their geographic locations provide important information that can best be displayed visually. Often, we have datasets that give location data in various ways: zip code, census tract or block, street address, latitude / longitude, congressional district, etc. Combining these data and getting critical insight into the relationships between them requires a bit of data munging skills that go beyond the basic data analysis we use for traditional tabular data.

There are many free, publicly available datasets about environmental exposures, socioeconomic status, climate, public safety events, and more that are linked to a geographic point or area, and this wealth of information gives us the opportunity to enrich our proprietary data with greater insight.

Whether it’s comparing the number of asthma-related visits to the hospital with air quality data or looking at socioeconomic data and correlating it to commuting patterns, the presentation of geographic data in maps helps accelerate the transformation of raw data to actionable information. Maps are a well-known data idiom that are ideal for presenting complex data in an approachable way for non-technical stakeholders like policymakers, executives, and the press.

In this hands-on workshop, we will use R to take public data from various sources and combine them to find statistically interesting patterns and display them in static and dynamic, web-ready maps. This session will cover topics including geojson and shapefiles, how to munge Census Bureau data, geocoding street addresses, transforming latitude and longitude to the containing polygon, and data visualization principles.

Participants will leave this workshop with a publication-quality data product and the skills to apply what they’ve learned to data in their field or area of interest. Participants should have R and RStudio installed and have a basic understanding of how to use R and R Markdown for basic data ingestion and analysis. Ideally, participants will install the following packages prior to the workshop: tidyverse, leaflet, jsonlite, ggplot2, maptools, sp, rgdal, rgeos, scales, tmap.

Instructor's Bio

Joy Payton is a data scientist and data educator at the Children’s Hospital of Philadelphia (CHOP), where she helps biomedical researchers learn the reproducible computational methods that will speed time to science and improve the quality and quantity of research conducted at CHOP. A longtime open source evangelist, Joy develops and delivers data science instruction on topics related to R, Python, and git to an audience that includes physicians, nurses, researchers, analysts, developers, and other staff. Her personal research interests include using natural language processing to identify linguistic differences in a neurodiverse population as well as the use of government open data portals to conduct citizen science that draws attention to issues affecting vulnerable groups. Joy holds a degree in philosophy and math from Agnes Scott College, a divinity degree from the Universidad Pontificia de Comillas (Madrid), and a data science Masters from the City University of New York (CUNY).

Joy Payton

Supervisor, Data Education at Children’s Hospital of Philadelphia

Workshop: Synthesizing Data Visualization and User Experience

The wealth of data available offers unprecedented opportunities for discovery and insight. How do we design a more intuitive and useful data experience? This workshop focuses on approaches to turn data into actionable insights by combining principles from data visualization and user experience design. Participants will be asked to think holistically about data visualizations and the people they serve. Through presentations and hands-on exercises, participants will learn how to choose and create data visualizations driven by user-oriented objectives.

Instructor's Bio

Bang Wong is the creative director of the Broad Institute of MIT and Harvard and an adjunct assistant professor in the Department of Art as Applied to Medicine at the Johns Hopkins University School of Medicine. His work focuses on developing strategies to meet the analytical challenges posed by the unprecedented volume, resolution, and variety of data in biomedical research.

Bang Wong

Creative Director at Broad Institute of Marvard

Workshop: Synthesizing Data Visualization and User Experience

The wealth of data available offers unprecedented opportunities for discovery and insight. How do we design a more intuitive and useful data experience? This workshop focuses on approaches to turn data into actionable insights by combining principles from data visualization and user experience design. Participants will be asked to think holistically about data visualizations and the people they serve. Through presentations and hands-on exercises, participants will learn how to choose and create data visualizations driven by user-oriented objectives.

Instructor's Bio

Mark Schindler is co-founder and Managing Director of GroupVisual.io. For over 15 years, he has designed user-interfaces for analytic software products and mobile apps for clients ranging from Fortune 50 companies to early-stage startups. In addition to design services, Mark and his team mentor startup companies and conduct workshops on data visualization, analytics and user-experience design.

Mark Schindler

Co-founder, Managing Director at GroupVisual.io

Workshop: Scaling AI Applications with Ray

The next generation of AI applications will continuously interact with the environment and learn from these interactions. To develop these applications, data scientists and engineers will need to seamlessly scale their work from running interactively to production clusters. In this talk we introduce Ray, a high-performance distributed execution engine, and its libraries for AI workloads. We cover each Ray library in turn, and also show how the Ray API allows these traditionally separate workflows to be composed and run together as one distributed application.

Ray is an open source project being developed at the RISE Lab in UC Berkeley for scalable hyperparameter optimization, distributed deep learning, and reinforcement learning. We focus on the following libraries in this tutorial:

TUNE: Tune is a scalable hyperparameter optimization framework for reinforcement learning and deep learning. Go from running one experiment on a single machine to running on a large cluster with efficient search algorithms without changing your code. Unlike existing hyperparameter search frameworks, Tune targets long-running, compute-intensive training jobs that may take many hours or days to complete, and includes many resource-efficient algorithms designed for this setting.

RLLIB: RLlib is an open-source library for reinforcement learning that offers both a collection of reference algorithms and scalable primitives for composing new ones. In this tutorial we discuss using RLlib to tackle both classic benchmark and applied problems, RLlib’s primitives for scalable RL, and how RL workflows can be integrated with data processing and hyperparameter optimization.

Instructor's Bio

Richard Liaw is a PhD student in BAIR/RISELab at UC Berkeley working with Joseph Gonzalez, Ion Stoica, and Ken Goldberg. He has worked on a variety of different areas, ranging from robotics to reinforcement learning to distributed systems. He is currently actively working on Ray, a distributed execution engine for AI applications; RLlib, a scalable reinforcement learning library; and Tune, a distributed framework for model training.

Richard Liaw

AI Researcher, RISELab at UC Berkeley

Workshop: Scaling AI Applications with Ray

The next generation of AI applications will continuously interact with the environment and learn from these interactions. To develop these applications, data scientists and engineers will need to seamlessly scale their work from running interactively to production clusters. In this talk we introduce Ray, a high-performance distributed execution engine, and its libraries for AI workloads. We cover each Ray library in turn, and also show how the Ray API allows these traditionally separate workflows to be composed and run together as one distributed application.

Ray is an open source project being developed at the RISE Lab in UC Berkeley for scalable hyperparameter optimization, distributed deep learning, and reinforcement learning. We focus on the following libraries in this tutorial:

TUNE: Tune is a scalable hyperparameter optimization framework for reinforcement learning and deep learning. Go from running one experiment on a single machine to running on a large cluster with efficient search algorithms without changing your code. Unlike existing hyperparameter search frameworks, Tune targets long-running, compute-intensive training jobs that may take many hours or days to complete, and includes many resource-efficient algorithms designed for this setting.

RLLIB: RLlib is an open-source library for reinforcement learning that offers both a collection of reference algorithms and scalable primitives for composing new ones. In this tutorial we discuss using RLlib to tackle both classic benchmark and applied problems, RLlib’s primitives for scalable RL, and how RL workflows can be integrated with data processing and hyperparameter optimization.

Instructor's Bio

Eric Liang is a PhD student at UC Berkeley working with Ion Stoica on distributed systems and applications of reinforcement learning. He is currently leading the RLlib project (rllib.io). Before grad school, he spent 4 years working in industry on storage infrastructure at Google and Apache Spark at Databricks.

Eric Liang

Project Lead, RISELab at UC Berkeley

Workshop: Modeling in the tidyverse

The tidyverse in R has traditionally been focused on data ingestion, manipulation, and visualization. The tidymodels packages apply the same design principles to modeling to create packages with high usability that produce results in predictable formats and structures. This workshop is a concise overview of the system and is illustrated with examples. Remote servers are available for users who cannot install software locally. Materials and preparation instructions can be found at https://github.com/topepo/odsc_2019

Instructor's Bio

Coming soon

Max Kuhn, PhD

Software Engineer, Author & Creator of Caret at RStudio

Workshop: Machine Learning for Digital Identity

There are tens of billions of online profiles today, each associated with some identity, on diverse platforms including social networks, online marketplaces, dating sites and financial institutions. Every platform needs to understand, validate and verify these identities.

The landscape of identity challenges, available data, and machine-learning technology have evolved over the years. However, identity still remains a notoriously hard problem. While we’ve made a lot of progress in academia and industry, there still are several unsolved problems. In this session, we will talk through three core, interconnected problems: (1) identity authentication/validation; (2) identity matching; (3) identity verification. We will discuss our work on effectively using machine learning technology to solve these problems, along with an analysis of popular techniques used on different platforms.

Identity authentication and validation ensures high quality attributes, that affect all downstream identity processes. The challenge of identity authentication is determining whether an input identity/attribute is a valid value. While identity validation solutions need to be tailored to the attribute-type, we will share some of the common techniques applicable across all attribute types: (1) canonicalizing attribute values, and then (2) lookup against constructed datasets of the universe of all possible values. We will also discuss how some of these generic techniques are applied to validation of two different types of attributes- names and government-issued IDs.

Identity matching is fundamental for two main applications: detecting duplicates and joining with other, often external, data sources to create a richer identity. We will describe the typical identity matching pipeline which is composed of 4 steps: (1) extraction of relevant attributes from structured and unstructured sources, (2) iterative identity enrichment of the input, (3) fuzzy matching of attribute pairs, (4) building a model to compute a match confidence using similarity and uniqueness.

Identity verification is the process of confirming that a online/digital identity accurately reflects the offline identity of the person who created the online identity. The key insight we will dive deep into is verification of one piece of the online identity, and then applying coherence across various identity attributes to verify all other attributes of the online identity.

This session is geared towards product, data science, and engineering leaders who would like to introduce state-of-the-art machine-learning techniques to solve identity problems at their respective companies or fortify their existing solutions. Some familiarity with machine learning techniques is preferred, but not required.

Instructor's Bio

Liren Peng is a Software Engineer on the Trust team at Airbnb. He is responsible for the architecture and development of user identity verification systems. He also works on the utilization of third party data and vendor integration. Prior to Airbnb, Liren worked at Trooly, a startup that built machine learning based trust models using both social media data and proprietary data to access the trustworthiness of individuals. He received B.S. from Carnegie Mellon University and M.Sc from Stanford University focusing data analytics.

Liren Peng

Software Engineer at Airbnb

Workshop: Open Data Hub workshop on OpenShift

The past few years has seen growth and adoption of container technology for cloud native applications and devops and agility. Kubernetes has emerged as the defacto hybrid cloud container platform.

There is considerable interest in bringing data science workloads and workflows to OpenShift – Red Hat’s Kubernetes distro. Data Scientists benefit by having a choice of public and private clouds and capabilities and technologies they bring for their experiments. Data and ML engineers benefit by able to scale and bring data science workloads and workflows to production.

We propose a hands on workshop where we show the attendees on how to deploy open source technologies for data science on Kubernetes – technologies such as Jupyter, Kafka, Spark, TensorFlow, Ceph etc. This workshop will be based on our experiences with this for Open Data Hub.

Instructor's Bio

Coming Soon

Steven Huels

Director of Engineering at Red Hat

Workshop: Open Data Hub workshop on OpenShift

The past few years has seen growth and adoption of container technology for cloud native applications and devops and agility. Kubernetes has emerged as the defacto hybrid cloud container platform.

There is considerable interest in bringing data science workloads and workflows to OpenShift – Red Hat’s Kubernetes distro. Data Scientists benefit by having a choice of public and private clouds and capabilities and technologies they bring for their experiments. Data and ML engineers benefit by able to scale and bring data science workloads and workflows to production.

We propose a hands on workshop where we show the attendees on how to deploy open source technologies for data science on Kubernetes – technologies such as Jupyter, Kafka, Spark, TensorFlow, Ceph etc. This workshop will be based on our experiences with this for Open Data Hub.

Instructor's Bio

Coming Soon

Sherard Griffin

Senior Principal Engineer at Red Hat

Workshop: Open Data Hub workshop on OpenShift

The past few years has seen growth and adoption of container technology for cloud native applications and devops and agility. Kubernetes has emerged as the defacto hybrid cloud container platform.

There is considerable interest in bringing data science workloads and workflows to OpenShift – Red Hat’s Kubernetes distro. Data Scientists benefit by having a choice of public and private clouds and capabilities and technologies they bring for their experiments. Data and ML engineers benefit by able to scale and bring data science workloads and workflows to production.

We propose a hands on workshop where we show the attendees on how to deploy open source technologies for data science on Kubernetes – technologies such as Jupyter, Kafka, Spark, TensorFlow, Ceph etc. This workshop will be based on our experiences with this for Open Data Hub.

Instructor's Bio

Coming Soon

Tushar Katarki

Sr. Principal Product Manager – OpenShift at Red Hat

Workshop: Intro to Technical Financial Evaluation with R

In this entry level workshop you will learn how to download and evaluate equities with the TTR (technical trading rules) package. We will evaluate an equity according to three basic indicators and introduce you to backtesting for more sophisticated analyses on your own. Next we will model a financial market’s risk versus reward to identify the best possible individual investments in the market. Lastly, we will explore a non-traditional market, simulate the reward in the market and put our findings to an actual test in a highly speculative environment.

Instructor's Bio

Coming Soon

Ted Kwartler

Director, Data Scientist, Adjunct Professor at Liberty Mutual, Harvard Extension School

Workshop: Get started with Deep Learning and the Internet of Things!

This workshop engages intermediate participants with Deep Learning and the Internet of Things (IoT). The hands-on exercises focus on an image recognition application that feeds into real-time IoT analytics; participants will use a webcam and a neural network to recognize images, aggregate data, and aggregate data to real-time IoT analytics.
We will start by introducing the fundamentals of deep learning and IoT, including concepts like “convolutional networks”, “smart assets”, “edge devices”, and fundamental techniques such as transfer learning and inference. The hands-on group exercise guides you through using a pre-trained convolutional neural network to recognize objects seen through a webcam, and then classify images from a live webcam. Finally, the data will be aggregated using an Internet of Things platform.
Participants will learn to:
1. Access and explore popular pretrained models like AlexNet or GoogLeNet
2. Use transfer learning to build an image classifier
3. Graphically modify deep neural networks
4. Improve the accuracy of deep networks

The workshop also covers common challenges in building commercial deep learning applications that go beyond just training the deep network, including ground-truth labeling of input data and exchanging models with other popular deep learning frameworks like TensorFlow using importers and the ONNX model format

Our goal to get participants excited about Deep Learning and IoT, and empower them to continue with more advanced projects after the conference. Access to MATLAB will be provided, along with training data and scripts via a browser.

Instructor's Bio

Jianghao Wang is a Data Scientist at MathWorks. In her role, Jianghao supports deep learning research and teaching in academia. Before joining MathWorks, Jianghao obtained her Ph.D. in Statistical Climatology from the University of Southern California and B.S. in Applied Mathematics from Nankai University.

Jianghao Wang, PhD

Data Scientist at MathWorks

Workshop: Get started with Deep Learning and the Internet of Things!

This workshop engages intermediate participants with Deep Learning and the Internet of Things (IoT). The hands-on exercises focus on an image recognition application that feeds into real-time IoT analytics; participants will use a webcam and a neural network to recognize images, aggregate data, and aggregate data to real-time IoT analytics.
We will start by introducing the fundamentals of deep learning and IoT, including concepts like “convolutional networks”, “smart assets”, “edge devices”, and fundamental techniques such as transfer learning and inference. The hands-on group exercise guides you through using a pre-trained convolutional neural network to recognize objects seen through a webcam, and then classify images from a live webcam. Finally, the data will be aggregated using an Internet of Things platform.
Participants will learn to:
1. Access and explore popular pretrained models like AlexNet or GoogLeNet
2. Use transfer learning to build an image classifier
3. Graphically modify deep neural networks
4. Improve the accuracy of deep networks

The workshop also covers common challenges in building commercial deep learning applications that go beyond just training the deep network, including ground-truth labeling of input data and exchanging models with other popular deep learning frameworks like TensorFlow using importers and the ONNX model format

Our goal to get participants excited about Deep Learning and IoT, and empower them to continue with more advanced projects after the conference. Access to MATLAB will be provided, along with training data and scripts via a browser.

Instructor's Bio

Pitambar Dayal works on deep learning and computer vision applications in technical marketing. Prior to joining MathWorks, he worked on creating technological healthcare solutions for developing countries and researching the diagnosis and treatment of ischemic stroke patients. Pitambar holds a B.S. in biomedical engineering from the New Jersey Institute of Technology.

Pitambar Dayal

Application Support Engineer at MathWorks

Workshop: Analyzing Legislative Burden Upon Businesses Using NLP and ML

As legislation develops over time, the burden upon businesses can change drastically. Data scientists from Bardess have collaborated with a research team within the Government of Ontario to investigate the use of advanced natural language processing (NLP) and machine learning (ML) techniques to analyze legal documents including statutes and regulations. Using the Accessibility for Ontarians with Disabilities Act (AODA) as a starting point, we developed a multi-stage analysis. On the higher level, the goal was to simply identify and automatically detect parts of the legislature that indicate legislative burden and categorize them as being primarily burdens upon business or government departments. The second level of analysis aims at understanding patterns of similarities and differences between different classes of burden using data mining and clustering techniques. Finally, the objective of the analysis is expanded to include other legislative texts, using ML algorithms to detect burdens which have been duplicated across multiple statutes and acts. This latter work supports the Government of Ontario to develop leaner legislature more efficiently. Overall this work indicates how NLP and ML techniques can be brought to bear on complex legislative problems, further emphasizing the increasing utility of these techniques in government and industry.

In this hands-on workshop, we’ll first describe the legislative/business context for the initiative, then walk attendees through the technical implementation. The work will be conducted by combining various techniques from the NLP toolbox, such as entity recognition, part-of-speech tagging, automatic summarization, and topic modeling. Work will be conducted in Python, making use of libraries for NLP such as spacy and nltk, and the ML library scikit-learn. We will also showcase interactive dashboards which have been created using the BI tool Qlik to allow exploration of the results of the analysis.

Instructor's Bio

Dr. Daniel Parton leads the data science practice at the analytics consultancy, Bardess. He has a background in academia, including a PhD in computational biophysics from University of Oxford, and previously worked in marketing analytics at Omnicom. He brings both technical and management experience to his role of leading cross-functional data analytics teams, and has led successful and impactful projects for companies in finance, retail, tech, media, manufacturing, pharma and sports/entertainment industries.

Daniel Parton, PhD

Lead Data Scientist at Bardess Group

Workshop: Analyzing Legislative Burden Upon Businesses Using NLP and ML

As legislation develops over time, the burden upon businesses can change drastically. Data scientists from Bardess have collaborated with a research team within the Government of Ontario to investigate the use of advanced natural language processing (NLP) and machine learning (ML) techniques to analyze legal documents including statutes and regulations. Using the Accessibility for Ontarians with Disabilities Act (AODA) as a starting point, we developed a multi-stage analysis. On the higher level, the goal was to simply identify and automatically detect parts of the legislature that indicate legislative burden and categorize them as being primarily burdens upon business or government departments. The second level of analysis aims at understanding patterns of similarities and differences between different classes of burden using data mining and clustering techniques. Finally, the objective of the analysis is expanded to include other legislative texts, using ML algorithms to detect burdens which have been duplicated across multiple statutes and acts. This latter work supports the Government of Ontario to develop leaner legislature more efficiently. Overall this work indicates how NLP and ML techniques can be brought to bear on complex legislative problems, further emphasizing the increasing utility of these techniques in government and industry.

In this hands-on workshop, we’ll first describe the legislative/business context for the initiative, then walk attendees through the technical implementation. The work will be conducted by combining various techniques from the NLP toolbox, such as entity recognition, part-of-speech tagging, automatic summarization, and topic modeling. Work will be conducted in Python, making use of libraries for NLP such as spacy and nltk, and the ML library scikit-learn. We will also showcase interactive dashboards which have been created using the BI tool Qlik to allow exploration of the results of the analysis.

Instructor's Bio

Serena Peruzzo is a senior data scientist at the analytics consultancy, Bardess. Her formal background is in Statistics with experience working both in the industry and academia. She has worked as a consultant on the Australian, British and Canadian markets delivering data science solutions across a broad range of industries and led several startups through the process of bootstrapping their data science capabilities.

Serena Peruzzo

Sr. Data Scientist at Bardess Group

Workshop: Causal Inference for Data Scientists

Causal inference is an increasingly necessary skill set for data scientists and analysts. No longer is it enough only to predict what happens given a set of environmental conditions, but rather internal business partners need to know how the decisions they are making influence outcomes. For example, marketers not only need to know that spending more money drives more revenue, but they also need to know how much revenue they can expect to observe at various levels of marketing spend. Understanding the causal relationship between spend and revenue empowers decision makers to optimize their decisions more accurately and quickly around crucial business goals such as ROI targets or revenue maximization. At DraftKings, we are always thinking about how we draw accurate conclusions from all of our tests. Our efforts include utilizing modern techniques as well as exploring new ideas and methods to improve our ability to learn.

Managers often assume that causal inference is a simple exercise for data scientists. Unfortunately, causal inference is not as simple as running A/B experiments. The purpose of this talk is to establish that causal inference is as much a philosophical exercise as it is a data exercise. Developing expertise in causal inference requires a deep understanding of the accepted framework, an ability to identify when data doesn’t adhere to the assumptions of this framework, and expertise with tools and techniques that can solve many of the significant challenges with estimating unbiased effects of treatments on critical outcomes.

This session serves as an introduction to the practice of causal inference. We start with an overview of the Rubin Causal Model (RCM), the leading framework for establishing causality. Once users are comfortable with the philosophy, we explore how the commonly used A/B testing framework maps to the more robust RCM framework from both a mathematical and philosophical perspective. In the final portion of the talk, we discuss several techniques developed by researchers that can be used to establish causality for a compromised A/B test or cases where tests are not feasible to implement. Throughout the talk, we use a general set of challenges faced by businesses to illustrate when issues arise and how these techniques mitigate the challenges.

Instructor's Bio

Coming soon

Bradley Fay, PhD

Senior Manager, Analytics at DraftKings

Workshop: Machine Learning for Digital Identity

There are tens of billions of online profiles today, each associated with some identity, on diverse platforms including social networks, online marketplaces, dating sites and financial institutions. Every platform needs to understand, validate and verify these identities.

The landscape of identity challenges, available data, and machine-learning technology have evolved over the years. However, identity still remains a notoriously hard problem. While we’ve made a lot of progress in academia and industry, there still are several unsolved problems. In this session, we will talk through three core, interconnected problems: (1) identity authentication/validation; (2) identity matching; (3) identity verification. We will discuss our work on effectively using machine learning technology to solve these problems, along with an analysis of popular techniques used on different platforms.

Identity authentication and validation ensures high quality attributes, that affect all downstream identity processes. The challenge of identity authentication is determining whether an input identity/attribute is a valid value. While identity validation solutions need to be tailored to the attribute-type, we will share some of the common techniques applicable across all attribute types: (1) canonicalizing attribute values, and then (2) lookup against constructed datasets of the universe of all possible values. We will also discuss how some of these generic techniques are applied to validation of two different types of attributes- names and government-issued IDs.

Identity matching is fundamental for two main applications: detecting duplicates and joining with other, often external, data sources to create a richer identity. We will describe the typical identity matching pipeline which is composed of 4 steps: (1) extraction of relevant attributes from structured and unstructured sources, (2) iterative identity enrichment of the input, (3) fuzzy matching of attribute pairs, (4) building a model to compute a match confidence using similarity and uniqueness.

Identity verification is the process of confirming that a online/digital identity accurately reflects the offline identity of the person who created the online identity. The key insight we will dive deep into is verification of one piece of the online identity, and then applying coherence across various identity attributes to verify all other attributes of the online identity.

This session is geared towards product, data science, and engineering leaders who would like to introduce state-of-the-art machine-learning techniques to solve identity problems at their respective companies or fortify their existing solutions. Some familiarity with machine learning techniques is preferred, but not required.

Instructor's Bio

Sukhada Palkar is a software engineer at Airbnb working on the various challenges of trusting digital identities. She enjoys working at the intersection of open ended problem solving, software engineering and machine learning. She has a background in applying machine learning for text and speech systems, and more recently identity and risk analytics.

Before Airbnb, Sukhada was an early member of the Amazon Alexa core natural language team and part of Trooly, a startup in the digital identity verification space, that was acquired by Airbnb. Sukhada has a M.S. in speech and language technologies from Carnegie Mellon.

Sukhada Palkar

Software Engineer at Airbnb

Workshop: Mapping the Global Supply Chain Graph

Panjiva maps the network of global trade using over one billion shipping records sourced from 15 governments around the world. We perform large-scale entity extraction and entity resolution from this raw data, identifying over 8 million companies involved in international trade, located across every country in the world. Moreover, we track detailed information on the 25 million+ relationships between them, yielding a map of the global trade network with unprecedented scope and granularity. We have developed a powerful platform facilitating search, analysis, and visualization of this network as well as a data feed integrated into S&P Global’s Xpressfeed platform.

We can explore the global supply chain graph at many levels of granularity. At the micro level, we can surface the close relationships around a given company to, for example, identify overseas suppliers shared with a competitor. At the macro level, we can track patterns such as the flow of products among geographic areas or industries. By linking to S&P Global’s financial and corporate data, we are able to understand how supply chains flow within or between multinational corporate structures, and correlate trade volumes and anomalies to financial metrics and events.

Instructor's Bio

Jason Prentice leads the data team at Panjiva, where he focuses on developing the fundamental machine learning technologies that power our data collection. Before joining Panjiva as a data scientist, he researched computational neuroscience as a C.V. Starr fellow at Princeton University and earned a Ph.D. in Physics from the University of Pennsylvania.

Jason Prentice, PhD

Senior Manager, Data Science at S&P Global Market Intelligence

Workshop: Democratizing & Accelerating AI through Automated Machine Learning

Intelligent experiences powered by AI can seem like magic to users. Developing them, however, is pretty cumbersome involving a series of sequential and interconnected decisions along the way that are pretty time consuming. What if there was an automated service that identifies the best machine learning pipelines for a given problem/data? automated ML does exactly that!

Automated ML is based on a breakthrough from our Microsoft Research division. The approach combines ideas from collaborative filtering and Bayesian optimization to search an enormous space of possible machine learning pipelines intelligently and efficiently. It’s essentially a recommender system for machine learning pipelines. Similar to how streaming services recommend movies for users, automated ML recommends machine learning pipelines for data sets.

Just as important, automated ML accomplishes all this without having to see the customer’s data, preserving privacy. Automated ML is designed to not look at the customer’s data. Customer data and execution of the machine learning pipeline both live in the customer’s cloud subscription (or their local machine), which they have complete control of. Only the results of each pipeline run are sent back to the automated ML service, which then makes an intelligent, probabilistic choice of which pipelines should be tried next.

By making automated ML available through the Azure Machine Learning service (Python based SDK), we’re empowering data scientists with a powerful productivity tool. We’re working on making automated ML accessible through PowerBI, so that business analysts and BI professionals can also take advantage of machine learning. And stay tuned as we continue to incorporate it into other product channels to bring the power of automated ML to everyone.

This session will provide an overview of Automated machine learning, key customer use-cases, how it works and how you can get started!

Instructor's Bio

Coming soon

Deepak Babu Mukunthu

Principal Program Manager, Azure AI Platform at Microsoft

Workshop: Deep Learning like a Viking: Building Convolutional Neural Networks with Keras

The Vikings came from the land of ice and snow, from the midnight sun, where the hot springs flow. In addition to longships and bad attitudes, they had a system of writing that we, in modern times, have dubbed the Younger Futhark (or ᚠᚢᚦᚬᚱᚴ if you’re a Viking). These sigils are more commonly called runes and have been mimicked in fantasy literature and role-playing games for decades.

Of course, having an alphabet, runic or otherwise, solves lots of problems. But, it also introduces others. The Vikings had the same problem we do today. How were they to get their automated software systems to recognize the hand-carved input of a typical boatman? Of course, they were never able to solve this problem and were instead forced into a life of burning and pillaging. Today, we have deep learning and neural networks and can, fortunately, avoid such a fate.

In this session, we are going to build a Convolution Neural Network to recognize hand-written runes from the Younger Futhark. We’ll be using Keras to write easy to understand Python code that creates and trains the neural network to do this. We’ll wire this up to a web application using Flask and some client-side JavaScript so you can write some runes yourself and see if it recognizes them.

When we’re done, you’ll understand how Convolution Neural Networks work, how to build your own using Python and Keras, and how to make it a part of an application using Flask. Maybe you’ll even try seeing what it thinks of the Bluetooth logo?

Instructor's Bio

Guy works for DataRobot in Columbus, Ohio as a Developer Evangelist. Combining his decades of experience in writing software with a passion for sharing what he has learned, Guy goes out into developer communities and helps others build great software.
Teaching and community have long been a focus for Guy. He is President of the Columbus JavaScript Users Group, an organizer for the Columbus Machine Learners, and has even helped teach programming at a prison in central Ohio.
In past lives, Guy has worked as a consultant in a broad range of industries including healthcare, retail, and utilities. He also has spent several years working for a major insurance company in central Ohio. This has given him a broad view of technology application toward business problems.

Guy Royse

Developer Evangelist at DataRobot

Tutorials:


Tutorial: How should we (correctly) compare graphs?

Graph representations of real-world phenomena are ubiquitous – from social and information networks, to technological, biological, chemical, and brain networks. Many graph mining tasks require a distance (or, conversely, a similarity) measure between graphs. Examples include clustering of graphs and anomaly detection, nearest neighbor and similarity search, pattern recognition, and transfer learning. Such tasks find applications in diverse areas including image processing, chemistry, and social network analysis, to name a few.

Intuitively, given two graphs, their distance is a score quantifying their structural differences. A highly desirable property for such a score is that it is a metric, i.e., it is non-negative, symmetric, positive-definite, and, crucially, satisfies the triangle inequality. Metrics exhibit significant computational advantages over non-metrics. For example, operations such as nearest-neighbor search, clustering, outlier detection, and diameter computation have known fast algorithms precisely when performed over objects embedded in a metric space.

Unfortunately, algorithms to compute several classic distances between graphs do not scale to large graphs; other distances do not satisfy all of the metric properties: non-negativity, positive definiteness, symmetry, and triangle inequality.

The purpose of this tutorial is to go over the recent and expanding literature of graph metric spaces, focusing specifically on tractable metrics. Furthermore, we also explain how to compute the distance between n graphs in a way that the resulting distance satisfy a generalization of the triangle inequality to n elements, and is still tractable.

Instructor's Bio

José Bento completed his Ph.D. in Electrical Engineering at Stanford University where he worked with Professor Andrea Montanari on statistical inference and structural learning of graphical models. After his Ph.D., he moved to Disney Research, Boston lab, where he worked with Dr. Jonathan Yedidia on algorithms for distributed optimization, robotics, and computer vision. He is now with the Computer Science department at Boston College. His current research lies at the intersection of distributed algorithms and machine learning. In 2014 he received a Disney Inventor Award for his work on distributed optimization, which recently lead to an approved patent. In 2016 he was awarded a $10M NIH joint grant to study the emergence of antibiotic resistance and in 2017 a $2M NSF joint grant to study measures of distance between large graphs.

Jose Bento, PhD

Assistance Professor at Boston College

Tutorial: Deep Learning on Mobile

Over the last few years, convolutional neural networks (CNN) have risen in popularity, especially in the area of computer vision. Many mobile applications running on smartphones and wearable devices would potentially benefit from the new opportunities enabled by deep learning techniques. However, CNNs are by nature computationally and memory intensive, making them challenging to deploy on a mobile device.

This workshop explains how to practically bring the power of convolutional neural networks and deep learning to memory and power-constrained devices like smartphones. You will learn various strategies to circumvent obstacles and build mobile-friendly shallow CNN architectures that significantly reduce the memory footprint and therefore make them easier to store on a smartphone; The workshop also dives into how to use a family of model compression techniques to prune the network size for live image processing, enabling you to build a CNN version optimized for inference on mobile devices. Along the way, you will learn practical strategies to preprocess your data in a manner that makes the models more efficient in the real world.

Following a step by step example of building an iOS deep learning app, we will discuss tips and tricks, speed and accuracy trade-offs, and benchmarks on different hardware to demonstrate how to get started developing your own deep learning application suitable for deployment on storage- and power-constrained mobile devices. You can also apply similar techniques to make deep neural nets more efficient when deploying in a regular cloud-based production environment, thus reducing the number of GPUs required and optimizing on cost.”

Instructor's Bio

Anirudh is the Head of AI & Research at Aira (Visual interpreter for the blind), and was previously at Microsoft AI & Research where he founded Seeing AI – Talking camera app for the blind community. He is also the co-author of the upcoming book, ‘Practical Deep Learning for Cloud and Mobile’. He brings over a decade of production-oriented Applied Research experience on Peta Byte scale datasets, with features shipped to about a billion people. He has been prototyping ideas using computer vision and deep learning techniques for Augmented Reality, Speech, Productivity as well as Accessibility. Some of his recent work, which IEEE has called ‘life changing’, has been honored by CES, FCC, Cannes Lions, American Council of the Blind, showcased at events by White House, House of Lords, World Economic Forum, on Netflix, National Geographic, and applauded by world leaders including Justin Trudeau and Theresa May.

Anirudh Koul

Head of AI & Research at Aira

Tutorial: When the Bootstrap Breaks

Resampling methods like the bootstrap are becoming increasingly common in modern data science. For good reason too; the bootstrap is incredibly powerful. Unlike t-statistics, the bootstrap doesn’t depend on a normality assumption nor require any arcane formulas. You’re no longer limited to working with well understood metrics like means. One can easily build tools that compute confidence for an arbitrary metric. What’s the standard error of a Median? Who cares! I used the bootstrap.

With all of these benefits the bootstrap begins to look a little magical. That’s dangerous. To understand your tool you need to understand how it fails, how to spot the failure, and what to do when it does. As it turns out, methods like the bootstrap and the t-test struggle with very similar types of data. We’ll explore how these two methods compare on troublesome data sets and discuss when to use one over the other.

In this talk we’ll explore what types to data the bootstrap has trouble with. Then we’ll discuss how to identify these problems in the wild and how to deal with the problematic data. We will explore simulated data and share the code to conduct the simulations yourself. However, this isn’t just a theoretical problem. We’ll also explore real Firefox data and discuss how Firefox’s data science team handles this data when analyzing experiments.

At the end of this session you’ll leave with a firm understanding of the bootstrap. Even better, you’ll understand how to spot potential issues in your data and avoid false confidence in your results.

Instructor's Bio

Ryan Harter is a Senior-Staff Data Scientist with Mozilla working on Firefox. He has years of experience solving business problems in the technology and energy industries both as a data scientist and data engineer. Ryan shares practical advice for applying data science as a mentor and in his blog.

Ryan Harter

Senior Staff Data Scientist at Mozilla

A Sample of Previous East Workshops


  • Reducing Model Risk with Automated Machine Learning

  • How to Visualize Your Data: Beyond the Eye into the Brain

  • Matrix Math at Scale with Apache Mahout and Spark

  • Tutorial on Anomaly Detection at Scale: Data Engineering Challenges meet Data Science Difficulties

  • Crunching your Data with CatBoost – New Gradient Boosting Library

  • Deep Learning in Finance : An experiment and a reflection

  • Real-Time Machine Learning on the Mainframe

  • Power up your Computer Vision skills with TensorFlow-Keras

  • Bayesian Networks with pgmpy

  • Bayesian Hieratical Model for Predictive Analytics

  • Standardized Data Science: The Team Data Science Data Process – with a practical, example in Python

  • Interpretable Representation Learning for Visual Intelligence

  • Henosis – a generalizable, cloud-native Python form recommender framework for Data Scientists

  • Bayesian Statistics Made Simple

  • CNNs for Scene Classification in Videos

  • Accelerated mapping from the Sky: object detection with high resolution remote sensing images

  • Applications of Deep Learning in Aerospace and Building Systems

  • Democratise Conversational AI – Scaling Academic Research to Industrial Applications

  • Latest Developments in GANs

  • Multivariate Time Series Forecasting Using Statistical and Machine Learning Models

  • Networks and Large Scale Optimization

  • Blockchain and Data Governance – Validating Information for Data Science

  • Why Machine Learning needs its own language, and why Julia is the one

  • Machine Learning in Chainer Python

  • Buying Happiness – Using LSTMs to Turn Feelings into Trades

  • Multi-Paradigm Data Science

  • Agile Data Science 2.0

  • Keras for R

  • R Packages as Collaboration Tools

  • Uplift Modeling and Uplift Prescriptive Analytics: Introduction and Advanced Topics

  • Using AWS SageMaker, Kubernetes, and PipelineAI for High Performance, Hybrid-Cloud Distributed TensorFlow Model Training and Serving with GPUs

  • Deep Learning Methods for Text Classification

  • Applying Deep Learning to Article Embedding for Fake News Evaluation

  • Experimental Reproducibility in Data Science with Sacred

  • Visual Analytics for High Dimensional Data

  • Running Data Science Projects and integration within the Organizational Ecosystem

  • Data Science Learnathon. From Raw Data to Deployment: The Data Science Cycle with Knime

  • Salted Graphs – A (Delicious) Approach to Repeatable Data Science

  • A Primer on Neural Network Models for Natural Language Processing

  • Help! I have missing data. How do I fix it (the right way)?

  • Applying Color to Visual Analytics in Data Science

  • Under The Hood: Creating Your Own Spark Datasources

  • #NOBLACKBOXES: How To Solve Real Data Science Problems with Automation, Without Losing Transparency

  • Solving Real World Problems in Machine Learning and Data Science

  • The Power of Monotonicity to Make ML Make Sense

Sign Up for ODSC East 2019 | April 30-May 3

Register Now