Training Sessions

– Taught by World-Class Data Scientists –

Learn the latest data science concepts, tools and techniques from the best. Forge a connection with these rock stars from industry and academia, who are passionate about molding the next generation of data scientists.

Highly Experienced Instructors

Our instructors are highly regarded in data science, coming from both academia and notable companies.

Real World Applications

Gain the skills and knowledge to use data science in your career and business, without breaking the bank.

Cutting Edge Subject Matter

Find training sessions offered on a wide variety of data science topics from machine learning to data visualization.

ODSC Training

Form a working relationship with some of the world’s top data scientists for follow up questions and advice.

Additionally, your ticket includes access to 50+ talks and workshops.

Recordings of workshop and talks sessions for later review.

Equivalent training at other conferences costs much more.

Professionally prepared learning materials, custom tailored to each course.

Opportunities to connect with other ambitious like-minded data scientists.

10+ reasons people are attending ODSC East 2019

See Reasons

A Few of Our 2019 Training and Workshop Session Speakers

More sessions to be added soon!

Training Sessions


Training: Apache Spark for Fast Data Science (and Fast Python Integration!) at Scale

We’ll start with the basics of machine learning on Apache Spark: when to use it, how it works, and how it compares to all of your other favorite data science tooling.

You’ll learn to use Spark (with Python) for statistics, modeling, inference, and model tuning. But you’ll also get a peek behind the APIs: see why the pieces are arranged as they are, how to get the most out of the docs, open source ecosystem, third-party libraries, and solutions to common challenges.

By lunch, you will understand when, why, and how Spark fits into the data science world, and you’ll be comfortable doing your own feature engineering and modeling with Spark.

We will then look at some of the newest features in Spark that allow elegant, high performance integration with your favorite Python tooling. We’ll discuss distributed scheduling for popular libraries like TensorFlow, as well as fast model inference, traditionally a challenge with Spark. We’ll even see how you can integrate Spark with Python+GPU computation on arrays (PyTorch) or dataframes (RapidsAI).

By the end of the day, you will be caught up on the latest, easiest, fastest, and most user friendly ways of applying Apache Spark in your job and/or research.

Instructor's Bio

Adam Breindel consults and teaches widely on Apache Spark, big data engineering, and machine learning. He supports instructional initiatives and teaches as a senior instructor at Databricks, teaches classes on Apache Spark and on deep learning for O’Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s 20 years of engineering experience include streaming analytics, machine learning systems, and cluster management schedulers for some of the world’s largest banks, along with web, mobile, and embedded device apps for startups. His first full-time job in tech was on a neural-net-based fraud detection system for debit transactions, back in the bad old days when some neural nets were patented (!) and he’s much happier living in the age of amazing open-source data and ML tools today.

Adam Breindel

Apache Spark Expert, Data Science Instructor and Consultant

Training: Introduction to Machine Learning

Machine learning has become an indispensable tool across many areas of research and commercial applications. From text-to-speech for your phone to detecting the Higgs boson, machine learning excells at extracting knowledge from large amounts of data. This talk will give a general introduction to machine learning, as well as introduce practical tools for you to apply machine learning in your research. We will focus on one particularly important subfield of machine learning, supervised learning. The goal of supervised learning is to “”learn”” a function that maps inputs x to an output y, by using a collection of training data consisting of input-output pairs. We will walk through formalizing a problem as a supervised machine learning problem, creating the necessary training data and applying and evaluating a machine learning algorithm. The talk should give you all the necessary background to start using machine learning yourself.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Research Scientist, Core Contributor of scikit-learn at Columbia Data Science Institute

Training: Introduction to Machine Learning

Machine learning has become an indispensable tool across many areas of research and commercial applications. From text-to-speech for your phone to detecting the Higgs boson, machine learning excells at extracting knowledge from large amounts of data. This talk will give a general introduction to machine learning, as well as introduce practical tools for you to apply machine learning in your research. We will focus on one particularly important subfield of machine learning, supervised learning. The goal of supervised learning is to “”learn”” a function that maps inputs x to an output y, by using a collection of training data consisting of input-output pairs. We will walk through formalizing a problem as a supervised machine learning problem, creating the necessary training data and applying and evaluating a machine learning algorithm. The talk should give you all the necessary background to start using machine learning yourself.

Instructor's Bio

Thomas Fan is a Software Developer at Columbia University’s Data Science Institute. He collaborates with the scikit-learn community to develop features, review code, and resolve issues. On his free time, Thomas contributes to skorch, a scikit-learn compatible neural network library that wraps PyTorch.

Thomas Fan

Software Developer – Machine Learning at Columbia Data Science Institute

Training: Hands-on introduction to LSTMs in Keras/TensorFlow

This is a very hands on introduction to LSTMs in Keras and TensorFlow. We will build a language classifier, generator and a translating sequence to sequence model. We will talk about debugging models and explore various related architectures like GRUs, Bidirectional LSTMs, etc. to see how well they work.

Instructor's Bio

Lukas Biewald is a co-founder and CEO of Weights and Biases which builds performance and visualization tools for machine learning teams and practitioners. Lukas also founded Figure Eight (formerly CrowdFlower) — a human in the loop platform transforms unstructured text, image, audio, and video data into customized high quality training data. — which he co-founded in December 2007 with Chris Van Pelt. Prior to co-founding Weights and Biases and CrowdFlower, Biewald was a Senior Scientist and Manager within the Ranking and Management Team at Powerset, a natural language search technology company later acquired by Microsoft. From 2005 to 2006, Lukas also led the Search Relevance Team for Yahoo! Japan.

Lukas Biewald

Founder at Weights & Biases

Training: Hands-on introduction to LSTMs in Keras/TensorFlow

This is a very hands on introduction to LSTMs in Keras and TensorFlow. We will build a language classifier, generator and a translating sequence to sequence model. We will talk about debugging models and explore various related architectures like GRUs, Bidirectional LSTMs, etc. to see how well they work.

Instructor's Bio

Chris Van Pelt is a co-founder of Weights and Biases which builds performance and visualization tools for machine learning teams and practitioners. Chris also founded Figure Eight (formerly CrowdFlower) — a human in the loop platform transforms unstructured text, image, audio, and video data into customized high quality training data. — which he co-founded in December 2007 with Lukas Biewald.

Chris Van Pelt

Co-founder at Weights & Biases

Training: Hands-on introduction to LSTMs in Keras/TensorFlow

This is a very hands on introduction to LSTMs in Keras and TensorFlow. We will build a language classifier, generator and a translating sequence to sequence model. We will talk about debugging models and explore various related architectures like GRUs, Bidirectional LSTMs, etc. to see how well they work.

Instructor's Bio

Stacey Svetlichnaya is deep learning engineer at Weights & Biases in San Francisco, CA, helping develop effective tools and patterns for deep learning. Previously a senior research engineer with Yahoo Vision & Machine Learning, working on image aesthetic quality and style classification, object recognition, photo caption generation, and emoji modeling. She has worked extensively on Flickr image search and data pipelines, as well as automating content discovery and recommendation. Prior to Flickr, she helped build a visual similarity search engine with LookFlow, which Yahoo acquired in 2013. Stacey holds a BS ‘11 and MS ’12 in Symbolic Systems from Stanford University.

Stacey Svetlichnaya

Deep Learning Engineer at Weights & Biases

Training: Intermediate Machine Learning with scikit-learn

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning. Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Research Scientist, Core Contributor of scikit-learn at Columbia Data Science Institute

Training: Intermediate Machine Learning with scikit-learn

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning. Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.

Instructor's Bio

Thomas Fan is a Software Developer at Columbia University’s Data Science Institute. He collaborates with the scikit-learn community to develop features, review code, and resolve issues. On his free time, Thomas contributes to skorch, a scikit-learn compatible neural network library that wraps PyTorch.

Thomas Fan

Software Developer – Machine Learning at Columbia Data Science Institute

Training: Introduction to Reinforcement Learning

Reinforcement Learning recently progressed greatly in industry as one of the best techniques for sequential decision making and control policies.

Deep Mind used RL to greatly reduce energy consumption in Google’s data centre. It has being used to do text summarisation, autonomous driving, dialog systems, media advertisement and in finance by JPMorgan Chase. We are at the very beginning of the adoption of these algorithms as systems are required to operate more and more autonomously.
In this workshop we will explore Reinforcement Learning, starting from its fundamentals and ending creating our own algorithms.

We will use OpenAI gym to try our RL algorithms. OpenAI is a non profit organisation that want committed to open source all their research on Artificial Intelligence. To foster innovation OpenAI crated a virtual environment, OpenAi gym, where it’s easy to test Reinforcement Learning algorithms.

In particular we will start with some popular techniques like Multi Armed Bandit, going thought Markov Decision Processes and Dynamic Programming.

Instructor's Bio

Leonardo De Marchi holds a Master in Artificial intelligence and has worked as a Data Scientist in the sport world, with clients such as New York Knicks and Manchester United, and with large social networks, like Justgiving.
He now works as Lead Data Scientist in Badoo, the largest dating site with over 360 million users, he is also the lead instructor at ideai.io, a company specialized in Deep Learning and Machine Learning training and a contractor for the European Commission.

Leonardo De Marchi

Head of Data Science and Analytics at Badoo

Training: Introduction to Deep Learning for Engineers

We will build and tweak several vision classifiers together starting with perceptrons and building up to transfer learning and convolutional neural networks. We will investigate practical implications of tweaking loss functions, gradient descent algorithms, network architectures, data normalization, data augmentation and so on. This class is super hands on and practical and requires no math or experience with deep learning.

Instructor's Bio

Lukas Biewald is a co-founder and CEO of Weights and Biases which builds performance and visualization tools for machine learning teams and practitioners. Lukas also founded Figure Eight (formerly CrowdFlower) — a human in the loop platform transforms unstructured text, image, audio, and video data into customized high quality training data. — which he co-founded in December 2007 with Chris Van Pelt. Prior to co-founding Weights and Biases and CrowdFlower, Biewald was a Senior Scientist and Manager within the Ranking and Management Team at Powerset, a natural language search technology company later acquired by Microsoft. From 2005 to 2006, Lukas also led the Search Relevance Team for Yahoo! Japan.

Lukas Biewald

Founder at Weights & Biases

Training: Introduction to Deep Learning for Engineers

We will build and tweak several vision classifiers together starting with perceptrons and building up to transfer learning and convolutional neural networks. We will investigate practical implications of tweaking loss functions, gradient descent algorithms, network architectures, data normalization, data augmentation and so on. This class is super hands on and practical and requires no math or experience with deep learning.

Instructor's Bio

Chris Van Pelt is a co-founder of Weights and Biases which builds performance and visualization tools for machine learning teams and practitioners. Chris also founded Figure Eight (formerly CrowdFlower) — a human in the loop platform transforms unstructured text, image, audio, and video data into customized high quality training data. — which he co-founded in December 2007 with Lukas Biewald.

Chris Van Pelt

Co-founder at Weights & Biases

Training: Introduction to Deep Learning for Engineers

We will build and tweak several vision classifiers together starting with perceptrons and building up to transfer learning and convolutional neural networks. We will investigate practical implications of tweaking loss functions, gradient descent algorithms, network architectures, data normalization, data augmentation and so on. This class is super hands on and practical and requires no math or experience with deep learning.

Instructor's Bio

Stacey Svetlichnaya is deep learning engineer at Weights & Biases in San Francisco, CA, helping develop effective tools and patterns for deep learning. Previously a senior research engineer with Yahoo Vision & Machine Learning, working on image aesthetic quality and style classification, object recognition, photo caption generation, and emoji modeling. She has worked extensively on Flickr image search and data pipelines, as well as automating content discovery and recommendation. Prior to Flickr, she helped build a visual similarity search engine with LookFlow, which Yahoo acquired in 2013. Stacey holds a BS ‘11 and MS ’12 in Symbolic Systems from Stanford University.

Stacey Svetlichnaya

Deep Learning Engineer at Weights & Biases

TRAINING: MACHINE LEARNING IN R PART I

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Professor at Columbia Business School

TRAINING: MACHINE LEARNING IN R PART II

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Professor at Columbia Business School

Training: Human-Centered Data Science - When the left brain meets the right brain

We will present two different dimensions of the practice of data science, specifically data storytelling (including data visualization) and data literacy. There will be short presentations, integrated with interactive sessions, group activities, and brief moments of brain and body exercise. The combination of these various activities is aimed at demonstrating and practicing the concepts being presented. The Data Literacy theme component will include a section on “data profiling – having a first date with your data”, focusing on getting acquainted with all the facets, characteristics, features (good and bad), and types of your data. This theme will also include a section on matching models to algorithms to data types to the questions being asked. The Data Storytelling theme component will include sections on the neuroscience of visual displays of evidence (visual analytics) for decision-making and include a component on user-centered design in data science. Design thinking, empathy, consultative practice, and the BI Dashboard Formula (BIDF) methodology will be emphasized. The combination of the two themes (data literacy and data storytelling) will be made more concrete through exercises in small breakout groups. Each group will be given a sample problem, then asked to take a data science approach (modeling, visualization, storytelling) to address the three fundamental questions that we should always consider in our projects: What? So what? Now what? The workshop participant will come away with design tips, tricks, and tools for better human-centered data science. The goal is for your next data science project and presentation to be your best ever. As Maya Angelou said so eloquently, “people will forget what you said, people will forget what you did, but people will never forget how you made them feel.” Make your data science matter by demonstrating why and how it matters.

Instructor's Bio

Dr. Kirk Borne is the Principal Data Scientist and an Executive Advisor at global technology and consulting firm Booz Allen Hamilton. In those roles, he focuses on applications of data science, data management, machine learning, A.I., and modeling across a wide variety of disciplines. He also provides training and mentoring to executives and data scientists within numerous external organizations, industries, agencies, and partners in the use of large data repositories and machine learning for discovery, decision support, and innovation. Previously, he was Professor of Astrophysics and Computational Science at George Mason University for 12 years where he did research, taught, and advised students in data science. Prior to that, Kirk spent nearly 20 years supporting data systems activities on NASA space science programs, which included a period as NASA’s Data Archive Project Scientist for the Hubble Space Telescope. Dr. Borne has a B.S. degree in Physics from LSU, and a Ph.D. in Astronomy from Caltech. In 2016 he was elected Fellow of the International Astrostatistics Association for his lifelong contributions to big data research in astronomy. As a global speaker, he has given hundreds of invited talks worldwide, including conference keynote presentations at many dozens of data science, A.I. and big data analytics events globally. He is an active contributor on social media, where he has been named consistently among the top worldwide influencers in big data and data science since 2013. He was recently identified as the #1 digital influencer worldwide for 2018-2019. You can follow him on Twitter at @KirkDBorne.

Dr. Kirk Borne

Principal Data Scientist at Booz Allen Hamilton

Training: Human-Centered Data Science - When the left brain meets the right brain

We will present two different dimensions of the practice of data science, specifically data storytelling (including data visualization) and data literacy. There will be short presentations, integrated with interactive sessions, group activities, and brief moments of brain and body exercise. The combination of these various activities is aimed at demonstrating and practicing the concepts being presented. The Data Literacy theme component will include a section on “data profiling – having a first date with your data”, focusing on getting acquainted with all the facets, characteristics, features (good and bad), and types of your data. This theme will also include a section on matching models to algorithms to data types to the questions being asked. The Data Storytelling theme component will include sections on the neuroscience of visual displays of evidence (visual analytics) for decision-making and include a component on user-centered design in data science. Design thinking, empathy, consultative practice, and the BI Dashboard Formula (BIDF) methodology will be emphasized. The combination of the two themes (data literacy and data storytelling) will be made more concrete through exercises in small breakout groups. Each group will be given a sample problem, then asked to take a data science approach (modeling, visualization, storytelling) to address the three fundamental questions that we should always consider in our projects: What? So what? Now what? The workshop participant will come away with design tips, tricks, and tools for better human-centered data science. The goal is for your next data science project and presentation to be your best ever. As Maya Angelou said so eloquently, “people will forget what you said, people will forget what you did, but people will never forget how you made them feel.” Make your data science matter by demonstrating why and how it matters.

Instructor's Bio

Mico Yuk (@micoyuk) is the founder of BI Brainz and the BI Dashboard Formula (BIDF) methodology, where she has trained thousands globally how to strategically use the power of data visualization to enhance the decision making process. Her inventive approach fuses Enterprise Visual Storytelling with her proprietary BI Dashboard Formula Methodology(BIDF). Mico’s ability to strategically use the power of data visualization to enhance the decision making process, develop analytics portfolios that business users love and help gain ROI from their Business Intelligence investment, has been sought out by several high-profile Fortune 500 companies: Shell, FedEx, Nestle, Qatargas, Ericsson, Procter & Gamble, Kimberly-Clark, FedEx and more. She has also authored, Data Visualization for Dummies (Wiley 2014).
Her ‘blunt’ twitter comments and blogs have been mentioned on tech websites and blogs. Since 2010 she continues to be a sought after global keynote speaker and trainer, and was named one of the Top 50 Analytics Bloggers to follow by SAP. Some of her featured keynotes include Microsoft PASS Business Analytics Conference, MasteringSAP BI, Saloon BI, BOAK, and Big Data World in London to name a few. This year she’s a much anticipated keynote at the first Facebook’s Women in Analytics Conference at their headquarters in Menlo Park. In June she will be returning to Real Business Intelligence Conference at MIT as a featured speaker for the second year in a row.

Mico Yuk

CEO and Founder at BI Brainz and the BI Dashboard Formula

Training: Tensorflow 2.0 and Keras: what's new, what's shared, what's different

Tensorflow 2.0 makes Keras the default API for model definition. This is a big change. It makes Tensorflow more accessible to beginners and newcomers and it also disrupts consolidated patterns and habits for experienced Tensorflow programmers. This workshop is aimed to both audiences and it covers how to define models in Tensorflow 2.0 using the tf.keras API. It also covers the commonalities and differences between the open source Keras package and tf.keras, explaining pros and cons of each of the two. If you are getting started with Tensorflow or you’re puzzled by the changes in Tensorflow 2.0, come and learn how easy it is to design models using Keras.

Instructor's Bio

Francesco Mosconi. Ph.D. in Physics and CEO & Chief Data Scientist at Catalit Data Science. With Catalit Francesco helps Fortune 500 companies to up-skill in Machine Learning and Deep Learning through intensive training programs and strategic advisory. Author of the Zero to Deep Learning book and bootcamp, he is also an instructor at Udemy and Cloud Academy. Formerly co-founder and Chief Data Officer at Spire, a YC-backed company that invented the first consumer wearable device capable of continuously tracking respiration and physical activity. Machine Learning and python expert. Also served as Data Science lead instructor at General Assembly and The Data incubator.

Francesco Mosconi, PhD

Data Scientist at Catalit

Training: Programming with Data: Python and Pandas

Whether in R, MATLAB, Stata, or python, modern data analysis, for many researchers, requires some kind of programming. The preponderance of tools and specialized languages for data analysis suggests that general purpose programming languages like C and Java do not readily address the needs of data scientists; something more is needed.

In this workshop, you will learn how to accelerate your data analyses using the Python language and Pandas, a library specifically designed for interactive data analysis. Pandas is a massive library, so we will focus on its core functionality, specifically, loading, filtering, grouping, and transforming data. Having completed this workshop, you will understand the fundamentals of Pandas, be aware of common pitfalls, and be ready to perform your own analyses.

Instructor's Bio

Daniel Gerlanc has worked as a data scientist for more than decade and written software professionally for 15 years. He spent 5 years as a quantitative analyst with two Boston hedge funds before starting Enplus Advisors. At Enplus, he works with clients on data science and custom software development with a particular focus on projects requiring an expertise in both areas. He teaches data science and software development at introductory through advanced levels. He has coauthored several open source R packages, published in peer-reviewed journals, and is active in local predictive analytics groups.

Daniel Gerlanc

President at Enplus Advisors Inc.

Training: Modern and Old Reinforcement Learning

Reinforcement Learning recently progressed greatly in industry as one of the best techniques for sequential decision making and control policies.

DeepMind used RL to greatly reduce energy consumption in Google’s data centre. It has being used to do text summarisation, autonomous driving, dialog systems, media advertisement and in finance by JPMorgan Chase. We are at the very beginning of the adoption of these algorithms as systems are required to operate more and more autonomously.
In this workshop we will explore Reinforcement Learning, starting from its fundamentals and ending creating our own algorithms.

We will use OpenAI gym to try our RL algorithms. OpenAI is a non profit organisation that want committed to open source all their research on Artificial Intelligence. To foster innovation OpenAI created a virtual environment, OpenAi gym, where it’s easy to test Reinforcement Learning algorithms.

In particular we will start with some popular techniques like Multi Armed Bandit, going thought Markov Decision Processes and Dynamic Programming.

We then will also explore other RL frameworks and more complex concepts like Policy gradients methods and Deep Reinforcement learning, which recently changed the field of Reinforcement Learning. In particular we will see Actor Critic models and Proximal Policy Optimizations that allowed openai to beat some of the best Dota players.

We will also provide the necessary Deep Learning concepts for the course.

Instructor's Bio

Leonardo De Marchi holds a Master in Artificial intelligence and has worked as a Data Scientist in the sport world, with clients such as New York Knicks and Manchester United, and with large social networks, like Justgiving.
He now works as Lead Data Scientist in Badoo, the largest dating site with over 360 million users, he is also the lead instructor at ideai.io, a company specialized in Deep Learning and Machine Learning training and a contractor for the European Commission.

Leonardo De Marchi

Head of Data Science and Analytics at Badoo

Training: AI for Executives

Gain insight into how to drive success in data science. Identify key points in the machine learning life cycle where executive oversight really matters. Learn effective methods to help your team deliver better predictive models, faster. You’ll leave this seminar able to identify business challenges well suited for machine learning, with fully defined predictive analytics projects your team can implement now to improve operational results.

Instructor's Bio

John Boersma is Director of Education for DataRobot. In this role he oversees the company’s client training operations and relations with academic institutions using DataRobot in analytics courses. Previously, John founded and led Adapt Courseware, an adaptive online college curriculum venture. John holds a PhD in computational particle physics and an MBA in general management.

John Boersma, PhD

Director of Education at DataRobot

Training: Introduction to Data Science

Curious about Data Science? Self-taught on some aspects, but missing the big picture? Well you’ve got to start somewhere and this session is the place to do it. This session will cover, at a layman’s level, some of the basic concepts of data science. In a conversational format, we will discuss: What are the differences between Big Data and Data Science – and why aren’t they the same thing? What distinguishes descriptive, predictive, and prescriptive analytics? What purpose do predictive models serve in a practical context? What kinds of models are there and what do they tell us? What is the difference between supervised and unsupervised learning? What are some common pitfalls that turn good ideas into bad science? During this session, attendees will learn the difference between k-nearest neighbor and k-means clustering, understand the reasons why we do normalize and don’t overfit, and grasp the meaning of No Free Lunch.

Instructor's Bio

Todd is a Data Science Evangelist at Data Robot. For more than 20 years, Todd has been highly respected as both a technologist and a trainer. As a tech, he has seen that world from many perspectives: “data guy” and developer; architect, analyst and consultant. As a trainer, he has designed and covered subject matter from operating systems to end-user applications, with an emphasis on data and programming. As a strong advocate for knowledge sharing, he combines his experience in technology and education to impart real-world use cases to students and users of analytics solutions across multiple industries. He is a regular contributor to the community of analytics and technology user groups in the Boston area, writes and teaches on many topics, and looks forward to the next time he can strap on a dive mask and get wet.

Todd Cioffi

Data Science Evangelist at DataRobot

Training: Time Series Analysis: From Introduction to Advanced Topics

Time series analysis is both a fascinating subject to study and an important set of techniques that enjoy a wide range of applications in industry, government, and academic settings. Use cases range from inventory management, capacity planning, marketing strategy design, capital budgeting, pricing, macroeconomic forecasting, and supply chain forecasting.

A common aspect to all of these applications is the use of forecasting, and time series forecasting requires time series data that is ubiquitous nowadays: weekly initial unemployment claims, product-level hourly sales, tick-level stock prices, daily term structure of interest rates, quarterly company earnings, daily number of steps taken recorded by a wearable, machine performance measurements recorded by sensors, and key performance indicators of business functions, just to name a few.

Time series data differs from cross-sectional data in that time series data has temporal dependence, which can be leveraged to forecast future values of the series. Some of the most important and commonly used data science techniques to analyze time series data and make forecast based on them are those in developed in the field of statistics and machine learning. For this reason, time series statistical and machine learning models should be included in any data scientists’ toolkit.

This workshop teaches the application of two important classes of time series statistical models (Autoregressive Integrated Moving Average Model and Vector Autoregressive Model) and an important set of neural network-based algorithms (Recurrent neural network) in time series forecasting. The attendees will learn the mathematical formulation, python implementation, the advantages, and disadvantages of when using these techniques in time series analysis. Jupyter notebooks with examples and sample codes will be provided for attendees to follow along and experiment with these techniques.

This workshop is divided into four parts, and a 15-minute break will be given between Part II and Part III.

Part I starts with the introduction to time series analysis, which includes the formulation of the time series problem, basic terminology, essential concepts, and the steps to analyze time series data.

Part II discusses the class of Autoregressive Integrated Moving Average Model (ARIMA), mathematical formulation, lag operator representation, model estimation, model diagnostics, model identification, model selection, assumption testing, statistical inference, and forecasting for both stationary series, non-stationary series, and series with seasonality.

Part III studies Vector Autoregressive (VAR) model, an important class of multivariate time series models. Similar to Part II, we will cover mathematical formulation, lag operator representation, model estimation, model diagnostics, model identification, model selection, assumption testing, statistical inference, and forecasting for both stationary series, non-stationary series, and series with seasonality.

Part IV introduces the application of recurrent neural networks to time series forecasting, covering the issues of using the basic feedforward network for modeling time series data, the various forms of recurrent neural networks, and the implementation in Keras.

The workshop concludes with a comparison of the various methods for time series analysis.

To fully appreciate the topics covered in this workshop and follow along with the examples, the attendees should have the following background:

1. Strong understanding of classical linear regression modeling
2. Working knowledge of Python
3. Basic understanding of neural network-based modeling

Instructor's Bio

Training

Jeffrey is the Chief Data Scientist at AllianceBernstein, a global investment firm managing over $500 billions. He is responsible for building and leading the data science group, partnering with investment professionals to create investment signals using data science, and collaborating with sales and marketing teams to analyze clients. Graduated with a Ph.D. in economics from the University of Pennsylvania, he has also taught statistics, econometrics, and machine learning courses at UC Berkeley, Cornell, NYU, the University of Pennsylvania, and Virginia Tech. Previously, Jeffrey held advanced analytic positions at Silicon Valley Data Science, Charles Schwab Corporation, KPMG, and Moody’s Analytics.

Jeffrey Yau, PhD

Chief Data Scientist at AllianceBernstein

Training: Building AI based Emotional Detectors in Images and Text - An Hands-on Approach

Deep Learning has become ubiquitous in everyday software applications and services. A solid understanding of DL foundational principles is necessary for researchers and modern-day engineers alike to successfully adapt the state of the art research in DL to business applications.

In this workshop, we will cover the basics of Deep Learning, what deep learning can and cannot do. We will learn the applications of Deep Learning where it has achieved state of the art results viz., to Images and Text.
The session will be a hands-on lab where attendees will use Apache MXNet to build an emotional detector in Images, we will cover basics of Convolutional Neural Networks applied to Computer Vision problems as we build the model.

The attendees will also build a model that detects emotions(sentiments) in text data, we will cover the basics of Recurrent Neural Networks that is widely used to solve Natural Language Processing problems.

The attendees will learn how to leverage the state of the art research to their application, best practices and tips, and tricks used by practitioners.

Instructor's Bio

Naveen is a Senior Software Engineer and a member of Amazon AI at AWS and works on Apache MXNet. He began his career building large scale distributed systems and has spent the last 10+ years designing and developing it. He has delivered various Tech Talks at AMLC, Spark Summit, ApacheCon and loves to share knowledge. His current focus is to make Deep Learning easily accessible to Software Developers without the need for a steep learning curve. In his spare time, he loves to read books, spend time with his family and watch his little girl grow.

Naveen Swamy

Software Developer at Amazon AI – AWS

Training: A Deeper Stack For Deep Learning: Adding Visualisations And Data Abstractions to Your Workflow

In this training session I introduce a new layer of Python software, called ConX, which sits on top of Keras, which sits on a backend (like TensorFlow.) Do we really need a deeper stack of software for deep learning? Backends, like TensorFlow, can be thought of as “assembly language” for deep learning. Keras helps, but is more like “C++” for deep learning. ConX is designed to be “Python” for deep learning. So, yes, this layer is needed.

ConX is a carefully designed library that includes tools for network, weight, and activation visualizations; data and network abstractions; and an intuitive interactive and programming interface. Especially developed for the Jupyter notebook, ConX enhances the workflow of designing and training artificial neural networks by providing interactive visual feedback early in the process, and reducing cognitive load in developing complex networks.

This session will start small and move to advanced recurrent networks for images, text, and other data. Participants are encouraged to have samples of their own data so that they can explore a real and meaningful project.

A basic understanding of Python and a laptop is all that is required. Many example deep learning models will be provided in the form of Jupyter notebooks.

Documentation: https://conx.readthedocs.io/en/latest/

Instructor's Bio

Doug Blank is now a Senior Software Engineer at Comet.ML, a start-up in New York City. Comet.ML helps data scientists and engineers track, manage, replicate, and analyze machine learning experiments.

Doug was a professor of Computer Science for 18 years at Bryn Mawr College, a small, all-women’s liberal arts college outside of Philadelphia. He has been working on artificial neural networks for almost 30 years. His focus has been on creating models to make analogies, and for use with robot control systems. He is one of the core developers of ConX.

Douglas Blank, PhD

Senior Software Engineer | Comet.ML

Training: Integrating Pandas with Scikit-Learn, an Exciting New Workflow

For Python data scientists, a typical workflow consists of using Pandas for exploratory data analysis before turning to Scikit-Learn for machine learning. Pandas and Scikit-Learn arose independently, each focusing on their specific tasks, and were never specifically designed to be integrated together. There was never a clearly defined and standardized process for transitioning between the two libraries. This lack of a concrete handoff lead to practitioners creating a variety of markedly different workflows to make this transition.

One of the main hurdles facing the Pandas to Scikit-Learn transition was the handling of string columns. Inputs to Scikit-Learn’s machine learning models only allow for numeric arrays. The common scenario of taking a Pandas DataFrame with string columns and converting it to an array of only numeric values was quite painful. Yet another hurdle, was processing separate groupings of columns with separate functions.

With the recent release of Scikit-Learn version 0.20, many workflows will start looking similar. The brand new ColumnTransformer allows for direct Pandas integration to Scikit-Learn. It applies separate transformations to specific subsets of columns. The upgraded OneHotEncoder standardizes the encoding of string columns. Before, it only encoded columns containing numeric categorical data.

In this hands-on tutorial, we will use these new additions to Scikit-Learn to build a modern, robust, and efficient workflow for those starting from a Pandas DataFrame. There will be ample practice problems and detailed notes available so that you can use it immediately upon completion.

Instructor's Bio

Ted Petrou is the author of Pandas Cookbook and founder of both Dunder Data and the Houston Data Science Meetup group. He worked as a data scientist at Schlumberger where he spent the vast majority of his time exploring data. Ted received his Master’s degree in statistics from Rice University and used his analytical skills to play poker professionally and teach math before becoming a data scientist.

Ted Petrou

Founder at Dunder Data

Training: Data Visualization with R Shiny

Shiny—an innovative package for R users to develop web applications—makes it easier for R users to share results from their analyses visually to those not familiar with R.
I will offer an overview of the the key ideas that will help you build simple yet robust Shiny applications and walk you through building data visualizations using the R Shiny web framework. You’ll learn how to use R to prepare data, run simple analyses, and display the results in Shiny web applications as you get hands-on experience creating effective and efficient data visualizations. Along the way, I will share best practices to make these applications suitable for production deployment.
Topics include:
– R basics for data preparation, analysis, and visualization
– The structure of a Shiny app
– Interactive elements and reactivity
– Customizing the user interface with HTML and CSS
– Best practices for Shiny app production deployment
– Shiny dashboards, R Markdown, and Shiny app sharing

Instructor's Bio

Alyssa Columbus is a Data Scientist at Pacific Life and member of the Spring 2018 class of NASA Datanauts. Previously, she was a computational statistics and machine learning researcher at the Athena Breast Health Network and has built robust predictive models and applications for a diverse set of industries spanning retail to biologics. Alyssa is a strong proponent of reproducible methods, open source technologies, and diversity in tech. In her free time, she leads R-Ladies Irvine and Girl Scout STEM workshops.

Alyssa Columbus

Data Scientist at Pacific Life

Training: Good, Fast, Cheap: How to do Data Science with Missing Data

If you’ve never heard of the “good, fast, cheap” dilemma, it goes something like this: You can have something good and fast, but it won’t be cheap. You can have something good and cheap, but it won’t be fast. You can have something fast and cheap, but it won’t be good. In short, you can pick two of the three but you can’t have all three.

If you’ve done a data science problem before, I can all but guarantee that you’ve run into missing data. How do we handle it? Well, we can avoid, ignore, or try to account for missing data. The problem is, none of these strategies are good, fast, *and* cheap.

We’ll start by visualizing missing data and identify the three different types of missing data, which will allow us to see how they affect whether we should avoid, ignore, or account for the missing data. We will walk through the advantages and disadvantages of each approach as well as how to visualize and implement each approach. We’ll wrap up with practical tips for working with missing data and recommendations for integrating it with your workflow!

Instructor's Bio

Matt currently leads instruction for GA’s Data Science Immersive in Washington, D.C. and most enjoys bridging the gap between theoretical statistics and real-world insights. Matt is a recovering politico, having worked as a data scientist for a political consulting firm through the 2016 election. Prior to his work in politics, he earned his Master’s degree in statistics from The Ohio State University. Matt is passionate about making data science more accessible and putting the revolutionary power of machine learning into the hands of as many people as possible. When he isn’t teaching, he’s thinking about how to be a better teacher, falling asleep to Netflix, and/or cuddling with his pug.

Matt Brems

Global Lead Data Science Instructor at General Assembly

Training: Data Visualization: From Square One to Interactivity

As data scientists, we are expected to be experts in machine learning, programming, and statistics. However, our audiences might not be! Whether we’re working with peers in the office, trying to convince our bosses to take some sort of action, or communicating results to clients, there’s nothing more clear or compelling than an effective visual to make our point. Let’s leverage the Python libraries Matplotlib and Bokeh along with visual design principles to make our point as clearly and as compellingly as possible!

This talk is designed for a wide audience. If you haven’t worked with Matplotlib or Bokeh before or if you (like me!) don’t have a natural eye for visual design, that’s OK! This will be a hands-on training designed to make visualizations that best communicate what you want to communicate. We’ll cover different types of visualizations, how to generate them in Matplotlib, how to reduce clutter and guide your user’s eye, and how (and when!) to add interactivity with Bokeh.

Instructor's Bio

Matt currently leads instruction for GA’s Data Science Immersive in Washington, D.C. and most enjoys bridging the gap between theoretical statistics and real-world insights. Matt is a recovering politico, having worked as a data scientist for a political consulting firm through the 2016 election. Prior to his work in politics, he earned his Master’s degree in statistics from The Ohio State University. Matt is passionate about making data science more accessible and putting the revolutionary power of machine learning into the hands of as many people as possible. When he isn’t teaching, he’s thinking about how to be a better teacher, falling asleep to Netflix, and/or cuddling with his pug.

Matt Brems

Global Lead Data Science Instructor at General Assembly

TRAINING: Introduction to RMarkdown in Shiny

Markdown Primer (45 minutes) Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections Include Table of Contents

Integrate R Code (30 minutes) Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching

Build RMarkdown Slideshows (20 minutes) Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode

Develop Flexdashboards (30 minutes )Start with the Flex dashboard, Layout Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code


Shiny Inputs, Drop Downs, Text Radio Checks, Shiny Outputs, Text Tables, Plots, Reactive Expressions, HTML, Widgets, Interactive Plots, Interactive Maps, Interactive Tables, Shiny Layouts UI and Server Files, User Interface

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Professor at Columbia Business School

TRAINING: Intermediate RMarkdown in Shiny

Markdown Primer (45 minutes) Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections Include Table of Contents


Integrate R Code (30 minutes) 
Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching

Build RMarkdown Slideshows (20 minutes) Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode

Develop Flexdashboards (30 minutes )Start with the Flex dashboard, Layout Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code


Shiny Inputs, Drop Downs, Text Radio Checks, Shiny Outputs, Text Tables, Plots, Reactive Expressions, HTML, Widgets, Interactive Plots, Interactive Maps, Interactive Tables, Shiny Layouts UI and Server Files, User Interface

Instructor's Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Professor at Columbia Business School

Training: Advanced Machine Learning with scikit-learn Part I

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This training will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, advanced model evaluation, feature engineering and working with imbalanced datasets. We will also work with text data using the bag-of-word method for classification.

This workshop assumes familiarity with Jupyter notebooks and basics of pandas, matplotlib and numpy. It also assumes some familiarity with the API of scikit-learn and how to do cross-validations and grid-search with scikit-learn.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Research Scientist, Core Contributor of scikit-learn at Columbia Data Science Institute

Training: Advanced Machine Learning with scikit-learn Part I

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This training will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, advanced model evaluation, feature engineering and working with imbalanced datasets. We will also work with text data using the bag-of-word method for classification.

This workshop assumes familiarity with Jupyter notebooks and basics of pandas, matplotlib and numpy. It also assumes some familiarity with the API of scikit-learn and how to do cross-validations and grid-search with scikit-learn.

Instructor's Bio

Thomas Fan is a Software Developer at Columbia University’s Data Science Institute. He collaborates with the scikit-learn community to develop features, review code, and resolve issues. On his free time, Thomas contributes to skorch, a scikit-learn compatible neural network library that wraps PyTorch.

Thomas Fan

Software Developer – Machine Learning at Columbia Data Science Institute

Training: Advanced Machine Learning with scikit-learn Part II

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This training will cover some advanced topics in using scikit-learn, such as how to perform out-of-core learning with scikit-learn and how to speed up parameter search. We’ll also cover how to build your own models or feature extraction methods that are compatible with scikit-learn, which is important for feature extraction in many domains. We will see how we can customize scikit-learn even further, using custom methods for cross-validation or model evaluation.

This workshop assumes familiarity with Jupyter notebooks and basics of pandas, matplotlib and numpy. It also assumes experience using scikit-learn and familiarity with the API.

Instructor's Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Andreas Mueller, PhD

Author, Research Scientist, Core Contributor of scikit-learn at Columbia Data Science Institute

Training: Advanced Machine Learning with scikit-learn Part II

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This training will cover some advanced topics in using scikit-learn, such as how to perform out-of-core learning with scikit-learn and how to speed up parameter search. We’ll also cover how to build your own models or feature extraction methods that are compatible with scikit-learn, which is important for feature extraction in many domains. We will see how we can customize scikit-learn even further, using custom methods for cross-validation or model evaluation.

This workshop assumes familiarity with Jupyter notebooks and basics of pandas, matplotlib and numpy. It also assumes experience using scikit-learn and familiarity with the API.

Instructor's Bio

Thomas Fan is a Software Developer at Columbia University’s Data Science Institute. He collaborates with the scikit-learn community to develop features, review code, and resolve issues. On his free time, Thomas contributes to skorch, a scikit-learn compatible neural network library that wraps PyTorch.

Thomas Fan

Software Developer – Machine Learning at Columbia Data Science Institute

Training: Artificial Intelligence in Finance

Artificial Intelligence (AI) is about to reshape finance and the financial industry. Many decisions in the industry are already made by algorithms, such as in stock trading, credit scoring, etc. However, most of these applications do not harness the capabilities of recent advances in the field of AI.

Today’s programmatic availability of basically all historical and real-time financial data, in combination with ever more powerful compute infrastructures, facilitates the application of even the most advanced and compute intensive algorithms from AI to financial problems. In that sense, finance already is data-driven to a large extent these days. And it will become an AI-first discipline in the near future.

The workshop provides some introductory background to AI in Finance. It then proceeds with the introduction to and application of different machine and deep learning algorithms to financial problems. The focus here lies on classification algorithms applied to the algorithmic trading of financial instruments. More specifically, the AI algorithms are used to create directional predictions about the future movements of financial prices.

The workshop uses Python and standard packages such as NumPy, pandas, scikit-learn, Keras/TensorFlow and matplotlib. Most of the coding will be presented based on Jupyter Notebooks.

Instructor's Bio

Dr. Yves J. Hilpisch is founder and managing partner of The Python Quants, a group focusing on the use of open source technologies for financial data science, artificial intelligence, algorithmic trading, and computational finance. He is also founder and CEO of The AI Machine, a company focused on harnessing the power of artificial intelligence for algorithmic trading via a proprietary strategy execution platform. He is the author of Python for Finance (2nd ed., O’Reilly) and of two other books: Derivatives Analytics with Python (Wiley, 2015) as well as Listed Volatility and Variance Derivatives (Wiley, 2017). Yves lectures on computational finance at the CQF Program and on algorithmic trading at the EPAT Program. He is also the director of the first online training program leading to a University Certificate in Python for Algorithmic Trading. Yves wrote the financial analytics library DX Analytics and organizes meetups, conferences, and bootcamps about Python for quantitative finance and algorithmic trading in London, Frankfurt, Berlin, Paris, and New York. He has given keynote speeches at technology conferences in the United States, Europe, and Asia.

Yves Hilpisch, PhD

Founder and Managing Partner at Python Quants

Training: Building Recommendation Engines and Deep Learning Models Using Python, R and SAS

Deep learning is the newest area of machine learning and has become ubiquitous in predictive modeling. The complex, brainlike structure of deep learning models is used to find intricate patterns in large volumes of data. These models have heavily improved the performance of general supervised models, time series, speech recognition, object detection and classification, and sentiment analysis.

Factorization machines are a relatively new and powerful tool for modeling high-dimensional and sparse data. Most commonly they are used as recommender systems by modeling the relationship between users and items. For example, factorization machines can be used to recommend your next Netflix binge based on how you and other streamers rate content.

In this session, participants will use recurrent neural networks to analyze sequential data and improve the forecast performance of time series data, and use convolutional neural networks for image classification. Participants will also use a genetic algorithm to efficiently tune the hyperparameters of both deep learning models. Finally, students will use factorization machines to model the relationship between movies and viewers to make recommendations.
Demonstrations are provided in both R and Python, and will be administered from a Jupyter notebook. Students will use the open source SWAT package (SAS Wrapper for Analytics Transfer) to access SAS CAS (Cloud Analytic Services) in order to take advantage of the in-memory distributed environment. CAS provides a fast and scalable environment to build complex models and analyze big data by using algorithms designed for parallel processing.

Instructor's Bio

Ari holds bachelor’s degrees in both physics and mathematics from UNC-Chapel Hill. His research focused on collecting and analyzing low energy physics data to better understand the neutrino. Ari taught introductory and advanced physics and scientific programming courses at UC-Berkeley while working on a master’s in physics with a focus on nonlinear dynamics. While at SAS, Ari has worked to develop courses that teach how to use Python code to control SAS analytical procedures.

Ari Zitin

Sr. Analytical Training Consultant at SAS

Training: Building Recommendation Engines and Deep Learning Models Using Python, R and SAS

Deep learning is the newest area of machine learning and has become ubiquitous in predictive modeling. The complex, brainlike structure of deep learning models is used to find intricate patterns in large volumes of data. These models have heavily improved the performance of general supervised models, time series, speech recognition, object detection and classification, and sentiment analysis.

Factorization machines are a relatively new and powerful tool for modeling high-dimensional and sparse data. Most commonly they are used as recommender systems by modeling the relationship between users and items. For example, factorization machines can be used to recommend your next Netflix binge based on how you and other streamers rate content.

In this session, participants will use recurrent neural networks to analyze sequential data and improve the forecast performance of time series data, and use convolutional neural networks for image classification. Participants will also use a genetic algorithm to efficiently tune the hyperparameters of both deep learning models. Finally, students will use factorization machines to model the relationship between movies and viewers to make recommendations.
Demonstrations are provided in both R and Python, and will be administered from a Jupyter notebook. Students will use the open source SWAT package (SAS Wrapper for Analytics Transfer) to access SAS CAS (Cloud Analytic Services) in order to take advantage of the in-memory distributed environment. CAS provides a fast and scalable environment to build complex models and analyze big data by using algorithms designed for parallel processing.

Instructor's Bio

Coming Soon

Jordan Bakerman, PhD

Analytical Training Consultant at SAS

Training: Introduction to building a distributed neural network on Apache Spark with BigDL and Analytics Zoo

In this training session you will get hands on experience with developing neural network using Intel BigDL and Analytics Zoo on Apache Spark. You will learn how you can use Spark DataFrames and build deep learning pipelines through implementing some practical examples.
Target Audience: AI developers and aspiring data scientists who are Experienced in Python and Spark. Also big data and analytics professionals interested in neural networks

Prerequisites:
• Experience in Python programming
• Entry level knowledge of Apache Spark
• Basic knowledge of deep learning and techniques in deep learning

Training outline:

Introduction to Deep Learning on Spark, BigDL and Analytics Zoo – 25 minutes
We will begin with a brief introduction to Apache Spark and the Machine Learning/Deep Learning ecosystem around Spark. Then we will introduce Intel BigDL and Analytics Zoo, two deep learning libraries for Apache Spark. We will go into the architectural details of how distributed training happens in BigDL. We will cover the model training process, including how the model, weights and gradients are distributed, calculated, updated and shared with Apache Spark.

Setting Up Sample Environment – 10 minutes
The instructors will highlight the major components of our demonstration environment, including the dataset, docker container and example code along with the public location of these resources and how to set them up.

Exercise 1 – Quick and simple image recognition use case with BigDL – 45 minutes
We will work through a simple image recognition use case that trains a CNN. The goal of this exercise is a simple introduction in to using BigDL with image datasets. Participants will get exposure to:
• How to use read images into Spark data frames
• Building transformation pipelines for images with Spark
• How to train a deep learning model using estimators

Exercise 2 – Transfer Learning for Image Classification Models – 45 minutes
Participants will get exposure to:
• How to build a pipeline in Spark to preprocess images
• How to import a model a trained model from other frameworks like TensorFlow
• How to implement transfer learning on the imported model with the preprocessed images

Quick break: Answer questions or help out anyone who is having trouble – 10 minutes

Exercise 3 – Anomaly Detection or Recommendation system with Intel Analytics Zoo – 30 minutes
In this exercise we will show participants
• How to build an initial pipeline for feature transformation
• How to Build a recommendation model in BigDL/Analytics Zoo
• How to perform training and inference for this use case

Exercise 4 – Model Serving – 15 minutes
In this exercise we will show participants to how to build an end to end pipeline and put their model to production. They will get exposure to:
• Model serving using POJO API
• Integration into web services and streaming services like Kafka for model inference
• Distributed model inference

Practical Knowledge – Discussion of practical experience using Spark and Hadoop for machine learning and deep learning projects – 15 minutes
We will have a discussion on the following topics:
• Spark parameters and how to set them: How to allocate the right amount of executors, cores and memory
• Performance Monitoring
• Tensorboard with BigDL
• Collaboration and reproducing experiments with a data science workbench tool.

Wrapping up / Questions – 15 minutes

Instructor's Bio

Bala Chandrasekaran is a Technical Staff Engineer at Dell Technologies, where he is responsible for building machine learning and deep learning infrastructure solutions. He has over 15 years of experience in the areas of high performance computing, virtualization infrastructure, cloud computing and big data.

Bala Chandrasekaran

Technical Staff at Dell Technologies

Training: Introduction to building a distributed neural network on Apache Spark with BigDL and Analytics Zoo

In this training session you will get hands on experience with developing neural network using Intel BigDL and Analytics Zoo on Apache Spark. You will learn how you can use Spark DataFrames and build deep learning pipelines through implementing some practical examples.
Target Audience: AI developers and aspiring data scientists who are Experienced in Python and Spark. Also big data and analytics professionals interested in neural networks

Prerequisites:
• Experience in Python programming
• Entry level knowledge of Apache Spark
• Basic knowledge of deep learning and techniques in deep learning

Training outline:

Introduction to Deep Learning on Spark, BigDL and Analytics Zoo – 25 minutes
We will begin with a brief introduction to Apache Spark and the Machine Learning/Deep Learning ecosystem around Spark. Then we will introduce Intel BigDL and Analytics Zoo, two deep learning libraries for Apache Spark. We will go into the architectural details of how distributed training happens in BigDL. We will cover the model training process, including how the model, weights and gradients are distributed, calculated, updated and shared with Apache Spark.

Setting Up Sample Environment – 10 minutes
The instructors will highlight the major components of our demonstration environment, including the dataset, docker container and example code along with the public location of these resources and how to set them up.

Exercise 1 – Quick and simple image recognition use case with BigDL – 45 minutes
We will work through a simple image recognition use case that trains a CNN. The goal of this exercise is a simple introduction in to using BigDL with image datasets. Participants will get exposure to:
• How to use read images into Spark data frames
• Building transformation pipelines for images with Spark
• How to train a deep learning model using estimators

Exercise 2 – Transfer Learning for Image Classification Models – 45 minutes
Participants will get exposure to:
• How to build a pipeline in Spark to preprocess images
• How to import a model a trained model from other frameworks like TensorFlow
• How to implement transfer learning on the imported model with the preprocessed images

Quick break: Answer questions or help out anyone who is having trouble – 10 minutes

Exercise 3 – Anomaly Detection or Recommendation system with Intel Analytics Zoo – 30 minutes
In this exercise we will show participants
• How to build an initial pipeline for feature transformation
• How to Build a recommendation model in BigDL/Analytics Zoo
• How to perform training and inference for this use case

Exercise 4 – Model Serving – 15 minutes
In this exercise we will show participants to how to build an end to end pipeline and put their model to production. They will get exposure to:
• Model serving using POJO API
• Integration into web services and streaming services like Kafka for model inference
• Distributed model inference

Practical Knowledge – Discussion of practical experience using Spark and Hadoop for machine learning and deep learning projects – 15 minutes
We will have a discussion on the following topics:
• Spark parameters and how to set them: How to allocate the right amount of executors, cores and memory
• Performance Monitoring
• Tensorboard with BigDL
• Collaboration and reproducing experiments with a data science workbench tool.

Wrapping up / Questions – 15 minutes

Instructor's Bio

Andrew is a data scientist at Dell where he explores how machine learning and deep learning techniques are used in spark. His experience includes time series analysis and prediction of pharmaceutical drug sales and usage, real estate valuation using machine learning, and medical data classification using deep learning. Andrew’s interests involve applying machine learning and deep learning to solve new problems and improve old solutions.

Andrew Kipp

Data Scientist at Dell Technologies

Training: Introduction to building a distributed neural network on Apache Spark with BigDL and Analytics Zoo

In this training session you will get hands on experience with developing neural network using Intel BigDL and Analytics Zoo on Apache Spark. You will learn how you can use Spark DataFrames and build deep learning pipelines through implementing some practical examples.
Target Audience: AI developers and aspiring data scientists who are Experienced in Python and Spark. Also big data and analytics professionals interested in neural networks

Prerequisites:
• Experience in Python programming
• Entry level knowledge of Apache Spark
• Basic knowledge of deep learning and techniques in deep learning

Training outline:

Introduction to Deep Learning on Spark, BigDL and Analytics Zoo – 25 minutes
We will begin with a brief introduction to Apache Spark and the Machine Learning/Deep Learning ecosystem around Spark. Then we will introduce Intel BigDL and Analytics Zoo, two deep learning libraries for Apache Spark. We will go into the architectural details of how distributed training happens in BigDL. We will cover the model training process, including how the model, weights and gradients are distributed, calculated, updated and shared with Apache Spark.

Setting Up Sample Environment – 10 minutes
The instructors will highlight the major components of our demonstration environment, including the dataset, docker container and example code along with the public location of these resources and how to set them up.

Exercise 1 – Quick and simple image recognition use case with BigDL – 45 minutes
We will work through a simple image recognition use case that trains a CNN. The goal of this exercise is a simple introduction in to using BigDL with image datasets. Participants will get exposure to:
• How to use read images into Spark data frames
• Building transformation pipelines for images with Spark
• How to train a deep learning model using estimators

Exercise 2 – Transfer Learning for Image Classification Models – 45 minutes
Participants will get exposure to:
• How to build a pipeline in Spark to preprocess images
• How to import a model a trained model from other frameworks like TensorFlow
• How to implement transfer learning on the imported model with the preprocessed images

Quick break: Answer questions or help out anyone who is having trouble – 10 minutes

Exercise 3 – Anomaly Detection or Recommendation system with Intel Analytics Zoo – 30 minutes
In this exercise we will show participants
• How to build an initial pipeline for feature transformation
• How to Build a recommendation model in BigDL/Analytics Zoo
• How to perform training and inference for this use case

Exercise 4 – Model Serving – 15 minutes
In this exercise we will show participants to how to build an end to end pipeline and put their model to production. They will get exposure to:
• Model serving using POJO API
• Integration into web services and streaming services like Kafka for model inference
• Distributed model inference

Practical Knowledge – Discussion of practical experience using Spark and Hadoop for machine learning and deep learning projects – 15 minutes
We will have a discussion on the following topics:
• Spark parameters and how to set them: How to allocate the right amount of executors, cores and memory
• Performance Monitoring
• Tensorboard with BigDL
• Collaboration and reproducing experiments with a data science workbench tool.

Wrapping up / Questions – 15 minutes

Instructor's Bio

Yuhao Yang is a senior software engineer in Intel Big Data team, focusing on deep learning algorithms and applications. His area of focus is distributed deep learning/machine learning and has accumulated rich solution experiences, including fraud detection, recommendation, speech recognition, visual perception etc. He’s also an active contributor of Apache Spark MLlib (GitHub: hhbyyh).

Yuhao Yang

Senior Software Engineer at Intel

Training: Building Generative Adversarial Networks in Tensorflow and Keras

Generative Adversarial Networks are a promising modern application of Deep Learning that allows models to *generate* examples. However, GANs are complex, difficult to tune, and limited to small examples. We will explore recent GAN progress with a model that generates faces conditional on desired features, like ‘smiling’ and ‘bangs’.

This workshop is designed for Data Scientists, researchers, and software developers familiar with keras, tensorflow, or similar recent Deep Learning tools. It is expected that most in the audience will be able to build models and begin to train them on a local machine. Such students will not leave the tutorial with fully trained models. While students are not expected to have remote access to a machine configured with CUDA and tensorflow-gpu, the instructor will.
After attending, students in the target audience should be able to – Identify and explain the essential components of Generative Adversarial Networks including Deep Convolutional versions. – Modify existing GAN implementations. – Design a GAN for a novel application. – Understand and explain recent improvements in GAN loss functions.

Instructor's Bio

Sophie is a Senior Data Scientist at Metis where she is a bootcamp instructor and leads curriculum development. Sophie works in deep learning and data science ethics. Through t4tech she helps provide free trans-centered classes in programming and data science. She holds masters degrees in Electrical and Computer Engineering and Psychology, and her writing has appeared in Information Week. Sophie is passionate about teaching, both in theory and in practice, and about making sure that data science is primarily a tool that is used to improve people’s lives.

Sophie Searcy

Sr. Data Scientist at Metis

Training: Modeling Volatility Trading Using Econometrics and Machine Learning in Python

How can market volatility be predicted, and what are the differences between heuristic models, econometric models and data science/machine learning models? This workshop provides lessons learned from doing econometric modeling in finance distilled into a training course with example project that compares the performance of turbulence, GARCH and blender algorithms. Particular focus on framing the problem and use the right tools for volatility modeling. Aimed at entry level finance quants who want a refresher on Python techniques or non-finance quants looking to make the leap into financial modeling.

Instructor's Bio

Stephen Lawrence is the Head of Investment Management Fintech Data Science at The Vanguard Group. He oversees the integration of new structured and unstructured data sources into the investment process, leveraging a blend of NLP and predictive analytics. Prior to joining Vanguard, Dr. Lawrence was Head of Quantextual Research at State Street Bank where he lead a machine learning product team. Prior to that he led FX and Macro flow research for State Street Global Markets. Stephen holds a B.A. in Mathematics from the University of Cambridge and a Ph.D. in Finance from Boston College. He is also a TED speaker with a 2015 talk titled “The future of reading: it’s fast”.

Stephen Lawrence, PhD

Head of Investment Management Fintech Data Science at Vanguard

Training: Modeling Volatility Trading Using Econometrics and Machine Learning in Python

How can market volatility be predicted, and what are the differences between heuristic models, econometric models and data science/machine learning models? This workshop provides lessons learned from doing econometric modeling in finance distilled into a training course with example project that compares the performance of turbulence, GARCH and blender algorithms. Particular focus on framing the problem and use the right tools for volatility modeling. Aimed at entry level finance quants who want a refresher on Python techniques or non-finance quants looking to make the leap into financial modeling.

Instructor's Bio

Coming soon

Eunice Hameyie-Sanon

Sr. Data Scientist – Investment Management Fintech Strategies at Vanguard

Training: Engineering a Performant Machine Learning Pipeline: From Dask to Kubeflow

The lifecycle of any machine learning model, regular or deep, consists of (a) the pre-processing/transformation/augmenting of data (b) the training of the model with different hyper-parameter values/learning rates (c) the computing of results on new data/test sets. Whether you are using transfer learning, or a from-scratch model, this process requires a large amount of computation, management of your experimental process, and the quick perusal of results from your experiment. In this workshop, we will learn how to combine off-the shelf clustering software such as kubernetes and dask, with learning systems such as tensorflow/pytorch/scikit-learn, on cloud infrastructure such as AWS/Google Cloud/Azure to construct a machine-learning system for your data science team. We’ll start with an understanding of kubernetes, move onto analysis pipelines in sklearn and dask, finally arriving at kubeflow. Participants should install minikube on their laptops (https://kubernetes.io/docs/tasks/tools/install-minikube/), and create accounts on the Google Cloud.

Instructor's Bio

Rahul Dave is a lecturer in Bayesian Statistics and Machine Learning at Harvard University, and consults on the same topics at LxPrior. He holds a Ph.D. from the University of Pennsylvania in Computational Astrophysics, and has programmed device drivers for telescopes, bespoke databases for astrophysical data, and machine learning systems in various fields. His new startup, univ.ai, helps students and companies upgrade the skill and understanding of both their developers and managers for this new AI driven world, by providing both corporate training and consulting.

Dr. Rahul Dave

Chief Scientist at univ.ai, lxprior.com and Harvard University

Training: TFX: Production ML Pipelines with TensorFlow

Putting machine learning models into production is now mission critical for every business – no matter what size.

TensorFlow is the industry-leading platform for developing, modeling, and serving deep learning solutions. But putting together a complete pipeline for deploying and maintaining a production application of AI and deep learning is much more than training a model. Google has taken years of experience in developing production ML pipelines and offered the open source community TensorFlow Extended (TFX), an open source version of tools and libraries that Google uses internally.

Learn what’s involved in creating a production pipeline, and walk through working code in an example pipeline with experts from Google. You’ll be able to take what you learn and get started on creating your own pipelines for your applications.

Instructor's Bio

Coming soon!

Robert Crowe

Developer Advocate, TensorFlow at Google

Training: Beyond Word Embedding: BERT, ElMo and ULMFit NLP models- New Era in Neural Natural Language Processing

Big changes are underway in the world of Natural Language Processing (NLP). The long reign of word vectors as NLP’s core representation technique has seen an exciting new line of challengers emerge: ELMo, OpenAI transformer, ULMFiT, Facebook’s PyText, Google’s BERT
In this Talk, the audience will get a detailed understanding of the past, present, and future of deep learning in NLP. In addition, readers will also learn some of the current best practices for applying deep learning in NLP. Some topics include: The rise of distributed representations (e.g., word2vec), Convolutional, recurrent, and recursive neural networks, Recent development in unsupervised sentence representation learning, Combining deep learning models with memory-augmenting strategies. Our conceptual understanding of how best to represent words and sentences in a way that best captures underlying meanings and relationships is rapidly evolving. Moreover, the NLP community has been putting forward incredibly powerful components that you can freely download and use in your own models and pipelines. This talk will introduce them to the audience.
These works made headlines by demonstrating that pretrained language models can be used to achieve state-of-the-art results on a wide range of NLP tasks. Such methods herald a watershed moment: they may have the same wide-ranging impact on NLP as pretrained ImageNet models had on computer vision.
Language understanding is a challenge for computers. Subtle nuances of communication that human toddlers can understand still confuse the most powerful machines. Even though advanced techniques like deep learning can detect and replicate complex language patterns, machine learning models still lack fundamental conceptual understanding of what our words really mean.
Understanding context has broken down barriers that had prevented NLP techniques making headway before.

Tools: NLTK, Spacy, Google Colab, Pandas, Gensim, PolyGlot, Sci-KitLearn, Glove, Word2Vec, Word Embedding, WEVI, Google Tensorflow Projector, Tensorflow Keras

Languages: Python, R, Jupyter Notebook

Learning Outcomes:
Text mining and the ways of extracting and reading data from some common file types including NLTK corpora
Understand some ways of text extraction and cleaning using NLTK.
Analyse a sentence structure using a group of words to create phrases and sentences using NLP and the rules of English grammar
Explore text classification, vectorization techniques and processing using scikit-learn
Build a Machine Learning classifier for text classification
Word Embedding
Deep Learning Concepts
Language Modeling
New Era in Pretrained Natural Language Processing language models like Google BERT, Facebook PyText, ELMo etc.

Instructor's Bio

Bhairav Mehta is Senior Data Scientist with extensive professional experience and academic background. Bhairav works for Apple Inc. as Sr. Data Scientist.

Bhairav Mehta is experienced engineer, business professional and seasoned Statistician / programmer with 19 years of combined progressive experience working on data science in electronics consumer products industry (7 years at Apple Inc.), yield engineering in semiconductor manufacturing (6 years at Qualcomm and MIT Startup) and quality engineering in automotive industry (OEM, Tier2 Suppliers, Ford Motor Company) (3 years). Bhairav founded a start up DataInquest Inc. in 2014 that is specialized in training/consulting in Artificial Intelligence, Machine Learning, Blockchain and Data Science.

Bhairav Mehta has MBA from Johnson School of Management at Cornell University, Masters in Computer science from Georgia Tech (Expected 2018), Masters in Statistics from Cornell University, Masters in Industrial Systems Engineering from Rochester Institute of Technology and BS Production Engineering from Mumbai University.

Bhairav Mehta

Data Science Manager at Apple

Training: From Stored Data to Data Stories: Building Data Narratives with Open-Source Tools

Literate computing weaves a narrative directly into an interactive computation. Text, code, and results are combined into a narrative that relies equally on textual explanations and computational components. Insights are extracted from data using computational tools. These insights are communicated to an audience in the form of a narrative that resonates with the audience. Literate computing lends itself to the practice of reproducible research. One may re-run the analyses; run the analyses with new data sets; modify the code for other purposes.
This workshop will take one through the steps associated with literate computing: data retrieval; data curation; model construction, evaluation, and selection; and reporting. Particular attention will be paid to reporting, i.e., building a narrative. Examples will be presented demonstrating how one might generate multiple output formats (e.g., HTML pages, presentation slides, PDF documents) starting with the same code base.
As a specific example, a data narrative will be built showing how one might build predictive models for the solubility of organic molecules. Reports will be presented as (1) an HTML file, (2) a PDF document (in a format acceptable for journal submission), and (3) a slide presentation.
While the workshop’s example comes from the field of cheminformatics, the computational tools used and the exercises presented are applicable to any field where an investigator is interested in building predictive models, and describing these models to colleagues and associates.
At the workshop’s conclusion attendees will have worked through exercises that may serve as templates to be used with their data as they build their data narratives.
The R and Python ecosystems will be used throughout. All data, code, and text will be made available.

Instructor's Bio

Paul Kowalczyk is a Senior Data Scientist at Solvay. There, Paul uses a variety of toolchains and machine learning workflows to visualize, analyze, mine, and report data; to generate actionable insights from data. Paul is particularly interested in democratizing data science, working to put data products into the hands of his colleagues. His experience includes using computational chemistry, cheminformatics, and data science in the biopharmaceutical and agrochemical industries. Paul received his PhD from Rensselaer Polytechnic Institute, and was a Postdoctoral Research Fellow with IBM’s Data Systems Division.

Paul J Kowalczyk, PhD

Senior Data Scientist at Solvay

Training: Understanding the PyTorch Framework with Applications to Deep Learning

Over the past couple of years, PyTorch has been increasing in popularity in the Deep Learning community. What was initially a tool for Deep Learning researchers has been making headway in industry settings.

In this session, we will cover how to create Deep Neural Networks using the PyTorch framework on a variety of examples. The material will range from beginner – understanding what is going on “under the hood”, coding the layers of our networks, and implementing backpropagation – to more advanced material on RNNs,CNNs, LSTMs, & GANs.

Attendees will leave with a better understanding of the PyTorch framework. In particular, how it differs from Keras and Tensorflow. Furthermore, a link to a clean documented GitHub repo with the solutions of the examples covered will be provided.

Instructor's Bio

Robert loves to break deep technical concepts down to be as simple as possible, but no simpler.

Robert has data science experience in companies both large and small. He is currently Head of Data Science for Podium Education, where he builds models to improve student outcomes, and an Adjunct Professor at Santa Clara University’s Leavey School of Business. Prior to Podium Education, he was a Senior Data Scientist at Metis teaching Data Science and Machine Learning. At Intel, he tackled problems in data center optimization using cluster analysis, enriched market sizing models by implementing sentiment analysis from social media feeds, and improved data-driven decision making in one of the top 5 global supply chains. At Tamr, he built models to unify large amounts of messy data across multiple silos for some of the largest corporations in the world. He earned a PhD in Applied Mathematics from Arizona State University where his research spanned image reconstruction, dynamical systems, mathematical epidemiology and oncology.

Robert Alvarez, PhD

Head of Data Science at Podium Education

Training: Hierarchal And Mixed-effect models in R

This course begins by explaining what a mixed-effect model is. Next, the course covers linear mixed-effect regressions. These powerful models will allow you to explore data with a more complicated structure than a standard linear regression. The course then teaches generalized linear mixed-effect regressions. Generalized linear mixed-effects models allow you to model more kinds of data, including binary responses and count data. Lastly, the course goes over repeated-measures analysis as a special case of mixed-effect modeling. This kind of data appears when subjects are followed over time and measurements are collected at intervals. Throughout the course you’ll work with real data to answer interesting questions using mixed-effects models.

Course outline (for a 3.5-hour course including a 15 minute break):
– Why and when to use mixed-effect models? 0.5 hr
– Linear mixed-effect regression with lmer from lme4 in R 1 hr
– Visualizing mixed-effects models with 0.5 hr
– Generalized mixed effect regression with glmer from lme4 in R 0.75 hr
– Where to from here? (conclusion and additional resources) 0.5 hr

Instructor's Bio

Coming soon!

Richard Erickson, PhD

Instructor, Research Quantitative Ecologist at DataCamp, U.S. Geological Survey (USGS)

Training: Data Driven Websites: Building Interactive Webpages using Bokeh & Flask

Our analyses are only as useful as they are seen and understood, which is why so many good data scientists talk about telling a story with data. You may find yourself in a position where you need to share your work with others publicly without the benefit of expensive dashboarding packages or a glitzy corporate website. With moderate python development skills, you can turn your analyses into impressive public dashboards using Flask and Bokeh.

This hands-on workshop will take you from data to website with multiple interactive charts and graphs on a two-page website 100% in python. Templates for the base HTML & CSS will be provided, so you can focus on learning how to build dynamic and interactive visualizations to tell your story. You should have a solid base understanding of python (data types, control flows and functions) but extensive data science experience is not necessary.

Instructor's Bio

Bethany is a data scientist, an instructor and a passionate experiential learner. Having started her career as an artist and educator, she is committed data driven story-telling and the appropriate use of graphs.

Bethany Poulin

Data Science Instructor at General Assembly

Workshop Sessions


Workshop: Real-time Anomaly Detection in Surveillance Feeds

Rapid advances in the surveillance infrastructure have enabled us to capture normal and anomalous events at scale, coupled with tremendous progress in computer vision and pattern recognition. However, the issue of timely response to potential threatening situations is still a problem at large. Various challenges such as low quality feeds, occlusion , clutter, lack of training data, adversarial attacks make it extremely hard for the network to achieve the desired and timely accuracy and performance, leading to hazardous situations that could have been potentially avoided. In this paper, we study state of the art approaches to tackle this problem and study their capabilities and limitations. Furthermore we also present the results of several experiments conducted to tackle this challenge from a supervised, unsupervised, generative and reinforcement perspective. We hope to present these results as an enabler for future work in this area.

Instructor's Bio

Utkarsh Contractor is the Director of AI at Aisera, where he leads the data science team working on machine learning and artificial intelligence applications in the fields of Natural Language Processing and Vision. He is also pursuing his graduate degree at Stanford University, focussing his research and experiments on computer vision, using CNNs to analyze surveillance scene imagery and footages. Utkarsh has a decade of industry experience in Information Retrieval and Machine Learning working at companies such as LinkedIn and AT&T Labs.

Utkarsh Contractor

ML and AI Director at Aisera Inc.

Workshop: Introduction to Natural Language Processing in Healthcare

Healthcare is an industry that is greatly benefiting from data science and machine learning. To successfully build predictive models, healthcare data scientists must extract and combine data of various types (numerical, categorical, text, and/or images) from electronic medical records. Unfortunately, many clinical signs and symptoms (e.g. coughing, vomiting, or diarrhea) are often not captured with numerical data and are usually only present in the clinical notes of physicians and nurses.

In this workshop, the audience will build a machine learning model to predict unplanned hospital readmission with discharge summaries using the MIMIC III data set. Throughout the tutorial, the audience will have the opportunity to prepare data for a machine learning project, preprocess unstructured notes using a bag-of-words approach, build a simple predictive model, assess the quality of the model and strategize how to improve the model. Note to the audience: the MIMIC III data set requires requesting access in advance, so please request access as early as possible.

Instructor's Bio

Andrew Long is a Data Scientist at Fresenius Medical Care North America (FMCNA). Andrew holds a PhD in biomedical engineering from Johns Hopkins University and a Master’s degree in mechanical engineering from Northwestern University. Andrew joined FMCNA last year after participating in the Insight Health Data Fellows Program. At FMCNA, he is responsible for building predictive models using machine learning to improve the quality of life of every patient who receives dialysis from FMCNA. He is currently creating a model to predict which patients are at the highest risk of imminent hospitalization.

Andrew Long, PhD

Data Scientist at Fresenius Medical Care

Workshop: Pomegranate: Fast and Flexible Probabilistic Modeling in Python

Instructor's Bio

Jacob Schreiber is a fifth year Ph.D. student and NSF IGERT big data fellow in the Computer Science and Engineering department at the University of Washington. His primary research focus is on the application of machine larning methods, primarily deep learning ones, to the massive amount of data being generated in the field of genome science. His research projects have involved using convolutional neural networks to predict the three dimensional structure of the genome and using deep tensor factorization to learn a latent representation of the human epigenome. He routinely contributes to the Python open source community, currently as the core developer of the pomegranate package for flexible probabilistic modeling, and in the past as a developer for the scikit-learn project. Future projects include graduating.

Jacob Schreiber

PhD Candidate at University of Washington

Instructor's Bio

Laura Norén is a data science ethicist and researcher currently working in cybersecurity at Obsidian Security in Newport Beach. She holds undergraduate degrees from MIT, a PhD from NYU where she recently completed a postdoc in the Center for Data Science. Her work has been covered in The New York Times, Canada’s Globe and Mail, American Public Media’s Marketplace program, in numerous academic journals and international conferences. Dr. Norén is a champion of open source software and those who write it.

Laura Norén, PhD

Director of Research, Professor at Obsidian Security, NYU Stern School of Business

Workshop: Real-ish Time Predictive Analytics with Spark Structured Streaming

In this workshop we will dive deep into what it takes to build and deliver an always-on “real-ish time” predictive analytics pipeline with Spark Structured Streaming.

The core focus of the workshop material will be on how to solve a common complex problem in which we have no labeled data in an unbounded timeseries dataset and need to understand the substructure of said chaos in order to apply common supervised and statistical modeling techniques to our data in a streaming fashion.

The example problem for the workshop will come from the telecommunications space but the skills you will leave with can be applied to almost any domain as long as you sprinkle in a little creativity and inject a bit of domain knowledge.

Skills Aquired:
1. Structured Streaming experience with Apace Spark.
2. Understand how to use supervised modeling techniques on unsupervised data (caveat: requires some domain knowledge and the good ol human touch).
3. Have fun for 90 minutes.

Instructor's Bio

Scott Haines is a Principal Software Engineer / Tech Lead on the Voice Insights team at Twilio. His focus has been on the architecture and development of a real-time (sub 250ms), highly available, trust-worthy analytics system. His team is providing near real-time analytics that processes / aggregates and analyzes multiple terabytes of global sensor data daily. Scott helped drive Apache Spark adoption at Twilio and actively teaches and consulting teams internally. Scott’s past experience was at Yahoo! where he built a real-time recommendation engine and targeted ranking / ratings analytics which helped serve personalized page content for millions of customers of Yahoo Games. He worked to build a real-time click / install tracking system that helped deliver customized push marketing and ad attribution for Yahoo Sports and lastly Scott finished his tenure at Yahoo working for Flurry Analytics where he wrote the an auto-regressive smart alerting and notification system which integrated into the Flurry mobile app for ios/android.

Scott Haines

Principal Software Engineer at Twilio

Workshop: Mastering Gradient Boosting with CatBoost

Gradient boosting is a powerful machine-learning technique that achieves state-of-the-art results
in a variety of practical tasks. For a number of years, it has remained the primary method for
learning problems with heterogeneous features, noisy data, and complex dependencies: web search,
recommendation systems, weather forecasting, and many others.

CatBoost (http://catboost.yandex) is a popular open-source gradient boosting library with a whole set of advantages:
1. CatBoost is able to incorporate categorical features in your data (like music genre or city) with no additional preprocessing.
2. CatBoost has the fastest GPU and multi GPU training implementations of all the openly available gradient boosting libraries.
3. CatBoost predictions are 20-60 times faster then in other open-source gradient boosting libraries, which makes it possible to use CatBoost for latency-critical tasks.
4. CatBoost has a variety of tools to analyze your model.

This workshop will feature a comprehensive tutorial on using CatBoost library.
We will walk you through all the steps of building a good predictive model.
We will cover such topics as:
– Working with different types of features, numerical and categorical
– Working with inbalanced datasets
– Using cross-validation
– Understanding feature importances and explaining model predictions
– Tuning parameters of the model
– Speeding up the training.

Instructor's Bio

Anna Veronika Dorogush graduated from the Faculty of Computational Mathematics and Cybernetics of Lomonosov Moscow State University and from Yandex School of Data Analysis. She used to work at ABBYY, Microsoft, Bing and Google, and has been working at Yandex since 2015, where she currently holds the position of the head of Machine Learning Systems group and is leading the efforts in development of the CatBoost library.

Anna Veronika Dorogush

ML Lead at Yandex

Workshop: Deciphering the Black Box: Latest Tools and Techniques for Interpretability

This workshop shows how interpretability tools can give you not only more confidence in a model, but also help to improve model performance. Through this interactive workshop, you will learn how to better understand the models you build, along with the latest techniques and many tricks of the trade around interpretability. The workshop will largely focus on interpretability techniques, such as feature importance, partial dependence, and explanation approaches, such as LIME and Shap.
The workshop will demonstrate interpretability techniques with notebooks, some in R and some in Python. Along the way, workshop will consider issues like spurious correlation, random effects, multicollinearity, reproducibility, and other issues that may affect model interpretation and performance. To illustrate the points, the workshop will use easy to understand examples and references to open source tools to illustrate the techniques.

Instructor's Bio

Rajiv Shah is a data scientist at DataRobot, where his primary focus is helping customers improve their ability to make and implement predictions. Previously, Rajiv has been part of data science teams at Caterpillar and State Farm. He has worked on a variety of projects from a wide ranging set of areas including supply chain, sensor data, acturial ratings, and security projects. He has a PhD from the University of Illinois at Urbana-Champaign.

Rajiv Shah, PhD

Data Scientist at DataRobot

Workshop: Building an Open Source Streaming Analytics Solution with Kafka and Druid

The maturation and development of open source technologies has made it easier than ever for companies to derive insights from vast quantities of data. In this talk, we will cover how data analytic stacks have evolved from data warehouses, to data lakes, and to more modern streaming analytics stack. We will also discuss building such a stack using Apache Kafka and Apache Druid.

Analytics pipelines running purely on Hadoop can suffer from hours of data lag. Initial attempts to solve this problem often lead to inflexible solutions, where the queries must be known ahead of time, or fragile solutions where the integrity of the data cannot be assured. Combining Hadoop with Kafka and Druid can guarantee system availability, maintain data integrity, and support fast and flexible queries.

In the described system, Kafka provides a fast message bus and is the delivery point for machine-generated event streams. Kafka streams can be used to manipulated data to load into Druid. Druid provides flexible, highly available, low-latency queries.

This talk is based on our real-world experiences building out such a stack for many use cases across many industries in the real world.

Instructor's Bio

Fangjin is a co-author of the open source Druid project and a co-founder of Imply, a San Francisco based technology company. Fangjin previously held senior engineering positions at Metamarkets and Cisco. He holds a BASc in Electrical Engineering and a MASc in Computer Engineering from the University of Waterloo, Canada.

Fangjin Yang

Core Contributor to Druid | CEO at Imply.io

Workshop: Reproducible Data Science Using Orbyter

Artificial Intelligence is already helping many businesses become more responsive and competitive, but how do you move machine learning models efficiently from research to deployment at enterprise scale? It is imperative to plan for deployment from day one, both in tool selection and in the feedback and development process. Additionally, just as DevOps is about people working at the intersection of development and operations, there are now people working at the intersection of data science and software engineering who need to be integrated into the team with tools and support.

At Manifold, we’ve developed the Lean AI process to streamline machine learning projects and the open-source Orbyter package for Docker-first data science to help your engineers work as an an integrated part of your development and production teams. In this workshop, Sourav and Alex will focus heavily on the DevOps side of things, demonstrating how to use Orbyter to spin up data science containers and discussing experiment management as part of the Lean AI process.

Instructor's Bio

As CTO for Manifold, Sourav is responsible for the overall delivery of data science and data product services to make clients successful. Before Manifold, Sourav led teams to build data products across the technology stack, from smart thermostats and security cams (Google / Nest) to power grid forecasting (AutoGrid) to wireless communication chips (Qualcomm). He holds patents for his work, has been published in several IEEE journals, and has won numerous awards. He earned his PhD, MS, and BS degrees from MIT in Electrical Engineering and Computer Science.

Sourav Dey, PhD

CTO at Manifold

Workshop: Reproducible Data Science Using Orbyter

Artificial Intelligence is already helping many businesses become more responsive and competitive, but how do you move machine learning models efficiently from research to deployment at enterprise scale? It is imperative to plan for deployment from day one, both in tool selection and in the feedback and development process. Additionally, just as DevOps is about people working at the intersection of development and operations, there are now people working at the intersection of data science and software engineering who need to be integrated into the team with tools and support.

At Manifold, we’ve developed the Lean AI process to streamline machine learning projects and the open-source Orbyter package for Docker-first data science to help your engineers work as an an integrated part of your development and production teams. In this workshop, Sourav and Alex will focus heavily on the DevOps side of things, demonstrating how to use Orbyter to spin up data science containers and discussing experiment management as part of the Lean AI process.

Instructor's Bio

Alexander Ng is a Senior Data Engineer at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Prior to Manifold, Alex served as both a Sales Engineering Tech Lead and a DevOps Tech Lead for Kyruus, a startup that built SaaS products for enterprise healthcare organizations. Alex got his start as a Software Systems Engineer at the MITRE Corporation and the Naval Undersea Warfare Center in Newport, RI. His recent projects at the intersection of systems and machine learning continue to combine a deep understanding of the entire development lifecycle with cutting-edge tools and techniques. Alex earned his Bachelor of Science degree in Electrical Engineering from Boston University, and is an AWS Certified Solutions Architect.

Alex Ng

Senior Data Engineer at Manifold

Workshop: Mapping Geographic Data in R

Our customers, store locations, constituents, research subjects, patients, crime locations, traffic accidents, and events don’t exist in a vacuum – their geographic locations provide important information that can best be displayed visually. Often, we have datasets that give location data in various ways: zip code, census tract or block, street address, latitude / longitude, congressional district, etc. Combining these data and getting critical insight into the relationships between them requires a bit of data munging skills that go beyond the basic data analysis we use for traditional tabular data.

There are many free, publicly available datasets about environmental exposures, socioeconomic status, climate, public safety events, and more that are linked to a geographic point or area, and this wealth of information gives us the opportunity to enrich our proprietary data with greater insight.

Whether it’s comparing the number of asthma-related visits to the hospital with air quality data or looking at socioeconomic data and correlating it to commuting patterns, the presentation of geographic data in maps helps accelerate the transformation of raw data to actionable information. Maps are a well-known data idiom that are ideal for presenting complex data in an approachable way for non-technical stakeholders like policymakers, executives, and the press.

In this hands-on workshop, we will use R to take public data from various sources and combine them to find statistically interesting patterns and display them in static and dynamic, web-ready maps. This session will cover topics including geojson and shapefiles, how to munge Census Bureau data, geocoding street addresses, transforming latitude and longitude to the containing polygon, and data visualization principles.

Participants will leave this workshop with a publication-quality data product and the skills to apply what they’ve learned to data in their field or area of interest. Participants should have R and RStudio installed and have a basic understanding of how to use R and R Markdown for basic data ingestion and analysis. Ideally, participants will install the following packages prior to the workshop: tidyverse, leaflet, jsonlite, ggplot2, maptools, sp, rgdal, rgeos, scales, tmap.

Instructor's Bio

Joy Payton is a data scientist and data educator at the Children’s Hospital of Philadelphia (CHOP), where she helps biomedical researchers learn the reproducible computational methods that will speed time to science and improve the quality and quantity of research conducted at CHOP. A longtime open source evangelist, Joy develops and delivers data science instruction on topics related to R, Python, and git to an audience that includes physicians, nurses, researchers, analysts, developers, and other staff. Her personal research interests include using natural language processing to identify linguistic differences in a neurodiverse population as well as the use of government open data portals to conduct citizen science that draws attention to issues affecting vulnerable groups. Joy holds a degree in philosophy and math from Agnes Scott College, a divinity degree from the Universidad Pontificia de Comillas (Madrid), and a data science Masters from the City University of New York (CUNY).

Joy Payton

Supervisor, Data Education at Children’s Hospital of Philadelphia

Workshop: Synthesizing Data Visualization and User Experience

The wealth of data available offers unprecedented opportunities for discovery and insight. How do we design a more intuitive and useful data experience? This workshop focuses on approaches to turn data into actionable insights by combining principles from data visualization and user experience design. Participants will be asked to think holistically about data visualizations and the people they serve. Through presentations and hands-on exercises, participants will learn how to choose and create data visualizations driven by user-oriented objectives.

Instructor's Bio

Bang Wong is the creative director of the Broad Institute of MIT and Harvard and an adjunct assistant professor in the Department of Art as Applied to Medicine at the Johns Hopkins University School of Medicine. His work focuses on developing strategies to meet the analytical challenges posed by the unprecedented volume, resolution, and variety of data in biomedical research.

Bang Wong

Creative Director at Broad Institute of MIT-Harvard

Workshop: Synthesizing Data Visualization and User Experience

The wealth of data available offers unprecedented opportunities for discovery and insight. How do we design a more intuitive and useful data experience? This workshop focuses on approaches to turn data into actionable insights by combining principles from data visualization and user experience design. Participants will be asked to think holistically about data visualizations and the people they serve. Through presentations and hands-on exercises, participants will learn how to choose and create data visualizations driven by user-oriented objectives.

Instructor's Bio

Mark Schindler is co-founder and Managing Director of GroupVisual.io. For over 15 years, he has designed user-interfaces for analytic software products and mobile apps for clients ranging from Fortune 50 companies to early-stage startups. In addition to design services, Mark and his team mentor startup companies and conduct workshops on data visualization, analytics and user-experience design.

Mark Schindler

Co-founder, Managing Director at GroupVisual.io

Workshop: Scaling AI Applications with Ray

The next generation of AI applications will continuously interact with the environment and learn from these interactions. To develop these applications, data scientists and engineers will need to seamlessly scale their work from running interactively to production clusters. In this talk we introduce Ray, a high-performance distributed execution engine, and its libraries for AI workloads. We cover each Ray library in turn, and also show how the Ray API allows these traditionally separate workflows to be composed and run together as one distributed application.

Ray is an open source project being developed at the RISE Lab in UC Berkeley for scalable hyperparameter optimization, distributed deep learning, and reinforcement learning. We focus on the following libraries in this tutorial:

TUNE: Tune is a scalable hyperparameter optimization framework for reinforcement learning and deep learning. Go from running one experiment on a single machine to running on a large cluster with efficient search algorithms without changing your code. Unlike existing hyperparameter search frameworks, Tune targets long-running, compute-intensive training jobs that may take many hours or days to complete, and includes many resource-efficient algorithms designed for this setting.

RLLIB: RLlib is an open-source library for reinforcement learning that offers both a collection of reference algorithms and scalable primitives for composing new ones. In this tutorial we discuss using RLlib to tackle both classic benchmark and applied problems, RLlib’s primitives for scalable RL, and how RL workflows can be integrated with data processing and hyperparameter optimization.

Instructor's Bio

Richard Liaw is a PhD student in BAIR/RISELab at UC Berkeley working with Joseph Gonzalez, Ion Stoica, and Ken Goldberg. He has worked on a variety of different areas, ranging from robotics to reinforcement learning to distributed systems. He is currently actively working on Ray, a distributed execution engine for AI applications; RLlib, a scalable reinforcement learning library; and Tune, a distributed framework for model training.

Richard Liaw

AI Researcher, RISELab at UC Berkeley

Workshop: Scaling AI Applications with Ray

The next generation of AI applications will continuously interact with the environment and learn from these interactions. To develop these applications, data scientists and engineers will need to seamlessly scale their work from running interactively to production clusters. In this talk we introduce Ray, a high-performance distributed execution engine, and its libraries for AI workloads. We cover each Ray library in turn, and also show how the Ray API allows these traditionally separate workflows to be composed and run together as one distributed application.

Ray is an open source project being developed at the RISE Lab in UC Berkeley for scalable hyperparameter optimization, distributed deep learning, and reinforcement learning. We focus on the following libraries in this tutorial:

TUNE: Tune is a scalable hyperparameter optimization framework for reinforcement learning and deep learning. Go from running one experiment on a single machine to running on a large cluster with efficient search algorithms without changing your code. Unlike existing hyperparameter search frameworks, Tune targets long-running, compute-intensive training jobs that may take many hours or days to complete, and includes many resource-efficient algorithms designed for this setting.

RLLIB: RLlib is an open-source library for reinforcement learning that offers both a collection of reference algorithms and scalable primitives for composing new ones. In this tutorial we discuss using RLlib to tackle both classic benchmark and applied problems, RLlib’s primitives for scalable RL, and how RL workflows can be integrated with data processing and hyperparameter optimization.

Instructor's Bio

Eric Liang is a PhD student at UC Berkeley working with Ion Stoica on distributed systems and applications of reinforcement learning. He is currently leading the RLlib project (rllib.io). Before grad school, he spent 4 years working in industry on storage infrastructure at Google and Apache Spark at Databricks.

Eric Liang

Project Lead, RISELab at UC Berkeley

Workshop: Modeling in the tidyverse

The tidyverse in R has traditionally been focused on data ingestion, manipulation, and visualization. The tidymodels packages apply the same design principles to modeling to create packages with high usability that produce results in predictable formats and structures. This workshop is a concise overview of the system and is illustrated with examples. Remote servers are available for users who cannot install software locally. Materials and preparation instructions can be found at https://github.com/topepo/odsc_2019

Instructor's Bio

Coming soon

Max Kuhn, PhD

Software Engineer, Author & Creator of Caret at RStudio

Workshop: Machine Learning for Digital Identity

There are tens of billions of online profiles today, each associated with some identity, on diverse platforms including social networks, online marketplaces, dating sites and financial institutions. Every platform needs to understand, validate and verify these identities.

The landscape of identity challenges, available data, and machine-learning technology have evolved over the years. However, identity still remains a notoriously hard problem. While we’ve made a lot of progress in academia and industry, there still are several unsolved problems. In this session, we will talk through three core, interconnected problems: (1) identity authentication/validation; (2) identity matching; (3) identity verification. We will discuss our work on effectively using machine learning technology to solve these problems, along with an analysis of popular techniques used on different platforms.

Identity authentication and validation ensures high quality attributes, that affect all downstream identity processes. The challenge of identity authentication is determining whether an input identity/attribute is a valid value. While identity validation solutions need to be tailored to the attribute-type, we will share some of the common techniques applicable across all attribute types: (1) canonicalizing attribute values, and then (2) lookup against constructed datasets of the universe of all possible values. We will also discuss how some of these generic techniques are applied to validation of two different types of attributes- names and government-issued IDs.

Identity matching is fundamental for two main applications: detecting duplicates and joining with other, often external, data sources to create a richer identity. We will describe the typical identity matching pipeline which is composed of 4 steps: (1) extraction of relevant attributes from structured and unstructured sources, (2) iterative identity enrichment of the input, (3) fuzzy matching of attribute pairs, (4) building a model to compute a match confidence using similarity and uniqueness.

Identity verification is the process of confirming that a online/digital identity accurately reflects the offline identity of the person who created the online identity. The key insight we will dive deep into is verification of one piece of the online identity, and then applying coherence across various identity attributes to verify all other attributes of the online identity.

This session is geared towards product, data science, and engineering leaders who would like to introduce state-of-the-art machine-learning techniques to solve identity problems at their respective companies or fortify their existing solutions. Some familiarity with machine learning techniques is preferred, but not required.

Instructor's Bio

Liren Peng is a Software Engineer on the Trust team at Airbnb. He is responsible for the architecture and development of user identity verification systems. He also works on the utilization of third party data and vendor integration. Prior to Airbnb, Liren worked at Trooly, a startup that built machine learning based trust models using both social media data and proprietary data to access the trustworthiness of individuals. He received B.S. from Carnegie Mellon University and M.Sc from Stanford University focusing data analytics.

Liren Peng

Software Engineer at Airbnb

Workshop: Open Data Hub workshop on OpenShift

The past few years has seen growth and adoption of container technology for cloud native applications and devops and agility. Kubernetes has emerged as the defacto hybrid cloud container platform.

There is considerable interest in bringing data science workloads and workflows to OpenShift – Red Hat’s Kubernetes distro. Data Scientists benefit by having a choice of public and private clouds and capabilities and technologies they bring for their experiments. Data and ML engineers benefit by able to scale and bring data science workloads and workflows to production.

We propose a hands on workshop where we show the attendees on how to deploy open source technologies for data science on Kubernetes – technologies such as Jupyter, Kafka, Spark, TensorFlow, Ceph etc. This workshop will be based on our experiences with this for Open Data Hub.

Instructor's Bio

Coming Soon

Steven Huels

Director of Engineering at Red Hat

Workshop: Open Data Hub workshop on OpenShift

The past few years has seen growth and adoption of container technology for cloud native applications and devops and agility. Kubernetes has emerged as the defacto hybrid cloud container platform.

There is considerable interest in bringing data science workloads and workflows to OpenShift – Red Hat’s Kubernetes distro. Data Scientists benefit by having a choice of public and private clouds and capabilities and technologies they bring for their experiments. Data and ML engineers benefit by able to scale and bring data science workloads and workflows to production.

We propose a hands on workshop where we show the attendees on how to deploy open source technologies for data science on Kubernetes – technologies such as Jupyter, Kafka, Spark, TensorFlow, Ceph etc. This workshop will be based on our experiences with this for Open Data Hub.

Instructor's Bio

Coming Soon

Sherard Griffin

Senior Principal Engineer at Red Hat

Workshop: Open Data Hub workshop on OpenShift

The past few years has seen growth and adoption of container technology for cloud native applications and devops and agility. Kubernetes has emerged as the defacto hybrid cloud container platform.

There is considerable interest in bringing data science workloads and workflows to OpenShift – Red Hat’s Kubernetes distro. Data Scientists benefit by having a choice of public and private clouds and capabilities and technologies they bring for their experiments. Data and ML engineers benefit by able to scale and bring data science workloads and workflows to production.

We propose a hands on workshop where we show the attendees on how to deploy open source technologies for data science on Kubernetes – technologies such as Jupyter, Kafka, Spark, TensorFlow, Ceph etc. This workshop will be based on our experiences with this for Open Data Hub.

Instructor's Bio

Coming Soon

Tushar Katarki

Sr. Principal Product Manager – OpenShift at Red Hat

Workshop: Intro to Technical Financial Evaluation with R

In this entry level workshop you will learn how to download and evaluate equities with the TTR (technical trading rules) package. We will evaluate an equity according to three basic indicators and introduce you to backtesting for more sophisticated analyses on your own. Next we will model a financial market’s risk versus reward to identify the best possible individual investments in the market. Lastly, we will explore a non-traditional market, simulate the reward in the market and put our findings to an actual test in a highly speculative environment.

Instructor's Bio

Ted started his text mining journey at Amazon when he launched the social media customer service team. Since then, he has held analytical leadership roles at startups and Fortune 100 companies. He is the Author of “Text Mining in Practice with R” available at Amazon.

Ted Kwartler

Director, Data Scientist, Adjunct Professor at Liberty Mutual, Harvard Extension School

Workshop: Get started with Deep Learning and the Internet of Things!

In this hands-on MATLAB workshop, you will explore deep learning techniques for data types such as images, text, and time-series data. You’ll use an online MATLAB instance to perform the following tasks:

1. Train deep neural networks on GPUs in the cloud
2. Access and explore pretrained models
3. Build a CNN to solve an image classification problem
4. Use LSTM networks to solve a time-series and text analytics problem.

Instructor's Bio

Jianghao Wang is a Data Scientist at MathWorks. In her role, Jianghao supports deep learning research and teaching in academia. Before joining MathWorks, Jianghao obtained her Ph.D. in Statistical Climatology from the University of Southern California and B.S. in Applied Mathematics from Nankai University.

Jianghao Wang, PhD

Data Scientist at MathWorks

Workshop: Get started with Deep Learning and the Internet of Things!

In this hands-on MATLAB workshop, you will explore deep learning techniques for data types such as images, text, and time-series data. You’ll use an online MATLAB instance to perform the following tasks:

1. Train deep neural networks on GPUs in the cloud
2. Access and explore pretrained models
3. Build a CNN to solve an image classification problem
4. Use LSTM networks to solve a time-series and text analytics problem.

Instructor's Bio

Pitambar Dayal works on deep learning and computer vision applications in technical marketing. Prior to joining MathWorks, he worked on creating technological healthcare solutions for developing countries and researching the diagnosis and treatment of ischemic stroke patients. Pitambar holds a B.S. in biomedical engineering from the New Jersey Institute of Technology.

Pitambar Dayal

Application Support Engineer at MathWorks

Workshop: Analyzing Legislative Burden Upon Businesses Using NLP and ML

As legislation develops over time, the burden upon businesses can change drastically. Data scientists from Bardess have collaborated with a research team within the Government of Ontario to investigate the use of advanced natural language processing (NLP) and machine learning (ML) techniques to analyze legal documents including statutes and regulations. Using the Accessibility for Ontarians with Disabilities Act (AODA) as a starting point, we developed a multi-stage analysis. On the higher level, the goal was to simply identify and automatically detect parts of the legislature that indicate legislative burden and categorize them as being primarily burdens upon business or government departments. The second level of analysis aims at understanding patterns of similarities and differences between different classes of burden using data mining and clustering techniques. Finally, the objective of the analysis is expanded to include other legislative texts, using ML algorithms to detect burdens which have been duplicated across multiple statutes and acts. This latter work supports the Government of Ontario to develop leaner legislature more efficiently. Overall this work indicates how NLP and ML techniques can be brought to bear on complex legislative problems, further emphasizing the increasing utility of these techniques in government and industry.

In this hands-on workshop, we’ll first describe the legislative/business context for the initiative, then walk attendees through the technical implementation. The work will be conducted by combining various techniques from the NLP toolbox, such as entity recognition, part-of-speech tagging, automatic summarization, and topic modeling. Work will be conducted in Python, making use of libraries for NLP such as spacy and nltk, and the ML library scikit-learn. We will also showcase interactive dashboards which have been created using the BI tool Qlik to allow exploration of the results of the analysis.

Instructor's Bio

Dr. Daniel Parton leads the data science practice at the analytics consultancy, Bardess. He has a background in academia, including a PhD in computational biophysics from University of Oxford, and previously worked in marketing analytics at Omnicom. He brings both technical and management experience to his role of leading cross-functional data analytics teams, and has led successful and impactful projects for companies in finance, retail, tech, media, manufacturing, pharma and sports/entertainment industries.

Daniel Parton, PhD

Lead Data Scientist at Bardess Group

Workshop: Analyzing Legislative Burden Upon Businesses Using NLP and ML

As legislation develops over time, the burden upon businesses can change drastically. Data scientists from Bardess have collaborated with a research team within the Government of Ontario to investigate the use of advanced natural language processing (NLP) and machine learning (ML) techniques to analyze legal documents including statutes and regulations. Using the Accessibility for Ontarians with Disabilities Act (AODA) as a starting point, we developed a multi-stage analysis. On the higher level, the goal was to simply identify and automatically detect parts of the legislature that indicate legislative burden and categorize them as being primarily burdens upon business or government departments. The second level of analysis aims at understanding patterns of similarities and differences between different classes of burden using data mining and clustering techniques. Finally, the objective of the analysis is expanded to include other legislative texts, using ML algorithms to detect burdens which have been duplicated across multiple statutes and acts. This latter work supports the Government of Ontario to develop leaner legislature more efficiently. Overall this work indicates how NLP and ML techniques can be brought to bear on complex legislative problems, further emphasizing the increasing utility of these techniques in government and industry.

In this hands-on workshop, we’ll first describe the legislative/business context for the initiative, then walk attendees through the technical implementation. The work will be conducted by combining various techniques from the NLP toolbox, such as entity recognition, part-of-speech tagging, automatic summarization, and topic modeling. Work will be conducted in Python, making use of libraries for NLP such as spacy and nltk, and the ML library scikit-learn. We will also showcase interactive dashboards which have been created using the BI tool Qlik to allow exploration of the results of the analysis.

Instructor's Bio

Serena Peruzzo is a senior data scientist at the analytics consultancy, Bardess. Her formal background is in Statistics with experience working both in the industry and academia. She has worked as a consultant on the Australian, British and Canadian markets delivering data science solutions across a broad range of industries and led several startups through the process of bootstrapping their data science capabilities.

Serena Peruzzo

Sr. Data Scientist at Bardess Group

Workshop: Causal Inference for Data Scientists

Causal inference is an increasingly necessary skill set for data scientists and analysts. No longer is it enough only to predict what happens given a set of environmental conditions, but rather internal business partners need to know how the decisions they are making influence outcomes. For example, marketers not only need to know that spending more money drives more revenue, but they also need to know how much revenue they can expect to observe at various levels of marketing spend. Understanding the causal relationship between spend and revenue empowers decision makers to optimize their decisions more accurately and quickly around crucial business goals such as ROI targets or revenue maximization. At DraftKings, we are always thinking about how we draw accurate conclusions from all of our tests. Our efforts include utilizing modern techniques as well as exploring new ideas and methods to improve our ability to learn.

Managers often assume that causal inference is a simple exercise for data scientists. Unfortunately, causal inference is not as simple as running A/B experiments. The purpose of this talk is to establish that causal inference is as much a philosophical exercise as it is a data exercise. Developing expertise in causal inference requires a deep understanding of the accepted framework, an ability to identify when data doesn’t adhere to the assumptions of this framework, and expertise with tools and techniques that can solve many of the significant challenges with estimating unbiased effects of treatments on critical outcomes.

This session serves as an introduction to the practice of causal inference. We start with an overview of the Rubin Causal Model (RCM), the leading framework for establishing causality. Once users are comfortable with the philosophy, we explore how the commonly used A/B testing framework maps to the more robust RCM framework from both a mathematical and philosophical perspective. In the final portion of the talk, we discuss several techniques developed by researchers that can be used to establish causality for a compromised A/B test or cases where tests are not feasible to implement. Throughout the talk, we use a general set of challenges faced by businesses to illustrate when issues arise and how these techniques mitigate the challenges.

Instructor's Bio

Coming soon

Bradley Fay, PhD

Senior Manager, Analytics at DraftKings

Workshop: Machine Learning for Digital Identity

There are tens of billions of online profiles today, each associated with some identity, on diverse platforms including social networks, online marketplaces, dating sites and financial institutions. Every platform needs to understand, validate and verify these identities.

The landscape of identity challenges, available data, and machine-learning technology have evolved over the years. However, identity still remains a notoriously hard problem. While we’ve made a lot of progress in academia and industry, there still are several unsolved problems. In this session, we will talk through three core, interconnected problems: (1) identity authentication/validation; (2) identity matching; (3) identity verification. We will discuss our work on effectively using machine learning technology to solve these problems, along with an analysis of popular techniques used on different platforms.

Identity authentication and validation ensures high quality attributes, that affect all downstream identity processes. The challenge of identity authentication is determining whether an input identity/attribute is a valid value. While identity validation solutions need to be tailored to the attribute-type, we will share some of the common techniques applicable across all attribute types: (1) canonicalizing attribute values, and then (2) lookup against constructed datasets of the universe of all possible values. We will also discuss how some of these generic techniques are applied to validation of two different types of attributes- names and government-issued IDs.

Identity matching is fundamental for two main applications: detecting duplicates and joining with other, often external, data sources to create a richer identity. We will describe the typical identity matching pipeline which is composed of 4 steps: (1) extraction of relevant attributes from structured and unstructured sources, (2) iterative identity enrichment of the input, (3) fuzzy matching of attribute pairs, (4) building a model to compute a match confidence using similarity and uniqueness.

Identity verification is the process of confirming that a online/digital identity accurately reflects the offline identity of the person who created the online identity. The key insight we will dive deep into is verification of one piece of the online identity, and then applying coherence across various identity attributes to verify all other attributes of the online identity.

This session is geared towards product, data science, and engineering leaders who would like to introduce state-of-the-art machine-learning techniques to solve identity problems at their respective companies or fortify their existing solutions. Some familiarity with machine learning techniques is preferred, but not required.

Instructor's Bio

Sukhada Palkar is a software engineer at Airbnb working on the various challenges of trusting digital identities. She enjoys working at the intersection of open ended problem solving, software engineering and machine learning. She has a background in applying machine learning for text and speech systems, and more recently identity and risk analytics.

Before Airbnb, Sukhada was an early member of the Amazon Alexa core natural language team and part of Trooly, a startup in the digital identity verification space, that was acquired by Airbnb. Sukhada has a M.S. in speech and language technologies from Carnegie Mellon.

Sukhada Palkar

Software Engineer at Airbnb

Workshop: Mapping the Global Supply Chain Graph

Panjiva maps the network of global trade using over one billion shipping records sourced from 15 governments around the world. We perform large-scale entity extraction and entity resolution from this raw data, identifying over 8 million companies involved in international trade, located across every country in the world. Moreover, we track detailed information on the 25 million+ relationships between them, yielding a map of the global trade network with unprecedented scope and granularity. We have developed a powerful platform facilitating search, analysis, and visualization of this network as well as a data feed integrated into S&P Global’s Xpressfeed platform.

We can explore the global supply chain graph at many levels of granularity. At the micro level, we can surface the close relationships around a given company to, for example, identify overseas suppliers shared with a competitor. At the macro level, we can track patterns such as the flow of products among geographic areas or industries. By linking to S&P Global’s financial and corporate data, we are able to understand how supply chains flow within or between multinational corporate structures, and correlate trade volumes and anomalies to financial metrics and events.

Instructor's Bio

Robert Christie is a Front End Engineer at Panjiva, a division of S&P Global Market Intelligence. He specializes in interactive data visualization and cartography for the web and has a background in statistics and spatial analysis. Much of Robert’s work has been in the domain of transportation, mobility, and logistics. He is passionate about the role of visualization in increasing the comprehensibility and observability of machine learning driven decision making. Robert received a B.A. from the McGill School of Environment and a Masters from the University of Toronto School of Information.

Robert Christie

Front End Engineer at S&P Global Market Intelligence

Workshop: Democratizing & Accelerating AI through Automated Machine Learning

Intelligent experiences powered by AI can seem like magic to users. Developing them, however, is pretty cumbersome involving a series of sequential and interconnected decisions along the way that are pretty time consuming. What if there was an automated service that identifies the best machine learning pipelines for a given problem/data? automated ML does exactly that!

Automated ML is based on a breakthrough from our Microsoft Research division. The approach combines ideas from collaborative filtering and Bayesian optimization to search an enormous space of possible machine learning pipelines intelligently and efficiently. It’s essentially a recommender system for machine learning pipelines. Similar to how streaming services recommend movies for users, automated ML recommends machine learning pipelines for data sets.

Just as important, automated ML accomplishes all this without having to see the customer’s data, preserving privacy. Automated ML is designed to not look at the customer’s data. Customer data and execution of the machine learning pipeline both live in the customer’s cloud subscription (or their local machine), which they have complete control of. Only the results of each pipeline run are sent back to the automated ML service, which then makes an intelligent, probabilistic choice of which pipelines should be tried next.

By making automated ML available through the Azure Machine Learning service (Python based SDK), we’re empowering data scientists with a powerful productivity tool. We’re working on making automated ML accessible through PowerBI, so that business analysts and BI professionals can also take advantage of machine learning. And stay tuned as we continue to incorporate it into other product channels to bring the power of automated ML to everyone.

This session will provide an overview of Automated machine learning, key customer use-cases, how it works and how you can get started!

Instructor's Bio

Coming soon

Deepak Babu Mukunthu

Principal Program Manager, Azure AI Platform at Microsoft

Workshop: Democratizing & Accelerating AI through Automated Machine Learning

Intelligent experiences powered by AI can seem like magic to users. Developing them, however, is pretty cumbersome involving a series of sequential and interconnected decisions along the way that are pretty time consuming. What if there was an automated service that identifies the best machine learning pipelines for a given problem/data? automated ML does exactly that!

Automated ML is based on a breakthrough from our Microsoft Research division. The approach combines ideas from collaborative filtering and Bayesian optimization to search an enormous space of possible machine learning pipelines intelligently and efficiently. It’s essentially a recommender system for machine learning pipelines. Similar to how streaming services recommend movies for users, automated ML recommends machine learning pipelines for data sets.

Just as important, automated ML accomplishes all this without having to see the customer’s data, preserving privacy. Automated ML is designed to not look at the customer’s data. Customer data and execution of the machine learning pipeline both live in the customer’s cloud subscription (or their local machine), which they have complete control of. Only the results of each pipeline run are sent back to the automated ML service, which then makes an intelligent, probabilistic choice of which pipelines should be tried next.

By making automated ML available through the Azure Machine Learning service (Python based SDK), we’re empowering data scientists with a powerful productivity tool. We’re working on making automated ML accessible through PowerBI, so that business analysts and BI professionals can also take advantage of machine learning. And stay tuned as we continue to incorporate it into other product channels to bring the power of automated ML to everyone.

This session will provide an overview of Automated machine learning, key customer use-cases, how it works and how you can get started!

Instructor's Bio

Eric Clausen-Brown is a Data Scientist who works on Automated ML in the AI Platform team at Microsoft. His past data science experience includes training and deploying machine learning models for search, ads, and personalization. In another life he was an astrophysicist who focused on understanding stuff going on around black holes.

Eric Clausen-Brown, PhD

Data Scientist at Microsoft

Workshop: Deep Learning like a Viking: Building Convolutional Neural Networks with Keras

The Vikings came from the land of ice and snow, from the midnight sun, where the hot springs flow. In addition to longships and bad attitudes, they had a system of writing that we, in modern times, have dubbed the Younger Futhark (or ᚠᚢᚦᚬᚱᚴ if you’re a Viking). These sigils are more commonly called runes and have been mimicked in fantasy literature and role-playing games for decades.

Of course, having an alphabet, runic or otherwise, solves lots of problems. But, it also introduces others. The Vikings had the same problem we do today. How were they to get their automated software systems to recognize the hand-carved input of a typical boatman? Of course, they were never able to solve this problem and were instead forced into a life of burning and pillaging. Today, we have deep learning and neural networks and can, fortunately, avoid such a fate.

In this session, we are going to build a Convolution Neural Network to recognize hand-written runes from the Younger Futhark. We’ll be using Keras to write easy to understand Python code that creates and trains the neural network to do this. We’ll wire this up to a web application using Flask and some client-side JavaScript so you can write some runes yourself and see if it recognizes them.

When we’re done, you’ll understand how Convolution Neural Networks work, how to build your own using Python and Keras, and how to make it a part of an application using Flask. Maybe you’ll even try seeing what it thinks of the Bluetooth logo?

Instructor's Bio

Guy works for DataRobot in Columbus, Ohio as a Developer Evangelist. Combining his decades of experience in writing software with a passion for sharing what he has learned, Guy goes out into developer communities and helps others build great software.
Teaching and community have long been a focus for Guy. He is President of the Columbus JavaScript Users Group, an organizer for the Columbus Machine Learners, and has even helped teach programming at a prison in central Ohio.
In past lives, Guy has worked as a consultant in a broad range of industries including healthcare, retail, and utilities. He also has spent several years working for a major insurance company in central Ohio. This has given him a broad view of technology application toward business problems.

Guy Royse

Developer Evangelist at DataRobot

Workshop: Explaining XGBoost Models - Tools and Methods

There is a widespread belief that the twin modeling goals of prediction and explanation are in conflict. That is, if one desires superior predictive power, then by definition one must pay a price of having little insight into how the model made its predictions. Conversely, if one desires explanations then one must only use “”highly interpretable”” methods like linear and logistic regression. However, in reality, this tradeoff is by no means a given. In fact, methods with high predictive power, when examined properly with sophisticated tooling, can yield practical insights that could never be realized by high bias methods like linear and logistic regression. Furthermore, the insights gained by carefully examining a model can be used to suggest better features, thereby improving model performance. Thus, the twin goals of prediction and understanding can instead form a virtuous cycle rather than remaining in conflict.

In this workshop, we will work hands-on using XGBoost with real-world data sets to demonstrate how to approach data sets with the twin goals of prediction and understanding in a manner such that improvements in one area yield improvements in the other. Using modern tooling such as Individual Conditional Expectation (ICE) plots and SHAP, as well as a sense of curiosity, we will extract powerful insights that could not be gained from simpler methods. In particular, attention will be placed on how to approach a data set with the goal of understanding as well as prediction.

Instructor's Bio

Brian Lucena is Principal at Lucena Consulting and a consulting Data Scientist at Agentero. An applied mathematician in every sense, he is passionate about applying modern machine learning techniques to understand the world and act upon it. In previous roles he has served as SVP of Analytics at PCCI, Principal Data Scientist at Clover Health, and Chief Mathematician at Guardian Analytics. He has taught at numerous institutions including UC-Berkeley, Brown, USF, and the Metis Data Science Bootcamp.

Brian Lucena, PhD

Consulting Data Scientist at Agentero

Workshop: Visual Analytics of Population Trajectory Data in Urban Environments

With the prevalent GPS, Wi-Fi, Cellular, and RFID devices, population mobility information is collected as the moving trajectories of taxis, fleets, public transits, and mobile phones, by a variety of practitioners, such as transportation administrations, taxi companies, bike sharing companies, fleets, mobile service providers, and other relevant researchers. The data is widely utilized by domain researchers and practitioner of urban systems, environment, economy, and citizens to optimize urban planning, improve human life quality and environment, and amend city operations.

To extract profound insights from the data, domain users must conduct iterative, evolving information foraging and sense making and guide the process using their domain knowledge. Iterative visual exploration is one key component in the processing, which should be supported by efficient data management and visualization tools. Visual analytics techniques and systems are demanded to support effective visual analytics, which integrates scalable data management and interactive visualization with powerful computational capabilities.

In this workshop, attendees will gain knowledge about data processing and visualization of urban trajectory data including (1) big data representation, processing, indexing, and data queries; and (2) trajectory data visualization techniques. Moreover, they will learn how to implement web-based interactive visualization system with source codes and case studies of our open source TrajAnalytics software.

Instructor's Bio

Ye Zhao is a professor in the Department of Computer Science at Kent State University, Ohio, USA. He has been working on computer graphics and visualization for more than 20 years. His current research interests include visual analytics of urban transportation data, multidimensional, text, and animated data visualization. He has published numerous refereed technical papers and served in many program committees of data visualization conferences. His work has been actively supported by NSF, including his recent work which develops several open source software for urban data processing, management, and visualization. Ye Zhao received his PhD degree in computer science from Stony Brook University and B.S./M.S. degrees from Tsinghua University.

Ye Zhao, PhD

Professor at Department of Computer Science at Kent State University

Workshop: Target Leakage in Machine Learning

Target leakage is one of the most difficult problems in developing real-world machine learning models. Leakage occurs when the training data gets contaminated with information that will not be known at prediction time. Additionally, there can be multiple sources of leakage, from data collection and feature engineering to partitioning and model validation. As a result, even experienced data scientists can inadvertently introduce leaks and become overly optimistic about the performance of the models they deploy. In this talk, we will look through real-life examples of data leakage at different stages of the data science project lifecycle, and discuss various countermeasures and best practices for model validation.

Instructor's Bio

Yuriy Guts is a Machine Learning Engineer at DataRobot with over 10 years of industry experience in data science and software architecture. His primary interests are productionalizing data science, automated machine learning, time series forecasting, and processing spoken and written language. He teaches AI and ML at UCU, competes on Kaggle, and has led multiple international data science and engineering teams.

Yuriy Guts

Machine Learning Engineer at DataRobot

Workshop: Predictions in Excel through Estimating Missing Values

In this workshop, we introduce a new data analysis tool that enables predictions in Excel-like environment **without** any prior knowledge of Machine Learning, Statistics or Data Science. This, seemingly magical, ability is direct consequence of viewing the question of prediction as estimating missing values or correcting errors within observations. More precisely, this boils down to estimating a structured “tensor” from its noisy, missing observations. We will show an intuitive, simple and scalable approach for estimating tensor as well as provide a collection of case-studies using an actual tool.

Instructor's Bio

Christina Lee Yu is an Assistant Professor at Cornell University in Operations Research and Information Engineering. Prior to Cornell, she was a postdoc at Microsoft Research New England. She received her PhD in 2017 and MS in 2013 in Electrical Engineering and Computer Science from Massachusetts Institute of Technology in the Laboratory for Information and Decision Systems. She received her BS in Computer Science from California Institute of Technology in 2011. She received honorable mention for the 2018 INFORMS Dantzig Dissertation Award. Her research focuses on designing and analyzing scalable algorithms for processing social data based on principles from statistical inference.

Christina Lee Yu, PhD

Assistant Professor at Cornell University

Workshop: Predictions in Excel through Estimating Missing Values

In this workshop, we introduce a new data analysis tool that enables predictions in Excel-like environment **without** any prior knowledge of Machine Learning, Statistics or Data Science. This, seemingly magical, ability is direct consequence of viewing the question of prediction as estimating missing values or correcting errors within observations. More precisely, this boils down to estimating a structured “tensor” from its noisy, missing observations. We will show an intuitive, simple and scalable approach for estimating tensor as well as provide a collection of case-studies using an actual tool.

Instructor's Bio

Devavrat Shah is a Professor with the department of Electrical Engineering and Computer Science at Massachusetts Institute of Technology. His current research interests are at the interface of Statistical Inference and Social Data Processing. His work has been recognized through prize paper awards in Machine Learning, Operations Research and Computer Science, as well as career prizes including 2010 Erlang prize from the INFORMS Applied Probability Society and 2008 ACM Sigmetrics Rising Star Award. He is a distinguished young alumni of his alma mater IIT Bombay.

Devavrat Shah, PhD

Professor, Co-Founder & Chief Scientist at MIT, Celect

Workshop: All The Cool Things You Can Do With Postgresql To Next Level Your Data Analysis

The intention of this VERY hands on workshop is to get you introduced and playing with some of the great features you never knew about in Postgresql. You know, and probably already love, PostgreSQL as your relational database. We will show you how you can forget about using ElasticSearch, MongoDB, and Redis for a broad array of use cases. We will add in some nice statistical work with R embedded in Postgresql. Finally we will bring this all together using the gold standard in spatial databases, PostGIS. Unless you have a specialized use case, PostgreSQL is the answer. The session will be very hands on with plenty of interactive exercises.

Instructor's Bio

Steve is the Developer Relations lead for DigitalGlobe. He goes around and shows off all the great work the DigitalGlobe engineers do. Steve has a Ph.D. in Ecology from University of Connecticut.

Steven Pousty, PhD

Director of Developer Relations at Crunchy Data

Workshop: Guided Analytics Learnathon: Building Applications for Automated Machine Learning

This event will focus on automated machine learning and data visualization, and we’ll work in groups to build a simple application for automated machine learning (ML).

We will build an input page to explore the data and to insert the settings for data preparation, model training, and hyper-parameter optimization and we will build an output dashboard to visualize model insights and performance. At the conclusion of the event, the application is deployed on a KNIME Server and run from a web browser.

The tool of choice for this Learnathon is open source KNIME Analytics Platform, which also offers great integrations with R, Python, SQL, and Spark.

After an initial introduction to KNIME Analytics Platform and automated ML, we split in two groups, with each group building one of the following:

Group 1: the start dashboard for visual data exploration and automated ML settings
Group 2: the final dashboard to visualize model accuracy and speed performance
We will provide the dataset, jump-start workflows, and final solutions, and of course data visualization and ML experts.

Please bring your own laptop with KNIME Analytics Platform pre-installed.
To install KNIME Analytics Platform, follow the instructions provided in these YouTube videos: Windows, Mac, Linux.
If you would like to get familiar with KNIME Analytics Platform, you can explore KNIME E-learning course.
Before the event, we will share the link to download all workshop material (jump-start workflows, slides, and instructions).

Instructor's Bio

Paolo Tamagnini currently works as a data scientist at KNIME.
Paolo holds a master degree in data science and research experience in data visualization techniques for machine learning interpretability.

Paolo Tamagnini

Data Scientist at KNIME, Inc.

Workshop: Guided Analytics Learnathon: Building Applications for Automated Machine Learning

This event will focus on automated machine learning and data visualization, and we’ll work in groups to build a simple application for automated machine learning (ML).

We will build an input page to explore the data and to insert the settings for data preparation, model training, and hyper-parameter optimization and we will build an output dashboard to visualize model insights and performance. At the conclusion of the event, the application is deployed on a KNIME Server and run from a web browser.

The tool of choice for this Learnathon is open source KNIME Analytics Platform, which also offers great integrations with R, Python, SQL, and Spark.

After an initial introduction to KNIME Analytics Platform and automated ML, we split in two groups, with each group building one of the following:

Group 1: the start dashboard for visual data exploration and automated ML settings
Group 2: the final dashboard to visualize model accuracy and speed performance
We will provide the dataset, jump-start workflows, and final solutions, and of course data visualization and ML experts.

Please bring your own laptop with KNIME Analytics Platform pre-installed.
To install KNIME Analytics Platform, follow the instructions provided in these YouTube videos: Windows, Mac, Linux.
If you would like to get familiar with KNIME Analytics Platform, you can explore KNIME E-learning course.
Before the event, we will share the link to download all workshop material (jump-start workflows, slides, and instructions).

Instructor's Bio

Scott Fincher works for KNIME, Inc as a Data Scientist. He has presented several talks on KNIME’s open source Analytics Platform, and enjoys assisting other data scientists with optimizing and deploying their models. Prior to his work at KNIME, he worked for almost 20 years as an environmental consultant,with a focus on numerical modeling of atmospheric pollutants. He holds an MS in Statistics and a BS in Meteorology, both from Texas A&M University.

Scott Fincher

Data Scientist at KNIME, Inc.

Workshop: Machine Learning Estimation of Heterogeneous Treatment Effects: the Microsoft EconML Library

One of the biggest promises of machine learning is the automation of decision making in a multitude of application domains. A core problem that arises in most data-driven personalized decision scenarios is the estimation of heterogeneous treatment effects: what is the effect of an intervention on an outcome of interest as a function of a set of observable characteristics of the treated sample? For instance, this problem arises in personalized pricing, where the goal is to estimate the effect of a price discount on the demand as a function of characteristics of the consumer. Similarly it arises in medical trials where the goal is to estimate the effect of a drug treatment on the clinical response of a patient as a function of patient characteristics. In many such settings we have an abundance of observational data, where the intervention was chosen via some unknown policy and the ability to run control A/B tests is limited.

We will present recent research advances in the area of machine learning based estimation of heterogeneous treatment effects. These novel methods offer large flexibility in modeling the effect heterogeneity (via techniques such as random forests, boosting, lasso and neural nets), while at the same time leverage techniques from causal inference and econometrics to preserve the causal interpretation of the learned model and many times also offer statistical validity via the construction of valid confidence intervals. We will also present and demo the Microsoft EconML library, an open source package developed by the ALICE project of Microsoft Research, New England, which implements several recent estimation algorithms in a common python API.

Instructor's Bio

Vasilis Syrgkanis is a Researcher at Microsoft Research, New England. He received his Ph.D. in Computer Science from Cornell University in 2014, under the supervision of Prof. Eva Tardos and subsequently spend two years in Microsoft Research, New York as a postdoctoral researcher in the Machine Learning and Algorithmic Economics groups. His research addresses problems at the intersection of theoretical computer science, machine learning and economics. His work received best paper awards at the 2015 ACM Conference on Economics and Computation (EC’15) and at the 2015 Annual Conference on Neural Information Processing Systems (NIPS’15) and was the recipient of the Simons Fellowship for graduate students in theoretical computer science 2012-2014.

Vasilis Syrgkanis, PhD

Researcher at Microsoft Research

Workshop: Machine Learning Estimation of Heterogeneous Treatment Effects: the Microsoft EconML Library

One of the biggest promises of machine learning is the automation of decision making in a multitude of application domains. A core problem that arises in most data-driven personalized decision scenarios is the estimation of heterogeneous treatment effects: what is the effect of an intervention on an outcome of interest as a function of a set of observable characteristics of the treated sample? For instance, this problem arises in personalized pricing, where the goal is to estimate the effect of a price discount on the demand as a function of characteristics of the consumer. Similarly it arises in medical trials where the goal is to estimate the effect of a drug treatment on the clinical response of a patient as a function of patient characteristics. In many such settings we have an abundance of observational data, where the intervention was chosen via some unknown policy and the ability to run control A/B tests is limited.

We will present recent research advances in the area of machine learning based estimation of heterogeneous treatment effects. These novel methods offer large flexibility in modeling the effect heterogeneity (via techniques such as random forests, boosting, lasso and neural nets), while at the same time leverage techniques from causal inference and econometrics to preserve the causal interpretation of the learned model and many times also offer statistical validity via the construction of valid confidence intervals. We will also present and demo the Microsoft EconML library, an open source package developed by the ALICE project of Microsoft Research, New England, which implements several recent estimation algorithms in a common python API.

Instructor's Bio

Miruna Oprescu is a Data and Applied Scientist at Microsoft Research New England. In her current role, Miruna works alongside researchers and software engineers to build the next generation machine learning tools for interdisciplinary applications.
Miruna spends her time between two projects: project ALICE, a Microsoft Research initiative aimed at applying artificial intelligence concepts to economic decision making, and the Machine Learning for Cancer Immunotherapies initiative, a collaboration with doctors and cancer researchers with the goal of applying machine learning techniques to improving cancer therapies.
Prior to her current position, Miruna was a software engineer at Microsoft building MMLSpark, an open source distributed machine learning library powered by Apache Spark.

Miruna Oprescu

Data and Applied Scientist II at Microsoft Research

Workshop: Machine Learning Estimation of Heterogeneous Treatment Effects: the Microsoft EconML Library

One of the biggest promises of machine learning is the automation of decision making in a multitude of application domains. A core problem that arises in most data-driven personalized decision scenarios is the estimation of heterogeneous treatment effects: what is the effect of an intervention on an outcome of interest as a function of a set of observable characteristics of the treated sample? For instance, this problem arises in personalized pricing, where the goal is to estimate the effect of a price discount on the demand as a function of characteristics of the consumer. Similarly it arises in medical trials where the goal is to estimate the effect of a drug treatment on the clinical response of a patient as a function of patient characteristics. In many such settings we have an abundance of observational data, where the intervention was chosen via some unknown policy and the ability to run control A/B tests is limited.

We will present recent research advances in the area of machine learning based estimation of heterogeneous treatment effects. These novel methods offer large flexibility in modeling the effect heterogeneity (via techniques such as random forests, boosting, lasso and neural nets), while at the same time leverage techniques from causal inference and econometrics to preserve the causal interpretation of the learned model and many times also offer statistical validity via the construction of valid confidence intervals. We will also present and demo the Microsoft EconML library, an open source package developed by the ALICE project of Microsoft Research, New England, which implements several recent estimation algorithms in a common python API.

Instructor's Bio

Keith Battocchi is a software engineer at Microsoft Research New England, where he is currently working on software for applying machine learning algorithms to economic problems. Over the past decade, he has worked in a variety of areas including programming language research, building query classifiers for Bing, and building a system to assess television advertising effectiveness.

Keith Battocchi

Software Engineer at Microsoft Research

Workshop: Introduction To Face Processing With Computer Vision

Ever wonder how Facebook’s facial recognition or Snapchat’s filters work?

Faces are a fundamental piece of photography, and building applications around them has never been easier with open-source libraries and pre-trained models.

In this talk, we’ll help you understand some of the computer vision and machine learning techniques behind these applications. Then, we’ll use this knowledge to develop our own prototypes to tackle tasks such as face detection (e.g. digital cameras), recognition (e.g. Facebook Photos), classification (e.g. identifying emotions), manipulation (e.g. Snapchat filters), and more.

Instructor's Bio

Gabriel is the founder of Scalar Research, a full-service artificial intelligence & data science consulting firm. Scalar helps companies tackle complex business challenges with data-driven solutions leveraging cutting-edge machine learning and advanced analytics.

Previously, Gabriel was a B.S. & M.S. student in computer science at Stanford, where he conducted research on computer vision, deep learning, and quantum computing. He’s also spent time at Google, Facebook, startups, and investment firms.

Gabriel Bianconi

Founder at Scalar Research

Workshop: Building a Scalable REST Endpoint for Machine Learning

Real-time model serving is a crucial capability to deliver value from data science projects. Unfortunately, many existing REST endpoint implementations cannot scale for large volume and low latency applications. In particular, existing ML platforms with REST serving capabilities can fail in production because the real-time serving infrastructure was not designed to scale while maintaining performance SLAs. In this session, you will learn about: The problems with existing REST endpoints and why they are not production grade. What are the technologies you can use to build out a REST endpoint, what are their pros and cons. How to build a scalable REST Endpoint with Flask, uWSGI, and NGINX. What it looks like to deploy a REST endpoint using this technology stack in the real world. What kind of performance can you expect when using this type of infrastructure?

Instructor's Bio

Lior Amar is the Principal Engineer at ParallelM where he is responsible for MCenter platform. He is an expert with 20 years’ experience in distributed systems development, low-level system programming and HPC cluster management / Linux systems. Before joining ParallelM, Lior was a government researcher working on high-performance computing (HPC). Before that, he was the Founder and CTO of Cluster Logic, a distributed systems consulting company. He has a Ph.D., and Master’s degree in Computer Science focused on distributed systems.

Lior Amar, PhD

Senior Engineer at ParallelM

Workshop: Deploying Data Science Applications

Bridging the gap from research and prototypes to production software continues to be a major challenge for most maturing data science teams. In this workshop we will discuss and demonstrate how various Data Science workflows and tasks are deployed into production.

We will center the discussion around three different use cases and workflows – machine learning models and pipelines, ETL/batch processing jobs, and real-time Analytics. We will start by demonstrating how to quickly deploy a simple machine learning model using Flask and Docker and then extend this into a more holistic end-to-end machine learning pipeline with AWS Sagemaker. Next, we will use Apache Airflow and Kubernetes to deploy and monitor an automated batch processing pipeline. Finally, we will demonstrate how to capture, transform and analyze streaming data in real time using AWS Kinesis and AWS Lambda.

For each of these workflows, we will introduce a specific use case and follow with an exploration of the technologies used to deploy these at production scale. We will discuss the reasons for these choices – ultimately leading to some higher-level takeaways on matching system requirements to product requirements.

Instructor's Bio

Garret Hoffman is the Director of Data Science at StockTwits. As the Director of Data Science, Garrett captures, creates, and shapes data from countless endpoints to create systems both visible and invisible on the StockTwits platform. There are times when he’s implementing new machine learning algorithms to improve the member experience. Other times, he uses StockTwits data to shape future products. Outside of StockTwits, Garrett plays and watches basketball, studies machine learning in graduate school, and keeps up on the latest in artificial intelligence advancements.

Garrett Hoffman

Director of Data Science at StockTwits

Workshop: A Master Class in AI and Machine Learning for Finance

The use of Data Science and Machine learning in the investment industry is increasing and investment professionals both fundamental and quantitative, are taking notice. Financial firms are taking AI and machine learning seriously to augment traditional investment decision making. Alternative datasets including text analytics, cloud computing, algorithmic trading are game changers for many firms who are adopting technology at a rapid pace. As more and more technologies penetrate enterprises, financial professionals are enthusiastic about the upcoming revolution and are looking for direction and education on data science and machine learning topics.

In this workshop, we aim to bring clarity on how AI and machine learning is revolutionizing financial services. We will introduce key concepts and through examples and case studies, we will illustrate the role of machine learning, data science techniques and AI in the investment industry. At the end of this workshop, participants can see a concrete picture on how to machine learning and AI techniques are fueling the Fintech wave!

Instructor's Bio

Sri Krishnamurthy, CFA, CAP is the founder of QuantUniversity.com, a data and Quantitative Analysis Company and the creator of the Analytics Certificate program and Fintech Certificate program. Sri has more than 15 years of experience in analytics, quantitative analysis, statistical modeling and designing large-scale applications. Prior to starting QuantUniversity, Sri has worked at Citigroup, Endeca, MathWorks and with more than 25 customers in the financial services and energy industries. He has trained more than 1000 students in quantitative methods, analytics and big data in the industry and at Babson College, Northeastern University and Hult International Business School. Sri is leading development efforts in creating a platform called QuSandbox for adopting open source and analytics solutions within regulated industries.

Sri Krishnamurthy

Chief Data Scientist and President at QuantUniversity.com

Workshop: Introduction To Meta-Kaggle And Human-AI Teams

Meta-Kaggle is a dataset released by Kaggle, an online machine learning competition platform. It includes a range of information on the competitions they have run, including participants, scoreboard results, code written, and more, all compiled into any easy to access SQL database. In this training, participants will be introduced to the dataset, accessing it with SQL commands, and using the dataset to train a model with the Python toolbox SKLearn. I will also describe some research performed as a portion of Project Alloy, in which the dataset was used to predict when code will fail to run.

Project Alloy (funded by DARPA’s Agile Teams program) aimed to develop and implement intelligent machine agents that team with humans (creating hybrid teams) in meaningful and supportive ways. Hosting challenges on a citizen science competition platform provided a unique environment to develop and test hybrid team hypotheses. By dynamically sensing of each team’s state and progress towards a solution, we enabled testing hypothesis formulated about hybrid team management. The combination of team performance monitoring and leaderboard scoring transformed the contest platform into a machine intelligence laboratory. Agents with the ability formulate compelling teams, predict code failures, and provide feedback on model vulnerabilities were created to serve as AI teammates during the competition. By providing machine agents that augment or substitute human roles, we explored a tighter synthesis of human and machine leading to greater resilience and agility under changing project goals and constraints.

Instructor's Bio

Coming soon!

Laura A. Seaman, PhD

Machine Intelligence Scientist at Draper

Workshop: Learning from Large-Scale Spatiotemporal Data

Applications such as climate science, intelligent transportation, aerospace control, and sports analytics apply machine learning for large-scale spatiotemporal data. This data is often nonlinear, high-dimensional, and demonstrates complex spatial and temporal correlation. Existing machine learning models cannot handle complex spatiotemporal dependency structures. We’ll explain how to design machine learning models to learn from large-scale spatiotemporal data, especially for dealing with non-Euclidean geometry, long-term dependencies, and logical and physical constraints. We’ll showcase the application of these models to problems such as long-term forecasting for transportation, long-range trajectories synthesis for sports analytics, and combating ground effect in quadcopter landing for aerospace control.

Instructor's Bio

Dr. Rose Yu is an Assistant Professor at Northeastern University Khoury College of Computer Sciences. Previously, she was a postdoctoral researcher in the Department of Computing and Mathematical Sciences at Caltech. She earned her Ph.D. in Computer Science at the University of Southern California and was a visiting researcher at Stanford University.

Her research focuses on machine learning for large-scale spatiotemporal data and its applications, especially in the emerging field of computational sustainability. She has over a dozen publications in leading machine learning and data mining conference and several patents. She is the recipient of the USC Best Dissertation Award, “MIT Rising Stars in EECS”, and the Annenberg fellowship.

Rose Yu, PhD

Assistant Professor at Northeastern University College of Computer and Information Science

Workshop: Soss: Lightweight Probabilistic Programming in Julia

Probabilistic programming is sometimes referred to as “modeling for hackers”, and has recently been picking up steam with a flurry of releases including Stan, PyMC3, Edward, Pyro, and Tensorflow Probability
As these and similar systems have improved in performance and usability, they have unfortunately also become more complex and difficult to contribute to. This is related to a more general phenomenon of the “two language problem”, in which performance-critical domain like scientific computing involves both a high-level language for users and a high-performance language for developers to implement algorithms. This establishes a kind of wall between the two groups and has a harmful effect on performance, productivity, and pedagogy.
In probabilistic programming, this effect is even stronger, and it’s increasingly common to see three languages: one for writing models, a second for data manipulation, model assessment, etc, and a third for implementation of inference algorithms.
In this workshop, we’ll see how the Julia programming language can help to solve this problem, and we’ll explore the basic ideas in Soss, a new probabilistic programming language written entirely in Julia. Soss allows a high-level representation of the kinds of models often written in PyMC3 or Stan, and offers a way to programmatically specify and apply model transformations like approximations or reparameterizations.

Instructor's Bio

Dr. Chad Scherrer has been actively developing and using probabilistic programming systems since 2010 and served as technical lead for the language evaluation team in DARPA’s Probabilistic Programming for Advancing Machine Learning (“PPAML”) program. Much of his blog is devoted to describing Bayesian concepts using PyMC3, while his current Soss.jl project aims to improve execution performance by directly manipulating source code for models expressed in the Julia Programming Language.
Chad is a Senior Data Scientist at Metis Seattle, where he teaches the Data Science Bootcamp.

Chad Scherrer, PhD

Senior Data Scientist at Metis

Workshop: AI/ML Algorithmic Based Recommendations for Cost and Time Effective Hiring Practices

Despite the advent of big data, predicative analytics and artificial intelligence, the $200 billion worldwide recruitment market is driven predominantly by a human/manual process that is prone to inefficiency and inaccuracy.

Bad hires cost employers nearly 30 percent of an employee’s annual earnings. While companies spend millions on recruitment advertising annually, using strategies based and past performance and little more than gut-instinct.

There is stiff competition for talent in today’s job market amidst the tight labor market and increasing expectations of job seekers. Employers are challenged to fill headcount in both a time efficient and cost-effective manner. They need much better predictive recommendations to improve their recruiting marketing spend to hire cheaper and faster.

iCIMS, the world’s leading best-in-class recruitment software provider, applies data science practices to analyze the hiring activities 75 million applicants and 288 million visitors to the career sites of more than 4,000 companies hosted on their proprietary database in 2018 alone.

Join these sessions to discuss and explore how to:

• Apply artificial intelligence/deep learning and machine learning methods to develop a recommendation engine for the best hiring practices. A variety of artificial intelligence techniques ranging from natural language processing, classification machine learning models and deep learning will be examined.
• Solve for the problems of recruiters and HR professionals using artificial intelligence and machine learning without inheriting human bias and error
• Cleanse, normalize, analyze and predict the data behind massive amounts of hiring activity

Instructor's Bio

Dr. Dastgeer Shaikh, Ph.D., is a senior data scientist at iCIMS, a leading provider of recruitment software solutions for global enterprise companies. At iCIMS, he has actively been engaged in artificial intelligence (AI) and machine learning (ML) algorithm development work aimed at producing insights into predictive recruiting job market data to help employers make smarter, more informed hiring decisions.
By means of implementation of cutting-edge technology such as TensorFlow with Keras python framework, Dr. Shaikh has developed an AI model that predicts candidates for open jobs that employers post online and suggests relevant open jobs for candidates. He has extensive experience in working with numerous data science centric state-of-the-arts such as Natural Language Processing, Bayesian Models, Deep Learning, Neural Network, Ensemble Modeling, linear and nonlinear modeling, Data Cleansing, building python API’s for automation, time series analysis, statistics and mathematical models.
Dr. Shaikh’s interests include financial, aerospace and social-related AI/ML modeling. He has built many AI and machine learning driven models to detect transaction risks, predict space weather, social behavior, etc. Dr. Shaikh has many publications during his tenure as PhD and post-doctoral researcher in Computational Physics.

Dastgeer Shaikh, PhD

Senior Data Scientist at iCIMS

Workshop: AI/ML Algorithmic Based Recommendations for Cost and Time Effective Hiring Practices

Despite the advent of big data, predicative analytics and artificial intelligence, the $200 billion worldwide recruitment market is driven predominantly by a human/manual process that is prone to inefficiency and inaccuracy.

Bad hires cost employers nearly 30 percent of an employee’s annual earnings. While companies spend millions on recruitment advertising annually, using strategies based and past performance and little more than gut-instinct.

There is stiff competition for talent in today’s job market amidst the tight labor market and increasing expectations of job seekers. Employers are challenged to fill headcount in both a time efficient and cost-effective manner. They need much better predictive recommendations to improve their recruiting marketing spend to hire cheaper and faster.

iCIMS, the world’s leading best-in-class recruitment software provider, applies data science practices to analyze the hiring activities 75 million applicants and 288 million visitors to the career sites of more than 4,000 companies hosted on their proprietary database in 2018 alone.

Join these sessions to discuss and explore how to:

• Apply artificial intelligence/deep learning and machine learning methods to develop a recommendation engine for the best hiring practices. A variety of artificial intelligence techniques ranging from natural language processing, classification machine learning models and deep learning will be examined.
• Solve for the problems of recruiters and HR professionals using artificial intelligence and machine learning without inheriting human bias and error
• Cleanse, normalize, analyze and predict the data behind massive amounts of hiring activity

Instructor's Bio

Christopher Maier is a data scientist at iCIMS, a leading provider of recruitment software solutions for global enterprise companies. Maier plays an instrumental role in producing data insights for thought leadership content for iCIMS, including the development of the iCIMS Monthly Hiring Indicator, which measures job openings and hires. He built the indicator, which provides an early and all-encompassing view of the U.S. labor market, drawing from iCIMS’ database of more than 75 million applications and 3 million jobs a year.
Maier has additional experience in the medical device and pharmaceutical industries, solving business problems as a statistician/statistical modeler at companies including Roche Molecular Systems and The Janssen Pharmaceutical Companies of Johnson & Johnson. He holds a master’s degree in Applied Statistics from the New Jersey Institute of Technology.

Christopher Maier

Data Scientist at iCIMS

Workshop: Automating Machine Learning Lifecycle with Kubeflow

During the workshop we are going to build and automate consequent stages of machine learning lifecycle, starting with data preparation and up to the model maintenance in production using Kubeflow, a machine learning toolkit for Kubernetes, and Hydrosphere.io, an open source machine learning models management platform. 

You will learn how to execute and automate the following steps:  
– Data preparation — perform data gathering and transformation to use it further for training;
– Model training — train a new model upon a prepared data, using a predefined model architecture and save model’s artifacts;
– Model cataloguing — extract model’s metadata from the trained model artifacts and upload them to Hydrosphere.io;
– Model deployment — deploy an uploaded model to production and expose REST, gRPC and Kafka endpoints;
– Integration testing — perform integration tests on your model on recent production traffic as well as on gathered edge cases; 
– Model monitoring — supply deployed model with monitoring services to watch its behaviour in production environment;
– Model maintenance — repeatedly fine-tune your model with the most relevant production data; 
– Configuring ML pipelines, once it’s done properly, will save a lot of time and effort in ML engineers’ daily routines. Come and see the best practices we’ve gained working with our customers.

We will provide sandboxes within our AWS infrastructure, for participants to follow the workshop course.
Prerequisites required to take a productive part:
Docker (https://docs.docker.com/)
Kubectl (https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl)
pip install kfp==0.1.11 Pillow==5.2.0 numpy==1.16.2 hs>=2.0.0rc6 requests==2.21.0 scikit-learn==0.20.2

Instructor's Bio

Stepan Pushkarev is a CTO at Hydrosphere.io. His background is in engineering of data- and AI/ML platforms. He has spent last couple of years building continuous delivery and monitoring tools for machine learning applications as well as designing streaming data platforms. He works closely with data scientists to make them productive in their daily operations and efficient in delivering the value.

Stepan Pushkarev

CTO at Hydrosphere.io

Workshop: Beyond Deep Learning – Differentiable Programming with Flux

Deep learning is a rapidly evolving field, and models are increasingly complex. Recently, researchers have begun to explore “differentiable programming”, a powerful way to combine neural networks with traditional programming. Differentiable programs may include control flow, functions and data structures, and can even incorporate ray tracers, simulations and scientific models, giving us even unprecedented power to find subtle patterns in our data.

This workshop will show you how this technique, and particularly Flux – a state-of-the-art deep learning library – is impacting the machine learning world. We will show you how Flux makes it easy to create traditional deep learning models, and explain how the flexibility of the Julia language allows complex physical models can be optimised by the same architecture. We’ll outline important recent work and show how Flux allows us to easily combine neural networks with tools like differential equations solvers.

Instructor's Bio

Jeff is one of the creators of Julia, co-founding the project at MIT in 2009 and eventually receiving a Ph.D. related to the language in 2015. He continues to work on the compiler and system internals, while also working to expand Julia’s commercial reach as a co-founder of Julia Computing, Inc.

Jeff Bezanson, PhD

Co-creator of the Julia language, Co-founder & CTO at Julia Computing, Inc.

Workshop: Building and Managing World Class Data Science Teams (Easier Said Than Done)

Despite the promise and opportunities of data science, many organizations are failing to see a return on their investment. The key issue holding organizations back is a lack of good data science management. This manifests in failure to effectively build and manage teams. In this workshop, we will go through a methodological approach for helping executives and managers identify the needs of their organization and build the appropriate team.

We will learn how to:
1 – Assess your organizational readiness
2- Brief overview of foundational elements
3 – Select and recruit the right team for your organization
4 – Develop and manage that team to deliver business value
5 – Create internal pipelines of great data science managers and technical rock stars

Instructor's Bio

Conor Jensen is an experienced Data Science executive with over 15 years working in the analytics space across multiple industries as both a consumer and developer of analytics solutions. He is the founder of Renegade Science, a Data Science strategy and coaching consultancy and works as a Customer Success Team Lead at Dataiku, helping customers make the most of their Data Science platform and guiding them through building teams and processes to be successful. He has worked at multiple Data Science platform startups and has successfully built out analytics functions at two multinational insurance companies. This includes building out data and analytics platforms, Business Intelligence capabilities, and Data Science teams serving both internal and external customers.
Before moving to insurance, Conor was a Weather Forecaster in the US Air Force supporting operations in Southwest Asia. After leaving the military, Conor spent a number of years in store management at Starbucks Coffee while serving as an Emergency Management Technician in the Illinois Air National Guard.
Conor earned his Bachelor of Science degree in Mathematics from the University of Illinois at Chicago.

Conor Jensen

Customer Success Team Lead at Dataiku

Tutorials:


Tutorial: Making Data Science: AIG, Amazon, Albertsons

Developing an internal data science capability requires a cultural shift, a strategic mapping process thataligns with existing business objectives, a technical infrastructure that can host new processes, and an organizational structure that can alter business practice to create measurable impact on business functions. This workshop will take you through ways to consider the vast opportunities for data science to identify and prioritize what will add the most value to your organization, and then budget and hire into commitments. Learn the most effective ways to establish data science objectives from a business perspective including recruiting, retention, goaling, and improving business.

Instructor's Bio

Haftan Eckholdt, PhD. is Chief Data Science Office at Plated. His career began with research professorships in Neuroscience, Neurology, and Psychiatry followed by industrial research appointments at companies like Amazon and AIG. He holds graduate degrees in Biostatistics and Developmental Psychology from Columbia and Cornell Universities. In his spare time he thinks about things like chess and cooking and cross country skiing and jogging and reading. When things get really really busy, he actually plays chess and cooks delicious meals and jogs a lot. Born and raised in Baltimore, Haftan has been a resident of Kings County, New York since the late 1900’s.

Haftan Eckholdt, PhD

Chief Data Science & Chief Science Officer at Understood.org

Tutorial: How should we (correctly) compare graphs?

Graph representations of real-world phenomena are ubiquitous – from social and information networks, to technological, biological, chemical, and brain networks. Many graph mining tasks require a distance (or, conversely, a similarity) measure between graphs. Examples include clustering of graphs and anomaly detection, nearest neighbor and similarity search, pattern recognition, and transfer learning. Such tasks find applications in diverse areas including image processing, chemistry, and social network analysis, to name a few.

Intuitively, given two graphs, their distance is a score quantifying their structural differences. A highly desirable property for such a score is that it is a metric, i.e., it is non-negative, symmetric, positive-definite, and, crucially, satisfies the triangle inequality. Metrics exhibit significant computational advantages over non-metrics. For example, operations such as nearest-neighbor search, clustering, outlier detection, and diameter computation have known fast algorithms precisely when performed over objects embedded in a metric space.

Unfortunately, algorithms to compute several classic distances between graphs do not scale to large graphs; other distances do not satisfy all of the metric properties: non-negativity, positive definiteness, symmetry, and triangle inequality.

The purpose of this tutorial is to go over the recent and expanding literature of graph metric spaces, focusing specifically on tractable metrics. Furthermore, we also explain how to compute the distance between n graphs in a way that the resulting distance satisfy a generalization of the triangle inequality to n elements, and is still tractable.

Instructor's Bio

José Bento completed his Ph.D. in Electrical Engineering at Stanford University where he worked with Professor Andrea Montanari on statistical inference and structural learning of graphical models. After his Ph.D., he moved to Disney Research, Boston lab, where he worked with Dr. Jonathan Yedidia on algorithms for distributed optimization, robotics, and computer vision. He is now with the Computer Science department at Boston College. His current research lies at the intersection of distributed algorithms and machine learning. In 2014 he received a Disney Inventor Award for his work on distributed optimization, which recently lead to an approved patent. In 2016 he was awarded a $10M NIH joint grant to study the emergence of antibiotic resistance and in 2017 a $2M NSF joint grant to study measures of distance between large graphs.

Jose Bento, PhD

Assistance Professor at Boston College

Tutorial: Deep Learning on Mobile

Over the last few years, convolutional neural networks (CNN) have risen in popularity, especially in the area of computer vision. Many mobile applications running on smartphones and wearable devices would potentially benefit from the new opportunities enabled by deep learning techniques. However, CNNs are by nature computationally and memory intensive, making them challenging to deploy on a mobile device.

This workshop explains how to practically bring the power of convolutional neural networks and deep learning to memory and power-constrained devices like smartphones. You will learn various strategies to circumvent obstacles and build mobile-friendly shallow CNN architectures that significantly reduce the memory footprint and therefore make them easier to store on a smartphone; The workshop also dives into how to use a family of model compression techniques to prune the network size for live image processing, enabling you to build a CNN version optimized for inference on mobile devices. Along the way, you will learn practical strategies to preprocess your data in a manner that makes the models more efficient in the real world.

Following a step by step example of building an iOS deep learning app, we will discuss tips and tricks, speed and accuracy trade-offs, and benchmarks on different hardware to demonstrate how to get started developing your own deep learning application suitable for deployment on storage- and power-constrained mobile devices. You can also apply similar techniques to make deep neural nets more efficient when deploying in a regular cloud-based production environment, thus reducing the number of GPUs required and optimizing on cost.”

Instructor's Bio

Anirudh is the Head of AI & Research at Aira (Visual interpreter for the blind), and was previously at Microsoft AI & Research where he founded Seeing AI – Talking camera app for the blind community. He is also the co-author of the upcoming book, ‘Practical Deep Learning for Cloud and Mobile’. He brings over a decade of production-oriented Applied Research experience on Peta Byte scale datasets, with features shipped to about a billion people. He has been prototyping ideas using computer vision and deep learning techniques for Augmented Reality, Speech, Productivity as well as Accessibility. Some of his recent work, which IEEE has called ‘life changing’, has been honored by CES, FCC, Cannes Lions, American Council of the Blind, showcased at events by White House, House of Lords, World Economic Forum, on Netflix, National Geographic, and applauded by world leaders including Justin Trudeau and Theresa May.

Anirudh Koul

Head of AI & Research at Aira

Tutorial: When the Bootstrap Breaks

Resampling methods like the bootstrap are becoming increasingly common in modern data science. For good reason too; the bootstrap is incredibly powerful. Unlike t-statistics, the bootstrap doesn’t depend on a normality assumption nor require any arcane formulas. You’re no longer limited to working with well understood metrics like means. One can easily build tools that compute confidence for an arbitrary metric. What’s the standard error of a Median? Who cares! I used the bootstrap.

With all of these benefits the bootstrap begins to look a little magical. That’s dangerous. To understand your tool you need to understand how it fails, how to spot the failure, and what to do when it does. As it turns out, methods like the bootstrap and the t-test struggle with very similar types of data. We’ll explore how these two methods compare on troublesome data sets and discuss when to use one over the other.

In this talk we’ll explore what types to data the bootstrap has trouble with. Then we’ll discuss how to identify these problems in the wild and how to deal with the problematic data. We will explore simulated data and share the code to conduct the simulations yourself. However, this isn’t just a theoretical problem. We’ll also explore real Firefox data and discuss how Firefox’s data science team handles this data when analyzing experiments.

At the end of this session you’ll leave with a firm understanding of the bootstrap. Even better, you’ll understand how to spot potential issues in your data and avoid false confidence in your results.

Instructor's Bio

Ryan Harter is a Senior-Staff Data Scientist with Mozilla working on Firefox. He has years of experience solving business problems in the technology and energy industries both as a data scientist and data engineer. Ryan shares practical advice for applying data science as a mentor and in his blog.

Ryan Harter

Senior Staff Data Scientist at Mozilla

Tutorial: When the Bootstrap Breaks

Resampling methods like the bootstrap are becoming increasingly common in modern data science. For good reason too; the bootstrap is incredibly powerful. Unlike t-statistics, the bootstrap doesn’t depend on a normality assumption nor require any arcane formulas. You’re no longer limited to working with well understood metrics like means. One can easily build tools that compute confidence for an arbitrary metric. What’s the standard error of a Median? Who cares! I used the bootstrap.

With all of these benefits the bootstrap begins to look a little magical. That’s dangerous. To understand your tool you need to understand how it fails, how to spot the failure, and what to do when it does. As it turns out, methods like the bootstrap and the t-test struggle with very similar types of data. We’ll explore how these two methods compare on troublesome data sets and discuss when to use one over the other.

In this talk we’ll explore what types to data the bootstrap has trouble with. Then we’ll discuss how to identify these problems in the wild and how to deal with the problematic data. We will explore simulated data and share the code to conduct the simulations yourself. However, this isn’t just a theoretical problem. We’ll also explore real Firefox data and discuss how Firefox’s data science team handles this data when analyzing experiments.

At the end of this session you’ll leave with a firm understanding of the bootstrap. Even better, you’ll understand how to spot potential issues in your data and avoid false confidence in your results.

Instructor's Bio

Saptarshi Guha is a Senior Staff Data Scientist with Mozilla working across domains at Firefox from marketing and software quality to product development. He has been at Firefox for seven years and witnessed the data team grow from ‘two guys and a dog’ to a sophisticated collaboration between product, data engineering and data science.

Saptarshi Guha, PhD

Senior Staff Data Scientist at Mozilla

Tutorial: Recommendations in a Marketplace: Personalizing Explainable Recommendations with Multi-objective Contextual Bandits

In recent years, two sided marketplaces have emerged as viable business models in many real world applications (e.g. Amazon, AirBnb, Spotify, YouTube), wherein the platforms have customers not only on the demand side (e.g. users), but also on the supply side (e.g. retailer, artists). Such multi-sided marketplace involves interaction between multiple stakeholders among which there are different individuals with assorted needs. While traditional recommender systems focused specifically towards increasing consumer satisfaction by providing relevant content to the consumers, two-sided marketplaces face an interesting problem of optimizing their models for supplier preferences, and visibility.

In this talk(/tutorial), we begin by describing a contextual bandit model developed for serving explainable music recommendations to users and showcase the need for explicitly considering supplier-centric objectives during optimization. To jointly optimize the objectives of the different marketplace constituents, we present a multi-objective contextual bandit model aimed at maximizing long-term vectorial rewards across different competing objectives. Finally, we discuss theoretical performance guarantees as well as experimental results with historical log data and tests with live production traffic in a large-scale music recommendation service.

Instructor's Bio

Rishabh Mehrotra is a Research Scientist at Spotify Research in London. He obtained his PhD in the field of Machine Learning and Information Retrieval from University College London where he was partially supported by a Google Research Award. His PhD research focused on inference of search tasks from query logs. His current research focuses on marketplaces, bandit based recommendations, counterfactual analysis and experimentation. Some of his recent work has been published at top conferences including WWW, SIGIR, NAACL, CIKM, RecSys and WSDM. He has co-taught a number of tutorials at leading conferences (WWW & CIKM) & summer schools.

Rishabh Mehrotra, PhD

Research Scientist at Spotify Research

Tutorial: Interpretability Toolkit for Azure Machine Learning

With the recent popularity of machine learning algorithms such as neural networks and ensemble methods, etc., machine learning models become more like a ‘black box’, harder to understand and interpret. To gain the user’s trust, there is a strong need to develop tools and methodologies to help the user to understand and explain how predictions are made. Data scientists also need to have the necessary insights to learn how the model can be improved. Much research has gone into model interpretability and recently several open sources tools, including LIME, SHAP, and GAMs, etc., have been published on GitHub. In this tutorial, we present the Azure Machine Learning (AML) SDK’s brand new Machine Learning Interpretability (MLI) toolkit which incorporates the cutting-edge technologies developed by Microsoft and leverages proven third-party libraries. It creates a common API and data structure across the integrated libraries and integrates AML services. Using MLI toolkit, data scientists can explain machine learning models using state-of-art technologies in an easy-to-use and scalable fashion.

Instructor's Bio

Mehrnoosh Sameki is a technical program manager at Microsoft responsible for leading the product efforts on machine learning transparency within the Azure Machine Learning platform. Prior to Microsoft, she was a data scientist in an eCommerce company, Rue Gilt Groupe, incorporating data science and machine learning in retail space to drive revenue and enhance personalized shopping experiences of customers and prior to that, she completed a PhD degree in computer science at Boston University. In her spare time, she enjoys trying new food recipes, watching classic movies and documentaries, and reading about interior design and house decoration.

Mehrnoosh Sameki, PhD

Technical Program Manager at Microsoft

Tutorial: Declarative Data Visualization with Vega-Lite & Altair

In this tutorial, we will introduce the concepts of declarative data visualization, which are widely used by the Jupyter and Observable data science communities, and companies such as AirBnB, Apple, Elastic, Google, Microsoft, Netflix, Twitter, and Uber. You will learn the basic vocabulary for describing data visualizations and learn how to use this vocabulary to author interactive plots via declarative visualization libraries including Vega-Lite (in JavaScript) and Altair (in Python). With these libraries, users can rapidly and concise create rich interactive visualizations. For example, brushing & linking among scatterplots and interactive cross filtering require only a few lines of code in Vega-Lite, versus hundreds in D3.

We will first introduce the basics of Vega-Lite and Altair including how to create simple single-view plots, how to combine them into layered plots and multi-view dashboards, and how to make them interactive. We will also describe how to use them for various use cases such as exploring data, customizing plots for sharing and publication, building web applications, as well as automatically generating charts on a server. Finally, we will illustrate how Vega-Lite and Altair fit into a larger ecosystem of data visualization tools.

There are no formal requirements to attend this tutorial, besides excitement about creating visualizations—though some previous experience building visualizations will help you get the most out of the experience.

Instructor's Bio

Coming soon!

Kanit Wongsuphasawat, PhD

Research Scientist at Apple

Tutorial: Declarative Data Visualization with Vega-Lite & Altair

In this tutorial, we will introduce the concepts of declarative data visualization, which are widely used by the Jupyter and Observable data science communities, and companies such as AirBnB, Apple, Elastic, Google, Microsoft, Netflix, Twitter, and Uber. You will learn the basic vocabulary for describing data visualizations and learn how to use this vocabulary to author interactive plots via declarative visualization libraries including Vega-Lite (in JavaScript) and Altair (in Python). With these libraries, users can rapidly and concise create rich interactive visualizations. For example, brushing & linking among scatterplots and interactive cross filtering require only a few lines of code in Vega-Lite, versus hundreds in D3.

We will first introduce the basics of Vega-Lite and Altair including how to create simple single-view plots, how to combine them into layered plots and multi-view dashboards, and how to make them interactive. We will also describe how to use them for various use cases such as exploring data, customizing plots for sharing and publication, building web applications, as well as automatically generating charts on a server. Finally, we will illustrate how Vega-Lite and Altair fit into a larger ecosystem of data visualization tools.

There are no formal requirements to attend this tutorial, besides excitement about creating visualizations—though some previous experience building visualizations will help you get the most out of the experience.

Instructor's Bio

Coming soon!

Arvind Satyanarayan, PhD

Assistant Professor at MIT CSAIL

Tutorial: Making government forms easier using NLP + Deep Learning

A common task when filling out government forms is to provide a code that describes what category something belongs to. For example, when shipping goods internationally, businesses need to provide product codes from the Harmonized System for each product that is crossing the border. The cost of an incorrect code is fines levied by US Customs, which can range upward into the millions of dollars. What if there were a way, using product descriptions and other contextual information, to save countless hours of manual time spent looking up these obscure codes, and to do so with better than human accuracy? In government, we call them “autocoders,” but as it turns out, some cleverly applied classification and Natural Language Processing technology can do just the trick.

First, we’ll discuss a couple different areas across government where this sort of work is being used to make forms easier and saving taxpayers millions of dollars. Then, we’ll work our way through an example of classifying products, discussing different approaches, techniques, and tradeoffs. We’ll start simple (bag-of-words with logistic regression), and advance forward into two state-of-the-art approaches with a discussion of their relative benefits and trade-offs: LSTM-based deep learning, and fastText.

Instructor's Bio

Christian has extensive data / technology experience – currently helping innovate data practices at the US Census Bureau, and previously leading the tech team at The Data Incubator, a data science training+consulting firm. He holds a Master’s from NYU’s Center for Urban Science and Progress.

Christian Leonard Moscardi

Data Scientist at U.S. Census Bureau

Tutorial: The Lifecycle Of A Machine Learning Project: Sea Turtle Conservation And Predictive Modeling

I will walk through all important steps of a machine learning project from problem definition and data collection to an interpretable predictive model and scientific/actionable insights in this tutorial. I will use an academic project to illustrate important concepts on
– how to incorporate external datasets,
– feature generation from time series data,
– data exploration and visualization,
– the importance of proper cross-validation approaches,
– how to improve the interpretability of supervised machine learning models using XGBoost and SHAP values.

We will analyze a rich dataset on the basking behavior of green sea turtles. This is a collaboration between data scientists at the Center for Computation and Visualization at Brown University and the Hawaii Wildlife Fund, a non-profit wildlife conservation organization. Green sea turtles are endangered marine animals that bask or rest on beaches. Maui’s Ho’okipa beach hosts one of the largest and densest basking aggregations in the state of Hawaii. Volunteers from the Hawaii Wildlife Fund are stationed at the beach from approximately 2:30 – 7:30pm every day of the year with the exception of severe weather conditions (hurricane force winds, etc.). They have been recording human visitor and basking turtle counts on the beach for years.

The goals of the collaboration are to better understand the basking behavior of the turtles and to inform the Hawaii Wildlife Fund’s management and policy decisions with the use of predictive modeling.

Instructor's Bio

Andras Zsom is a Lead Data Scientist at the Center for Computation and Visualization at Brown University. He is managing a small but dedicated team of data scientists with the mission to help high level university administrators to make better data-driven decisions with data analysis and predictive modeling, we collaborate with faculty members on various data-intensive academic projects, and we also train data science interns.

Andras is passionate about using machine learning and predictive modeling for good. He is an astrophysicist by training and he has been fascinated with all fields of the natural and life sciences since childhood. He was a postdoctoral researcher at MIT for 3.5 years before coming to Brown. He obtained his PhD from the Max Planck Institute of Astronomy at Heidelberg, Germany; and he was born and raised in Hungary.

Andras Zsom, PhD

Lead Data Scientist at Advanced Research Computing at CCV, Brown University

Tutorial: Putting the “Data” In Data Scientist

One of the byproducts of our digitally transformed world is the accumulation of large quantities of data. As a data scientist, the challenge is how to effectively manage, clone, prepare, and extract value from huge datasets for deep learning training.
In this workshop, we will examine how to handle large amounts of data at scale with the help of deep learning ops, or DeepOps as we call it. DeepOps is a set of methodologies, tools, and culture where data engineers and scientists collaborate to build a faster and more reliable deep learning pipeline.
Together, we’ll look at a publicly available dataset, ChestXray14, for reference and learn how to:
Organize one of the largest publicly available chest x-ray datasets.
Correctly correlate and tag medical images to corresponding metadata.
Discuss strategies for storing this data.
Prepare and stream data for deep learning training.
View and version data with MissingLink.ai’s query tool.
Afterward, you’ll walk away with the knowledge of how to automate data management, exploration, and versioning in your deep learning projects. Attendees will get access to the training material presented during the talk to continue experimenting with ChestXray14’s data on their own.

Instructor's Bio

As MissingLink’s Chief Evangelist, Jesse Freeman focuses on teaching DeepOps techniques that speed up AI first companies using computer vision and deep learning. One of the ways Jesse does this is by approaching deep learning from an engineering standpoint. With over 20+ years of enterprise development experience at companies like Amazon, Microsoft, MLB, HBO, New York Jets, Bloomberg and more, Jesse is an expert in his field. In addition to his development background, Jesse has a masters in interactive computer art from the School of Visual Arts. He can be found on twitter at @jessefreeman.

Jesse Freeman

Chief Evangelist at MissingLink

Tutorial: mlpack: Or, How I Learned To Stop Worrying And Love C++

mlpack is a general-purpose flexible, fast machine learning library written in C++. The library aims to provide fast, extensible implementations of cutting-edge machine learning algorithms. These algorithms are provided as simple command-line programs, Python bindings (and bindings to other languages), and also C++ classes which can then be integrated into larger-scale machine learning solutions. In this tutorial, I will introduce mlpack and discuss how it achieves its fast implementations via template metaprogramming and by implementing more
asymptotically efficient algorithms. Even though C++ is fairly unpopular for machine learning, I will show that it is possible to have easy, understandable, production-quality C++ machine learning code with mlpack. I will also give some examples of usage, including how we use mlpack inside of RelationalAI, and also talk about the future goals and development of the library.

Instructor's Bio

Ryan Curtin is a Computer Scientist at RelationalAI. His Ph.D. work focused on fast algorithms for core machine learning tasks via the use of indexing structures such as kd-trees and other related structures. These accelerated algorithms enabled fast implementations of standard machine learning algorithms like nearest neighbor search, k-means clustering, and kernel density estimation; these implementations then formed the basis of the mlpack C++ machine learning library, which Ryan has maintained for nearly a decade. In his free time, he likes to restore broken pinball machines—and if they are ever restored to a working state, actually play them (thus helping them return to a non-working state).

Ryan Curtin, PhD

Computer Scientist at RelationalAI

Tutorial: Deep Learning From Scratch

There are many good tutorials on neural networks out there. While some of them dive deep into the code and show how to implement things, and others explain what is going on via diagrams or math, very few bring all the concepts needed to understand neural networks together, showing diagrams, code, and math side by side. In this tutorial, I’ll present a clear, step-by-step explanation of neural networks, implementing them from scratch in Numpy, while showing both diagrams that explain how they work and the math that explains why they work. We’ll cover normal, feedforward neural networks, convolutional neural networks (also from scratch) as well as recurrent neural networks (time permitting). Finally, we’ll be sure to leave time to translate what we learn into performant, flexible PyTorch code so you can apply what you’ve learned to real-world problems.

No background in neural networks is required, but a familiarity with the terminology of supervised learning (e.g. training set vs. testing set, features vs. target) will be helpful.

Instructor's Bio

Seth Weidman is a data scientist at Facebook, working on machine learning problems related to their data center operations. Prior to this role, Seth was a Senior Data Scientist at Metis, where he first taught two data science bootcamps in Chicago and then taught for one year as part of Metis’ Corporate Training business. Prior to that, Seth was the first data scientist at Trunk Club in Chicago, where he built their first lead scoring model from scratch and worked on their recommendation systems.

In addition to solving real-world ML problems, he loves demystifying concepts at the cutting edge of machine learning, from neural networks to GANs. He is the author of a forthcoming O’Reilly book on neural networks and has spoken on these topics at multiple conferences and Meetups all over the country.

Seth Weidman

Senior Data Scientist at Facebook

Tutorial: An Introduction to PyTorch Fundamentals

In this tutorial, we shall have a 30-minute overview of PyTorch followed by a 1 hour tutorial that you shall complete, with the help of the instructor. PyTorch is a python library for numerical computing, automatic differentiation and deep learning, supporting a fast GPU backend for high performance. PyTorch is often used to build neural networks and for gradient-based methods in machine learning.

The tutorial will take you through doing operations on PyTorch Tensors, building your own neural networks, training them on small datasets and interpreting the final results. We shall use Google Colab ( https://colab.research.google.com/ ), a free Cloud notebook service to run the tutorials.
Familiarity with Python is required, and familiarity with NumPy helps.

Instructor's Bio

Soumith Chintala is a Researcher at Facebook AI Research, where he works on deep learning, reinforcement learning, generative image models, agents for video games and large-scale high-performance deep learning with major contributions to the Torch deep learning framework which is used by the major players in the A.I. Industry. His research on generative models has been quoted to be one of the major advances in A.I. in 2015. His work in the A.I. research and systems community on benchmarking and consolidating systems is well recognized and is often quoted by Intel, Nervana Systems, NVIDIA and other hardware and systems companies as an independent metric. He holds a Masters in CS from NYU, and spent time in Yann LeCun’s NYU lab building deep learning models for pedestrian detection, natural image OCR, depth-images among others.

Soumith Chintala

Creator of PyTorch at Facebook AI Research

Tutorial: Predictive Models with Explanatory Concepts: A Framework for Extracting Reason Codes from Machine Learning Credit Risk Models While Simultaneously Increasing Predictive Power

Lenders are required to transmit relatively few raw consumer credit behavior data values to credit reporting agencies. From the raw data, credit bureaus have derived thousands of predictive attributes. By construction, these derived attributes are highly correlated. Multicollinearity is known to hamper the ability to explain statistical and machine learning models. In general, it is desirable if not required to be able to explain the inner workings of credit risk models. When a modeler attempts to incorporate numerous highly collinear attributes in a statistical model and maximize prediction, the impact of multicollinearity works against explainability. Traditional solutions to this problem include omitting variables with the wrong sign or using factor analysis to collapse the original variables into a new subset of variables prior to estimating model parameters. Both solutions increase the ability to explain the model at the expense of decreasing its predictive power.
This paper describes a novel method of utilizing multicollinearity in the data to increase the predictive power of the credit risk model and simultaneously allows reason codes to be extracted from it. We make use of the original attributes, and develop a factor analysis after building the predictive model that allows identification of the concepts that describe reasons credit may be denied. The model is constructed so as to be as predictive as possible using readily available data. Reason codes are extracted from the factors. Eight ways to accomplish this are described. It can be applied to any credit scoring system including traditional logistic regression models and machine learning models.

Instructor's Bio

Michael McBurnett is a Distinguished Scientist in the Equifax Data Science Lab. He has 30 years of experience building, deploying, or monetizing mathematical models of human behavior in the credit risk, banking, combination utility, telecommunications, direct marketing, counterinsurgency warfare, intelligence, political, and academic arenas. His professional career has focused on mathematical and statistical modeling, data collection, the invention or identification of new data sources appropriate for particular problems, and data analysis. He is a co-inventor of NeuroDecision®, a regulatory compliant method of producing actionable risk scores with appropriate adverse action codes using neural networks.

Michael McBurnett, PhD

Distinguished Scientist at Equifax Corporation

Panel: The Next Frontier Of AI Will Be In Space

AI is everywhere. In our phones, in our homes, where we work, where we play. The “new industry revolution” in AI and machine learning, fueled by the “new oil” – massive amounts of data – has spread globally with alarming speed. But, AI has rarely left earth’s orbit, and it’s non-existent in deep spacecraft. The most recent Mars Rover runs on 15 year old CPU technology. Recently launched spacecraft lack the ability to navigate themselves, requiring constant communication with ground control. Meanwhile, back on earth, humans are already riding in self-driving vehicles, and earthlings routinely run billion-parameter neural networks at real-time speeds.

Disruptive change is on its way, and it’s happening at the intersection of space and artificial intelligence. Our panelists are ushering in a new golden age of space exploration, one that is embracing cutting edge AI at all layers of the technology stack. In this session, you will hear from experts from industry, government, and academia – and they will talk on a broad range of topics – from new AI hardware that can survive the harsh conditions of space, to new deep learning algorithms that can locate new galaxies and dark matter, to robotic virtual assistants that monitor astronauts and keep them company on deep space missions.

Get ready. AI is soon going where no AI has gone before!

Instructor's Bio

Roberto Carlino is an Aerospace Engineer at NASA Ames, currently working as a hardware/flight software engineer for the Astrobee Free-flyers project.
Before that, he was working at the Science Processing Operation Center (SPOC) for the mission Transiting Exoplanet Survey Satellite (TESS) (follow-on mission of the Kepler Space Telescope), which is searching for exoplanets around the closest and brightest stars to our Sun.
Roberto started at NASA Ames around 4 years ago, working on small flight projects and mission proposals.
He got his Bachelor’s degree and Master of Science in Aerospace Engineering at the University of Naples Federico II, in Italy, and Delft University of Technology, in The Netherlands. After that, he got his second Master degree in ‘Space Systems and Services’ from the University of Rome, La Sapienza.

Robert Carlino

Flight Engineer at NASA Ames Research Center

Panel: The Next Frontier Of AI Will Be In Space

AI is everywhere. In our phones, in our homes, where we work, where we play. The “new industry revolution” in AI and machine learning, fueled by the “new oil” – massive amounts of data – has spread globally with alarming speed. But, AI has rarely left earth’s orbit, and it’s non-existent in deep spacecraft. The most recent Mars Rover runs on 15 year old CPU technology. Recently launched spacecraft lack the ability to navigate themselves, requiring constant communication with ground control. Meanwhile, back on earth, humans are already riding in self-driving vehicles, and earthlings routinely run billion-parameter neural networks at real-time speeds.

Disruptive change is on its way, and it’s happening at the intersection of space and artificial intelligence. Our panelists are ushering in a new golden age of space exploration, one that is embracing cutting edge AI at all layers of the technology stack. In this session, you will hear from experts from industry, government, and academia – and they will talk on a broad range of topics – from new AI hardware that can survive the harsh conditions of space, to new deep learning algorithms that can locate new galaxies and dark matter, to robotic virtual assistants that monitor astronauts and keep them company on deep space missions.

Get ready. AI is soon going where no AI has gone before!

Instructor's Bio

Jiwei Liu is a data scientist at NVIDIA working on NVIDIA AI infrastructure, including the RAPIDS data science framework. Jiwei received a PhD degree from University of Pittsburgh in electrical and computer engineering. He has 5 years’ experience in data science, predictive modeling, machine learning and GPU programming. Jiwei is a kaggle grandmaster and ranked top 40 world-wide.

Jiwei Liu, PhD

Senior Data Scientist at Nvidia

Panel: The Next Frontier Of AI Will Be In Space

AI is everywhere. In our phones, in our homes, where we work, where we play. The “new industry revolution” in AI and machine learning, fueled by the “new oil” – massive amounts of data – has spread globally with alarming speed. But, AI has rarely left earth’s orbit, and it’s non-existent in deep spacecraft. The most recent Mars Rover runs on 15 year old CPU technology. Recently launched spacecraft lack the ability to navigate themselves, requiring constant communication with ground control. Meanwhile, back on earth, humans are already riding in self-driving vehicles, and earthlings routinely run billion-parameter neural networks at real-time speeds.

Disruptive change is on its way, and it’s happening at the intersection of space and artificial intelligence. Our panelists are ushering in a new golden age of space exploration, one that is embracing cutting edge AI at all layers of the technology stack. In this session, you will hear from experts from industry, government, and academia – and they will talk on a broad range of topics – from new AI hardware that can survive the harsh conditions of space, to new deep learning algorithms that can locate new galaxies and dark matter, to robotic virtual assistants that monitor astronauts and keep them company on deep space missions.

Get ready. AI is soon going where no AI has gone before!

Instructor's Bio

Coming soon!

Paul Armijo

Head of Government and Aerospace Sector at GSI Technology

Panel: The Next Frontier Of AI Will Be In Space

AI is everywhere. In our phones, in our homes, where we work, where we play. The “new industry revolution” in AI and machine learning, fueled by the “new oil” – massive amounts of data – has spread globally with alarming speed. But, AI has rarely left earth’s orbit, and it’s non-existent in deep spacecraft. The most recent Mars Rover runs on 15 year old CPU technology. Recently launched spacecraft lack the ability to navigate themselves, requiring constant communication with ground control. Meanwhile, back on earth, humans are already riding in self-driving vehicles, and earthlings routinely run billion-parameter neural networks at real-time speeds.

Disruptive change is on its way, and it’s happening at the intersection of space and artificial intelligence. Our panelists are ushering in a new golden age of space exploration, one that is embracing cutting edge AI at all layers of the technology stack. In this session, you will hear from experts from industry, government, and academia – and they will talk on a broad range of topics – from new AI hardware that can survive the harsh conditions of space, to new deep learning algorithms that can locate new galaxies and dark matter, to robotic virtual assistants that monitor astronauts and keep them company on deep space missions.

Get ready. AI is soon going where no AI has gone before!

Instructor's Bio

Coming soon!

Arthur Edwards, PhD

Senior Researcher at Air Force Research Laboratory

Panel: The Next Frontier Of AI Will Be In Space

AI is everywhere. In our phones, in our homes, where we work, where we play. The “new industry revolution” in AI and machine learning, fueled by the “new oil” – massive amounts of data – has spread globally with alarming speed. But, AI has rarely left earth’s orbit, and it’s non-existent in deep spacecraft. The most recent Mars Rover runs on 15 year old CPU technology. Recently launched spacecraft lack the ability to navigate themselves, requiring constant communication with ground control. Meanwhile, back on earth, humans are already riding in self-driving vehicles, and earthlings routinely run billion-parameter neural networks at real-time speeds.

Disruptive change is on its way, and it’s happening at the intersection of space and artificial intelligence. Our panelists are ushering in a new golden age of space exploration, one that is embracing cutting edge AI at all layers of the technology stack. In this session, you will hear from experts from industry, government, and academia – and they will talk on a broad range of topics – from new AI hardware that can survive the harsh conditions of space, to new deep learning algorithms that can locate new galaxies and dark matter, to robotic virtual assistants that monitor astronauts and keep them company on deep space missions.

Get ready. AI is soon going where no AI has gone before!

Instructor's Bio

Coming soon!

Jamal Madni

Director of Technology Strategy at Boeing

Panel: The Next Frontier Of AI Will Be In Space

AI is everywhere. In our phones, in our homes, where we work, where we play. The “new industry revolution” in AI and machine learning, fueled by the “new oil” – massive amounts of data – has spread globally with alarming speed. But, AI has rarely left earth’s orbit, and it’s non-existent in deep spacecraft. The most recent Mars Rover runs on 15 year old CPU technology. Recently launched spacecraft lack the ability to navigate themselves, requiring constant communication with ground control. Meanwhile, back on earth, humans are already riding in self-driving vehicles, and earthlings routinely run billion-parameter neural networks at real-time speeds.

Disruptive change is on its way, and it’s happening at the intersection of space and artificial intelligence. Our panelists are ushering in a new golden age of space exploration, one that is embracing cutting edge AI at all layers of the technology stack. In this session, you will hear from experts from industry, government, and academia – and they will talk on a broad range of topics – from new AI hardware that can survive the harsh conditions of space, to new deep learning algorithms that can locate new galaxies and dark matter, to robotic virtual assistants that monitor astronauts and keep them company on deep space missions.

Get ready. AI is soon going where no AI has gone before!

Instructor's Bio

Dr. Ian Troxel is the Founder and CEO of Troxel Aerospace Industries, Inc., a firm developing autonomous fault mitigation software and performing radiation testing services for migrating commercial technology to space systems. He earned a masters and doctorate in Electrical and Computer Engineering from the University of Florida with a focus on aircraft and spacecraft onboard processing, mission assurance for big-data movement and computation (before it was called cloud computing), processor-in-memory architectures, and optical networking technologies. Dr. Troxel served as the Principal Engineer for Processor and Memory systems at SEAKR Engineering where he developed system-level fault mitigation strategies and designed next-generation processor and data management systems for several high-end space missions. He was selected for the NRO Technology Fellowship program in 2009-2010 to shape development of new technologies for image processing. Dr. Troxel formed Troxel Aerospace in 2015 to focus on product development, which has seen significant growth since 2017 through a NASA SBIR Phase 2 award selection and several commercial engagements.

Ian Troxel, PhD

CEO at Troxel Aerospace

A Sample of Previous East Workshops


  • Reducing Model Risk with Automated Machine Learning

  • How to Visualize Your Data: Beyond the Eye into the Brain

  • Matrix Math at Scale with Apache Mahout and Spark

  • Tutorial on Anomaly Detection at Scale: Data Engineering Challenges meet Data Science Difficulties

  • Crunching your Data with CatBoost – New Gradient Boosting Library

  • Deep Learning in Finance : An experiment and a reflection

  • Real-Time Machine Learning on the Mainframe

  • Power up your Computer Vision skills with TensorFlow-Keras

  • Bayesian Networks with pgmpy

  • Bayesian Hieratical Model for Predictive Analytics

  • Standardized Data Science: The Team Data Science Data Process – with a practical, example in Python

  • Interpretable Representation Learning for Visual Intelligence

  • Henosis – a generalizable, cloud-native Python form recommender framework for Data Scientists

  • Bayesian Statistics Made Simple

  • CNNs for Scene Classification in Videos

  • Accelerated mapping from the Sky: object detection with high resolution remote sensing images

  • Applications of Deep Learning in Aerospace and Building Systems

  • Democratise Conversational AI – Scaling Academic Research to Industrial Applications

  • Latest Developments in GANs

  • Multivariate Time Series Forecasting Using Statistical and Machine Learning Models

  • Networks and Large Scale Optimization

  • Blockchain and Data Governance – Validating Information for Data Science

  • Why Machine Learning needs its own language, and why Julia is the one

  • Machine Learning in Chainer Python

  • Buying Happiness – Using LSTMs to Turn Feelings into Trades

  • Multi-Paradigm Data Science

  • Agile Data Science 2.0

  • Keras for R

  • R Packages as Collaboration Tools

  • Uplift Modeling and Uplift Prescriptive Analytics: Introduction and Advanced Topics

  • Using AWS SageMaker, Kubernetes, and PipelineAI for High Performance, Hybrid-Cloud Distributed TensorFlow Model Training and Serving with GPUs

  • Deep Learning Methods for Text Classification

  • Applying Deep Learning to Article Embedding for Fake News Evaluation

  • Experimental Reproducibility in Data Science with Sacred

  • Visual Analytics for High Dimensional Data

  • Running Data Science Projects and integration within the Organizational Ecosystem

  • Data Science Learnathon. From Raw Data to Deployment: The Data Science Cycle with Knime

  • Salted Graphs – A (Delicious) Approach to Repeatable Data Science

  • A Primer on Neural Network Models for Natural Language Processing

  • Help! I have missing data. How do I fix it (the right way)?

  • Applying Color to Visual Analytics in Data Science

  • Under The Hood: Creating Your Own Spark Datasources

  • #NOBLACKBOXES: How To Solve Real Data Science Problems with Automation, Without Losing Transparency

  • Solving Real World Problems in Machine Learning and Data Science

  • The Power of Monotonicity to Make ML Make Sense

Sign Up for ODSC East 2019 | April 30-May 3

Register Now