Training & Workshop Sessions

– Taught by World-Class Data Scientists –

Learn the latest data science concepts, tools and techniques from the best. Forge a connection with these rockstars from industry and academic, who are passionate about molding the next generation of data scientists.

Highly Experienced Instructors

Our instructors are highly regarded in data science, coming from both academia and notable companies.

Real World Applications

Gain the skills and knowledge to use data science in your career and business, without breaking the bank.

Cutting Edge Subject Matter

Find training sessions offered on a wide variety of data science topics from machine learning to data visualization.

ODSC Training Includes

Form a working relationship with some of the world’s top data scientists for follow up questions and advice.

Additionally, your ticket includes access to 50+ talks and workshops.

High quality recordings of each session, exclusively available to premium training attendees.

Equivalent training at other conferences costs much more.

Professionally prepared learning materials, custom tailored to each course.

Opportunities to connect with other ambitious like-minded data scientists.

2018 Training Instructors

We have some of the top names in data science siged up to host  training and workshops sessions.  More instructors will be added weekly.

Workshop: Multivariate Time Series Forecasting Using Statistical and Machine Learning Models

Time series data is ubiquitous: weekly initial unemployment claim, daily term structure of interest rates, tick level stock prices, weekly company sales, daily foot traffic recorded by mobile devices, and daily number of steps taken recorded by a wearable, just to name a few.

Some of the most important and commonly used data science techniques in time series forecasting are those developed in the field of machine learning and statistics. Data scientists should have at least a few basic time series statistical and machine learning modeling techniques in their toolkit.

This lecture discusses the formulation Vector Autoregressive (VAR) Models, one of the most important class of multivariate time series statistical models, and neural network-based techniques, which has received a lot of attention in the data science community in the past few years, demonstrates how they are implemented in practice, and compares their advantages and disadvantages used in practice. Real-world applications, demonstrated using python, are used throughout the lecture to illustrate these techniques. While not the focus in this lecture, exploratory time series data analysis using histogram, kernel density plot, time-series plot, scatterplot matrix, plots of autocorrelation (i.e. correlogram), plots of partial autocorrelation, and plots of cross-correlations will also be included in the demo.

Instructor Bio

Jeffrey is the Chief Data Scientist at AllianceBernstein, a global investment firm managing over $500 billions. He is responsible for building and leading the data science group, partnering with investment professionals to create investment signals using data science, and collaborating with sales and marketing teams to analyze clients. Graduated with a Ph.D. in economics from the University of Pennsylvania, he has also taught statistics, econometrics, and machine learning courses at UC Berkeley, Cornell, NYU, the University of Pennsylvania, and Virginia Tech. Previously, Jeffrey held advanced analytic positions at Silicon Valley Data Science, Charles Schwab Corporation, KPMG, and Moody’s Analytics.

Jeffrey Yau, PhD

Chief Data Scientist, Alliance Bernstein

Training: Introduction to Machine Learning

Machine learning has become an indispensable tool across many areas of research and commercial applications. From text-to-speech for your phone to detecting the Higgs boson, machine learning excells at extracting knowledge from large amounts of data. This talk will give a general introduction to machine learning, as well as introduce practical tools for you to apply machine learning in your research. We will focus on one particularly important subfield of machine learning, supervised learning. The goal of supervised learning is to “”learn”” a function that maps inputs x to an output y, by using a collection of training data consisting of input-output pairs. We will walk through formalizing a problem as a supervised machine learning problem, creating the necessary training data and applying and evaluating a machine learning algorithm. The talk should give you all the necessary background to start using machine learning yourself.

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD,

Author, Lecturer, and Core contributor to scikit-learn

Training: Intermediate Machine Learning with scikit-learn

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning. Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training: Deep Learning Research to Production - A hands on approach using Apache MxNet

Deep Learning (DL) has become ubiquitous in every day software applications and services. A solid understanding of DL foundational principles is necessary for researchers and modern-day engineers alike to successfully adapt the state of the art research in DL to business applications.

Researchers require a DL framework to quickly prototype and transform their ideas into models and Engineers need a framework that allows them to efficiently deploy these models to production without losing performance. We will show how to use Gluon APIs in Apache MXNet to quickly prototype models and also deploy them without losing performance in production using MXNet Model Server (MMS).

In this workshop, you will learn applying Convolutional Neural Network (CNN), a class of DL techniques, in Computer Vision (CV) and applying Recurrent Neural Network (RNN) DL techniques for solving Natural Language Processing (NLP) tasks using Apache MXNet – the two fields in which Deep Learning has achieved state of the Art results.

To learn applying DL in CV problems, we will get hands-on by building a Facial Emotion Recognition (FER) model using advances of deep learning in CV. We will also build a sentiment analysis model to understand the application of DL in Natural Language Processing (NLP). As we build the model, we will learn common practical limitations, pitfalls, best practices and tips and tricks used by practitioners. Finally, we will conclude the workshop by showing how to deploy using MMS for online/real-time inference and using Apache Spark + MXNet for offline batch inference on large datasets.

We will provide Juptyer notebooks to get hands on and solidify the concepts.
Screen reader support enabled.

Instructor Bio

Naveen is a Senior Software Engineer and a member of Amazon AI at AWS and works on Apache MXNet. He began his career building large scale distributed systems and has spent the last 10+ years designing and developing it. He has delivered various Tech Talks at AMLC, Spark Summit, ApacheCon and loves to share knowledge. His current focus is to make Deep Learning easily accessible to Software Developers without the need for a steep learning curve.

Naveen Swamy

Software Developer, Amazon AI

Training: Feature Engineering for Time Series Data

Most machine learning algorithms today are not time-aware and are not easily applied to time series and forecasting problems. Leveraging algorithms like XGBoost, or even linear models, typically requires substantial data preparation and feature engineering – for example, creating lagged features, detrending the target, and detecting periodicity. The preprocessing required becomes more difficult in the common case where the problem requires predicting a window of multiple future time points. As a result, most practitioners fall back on classical methods, such as ARIMA or trend analysis, which are time-aware but often less expressive. This talk covers practices for solving this challenge and exploring the potential to automate this process in order to apply advanced machine learning algorithms time series problems.

Instructor Bio

Michael Schmidt is the Chief Scientists at DataRobot, and has been featured in the Forbes list of the world’s top 7 data scientists and MIT’s list of the most innovative 35-under-35. He has authored AI research in the journal Science and has appeared in media outlets such as the New York Times, NPR’s RadioLab, the Science Channel, and Communications of the ACM. In 2011, Michael founded Nutonian and led the development Eureqa, a machine learning application and service used by over 80,000 users and later acquired by DataRobot in 2017. Most recently, his work has focused on automated machine learning, feature engineering, and time series prediction.

Michael Schmidt, PhD

Chief Scientist, DataRobot

Training: Advanced Machine Learning with scikit-learn Part I

Abstract Coming Soon

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training: Advanced Machine Learning with scikit-learn Part II

Abstract Coming Soon

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training: Machine Learning in R Part I

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Machine Learning in R Part II

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Training Session: Learn D3 essentials to get started with web based data visualization

One of the most popular frameworks today for web-based data visualizations is D3. In this workshop you will learn how to leverage to most common parts of the D3 framework to create data visualizations. This will in include making selections, working with data, creating shapes, and more. Although the emphasis is on learning D3, some general visualization design principles will be touched upon briefly as well, to help you make better informed design decisions.

Instructor Bio

Jan Willem Tulp is a independent Data Experience Designer from The Netherlands. With his company TULP interactive he creates custom data visualizations for a variety of clients. He has helped clients such as Google, European Space Agency, Scientific American, Nature and World Economic Forum by creating visualizations, both interactive and in print. His work has appeared in several books and magazines and he speaks regularly at international conferences.

Jan Willem Tulp

Data Experience Designer, TULP Interactive

Training: The New Science of Big Data Analytics, Based on the Geometry and the Topology of Complex, Hierarchic Systems

These foundations of Data Science are solidly based on mathematics and computational science. The hierarchical nature of complex reality is part and parcel of this new, mathematically well-founded way of observing and interacting with (physical, social and all) realities.

These lectures include pattern recognition and knowledge discovery, computer learning and statistics.  Addressed is how geometry and topology can uncover and empower the semantics of data. Key themes include: text mining; computational linear time hierarchical clustering, search and retrieval; the Correspondence Analysis platform that performs latent semantic factor space mapping, and accompanying hierarchical clustering.

Various application domains are covered in the case studies.  These include in text mining, literary text, and social media – Twitter; and clustering in astronomy, chemistry, psychoanalysis.  Final discussion is in regard to the increasingly important domains of smart environments, Internet of Things, health analytics, and further general scope of Big Data.

Instructor Bio

Fionn Murtagh is Professor of Data Science and was Professor of Computer Science, including Department Head, in many universities. Following his primary degres in Mathematics and Engineering Science, before his MSc in Computer Science, that was in Information Retrieval, in Trinity College Dublin, his first position as Statistician/Programmer was in national level (first and second level) education research. PhD in Université P&M Curie, Paris 6, with Prof. Jean-Paul Benzécri, was in conjunction with the national geological research centre, BRGM. After an initial 4 years as lecturer in computer science, there was a period in atomic reactor safety in the European Joint Research Centre, in Ispra (VA), Italy. On the Hubble Space Telescope, as a European Space Agency Senior Scientist, Fionn was based at the European Southern Observatory, in Garching, Munich for 12 years. For 5 years, Fionn was a Director in Science Foundation Ireland, managing mathematics and computing, nanotechnology, and introducing and growing all that is related to environmental science and renewable energy.

Fionn was Editor-in-Chief of the Computer Journal (British Computer Society) for more than 10 years, and is an Editorial Board member of many journals. With over 300 refereed articles and 30 books authored or edited, his fellowships and scholarly academies include: Fellow of: British Computer Society (FBCS), Institute of Mathematics and Its Applications (FIMA), International Association for Pattern Recognition (FIAPR), Royal Statistical Society (FRSS), Royal Society of Arts (FRSA). Elected Member: Royal Irish Academy (MRIA), Academia Europaea (MAE). Senior Member IEEE.

Fionn Murtagh, PhD

Director, Centre of Mathematics and Data Science, University of Huddersfield

Training: Getting to Grips with the Tidyverse (R)

The tidyverse is essential for any statistician or data scientist who deals with data on a day-to-day basis. By focusing on small key tasks, the tidyverse suite of packages removes the pain of data manipulation. In this tutorial, we’ll cover some of the core features of the tidyverse, such as dplyr (the workhorse of the tidyverse), string manipulation, linking directly to databases and the concept of tidy data.

Instructor Bio

Dr. Colin Gillespie is Senior lecturer (Associate Professor) at Newcastle University, UK. His research interests are high performance statistical computing and Bayesian statistics. He is regularly employed as a consultant by Jumping Rivers and has been teaching R since 2005 at a variety of levels, ranging from beginners to advanced programming.

Dr. Colin Gillespie

Author of Efficient R Programming, R Trainer and Consultant, and Senior Lecturer at Newcastle University

Training: The Path to Deep Learning with TensorFlow + Keras

Agenda • Why do we need to create our own models? • Introduction to Deep Learning • Lab: The “Hello World” of TensorFlow + Keras: Logistic Regression
• Convolutional Neural Networks: At last! Real Deep Learning • Lab: Computer Vision with CNNs • Beyond Computer Vision

Target Audience
Developers interested in building deep learning models, and researchers interested in comparing the specific implementation with other frameworks. No previous experience is required, as all concepts will be introduced in the theory modules of the workshop; however, a minimum knowledge of Machine Learning concepts and practices (such as understanding the train / test / validation cycle, etc…) would be beneficial.
Practice

The following labs will be done during the course of the workshop:
• Environment set up • Basic Logistic Regression • MNIST classifier (guided) o Logistic Regression o CNN • Playing with the hyperparameters: o Minibatch sizes o Learning Rates • MNIST classifier challenge (your turn!)

Requirements
We will perform the installation of the required wheels for using TensorFlow as part of the labs, but having the following pre-requisites installed will save time and potential issues
during the workshop: • Anaconda distribution with Python 3.5 environment • Python IDE (VSCode recommended)• Git client”

Instructor Bio

Juliet is a Technical Evangelist at Microsoft, helping ISVs get the most of the cloud. During the last 8 years at Microsoft she has focused on business apps while at the same time delivering talks about projects which combine things such as IoT with Cognitive Services and other Azure services.

Juliet Moreiro Bockhop

Technical Evangelist, Microsoft

Training: The Path to Deep Learning with TensorFlow + Keras

Agenda • Why do we need to create our own models? • Introduction to Deep Learning • Lab: The “Hello World” of TensorFlow + Keras: Logistic Regression
• Convolutional Neural Networks: At last! Real Deep Learning • Lab: Computer Vision with CNNs • Beyond Computer Vision

Target Audience
Developers interested in building deep learning models, and researchers interested in comparing the specific implementation with other frameworks. No previous experience is required, as all concepts will be introduced in the theory modules of the workshop; however, a minimum knowledge of Machine Learning concepts and practices (such as understanding the train / test / validation cycle, etc…) would be beneficial.
Practice

The following labs will be done during the course of the workshop:
• Environment set up • Basic Logistic Regression • MNIST classifier (guided) o Logistic Regression o CNN • Playing with the hyperparameters: o Minibatch sizes o Learning Rates • MNIST classifier challenge (your turn!)

Requirements
We will perform the installation of the required wheels for using TensorFlow as part of the labs, but having the following pre-requisites installed will save time and potential issues
during the workshop: • Anaconda distribution with Python 3.5 environment • Python IDE (VSCode recommended)• Git client”

Instructor Bio

Pablo Doval is Principal Data Architect and the General Manager of Plain Concepts in the UK. With a background of relational databases, data warehousing and traditional BI projects, he has spent the last years architecting and building Big Data and Machine Learning projects for customers in different sectors, such as Healthcare, Digital Media, Retail and Industry.

Pablo Alvarez

Data Science Architect, Plain Concepts

Training: Data science applications in quantitative finance

In this training session I will give an introduction to quantitative finance applications for data scientists. I will start by discussing a few fundamental ideas such as the efficient market hypothesis and the capital asset pricing model. I will also introduce some concepts on investment strategies including portfolio theory and smart beta investing. Many such strategies are based on fundamental factors revealed by academic research, such as value or profitability, which can yield excess returns when suitably applied. There is currently a lot of interest to use alternative data sets and machine learning tools to uncover further factors, making the field exciting for data scientists.

Throughout the training session we will work on some simplified hands-on examples to illustrate the main ideas.

We will discuss the setup and subtleties of backtesting for investment strategies and define relevant metrics. Typically, a major challenge is to avoid overfitting, which we will address as well.

Although the focus of the session is investment strategies, I will also illustrate a few other machine learning applications in finance from my experience.

Participants don’t need to have prior experience in finance. However, familiarity with python and the numpy, pandas, scikit-learn ecosystem would be very beneficial.

Instructor Bio

Johannes is Data Analytics Associate Director at IHS Markit, a global information and intelligence provider. He technically manages multiple data science projects across various business lines including finance, the automotive industry and the energy sector. He has a keen interest in the full data science spectrum including mathematical statistics, machine learning, databases, distributed computing, and dynamic visualizations.

He holds a PhD in theoretical condensed matter physics from Imperial College and has been active in quantitative research for more than 10 years including research positions at Harvard University and the Max-Planck Institute in Germany. He has disseminated his research in more than 30 peer reviewed publications and over 50 talks.

Johannes Bauer, PhD

Data Analytics Associate Director, IHS Markit

Training: From prediction to prescription: multi-objective optimization with PyGMO

With greater machine learning and big data software from the open source community, it is now easier than ever to build powerful predictive models to help people making decisions. On the other hand, even if one can predict the outcome accurately, it is not always trivial to make real life decisions, especially when there are a large number of choices and trade-off between multiple objectives. Bio-inspired optimization algorithms can be a great combination with the predictive modelling techniques under such scenario, allowing efficient exploration of the multi-dimensional choice space to find out the optimal frontier of the key objectives. Decision makers can then focus on these objectives rather than worrying about the choices at the execution level.

In this session we are going to walk through an example of multi-objective optimization problem in the context of a promotion campaign, using the open source package PyGMO (the Python Parallel Global Multiobjective Optimizer) from ESA. We will first briefly touch upon how to build a propensity model for such marketing activities. Then we will see how to optimize our promotion strategy with PyGMO, based on the prediction of the propensity model. We will also go a bit into the details of various algorithms available in PyGMO, as well as how to handle constraints.

Instructor Bio

Dr Jiahang Zhong is the leader of the data science team at Zopa, one of UK’s earliest fintech company. He has broad experience of data science projects in credit risk, operational optimization and marketing, with keen interests in machine learning, optimization algorithms and big data technologies. Prior to Zopa, he worked as a PhD and Postdoctoral researcher on the Large Hadron Collider Project at CERN, with focus on data analysis, statistics and distributed computing

Jiahang Zhong, PhD

Head of Data Science, Zopa Ltd

Workshop: Intelligent Price Optimization in Retail

Being a well-studied economic problem, price optimization becomes increasingly complex in practice, especially when dealing with large enterprises holding significant market share. You have to account for complementary goods. If your competitor raises prices for cars and you sell gasoline, people will buy fewer cars and need less gasoline so it is a bad news for you. Alternatively, if your competitor raises prices for beef and you sell chicken, it is a good news for you as beef and chicken are substitute goods. As beef becomes more expensive, some people will switch to chicken.

Changes in prices among the goods you sell affect buyer’s decisions in the same way. A good way of estimating demand in such a complex and inter-dependent environment is machine learning. First, you need to collect the information about the items your competitors sell and their prices. Machine learning helps you identify those, which are similar to the items you sell.

Next, you train a model, which predicts how a change in prices for one good affects another. After that, you build another machine learning model to predict the demand for every good.

Finally, you build an elasticity model, which takes into account additional business rules and generates price recommendations.This talk is going to be about tips and pitfalls of designing such systems. We are going to discuss what be taken into account when training ML models for price recommendations, how to match different products, how to predict the demand and how to integrate the solution with the big data technologies.”

Instructor Bio

Sergii is an Artificial intelligence and data science enthusiast, who has built very first data science group in Ukraine. Previously, he managed data science departments and business directions in largest Ukrainian software development services companies. Currently, he is lecturing on machine learning and artificial intelligence classes at Ukrainian Catholic University.

Sergii Shelpuk

Head of Data Science, Eleks

Training: Interactive data visualisation in python

When creating complex visualisations, interactivity can help communicate your core concepts. It allows the audience to familiarise themselves with the data, and makes understanding the data a step along the journey of understanding your visualisation.

However, creating interactive visualisations adds a layer of complexity to the data science workflow: during modelling and data exploration, interactivity is effectively achieved by re-running chunks of code with different parameters. Giving the reader the ability to achieve the same interactivity without having to change and re-run code therefore requires extra development from the data scientist.

In this workshop, we will go through python libraries that make this extra development as frictionless as possible, and produce interactive visualisations with as little code as possible. We will also go through options for producing interactivity for the wider public, and what steps need to be taken to achieve resilient interactive graphs.
Libraries to be used include ipywidgets, plotly, and plotly dash.

Instructor Bio

Dr. Jan Freyberg is a data scientist at ASI. He has worked on data science projects in the private and public sector, and his experience ranges from geospatial to unstructured language data. He is an expert in building interactive tools for communicating complex models, and is active in developing open-source data science software.

Jan completed a PhD and a fellowship studying brain activity, vision and consciousness in autism at the University of Cambridge and King’s College London, where he taught statistics and programming at undergraduate and postgraduate level.

Dr. Jan Freyberg

Data Scientist, ASI

Training: Data Science 101

Curious about Data Science? Self-taught on some aspects, but missing the big picture? Well you’ve got to start somewhere and this session is the place to do it. This session will cover, at a layman’s level, some of the basic concepts of data science. In a conversational format, we will discuss: What are the differences between Big Data and Data Science – and why aren’t they the same thing? What distinguishes descriptive, predictive, and prescriptive analytics? What purpose do predictive models serve in a practical context? What kinds of models are there and what do they tell us? What is the difference between supervised and unsupervised learning? What are some common pitfalls that turn good ideas into bad science? During this session, attendees will learn the difference between k-nearest neighbor and k-means clustering, understand the reasons why we do normalize and don’t overfit, and grasp the meaning of No Free Lunch.

Instructor Bio

For more than 20 years, Todd has been highly respected as both a technologist and a trainer. As a tech, he has seen that world from many perspectives: “data guy” and developer; architect, analyst and consultant. As a trainer, he has designed and covered subject matter from operating systems to end-user applications, with an emphasis on data and programming. As a strong advocate for knowledge sharing, he combines his experience in technology and education to impart real-world use cases to students and users of analytics solutions across multiple industries. He is a regular contributor to the community of analytics and technology user groups in the Boston area, writes and teaches on many topics, and looks forward to the next time he can strap on a dive mask and get wet. Todd is a Data Science Evangelist at DataRobot.

Todd Cioffi

Data Science Evangelist, DataRobot

Workshop: Algorithmic Trading with Machine Learning and Deep Learning

This workshop illustrates the use of machine and deep learning algorithms for classification in the context of predicting stock market movements. The workshop shows that there are parallels between building self-driving cars and deploying automated algorithmic trading strategies.

Instructor Bio

Yves has a Ph.D. in Mathematical Finance and is the founder and managing partner of The Python Quants GmbH. He is also the author of the books Python for Finance, Derivatives Analytics with Python and Listed Volatility & Variance Derivatives. He lectures for Data Science at htw saar University of Applied Sciences and for Computational Finance at the CQF Program and is the organizer of the Python for Quant Finance Meetup in London.

Yves Hilpisch, PhD

Founder, Quant University, Lecturer, and Author of Derivatives Analytics with Python and Python for Finance

Workshop: Probabilistic Graphical Models using PGMPY

This will be a hands-on workshop on Probabilistic Graphical Models using PGMPY library. Attendees shall learn about basics of PGMs with the open source library, pgmpy for which we are contributors. PGMs are generative models that are extremely useful to model various hierarchical and non-hierarchical models as well as stochastic processes. We shall talk about how fraud models, credit risk models can be built using Bayesian Networks. We shall also talk about Hidden Markov Models and showcase how thermostat control can be modeled. Generative models are also useful to measure causality and are great alternatives to deep neural networks, latter which, cannot solve such problems.. This workshop shall have students learn basics needed to learn about Bayesian Networks, Markov Models, HMMs including advanced probability and other math basics needed to understand the topic. Students shall learn by examples through this workshop.

Instructor Bio

Harish Kashyap has a Masters from Northeastern University, Boston in Electrical Engineering. He received the Graduate Student Award, a research scholarship as a part of which he worked at the BBN Technologies in the area of Bayesian Machine Learning algorithms as applied to Speech Recognition. He has more than 15 years of experience in the areas of Artificial Intelligence (AI), Digital Signal Processing and Software development. He has several publications and patents filed in the areas of ML. He has led various Data Analytics projects across US and Europe that led to organizational savings of $8M+. He has built out the curriculum for Machine Learning training platform, refactored.ai which is now part of SUNY Buffalo’s graduate ML course. He  is currently the founder of Mysuru Consulting Group (MCG.ai) and Diagram.AI where he works on ML algorithms

Harish Kashyap

ML Architect | Voyagenius Labs LLP

Workshop: Probabilistic Graphical Models using PGMPY

This will be a hands-on workshop on Probabilistic Graphical Models using PGMPY library. Attendees shall learn about basics of PGMs with the open source library, pgmpy for which we are contributors. PGMs are generative models that are extremely useful to model various hierarchical and non-hierarchical models as well as stochastic processes. We shall talk about how fraud models, credit risk models can be built using Bayesian Networks. We shall also talk about Hidden Markov Models and showcase how thermostat control can be modeled. Generative models are also useful to measure causality and are great alternatives to deep neural networks, latter which, cannot solve such problems.. This workshop shall have students learn basics needed to learn about Bayesian Networks, Markov Models, HMMs including advanced probability and other math basics needed to understand the topic. Students shall learn by examples through this workshop.

Instructor Bio

Ria Aggarwal is an experienced Engineer with a demonstrated history of working in the wireless industry, now working in the field of Machine Learning/Artificial Intelligence. She completed her graduation from Indian Institute of Technology Roorkee. She has always been passionate about Maths and Algorithms, machine learning being a union of both was a natural succession for her. She is currently working in a AI based R&D startup ‘Voyagenius labs’. Her work involves architecting machine learning solutions for solving real life problems with primary focus on research in the fields of Natural Language Processing, Reinforcement Learning and Bayesian statistics. She truly believes that probabilistic approach can give super powers to many Machine learning algorithms and is going to revolutionise the paradigm of predictive modelling

Ria Aggarwal

Machine Learning Engineer , Mysuru Consulting Group

Workshop: Telling human stories with data

Robust data analysis underpins every business decision, public sector project and non-profit initiative. But data in its raw form often fails to convince crucial lay audiences – either due to its complexity, or due to suspicion and mistrust. And you can’t help get guide the world in the right direction made if you alienate key decision-makers.
Visualisation and narrative storytelling offer a path to bringing the numbers to life and persuading your audience – without losing the the crucial truth that’s in the data.

This workshop, delivered by journalist and data visualisation specialist Alan Rutter, will cover an audience-centred approach to visualising data. It will introduce tried-and-tested techniques for communicating data-driven stories effectively to people from a broad range of backgrounds, and deal with some of the common problems that practitioners encounter.
We will discuss how to avoid landing at either extreme of the data visualisation spectrum: bullet-proof academic analysis that fails to communicate a message; or fantastic aesthetic creations that actually say nothing at all.
We will look at the ethics of data visualisation, and how to be transparent about underlying data and methodologies. And we will discuss the considerations of different media, and how to tailor your visual approach to both your channels and audience. We will also look at how human beings interpret data, and how we can use design and psychology to our advantage.

There is no focus on a specific tool or language, as it is far more important to have a clear idea of the intention and desired action than to obsess over a given language or piece of software. This workshop is suited to anyone who wants to create impact with the data they work with by turning it into compelling stories for other audiences – whether through printed materials, presentations, social media, or websites and apps. 

Instructor Bio

Alan Rutter is the co-founder of consultancy Clever Boxer. He first worked with infographics as a magazine journalist (Time Out, WIRED), before moving into technology roles (Condé Nast, Net-A-Porter) and then training and development (The Guardian, General Assembly). He has taught data visualisation techniques to thousands of students, and for organisations including the Home Office, Department of Health, Biotechnology and Biosciences Research Council, Capita, Novartis and Kings College London.

Alan Rutter

Data Visualization Consultant, Trainer at General Assembly and Co-founder at Clever Boxer

Workshop: Understanding unstructured data with Language Models

As data scientists, we’ve seen a rapid improvement in the last decades in the tools available for working with structured data (be it tabular data, graph data, sensor data etc.). Yet, the vast majority of our data (Merrill Lynch puts the figure at roughly 90%) is *unstructured*, and lives in the form of documents, emails, reviews, reports and chat logs etc.

Many of us are far less familiar with how to analyse and understand this trove of unstructured data. This talk focuses on language models, one of the most fundamental tools for working with unstructured data. Language models are all around us (although we’re probably unaware of them), underpinning everything from Word’s spellchecker to home assistants like Alexa.

While plenty of “out of the box” language modelling libraries exists, the first part of the talk focuses on getting an thorough understanding of what a language model is, and how it works. We’ll touch on key ideas from statistics and information theory, and see how Alan Turing, in developing techniques to break Nazi codes at Bletchley Park, created the smoothing techniques which remain widely used in language models today. We’ll then proceed to the present day, looking at how techniques like word vectors and transfer learning have yielded an improved generation of tools.

In the second half of the talk, we’ll look at how we can practically use language models to understand unstructured data. Specifically we’ll explore:

– Classification – the canonical application of language models, they can help us identify spam, analyse sentiment or perform unsupervised clustering. We’ll look at a famous case where language models were able to successfully identify a Shakespeare forgery.
– Predictive modelling – if I were to look at your Tweets (and nothing else), could I guess your gender? It turns out state-of-the-art techniques can successfully predict it with an 80%+ success rate. We’ll look at how language models can enrich your datasets with additional demographic or contextual data.
– Information retrieval – finally, we’ll see how language models have been used extensively (for example in the legal sector), to extract targeted insights from enormous data sets.

 

Instructor Bio

Alex Peattie is the co-founder and CTO of Peg, a technology platform helping multinational brands and agencies to find and work with top YouTubers. Peg is used by over 2000 organisations worldwide including Coca-Cola, L’Oreal and Google.

An experienced digital entrepreneur, Alex spent six years as a developer and consultant for the likes of Grubwithus, Huckberry, UNICEF and Nike, before joining coding bootcamp Makers Academy as senior coach, where he trained hundreds of junior developers. Alex was also a technical judge at the 2017 TechCrunch Disrupt conference.

Alex Peattie

CTO, Peg

Workshop:Training Gradient Boosting models to solve Classification Problems with CatBoost

Gradient boosting is a powerful machine-learning technique that achieves state-of-the-art results in a variety of practical tasks. For a number of years, it has remained the primary method for learning problems with heterogeneous features, noisy data, and complex dependencies: web search, recommendation systems, weather forecasting, and many others.

CatBoost (http://catboost.ai) is an open-source gradient boosting library, that outperforms existing publicly available implementations of gradient boosting in terms of quality. It has a set of additional advantages.

1. CatBoost is able to incorporate categorical features in your data (like music genre, device id, URL, etc.) in predictive models with no additional preprocessing. For more details on our approach please refer to our NIPS 2017 ML Systems Workshop paper (http://learningsys.org/nips17/assets/papers/paper_11.pdf).
2. CatBoost requires almost no hyperparameter tunning in order to get a model with good quality.
3. CatBoost has the fastest GPU and multi GPU training implementations of all the openly available gradient boosting libraries.

This tutorial will explain main features of the library on the example of solving classification problem.
We will walk you through all the steps of building a good predictive model. We will cover such topics as:

– Choosing suitable loss functions and metrics to optimize
– Training classification model
– Visualizing the process of training and cross-validation
– CatBoost built-in overfitting detector and means of reducing overfitting of gradient boosting models
– Selection of an optimal decision boundary
– Feature selection and explaining model predictions
– Testing trained CatBoost model on an unseen data

Instructor Bio

Anna Veronika Dorogush graduated from the Faculty of Computational Mathematics and Cybernetics of Lomonosov Moscow State University and from Yandex School of Data Analysis. She used to work at ABBYY, Microsoft, Bing and Google, and has been working at Yandex since 2015, where she currently holds the position of the head of Machine Learning Systems group and is leading the efforts in development of the CatBoost library.

Anna Veronika Dorogush

ML Lead, Yandex

Workshop: Introduction to Automatic and Interpretable Machine Learning with H2O and LIME

General Data Protection Regulation became enforceable on 25 May 2018. Are you and your organization ready to explain your models?

This is a hands-on tutorial for R beginners. I will demonstrate the use of two R packages, h2o & lime, for automatic and interpretable machine learning. Participants will be able to follow and build regression and classification models quickly with H2O’s AutoML. They will then be able to explain the model outcomes with a framework called Local Interpretable Model-Agnostic Explanations (LIME)

Instructor Bio

Jo-fai (Joe) is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab as a data science evangelist promoting products via blogging and giving talks at external events. 

Joe has a background in water engineering. Before his data science journey, he was an EngD researcher at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in UK and abroad. He also holds a MSc in Environmental Management and a BEng in Civil Engineering.

Long before Joe immersed himself in the world of open-source R and Python, he learned his trade as an avid MATLAB user. When he was a kid, his parents taught him one of the famous old Chinese sayings – when one drinks water, one must not forget where it comes from. So when Twitter asked Joe to be creative, he simply put down @matlabulous as his handle.

In the summer of 2014, his data visualization side project ‘CrimeMap’ led him to a poster presentation at useR! 2014 where he heard about H2O for the very first time. He has been using H2O for various data science projects ever since.

Jo-Fai Chow

Data Science Evangelist, H2O.ai

Training: AI for Executives

Gain insight into how to drive success in data science. Identify key points in the machine learning life cycle where executive oversight really matters. Learn effective methods to help your team deliver better predictive models, faster. You’ll leave this seminar able to identify business challenges well suited for machine learning, with fully defined predictive analytics projects your team can implement now to improve operational results.

 

Instructor Bio

John Boersma is Director of Education for DataRobot. In this role he oversees the company’s client training operations and relations with academic institutions using DataRobot in analytics courses. Previously, John founded and led Adapt Courseware, an adaptive online college curriculum venture. John holds a PhD in computational particle physics and an MBA in general management.

John Boersma

Director of Education, DataRobot

Workshop: Predictive Modeling with R

R is a standard tool for predictive modeling. It allows to use hundreds of predictive models and build really complex workflows.

The workshop is a guided tour through the most important R packages. It is illustrated with working R examples. You will learn how to use R for predictive modeling including feature selection, model building, validation, and deployment.

You will learn what is the correct process of predictive models building and how to:

– use R for predictive modeling including feature selection, model building, validation, and deployment
– work with an universal and powerful package `caret`
– a couple of specialized packages like `xgboost` and `caretEnsemble`
– supporting packages like `PMML`
– as a help we will use H2O

Instructor Bio

Artur has over twenty years of experience in deep business analytics, Data Science, and Machine Learning projects. He worked for various companies: from start-ups to international corporations, and in various roles: as an employee, a consultant, and a business owner. He spent over ten years working as a statistician in a commercial bank. At the same time he received Ph.D. in Mathematics and wrote several scientific papers. He currently runs his company QuantUp (http://quantup.pl), focusing on giving value to companies using Data Science, Machine Learning, software development and commercial trainings. He has led nearly one hundred of real-world Data Science projects and several thousand hours of commercial trainings in this field. A co-owner, Vice CEO and CSO of a Swedish bioinformatics company MedicWave. Artur has a long-time experience working with open source software and promotes its use in business applications during numerous conferences. He is a fan of the R language and a co-author of a book on forecasting in R.

Artur Suchwalko, PhD

Data Scientist / Owner, QuantUp

Workshop: Competitive Model Stacking: An Introduction to Stacknet Meta Modelling Framework

StackNet is a computational, scalable and analytical framework mainly implemented in Java that resembles a feedforward neural network and uses Wolpert’s stacked generalization in multiple levels to improve the accuracy of predictions. StackNet will be demonstrated through practical examples.

 

Instructor Bio

Marios Michailidis is a Research data scientist at H2O.ai . He holds a Bsc in accounting Finance from the University of Macedonia in Greece and an Msc in Risk Management from the University of Southampton. He has also nearly finished his PhD in machine learning at University College London (UCL) with a focus on ensemble modelling. He has worked in both marketing and credit sectors in the UK Market and has led many analytics’ projects with various themes including: Acquisition, Retention, Recommenders, Uplift, fraud detection, portfolio optimization and more.

He is the creator of KazAnova(http://www.kazanovaforanalytics.com/), a freeware GUI for credit scoring and data mining 100% made in Java as well as is the creator of StackNet Meta-Modelling Framework (https://github.com/kaz-Anova/StackNet). In his spare time he loves competing on data science challenges and was ranked 1st out of 500,000 members in the popular Kaggle.com data competition platform. Here (http://blog.kaggle.com/2016/02/10/profiling-top-kagglers-kazanova-new-1-in-the-world/) is a blog about Marios being ranked at the top in Kaggle and sharing his knowledge with tricks and ideas.

Marios Michailidis, PhD

Ranked #1 On Kaggle.Com, Data Scientist at H20.ai

Workshop: Coming Soon

Coming soon.

Instructor Bio

Peter is a pioneer in the field of applied news analytics, bringing alternative data to banks and hedge funds. He has more than 15 years of experience in quantitative finance with companies such as Standard & Poor’s, Credit Suisse First Boston, and Saxo Bank.

Peter Hafez

Chief Scientist, RavenPack

 

Workshop: Network Analysis using Python

Certainly, some of the most exciting research going on right now is in the area Deep-Learning. But how do we get started with hands-on practice and how do we gain a basic understanding of what is going on within all of those deep learning layers? This lesson will help the beginner-level deep-learner navigate this new landscape. I will explain both the design theory and the Keras implementation of some of today’s most widely used deep-learning algorithms including convolutional neural nets and recurrent neural nets. I will also discuss some of my own recent explorations via Keras including a spin-off of Style Transfer.

 

Instructor Bio

Julia Lintern is a senior data scientist at Metis, where she co-teaches the data science bootcamp, develops curricula, and focuses on other special projects. Previously, Julia worked as a data scientist at JetBlue, where she used quantitative analysis and machine learning methods to provide continuous assessment of the aircraft fleet. Julia began her career as a structures engineer designing repairs for damaged aircraft. Julia holds an MA in applied math from Hunter College, where she focused on visualizations of various numerical methods including collocation and finite element methods and discovered a deep appreciation for the combination of mathematics and visualizations, leading her to data science as a natural extension of these ideas. She continues to collaborate on various projects; including her current work with the NYTimes data science team. During certain seasons of her career, she has also worked on creative side projects such as Lia Lintern, her own fashion label.

Julia Lintern

Senior Data Scientist, Metis

Workshop: Personal Price-Aware Recommender System: Evidence from eBay

With the ever-growing volume, complexity and dynamicity of online information, recommender system has been an effective solution to overcome such information overload. In recent years, deep learning’s revolutionary advances in speech recognition, image analysis and natural language processing have gained significant attention. The same is true for recommender systems: deep learning techniques effectively capture the non-linear user-item relationships as
well as the intricate relationships with contextual information enabling higher recommendation quality than traditional algorithms.

In this hands on workshop, I’m going to present some useful examples where deep learning techniques are superior to traditional recommender systems algorithms: learning item embedding, deep collaborative filtering and session based recommendations. For each use case, a detailed description of the data pre-processing, neural network architecture, tuning best practices and experiment results will be presented. followed by a step by step exercises using ipython notebooks with some public datasets to provide a starting point for further experimentation.

Instructor Bio

Bio Coming Soon

Asi Messica, PhD

Lecturer, Ben Gurion University

Workshop: Target leakage in machine learning

Target leakage is one of the most difficult problems in developing real-world machine learning models. Leakage occurs when the training data gets contaminated with information that will not be known at prediction time. Additionally, there can be multiple sources of leakage, from data collection and feature engineering to partitioning and model validation. As a result, even experienced data scientists can inadvertently introduce leaks and become overly optimistic about the performance of the models they deploy. In this talk, we will look through real-life examples of data leakage at different stages of the data science project lifecycle, and discuss various countermeasures and best practices for model validation.

 

Instructor Bio

Yuriy is a professional with 10 years of experience in the industry. He has extensive experience in Data Science, ML, R&D and software architecture. Yuriy is a developer of the ML platform DataRobot. Moreover, he has worked with the oral and written speech processing. He also teaches at CS@UCU, Kyivstar Big Data School, LITS.

Yuriy Guts

Machine Learning Engineer, DataRobot

Workshop: Sentiment analysis with deep learning, machine learning or lexicon based? You choose!

Do you want to know what your customers, users, contacts, or relatives really think? Find out by building your own sentiment analysis application.

In this workshop you will build a sentiment analysis application, step by step, using KNIME Analytics Platform. After an introduction to the most common techniques used for sentiment analysis and text mining we will work in three groups, each one focusing on a different technique.

Deep Learning. This group will work with the visual Keras deep learning integration available in KNIME (completely code free)
Machine Learning. This group will use other machine learning techniques, based on native KNIME nodes
Lexicon Based. This group will focus on a lexicon based approach for sentiment analysis

Workshop Requirements:

Your own laptop preinstalled with KNIME Analytics Platform, which you can download from the KNIME website
KNIME Textprocessing extension. See video link, below about installing KNIME extensions. https://www.youtube.com/watch?v=8HMx3mjJXiw

Help Installing KNIME Analytics Platform:

Here are some links to YouTube videos to help you install KNIME Analytics Platform:
Windows https://www.youtube.com/watch?v=yeHblDxakLk&feature=youtu.be
Mac https://www.youtube.com/watch?v=1jvRWryJ220&feature=youtu.be
Linux https://www.youtube.com/watch?v=wibggQYr4ZA&feature=youtu.be

Instructor Bio

Rosaria Silipo has been mining data, big and small, since her master degree in 1992. She kept mining data throughout all her doctoral program, her postdoctoral program, and most of her following job positions. So many years of experience and passion for data analytics, data visualization, data manipulation, reporting, business intelligence, and KNIME tools, naturally led her to become a principal data scientist and an evangelist for data science at KNIME.

Rosaria Silipo, PhD

Principal Data Scientist, KNIME

Workshop: Sentiment analysis with deep learning, machine learning or lexicon based? You choose!

Do you want to know what your customers, users, contacts, or relatives really think? Find out by building your own sentiment analysis application.

In this workshop you will build a sentiment analysis application, step by step, using KNIME Analytics Platform. After an introduction to the most common techniques used for sentiment analysis and text mining we will work in three groups, each one focusing on a different technique.

Deep Learning. This group will work with the visual Keras deep learning integration available in KNIME (completely code free)
Machine Learning. This group will use other machine learning techniques, based on native KNIME nodes
Lexicon Based. This group will focus on a lexicon based approach for sentiment analysis

Workshop Requirements:

Your own laptop preinstalled with KNIME Analytics Platform, which you can download from the KNIME website
KNIME Textprocessing extension. See video link, below about installing KNIME extensions. https://www.youtube.com/watch?v=8HMx3mjJXiw

Help Installing KNIME Analytics Platform:

Here are some links to YouTube videos to help you install KNIME Analytics Platform:
Windows https://www.youtube.com/watch?v=yeHblDxakLk&feature=youtu.be
Mac https://www.youtube.com/watch?v=1jvRWryJ220&feature=youtu.be
Linux https://www.youtube.com/watch?v=wibggQYr4ZA&feature=youtu.be

Instructor Bio

Kathrin Melcher is a Data Scientist at KNIME. She holds a Master Degree in Mathematics obtained at the University of Konstanz, Germany. She joined the Evangelism team at KNIME in 2017. She has a strong interest in data science, machine learning and algorithms, and enjoys teaching and sharing her knowledge about it.

Kathrin Melcher

Data Scientist,  KNIME

Training: Introduction to Reinforcement Learning

Reinforcement Learning recently progressed greatly in industry as one of the best techniques for sequential decision making and control policies.
Deep Mind used RL to greatly reduce energy consumption in Google’s data centre. It has being used to do text summarisation, autonomous driving, dialog systems, media advertisement and in finance by JPMorgan Chase. We are at the very beginning of the adoption of these algorithms as systems are required to operate more and more autonomously.

In this workshop we will explore Reinforcement Learning, starting from its fundamentals and ending creating our own algorithms.
We will use OpenAI gym to try our RL algorithms. OpenAI is a non profit organisation that want committed to open source all their research on Artificial Intelligence. To foster innovation OpenAI crated a virtual environment, OpenAi gym, where it’s easy to test Reinforcement Learning algorithms.
In particular we will start with some popular techniques like Multi Armed Bandit, going thought Markov Decision Processes and Dynamic Programming.

Instructor Bio

Leonardo De Marchi holds a Master in Artificial intelligence and has worked as a Data Scientist in the sport world, with clients such as New York Knicks and Manchester United, and with large social networks, like Justgiving. He now works as Lead Data Scientist in Badoo, the largest dating site with over 340 million users.

Leonardo De Marchi

Lead Data Scientist at Badoo

Workshop: A Gentle Introduction to Survival Models with Applications in Python and R

Survival/duration models are common ways to model the probability to fail/survive at each period in your data set. Though they are common in certain fields in economics, econometrics and biology, they are less commonly applied in data science, despite them often being the most appropriate approach to a problem.

This workshop will start with a theoretical introduction on basic non-parametric, semi-parametric and parametric models such as Kaplan-Meier, Cox Proportional Hazard (with and without time-varying covariates), and Aalen additive model, and random survival forests. In the second part of the workshop, we will look at how we can apply these models in Python and R. 

Instructor Bio

Violeta has been working as a data scientist in the Data Innovation and Analytics department in ABN AMRO bank located in Amsterdam, the Netherlands.In her daily job, she works on projects with different business lines applying latest machine learning and advanced analytics technologies and algorithms. Before that, she worked for about 1.5 years as a data science consultant in Accenture, the Netherlands. Violeta enjoyed helping clients solve their problems with the use of data and data science but wanted to be able to develop more sophisticated tools, therefore the switch.

Before her position at Accenture, she worked on her PhD, which she obtained from Erasmus University, Rotterdam in the area of Applied Microeconometrics.In her research she used data to investigate the causal effect of negative experiences on human capital, education, problematic behavior and crime commitment.

Violeta Misheva, PhD

Data Scientist, ABN AMRO Bank N.V.

Workshop: How to learn many, many labels with Machine Learning

Classification is one of the most common machine learning tasks. In this workshop we will discuss the unusual challenges when dealing with hundreds or even many thousands of distinct classes, focusing particularly on text classification. We’ll cover several aspects of the problem, from taxonomy development and data exploration, through to nifty linear algebra and optimisation tricks you can leverage to help your machine learning algorithms cope. We’ll walk through an Jupyter notebook applying this knowledge to a new, never-been-seen highly multiclass NLP dataset (soon to be made publicly available).

This talk is aimed at those with knowledge of basic machine learning and python. Some neural network experience is preferable but not required.

Instructor Bio

Mike is a senior machine learning engineer at Evolution AI, working on Evolution AI’s NLP platform. He is probably most widely known in the machine learning community for a popular blog post about his escapades teaching a neural network to freestyle rap (https://bit.ly/2fsePbZ). He has has been working in data science and machine learning for the last 5 years, for the likes of Ocado and Qubit Technology. His primary areas of expertise are in NLP, probabilistic graphical models and recommender systems. 

Dr. Michael Swarbrick Jones

Senior Machine Learning Engineer, Evolution AI

Workshop: How to Build High Performing Weighted XGBoost ML Model for Real Life Imbalance Dataset

Creating end to end ML Flow and Predict Financial Purchase for Imbalance financial data using weighted XGBoost code pattern is for anyone who is also interested in using XGBoost and creating Scikit-Learn based end to end machine learning pipeline for the real dataset where class imbalances are very common.

1. Introduction and Background.
2. Data Set Description.
3. Statement of Classification Problem.
4. Software and Tools(Xgboost and Scikit Learn).
5. Visual Data Exploration to understand data (Using seaborn and matplotlib).

We will also explore output and will note the class imbalance issues.

6. Create Scikit learn ML Pipelines for Data Processing.
7. Model Training.
7.1 What and Why of XGBoost.
7.2 Discuss Metrics for Model Performance.
7.3 First Attempt at Model Training and it’s performance, evaluation and analysis.
7.4 Strategy For Better Classifier for the Imbalance Data.
7.5 Second Attempt at Model Training using Weighted Samples and it’s performance, evaluation and analysis.
7.6 Third Attempt at Model Training using Weighted Samples and Feature Selection and it’s performance analysis.
8. Inference Discussion
9. Summary 
10. Pointers to Other Advanced Techniques

Instructor Bio

Alok Singh is a Principal Engineer at the IBM CODAIT (Center for Open-Source Data and AI Technologies). He has built and architected multiple analytical frameworks and implemented machine learning algorithms along with various data science use cases. His interest is in creating Big Data and scalable machine learning software and algorithms. He has also created many Data Science based applications.

Alok Singh

Principal Engineer, IBM

Workshop: Introduction to Data Science - A Practical Viewpoint

In this talk I will address some of the most important aspects in and around doing data science. We will cover the data science workflow through applications in different industries and explore the data science landscape including tools and methodologies, roles, challenges and opportunities.

Instructor Bio

Jesús (aka J.) is a lead data scientist with a strong background and interest in insight, based on simulation, hypothesis generation and predictive analytics. He has substantial practical experience with statistical analysis, machine learning and optimisation tools in product development and innovation, finance, media and others.

Jesús is the Principal Data Scientist at AKQA. He has a background in physics and held positions both in academia and industry, including Imperial College, IBM Data Science Studio, Prudential and Dow Jones to name a few.

Jesús is the author of “Essential MATLAB and Octave,” a book for students in physics, engineering, and other disciplines. He also authored the upcoming data science book entitled “Data Science and Analytics with Python.

Dr. Jesús Rogel-Salazar

Data Science Instructor, General Assembly

Workshop: Elegant Machine Learning workshop with Julia and Flux

Flux (http://fluxml.ai) is a new machine learning library that’s easy and intuitive to use, but scales to handle the most difficult research challenges. As machine learning models grow increasingly complex, we suggest that neural networks are best viewed as an emerging, differentiable programming paradigm, and ask what decades of research into programming languages and compilers has to offer to the machine learning world.

Flux is written entirely in Julia, an easy but high-performance programming language similar to Python. You can train models using high-level Keras-like interfaces, or drop down to the mathematics, allowing complete customisation even down to the CUDA kernels. Meanwhile, Julia’s advanced compiler technology allows us to provide cutting edge performance.

This workshop will introduce Flux and its approach to building differentiable, trainable algorithms and show simple but practical examples in image recognition, reinforcement learning and natural language processing. We’ll also cover Flux’s ecosystem of existing ready-made models, and how these can be used to get a head start on real-world problems.

Instructor Bio

Avik has spent many years helping investment banks leverage technology in risk and capital markets. He’s worked on bringing AI powered solutions to investment research, and is currently the VP of Engineering at Julia Computing.

Avik Sengupta

VP of Engineering, Julia Computing

Workshop: Elegant Machine Learning workshop with Julia and Flux

Flux (http://fluxml.ai) is a new machine learning library that’s easy and intuitive to use, but scales to handle the most difficult research challenges. As machine learning models grow increasingly complex, we suggest that neural networks are best viewed as an emerging, differentiable programming paradigm, and ask what decades of research into programming languages and compilers has to offer to the machine learning world.

Flux is written entirely in Julia, an easy but high-performance programming language similar to Python. You can train models using high-level Keras-like interfaces, or drop down to the mathematics, allowing complete customisation even down to the CUDA kernels. Meanwhile, Julia’s advanced compiler technology allows us to provide cutting edge performance.

This workshop will introduce Flux and its approach to building differentiable, trainable algorithms and show simple but practical examples in image recognition, reinforcement learning and natural language processing. We’ll also cover Flux’s ecosystem of existing ready-made models, and how these can be used to get a head start on real-world problems.

Instructor Bio

Mike Innes is a software engineer at Julia Computing, where he works on among other things the Juno IDE and the machine learning ecosystem. He is the creator of the Flux machine learning library.

Mike Innes

Software Engineer, Julia Computing

Workshop: Handling Missing Data in Python/Pandas

Creating end to end ML Flow and Predict Financial Purchase for Imbalance financial data using weighted XGBoost code pattern is for anyone who is also interested in using XGBoost and creating Scikit-Learn based end to end machine learning pipeline for the real dataset where class imbalances are very common.

1. Introduction and Background.
Imbalance dataset where number of positive samples are lot more than negative samples are very common and we setup the premise of our whole code pattern here.

2. Data Set Description.
Data set is from Purtugese Bank Marketing data, where bank’s associate makes call to user to sell financial product i.e CD to bank’s client. Data set contains 17 columns and is explained here.

3. Statement of Classification Problem.
Before we start building our model, we should clearly defines, our object and high level problem statement.

4. Software and Tools(Xgboost and Scikit Learn).
We will mostly use python based libraries i.e XGBoost, Scikit-learn, Matplotlib, SeaBorn, Pandas. In this section we will load and explain, each of the packages and it’s sub packages.

5. Visual Data Exploration to understand data (Using seaborn and matplotlib).
To get better insight into data, data scientists, usually perform data exploration, we will explore inputs for it’s distribution, correlation and outliers

We will also explore output and will note the class imbalance issues.

6. Create Scikit learn ML Pipelines for Data Processing.
Split data into train and test set.
Create ML pipeline for data preparation. In typical machine learning application, one would usually creates, ML pipeline, so that all the steps that are done on training data set, can be easily applied on the test set.
7. Model Training.
Model Training is a iterative process and we will do several iteration to improve our model performance.

7.1 What and Why of XGBoost.
We will explain, why we choose XGBoost as our tool of choice.

7.2 Discuss Metrics for Model Performance.
We explain in detail various classification performance metrics like ROC curve, Precision-Recall curve, Confusion Matrix and our choice for this application.

7.3 First Attempt at Model Training and it’s performance, evaluation and analysis.
We will build XGBoost model using cross validation and compare it’s performance via various stats and visualization. We will note that, performance is not good for the positive class i.e recall is bad.

7.4 Strategy For Better Classifier for the Imbalance Data.
To improve, recall, we will highlight a few tricks.

7.5 Second Attempt at Model Training using Weighted Samples and it’s performance, evaluation and analysis.
Next, we will use one of the tricks of weighted samples to improve performance.

7.6 Third Attempt at Model Training using Weighted Samples and Feature Selection and it’s performance analysis.
Lastly, we will build model with weighted samples and feature selection.

8. Inference Discussion (Generalization and Prediction).
Now, our model is ready to used and we run it on held out data, to see it’s performance on test data.

9. Summary about what we learned about various techniques.
10. Pointers to Other Advanced Techniques like OverSampling, UnderSampling and SMOTE algorithms.

Instructor Bio

Alexandru Agachi is a co-founder of Empiric Capital, an algorithmic, data driven asset management firm headquartered in London. He is also a guest lecturer in big data and machine learning at Pierre et Marie Curie University in Paris, and is involved in neuro oncogenetic research, in particular applications of machine learning. After initial studies at LSE, he completed 4 graduate and postgraduate degrees and diplomas in technology and science, focusing on the thorium nuclear fuel cycle, surgical robotics, neuroanatomy and imagery, and biomedical innovation. He previously worked at UBP in hedge funds research, Deutsche Bank, the Kyoto University Research Reactor Institute, and conducted an investment consulting project for the CIO office at Investec. He was nominated as one of Forbes’ 30 Under 30 in Finance in 2018.

Alexandru Agachi

Co Founder & COO, Empiric Capital

ODSC EUROPE 2018 | September 19-22

Register Now

What To Expect

As we prepare out 2018 schedule take a look at some of our previous training and workshops we have hosted at ODSC Europe for an idea of what to expect.

  • High Performance, Distributed Spark ML, Tensorflow AI, and GPU

  • Machine Learning with R

  • Deep Learning with Tensorflow for Absolute Beginners

  • Algorithmic Trading with Machine and Deep Learning

  • Deep Learning in Keras

  • Deep Learning – Beyond the Basics

  • Running Intelligent Applications inside a Database: Deep Learning with Python Stored Procedures in SQL

  • Distributed Deep Learning on Hops

  • R and Spark with Sparklyr

  • Towards Biologically Plausible Deep Learning

  • Machine Learning with R

  • Deep Learning Ensembles in Toupee

  • A Gentle Introduction to Predictive Analytics with R

  • Deep Learning with Tensorflow for Absolute Beginners

  • Graph Data – Modelling and Quering with Neo4j and Cypher

  • Introduction to Data Science with R

  • Introduction to Python in Data Science

  • Machine Learning with R

  • High Performance, Distributed Spark ML, Tensorflow AI, and GPU

  • The Magic of Dimensionality Reduction

  • Analyze Data, Build a UI and Deploy on the Cloud with Apache Spark, Notebooks and PixieDust

  • Data Science for Executives

  • Data Science Learnathon. From Raw Data to Deployment: the Data Science Cycle with KNIME

  • Distributed Deep Learning on Hops

  • Distributed Deep Learning on Hops

  • Drug Discovery with KNIME

  • Interactive Visualisation with R (and just R)

  • Introduction to Algorithmic Trading

  • Introduction to Data Science – A Practical Viewpoint

  • Julia for Data Scientists

  • Machine Learning with R

  • R and Spark with Sparklyr

  • Running Intelligent Applications inside a Database: Deep Learning with Python Stored Procedures in SQL

  • Telling Stories with Data

  • Towards Biologically Plausible Deep Learning

  • Win Kaggle Competitions Using StackNet Meta Modelling Framework

Open Data Science Conference