Introduction to Machine Learning with Andreas Mueller, core developer of scikit-learn

The resurging interest in machine learning is due to multiple factors including growing volumes and varieties of data,and cheaper computational processing. Thus making it possible to quickly and automatically produce models that can analyze bigger, more complex data and deliver faster, more accurate results on a very large scale. scikit-learn (http://scikit-learn.org/) has emerged as one of the most popular open source machine learning toolkits, now widely used in academia and industry.

scikit-learn provides easy-to-use interfaces in Python to perform advanced analysis and build powerful predictive models.

Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Curriculum

This workshop will cover basic concepts of machine learning, such as supervised and unsupervised learning, cross-validation and model selection, and how they map to programming concepts in scikit-learn. Andreas will demonstrate how to prepare data for machine learning, and go from applying a single algorithm to building a machine learning pipeline. We will cover the trade-offs of learning on large datasets, and describe some techinques to handle larger-than-RAM and streaming data on a single machine.

Prerequisites

https://github.com/amueller/odsc-masterclass-2017-morning

 

Introduction to Deep Learning with Tensorflow Contributor and Kaggle Winner Dan Becker

Want to make the jump to deep learning? This workshop covers everything you need to get started. You will use the keras library to build deep learning models on both numeric and image data.

This workshop will introduce participants to the basic concepts of deep learning and its most promising applications. Participants will get a taste of this exciting field at the intersection of AI and machine learning. The workshop will utilize popular open source deep learning tools combined with practical exercises and programming assignments.

Bio

Dan is the Technical Product Director at DataRobot. He has broad data science expertise, with consulting experience for 6 companies from the Fortune 100, a 2nd place finish in Kaggle’s $3million Heritage Health Prize, and contributions to the Keras and Tensorflow libraries for deep learning. Dan has a PhD in Econometrics from the University of Virginia.

Curriculum

 
– Basic model set-up in Keras
– Multi-class classification
– Convolutional neural networks
– Debugging deep learning models

Prerequisites

TBD

Developing and Deploying Intelligent Chat Bots with Micheleen Harris

This tutorial will ramp up the attendee very quickly on the Microsoft Bot Framework, providing sample code upon which to base a bot experience. In this tutorial the attendee will build their own intelligent bot. This tutorial will help attendees decide if they want to make a bot to solve a repetitive task they have encountered or they know might be useful to others. User experience will be heavily emphasized to create the best bot experiences. Components will be laid out for the attendee.

Bio

Micheleen is a Data Scientist and trainer at Microsoft where she shares her Python, R and advanced analytics experience internally and externally. She has led or co-led workshops around data science and analytics concepts in Python and R, often utilizing Jupyter notebooks for interactive coding. Micheleen has developed a “Python for the Data Scientist” course delivered on Jupyter notebooks and have delivered this at Microsoft several times and look forward to its external release.  She has also delivered courses utilizing Microsoft Azure and covering DocumentDB, Cognitive Services, the Bot Framework, as well as other components of the Cortana Intelligence Suite.  She enjoys teaching/training and finding the most effective ways to teach data science and advanced analytics on any size dataset.

Curriculum

  1. Cognitive services overview
    1. What are Cognitive APIs
    2. Demos
  2. Introduction for Bot Framework Part
    • Syllabus
    • Learning objectives
  3. Bot Framework Overview
    1. What a bot is and is not
    2. The major components of the Bot Framework
    3. Deploying and working with channels
    4. Your arsenal or toolbox
  4. Developer’s Introduction and Building an intelligent bot with Bot Builder Node.js SDK
    1. Toolbox – Go over prereqs
    2. Setup project in VSCode (and set up debugger)
    3. Get code from course website with Git
    4. Update with Vision API key from Cognitive Services “My Account”
    5. Test with emulator
  5. Create more bots! Follow along or create your own
  6. Summary

Prerequisites

There are a few things you will need in order to take full advantage of the course:

Please bring a laptop with internet connectivity.

  1. Node.js with npm installed locally – get the latest at:
  2. Visual Studio Code [recommended] or equivalent code editing and debugging environment with IntelliSense.
  3. Bot Framework Emulator (Windows and Unix-compatible) installed locally – information and links at
  4. GitHub Account – a code repository and collaboration tool we’ll use
  5. Git Bash – included in git download
  6. [Recommended]Azure account – use the one you have, sign up for a free trial at https://azure.microsoft.com/en-us/free/, or, if you have an MSDN account and Azure as a benefit, link your Microsoft Account or Work/School Account to MSDN and activate the Azure benefit by following this guide

We will assume you have already have the following background:

  1. Basic knowledge around using and navigating in a unix-style command line or terminal (for using Git Bash) (good basic guide at http://linuxcommand.org/lc3_learning_the_shell.php)
  2. Familiarity with Git and GitHub as a tools for software development, versioning and collaboration. (great book on Git at https://git-scm.com/book/en/v2)
  3. Have learned about debugging bots with VSCode in https://docs.botframework.com/en-us/node/builder/guides/debug-locally-with-vscode/ docs.
  4. If you are new to Node, here’s a good video tutorial series at https://www.youtube.com/playlist?list=PL6gx4Cwl9DGBMdkKFn3HasZnnAqVjzHn_

The full cycle of model development in R and Python and utilizing Shiny and Bokeh with Ali Marami

In this workshop, I explain the dual time dynamic model which is widely used in different fields such as banking or epidemiology. The workshop includes model implementation, data pre-processing and validation in R and Python, along with development of web application in Shiny and Bokeh. We compare the entire process in both languages and discuss the advantage and disadvantage of each language. Participants learn about the model, full cycle of development and gain insights for making a decision in choosing R or Python (or both). It could also be a fast-track learning for those familiar with one language and would like to learn the other.

Bio

Ali has PhD in Finance and BS in Electrical Engineering. He has extensive experience in financial and quantitative modeling, and financial risk management in banking industry. Currently, he is a Data Science advisor at R-Brain Inc.

Curriculum

In this workshop, I explain the dual time dynamic model which is widely used in different fields such as banking or epidemiology. The workshop includes model implementation, data pre-processing and validation in R and Python, along with development of web application in Shiny and Bokeh. We compare the entire process in both languages and discuss the advantage and disadvantage of each language. Participants learn about the model, full cycle of development and gain insights for making a decision in choosing R or Python (or both). It could also be a fast-track learning for those familiar with one language and would like to learn the other.

Prerequisites

TBD

Modeling in R with Jared Lander

At one point the open source language, R, was considered the lingua franca for data science in terms of programming. As more languages competing for that title, R still has a very passionate following. In fact, one of the main strengths of R is its huge community that provides open source user-contributed packages (CRAN), documentation and very active user support group. R packages are a collection of R functions and data that make it easy to immediately get access to the latest techniques and functionalities without needing to develop everything from scratch yourself.

Bio

Jared Lander is theChief Data scientist at Lander Analytics, Columbia Professor, Author of R for Everyone and Organizer of the World’s Largest R Meetup

Curriculum

The linear model, and its extensions, forms the backbone of statistical analysis. In this course we cover Linear Regression using `lm`, Generalized Linear Models using `glm` and model assessment using `AIC`, `BIC` and other measures. The focus will be mainly on applied programming, though theoretical properties and derivations will be taught where appropriate. Attendees should already have a basic knowledge of linear models and have R and RStudio installed, along with the `UsingR`, `ggplot2` and `coefplot` packages. Linear Models: Learn about the best fit line, Understand the formula interface in R, Understand the design matrix, Fit Models with `lm`, Visualize the coefficients with `coefplot`, Make predictions on new data. Generalized Linear Models: Learn about Logistic Regression for classification, Learn about Poisson Regression for count data, Fit models with `glm`, Visualize the coefficients with `coefplot`, Model Assessment, Compare models, `AIC`,’BIC`

Prerequisites

TBD

Getting Started with Deep Learning with Charlie Killam

In this four session workshop series, you will start with the basic concepts of deep learning and quickly move to learning how to solve real-word problems using deep learning. NVIDIA Deep Learning Institute Certified Instructors will blend lecture and hands-on, real-world exercises to explore how to solve the most challenging problems with deep learning.

 

Bio

Charles (Charlie) Killam, LP.D. is a Senior Deep Learning Institute Instructor at NVIDIA. Charlie works across all verticals, focusing primarily on the application of deep neural networks (DNNs) in the healthcare space. Prior to NVIDIA, Charlie’s experience includes delivering a data analytics bootcamp for Northeastern University, a geospatial Tableau project for Stanford University, and working with MADlib, an open-source, machine learning algorithm library while at Pivotal.

Curriculum

  • Learn how to leverage deep neural networks (DNN) within the deep learning workflow
  • Solve a real-world image classification problem using NVIDIA DIGITS
  • Walk through the process of data preparation, model definition, model training and troubleshooting
  • Use validation data to test and try different strategies for improving model performance using GPUs
  • Use NVIDIA DIGITS to train a DNN on your own image classification application

Prerequisites

TBD

Approaches to Object Detection with Charlie Killam

In this four session workshop series, you will start with the basic concepts of deep learning and quickly move to learning how to solve real-word problems using deep learning. NVIDIA Deep Learning Institute Certified Instructors will blend lecture and hands-on, real-world exercises to explore how to solve the most challenging problems with deep learning.

 

Bio

Charles (Charlie) Killam, LP.D. is a Senior Deep Learning Institute Instructor at NVIDIA. Charlie works across all verticals, focusing primarily on the application of deep neural networks (DNNs) in the healthcare space. Prior to NVIDIA, Charlie’s experience includes delivering a data analytics bootcamp for Northeastern University, a geospatial Tableau project for Stanford University, and working with MADlib, an open-source, machine learning algorithm library while at Pivotal.

Curriculum

  • Learn three approaches to identify a specific feature within an image

  • Compare each in relation to: model training time, model accuracy and speed of detection during deployment

  • Understand the merits of each approach

  • Learn how to detect objects using neural networks trained neural networks

Prerequisites

TBD

Deep Learning for Image Segmentation with Charlie Killam

In this four session workshop series, you will start with the basic concepts of deep learning and quickly move to learning how to solve real-word problems using deep learning. NVIDIA Deep Learning Institute Certified Instructors will blend lecture and hands-on, real-world exercises to explore how to solve the most challenging problems with deep learning.

 

Bio

Charles (Charlie) Killam, LP.D. is a Senior Deep Learning Institute Instructor at NVIDIA. Charlie works across all verticals, focusing primarily on the application of deep neural networks (DNNs) in the healthcare space. Prior to NVIDIA, Charlie’s experience includes delivering a data analytics bootcamp for Northeastern University, a geospatial Tableau project for Stanford University, and working with MADlib, an open-source, machine learning algorithm library while at Pivotal.

Curriculum

  • Learn how to train and evaluate an image segmentation network

Prerequisites

TBD

Neural Network Deployment with Charlie Killam

In this four session workshop series, you will start with the basic concepts of deep learning and quickly move to learning how to solve real-word problems using deep learning. NVIDIA Deep Learning Institute Certified Instructors will blend lecture and hands-on, real-world exercises to explore how to solve the most challenging problems with deep learning.

 

Bio

Charles (Charlie) Killam, LP.D. is a Senior Deep Learning Institute Instructor at NVIDIA. Charlie works across all verticals, focusing primarily on the application of deep neural networks (DNNs) in the healthcare space. Prior to NVIDIA, Charlie’s experience includes delivering a data analytics bootcamp for Northeastern University, a geospatial Tableau project for Stanford University, and working with MADlib, an open-source, machine learning algorithm library while at Pivotal.

Curriculum

  • Learn three approaches for deployment – directly use inference functionality within a deep learning framework, integrate inference within a custom application , use the NVIDIA TensorRT™,

  • Learn about the role of batch size in inference performance

  • Learn about  various optimizations that can be made in the inference process

  • Explore inference for a variety of different DNN architectures

Prerequisites

TBD

Deploying Python Models As an API with Jed Dougherty

While constructing data pipelines and building models is a core part of the Data Scientist’s job, an often-forgotten facet of the toolkit is how to actually move models into production. In this course we will build a simple model for predicting spam in hotel reviews. We’ll then take that model and expose it as an API using several different tools.

Bio

Jed Dougherty is a Data Scientist working to build the world’s best collaborative Data Science platform at Dataiku. Before coming to Dataiku he worked in the fields of event detection, recommendation systems, and survival analysis in the fields of breaking news and child welfare.

Curriculum

This session covers various deployment strategies for serving a python machine-learning model as an API. Many business applications can make good use of real-time scoring using machine learning, and one of the most approachable and easy languages to use to build these models is Python. The goal is to show the audience how to actually take a trained python model and turn it into an API. We’ll start very simple and cover increasingly complex deployment strategies. Throughout, we will consider the API throughput and resource tradeoffs, and benchmark our solutions.

Prerequisites

  • Basic knowledge of Python and Scikit-Learn
  • We will be using Flask, Celery, and Docker for performing Predictions with an API. Basic knowledge of these tools is helpful but not required.
  • The following python packages installed:
    • numpy
    • scipy
    • scikit-learn
    • joblib
    • Flask
    • gunicorn
    • celery[redis]
    • gevent

Create a Codeless Data Pipeline using Dataiku with Jed Dougherty

Data Science, at least in its current state, will always require coding abilities. However, new tools are entering the market that allow teams of Data Scientists and analysts to mix code and clicking with suprising effectiveness. In this session we will see just how far we can push the clicking capabilities of one such tool: Dataiku DSS. We will construct a data pipeline consisting of data from multiple public APIs and flat files, join together and clean the data, perform descriptive analysis of the data, and generate predictive models, all without writing code (blasphemy!)

Bio

Jed Dougherty is a Data Scientist working to build the world’s best collaborative Data Science platform at Dataiku. Before coming to Dataiku he worked in the fields of event detection, recommendation systems, and survival analysis in the fields of breaking news and child welfare.

Curriculum

We will take 2 years of domestic US flight data, then ingest weather data and information about the planes on the flights to create a model that uses the first year of data to predict the second. We will perform the ingestion, data cleaning, and model building processes in Dataiku DSS.

Prerequisites

  • Attendees should have the free version of DSS installed on their local machines.
  • We will be discussing feature selection, model optimization, and model choice so some background in machine learning is recommended.
  • We will be using models from the Scikit-learn package and the XGBoost package.

Data Science with Spark: Beyond the Basics with Adam Breindel

In 2016 Spark firmly established itself as one of data scientist’s favorite scalable machine learning platforms. Spark provides a general machine learning library, MLlib, that is designed for simplicity, scalability, and easy integration with other tools. Spark can also process your data in local, machine standalone mode and even build models when the input data set is larger than the amount of memory your computer has. Spark provides data scientists with a powerful, unified engine that is both fast (100x faster than Hadoop for large-scale data processing) and easy to use. This allows data practitioners to solve their machine learning problems (as well as graph computation, streaming, and real-time interactive query processing) interactively and at much greater scale. Spark also provides many language choices, including Scala, Java, Python, and R.

Bio

Adam Breindel consults and teaches widely on Apache Spark and other technologies. Adam’s experience includes work with banks on neural-net fraud detection, streaming analytics, cluster management code, and web apps, as well as development at a variety of startup and established companies in the travel, productivity, and entertainment industries. He is excited by the way that Spark and other modern big-data tech remove so many old obstacles to system design and make it possible to explore new categories of interesting, fun, hard problems.

Curriculum

This class is aimed at practitioners who are already familiar with the basics of Apache Spark and are have tried the machine learning samples in the Spark docs or some of the ML tutorial examples online. We’ll start from there and work to advance our knowledge of Spark ML. After briefly reviewing some fundamentals of Spark, DataFrames and Spark ML APIs, the class will then explore: – Performing feature preparation/transformation beyond the Spark built-in tools – “Borrowing” functionality from scikit-learn to help us pre-process features in Spark – Converting DataFrame data to access legacy (RDD) mllib features that are not yet exposed in the SparkML DataFrame API – Implementing data prep operations as reusable components by implementing new Transformers and Estimators – Adding a reusable parallel machine learning algorithm to Spark, by creating our own Estimator and Model classes – Sharing our reusable components with our Python data science colleagues by creating Python wrappers like those built into Spark

Prerequisites

TBD

Advanced scikit-learn with Andreas Mueller

The resurging interest in machine learning is due to multiple factors including growing volumes and varieties of data,and cheaper computational processing. Thus making it possible to quickly and automatically produce models that can analyze bigger, more complex data and deliver faster, more accurate results on a very large scale. scikit-learn (http://scikit-learn.org/) has emerged as one of the most popular open source machine learning toolkits, now widely used in academia and industry.

scikit-learn provides easy-to-use interfaces in Python to perform advanced analysis and build powerful predictive models.

Bio

Andreas Mueller received his MS degree in Mathematics (Dipl.-Math.) in 2008 from the Department of Mathematics at the University of Bonn. In 2013, he finalized his PhD thesis at the Institute for Computer Science at the University of Bonn. After working as a machine learning scientist at the Amazon Development Center Germany in Berlin for a year, he joined the Center for Data Science at the New York University in the end of 2014. In his current position as assistant research engineer at the Center for Data Science, he works on open source tools for machine learning and data science. He is one of the core contributors of scikit-learn, a machine learning toolkit widely used in industry and academia, for several years, and has authored and contributed to a number of open source projects related to machine learning.

Curriculum

Coming Soon

Prerequisites

https://github.com/amueller/odsc-masterclass-2017-afternoon

Machine Learning in R with Jared Lander

R is often considered the original lingua franca of data science in terms of programming. As more languages are competing for the title, R maintains a large and passionate following. In fact, one of the main strengths of R is its huge community that provides open source user-contributed packages (CRAN) for a wide range of data science models and tools. Combining all this with a very active user support group, R promises to remain popular for the foreseeable future.

Bio

Jared Lander is theChief Data scientist at Lander Analytics, Columbia Professor, Author of R for Everyone and Organizer of the World’s Largest R Meetup.

Curriculum

Modern statistics has become almost synonymous with machine learning; a collection of techniques that utilize today’s incredible computing power. This course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theories behind the curtain, covering the Elastic Net, Decision Trees and cross-validation. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `rpart`, `rpart.plot`, `boot`, `ggplot2` and `coefplot` packages. Elastic Net: Learn about penalized regression with the Lasso and Ridge, Fit models with `glmnet`, Understand the coefficient path, View coefficients with `coefplot`. Decision Trees: Learn how to make classifications (and regression) using recursive partitioning, Fit models with `rpart.`, Make compelling visualizations with `rpart.plot`. Cross-Validation: Learn the reasoning and process behind cross-validation, Cross-validate glm models with `cv.glm`

Prerequisites

TBD

Open Geospatial Machine Learning with Kevin Stofan

This workshop will guide attendees through the entire geospatial machine learning workflow. Attendees will be be exposed to a variety of open source tools used to process, model, and visualize geospatial data including PySAL, GDAL, and QGIS. We will work through a supervised machine learning problem to predict the sale price of single family homes in Pinellas County Florida using a large number of property, structural, and socioeconomic features. This workshop will focus on concepts unique to handling geospatial data such as spatial autocorrelation, coordinate transformations, and edge effects.

Bio

Kevin is a Customer Facing Data Scientist at DataRobot and an Adjunct Professor at Pennsylvania State University where he teaches a graduate level Geographic Information Systems (GIS) course. He has over 16 years of experience using GIS and geospatial analysis to solve real world business problems. His experience with modeling geospatial phenomena include geostatistical, spatial econometric, and point pattern analysis.

Curriculum

This workshop will guide attendees through the entire geospatial machine learning workflow. Attendees will be be exposed to a variety of open source tools used to process, model, and visualize geospatial data including PySAL, GDAL, and QGIS. We will work through a supervised machine learning problem to predict the sale price of single family homes in Pinellas County Florida using a large number of property, structural, and socioeconomic features. This workshop will focus on concepts unique to handling geospatial data such as spatial autocorrelation, coordinate transformations, and edge effects.

Prerequisites

TBD

Natural Language Processing and Text Mining in Python with Michael Galvin

We’ll cover NLP and text mining using Python and give several examples of real world applications. It will start by introducing various text processing techniques and move into text classification, clustering, and topic modeling. Finally, we’ll cover techniques for natural language processing based on deep learning. By the end of this talk, participants will be able to use Python to explore and build their own models on text data.

Bio

Michael is the Executive Director of Data Science at Metis. He came to Metis from General Electric where he worked to establish their data science strategy and capabilities for field services and to build solutions supporting Global operations, risk, engineering, sales, and marketing. Prior to GE, Michael spent several years as a data scientist working on problems in credit modeling at Kabbage and corporate travel and procurement at TRX. Michael holds a Bachelor’s degree in Mathematics and a Master’s degree in Computational Science and Engineering from the Georgia Institute of Technology where he also spent 3 years working on machine learning research problems related to computational biology and bioinformatics. Additionally, Michael spent 12 years in the United States Marine Corps where he held various leadership roles within aviation, logistics, and training units.

Curriculum

In this session we’ll cover NLP and text mining using Python and give several examples of real world applications. It will start by introducing various text processing techniques and move into text classification, clustering, and topic modeling. Finally, we’ll cover techniques for natural language processing based on deep learning. By the end of this talk, participants will be able to use Python to explore and build their own models on text data.

Prerequisites

TBD

Machine Learning with H2O Open Platform (Morning Session) with Jo-fai (Joe) Chow

H2O is a fast scalable open-source machine learning and deep learning platform. Using in-memory compression techniques, H2O can handle billions of data rows in-memory — even on small computer clusters. The platform includes interfaces for R, Python, Scala, Java, JS and JSON, along with its interactive graphical Flow interface that make it easier for non-engineers to stitch together complete analytic workflows. H2O was built alongside (and on top of) both Hadoop and Spark clusters and is deployed within minutes. It is a math and machine learning engine that brings distribution and parallelism to powerful algorithms that enable you to make better predictions and more accurate models faster.

Bio

Jo-fai (Joe) is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab as a data science evangelist promoting products via blogging and giving talks at meetups.

Curriculum

Intro to H2O Platform & distributed computing, Supervised H2O algorithms, Unsupervised H2O algorithms

Prerequisites

TBD

Deploying and Scaling Spark ML and Tensorflow AI Models with Chris Fregly, Research Scientist, Contributor, Author and Trainer

Apache Spark is becoming increasing popular for big data science projects. Spark can handle large volumes of data significantly faster and easier than other platforms, and it includes tools for real-time processing, machine learning, and interactive SQL. It is quickly being adopted by industry to achieve business objectives that need data and data science at scale.

Bio

Chris Fregly is a Reaserach Scientist at Pipeline.IO. Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix. When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.

Curriculum

In this workshop, we will train, deploy, and scale Spark ML and Tensorflow AI Models in a distributed, hybrid-cloud and on-premise production environment. We will use 100% open source tools including Tensorflow, Spark ML, Jupyter Notebook, Docker, Kubernetes, and NetflixOSS Microservices. This talk will discuss the trade-offs of mutable vs. immutable model deployments, on-the-fly JVM byte-code generation, global request batching, miroservice circuit breakers, and dynamic cluster scaling – all from within a Jupyter notebook. All code and docker images are 100% open source and available from Github and DockerHub at http://pipeline.io.

Prerequisites

TBD

Deep Learning with H2O Open Platform with Jo-fai (Joe) Chow

H2O is fast scalable open-source machine learning and deep learning platform. Using in-memory compression techniques, H2O can handle billions of data rows in-memory — even on small computer clusters. The platform includes interfaces for R, Python, Scala, Java, JS and JSON, along with its interactive graphical Flow interface that make it easier for non-engineers to stitch together complete analytic workflows. H2O was built alongside (and on top of) both Hadoop and Spark clusters and is deployed within minutes. It is a math and machine learning engine that brings distribution and parallelism to powerful algorithms that enable you to make better predictions and more accurate models faster.

Bio

Jo-fai (Joe) Chow is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab as a data science evangelist promoting products via blogging and giving talks at meetups.

Curriculum

  • Gridlines
  • Pipe Search
  • Deep Water

Prerequisites

TBD

Intro to Text Analytics with Ted Kwartler

Much of the incredibly diversified data now being collected is mostly of the unstructured variety. Text, speech, email, social media, etc. Text analytics along with sentiment analysis and social analytics help you to understand, at scale, the voice of the public or your customers. Text analytics is applied in an array of industries – in healthcare, finance, media, and consumer markets to successfully gain customer insights. Combining text analytics with machine learning, semantic analysis, and deep linguistic parsing can lead to many useful applications thus making this a popular topic for data scientist to master.

Bio

Ted Kwartler is the Director of Customer Success at DataRobot where he manages the end-to-end customer journey. He advocates for and integrates customer innovation into everyday culture and work. He helps to define and organize all customer service functions and key performance indicators. Thus, he incorporates data-driven customer analytics decisions balanced with qualitative feedback to continuously innovate for customer experience. Specialties: Statistical forecasting and data mining, IT service management, customer service process improvement and project management, business analytics.

Curriculum

Attendees will learn the foundations of text mining approaches in addition to learn basic text mining scripting functions used in R. The audience will learn what text mining is, then perform primary text mining such as keyword scanning, dendogram and word cloud creation. Later participants will be able to do more sophisticated analysis including polarity, topic modeling and named entity recognition.

Prerequisites

TBD

Apache Drill with Charles Givre

Aside from needing nothing more than ANSI SQL to run queries and using conventional ODBC/JDBC connectors to allow access to the data, Drill doesn’t require schemas to be defined for the data before querying. This means less involvement from IT to prepare data for analysis; anyone with a suitable toolset and the proper permissions can plug in and begin querying. Apache drill works well with self-describing data in addition to complex data types. It offers low latency queries and supports large datasets.

Bio

Coming Soon

Curriculum

Coming Soon

Prerequisites

TBD

Programming with Data: Python and Pandas with Accomplished Data Scientist Daniel Gerlanc

The great debate around “can you be a data scientist without being a coder” still goes on. However the consensus amongst employers is “no” and one of the most popular responses to that question is Python. As a financial quant and highly sought after data science consultant, Daniel Gerlanc has the adept experience to quickly make you a practical and productive data scientist with Python.

Bio

  • Daniel Gerlanc is a highly respected former hedge fund quant and much sought after data scientist. He has a well earned reputation of helping companies improve their modeling techniques and unblocking critical issues.  His workshop is a shortened version he has delivered internally to top hedge funds and fortune 100 companies.
  • Daniel Gerlanc has worked as a data scientist for over 10 years. He spent 5 years as a quantitative analyst with two Boston hedge funds before starting Enplus Advisors Inc, a predictive analytics consultancy, in 2011. At Enplus, he works with clients in different industries to improve existing analytic processes and develop new ones. He has coauthored several open source R packages, published in peer-reviewed journals, and is active in local predictive analytics groups. He is a graduate of Williams College.

Curriculum

Coming Soon

Prerequisites

TBD

Sign Up for Masterclass Summit 2017 | March 1-2

Register Now