ODSC’s free webinar series serves to educate our community on the languages, tools, and topics of AI and Data Science
Jason Prentice, Senior Manager, Data Science at S&P Global Market Intelligence
Mapping the Global Supply Chain Graph
Nov 15th, 1-2 PM EST
Mapping the Global Supply chain graph
Panjiva maps the network of global trade using over one billion shipping records sourced from 15 governments around the world. We perform large-scale entity extraction and entity resolution from this raw data, identifying over 8 million companies involved in international trade, located across every country in the world. Moreover, we track detailed information on the 25 million+ relationships between them, yielding a map of the global trade network with unprecedented scope and granularity. We have developed a powerful platform facilitating search, analysis, and visualization of this network as well as a data feed integrated into S&P Global’s Xpressfeed platform.
We can explore the global supply chain graph at many levels of granularity. At the micro level, we can surface the close relationships around a given company to, for example, identify overseas suppliers shared with a competitor. At the macro level, we can track patterns such as the flow of products among geographic areas or industries. By linking to S&P Global’s financial and corporate data, we can understand how supply chains flow within or between multinational corporate structures and correlate trade volumes and anomalies to financial metrics and events.
Presenter bio - Jason Prentice, Senior Manager, Data Science at S&P Global Market Intelligence
Jason Prentice leads the data team at Panjiva, where he focuses on developing the fundamental machine learning technologies that power our data collection. Before joining Panjiva as a data scientist, he researched computational neuroscience as a C.V. Starr fellow at Princeton University and earned a Ph.D. in Physics from the University of Pennsylvania.
“ODSC West Warm-Ups” Six 30-min Sessions
7 speakers from ODSC West conference
Matthew Rubashkin, Ph.D. AI Program Director at Insight Data Science
We are bringing a workshop on how you would go about building your own representations, both for image and text data, and efficiently do similarity search. By the end of this workshop, you should be able to build a quick semantic search model from scratch, no matter the size of your dataset.
Presenter bio - Matthew Rubashkin, Ph.D. AI Program Director at Insight Data Science
Michael Mahoney, PhD, Professor at UC Berkeley
“Matrix Algorithms at Scale: Randomization and using Alchemist to bridge the Spark-MPI gap”
Matrix Algorithms at Scale: Randomization and using Alchemist to bridge the Spark-MPI gap
In this talk we will describe some of the underlying randomized linear algebra techniques. Finally, we’ll describe Alchemist, a system for interfacing between Spark and existing MPI libraries that is designed to address this performance gap. The libraries can be called from a Spark application with little effort, and we illustrate how the resulting system leads to efficient and scalable performance on large datasets. We describe use cases from scientific data analysis that motivated the development of Alchemist and that benefit from this system. We’ll also describe related work on communication-avoiding machine learning, optimization-based methods that can call these algorithms, and extending Alchemist to provide an ipython notebook <=> MPI interface.
Presenter Bio - Michael Mahoney, PhD, Professor at UC Berkeley
Michael Mahoney is at the University of California at Berkeley in the Department of Statistics and at the International Computer Science Institute (ICSI). He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics, and he has worked and taught at Yale University in the mathematics department, at Yahoo Research, and at Stanford University in the mathematics department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), he was on the National Research Council’s Committee on the Analysis of Massive Data, he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets, and he spent fall 2013 at UC Berkeley co-organizing the Simons Foundation’s program on the Theoretical Foundations of Big Data Analysis.
This talk will discuss Docker as a tool for the data scientist, in particular in conjunction with the popular interactive programming platform, Jupyter, and the cloud computing platform, Amazon Web Services (AWS). Using Docker, Jupyter, and AWS, the data scientist can take control of their environment configuration, prototype scalable data architectures, and trivially clone their work toward replicability and communication. This talk will toward developing a set of best practices for Engineering for Data Science.
Presenter Bio - Joshua Cook, Curriculum Developer at Databricks
Joshua Cook is a mathematician. He writes code in Bash, C, and Python and has done pure and applied for computational work in geospatial predictive modeling, quantum mechanics, semantic search, and artificial intelligence. He also has ten years experience teaching mathematics at the secondary and post-secondary level. His research interests lie in high-performance computing, interactive computing, feature extraction, and reinforcement learning. He is always willing to discuss orthogonality or to explain why Fortran is the language of the future over a warm or cold beverage.
Nisha Talagala, CTO/VP of Engineering at ParallelM
“Bringing Your Machine Learning and Deep Learning Algorithms to Life: From Experiments to Production Use”
Bringing Your Machine Learning and Deep Learning Algorithms to Life: From Experiments to Production Use
In this hands on workshop, attendees will learn how to take Machine Learning and Deep Learning programs into a production use case and manage the full production lifecycle. This workshop is targeted for data scientists, with some basic knowledge of Machine Learning and/or Deep Learning algorithms, who would like to learn how to bring their promising experimental results on ML and DL algorithms into production success. In the first half of the workshop, attendees will learn how to develop an ML algorithm in a Jupyter notebook and transition this algorithm into an automated production scoring environment using Apache Spark. The audience will then learn how to diagnose production scenarios for their application (for example, data and model drift) and optimize their ML performance further using retraining. In the second half of the workshop, users will perform a similar exercise for Deep Learning. They will learn how to experiment with Convolutional Neural Network algorithms in TensorFlow and then deploy their chosen algorithm into production use. They will learn how to monitor the behavior of Deep Learning algorithms in production and approaches to optimizing production DL behavior via retraining and transfer learning.
Attendees should have basic knowledge of ML and DL algorithm types. Deep mathematical knowledge of algorithm internals is not required. All experiments will use Python. Environments will be provided in Azure for hands on use by all attendees. Each attendee will receive an account for use during the workshop and access to the notebook environments, Spark and TensorFlow engines, as well as an ML lifecycle management environment. For the ML experiments, sample algorithms and public data sets will be provided for Anomaly Detection and Classification. For the DL experiments, sample algorithms and public data sets will be provided for Image Classification and Text Recognition.
Presenter Bio - Nisha Talagala, CTO/VP of Engineering at ParallelM
Nisha Talagala is Co-Founder, CTO/VP of Engineering at ParallelM, a startup focused on Production Machine Learning. As Fellow at SanDisk and Fellow/Lead Architect at Fusion-io, she led advanced technology development in Non-Volatile Memory and applications. Nisha has more than 15 years of expertise in software, distributed systems, machine learning, persistent memory, and flash. Nisha was also technology lead for server flash at Intel and the CTO of Gear6. Nisha earned her PhD at UC Berkeley on distributed systems research. Nisha holds 54 patents, is a frequent speaker at both industry and academic conferences, and serves on multiple technical conference program committees.
Kirk Borne, PhD, Principal Data Scientist, Executive Advisor Booz Allen Hamilton
“Solving the Data Scientist’s Dilemma – The Cold Start Problem”
Solving the Data Scientist's Dilemma - The Cold Start Problem
Supervised machine learning is a great tool when you have labeled training data and known classes that you are trying to predict for new previously unseen data. But, the assumptions of labeled data and known classes are generally not true in unsupervised machine learning. So, how can you maximize the data science outcomes, benefits, and applications when faced with the cold start problem? We will discuss this challenge and some solutions with several illustrative examples.
Presenter bio - Kirk Borne, PhD. Principal Data Scientist, Executive Advisor Booz Allen Hamilton
Kirk Borne is a data scientist and an astrophysicist who has used his talents at Booz Allen since 2015. He was professor of astrophysics and computational science at George Mason University (GMU) for 12 years. Kirk spent nearly 20 years supporting NASA projects.
Sean Patrick Gorman, PhD, Head of Technical Product Management, DigitalGlobe
Steven Pousty, Director of Developer Relations, DigitalGlobe
“How to use Satellite Imagery to be a Machine Learning Mantis Shrimp”
How to use Satellite Imagery to be a Machine Learning Mantis Shrimp
In this session we are going to start by showing you how satellite imagery actually allows you to “see” in more bands of color than the mantis (how about 26 bands) – each band is a massive amount of data about the earth. We will show you how you can work with this data in Jupyter notebooks to extract all sorts of information about the world. Last, we will wrap up with how to make ML models using this data, extract features we care about, and then run it through a cloud-based processing model.
Presenter Bio - Sean Patrick Gorman, PhD, Head of Technical Product Management, DigitalGlobe
1. Sean Patrick Gorman, PhD. Sean is the Head of Technical Product Management at DigitalGlobe helping build GBDX and next generation machine learning tools for satellite imagery. Sean received his PhD from George Mason University as the Provost’s High Potential Research Candidate, Fisher Prize winner and an INFORMS Dissertation Prize recipient.
2. Steven Pousty. Steve is the Developer Relations lead for DigitalGlobe. He goes around and shows off all the great work the DigitalGlobe engineers do. Steve has a Ph.D. in Ecology from University of Connecticut