Solving the Data Scientist’s Dilemma: the Cold-start Problem with 10+ Machine Learning Examples
Solving the Data Scientist’s Dilemma: the Cold-start Problem with 10+ Machine Learning Examples

Abstract: 

Many unsupervised learning applications (including the discovery of trends, correlations, principal components, clusters, and interesting associations in the data) can converge more readily to a useful and valuable model if we know in advance which parameterizations of the model are best to choose. Feature engineering can help, but sometimes the best choice of features remains within a "black box". If we cannot know the best parameterizations in advance (i.e., because this truly is unsupervised learning), then we would at least like to know that our final model is optimal (in some way) in explaining the important characteristics and signals in the data. Similarly, in supervised machine learning, the availability and application of labeled data (things past) are fundamental for the accurate labeling of previously unseen data (things future). Without labels (diagnoses, classes, known outcomes) in past data, then how do we make progress in labeling (explaining) future data? This would be a problem. In both these applications (supervised and unsupervised machine learning), if we don’t have these initial insights and validation metrics, then how does such model-building get started and get moving towards the optimal solution? This challenge (or, data scientist's dilemma) is known as the cold-start problem! The solution to the problem is easy (sort of): We make a guess -- an initial guess! Usually, that would be a totally random guess. That sounds so random, so wrong, so bad! But there is an orderly and productive way forward from such a start, which we will describe in this workshop. The theme of the workshop at this point could be stated in this way: "Machine Learning is the set of mathematical algorithms that learn from experience. Good judgment comes to experience. And experience comes from bad judgment." We will present at least 10 examples and suggested solutions of cold-start problems (i.e., that move from a bad initial random guess to a good, perhaps optimal, solution), covering a variety of different algorithms and applications, focused primarily on unsupervised learning, but with some supervised learning examples also. We will also introduce related concepts and their importance, including the objective function, genetic algorithms, backpropagation, gradient descent, and meta-learning. Those concepts represent the true keys that unlock performance in a cold-start challenge. Those are the magic ingredients in most of the examples that we will present. At the end of the workshop, you should be empowered, enabled, and emboldened to tackle similar machine learning challenges problems in other domains. After all, data are data, math is math, and good experience is transferable!

Bio: 

Dr. Kirk Borne is the Principal Data Scientist and an Executive Advisor at global technology and consulting firm Booz Allen Hamilton. In those roles, he focuses on applications of data science, data management, machine learning, A.I., and modeling across a wide variety of disciplines. He also provides training and mentoring to executives and data scientists within numerous external organizations, industries, agencies, and partners in the use of large data repositories and machine learning for discovery, decision support, and innovation. Previously, he was Professor of Astrophysics and Computational Science at George Mason University for 12 years where he did research, taught, and advised students in data science. Prior to that, Kirk spent nearly 20 years supporting data systems activities on NASA space science programs, which included a period as NASA's Data Archive Project Scientist for the Hubble Space Telescope. Dr. Borne has a B.S. degree in Physics from LSU, and a Ph.D. in Astronomy from Caltech. In 2016 he was elected Fellow of the International Astrostatistics Association for his lifelong contributions to big data research in astronomy. As a global speaker, he has given hundreds of invited talks worldwide, including conference keynote presentations at many dozens of data science, A.I. and big data analytics events globally. He is an active contributor on social media, where he has been named consistently among the top worldwide influencers in big data and data science since 2013. He was recently identified as the #1 digital influencer worldwide for 2018-2019. You can follow him on Twitter at @KirkDBorne.