Abstract: What can we learn about data science by watching data science competitions?
During a data science competition like the ones hosted by DrivenData and Kaggle, the leaderboard lists the teams that have submitted models and the scores the top models have achieved.
As the competition proceeds, the scores often improve quickly as teams explore a variety of models and then more slowly as they approach the limits of what’s possible.
Using 170,000 scores from more than 50 competitions hosted by DrivenData, we explore the aggregated behavior of the competing teams.
What patterns can we see?
Based on early returns, can we predict the limits?
What factors influence the time, and number of submissions it takes to reach the performance plateau?
Do models tend to overfit the data as the contest progresses?
And what guidance can we provide for deciding when to stop searching?
In this talk, we will answer these questions and share other observations from the other side of the leaderboard.
Bio: Allen Downey is a Professor of Computer Science at Olin College of Engineering in Needham, MA. He is the author of several books related to computer science and data science, including Think Python, Think Stats, Think Bayes, and Think Complexity. Prof Downey has taught at Colby College and Wellesley College, and in 2009 he was a Visiting Scientist at Google. He received his Ph.D. in Computer Science from U.C. Berkeley, and M.S. and B.S. degrees from MIT.