Data Excellence: Better Data for Better AI


Human annotated data plays a crucial role in the current ML/AI climate, where the human judgements are referenced as the ultimate source of truth. As such, human annotated data is a kind of compass for AI and research on Human Computation has a multiplicative effect on the AI field. Optimizing the cost, scale, and speed of data collection has been the center of the Human Computation research. What is less known is that such optimization is sometimes done at the cost of quality [Riezler, 2014]. Quality is evidently important but unfortunately poorly defined and rarely measured. A decade later, problems inherent to data are beginning to surface: fairness and bias [Goel and Faltings, 2019], quality issues [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility in ML research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation [Katsuno et al., 2019]. Finally, in rushing to be first to market, aspects of data quality such as maintainability, reliability, validity, and fidelity are often overlooked. We want to turn this way of thinking on its head and highlight examples, case-studies, and methodologies for excellence in data collection. Currently, to the extent that it does, data excellence happens organically by virtue of individual expertise, diligence, commitment, pride, etc. This could be dangerous as we grow increasingly dependent on automation technologies, we don't want to be at the mercy of individual exemplars. Instead, we should codify data excellence in a systematic manner and raise the standards on the entire industry.


Lora Aroyo is a Research Scientist at Google, NY currently working on human-labeled data quality . She is best known for her work on CrowdTruth crowdsourcing methodology. Throughout her career, Lora was a principal investigator of a large number of research projects bringing together methods and tools from human computation, linked (open) data, data science & human-computer interaction with the goal of building hybrid human-AI systems for understanding text, images, and videos with humans-in-the-loop. Her research projects focussing on personalized access to online multimedia have a major impact and established her as a recognized leader in human computation techniques for digital humanities, cultural heritage, and interactive TV. Prior to joining Google, she worked at the VU University Amsterdam as Full Professor in Computer Science and was Chief Scientist at NY-based startup Tagasauris. She is a four times holder of IBM Faculty Award for her work on CrowdTruth used in adapting the IBM Watson system to the medical domain and in capturing ambiguity in understanding misinformation. She is currently president of the User Modeling Inc, which acts as a steering committee for the UMAP conference series.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google