How to Get Quality Data Labels

Abstract: The opportunity to crowdsource training data for AI provides unprecedented capacity for gathering human-generated data labels. However, crowdsourced data can vary significantly in quality. Variations in worker skill, training, and experience can affect quality. The complexity of the user experience (UX) for labeling as well as natural human error or fatigue can also affect quality. Bad actors can pollute collected data with fraudulent results. These variations are ultimately potential sources of error for a trained model. Data scientists should have an understanding of potential quality impacts on their data, resulting from how they design and constrain their labeling tasks. This presentation focuses on getting high quality results from a crowd with specific insights on how improving the worker experience can result in better quality labels in addition to benefiting the workers.

Bio: Dr. Cheryl Martin joined Alegion in 2018. She previously directed the Center for Content Understanding and the Cyber Information Assurance and Decision Support Group at the Applied Research Laboratories, The University of Texas at Austin (ARL:UT). Dr. Martin's areas of expertise include distributed artificial intelligence, machine learning, rule-based systems, and dynamically adaptive software. Through her work at ARL:UT, she has applied data mining, detection, and inference techniques to information assurance problems and cyber security challenges such as intrusion detection and insider threat detection. Her work in combining semantic knowledge models, natural language processing, expert systems, and machine learning to categorize and label text has been successful in automatically determining whether documents contain sensitive information that must be protected with respect to classification decisions and review and release.

Open Data Science Conference