Abstract: Most businesses rely on traditional analytics on structured data for gaining day to day business insights despite owning very rich text data. To leverage text data companies typically need an in house Machine Learning team with Natural Language Processing expertise. There is a huge investment in problem specific feature engineering and manual curation of training data by subject matter experts (SME). This process becomes too expensive in an agile business environment where problem definitions change frequently.
Chegg has multiple student centric products: online tutoring, help with answering study questions, studying for ACT/SAT, writing help and others. Frequently there are business questions that are hidden in chats or questions asked by students.
Many students come to the Chegg Tutors platform and ask the tutors to do their graded assignments or quizzes for them. This violates Chegg’s honor code policies. We use text data (questions submitted by students, chats) and apply dark data extraction tool: snorkel, developed at Stanford to create an honor code violation detector (HCVD). This process uses inputs from SME’s and business partners and converts them into heuristic noisy rules which are modeled using generative models to produce high quality training data. Once there is training data HCVD detects key phrases (example: do my online quiz) that indicate honor code violation and indicates the necessary actions such as warnings, advising tutors or blocking, that need to be taken by the system.
Bio: Sanghamitra Deb is a Senior Data Scientist at Chegg Inc. At Chegg she works on a wide range of projects related to developing a recommendation system for Chegg online tutoring, detecting student and tutor intents using natural language processing and is heavily involved with A/B testing machine learning models. In the past she has worked at Accenture Tech Labs developing algorithmic solutions to business problems. Prior to being a data scientist she did her PhD in astrophysics and studied the formation and evolution of the universe by analyzing gravitational lensing by galaxy clusters.