Abstract: SentiLink builds models in areas of fraud where no dataset containing ground truth exists. This means our models are built using a combination of the massive amount of unlabeled data we receive from our partners, as well as labels that an internal team of fraud experts generates by manually reviewing cases escalated to them by both the data science team and external partners. This presents us with a number of unique challenges: how do we use unlabeled data to improve our models by supplementing these labeled cases during model training? How do we select incremental cases for these risk analysts to review, given what we’ve already labeled (a problem very similar to active learning, but with some special caveats for our domain)? How do we determine how much of different kinds of fraud are hitting our partners, given that we can’t label everything; that is, how do we know there aren’t significant fraud trends that we’re missing? In this talk, SentiLink Data Scientist Seth Weidman will tackle these questions and more. Attendees will come away with a better understanding of how to operate in the context of partially-labeled datasets.
Bio: Seth Weidman is a Data Scientist at SentiLink, where he works on the core synthetic fraud and identity theft models that power SentiLink’s API-based solution to stopping fraud, as well as on new product development. Immediately before SentiLink he was at Facebook. He is the author of Deep Learning From Scratch, published by O’Reilly in 2019, and has degrees in mathematics and economics from the University of Chicago.