Building an Industry Classifier With The Latest Scraping, NLP and Deployment Tools
Building an Industry Classifier With The Latest Scraping, NLP and Deployment Tools

Abstract: 

For BlueVine, and indeed for any Fintech company, figuring out the client’s industry is a critical factor in making precise financial decisions. Traditional sources are invariably pricey, inaccurate and unavailable, and as such leave an opening for an ML based solution. We met that challenge building a service that predicts the industry using the business’s publicly available web data. By employing the latest innovations in NLP (BERT) and some of the most powerful scraping and deployment tools available (Scrapy and Amazon SageMaker) we were able to dramatically surpass the performance achieved by any other such tool in the space.

This presentation will cover the entire development pipeline hands-on: Crowdsourcing a tagged sample, building a smart and scalable web scraper, prepping and feeding the resulting raw data into BERT, fine tuning the model and finally deploying it as a cloud based service behind an API. Both model training and deployment will be through Amazon SageMaker.

Session Outline
Part 1: Gathering websites and generating a tagged set of industry values. Learn how to manage a complex crowdsourcing task and how to ensure a quality dataset.

Part 2: Building the crawler. Overview of how to build a crawler using Scrapy and how to customize it for your specific needs. Will cover specific tweaks that add functionality beyond what’s provided out of the box.

Part 3: Design, train and deploy the model: Learn the basics of BERT and PyTorch, and how to build a transfer learning model using those two frameworks. Will cover the entire cycle with the example of an industry classification model. Will conclude with the basics of how to deploy such a model using AWS SageMaker.

Background Knowledge
- BERT;
- PyTorch;
- Pandas;
- Scrapy;
- Amazon;
- SageMaker;
- Jupyter;
- Docker.

Bio: 

Ido Shlomo is the head of BlueVine’s data science team in the US, where he works on applying machine learning and other automation solutions for risk management, fraud detection and marketing purposes. Recent work is focused on implementing complex NLP tasks in production systems, and specifically on dealing with the the challenge of consuming unstructured data. Previously Ido worked in the Economics department at Tel Aviv University as a researcher in structural macroeconomic modeling. Ido holds a dual BA in mathematics and philosophy and an MA in economics, both from Tel Aviv University.