Industry Classification at Scale: Leveraging Web Data, Crowdsourcing and Distributed Computing for Model Generation

Abstract: BlueVine is a leading provider of funding for small businesses with a primary focus on speed, simplicity and transparency. One of the key considerations when making funding decisions is the industry of the prospective client, since it interacts directly with Risk management considerations, policy constraints and portfolio management. Indeed, the Risk management implications cannot be understated - underwriting practices vary widely across different industries and many costly mistakes can be avoided by this distinction alone. This presentation will describe how we at BlueVine are able to automatically classify prospective clients’ industry at scale and with high precision. For that purpose we have developed a unique system that employs multiple tools at the forefront of model development: (i) scraping libraries for harvesting web data, (ii) crowdsourcing platforms for labelling uncategorized observations, (iii) machine learning libraries and distributed computing services for model generation.

First off, there is no agreed upon set of industries to target. We had to generate a set large enough to describe our entire client-base but small enough to be effectively predicted using limited budget and resources. Our eventual solution turned out to be a chain of tools where each link solves a certain aspect of the the problem. The most basic challenge was finding a fitting data source - and we chose business websites due to their prevalence and free availability. In order to turn websites into data ingestible by models, we used web harvesting libraries (Scrapy) to gather all of the textual information and NLP techniques (Spacy and Scikit-learn) for feature generation. Given this type of data, even a moderately sized set of industries presents a significant labelling challenge - tens of thousands of observations need to be labelled, and every decision is itself a non-trivial one since a company’s industry is not always obvious. Many times a company will do a number of different things (e.g. provide a service and sell a product) or work in a number of different areas (e.g. construction and engineering). Through crowdsourcing (partnering with CrowdFlower) we managed to complete this task by deploying an interface specifically tailored to this decision problem. Finally, speedy and exhaustive development requires significant computing resources, and we made use of Amazon’s various AWS services (EC2, P2, S3) which allow for both flexibility and cost control.

Bio: Ido is a Data Science Manager at BlueVine and heads the Data Science team at BlueVine’s US offices. He is an expert at designing end-to-end automated solutions for financial decisions and risk mitigation. Ido’s main interest is leveraging difficult and unstructured datasets into actionable insights. Ido holds a Bsc in Mathematics and Philosophy and an MA in Economics, both from Tel Aviv University, Israel.