Abstract: Extracting key fields from a variety of document types remains a challenging problem. Services such as AWS and Google Cloud provide text extraction services to digitize images or PDFs. These services use OCR techniques and return phrases, words, and characters with their corresponding coordinate locations. Working with these outputs remains challenging and unscalable as different document types require different heuristics with new types uploaded daily. Additionally, OCR doesn’t attempt to understand the document; for example, dollar amounts need be numerical, and OCR may suggest a “1” is a lowercase “L.” Furthermore, a performance ceiling is reached even when parsing algorithms work perfectly: while third-party-service OCR is excellent, it isn’t perfectly accurate.
We propose an end-to-end scalable solution using deep learning architecture consisting of a computer vision component connected to a sequence generation component. Through training on millions of documents, the model learns to understand document trends and characteristics to finally extract important fields from raw documents. There is marked improvement of accuracy compared to third-party OCR services. Additional benefits include character-level probabilities for confidence scores and using explainability algorithms such as LIME to determine which “hot pixels” in the document are responsible for the predictions.
Bio: I'm the Chief Data Scientist at Bill.com and have many years of experience as a scientist and researcher. My recent focus is in machine learning, deep learning, applied statistics and engineering. Before, I was a Postdoctoral Scholar at Lawrence Berkeley National Lab, received my PhD in Physics from Boston University and my B.S. in Astrophysics from University of California Santa Cruz. I have 2 patents and 11 publications to date and have spoken about data at various conferences around the world.