Applying State-of-the-art Natural Language Processing for Personalized Healthcare
Applying State-of-the-art Natural Language Processing for Personalized Healthcare


Accelerating progress in personalized healthcare requires learning the causal relationships between diseases, genes, treatments, medications, labs, and other clinical information – at scale over a large population and time range. More than half of the clinically relevant data in oncology is only found in free-text pathology reports, radiology reports, sequencing reports, and progress notes.

Extracting and normalizing these facts from these clinical documents requires training oncology-specific models that can accurately extract these specific facts from a variety of documents. This talk describes results and lessons learned, from a real-world project doing this at scale, in three areas:

1. Applying state-of-the-art deep-learning based NLP models for entity recognition, entity resolution, negation detection, and document segmentation. This is one of the first projects outside a research setting applying BioBERT and we’ll compare versus “vanilla” BERT, share tricks for improving embeddings using vocabularies, and the impact of this form on transfer learning on the ability to learn from small labeled datasets.
2. Using Spark NLP for training and inference of these NLP pipelines – to unify processing from document loading to generating final results, that runs well locally and scales natively on a Spark cluster. We’ll share benchmarks from optimized builds on Intel and Nvidia hardware.
3. Considerations for architecting an AI platform that can process protected health information (PHI) in a secure and compliant way, for both training and inference. This involves operating the whole process – data integration, experimentation, scaling to a cluster, versioning, reproducibility, model deployment – within an air-gap environment without Internet access.


David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a Ph.D. in computer science and master’s degrees in both computer science and business administration.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google