Creating a Benchmark for a Large-Scale Image Captioning Pipeline


Motivation for the project arose from a Kaggle competition hosted by Bristol-Myers Squibb (BMS) that is here: Bristol Myers Squibb Chemistry Competition. In 2005, International Chemical Identifiers (InChI) were standardized. InChI labels are standardized strings created to enable the identification of chemicals in databases. Some of the chemical literature is older than 2005, so BMS wanted an automated method to match images of chemicals with their InChI labels. In this project, we took the approach of captioning chemical images. Our process involved natural language processing, transfer learning, and natural language generation.
A textual identifier for chemical substances, the IUPAC International Chemical Identifier, is designed to standardize how chemical substances are encoded. Chemical identifiers facilitate the search for chemical information in databases and on the web. The labeling system, developed by IUPAC (International Union of Pure and Applied Chemistry) and the NIST (National Institute of Standards and Technology), was used by 2005. These identifiers consist of long strings of letters and symbols.
The original code for this project, inspired by work on image captioning using the COCO (Common Objects in Context) dataset, leveraged repurposed techniques for labeling images. The BMS dataset contains artificially speckled and blurred images to mimic how images may look in older chemistry publications. There are about 2.5 million training images and another 2.4 million test images in the dataset. We used several computer vision algorithms to despeckle and clean the images before feeding them into our pipeline, which consisted of transfer learning and natural language generation.
Our process was as follows:
Natural Language Processing for tokenization of training labels
Computer vision and transfer learning for ingestion of images into an encoder/convolutional neural network using Resnet50
Natural Language Generation to generate image captions.
Using the distributed compute system Ray, we created a benchmark and shortened the training time from the second-place winner of the competition (one week for one epoch) to about a day and a half for one epoch. We believe that our approach and methodology will inform future work and scale big data deep learning using distributed systems.


Jennifer Davis, Ph.D. is a Staff Field Data Scientist at Domino Data Labs, where she empowers clients on complex data science projects. She has completed two postdocs in computational and systems biology, trained at a supercomputing center at the University of Texas, Austin, and worked on hundreds of consulting projects with companies ranging from start-ups to the Fortune 100. Jennifer has previously presented topics at conferences for Association for Computing Machinery on LSTMs and Natural Language Generation and at conferences across the US and in Italy. Jennifer was part of a panel discussion for an IEEE conference on artificial intelligence in biology and medicine. She has practical experience teaching both corporate classes and at the college level. Jennifer enjoys working with clients and helping them achieve their goals.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google