Sell Cron, Buy Airflow: Modern Data Pipelines In Finance

Abstract: Quantopian's data pipelines ingest financial data from vendors to produce a unified view of market history repackaged into high-performance formats. In 2018 we entered into a partnership with FactSet and began dramatically expanding the data available on our platform. We selected Apache Airflow as a workflow engine in order to cope with the additional complexity and maintain high availability for our production data systems.

Our community of algorithm authors becomes more productive with every dataset we add to the platform, so we knew that we’d need to deploy new algorithms faster than ever before. In the latter half of 2018, we started building a new production system, the Quantopian Alpha Model (QAM), to allow our investment team to incorporate ideas from many more authors in our community.

We decided to build QAM by applying proven DevOps methodologies like containerization and continuous deployment to data science fundamentals. QAM’s nightly pipeline leverages scientific Python for data processing, Kubernetes for execution, and Apache Airflow for orchestration. QAM is also entirely code-defined and shipped via pull request, allowing developers to perform mid-day code deployments with no involvement from operations.

This development methodology has been a resounding success: our team of four went from creating a repo to placing live trades in three and a half months. Now that we’ve shipped QAM, we’d like to help you ship your data science projects faster, too.

Bio: James Meickle is a site reliability engineer at Quantopian, a Boston startup making algorithmic trading accessible to everyone. His current areas of interest include data pipelines, containerization platforms, and continuous delivery. In past roles, he’s been responsible for processing MRI scans at the Center for Brain Science at Harvard University, sales engineering and developer evangelism at AppNeta, and release engineering on a presidential campaign.