
Abstract: Python's versatility, efficiency, reliability, and speed have rightfully established it as the default language for big data processing, exploratory data analysis, machine learning and cloud computing. An inevitable consequence of this is the development of a variety of open-source general purpose packages widely accessible to ML researchers for the development of their complex systems. In an era where ML-Ops was not even a thing, data scientists could quickly bundle together code to generate their ML-models or sophisticated ETL pipelines, leveraging many different packages through simple notebooks and deploy them, even, in production. A repercussion of this is the generation of thousands of lines of glue-code used to get data into and out of these general-purpose packages. However, as organisations mature, anti-patterns like these need to be effectively dealt with, else they can freeze systems to their peculiarities.
This talk is a story about how we dealt with this problem, particularly related to how we used pandas in many of our systems, by actively combating glue-code through wrapping pandas-I/O operations (which are indirectly using libraries like `pyarrow`, `fastparquet`, `s3fs`, `boto3` and `aws-cli`), into a common API under the name dynamic(i/o). By packaging up these libraries we were able to promote good practices, write reusable code that was easy to read, better structure our repos and promote a consistent template of work, define data expectations that would be validated by our wrapper, generate structured metrics picked up by our monitoring systems and even define interfaces between our developer teams, to enable a smoother communication of both data and knowledge.
(We intend to open-source this library on the day of the talk).
Bio: Tyler works as a Data Scientist at Vortexa where he focuses on building machine learning models that capture the dynamics of the energy markets. Prior to Vortexa, Tyler was doing research in clinical machine learning and published work in sports injury analytics and mathematical optimisation. He previously worked as a software engineer working for startups and clients in finance and uses this experience to contribute to the full lifecycle of building machine learning pipelines.