What it Takes to Stabilize a GenAI-first, Modern Data Lake in a Big Company: Provision 20,000 Ephemeral Data Lakes Annually

Abstract: 

LinkedIn, having joined the exabyte-scale data lake club in 2021, has been at the forefront of data and AI innovations. The year 2023 brought significant challenges and milestones, including the introduction of GenAI LLMs, completion of the Iceberg migration, initiation of the object storage journey, and a renewed focus on data privacy and security. This session delves into the strategies and lessons learned during this transformative period, with a specific focus on stabilizing platforms without compromising advancements in AI, security, and unified SQL.

Overview:
Challenges Faced:
Connectivity issues leading to 11 days of GenAI training losses.
Live production failures in interactive Darwin notebooks queries.
Infrastructure development hesitations and trust issues in staging environments.
Approaches and Learnings:
Development of a high-throughput system for auto-building lightweight, production data lakes on Kubernetes (K8s) for every code commit and pull request (PR).
Scaling flow failure insights using Prometheus, OpenTelemetry, and the Java Virtual Machine (JVM).

Key Discussion Points:
Recognizing symptoms indicating the need for reinvestment in foundational infrastructure.
Strategies for stabilizing platforms while accommodating rapid innovation in AI, security, and unified SQL.
User experience enhancements and architectural iterations implemented during the stabilization process.
The journey of productionizing OpenTelemetry at LinkedIn and its impact on observability.
Unanticipated challenges faced and successful resolutions encountered along the way.

Results:
Currently, the system supports over 20,000 ephemeral data lakes annually.
Detection and resolution of 2.1K platform issues each year.

By sharing LinkedIn's experiences and solutions, this session aims to provide valuable insights into managing large-scale data lakes, ensuring stability, and fostering continuous innovation. The discussion will be particularly relevant for data scientists, machine learning engineers, and infrastructure developers seeking to strike a balance between technological advancements and a robust foundation.

Bio: 

Moses Lee is a software engineer who found his interest in data at RISELab, home to AI-focused platforms such as Ray, and in industry at a Berkeley SkyDeck incubator. Since then, Moses joined as a software engineer at LI and helped GA multiple platforms across cloud, AI compute, and now data lake foundations. Moses has an interest in bringing simple interfaces to end users on fairly complex data infrastructures.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google