
Abstract: This workshop will cover the basics of Apache Arrow and Apache Parquet, how to load data to/from pyarrow arrays, csv and parquet files, and how to use pyarrow to quickly perform analytic operations such as filtering, aggregation, joining and sorting. In addition, you will also experience the benefits of the open Arrow ecosystem and see how Arrow allows fast and efficient interoperability with pandas, pol.rs, DataFusion, DuckDB and other technologies that support the Arrow memory format.
Session Outline:
"Apache Arrow https://arrow.apache.org/ is a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
After completing this workshop, you will understand the basics of Apache Arrow and Apache Parquet, how to load data to/from pyarrow arrays, csv and parquet files, and how to use pyarrow to quickly perform analytic operations such as filtering, aggregation, joining and sorting. In addition, you will also experience the benefits of the open Arrow ecosystem and see how Arrow allows fast and efficient interoperability with pandas, pol.rs, DataFusion, DuckDB and other technologies that support the Arrow memory format.
Session Outline
Lesson 1: Apache Arrow and Parquet
Review the basics of Apache Arrow and Apache Parquet as well as the rationale and use cases for both. At the end of this lesson, you will understand the formats, when each is appropriate, and have a python environment setup that can load data using pyarrow.
Lesson 2: Transform Your Data for Analysis
In this lesson we will practice loading data to / from Arrow and Parquet, and highlight some common gotchas. At the end of this lesson, you will be able to use pyarrow to quickly load, filter, aggregate, join, sort, and save large datasets to parquet files.
Lesson 3: Interoperate with the Ecosystem
Not only is pyarrow a useful tool on its own, it also efficiently interoperates with a wide variety of other tools that make use of the Arrow format. At the end of this lesson, you will be able to efficiently move data from pyarrow to other common tools such as pandas, pol.rs, DataFusion and DuckDB, as well as how to quickly send Arrow over the network server using Arrow Flight.
Background Knowledge:
Need to know python and have basic familiarity with relational data (aka spreadsheet / tables) model.
We will be using python extensively
Bio: Andrew Lamb is the chair of the Apache Arrow Program Management Committee (PMC) and a Staff Software Engineer at InfluxData. He works on InfluxDB IOx, a time series database engine written in Rust, that heavily uses the Apache Arrow ecosystem. He actively contributes to many open source software projects including the Apache Arrow Rust implementation and the Apache Arrow DataFusion query engine.

Andrew Lamb
Title
Chair of the Apache Arrow Program Management Committee | Staff Software Engineer | InfluxData
