Tutorial: Introduction to Apache Arrow and Apache Parquet, using Python and Pyarrow

Abstract: 

This workshop will cover the basics of Apache Arrow and Apache Parquet, how to load data to/from pyarrow arrays, csv and parquet files, and how to use pyarrow to quickly perform analytic operations such as filtering, aggregation, joining and sorting. In addition, you will also experience the benefits of the open Arrow ecosystem and see how Arrow allows fast and efficient interoperability with pandas, pol.rs, DataFusion, DuckDB and other technologies that support the Arrow memory format.

Session Outline:

"Apache Arrow https://arrow.apache.org/ is a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.

After completing this workshop, you will understand the basics of Apache Arrow and Apache Parquet, how to load data to/from pyarrow arrays, csv and parquet files, and how to use pyarrow to quickly perform analytic operations such as filtering, aggregation, joining and sorting. In addition, you will also experience the benefits of the open Arrow ecosystem and see how Arrow allows fast and efficient interoperability with pandas, pol.rs, DataFusion, DuckDB and other technologies that support the Arrow memory format.

Session Outline

Lesson 1: Apache Arrow and Parquet
Review the basics of Apache Arrow and Apache Parquet as well as the rationale and use cases for both. At the end of this lesson, you will understand the formats, when each is appropriate, and have a python environment setup that can load data using pyarrow.

Lesson 2: Transform Your Data for Analysis
In this lesson we will practice loading data to / from Arrow and Parquet, and highlight some common gotchas. At the end of this lesson, you will be able to use pyarrow to quickly load, filter, aggregate, join, sort, and save large datasets to parquet files.

Lesson 3: Interoperate with the Ecosystem
Not only is pyarrow a useful tool on its own, it also efficiently interoperates with a wide variety of other tools that make use of the Arrow format. At the end of this lesson, you will be able to efficiently move data from pyarrow to other common tools such as pandas, pol.rs, DataFusion and DuckDB, as well as how to quickly send Arrow over the network server using Arrow Flight.

Background Knowledge:

Need to know python and have basic familiarity with relational data (aka spreadsheet / tables) model.

We will be using python extensively

Bio: 

Andrew Lamb is the chair of the Apache Arrow Program Management Committee (PMC) and a Staff Software Engineer at InfluxData. He works on InfluxDB IOx, a time series database engine written in Rust, that heavily uses the Apache Arrow ecosystem. He actively contributes to many open source software projects including the Apache Arrow Rust implementation and the Apache Arrow DataFusion query engine.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google