Unlocking Data Insights with Python Pandas & Apache Iceberg

Claude Paugh
May 11
3 min read

Updated: Aug 7

Introduction to Pandas

As far as Python packages go, Pandas has been one of the most popular since its initial release. Its "sweet spot" lies in providing multi-dimensional arrays to represent data, a.k.a. tables. It supports many basic-to-advanced calculations and aggregations commonly used by those who have "graduated" from, or simply didn't prefer, spreadsheets. As Data Science has developed over the years, Pandas' popularity has enjoyed significant growth.

Pandas heavily relies on the NumPy Python package for arrays, types, and math functions. You can filter, slice, subset, and extend the DataFrames using various techniques. If you plan it correctly, you can effectively have multiple Pandas DataFrames mimic multi-dimensional cubes. While not necessarily cohesive, they can be functionally effective.

Reading Data with Pandas

Pandas offers various "read_" methods that allow you to take an input and create a DataFrame from that input data. For example, it can read from CSV, JSON, Parquet, delimited files, SQL statements/result sets, Apache AVRO & ORC, HDF5, buffers, pickle files, HTML, and lists or dictionaries in Python, to name several sources from which Pandas can read and digest data.

Pandas Named Aggregates

After defining and loading the Iceberg tables, as discussed in my previous post, going through an analytics exercise with Pandas seemed like a good test of something that could represent an actual use case. I started with the basics of getting data and retrieving it.

The "Apply filters..." comment line above highlights an important aspect of PyIceberg. You can see from the import of pyiceberg.expressions that several expression types have been imported for use. There are essentially four expressions used for both numeric and string evaluation. The first row_filter targets only the technology sector by using the "EqualTo" expression from the import. The second filter limits the rows to only USD currency. The syntax differs slightly because the objects used are different. The first object is a table being scanned and filtered, while the second is a DataScan object being filtered.

The next section performs type conversion for Pandas, and DataFrames are merged to create new objects. I used a Pandas filter on the company_df object to contrast with the filter above using the pyiceberg.expressions methods. Every column in the DataFrame is typed as an object. To perform numeric calculations, I need to convert the columns to a numeric type (float). I also perform another scan on the Iceberg "forecasts" table to filter for 'Dividend_Per_Share_Growth' and then convert the output to Pandas.

The last section in the code merges DataFrames using the 'Ticker' column and drops duplicates using the same column as a subset. I reset the index; otherwise, my next operations do not work. For SQL users, the merge on a common column using "how='inner'" is functionally similar to an inner join between two tables in an RDBMS. The results of these operations will provide an extended dataset for further use.

The next section involves more column typing to float, so I will skip that and move to the Pandas "Named Aggregates." As you can see below, using the Pandas alias from the import, "pd," named aggregates are created.

The results for each aggregate are calculated using a DataFrame containing the column(s) and adding a group by. The named results, e.g., "mean_10yr_cash," will be the column names for the resulting aggregate DataFrame. You will notice below that it's possible to get multiple named aggregate results from a single DataFrame group by; essentially, one statement.

Results Overview

Dividend Per Share Growth Aggregates — Dividend Per Share Growth

Debt Growth Percent Aggregates — Debt Growth PCT

Graphing Pandas Results

To better illustrate the example aggregate output, I decided to graph the results using matplotlib and seaborn. Sometimes, a visual provides contextual information that is only implied when looking at the output calculations.

The code block below contains comments that will give you the context on which resulting datasets are plotted. It's reasonably straightforward if you have some knowledge of matplotlib; otherwise, you may need to look up some syntax. If you know how to create graphs, it's somewhat intuitive since the libraries use terminology familiar in graphing.

Conclusion

Hopefully, I have provided some insight into a potential analytics use case with Pandas and Apache Iceberg. There are other variations that I will likely explore in the near future, starting with Apache Spark and Iceberg. In this scenario, source data is read from Iceberg, aggregates and calculations occur in Spark, and the resulting output is written back to Iceberg. Thank you for reading.