top of page

Unlocking Data Insights with Python Pandas & Apache Iceberg

  • Writer: Claude Paugh
    Claude Paugh
  • May 11
  • 3 min read

Updated: Aug 7

Introduction to Pandas


As far as Python packages go, Pandas has been one of the most popular since its initial release. Its "sweet spot" lies in providing multi-dimensional arrays to represent data, a.k.a. tables. It supports many basic-to-advanced calculations and aggregations commonly used by those who have "graduated" from, or simply didn't prefer, spreadsheets. As Data Science has developed over the years, Pandas' popularity has enjoyed significant growth.


Pandas heavily relies on the NumPy Python package for arrays, types, and math functions. You can filter, slice, subset, and extend the DataFrames using various techniques. If you plan it correctly, you can effectively have multiple Pandas DataFrames mimic multi-dimensional cubes. While not necessarily cohesive, they can be functionally effective.


Pandas Reading Files
Pandas Reading Files

Reading Data with Pandas


Pandas offers various "read_" methods that allow you to take an input and create a DataFrame from that input data. For example, it can read from CSV, JSON, Parquet, delimited files, SQL statements/result sets, Apache AVRO & ORC, HDF5, buffers, pickle files, HTML, and lists or dictionaries in Python, to name several sources from which Pandas can read and digest data.


Pandas Named Aggregates


After defining and loading the Iceberg tables, as discussed in my previous post, going through an analytics exercise with Pandas seemed like a good test of something that could represent an actual use case. I started with the basics of getting data and retrieving it.


The "Apply filters..." comment line above highlights an important aspect of PyIceberg. You can see from the import of pyiceberg.expressions that several expression types have been imported for use. There are essentially four expressions used for both numeric and string evaluation. The first row_filter targets only the technology sector by using the "EqualTo" expression from the import. The second filter limits the rows to only USD currency. The syntax differs slightly because the objects used are different. The first object is a table being scanned and filtered, while the second is a DataScan object being filtered.


Pandas Calculating Totals
Pandas Calculating Totals

The next section performs type conversion for Pandas, and DataFrames are merged to create new objects. I used a Pandas filter on the company_df object to contrast with the filter above using the pyiceberg.expressions methods. Every column in the DataFrame is typed as an object. To perform numeric calculations, I need to convert the columns to a numeric type (float). I also perform another scan on the Iceberg "forecasts" table to filter for 'Dividend_Per_Share_Growth' and then convert the output to Pandas.


The last section in the code merges DataFrames using the 'Ticker' column and drops duplicates using the same column as a subset. I reset the index; otherwise, my next operations do not work. For SQL users, the merge on a common column using "how='inner'" is functionally similar to an inner join between two tables in an RDBMS. The results of these operations will provide an extended dataset for further use.


The next section involves more column typing to float, so I will skip that and move to the Pandas "Named Aggregates." As you can see below, using the Pandas alias from the import, "pd," named aggregates are created.


The results for each aggregate are calculated using a DataFrame containing the column(s) and adding a group by. The named results, e.g., "mean_10yr_cash," will be the column names for the resulting aggregate DataFrame. You will notice below that it's possible to get multiple named aggregate results from a single DataFrame group by; essentially, one statement.


Results Overview


Company Cash Aggregates
Company Cash Aggregates
Company Metrics Aggregates
Company Metrics Aggregates
Dividend Per Share Growth Aggregates
Dividend Per Share Growth
Debt Growth Percent Aggregates
Debt Growth PCT

Graphing Pandas Results


To better illustrate the example aggregate output, I decided to graph the results using matplotlib and seaborn. Sometimes, a visual provides contextual information that is only implied when looking at the output calculations.


Pandas Creating Graphs
Pandas Creating Graphs

The code block below contains comments that will give you the context on which resulting datasets are plotted. It's reasonably straightforward if you have some knowledge of matplotlib; otherwise, you may need to look up some syntax. If you know how to create graphs, it's somewhat intuitive since the libraries use terminology familiar in graphing.


Correlation Matrix

Bar Graphs
Bar Graphs

Line Graphs
Line Graphs

Conclusion


Hopefully, I have provided some insight into a potential analytics use case with Pandas and Apache Iceberg. There are other variations that I will likely explore in the near future, starting with Apache Spark and Iceberg. In this scenario, source data is read from Iceberg, aggregates and calculations occur in Spark, and the resulting output is written back to Iceberg. Thank you for reading.


Pandas Waving Bye
Pandas Waving Bye

+1 508-203-1492

Bedford, MA 01730

bottom of page