Apache Iceberg and Pandas Analytics: Part I

Claude Paugh
7 may
6 Min. de lectura

Actualizado: 24 jun

I generally like to try new things, and technology is no different. So, I decided to do some more in-depth research on the mechanics under the covers of Apache Iceberg, and specifically the Python implementation, PyIceberg. I will add an important caveat before going any further: the most feature rich and robust implementation of Apache Iceberg is currently delivered by Apache Spark and it's Iceberg extensions. If you're looking for "best of breed", I would start there.

I was specifically looking at some key items that are usually part of data management practices, regardless of the technology:

Data Quality Controls:
can I validate schemas at minimum with data typing, but preferably with value inspection.
Data Integrity:
can you prevent duplicates, have keys enforced, and maintain some level of referential integrity when data is linked across data structures

Data Access:
can you manage security with row-level access, and is the access time robust so that retrievals and loads are reasonable, i.e. average expectations for users, services, and applications
Metadata management:
can I have schemas that can evolve through multiple versions and perhaps provide me the ability to rollback/undo changes.

If you're brand new to PyIceberg, it's a Python implementation of Apache Iceberg without the need of a JVM. It's open source, so the GitHub repo for PyIceberg is here and community support can be found here. The project started as an extension for Apache Spark first, then Flink and Hive.

Apache Arrow Foundation

It's data I/O mechanics are based on Apache Arrow, which was a project launched to define a specification for tablular data and its interchange between systems, specifically in the area of in-memory analytics. Arrow was initially written in C, and then ported to several other languages (see Implementations from the home page link above).

At the bottom of the home page link, you will see additional links to cookbooks for C++, Java, Python, and R, to get a jump-start on those language implementations. Porting to popular languages and toolkits such as C#, Rust, Ruby, and MATLAB are available as well.

If your currently or plan to use the Apache Parquet or ORC formats, you will probably find that libraries from cloud vendors or query engines, source and wrap the Arrow libraries as a starting foundation for each providers specific extensions.

AI/ML Extensions

There are CUDA extensions for the AI/ML community, but hopefully you stay away from performing data transformation and manipulation on expensive GPU resources, you will be more cost effective and processing efficient.

I think the lack of separation of concerns for these data focused tasks, is one of the demand drivers for GPU's, and the lack of CPU consumption, for training and running of AI/ML models. It's not so much a lack of recognition that the separation is required, but the time it takes to engineer and deliver separation; i.e. it does not happen.

GPU's are not general purpose processors, like CPU's, and are more suited (at least up to this point) for calculations for algorithms and visualization - they were invented to compute points on a triangle in parallel for graphical display. NVIDIA offers some introduction to the topic on GPU performance. A basic overview of cell based processors (which GPU's are one form) from the Ohio Supercomputing Lab (2007) is below:

The specific use cases for using cell based processors is the key take-away, since development wise most engineers have careers focused on a single processor and not multiple. The skills on workload separation for specific compute resources need to be developed and sharpened (end of interlude topic).

Apache Iceberg Under the Covers

The overall approach to metadata and data storage of data is to ensure immutability, which is definitely a plus. Why? In a nutshell, data access concurrency and performance scales better with immutability implementations (when done well), in addition to providing traceability as schema and data evolve. If you're a user of RDBMS products, the familiar data(metadata) and index fragmentation issue can require vigilance and maintenance cycles. It's especially impactful in analytics (ROLAP) or hybrid (OLTP+long history) cases, but it disappears with immutability implementations.

As I indicated or implied above, Iceberg wraps and uses several different libraries from Apache projects from Arrow, Parquet, ORC, and AVRO. These mainly focus on I/O operations whether they are in-memory or file based persistence. Iceberg also makes considerable use of Pydantic for metadata implementation in Python.

The "models" for schema structures are built and validated using Pydantic, and metadata objects stored in a catalog using a persistent store. The persistent store can be accessed via REST, a SQL DBMS, In-Memory store, Apache Hive, AWS Glue, or customized. The overview, sourced from https://iceberg.apache.org/spec/ is below to give you a visual on how it's organized. Some additional details to point out in the figure below: The metadata file is an Apache AVRO file, and manifests lists are in JSON format. The schema validation properties in conjunction with Pydantic is probably why that combination was chosen.

Metadata and Data Quality

The metadata organization starts at the namespace level, and it's a logical or contextual separation of data. You can do cross-namespace references when accessing data if you need to. Table objects are created "under" a namespace and provide the implementation types for schemas you can find here. If you need to store geo-spatial data, Iceberg supports geometry and geographic types, and offering support of CRS (coordinate reference system) and edge interpolation in the metadata layer. Data partitioning and row lineage is also available, and the number of records kept to detail lineage can be set when creating the catalog.

To better illustrate the diagram above, I have included the console file output of Iceberg tables I created for the example implementation in Part II. I created a "warehouse" folder on my device, and listing starts under the "docs" namespace. It drills down through the "forecast" data and metadata.

Pydantic's implementation, especially while using "strict" mode (metadata and data) covers the "Metadata" and "Data Quality" bullet points at the beginning of the article. You also get schema versioning controls in the metadata, down to the column level, and its has very good mechanisms to help you with schema evolution. Iceberg metadata does provide many of variations that are implemented using the RDBMS equivalent of "constraints"(not the literal table DDL). You get type validation, required field designation, and inspection for size and precision, it does lack applying a custom list-of-values to specific fields.

The list-of-values use tends to be more common in OLTP design patterns than OLAP(analytics) cases for a RDBMS, in my experience, and clearly Apache Iceberg is designed for the analytics use cases and not transaction processing. Overall it's a good start on data quality and metadata. The "default" values attributes does provide the ability to implement a surrogate "key" equivalent, or using "identifier" field ids.

It does not offer an equivalent of a primary or unique key, which is a database constraint type the ensures uniqueness regardless of having an associated index. There is no user managed indexing either, but if your an existing user of any of the file format's mentioned previously, that's not a new, and it's not expected to be there. Ensuring uniqueness is up to the "user". You can enforce uniqueness within a struct, for example, but that's up to you to deliver, Iceberg will not do it for you.

Row-Level Security

The challenging areas for Apache Iceberg are RWA (row-level access security) and data integrity. I am evaluating what is "built-in" to the libraries, not how you can construct a solution to solve those problems using additional tools.

RWA security is not built-in to the libraries. There are some very nice lineage for row and metadata capabilities, that can help you with diagnosis of data-quality issues and provide change over time capabilities. The latter could be very useful in a time-series use case. But RWA security tends to be a corporate security requirement; hopefully LDAP integration shows up at some point to enable this.

Data Integrity

The data integrity capabilities are not really focused on with Iceberg, perhaps because the trend in AI/ML data access has been to de-normalize data tables so that foreign keys are really not required, i.e. one big table. It does come up in more traditional analytics use cases where there are multiple tables that require relationships links (physical or virtual), so it would be a helpful add for those cases. Even if it's only informational to track in metadata, so it provides a boost to validation. It's an overlooked area in Iceberg.

Data Access

Users who currently use any of the Apache Arrow formats mentioned previously knows what they are getting using each one. For large amounts of data storage, Parquet does tend to be used frequently for large-to-very large datasets, especially if partitioning is needed to ensure query response times are reasonable. I did a small implementation, that I will be detailing in Part II of the article, where simple response times are recorded. But as many know, Parquet, especially if/when partitioned well, is very responsive to query requests. It columnar format is especially well-designed to fit and service analytics use cases.

For the next part of the article, I am going to walk through a simple implementation using Apache Iceberg and financial data with some Python aggregations.

Data Quality Controls:

Data Integrity:

Data Access:

Metadata management: