Docker, Spark, and Iceberg: The Fastest Way to Try Iceberg!

Tabular
6 min readMay 11, 2022
Close up photo of a penguin swimming in the ocean.

If you’re interested in Iceberg because you heard it solves a problem, such as schema evolution or row-level updates, and you want an easy way to test it out, then you’re in the right place! This post will get you up and running with Spark and Iceberg locally in minutes. This also highlights many of the incredible Iceberg features that solve data warehousing problems you’ve seen before.

Iceberg provides libraries for interacting directly with tables, but those are too low level for most people. Most of the time, you’ll interact with Iceberg through a compute engine, like Spark, Trino, or Flink.

Let’s start by running a local Spark instance with Iceberg integration using Docker. In this environment you’ll be able to test out Iceberg’s many features like time travel, safe schema evolution, and hidden partitioning.

First, install Docker and Docker Compose if you don’t already have them. Next, create a docker-compose.yaml file with the following content.

docker-compose.yaml

In the same directory as the docker-compose.yaml file, run the following commands to start the runtime and launch an Iceberg-enabled Spark notebook server.

That’s it! You should now have a fully functional notebook server running at http://localhost:8888, a Spark driver UI running at http://localhost:8080, and a Spark history UI running at http://localhost:18080.

Note: As the tabulario/spark-iceberg image evolves, be sure to refresh your cached image to pick up the latest changes by running docker-compose pull.

A Minimal Runtime

The runtime provided by the docker compose file is far from a large scale production-grade warehouse, but it does let you demo Iceberg’s wide range of features. Let’s quickly cover this minimal runtime.

  • Spark 3.1.2 in local mode (the Engine)
  • A JDBC catalog backed by a Postgres container (the Catalog)
  • A docker-compose setup to wire everything together
  • A %%sql magic command to easily run SQL in a notebook cell

The docker environment comes with an “Iceberg — Getting Started” notebook which demos many of the features that are covered in greater depth in this post. The examples you see here are included in the notebook, so that’s also a good place to start learning!

Schema Evolution

Anyone who’s worked with Hive tables or other big data table formats knows how tricky evolving a schema can be. Adding columns that have been previously dropped can lead to data being resurrected from the dead. When you’re lucky enough to quickly catch it, it’s often time-consuming to undo and many times leads you back to the original state of the table. Even renaming a column is a potentially dangerous operation.

You shouldn’t need to worry about which changes work and which ones break your table. In Iceberg, schema operations such as renaming or adding columns are safe operations with no surprising side-effects.

Other schema changes like changing a column’s type, adding a comment, or moving its position are similarly straightforward.

Furthermore, schema evolution consists of metadata operations only. This is why they’re executed very quickly. Gone are the days of rewriting entire tables just to change the name of a single column!

Partitioning

When you consider changing the partition scheme of a production table, it’s always a big deal. The process is arduous and requires migrating to an entirely new table with a different partition scheme-unless you’re using Iceberg. Historical data in the table must be updated to the new partition scheme, even if you’re only interested in changing the partitioning of newly written data. The hardest part, arguably, is chasing down the table’s consumers to remind them to update the partition clause in all of their queries! Iceberg handles this seamlessly. A table’s partitioning can be updated in place and applied only to newly written data.

Query plans are then split, using the old partition scheme for data written before the partition scheme was changed, and using the new partition scheme for data written after. People querying the table don’t even have to be aware of this split. Simple predicates in WHERE clauses are automatically converted to partition filters that prune out files with no matches. This is what’s referred to in Iceberg as Hidden Partitioning.

Time Travel and Rollback

As you change your table, Iceberg tracks each version as a “snapshot” in time. You can time travel to any snapshot or point in time. This is useful when you want to reproduce results of previous queries, for example to reproduce a downstream product such as a report.

In complicated logic that depends on many upstream tables that are frequently being updated, it’s sometimes ambiguous whether changes in query results are caused by the new code change you’re testing, or if the upstream tables have been updated. Using time travel, you can ensure that you’re querying all of the tables as of a specific time. This makes creating a controlled environment much easier.

If instead of time traveling for a single query at runtime, you want to actually roll back the table to that point in time or to a specific snapshot ID, you can easily achieve that using the rollback procedure!

Expressive SQL for Row-Level Changes

Iceberg’s SQL extensions allow for very expressive queries that perform row-level operations. For example, you can delete all records in a table that match a specific predicate.

Also, the MERGE INTO command makes the common task of merging two tables very intuitive.

Atomic CTAS and RTAS Statements

How often have you needed to create a table that exactly matched the results of a query? Iceberg ensures that CTAS (short for “CREATE TABLE AS SELECT”) or RTAS (short for “REPLACE TABLE AS SELECT”) statements are atomic. If the SELECT query fails, you aren’t left with a partial table that you have to quickly drop before someone queries it!

Spark Procedures

We covered the rollback_to_snapshot procedure, but Iceberg includes many more procedures that allow you to perform all sorts of table maintenance. You can expire snapshots, rewrite manifest files, or remove orphan files, all using intuitively named Spark procedures. To read more about all of the available procedures, check out the Spark Procedures section in the Iceberg documentation.

Support for Iceberg is continuing to grow rapidly, with Amazon recently announcing built-in support for Iceberg in EMR and Snowflake recently announcing support for connecting to Iceberg tables in response to significant customer demand. Likewise, the community is growing bigger and stronger everyday, with contributors from many different industries bringing excitingly unique use cases, perspectives, and solutions. If you have thoughts on any of the features discussed or just want to say hello, check out our community page for all of the ways you can join in on the conversation. I hope to see you there!

About the Authors

Samuel Redai (LinkedIn, GitHub)

Sam has worked with data in a wide range of environments. From medical research labs and hospital supply chain planning systems, to corporate insurance companies and video streaming platforms.

Kyle Bendickson (LinkedIn, GitHub)

Kyle is passionate about delivering simple solutions to complex problems. He’s particularly passionate about distributed systems, real time streaming data that can be used for both applications purposes and analytics purposes as well. Having always been “SRE-Adjacent”, he’s also often focused on Kubernetes and (re)architecting applications for the cloud. He’s also obsessed with cost savings, but more obsessed with his dog named Hank. He got his start in software development in search, recommendations, and designing features with user-safety in mind working in the online dating space.

Originally published at https://tabular.io.

--

--

Tabular

Tabular is building an independent cloud-native data platform powered by the open source standard for huge analytic datasets, Apache Iceberg.