Iceberg in Modern Data Architecture

Tabular
6 min readJul 6, 2023

--

by Ryan Blue

The last two weeks have seen several major announcements about Apache Iceberg:

  1. Google announced that Apache Iceberg support in BigLake and BigQuery is now generally available (GA), calling Iceberg “the standard for organizations building open-format lakehouses.”
  2. Snowflake announced major performance improvements for Iceberg tables, along with future REST catalog support for working in other frameworks or engines.
  3. Databricks announced future support for Iceberg as a compatibility layer — tacitly acknowledging the momentum Iceberg is gaining in the larger marketplace.

These are radical developments. Commercial warehouse engines seldom add support for new data formats, and it’s all the more noteworthy that these vendors are doing it at the same time. People are getting interested in Apache Iceberg, fast.

With so many new people hearing about Iceberg and trying to learn more, I think it’s a good time to revisit the Iceberg primer I wrote in August of 2022 on Fivetran’s blog for people using data warehouses and the modern data stack. In this post, I’ll take a deeper look at what Iceberg is and why it is being built into the foundation of the modern data architecture.

What is Apache Iceberg?

Apache Iceberg is a modern table format for analytic tables, created by my team at Netflix and later adopted at companies like Apple, LinkedIn, Stripe, Airbnb, Pinterest, and Expedia. At Netflix, we used Iceberg to transform our data lake into a cloud-native data warehouse by building the guarantees of SQL into data lake tables.

If you’re looking at Iceberg from a data lake background, its features are impressive: queries can time travel, transactions are safe so queries never lie, partitioning (data layout) is automatic and can be updated, schema evolution is reliable — no more zombie data! — and a lot more.

On the other hand, if you’re coming from a data warehouse or modern data stack background, that probably sounds horrible! Data lakes can’t rename columns? Queries might just give wrong answers? Partitioning is a manual process that people mess up all the time? Sigh. Yes.

The flaws can be managed, and data lakes have benefits that make them attractive. The main benefit is flexibility: there is a wealth of projects and processing frameworks. For example, Spark and Trino come from the data lake world and are commonly used in the same architecture because they’re good at different things. Trino prioritizes speed and is great for ad hoc SQL queries across multiple sources. Spark prioritizes reliability and can mix Python, Scala, and SQL together to handle any workload. And there are more examples, such as Flink for streaming or Python DataFrames that run on GPUs.

Iceberg’s appeal is that you aren’t forced to choose between data lakes and data warehouses. It brings together the best capabilities of both: you can be productive and rely on the guarantees of a data warehouse using any processing tool or query engine, including any built for data lakes.

Why add Iceberg to data warehouses?

It’s easy to see how data lakes are improved by Iceberg’s warehouse features — nobody likes unreliable queries — but warehouses already have mature tables with standard SQL behavior that deliver great performance. So what’s the compelling reason to build Iceberg into data warehouses?

I think the answer is the incredible demand for different ways to work with data. People want the right tool for the job and to use what they’re comfortable and productive with. They expect to be able to load data into hundreds of Python containers in parallel to test model parameters, or use a streaming framework to easily sessionize events, or query the same tables from a BI tool as they are consumed by Spark. These tasks are not easy to do in a data warehouse.

Warehouses are built to accept SQL queries and quickly produce manageable-sized results. They do this really, really well. But a crucial assumption is baked into this model: that the query layer handles all access to data. This is how access controls are enforced, how queries are isolated from concurrent changes, and how features such as result set caching and background data maintenance are enabled.

What you gain when doing away with a query layer

Having a single way to access data is a useful assumption when building a data warehouse, but it is also an Achilles heel. With Iceberg, warehouses no longer need a runtime query layer to deliver safe and correct queries.

To understand the trade-off, think of a data warehouse like a combined bakery/sandwich shop, where the sandwich shop (query layer) is the only way to get bread from the bakery (storage layer). Customers wait a little while for someone to take their order, make the food, and check them out. This works great when a steady stream of people want lunch. The sandwich shop employees make sandwiches and ensure that each customer is checked out.

But the sandwich line breaks down quickly in other situations. What if a person in line orders 50 sandwiches? And what if a deli opens next door and wants bread from the bakery? It would make no sense to send deli employees to stand in the sandwich line to buy loaves of bread as needed; the bakery needs a different way to handle catering and wholesale.

A query layer limits flexibility in similar ways:

If you spun up 1,000 Python jobs at once to read the same warehouse table, chances of success are low. This is the “thundering herd” problem. You can fix it by scaling up query capacity (more employees working the counter), but making it work is not simple or cheap.

Stream processing is at odds with a query-centric view of data, so it must depend on custom integration and protocols or on messy polling and deduplication.

Stacking full query engines on one another duplicates work and compounds their flaws. If you used Trino to load data from a Spark query, it would be slower and no more reliable! This is like sending a deli employee to buy bread in the sandwich line while customers are waiting.

By building support for Iceberg, data warehouses can skip the query layer and share data directly.

Iceberg was built on the assumption that there is no single query layer. Instead, many different processes all use the same underlying data and coordinate through the table format along with a very lightweight catalog. Iceberg enables direct data access needed by all of these use cases and, uniquely, does it without compromising the SQL behavior of data warehouses.

How does modern data architecture change?

What does it mean to build a warehouse that is independent of the query layer, and how does that change data architecture?

To start with the most obvious, modern data platforms get a lot more options. An independent warehouse lets you use the processing pattern or query layer that is best for your task and for your team. If you want to stream data through Flink, train models in Spark, and run BI on top of Snowflake, you can! All the projects from the data lake space can now operate reliably on the same warehouse and can be brought into a unified data architecture, without maintaining pipelines to copy data in or out for them.

This means you can use your tool of choice — Pandas, Trino, Snowflake, Spark, and others — with data cleansed, normalized, and delivered to your warehouse by other technologies of your choosing. Integrating all of your data sources and getting insights from them is easier and faster than ever before.

Next, I think the role that data warehouses play today will evolve into two separate functions: a query layer and a storage layer. Returning to the bakery/sandwich shop analogy, say the bakery now supplies a deli and a bánh mì place. Now it needs to be good at making rye and baguettes! Similarly, there’s a lot more for the storage layer to do, not least of which are unsolved challenges like consistent security controls or data maintenance and automation. (Quick note: an independent storage platform with built-in role-based access controls and automatic optimization is what we’re building at Tabular.)

On top of needing new capabilities, does it still make sense for our hypothetical bakery to be coupled with its sandwich shop? Probably not; the owner could choose to sell only stale leftovers to sandwich shop competitors to drive more sweet PB&J business. I think there are conflicts of interest that will naturally lead to separating not only query and storage responsibilities but also query and storage businesses.

This highlights another critical reason for Iceberg’s adoption in the modern data stack: it is an open standard governed by the Apache Software Foundation. Iceberg is the result of collaboration from a broad and vibrant community of companies such as Netflix and Apple, and this community is trusted by vendors such as Google, AWS, Snowflake, Fivetran, Tabular, Starburst, Dremio, Cloudera, and more. That trust is incredibly important because it means the companies that make up the modern data stack can confidently invest time and R&D into Iceberg, as well as build support for it in their products.

--

--

Tabular

Tabular is building an independent cloud-native data platform powered by the open source standard for huge analytic datasets, Apache Iceberg.