The Case for Independent Storage

Tabular
7 min readSep 21, 2023

--

by Ryan Blue

Analytic databases are quietly going through an unprecedented transformation — one that will fundamentally change the industry for the better, by freeing data that’s being held hostage. This change is coming, but when is an open question.

Let me explain.

Databases are made up of two equally important parts: a storage layer and a query layer (compute). These two layers have always been tightly coupled because the capabilities and performance of the combined product rely on both working together.

The storage layer is a system for organizing and tracking data — its responsibility is to create shortcuts and opportunities for fast operations. The query layer satisfies requests for data — it finds the most clever way to fetch, transform, and summarize and then executes that plan. The query layer relies on all the shortcuts the storage layer can provide to avoid needless work and return results quickly.

Innovations in the storage layer create new shortcuts and opportunities that can kick off new eras in the industry. Take for example, column-oriented layouts that led to columnar and vectorized engines, or the technical separation of storage and compute that led to the latest generation of cloud-native warehouses that scale each independently.

Open table formats are the next foundational innovation. These make it possible to build universal storage that is shared in place (zero-copy) between as many different compute frameworks as you need — from high-performance SQL engines to stream-processing applications and Python scripts. My previous post on Iceberg in Modern Data Architecture outlines why everyone is racing to add support for open table formats. This post is about how the industry will change, with all the usual long-term optimism interspersed with some words of warning.

Building toward modular data architecture

Let’s start with optimism.

Shared storage is a fundamentally better model than a tightly-coupled data stack, primarily because it delivers the flexibility to use multiple compute frameworks. Today, people are using Apache Iceberg to solve longstanding pain points with their current data infrastructure — such as:

  • Data is siloed — to use the same dataset across engines, it must be copied. Copies take up space, require sync jobs and maintenance, and — worst of all — can easily become stale. No one likes copies, especially when they can’t be trusted.
  • Testing is costly — siloed data sets a high bar to use or even test alternative query layers. Maybe a different engine would be faster, but is it worth the effort of copying tables just to try?
  • Migration is risky — if another engine does work better, is it practical to move the data? When other pipelines use the same tables, the choice is either to move everything or attempt to keep copies in sync.

The pull of data gravity is strong and creates a natural lock-in that can be paralyzing. Creating a universal storage layer breaks down the silos and enables you to use the right tool for the job, and adopt new tools more easily.

This flexibility is a major step forward, but it isn’t enough. You can build architecture that is flexible, but it’s hard. You research, choose components, wire them up, test it, deploy, and repeat for the next use case.

This is not sustainable. Flexibility isn’t enough — open data architecture needs to be modular. High setup cost was fine when there were just two components, but with many compute options you need to be able to add to and change the architecture without excessive friction. The components should all fit together like Legos. Storage and compute products should be largely interchangeable and easily swapped, by using open standards like Iceberg.

Access controls are a great example of an area where architecture needs to become modular. The status quo is to configure and enforce access policy in the query layer, but in a world of shared storage this makes no sense. It requires managing multiple copies of roles and permissions in query engines that have different models and capabilities. In Python and Spark, there’s no native enforcement at all.

Moving access controls to the storage layer is the path to uniform policy configuration and enforcement. Next, open standards like OAuth2 can make it easy to authorize and connect new components. Both changes make the architecture modular: it’s easy to plug in new pieces with confidence.

The evolution towards a modular data architecture doesn’t end at access control. Metastore catalogs, governance systems, and even abstractions like views will require modification, with the goal of building a new, modular system that just works.

Like all foundational innovations, open tables kickstart a lot of change, but the end result will be more secure and more extensible. It will also give you more confidence in your data as well as the flexibility to innovate.

The need for neutral storage

As the industry rearranges around universal storage, one of the biggest imperatives for delivering modular components is to ensure that storage remains neutral.

Storage and compute have been inseparable since the beginning of time — at least Unix time — and this has always shaped how analytic databases are designed and sold: with storage and compute as a monolithic product. Moving data is required even to test an alternative product. And migrations to a new product are long, disruptive, and risky. This is the source of lock-in.

To established vendors, this natural lock-in is a feature, not a bug. It’s a cornerstone of the business model that has dominated the industry for decades; let’s call this the Oracle Trap. As you’d expect, now that adoption of open storage standards seems inevitable, there is a race by the major players to control it and cling to their advantage. Storage is the key to the shortcuts and opportunities that compute engines rely on for fast queries. Every data warehouse or data lake provider needs to control storage, either to create an advantage for their own engine or to prevent the competition from doing so.

There are already examples of bad behavior designed to mislead you and build structural advantages. None of this is in the best interest of the customer.

To best serve your interests, storage must be neutral — it needs to help every query layer perform at its best. Decisions at the storage layer must balance trade-offs without preference for a particular product or service. Decisions should maximize overall efficiency or align with your priorities.

Yet neutrality is hard to achieve in practice. Even if you trust a particular company not to tip the scale unfairly, there is a natural familiarity bias for them to make choices based on what their engine is good at.

Demand independent storage

At the start of this post, I said that when you’ll benefit is an open question — you may not see a change any time soon. That’s because there are two potential futures states.

In one, established compute vendors win dominance in open storage. This unlocks the ability to use datasets from other frameworks and even other vendors. This is a step forward, but only an incremental one. Data vendors aren’t neutral: somehow, the best compute engine is always the one operated by your storage vendor. It becomes difficult to leave. Security continues to be a patchwork of different models and gaps.

In the other, open table formats create a new category of independent storage companies, whose independence from compute vendors creates incentives that align closely with customer needs. In this case storage management is neutral and these new storage platforms become the foundation for a modular data architecture. Now it’s easy to test and integrate new engines without moving or copying tables, knowing that security isn’t at risk. Moving a workload to a cheaper engine requires a little testing, not an expensive migration.

This second future state is exactly what we’re building toward at Tabular. Tabular shows its value by saving you time and making shortcuts to reduce your cost or make your queries faster. Independent storage is the right choice that aligns with customer needs, and the sooner people demand independent storage, the sooner everyone will benefit from this ongoing transformation.

Incumbents will try to convince you that all you need is to use open formats on their platform, and that separate storage won’t be as good, or is read-only. They may be right in the short term, but be wary of arguments that are only true because the vendor makes it so. If you think neutral storage is important, then insist on it.

I strongly believe in independent storage because I’ve seen the benefits first-hand. Our customers routinely see 30–60% savings from automatic tuning, and because we are neutral, these benefits apply across all compute environments. They’ve told us horror stories about tables that weren’t clustered, so every query was a full table scan and needlessly cost an extra $5 million per year. When you buy compute and storage from the same vendor, that company has little incentive to find and fix these types of problems. But that’s exactly what an independent storage vendor can and should do.

Tabular is the first independent storage company and the only one based on Iceberg. We’re building the foundation for modular architectures with new standards, like the Iceberg REST catalog protocol, that point toward a future in which every component speaks the same language. We’ve built enterprise-grade RBAC that allows you to standardize on one set of policies everywhere. Tabular enables you to quickly and easily set up the tools you want to use.

We have a free tier, so if you’d like to try Tabular out, you can sign up here.

--

--

Tabular
Tabular

Written by Tabular

Tabular is building an independent cloud-native data platform powered by the open source standard for huge analytic datasets, Apache Iceberg.

No responses yet