Zen and the art of CDC performance

Tabular
6 min readSep 21, 2023

--

Ryan Blue

I’ve moved around for work quite a bit in my career, mostly when I worked for the government and was lucky enough to have professional movers. By the time I moved last, the process was familiar, with everything packed up on day 1 and hauled off on day 2. But during the last move there was a new challenge after the packing day, posed by my young kids: How quickly can I find a Frozen II bandaid in all the boxes? And, of course, later: No, not that one! An Elsa one!

Find a box labeled bathroom. Unpack, looking for bandaids. Repack. Repeat. The similarity to write amplification was lost on me at the time, but it’s there. The larger the box, the more random other stuff to unpack and repack. The larger the data file, the more unchanged rows to copy into a new one. In modern architectures based on object stores, files are immutable. It isn’t possible to make small changes to an existing file; even the smallest modification requires rewriting an entire file. That additional work is write amplification.

The previous post on the CDC MERGE pattern skipped over write amplification to focus on trade-offs and problems with the pattern. This post covers optional improvements that defer and reduce work.

If you haven’t already, you may want to read the previous posts in this series:

Write Amplification in CDC Pipelines

Write amplification refers to the work necessary to balance the trade-offs between efficiency at write time and efficiency at read time. It’s about how you approach the last step of modifying or deleting a finite amount of data sitting somewhere in a large data store.

Our moving box analogy is a good start to building a mental model for write amplification. It even makes some ways of handling the problem more clear:

  • Smaller data files (boxes) will reduce the amplification of a small change.
  • Metrics about data files (labeling what’s in a box) helps avoid checking whether a file needs to be rewritten.

These are valuable tools. But they’re limited because they don’t address the underlying amplification problem that small changes have much larger costs. To address this problem, another tool is more powerful: delta files.

A delta file stores changes to some portion of a table that are applied at read time. It’s basically a diff against a data file, and obviates the need to rewrite an entire huge file just to modify 2 or 3 rows. The advantage of a delta file is that the cost to write one is (nominally) proportional to the size of data that is changing rather than the size of data files. This directly addresses write amplification and is clearly better than rewriting whole data files.

Hooray, the problem is solved! Surely nothing will go wrong.

Deltas defer work to readers

Of course there’s always a trade-off. In this case it’s the added cost at read time of applying delta files. Applying just one small delta instead of rewriting a 256 MB data file is clearly a win, but applying 10,000 deltas is a big problem. A table with too many deltas has the small files problem, where the latency of opening a large number of files causes a task to take forever. CDC pipelines are notoriously bad because updates are frequent and distributed across the whole table.

For use cases like CDC, it’s important to remember that deltas are a way to defer work — it still must be done eventually. There’s no free lunch; as with energy, work doesn’t dissipate — it gets redistributed, in this case from read time to write time, or vice versa. In the case of deltas, rather than rewriting a data file at write time, a delta creates a little more work every time it’s read. (That’s why this strategy is sometimes called merge-on-read.) CDC write patterns stack up so many deltas that the overhead of reading delta files makes the table unusable, unless that deferred work is regularly reduced by compaction.

If reads are slower and every modified data file will be rewritten by compaction anyway, why use deltas? There are two benefits:

  1. Deltas allow the write job to complete more quickly, making data available sooner and helping to reduce the chances of a write conflict
  2. The overall write pattern is more efficient because deltas from multiple commits can be rewritten in batches; deltas assist in reducing work when combined with compaction, but can easily be a performance disaster when used alone

Anti-patterns in CDC with delta files

Balancing the trade-off between work at write time and work at read time is hard. The work at write time is measured in volume — bytes read and written. The work at read time is measured mostly in files — how many round trips to S3 are needed — and its cost is paid on every read.

My first recommendation is to avoid common **anti-patterns:

  1. Committing too often — when possible, avoid work instead of deferring work. That’s why the prior post called out the frequency of updates when using the MERGE pattern for CDC: the more frequently a mirror table is updated, the more overall work is created — whether that’s rewriting files immediately or creating a new set of deltas.
  2. Not compacting deltas — long-term read performance depends on eventually rewriting data files with deltas applied. Ignoring compaction allows delete files and metadata to grow until the table is unusable. If maintaining a compaction process is a problem, then using deltas is just not a good strategy.
  3. Over-tuning — the opposite of ignoring compaction is also a problem. You can spend a lot of time chasing efficiency, but the time it takes to measure, plan, and adjust pipelines often negates the benefits. Unless a table is huge or has exacting latency and performance requirements, don’t spend hours tuning it.

Best practices for CDC with a merge-on-read strategy

Now that it’s clear what not to do, what are the best practices for CDC with deltas?

  1. Pick a reasonable update frequency — the biggest impact comes from avoiding work, so pick a default commit frequency that works for most tables. This is usually between 5 and 15 minutes. Few use cases require more frequent updates and picking a good default allows you to spend your time tuning the tables that need it.
  2. Compact before tables get slow — schedule compactions based on the expected number of deltas created by the commit frequency. It’s okay to compact more than necessary. Running compaction once every 10 MERGE commits already reduces write amplification by about 10x. You can push the limit further, but it usually isn’t worth getting too close to the point at which users complain.
  3. Don’t forget the basics — compactions still need to rewrite data files and incur some write amplification. As with copy-on-write, smaller files reduces amplification. And keeping data well indexed helps target deletes so that compaction can skip files that don’t require it.

These best practices cover most tables so you can focus time on the tables with stricter requirements. For those tables, follow the same pattern: run MERGE to meet the latency requirement and then base compaction frequency on when reads start to slow down from too many deltas.

Remember that frequent commits can easily increase the amount of maintenance work by an order of magnitude. Committing every minute instead of every 10 produces 10x the number of deltas and requires 10x more maintenance.

Frequent commits also squeeze the amount of time in which compactions can run. To handle these use cases, Iceberg provides more advanced features, like compaction partial progress and commit deconfliction in REST catalogs. There’s also a simpler solution: change the MERGE strategy periodically to use copy-on-write rather than merge-on-read. That combines compaction and an update into one commit.

What’s next?

Another reason why it’s important to think of deltas as a way to defer work is that deltas aren’t the only clever solution. The first CDC post on the change log pattern had another example: using a windowed view of the change log instead of a mirror table, which deferred finding the latest copy of any given row entirely to read time.

The next post in this series is on the Hybrid Pattern, which builds on the idea of using the change log table to defer work, combined with the MERGE pattern to maintain a read-optimized mirror.

--

--

Tabular

Tabular is building an independent cloud-native data platform powered by the open source standard for huge analytic datasets, Apache Iceberg.