Using dbt with Tabular

Tabular
5 min readMay 19, 2023

--

by Fokko Driesprong

Tabular + dbt-spark

dbt recently added Apache Iceberg support in their version 1.5 release. dbt is a popular framework for building and maintaining data pipelines. It is both a compiler and a runner:

  • Compiler: dbt will compile the templated sql files based on the current state of the database.
  • Runner: dbt will execute sql statements against your query engine, keeping track of dependencies and running queries in parallel when possible.

dbt implements the best practices that are common in the software industry, such as proper documentation, version control, and testing in the data warehousing world. For example, setting up a development and production environment is easy using dbt.

This blog post will walk you through the steps on how to run a dbt pipeline against Tabular.

Installing Spark

Neither Tabular nor dbt-spark provide a query engine; therefore, we need to install Spark locally. In production environments, it is recommended to use Spark deployment that is running near the data. First download Spark:

curl -OL https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
tar zxvf spark-3.3.2-bin-hadoop3.tgz

Next, download the open-source Iceberg runtime, and the AWS bundle:

curl https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.2.1/iceberg-spark-runtime-3.3_2.12-1.2.1.jar --output jars/iceberg-spark-runtime-3.3_2.12-1.2.1.jar       
curl https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.20.18/bundle-2.20.18.jar --output jars/bundle-2.20.18.jar

Now we start the Hive server using spark-submit:

./bin/spark-submit  \                                                                                                                                                    
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.defaultCatalog=sandbox \
--conf spark.sql.catalog.sandbox=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.sandbox.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
--conf spark.sql.catalog.sandbox.uri=https://api.tabular.io/ws/ \
--conf spark.sql.catalog.sandbox.credential=<CREDENTIAL> \
--conf spark.sql.catalog.sandbox.warehouse=sandbox \
--class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 \
--name "Thrift JDBC/ODBC Server"

You’ll notice the <CREDENTIAL> in the run command above. You will need to create one in the Tabular UI and paste it here. That credential unlocks a lot of access control power that is not available in Spark on its own, and we’ll cover that in a later blog.

Setting up dbt

Make sure that the latest version of dbt-spark is installed:

pip3 install dbt-spark==1.5.0

Next, set up the configuration ~/.dbt/profiles.yml to point at the local Spark instance:

dbt_tabular:
outputs:
dev:
host: 127.0.0.1
method: thrift
port: 10000
schema: dbt_tabular
threads: 1
type: spark
target: dev

Running the pipelines

Next we’ll clone a repository that has an example pipeline that takes data from the Taxi dataset that comes with Tabular in the examples schema.

git clone https://github.com/tabular-io/dbt-tabular.git
cd dbt-tabular

Now execute dbt run:

➜  dbt-tabular git:(main) ✗ dbt run
08:48:42 Running with dbt=1.5.0
08:48:42 Found 3 models, 3 tests, 0 snapshots, 0 analyses, 356 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
08:48:42
08:48:48 Concurrency: 1 threads (target='dev')
08:48:48
08:48:48 1 of 3 START sql table model dbt_tabular.locations ............................. [RUN]
08:48:57 1 of 3 OK created sql table model dbt_tabular.locations ........................ [OK in 8.85s]
08:48:57 2 of 3 START sql table model dbt_tabular.taxis ................................. [RUN]
08:49:27 2 of 3 OK created sql table model dbt_tabular.taxis ............................ [OK in 30.56s]
08:49:27 3 of 3 START sql table model dbt_tabular.rides ................................. [RUN]
08:49:50 3 of 3 OK created sql table model dbt_tabular.rides ............................ [OK in 22.55s]
08:49:51
08:49:51 Finished running 3 table models in 0 hours 1 minutes and 8.58 seconds (68.58s).
08:49:51
08:49:51 Completed successfully
08:49:51
08:49:51 Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

Open up a browser and look at your Tabular Warehouse; you’ll see the tables listed:

You’ll also note that all the documentation is being propagated:

On of the benefits of using Iceberg is that the table is always available:

➜  dbt-tabular git:(main) ✗ head target/run/dbt_tabular/models/consumption/rides.sql
create or replace table dbt_tabular.rides
using iceberg

As dbt functions as a compiler, all compiled sql files can be accessed from target/. Unlike when using plain parquet tables, which drop the table and then recreate it with the same name in the same location, Iceberg’s approach is to create or replace the table. This provides two benefits: firstly, the entire history is tracked within the Iceberg table, and secondly, the table remains available at all times. In contrast, the traditional method renders the table unavailable during the replacement period. With Iceberg, however, the table is always accessible, and its full history is retained.

Additionally, dbt supports incremental models, as exemplified in the locations_incremental.sql file within the repository. During the initial run, dbt detects the absence of the table and executes a simple create table statement.

create or replace table dbt_tabular.locations_incremental
using iceberg

And on the second run:

create temporary view locations_incremental__dbt_tmp as
SELECT *
FROM examples.nyc_taxi_locations;
merge into dbt_tabular.locations_incremental as DBT_INTERNAL_DEST
using locations_incremental__dbt_tmp as DBT_INTERNAL_SOURCE
on DBT_INTERNAL_SOURCE.location_id = DBT_INTERNAL_DEST.location_id
when matched then update set *
when not matched then insert *

This way, you can leverage dbt to generate the syntax for you.

Another big advantage of using Iceberg is when running tests. Because Iceberg collects metadata when the data is being written, it can determine if a column is not-null without having to go through the actual data; this also applies to not-negative checks.

➜  dbt-tabular git:(main) ✗ dbt test
11:11:33 Running with dbt=1.5.0
11:11:33 Found 4 models, 3 tests, 0 snapshots, 0 analyses, 356 macros, 1 operation, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
11:11:33
11:11:36 Concurrency: 1 threads (target='dev')
11:11:36
11:11:36 1 of 3 START test not_null_rides_payment ....................................... [RUN]
11:11:39 1 of 3 PASS not_null_rides_payment ............................................. [PASS in 2.80s]
11:11:39 2 of 3 START test not_null_rides_ratecode ...................................... [RUN]
11:11:41 2 of 3 PASS not_null_rides_ratecode ............................................ [PASS in 1.96s]
11:11:41 3 of 3 START test not_null_rides_vendor ........................................ [RUN]
11:11:43 3 of 3 PASS not_null_rides_vendor .............................................. [PASS in 1.95s]
11:11:43
11:11:43 Finished running 3 tests, 1 hook in 0 hours 0 minutes and 9.93 seconds (9.93s).
11:11:43
11:11:43 Completed successfully
11:11:43
11:11:43 Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

Tabular and dbt

This blog not only showcases the simplicity of using dbt with Tabular managed Iceberg tables but also highlights the exceptional speed and convenience that come with utilizing intrinsic Iceberg features.

Future work

Both Iceberg and dbt are open-source and new features are added continuously.

  • Iceberg support for views. dbt relies heavily on views, and there is a lot of work being done on Iceberg views. Iceberg is query engine agnostic. In this blog, the data is written using Spark, but the table can also be written to or read by any other query engine that implements the Iceberg specification. Next to the Iceberg tables, Iceberg views are coming. This will allow dbt to define views that can be queried using any Iceberg compliant engine.
  • Simplified setup. The current setup with having to run the Spark Hive server is a bit clunky. There are two PRs open (#691, #782) that will remove the need for the Hive server, and just run everything in the same process as dbt.

--

--

Tabular
Tabular

Written by Tabular

Tabular is building an independent cloud-native data platform powered by the open source standard for huge analytic datasets, Apache Iceberg.

No responses yet