Reading Apache Iceberg from Python with PyIceberg

Tabular
4 min readNov 14, 2022

--

This is a walkthrough on how to use the PyIceberg CLI. For this walkthrough, we’re going to use the docker-spark-iceberg repository that contains a full stack of Spark with Iceberg support, MinIO as a storage backend and a REST catalog as a catalog. This is a companion piece to our PyIceberg video. It requires docker and docker-compose installed.

First, we’re going to clone the repository and start the stack:

git clone https://github.com/tabular-io/docker-spark-iceberg.git
cd docker-spark-iceberg
docker-compose up

It takes a couple of seconds to start all the services, and next we can open up a web browser to go to the Jupyter notebook UI that comes with the Spark images. This UI is available at http://localhost:8888/.

We can use the notebook Iceberg — Getting Started.ipynb to create a table that we then can query using PyIceberg. First, we need to create the database nyc:

%%sql
CREATE DATABASE nyc;

And next we’re going to read in a single parquet file, and write it to the table nyc.taxis:

df = spark.read.parquet("/home/iceberg/data/yellow_tripdata_2021-04.parquet")
df.write.saveAsTable("nyc.taxis")

We don’t always want to use a full blown JVM to interact with the Iceberg tables. This is where PyIceberg comes in. Next, we’ll jump into the Spark container, and get PyIceberg installed.

docker exec -t -i `docker ps -q -f "expose=8888"` bash

Now we’re inside of the spark container, and we can install PyIceberg:

pip3 install "pyiceberg[pyarrow]"

In this instance we’re install PyIceberg with PyArrow support, but there are also other backends such as s3fs. For all the latest information, please check the docs.

And next, we can query the rest catalog:

pyiceberg --uri http://rest:8181 list
nyc

Now, we don’t want to provide the — uri all the time, instead we can also create a configuration file:

vi ~/.pyiceberg.yaml

We can set the following yaml:

catalog:
default:
uri: http://rest:8181
s3.endpoint: http://minio:9000

You can escape vi using by pressing ‘:’, and then do the command wq, which means write and quite. Now we’re able to query the catalog without having to provide a uri:

pyiceberg list
nyc

To list all the commands, we can just run pyiceberg:

pyiceberg

That gives an overview of all the commands:

sage: pyiceberg [OPTIONS] COMMAND [ARGS]...

Options:
--catalog TEXT
--verbose BOOLEAN
--output [text|json]
--uri TEXT
--credential TEXT
--help Show this message and exit.

Commands:
describe Describes a namespace xor table
drop Operations to drop a namespace or table
files Lists all the files of the table
list Lists tables or namespaces
location Returns the location of the table
properties Properties on tables/namespaces
rename Renames a table
schema Gets the schema of the table
spec Returns the partition spec of the table
uuid Returns the UUID of the table

Next we can describe the table that we created in the beginning:

pyiceberg describe nyc.taxis

Which returns:

Table format version  1
Metadata location s3a://warehouse/wh/nyc/taxis/metadata/00000-d8c5f2af-b445-4814-a7ae-133b07be7a85.metadata.json
Table UUID 265415d3-6d3d-47dd-86f6-ee4cfb6e4ce9
Last Updated 1668444293891
Partition spec []
Sort order []
Current schema Schema, id=0
├── 1: VendorID: optional long
├── 2: tpep_pickup_datetime: optional timestamptz
├── 3: tpep_dropoff_datetime: optional timestamptz
├── 4: passenger_count: optional double
├── 5: trip_distance: optional double
├── 6: RatecodeID: optional double
├── 7: store_and_fwd_flag: optional string
├── 8: PULocationID: optional long
├── 9: DOLocationID: optional long
├── 10: payment_type: optional long
├── 11: fare_amount: optional double
├── 12: extra: optional double
├── 13: mta_tax: optional double
├── 14: tip_amount: optional double
├── 15: tolls_amount: optional double
├── 16: improvement_surcharge: optional double
├── 17: total_amount: optional double
├── 18: congestion_surcharge: optional double
└── 19: airport_fee: optional double
Current snapshot Operation.APPEND: id=8152339288720205593, schema_id=0
Snapshots Snapshots
└── Snapshot 8152339288720205593, schema 0: s3a://warehouse/wh/nyc/taxis/metadata/snap-8152339288720205593-1-bcce7ed6-5396-4285-9892-a1e89ae286f8.avro
Properties owner root
created-at 2022-11-14T16:44:46.939799552Z
write.format.default parquet

In the case when you’re scripting things, it is often nice to have the output in json, which is also possible:

pyiceberg --output json describe nyc.taxis

Which returns all the information in json:

{
"identifier": ["default", "nyc", "taxis"],
"metadata_location": "s3a://warehouse/wh/nyc/taxis/metadata/00000-d8c5f2af-b445-4814-a7ae-133b07be7a85.metadata.json",
"metadata": {
"location": "s3a://warehouse/wh/nyc/taxis",
"table-uuid": "265415d3-6d3d-47dd-86f6-ee4cfb6e4ce9",
"last-updated-ms": 1668444293891,
"last-column-id": 19,
"schemas": [{
"type": "struct",
"fields": [{
"id": 1,
"name": "VendorID",
"type": "long",
"required": false
}, {
"id": 2,
"name": "tpep_pickup_datetime",
"type": "timestamptz",
"required": false
}, ...],
"schema-id": 0,
"identifier-field-ids": []
}],
"current-schema-id": 0,
"partition-specs": [{
"spec-id": 0,
"fields": []
}],
"default-spec-id": 0,
"last-partition-id": 999,
"properties": {
"owner": "root",
"created-at": "2022-11-14T16:44:46.939799552Z",
"write.format.default": "parquet"
},
"current-snapshot-id": 8152339288720205593,
"snapshots": [{
"snapshot-id": 8152339288720205593,
"timestamp-ms": 1668444293891,
"manifest-list": "s3a://warehouse/wh/nyc/taxis/metadata/snap-8152339288720205593-1-bcce7ed6-5396-4285-9892-a1e89ae286f8.avro",
"summary": {
"operation": "append",
"spark.app.id": "local-1668442408867",
"added-data-files": "1",
"added-records": "2171187",
"added-files-size": "34536512",
"changed-partition-count": "1",
"total-records": "2171187",
"total-files-size": "34536512",
"total-data-files": "1",
"total-delete-files": "0",
"total-position-deletes": "0",
"total-equality-deletes": "0"
},
"schema-id": 0
}],
"snapshot-log": [{
"snapshot-id": "8152339288720205593",
"timestamp-ms": 1668444293891
}],
"metadata-log": [],
"sort-orders": [{
"order-id": 0,
"fields": []
}],
"default-sort-order-id": 0,
"refs": {
"main": {
"snapshot-id": 8152339288720205593,
"type": "branch"
}
},
"format-version": 1,
"schema": {
"type": "struct",
"fields": [{
"id": 1,
"name": "VendorID",
"type": "long",
"required": false
}, {
"id": 2,
"name": "tpep_pickup_datetime",
"type": "timestamptz",
"required": false
} ... ],
"schema-id": 0,
"identifier-field-ids": []
},
"partition-spec": []
}
}

Also, it is very easy to accidentally drop the table:

pyiceberg drop table nyc.taxis

And if we try to query the table:

pyiceberg --output json describe nyc.taxis

We can see that it is not available anymore:

{
"type": "NoSuchTableError",
"message": "Table or namespace does not exist: nyc.taxis"
}

Hope you’ve learned something about PyIceberg today. If you run into anything, please don’t hesitate to reach out on the #python channel on Slack, or create an issue on Github.

--

--

Tabular
Tabular

Written by Tabular

Tabular is building an independent cloud-native data platform powered by the open source standard for huge analytic datasets, Apache Iceberg.