This week, I dove into Project Nessie - an open-source transactional data catalogue for Apache Iceberg tables. I’d heard about Nessie’s git-like semantics and was curious about its potential for better managing data versioning and auditability in my projects.

Docker compose setup for Nessie Server and CLI

To experiment locally, I leveraged Docker, conveniently supported by a guide provided by the Nessie team. Following their materials, I put together a straightforward Docker Compose file that neatly places both the Nessie server and CLI into the same Docker network. This setup greatly simplifies communication between the containers.

Here’s a quick excerpt from my docker-compose.yaml, the repository can be found on GitHub:

services:  
  nessie-server:  
    image: ghcr.io/projectnessie/nessie:0.103.3  
    container_name: nessie-server  
    ports:  
      - "19120:19120"  
      - "9000:9000"  
  
  nessie-cli:  
    image: ghcr.io/projectnessie/nessie-cli:0.103.3  
    container_name: nessie-cli  
    stdin_open: true   # -i  
    tty: true          # -t  
    profiles:  
      - cli  
    depends_on:  
      - nessie-server

With everything configured, I spun up the server with docker compose up -d, and then initiated an interactive CLI session using docker compose run nessie-cli. I particularly liked how Docker Compose’s profiles feature ensures only the services needed at runtime are started.

Inside the Nessie CLI, connecting to the server was straightforward:

CONNECT TO http://nessie-server:19120/api/v2

I explored some basic commands:

  • create namespace my_new_namespace: This felt similar to creating folders or database schemas.
  • create branch if not exists my_new_branch from main: Like git branches, this allows for isolated, version-controlled experiments with data.
  • use branch my_new_branch: Activates the new branch, making any subsequent operations exclusive to it. The REPL conveniently marks the active branch.

Checking the Nessie UI at http://127.0.0.1:19120, it was clear that namespaces created on the branch were entirely isolated from main. Exiting the CLI was as easy as typing exit.

PyIceberg and Marimo Notebooks

Next up was experimenting with PyIceberg to populate our data catalogue. After setting up a Python virtual environment and installing dependencies, I used Marimo — a fantastic interactive Python environment — to test various operations like creating Iceberg tables using PyIceberg schemas and even PyArrow schemas, which seem particularly useful for ETL scenarios.

I encountered an initial snag: when writing data to the catalogue, PyIceberg threw connection errors related to the minio:9000 endpoint. The fix involved tweaking my Docker Compose configuration to add an external-endpoint setting for the nessie-server service, enabling proper communication between services and clients.

Nessie for ETL pipelines

Looking ahead, I’m particularly excited about Nessie’s potential for implementing a write-audit-publish pattern. I can already envision a process where Airflow tasks fetch new data, branch off for isolated changes, perform quality checks, and only then merge into the main data branch — maintaining robust control and auditability.