DaskBERG!

written by Martin Durant on 2022-11-29

TL;DR

I hacked together a dask client to Apache Iceberg! It can work on the files stored by Iceberg or talk with the REST service to get the locations of those files. There is no writing support right now, but there could be quite soon.

https://github.com/martindurant/daskberg

Introduction

First there were database engines, and then came big data with its efficient binary columnar storage (parquet and ORC). But we wanted both together! Transactions and time-checkpoints along with throughput and efficiency! So the world has moved on again, and now there are various "Lake" technologies around layering on top of parquet to get the best of both.

When the dask team talked about this some time ago, one effort resulted to read from Apache Deltalake: dask-deltalake. The view of some of the team was that another Apache project, Iceberg, was the better designed offering around, but nothing was done at the time. Recently I happened to be thinking about Iceberg because of a particular dataset I came across and this long discussion about Iceberg-like ideas beyond tabular data.

So I got to work...

Outcome

Here is the repository of my code: https://github.com/martindurant/daskberg . This did not cost me a huge amount of effort and was fun and educational to dig into. The README example shows a full Iceberg workflow on some super simple test data, but already enough to show the power of the code. This is an open invitation for people to try it out and see how far you can push it!

pyiceberg?

The core team behind Iceberg itself does also work on a python client. It is not yet ready for any read operations. Maybe eventually the work I did will get folded into their library, or they will code their own way to integrate with dask. Either option would be fine with me.