\ Mutable Kerchunk References — Martin Durant

Martin Durant

Mutable Kerchunk References

written by Martin Durant on 2023-01-09

summary

Kerchunk is all about making a virtual "reference" filesystem, where each chunk of data points to a byte range in some remote file. These references are usually saved in JSON files elsewhere. We can write new data into the references file and update references to new locations and save the new reference set, implementing a very rudimentary versioning system.

kerchunk kerchunk kerchunk

Introduction

This idea came about from experimenting with Apache Iceberg's implementation of versioned tabular (parquet) datasets by immutable files and "manifest" listings and the conversation started by @rabernat in this zarr issue.

In this post, I present little hack project to make ReferenceFileSystem writable, so that we can make new versions of an array dataset using zarr. No server required.

You can install the fork that makes this work as follows.

pip install git+https://github.com/martindurant/filesystem_spec.git@icy

The code to allow mutating and saving a ReferenceFileSystem will eventually be merged into fsspec.

Example

We start here by running the first part of the kerchunk tutorial to produce some JSON reference sets. We'll only use one of them here. Let's open one of the datasets using zarr. It contains one big data array and smaller coordinate arrays. I do not have permission to change the original data in place.

import fsspec
import zarr
so = {"anon": True}
fs = fsspec.filesystem(
    "reference", fo="01_air_pressure_at_mean_sea_level.json", 
    remote_protocol="s3", remote_options=so
)
g = zarr.open(fs.get_mapper())
list(g)
    ['air_pressure_at_mean_sea_level', 'lat', 'lon', 'time0']

Some basic information: this array is 2.9GB in 3720 chunks:

(g.air_pressure_at_mean_sea_level.shape, 
 g.air_pressure_at_mean_sea_level.chunks, 
 g.air_pressure_at_mean_sea_level.nchunks,
 g.air_pressure_at_mean_sea_level.nbytes / 2**30)
    ((744, 721, 1440), (24, 100, 100), 3720, 2.8776025772094727)

It contains data with values near 100000 (pressure in Pa, one assumes). Here are two very small sections.

(g.air_pressure_at_mean_sea_level[100, 700, 1000:1004],
 g.air_pressure_at_mean_sea_level[100, 690, 1000:1004])
    (array([99510.69, 99512.44, 99514.44, 99516.44], dtype=float32),
     array([99152.44, 99164.44, 99176.69, 99188.69], dtype=float32))

Let's modify some data! This writes a (temporary) local file and updates the appropriate reference to point to it.

g.air_pressure_at_mean_sea_level[100, 700, 1000:1004] /= 2
fs.save_json("air_modified.json")

Now we can load the new reference set, and indeed the values have changed. The unchanged portion is still loading from remote, but the changed version is loading from local.

fs2 = fsspec.filesystem(
    "reference", fo="air_modified.json", 
    remote_protocol="s3", remote_options=so
)
g2 = zarr.open(fs.get_mapper())
(g2.air_pressure_at_mean_sea_level[100, 700, 1000:1004],
 g2.air_pressure_at_mean_sea_level[100, 690, 1000:1004])
    (array([49755.344, 49756.22 , 49757.22 , 49758.22 ], dtype=float32),
     array([99152.44, 99164.44, 99176.69, 99188.69], dtype=float32))

but the originals have not changed and only access the remote versions:

fs = fsspec.filesystem(
    "reference", fo="01_air_pressure_at_mean_sea_level.json", 
    remote_protocol="s3", remote_options=so,
    skip_instance_cache=True
)
g = zarr.open(fs.get_mapper())
(g.air_pressure_at_mean_sea_level[100, 700, 1000:1004],
 g.air_pressure_at_mean_sea_level[100, 690, 1000:1004])
    (array([99510.69, 99512.44, 99514.44, 99516.44], dtype=float32),
     array([99152.44, 99164.44, 99176.69, 99188.69], dtype=float32))

So the two reference files are snapshots of the data with an edit between, and we only saved new data for the one chunk where we actually made a change. One might imagine keeping the two dataset specs side by side in an Intake catalogue describing the two snapshots and provenance in metadata. This is clearly not a version control system for array data, but perhaps part of one.