\ Untangling a 90s data format — Martin Durant

Martin Durant

Untangling a 90s data format

written by Martin Durant on 2024-08-29

Summary

Here is a brief delve into the internals of "HDF4". Although this format has been superseded, archival data exists and is still being written in it, because why change a pipeline. For the purposes of kerchunk, I was sniped into seeing how hard it would be to find the binary buffer offset/sizes in a file, and found out how convoluted a file format can get. The result can be seen in a PR.

hdf-logo

Kerchunk for 90s data

For those that don't know, kerchunk is a project for bringing archival data into the cloud era. It find the binary buffers within array-oriented data formats and presents a virtual zarr dataset, so that you can read the data directly from remote, with concurrency and parallelism.

Kerchunk already handled HFD5, TIFF, grib2, FITS, netCDF3 and zarr itself. But there are always other array storage formats that might be interesting, and it turns out that NASA has a good deal of HDF4 (see here). Now, there are some readers of this file format for python, particularly rarsterio/gdal, `netcdf4 (recent versions) and satpy. However, they don't expose the internals of the file, the information kerchunk needs.

So I started to read the 222 page manual. Just for fun! I conclude that they were after two features common in the 90s:

These concerns only make sense if you are working exclusively with local disk, with seek() and low latency. None of this is useful for archival data.

Untangling

Some interesting facets of HDF4:

Conclusion

Well, it turns out that the chunks, in at least the data I've seen, are tiny. That means that kerchunking thousands of files might be problematic. However, having some considerable work, maybe what I have will be useful to some people.