Untangling a 90s data format
written by
Martin Durant
on 2024-08-29
Summary
Here is a brief delve into the internals of "HDF4". Although this format has been
superseded, archival data exists and is still being written in it, because why change
a pipeline. For the purposes of kerchunk
, I was sniped into seeing how hard it
would be to find the binary buffer offset/sizes in a file, and found out how convoluted
a file format can get. The result can be seen in a PR.
Kerchunk for 90s data
For those that don't know, kerchunk is a project for
bringing archival data into the cloud era. It find the binary buffers within array-oriented
data formats and presents a virtual zarr
dataset, so that you can read the data directly
from remote, with concurrency and parallelism.
Kerchunk already handled HFD5, TIFF, grib2, FITS, netCDF3 and zarr itself. But there are always
other array storage formats that might be interesting, and it turns out that NASA has a good deal
of HDF4 (see here). Now, there are some
readers of this file format for python, particularly rarsterio/gdal
, `netcdf4
(recent versions) and satpy
. However, they don't expose the internals of the file, the
information kerchunk needs.
So I started to read the 222 page manual. Just for fun!
I conclude that they were after two features common in the 90s:
- in-place editing of attributes and metadata
- appending new arrays or extending existing arrays without rewriting the file
- features were added over time, but none were removed
These concerns only make sense if you are working exclusively with local disk, with seek()
and
low latency. None of this is useful for archival data.
Untangling
Some interesting facets of HDF4:
- the items in the file are enumerated in a centralized index, but actually it's a linked list,
spread throughout the file. Each item is a combination of a class identifier and a reference,
and this combination is unique. Each item refers to an offset and length in the file.
- some of the items are "extended", which means that the offset/length points to another item
not listed in the central index. It contains a "special" identifier, and, depending on
that type, links to either a linked list or to a chunk manifest
- for the linked list variant, you need to follow data items, each stating if it is the last,
and giving a pointer to the data; the data is the concatenation of the data sections.
- chunk manifests are stored as tables: the special item links to a table description which
gives the number of rows and the number of dimension columns; a corresponding data item
points to the actual binary of the table (it can be read as a numpy array), but as it happens,
this is probably also stored as a linked list extended item.
- each row of the chunk manifest points to yet another item, which tells you where the actual data
for that chunk is, and if it is compressed (compression seems to be zlib in the example
I am working from)
- special group items include other items (as a list), which can in turn be groups too, and this
if how the dataset is built up
- we realise this must be archival data written in one shot, because the arrays are actually
contiguous, and the root group item is the last one
Conclusion
Well, it turns out that the chunks, in at least the data I've seen, are tiny. That means that
kerchunking thousands of files might be problematic. However, having some considerable work,
maybe what I have will be useful to some people.