fsspec Retrospective

written by Martin Durant on 2026-05-19

This is a description of how fsspec came to be, why it is like it is, and how the journey has been for me. Consider it a 10-year gift. This follows a period of surprising activity in python remote storage, not least, advances in gcsfs rapid buckets.

Origins

When I arrived at Anaconda (formerly Continuum Analytics) in 2015, I started as an instructor and curriculum developer for data science, but it wasn't long before I started to work with the dask team (almost all of which were colleagues at the time).

This was the era of Hadoop and Spark, and dask wished to play in the same world, distributing jobs onto clusters of workers. That required two new features: packaging IO tasks so that they could be sent to workers, and they could access bytes in storage; and integration with the parquet tabular data storage format, as well as ways to deploy to AWS, YARN clusters and others.

(The parquet library would become fastparquet, the pioneering and leading parquet library for python for several years. But that's another story. There is a similar story for other utilities like knit, dask-yarn, dask-gateway and more.)

The first non-standard storage of concern was Hadoop's own HDFS. Luckily, the C library libhdfs3 had recently appeared, and over that year's holidays, three of us had an informal contest to see who could write the better python wrapper. My version was simple, using ctypes, versus a raw C python extension library and a cython one. ctypes had the advantage of not needing a compilation step, and the API being wrapped was simple enough to work well. hdfs3 was my first filesystem.

Design

The initial code inside dask.bytes provided a uniform interface to different storage backends, so that the rest of dask could make use of them. Some important considerations:

dispatch based on the protocol of a URL ("s3://" is not the same as "hdfs://")
serializable filesystems and file-like instances, so they could be distributed to dask workers
tokens for filesystem instances, so they could be cached/reused
transparent text mode and compression (e.g., to read .csv.gz files)
glob handling (e.g., to read file*.cav, for to write file1.csv, file2.csv from a series of tasks)
buffering of bytes for read and write in file instances
utilities for finding chunks of files, e.g., newline delimiters (this is the only functionality still left inside dask.bytes: link).

I made the solo decision to roughly follow posix naming and conventions for the filesystem interface, since I thought this would be most familiar to other users and developers. It has caused a lot of friction through the years with backends that are fundamentally not structured that way, but we do the best we can.

The only plausible other library around at the time that did some of this was pyfilesystem, aka fs. They were not amenable to consider some of dasks's requirements such as serializability, unfortunately.

The final list of main features is at https://filesystem-spec.readthedocs.io/en/latest/features.html .

Adoption

Anaconda at the time was a lovely place to launch this idea. Not only did dask adopt the newly-born fsspec (of course, it was designed for it), but we had contributors to pandas, xarray and other important python/data libraries. Since there was really no other way to dispatch IO to different storage systems at the backend, but most libraries supported writing to a generic file-like object (i.e., one that matches the python builtin file type), it was simple to extend their various read/write functions to take general URLs and call fsspec. Of course, I had a direct hand in zarr and parquet integration; I think it's fair to say that, together, we bootstrapped the "cloud native" era for python.

I certainly had no idea that it would take off — it was meant to be just a means to an end. Some of that must be down to luck and circumstance, but also providing something genuinely useful. Usage built slowly over time, and I wasn't really aware of it (we continue not to see many issues compared to other OSS projects) until relatively recently. Unlike other efforts I have led — good ideas that ought to have users — I have not taken fsspec to conferences or done anything particular to attract users, downstream libraries or contributors. Maybe this led me to falsely believe that users would flock to everything I made...

From there, it was a fairly straight-forward process of adding features, more backends, and strengthening the code. Some of these new features ended up being more widely used than others! I think the hardest part throughout was probably the around caching of file listings.

One major switch along the way, was implementing an asyncio-based backend for cloud storage (s3, gcs, azure, http). This produced an immense speedup for bulk operations (upload, download, copy, delete) on many small files; but this made very little difference to calling code using the file-like interface, at least until we can upstream the prefetcher from gcsfs to fsspec.

Status

fsspec gets about 700M downloads per month as of April 2026 (see here). The full set of known implementations are listed here (internal) and here (external) — it's a long list indeed.

We have institutional buy-in from Microsoft, Google, HuggingFace and others. Essentially, they all came to me when they realised that users values fsspec as part of their work on their respective platforms. (Notice I've never heard anything from AWS) Many contributors have weighed in along the years, and the packages are depend on by a large number of other libraries. I think we can say, that fsspec does a thing well (or well enough), limits its scope to just that functionality, and provides a useful convenience to other developers.

Although I have continued to maintain the repos throughout (with occasional co-admins), I do also want to make new features and totally new things, so sometimes "just maintenance" can definitely be onerous. fsspec continues as a more-or-less solo operation. There are many contributors and a lot of activity, but still no formal governance structure (no council, no numFOCUS affiliation). Anaconda continues to be the biggest financial supporter, by paying my salary.

It seems that storage, as a theme, is having a bit of a moment right now. Some of this is squarely pushing fsspec forward (such as the gcsfs work with Google, mentioned above), some of it broadening the possibilities (such as s3files) and some inbetween (such as huggingface buckets). I am at a loss as to why this is happening now, but it's fun to see!

Thoughts

Ten years is a long time to be running a project! I had not anticipated the reach that fsspec would have and perhaps I would have made different choices if I had. I am proud of the usage and community that fsspec fits into, and this is already a good run for a tool meant as a decent convenience for others.