\
Version 0.2.0 of filesystem-spec
(fsspec
for the remainder of this article) has been released, and everyone is invited to try it and make their thoughts known!
On github: https://github.com/martindurant/filesystem_spec
Documentation: https://filesystem-spec.readthedocs.io/
Installation
conda install -c conda-forge fsspec # or pip install fsspec
I have been previously involved with building interfaces to storage systems, which have become popular and widely used:
and a general mechanism for accessing files on these systems within Dask. For the purposes of Dask, it was critical to be able to resolve glob patterns on a given file-system, pass arguments (such as credentials) through to that system, and be able to read arbitrary sections of some file without downloading or reading the whole file. This is how functions such as dd.read_csv
are distributed to the nodes of a Dask cluster, and each worker reads different parts of the data files from network or remote storage.
fsspec
defines an abstract file-system for other filesystem interfaces to be based on, with an interface similar to the builtin os
module. It is meant to unify the above interfaces and codify all of the extra functionality - not to mention cutting out a lot of duplication between the projects. This means a unified API for accessing any file storage-like system (archive, cluster, remote), do normal file operations (copy, delete, list) and obtain file-like instances to pass to other libraries. It also makes it much easier to add higher-level functionality to all the implementations, such as transparent compressed and text-mode files, a key-value mapping, file search and transactional writing, all in one place.
Whilst s3fs and gcsfs remain separate projects for now, both have integration PRs to show how they would be adapted to fsspec, without changing the user experience. These PRs mostly remove duplicated code. The following implementations are included in the repo, and we can use them for a demonstration:
Every file-system has the same API.
The completeness of each implementation varies: HTTP and zip-files are read-only, for example, and not all HTTP servers respect HEAD and byte-range requests.
An example use of the interface may be as follows. Start an FTP server in one process (this would normally be running elsewhere):
python -m pyftpdlib -d /Users/tmp/data -u user -P pass -w
and in python
>>> import fsspec >>> fs = fsspec.filesystem('ftp', host='localhost', port=2121, username='user', password='pass') >>> fs.ls('/') ['/text', '/text2', '/comp.zip'] >>> fs.cat('/text') Hello fsspec >>> fs.info('/text') {'modify': '20190223200610', 'perm': 'radfwMT', 'size': 14, 'type': 'file', 'unique': '1000004g202c662ff', 'name': '/text'} >>> fs.cat('/text') b'Hello fsspec\n' >>> with fs.open('/text2', 'wt') as f: ... f.write('Fóreign téxtá') >>> with fs.open('/text2', 'rt') as f: ... f.seek(7) ... print(f.read()) téxtá
or even chain file-systems:
>>> fs = fsspec.filesystem('sftp', host='localhost') # now via SSH >>> f = fs.open('/Users/mdurant/data/comp.zip', 'rb') # contains previous file "text" >>> fs2 = fsspec.filesystem('zip', fo=f) >>> fs2.cat('text') b'Hello fsspec\n\n'
Finally, making use of the more general functionality - this is typically where external libraries would start:
>>> files = fsspec.open_files('ftp:///text*', host='localhost', port=2121, username='user', password='pass', mode='rt') >>> files [<OpenFile '/text'>, <OpenFile '/text2'>] >>> with files[0] as f: ... print(f.readline()) Hello fsspec
where each OpenFile
is a serializable instance which points to a text-mode file-like object which could be passed to any python function that expects a file. Being serializable means that the instance can be safely passed between processes and even machines (so long as they can access the same underlying resource) - actual access only happens in the with
context blocks.
There is no compression in this case, but there could have been.
Now that fsspec
is here, it is conceivable that Dask could be changed to depend upon it, and so we are able to move all the file and bytes-handling stuff out of there, and so make this more widely available to people and other libraries. That is my hope. Similarly, it would make a lot of sense to eventually have s3fs and gcsfs and indeed azure-datalake depend on fsspec, and to avoid all the redundant code in the various projects.
It is also much easier now to add new implementations such as azure-blob-storage and a whole host of other storage media.
We need tests! We need more implementations! Perhaps more importantly, we need a conversation about the right way approach the trickier issues, such as file listing caching, read-ahead, and what to do on systems that don't have real hierarchical directories (e.g., s3).
All fsspec
-compliant classes also subclass from Arrow's file-system, if installed, so any Arrow function that expects one of its own file-systems will happily work with any of the implementations.