Announcing filesystem-spec

written by Martin Durant on 2019-02-22

Version 0.2.0 of filesystem-spec (fsspec for the remainder of this article) has been released, and everyone is invited to try it and make their thoughts known!

On github: https://github.com/martindurant/filesystem_spec

Documentation: https://filesystem-spec.readthedocs.io/

Installation

conda install -c conda-forge fsspec
# or
pip install fsspec

Overview

I have been previously involved with building interfaces to storage systems, which have become popular and widely used:

s3fs for AWS S3
gcsfs for Google Storage
Microsoft azure-datalake-store
hdfs3 for Hadoop

and a general mechanism for accessing files on these systems within Dask. For the purposes of Dask, it was critical to be able to resolve glob patterns on a given file-system, pass arguments (such as credentials) through to that system, and be able to read arbitrary sections of some file without downloading or reading the whole file. This is how functions such as dd.read_csv are distributed to the nodes of a Dask cluster, and each worker reads different parts of the data files from network or remote storage.

fsspec defines an abstract file-system for other filesystem interfaces to be based on, with an interface similar to the builtin os module. It is meant to unify the above interfaces and codify all of the extra functionality - not to mention cutting out a lot of duplication between the projects. This means a unified API for accessing any file storage-like system (archive, cluster, remote), do normal file operations (copy, delete, list) and obtain file-like instances to pass to other libraries. It also makes it much easier to add higher-level functionality to all the implementations, such as transparent compressed and text-mode files, a key-value mapping, file search and transactional writing, all in one place.

Usage

Whilst s3fs and gcsfs remain separate projects for now, both have integration PRs to show how they would be adapted to fsspec, without changing the user experience. These PRs mostly remove duplicated code. The following implementations are included in the repo, and we can use them for a demonstration:

SSH
Hadoop HDFS
HTTP
SFTP
FTP
memory
webHDFS
local file-system
ZIP file

Every file-system has the same API.

The completeness of each implementation varies: HTTP and zip-files are read-only, for example, and not all HTTP servers respect HEAD and byte-range requests.

An example use of the interface may be as follows. Start an FTP server in one process (this would normally be running elsewhere):

python -m pyftpdlib -d /Users/tmp/data -u user -P pass -w

and in python

>>> import fsspec
>>> fs = fsspec.filesystem('ftp', host='localhost', port=2121, username='user', password='pass')
>>> fs.ls('/')
['/text',
 '/text2',
 '/comp.zip']
>>> fs.cat('/text')
Hello fsspec
>>> fs.info('/text')
{'modify': '20190223200610',
 'perm': 'radfwMT',
 'size': 14,
 'type': 'file',
 'unique': '1000004g202c662ff',
 'name': '/text'}
>>> fs.cat('/text')
b'Hello fsspec\n'
>>> with fs.open('/text2', 'wt') as f:
...     f.write('Fóreign téxtá')
>>> with fs.open('/text2', 'rt') as f:
... f.seek(7)
... print(f.read())
 téxtá

or even chain file-systems:

>>> fs = fsspec.filesystem('sftp', host='localhost')  # now via SSH
>>> f = fs.open('/Users/mdurant/data/comp.zip', 'rb')  # contains previous file "text"
>>> fs2 = fsspec.filesystem('zip', fo=f)
>>> fs2.cat('text')
b'Hello fsspec\n\n'

Finally, making use of the more general functionality - this is typically where external libraries would start:

>>> files = fsspec.open_files('ftp:///text*', host='localhost', port=2121, username='user', password='pass', mode='rt')
>>> files
[<OpenFile '/text'>, <OpenFile '/text2'>]

>>> with files[0] as f:
...     print(f.readline())
Hello fsspec

where each OpenFile is a serializable instance which points to a text-mode file-like object which could be passed to any python function that expects a file. Being serializable means that the instance can be safely passed between processes and even machines (so long as they can access the same underlying resource) - actual access only happens in the with context blocks. There is no compression in this case, but there could have been.

The future

Now that fsspec is here, it is conceivable that Dask could be changed to depend upon it, and so we are able to move all the file and bytes-handling stuff out of there, and so make this more widely available to people and other libraries. That is my hope. Similarly, it would make a lot of sense to eventually have s3fs and gcsfs and indeed azure-datalake depend on fsspec, and to avoid all the redundant code in the various projects.

It is also much easier now to add new implementations such as azure-blob-storage and a whole host of other storage media.

We need tests! We need more implementations! Perhaps more importantly, we need a conversation about the right way approach the trickier issues, such as file listing caching, read-ahead, and what to do on systems that don't have real hierarchical directories (e.g., s3).

Postscript

All fsspec-compliant classes also subclass from Arrow's file-system, if installed, so any Arrow function that expects one of its own file-systems will happily work with any of the implementations.