\ Python and Big Data storage — Martin Durant

Martin Durant

Python and Big Data storage

written by Martin Durant on 2016-12-21

The python ecosystem is excellent for single-machine data processing and data science using popular packages such as numpy, scipy, pandas, sklearn and many others.

Increasingly, the volume of data available for processing is such that single-machine, in memory analysis is no longer an option. Tools such as dask/distributed allow the use of familiar python-based workflows for larger-than-memory and distributed processing. Similarly, work has been progressing to allow pythonic access to Big Data file formats such as avro (cyavro, fastavro) and parquet (fastparquet) to allow python to inter-operate with other Big Data frameworks.

Big Data stores

Pushing data through some CPU cores to produce an output is only one side of the story. The data needs to first be accessible in a data storage which is accessible to the processing nodes, and possibly also for long-term archival.

We have built out a number of python libraries for interacting with Big Data storage. They offer a consistent python interface for file-system like operations (mkdir, du, rm, mv, etc.), and a standard python file-like object which can be directly used with other python libraries expecting to read from files.

In chronological order:

Example:

>>> import s3fs
>>> s3 = s3fs.S3FileSystem(anon=True)
>>> s3.ls('mybucket/data/file.csv.gz', detail=True)
[{'ETag': '"f7f330d515645996bb4b40c74fdf15c8-2"',
  'Key': 'data/file.csv.gz',
  'LastModified': datetime.datetime(2016, 7, 5, 18, 28, 13, tzinfo=tzutc()),
  'Size': 127162941,
  'StorageClass': 'STANDARD'}]
>>> with s3.open('mybucket/data/file.csv.gz', 'rb') as f:
        df = pd.read_csv(gzip.GzipFile(fileobj=f, 'rb'))

Other good things include:

Things that remain to be done:

The future:

Afterthoughts

All the file-system interfaces mentioned here are open-source, and were supported in their development by Continuum Analytics.