Intake's derived datasets!

written by Martin Durant on 2021-05-28

Data graph

TL;DR: Intake now provides derived datasets with a great deal of flexibility, so you are no longer restricted to ingesting data as provided by a service or set of files. You can encode transforms and data cleaning as part of catalogues without having to write a custom driver each time.

Intake is...

Intake is a python package for describing, loading and disseminating datasets in catalogues. Newcomers are encouraged to read the main documentation and the quickstart there to understand how Intake can "take the pain out of data access", whether for the data user, curators, developers or IT professionals. It is one of my main work areas, so this is not the first blog on the topic!

Here, we describe a new feature to be able to describe new data sources in terms of other data sources. This is a long-requested feature, now available in beta for feedback.

Recap

Intake provides catalogues. which are collections of data sources. Catalogues can contain other catalogues to form a hierarchical tree, and you could have several different views on some data, whether it's the same data in different formats, or different loading parameters. All this has been around a while. The data loads into one of several standard container types such as dataframe or array, in memory or lazy. which the data scientist/analyst can then proceed to use with familiar methods, e.g., Pandas calls.

Impetus

The idea for derived datasets has been around in Intake for a while - but we demurred on the basis that Intake we supposed to be an ingest-only package. However, the project lumen, released by my Anaconda colleagues earlier this year, showed how important "views" of a dataset were, at least for the case of visualisations in a dashboard. Since that project showed a simple way to define transformations (e.g., selecting columns from a dataframe) that was reminiscent of Intake YAML formalism. Note that parameters for visualisations - which can include those in the transform - are meant to be surfaced to the user in interactive GUI widgets.

lumen does pull datasets from Intake specifications, as well as some custom APIs for, for example, steaming datasets (see also: intake-streamz). So the question becomes: at what point is a view useful only for the purposes of providing a visualization, and when can that "view" actually be considered as a new dataset.

We decided that it made sense to have both: that a datasource+transform is indeed another dataset, and one should be able to catalogue it.

Implementation

We built the functionality and released it without much fanfare in Intake 0.6.2. Documentation is here. The main takeaways are:

you can apply arbitrary functions to a dataset to create a new dataset
a transform class is like any other DataSource, except that it must declare an input container type as well as the usual output one
by being able to call to_dask, read or any other load function on the input dataset, a transform class defines the execution platform for the transform. It is up to the user, however, to ensure that the implied compute resources (e.g., a Dask cluster) are available
some generic transforms are easily described by a single function, and so building out a list of useful transforms (such as the column-selection example) would be easy. Distributing these would be as simple as putting them in an installable python package - perhaps along with the catalogue that uses them.

Exact integration with lumen is still an ongoing conversation, particularly around the question of dynamic parameters (e.g., which columns can be picked, which depends on the current state of the input dataset) and presenting interactive widgets to the user alongside graphical output. We will doubtless converge to a clear separation of concerns. This would allow users, in the future, to experiment with data filters and transforms, and then save their derived dataset into a catalogue to be shared with others.

Summary

What works

using one of the pre-built classes to apply transformations to a data source
referring to data sources in other, potentially remote, catalogues
writing new transformations which specify their execution environment.

What is planned:

chaining transformations into explicit pipelines, with persistence stages
dynamically defining possible options for arguments to the transform, by introspecting the input dataset
datasets that depend on more than one source.

What's missing:

execution runtime managed by Intake configuration.

The future

Basically, we're waiting for feedback on this feature. I'll be fascinated to find out what kinds of uses people find for it.