\ Intake's derived datasets! — Martin Durant

Martin Durant

Intake's derived datasets!

written by Martin Durant on 2021-05-28

Data graph

TL;DR: Intake now provides derived datasets with a great deal of flexibility, so you are no longer restricted to ingesting data as provided by a service or set of files. You can encode transforms and data cleaning as part of catalogues without having to write a custom driver each time.

Intake is...

Intake is a python package for describing, loading and disseminating datasets in catalogues. Newcomers are encouraged to read the main documentation and the quickstart there to understand how Intake can "take the pain out of data access", whether for the data user, curators, developers or IT professionals. It is one of my main work areas, so this is not the first blog on the topic!

Here, we describe a new feature to be able to describe new data sources in terms of other data sources. This is a long-requested feature, now available in beta for feedback.

Recap

Intake provides catalogues. which are collections of data sources. Catalogues can contain other catalogues to form a hierarchical tree, and you could have several different views on some data, whether it's the same data in different formats, or different loading parameters. All this has been around a while. The data loads into one of several standard container types such as dataframe or array, in memory or lazy. which the data scientist/analyst can then proceed to use with familiar methods, e.g., Pandas calls.

Impetus

The idea for derived datasets has been around in Intake for a while - but we demurred on the basis that Intake we supposed to be an ingest-only package. However, the project lumen, released by my Anaconda colleagues earlier this year, showed how important "views" of a dataset were, at least for the case of visualisations in a dashboard. Since that project showed a simple way to define transformations (e.g., selecting columns from a dataframe) that was reminiscent of Intake YAML formalism. Note that parameters for visualisations - which can include those in the transform - are meant to be surfaced to the user in interactive GUI widgets.

lumen does pull datasets from Intake specifications, as well as some custom APIs for, for example, steaming datasets (see also: intake-streamz). So the question becomes: at what point is a view useful only for the purposes of providing a visualization, and when can that "view" actually be considered as a new dataset.

We decided that it made sense to have both: that a datasource+transform is indeed another dataset, and one should be able to catalogue it.

Implementation

We built the functionality and released it without much fanfare in Intake 0.6.2. Documentation is here. The main takeaways are:

Exact integration with lumen is still an ongoing conversation, particularly around the question of dynamic parameters (e.g., which columns can be picked, which depends on the current state of the input dataset) and presenting interactive widgets to the user alongside graphical output. We will doubtless converge to a clear separation of concerns. This would allow users, in the future, to experiment with data filters and transforms, and then save their derived dataset into a catalogue to be shared with others.

Summary

What works

What is planned:

What's missing:

The future

Basically, we're waiting for feedback on this feature. I'll be fascinated to find out what kinds of uses people find for it.