\
TL;DR: Intake now provides derived datasets with a great deal of flexibility, so you are no longer restricted to ingesting data as provided by a service or set of files. You can encode transforms and data cleaning as part of catalogues without having to write a custom driver each time.
Intake is a python package for describing, loading and disseminating datasets in catalogues. Newcomers are encouraged to read the main documentation and the quickstart there to understand how Intake can "take the pain out of data access", whether for the data user, curators, developers or IT professionals. It is one of my main work areas, so this is not the first blog on the topic!
Here, we describe a new feature to be able to describe new data sources in terms of other data sources. This is a long-requested feature, now available in beta for feedback.
Intake provides catalogues. which are collections of data sources. Catalogues can contain other catalogues to form a hierarchical tree, and you could have several different views on some data, whether it's the same data in different formats, or different loading parameters. All this has been around a while. The data loads into one of several standard container types such as dataframe or array, in memory or lazy. which the data scientist/analyst can then proceed to use with familiar methods, e.g., Pandas calls.
The idea for derived datasets has been around in Intake for a while - but we demurred on the basis that Intake
we supposed to be an ingest-only package.
However, the project lumen
, released by my Anaconda colleagues earlier this
year, showed how important "views" of a dataset were, at least for the case of visualisations in a dashboard.
Since that project showed a simple way to define transformations (e.g., selecting columns from a dataframe) that
was reminiscent of Intake YAML formalism. Note that parameters for visualisations - which can include those
in the transform - are meant to be surfaced to the user in interactive GUI widgets.
lumen
does pull datasets from Intake specifications, as well
as some custom APIs for,
for example, steaming datasets (see also: intake-streamz
).
So the question becomes: at what point is a view useful only
for the purposes of providing a visualization, and when can that "view" actually be considered as a new
dataset.
We decided that it made sense to have both: that a datasource+transform is indeed another dataset, and one should be able to catalogue it.
We built the functionality and released it without much fanfare in Intake 0.6.2. Documentation is here. The main takeaways are:
to_dask
, read
or any other load function on the input dataset, a transform
class defines the execution platform for the transform. It is up to the user, however, to ensure that
the implied compute resources (e.g., a Dask cluster) are availableExact integration with lumen
is still an ongoing conversation, particularly around the question of
dynamic parameters (e.g., which columns can be picked, which depends on the current state of the input
dataset) and presenting interactive widgets to the user alongside graphical output. We will doubtless converge
to a clear separation of concerns. This would allow users, in the future, to experiment with data filters
and transforms, and then save their derived dataset into a catalogue to be shared with others.
What works
What is planned:
What's missing:
Basically, we're waiting for feedback on this feature. I'll be fascinated to find out what kinds of uses people find for it.