The Data Retriever

May 24, 2017 by Ethan White

The Data Retriever (opens new window) automates the tasks of finding, downloading, and cleaning up publicly available data, and then stores them in a variety of databases and file formats. This lets data analysts spend less time cleaning up and managing data, and more time analyzing it.

We originally built the Data Retriever starting in 2010 with a focus on ecological data. Over time, we realized that the common challenges with finding downloading, and cleaning up ecological data applied to data in most other fields, so we rebranded and starting integrating data from other fields as well.

The Data Retriever is primarily focused on tabular data, but we’re starting work on supporting spatial data as well.

Diagram
The Data Retriever automatically installing the BBS (USGS North American Breeding Bird Survey) (opens new window) dataset

Data is often messy and needs cleaning and restructuring before it can be effectively used. It is often not feasible to modify and redistribute the data due to licensing and other limitations (Editor’s note: see our Open Power System Data case study for more on this).

We need to make it as easy as possible for contributors to add new datasets (opens new window). For relatively clean datasets this means having a simple, easy-to-work-with metadata standard to describe existing data. The description for each dataset is written in a single file which gets read by our plugin infrastructure.

To describe the structure of simple data, we originally created a YAML-like^[1] metadata structure. When the Data Package^[2] specs were created by Open Knowledge International (opens new window), we decided to switch over to using this standard so that others could benefit from the metadata we were creating and so that we could benefit from th standards-based infrastructure[^software] being created around the specs.

The transition to the Data Package specification was fairly smooth as most of the fields we needed were already included in the specs. The only thing that we needed to add were fields for restructuring poorly formatted data since the spec assumes the data is well structured to begin with. For example, we use custom fields for describing how to convert wide data to long data (opens new window).

We first learned about Frictionless Data through the announcement (opens new window) of their funding by the Sloan Foundation. Going forward, we would love to see the Data Package spec expanded to include information about “imperfections” in data. It currently assumes that the person creating the metadata can modify the raw data files to comply with the standard rules of data structure. However this doesn’t work if someone else is distributing the data, which is a very common use
case.

The expansion of the standard would include things like a way to indicate wide versus long data with enough information to uniquely describe how to translate from one to the other as well as information on single tables that are composed from data in many separate files. We have already been adding new fields to the JSON to accomplish some of these things and would be happy to be part of a larger dialog about implementing them more widely. For the wide-data-to-long-data example mentioned above, we use ct_column and ct_names fields and a ct-type type to indicate how to transform the data into a properly normalized form.

The other thing we’ve come across is the need to develop a clear specification for semantic versioning (opens new window) of Data Packages. The specification includes an optional version field^[3] for keeping track to changes to the package. This version has a standard structure from semantic versioning in software that includes major, minor, and patch level changes. Unlike in software there is no clearly established standard for what changes in different version numbers indicate. Since we work with a lot of different datasets, we’ve been changing a lot of version numbers over the last year; this has lead us to open a discussion with the OKFN team (opens new window) about developing a standard to apply to these changes.

Our next big step is working on the challenge of simple data integration. One of the major challenges data analysts have after they have cleaned up and prepared individual data sources is combining them. General solutions to the data integration problem (e.g. linked data approaches) have proven to difficult but we are approaching the problem by tackling a small number of common use cases and involving humans in the metadata development describing the linkages between datasets.

The major specification that is available for ecological data is the Ecological Metadata Language (EML) (opens new window). It is an XML^[4] based spec that includes a lot of information specific to ecological datasets. The nice thing about EML—which is also its challenge—is that it is very comprehensive. This gives it a lot of strength in a linked data context, but also means that it is difficult to drive adoption by users.

The Frictionless Data specifications line up better with our approach to data^[5], which is to complement lightweight computational methods with human contributions to make data easier to work with quickly.

Community contributions to our work are welcome. We work hard to make all of our development efforts open and inclusive (see our Code of Conduct (opens new window)) and love it when new developers, data scientists, and domain specialists contribute (opens new window). A contribution can be as easy as adding a new dataset by following a set of prompts (opens new window) to create a new JSON file and submitting a PR (opens new window) on GitHub, or even just opening an issue to tell us about a dataset that would be useful to you. So, open an issue (opens new window), submit a PR, or stop by our Gitter chat channel (opens new window) and say “Hi”. We also participate in Google Summer of Code (opens new window), which is a great opportunity for students interested in being directly supported to work on the project.

YAML Ain’t Markup Language: https://en.wikipedia.org/wiki/YAML (opens new window) ↩︎
Data Package: https://specs.frictionlessdata.io/data-package (opens new window) ↩︎
Data Package version field: /specs/#version (opens new window) ↩︎
Extensible Markup Language: https://en.wikipedia.org/wiki/XML (opens new window) ↩︎
Design Philosophy: /specs/#design-philosophy (opens new window) ↩︎