case-studies

data.world

An Interview with
  • Bryon Jacob

We’re curious to learn about some of the common issues users face when working with data. In our Case Study series, we are highlighting projects and organisations who are working with the Frictionless Data specifications and tooling in interesting and innovative ways.

How do you use the specs and what advantages did you find in using the Data Package approach?

We deal with a great diversity of data, both in terms of content and in terms of source format - most people working with data are emailing each other spreadsheets or CSVs, and not formally defining schema or semantics for what’s contained in these data files.

When data.world ingests tabular data, we “virtualize” the tables away from their source format, and build layers of type and semantic information on top of the raw data. What this allows us to do is to produce a clean Tabular Data Package1 for any dataset, whether the input is CSV files, Excel Spreadsheets, JSON data, SQLite Database files - any format that we know how to extract tabular information from - we can present it as cleaned-up CSV data with a datapackage.json that describes the schema and metadata of the contents.

Available Data Tabular Data Package structure on disk

What else would you like to see developed?

Graph data packages, or “Universal Data Packages” that can encapsulate both tabular and graph data. It would be great to be able to present tabular and graph data in the same package and develop tools that know how to use these things together.

To elaborate on this, it makes a lot of sense to normalize tabular data down to clean, well-formed CSVs or data that more graph-like, it would also make sense to normalize it to a standard format. RDF2 is a well-established and standardized format, with many serialized forms that could be used interchangeably (RDF XML, Turtle, N-Triples, or JSON-LD, for example). The metadata in the datapackage.json would be extremely minimal, since the schema for RDF data is encoded into the data file itself. It might be helpful to use the datapackage.json descriptor to catalog the standard taxonomies and ontologies that were in use, for example, it would be useful to know if a file contained SKOS3 vocabularies, or OWL4 classes.

What are the next things you are going to be working on yourself?

We want to continue to enrich the metadata we include in Tabular Data Packages exported from data.world, and we’re looking into using datapackage.json as an import format as well as an export option.

How do the Frictionless Data specifications compare to existing proprietary and nonproprietary specifications for the kind of data you work with?

data.world works with lots of data across many domains - what’s great about the Frictionless Data specs is that it’s a lightweight content standard that can be a starting point for building domain-specific content standards - it really helps with the “first mile” of standardizing data and making it interoperable.

Available Data Tabular datasets can be downloaded as Tabular Data Packages

What do you think are some other potential use cases?

In a certain sense, a Tabular Data Package is sort of like an open-source, cross-platform, accessible replacement for spreadsheets that can act as a “binder” for several related tables of data. I could easily imagine web or desktop-based tools that look and function much like a traditional spreadsheet, but use Data Packages as their serialization format.

Who else do you think we should speak to?

Data science IDE (Interactive Development Environment) producers - RStudio, Rodeo (a Python IDE for Data), anaconda, Jupyter - anything that operates on Data Frames as a fundamental object type should provide first-class tool and API support for Tabular Data Packages.

What should the reader do after reading this Case Study?

To read more about Data Package integration at data.world, read our post: Try This: Frictionless data.world. Sign up, and starting playing with data.

  1. Tabular Data Package specifications: http://specs.frictionlessdata.io/tabular-data-package/ 

  2. RDF: https://www.w3.org/RDF/ 

  3. SKOS Simple Knowledge Organization System: https://www.w3.org/2004/02/skos/ 

  4. OWL Web Ontology Language: https://www.w3.org/TR/owl-ref/ 


Have a question or comment? Let us know in the forum topic for this case study.