Open Data Quality, Standardization and Why we Need a Schema Catalog

April 23, 2020 by Johan Richer

Jailbreak (opens new window) is a French company founded by former employees of Etalab (opens new window), the national open data agency, and Easter-eggs (opens new window), GNU/Linux experts since 1997.

# Open Data Quality: The case of France

The issue of open data quality has been a prominent subject of discussion for years past. These articles covers more discussion on it. Why the Open Definition Matters for Open Data: Quality, Compatibility and Simplicity (opens new window) and more recently Open data quality – the next shift in open data? (opens new window)

Since 2017, OpenDataFrance (opens new window) has made it a top priority to help open data producers, mainly local governments, improve the quality of their open data.

Jailbreak joined the team tasked to find solutions to that problem. We proposed to start with data validation as a way to point out quality problems in datasets, and choose Goodtables (opens new window) as a basis for that. We developed a new UI with adaptations and localization for French users, as well as some custom checks to tackle specific errors which are common in French datasets. This has given birth to a validator tool called Validata (opens new window).

Like Goodtables, it checks tabular data sources for structural problems, such as blank rows, but where it really shines is when you give it a schema to validate your data.

# Schemas as Standards

A schema is a file describing the way the data should be formatted. For example, if a column exists for dates, the schema is where it would be specified. This way, the validator can automatically check that all the cells in that column contain dates.

Table Schemas (opens new window) are perfect for open data, which is often just tabular data such as CSV or XLSX files. They’re also really easy to write and, if enough people use them, they can become de facto standards for datasets.

Spearheaded by OpenDataFrance, the French open data community has created 8 common open data schemas as part of a so-called “Socle commun des données locales” (Common Ground of Local Data). These are now the standards to publish datasets on subjects like grants given to non-profits, decisions voted in local assemblies or stations for electric vehicles.

What we learned with Validata is that the schemas and tools we created in order to improve open data quality are only as good as their popularity. If only a few are using the schemas to publish their data, nobody else will follow these schemas and, immediately, the validator tool is not as useful anymore. The quality is not improving if the “standards” are not used. But, most importantly, a standard cannot be self-proclaimed.

# Where are all the schemas? We need catalog(s)

A few months ago, Etalab has launched schema.data.gouv.fr (opens new window), an official open data schema catalog specific to France. The idea is now to go next-level and start a community-run schema catalog which would be both inclusive and international. First to share Table Schemas but it could also be open to other schemas such as Data Package Schemas (opens new window) or even others.

For schemas to become standards, they must be easily found and usable. They must be shared. We propose to open a new chapter for Table Schemas with schemas.frictionlessdata.io (opens new window) as the place to catalog them.

Each schema page could have link tools, calling users to appropriate actions ; for example “Validate a file” with Goodtables or Validata, “Create a file” with CSV Good Generator (opens new window) developed by Etalab or tsfaker (opens new window), or “Download a template” with table-schema-resource-template (opens new window), etc.

The website for the catalog should have all the features needed such as full-text search and filtered search (by country, etc.). It should also have an API to make use of the catalog within other tools, for example, an open data portal proposing schemas when people upload a data package. This is an idea already experimented by ODI with Octopub (opens new window).

Those are some ideas that need to be expanded. We have to give schemas their chance to shine!

Situation: Poor quality of open data

Question: How to improve the quality of open data?

Problem: Standardization of common datasets
Solution: Table Schemas
Example: A schema for the names of babies born in a city in a given year.
Problem: Checking the quality of datasets
Solution: Goodtables
Example: Validata (opens new window), an adaptation of Goodtables for French open data.
Problem: Sharing open data standards
Solution: Schema Catalog
Example: SCDL (opens new window), Schema.data.gouv.fr (opens new window), Schemas.frictionlessdata.io (opens new window)

There’s an ongoing conversation about this project on Frictionless Data Forum (opens new window) and it’s open to feedback and contribution.