Data Package version 2.0 is out!
We are very excited to announce the release of version 2.0 of the Data Package standard (opens new window) (previously known as Frictionless Specs). Thanks to the generous support of NLnet (opens new window), starting from November last year (opens new window), we were able to focus on reviewing Data Package in order to include features that were often requested throughout the years and improve extensibility for domain-specific implementations.
Data Package is a standard for data containerisation, which consists of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that enhances data FAIRness (findability, accessibility, interoperability, and reusability). Since its initial release in 2007, the community has suggested many features that could improve or extend the standard for use cases that weren’t initially envisioned. Those were sometimes adopted, but there wasn’t a versioning or governance process in place to truly evolve the standard.
We started with the issues that had accumulated in the GitHub repository (opens new window) to build our Roadmap for v2. Many of the requested features are now adopted, making Data Package the answer for even more use cases.
In parallel we assembled an outstanding Data Package Working Group composed of experts from the community. We carefully selected a diverse group of people who brought different use-cases, formats, and data types that we would need the Standard to support. Together with them, we crafted a governance model (opens new window) that is explicit, in order to create an environment that adequately supports new contributions and ensures project sustainability.
We would like to thank each one of them for their remarkable contribution and for the incredibly insightful conversations we had during these months. Thank you to my colleague Evgeny Karev, Peter Desmet from the Research Institute for Nature and Forest (INBO) (opens new window), Phil Schumm from CTDS - University of Chicago (opens new window), Kyle Husmann from the PennState University (opens new window), Keith Hughitt from the National Institutes of Health (opens new window), Jakob Voß from the Verbundzentrale des GBV (VZG) (opens new window), Ethan Welty from the World Glacier Monitoring Service (opens new window), Paul Walsh from Link Digital (opens new window), Pieter Huybrechts from the Research Institute for Nature and Forest (INBO) (opens new window), Martin Durant from Anaconda, inc. (opens new window), Adam Kariv from The Public Knowledge Workshop (opens new window), Johan Richer from Multi (opens new window), and Stephen Diggs from the University of California Digital Library (opens new window).
If you are curious about the conversations we had during the Standard review, they are all captured (and recorded) in the blog summaries of the community calls (opens new window). Alternatively you can also check out the closed issues on GitHub (opens new window).
# So what is new in version 2?
During these months we have been working on the core specifications that compose the Standard, namely: Data Package (opens new window) – a simple container format for describing a coherent collection of data in a single ‘package’, Data Resource (opens new window) to describe and package a single data resource, Table Dialect (opens new window) to describe how tabular data is stored in a file, and Table Schema (opens new window) to declare a schema for tabular data.
During the update process we tried to be as little disruptive as possible, avoiding breaking changes when possible.
We put a lot of effort into removing ambiguity, cutting or clarifying under-defined features, and promoting some well-oiled recipes into the Standard itself. An example of a recipe (or pattern, as they were called in v1) that has been promoted to the Standard is the Missing values per field (opens new window). We also added a versioning mechanism, support for categorical data, and changes that make it easier to extend the Standard.
If you would like to know the details about what has changed, see the Changelog (opens new window) we published.
To increase and facilitate adoption, we published a metadata mapper written in Python (opens new window). We have also worked on Data Package integrations for the most notable open data portals out there. Many people from the community use Zenodo, so we definitely wanted to target that. They have recently migrated their infrastructure to Invenio RDM (opens new window) and we proposed a Data Package serializer for better integration with the Standard (more info on this integration will be announced in an upcoming blog!). We also created a pull request that exposes datapackage.json
as a metadata export target in the Open Science Framework (opens new window) system, and built an extension that adds a datapackage.json
endpoint to every dataset in CKAN (opens new window).
If you want to know more about how to coordinate a standard update, we shared our main takeaways at FOSDEM 2024. The presentation was recorded, and you can watch it here (opens new window).
# And what happens now?
While the work on Data Package 2.0 is done (for now!), we will keep working on the Data Package website and documentation (opens new window) together with the Working Group, to make it as clear and straightforward as possible for newcomers. In parallel, we will also start integrating the version 2 changes in the software implementations (opens new window)).
Would you like to contribute? We always welcome new people to the project! Go and have a look at our Contribution page (opens new window) to understand the general guideline. Please get in touch with us by joining our community chat on Slack (opens new window) (also accessible via Matrix (opens new window)), or feel free to jump in any of the discussions on GitHub (opens new window).
# Funding
This project was funded through NGI0 Entrust (opens new window), a fund established by NLnet (opens new window) with financial support from the European Commission’s Next Generation Internet (opens new window) program. Learn more at the NLnet project page (opens new window).