Have you ever been looking at a dataset and had no idea what the data values mean? What units are being used? What does that acronym in the first column mean? What is the license for this data?
These are all very common issues that make data hard to understand and use. At Frictionless Data, we work to solve these issues by packaging data with its metadata - aka the description of the data. To help you package your data, we have code in several languages (opens new window) and a browser tool, called Data Package Creator (opens new window).
Our Reproducible Research Fellows recently learned all about packaging their data by using the Data Package Creator. To help others learn how they too can package their data, the Fellows wrote about packaging their data in blogs that you can read below!
“To quality-check the integrity of your data package creation, you must validate it before downloading it for sharing, among many things. The best you can get from that process is “Data package is valid!”. What about before then?”
“Follow the #otherpeoplesdata on Twitter and in it you will find a trove of data users trying to make sense of data they did not collect. While the data may be open, having no metadata or information about what variables mean, doesn’t make it very accessible….Without definitions and an explanation of the data, taking the data out of the context of my experiment and adding it to something like a meta-analysis is difficult. Enter Data packages. “
"When I started graduate school, I was shocked to learn that seafood is actually the most internationally traded food commodity in the world….However, for many developing countries being connected to the global seafood market can be a double-edged sword….Over the course of my master’s degree, I developed a passion for studying these issues, which is why I am excited to share with you my experience turning some of the data my collaborators into a packaged dataset using the Open Knowledge Foundation’s Datapackage tool.”
# ¿Cómo empaquetamos datos y por qué es importante organizar la bolsa del supermercado? By Sele Yang (opens new window) (Cohort 1)
“Empaquetando datos sobre aborto desde OpenStreetMap Esta es una publicación para compartirles sobre el proceso y pasos para crear datapackages. ¿Qué es esto? Un datapackage es básicamente un empaquetado que agiliza la forma en que compartimos y replicamos los datos. Es como un contenedor de datos listo para ser transportado por la autopista del conocimiento (geeky, right).”
# So you want to get your data package validated? By Katerina Drakoulaki (opens new window) (Cohort 2)
“Have you ever found any kind of dataset, (or been given one by your PI/collaborator) and had no idea what the data were about? During my PhD I’ve had my fair share of not knowing how code works, or how stimuli were supposed to be presented, or how data were supposed to be analysed….The datapackage tool tries to solve one of these issues, more specifically creating packages in which data make sense, and have all the explanations (metadata) necessary to understand and manipulate them.”
“As a machine learning researcher, I am constantly scraping, merging, reshaping, exploring, modeling, and generating data. Because I do most of my data management and analysis in Python, I find it convenient to package my data in Python as well. The screenshots below are a walk-through of basic data package construction in Python.”
# Sharing data from your own scientific publication by Dani Alcalá-López (opens new window) (Cohort 2)
“What better way to start working with open data than by sharing a Data Package from one of my own publications? In this tutorial, I will explain how to use the Frictionless Data tools to share tabular data from a scientific publication openly. This will make easier for anyone to reuse this data.”
“As a library science student with an interest in pursuing data librarianship, learning how to create, manage, and share frictionless data is important. These past few months I’ve been learning about Frictionless Data and how to use Frictionless Data Tools to support reproducible research….To learn how to use the Frictionless Data Tools, I decided to pursue an independent project and am working on creating a comprehensive dataset of OER (open educational resources) health science materials that can be filtered by material type, media format, topic, and more.”
“A few weeks ago I met data packages for the first time and I was intrigued since I had spent too much time in the past wrangling missing and inconsistent values. Packaging data therefore taught me that arranging and preserving data does not have to be tedious anymore. Here, I show how I packaged a bit of my data (unpublished) into a neat json document using the Data Package creator . I am excited to show you just how much I have come from knowing nothing to being able to package and extract the json output.”
# [Data]packaging human rights with the Universal Periodic Review by Anne Lee Steele (opens new window) (Cohort 2)
“All of the records for the Universal Periodic Review have been uploaded online, and are available for the public. However, it’s not likely that the everyday user would be able to make heads or tails of what it actually means….The way I think about it, the Data Package is a way of explaining the categories used within the data itself, in case someone besides an expert is using them. While sections like “Recommendation” and “Recommending State” may be somewhat self-explanatory, I can imagine that this will get way more complicated with purely numerical data.”
# Creating a datapackage for microbial community data (and a phyloseq object) by Kate Bowie (opens new window) (Cohort 2)
“I study bacteria, and lucky for me, bacteria are everywhere….My lab often tries many different ways to handle the mock [bacteria] community, so it’s important that the analysis be documented and reproducible. To address this, I decided to generate a data package using a tool created by the Open Knowledge Foundation. Here is my experience creating a data package of our data, the metadata, and associated software.”
“I am using a data resource from Telangana Open Data…it is an open source data repository commissioned by the state government here in India and basically it archives and stores Weather, Topological, Agriculture and Infrastructure data which then can be used by research students and stakeholders keen to study and make reports in it….CSV files are very versatile, but cannot handle the metadata with all the necessary context. We need to make sure that people can find our data and the information they need to understand our data. That’s where the Data Package comes in! ”