FAQ on Publishing Data Packages

April 20, 2016 by Frictionless Data

FAQs and best practice patterns for publishing data packages.

Complete specifications are available at specs/data-package (opens new window).

# Data Package Name

The Data Package name is used in the name field of the datapackage.json.

This name is also frequently used for the folder/directory in which the Data Package is stored.

As per the Data Package spec The name SHOULD be:

lower-case
use ‘-’ for word separators
reasonably concise (3-4 words)

Naming conventions

For country specific datasets:

{topic}                  # e.g. gdp
{topic}-{2-digit-iso}    # e.g. gdp-us

For time series data:

[...-]year
[...-]quarter
[...-]month
[...-]day

# Resource and File Names

Similar to Data Package Names:

lower-case
use ‘-’ for word separators

Resource names SHOULD, usually, be the same as the name of the associated file on disk but without the file extension. e.g.

gdp-quarterly     # resource name
gdp-quarterly.csv # on disk

Naming conventions of files follow that for data packages in terms of country or time series facets.

# Descriptor `datapackage.json`

# Alignment

With JSON, data is structured in a nested way through curly and squared brackets. Though the alignment of these structures is not relevant for computer programs, it makes it easier for the human reader if they are properly aligned.

Good alignment:

{
  "name": "corruption-perceptions-index",
  "title": "Corruption Perceptions Index (CPI)",
  "sources": [
    {
      "name": "Transparency International",
      "web": "http://www.transparency.org/research/cpi/overview"
    }
  ],
...
}

Bad alignment:

{
  "name": "corruption-perceptions-index","title": "Corruption Perceptions Index (CPI)",
  "sources":
  [{
    "name": "Transparency International",
    "web": "http://www.transparency.org/research/cpi/overview"}]
    ,
...
}

Please make sure to have your datapackage.json well structured to ease the understanding of your Data Package content. The Online DataPackage.json Creator (opens new window) can help you create the general structure.

# Contributors fields

Add the ‘contributors’ field (original author of the package - see specs/data-package (opens new window) if you wish to keep the credits for the package.

# Data Package Folder Names and Structure

It is standard practice to use the Data Package name (from the datapackage.json) for the name of the folder/directory in which the Data Package is kept.

If storing in e.g. git(hub) this would also be the the name of the repository.

If you include scripts allowing to automate the data extraction process, these should be stored in a script folder/directory.

# README

A README is a text file giving (human-readable) information about your dataset.

Data Packages SHOULD have a README.

# Formatting

The README SHOULD be a plain text file (no word or rich text etc) and SHOULD use markdown to allow for formatting

# File Name

If markdown is used the file SHOULD be named README.md and otherwise SHOULD be named README.txt.

# Sections

You can include anything you like in your README. It is standard practice to include some (if possible all) of the following sections: Introduction, Data, Preparation, License.

We SHOULD NOT include the title of the Data Package at the top of the README.

Each section other than the introduction should be headed with its name using level 2 heading in markdown e.g. for the data section you would have the following markdown in your README:

## Data

# Introduction

Start with a short description of the dataset (the first sentence and first paragraph should be extractable to provide short standalone descriptions).

Unlike other sections this section SHOULD NOT have a heading as it starts the README. (i.e. you do not need the heading ## Introduction

# Data

Put specific information about the data in a Data section. This can be things like information about the source of the data, the specific structure of the data, missing values etc.

# Preparation

Put information on preparing the data in a Preparation section. In particular, any instructions about how to run any preparation and processing scripts to generate the data should go here.

# License

Put additional information on the permissions and licensing of the data in the Data Package in the License section.

Since licensing information is often not clear from the data producers, the guideline here is to license the Data Package under the Public Domain Dedication and License, and then to add any relevant information or disclaimers regarding the source data.

See, for example:

# Validate and preview your Data Package

Use the Data Package Creator (opens new window) to check that your datapackage.json and Data Package are good to go. Simply drop the URL to your datapackage.json file in the input box, or upload from a local source, and press Validate. If everything is fine, Status: Valid is returned.

Then use the Online Data Package viewer app (opens new window) to have a preview of your Data Package.

# Examples

For examples of well-structured Data Package see:

For tabular data: http://datahub.io/core/corruption-perceptions-index (opens new window)
For geospatial data: http://datahub.io/core/geo-nuts-administrative-boundaries (opens new window)

Recommended reading: Find out how to use Frictionless Data software to improve your data publishing workflow in our new and comprehensive Frictionless Data Field Guide.