Data Package

A Data Package is a simple way of putting collections of data and their descriptions in one place so that they can be easily shared and used. The Data Package format is very simple, web friendly and extensible.

Creating a Data Package is very easy: all you need to do is put a datapackage.json "descriptor" file in the top-level directory of your set of data files.

Full Spec

Here is a full RFC-style specification of Data Package format to complement this quick introduction.

Tabular Data

The Tabular Data Package format extends Data Packages for tabular data. It supports providing additional information such as data types of columns.

Software

There is a growing set of online and offline software for working with Data Packages including for creating, viewing and validating.

Getting Started

A minimal example Data Package would look like this on disk:

datapackage.json

# a data file(s) (CSV in this case but could be any type of data). Data files may go either in data subdirectory or in the main directory
data
data/more-data.csv

# (Optional!) A README (in markdown format)
README.md

Any number of additional files such as more data files, scripts (for processing or analyzing the data) and other material may be provided but are not required.

datapackage.json

datapackage.json file is the basic building block of a Data Package and is the only required file. It provides:

  • General metadata such as the name of the package, its license, its publisher and source, etc
  • A "manifest" in the the form of a list of the data resources (data files) included in this data package along with information on those files (e.g. schema)

As its file extension indicates, it must be a JSON file. Here's a very minimal example of a datapackage.json file:

{
  "name": "a-unique-human-readable-and-url-usable-identifier",
  "title": "A nice title",
  "licenses" : [ ... ],
  "sources" : [...],
  "resources": [{
    // see below for what a resource descriptor looks like
  }]
}

Here is a much more extensive example of a datapackage JSON file:

Note: a complete list of potential attributes and their meaning can be found in the full Data Package spec.

Note: the Data Package format is extensible: publishers may add their own additional metadata as well as constraints on the format and type of data by adding their own attributes to the datapackage.json.

{
  "name": "a-unique-human-readable-and-url-usable-identifier",
  "datapackage_version": "1.0-beta",
  "title": "A nice title",
  "description": "...",
  "version": "2.0",
  "keywords": ["name", "My new keyword"],
  "licenses": [{
    "url": "http://opendatacommons.org/licenses/pddl/",
    "name": "Open Data Commons Public Domain",
    "version": "1.0",
    "id": "odc-pddl"
  }],
  "sources": [{
    "name": "World Bank and OECD",
    "web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
  }],
  "contributors":[{
    "name": "Joe Bloggs",
    "email": "[email protected]",
    "web": "http://www.bloggs.com"
  }],
  "maintainers": [{
    // like contributors
  }],
  "publishers": [{
    // like contributors
  }],
  "dependencies": {
    "data-package-name": ">=1.0"
  },
  "resources": [
    {
      // ... see below ...
    }
  ],
  // extend your datapackage.json with attributes that are not
  // part of the data package spec
  // we add a views attribute to display Recline Dataset Graph Views
  // in our Data Package Viewer
  "views" : [
    {
      ... see below ...
    }
  ],
  // you can add your own attributes to a datapackage.json, too
  "my-own-attribute": "data-packages-are-awesome",
}

Resources

You list data files in the resources entry of the datapackage.json.

  {
    // one of url or path should be present
    "path": "relative-path-to-file", // e.g. data/mydata.csv
    "url": "online url" // e.g http://mysite.org/some-data.csv
  }

Views

The Data Package Viewer will display a Recline Dataset Graph View when a views entry is provided in the datapackage.json.

  • Include the resourceName property if you have more than one resource and want to display a graph for a resource other than the first resource

  • In the state property

    • the group property is the name of the resource field whose values will be used on the y axis in the bars graph type and the x axis in all other graph types
    • the series property is an array of one or more names of resource fields whose values will be used on the x axis in the bars graph type and the y axis in all other graph types
    • the graphType may be one of lines-and-points, lines, points, bars, or columns
{
  "id": "graph",
  "label": "Graph",
  "resourceName": "a-resource-name",
  "type": "Graph",
  "state": {
    "group": "a-resource-field-name",
    "series": [
      "another-resource-field-name"
    ],
    "graphType": "lines-and-points"
  }
}

Software

There is a growing set of online and offline software for working with Data Packages including tools for creating, viewing, validating, publishing and managing Data Packages. See the Frictionless Data software page for more.

Examples

Many exemplar data packages can be found on datahub. Specific examples:

World GDP

A Data Package which includes the data locally in the repo (data is CSV).

http://datahub.io/core/gdp

Here's the datapackage.json:

S&P 500 Companies Data

This is an example with more than one resource in the data package.

http://datahub.io/core/s-and-p-500-companies

Here's the datapackage.json:

GeoJSON and TopoJSON

You can see an example on how to package GeoJSON files here.

DataHub does not currently support the TopoJSON format. You can use “Vega Graph Spec” and display you TopoJSON data using the Vega specification. See an example here.

bookdocsexternal fforumgithubgitterheartpackageplayrocket softwaretools