Creating Data Packages in Python

This tutorial will show you how to install the Python library for working with Data Packages and Table Schema, load a CSV file, infer its schema, and write a Tabular Data Package.

Setup

For this tutorial, we will need the Data Package library (PyPI) library.

pip install datapackage

Creating basic metadata

You can start using the library by importing datapackage.

import datapackage

The Package() class allows you to work with data packages. Use it to create a blank datapackage called package like so:

package = datapackage.Package()

You can then add useful metadata by adding keys to metadata dict attribute. Below, we are adding the required name key as well as a human-readable title key. For the keys supported, please consult the full Data Package spec. Note, we will be creating the required resources key further down below.

package.descriptor['name'] = 'period-table'
package.descriptor['title'] = 'Periodic Table'

To view your descriptor file at any time, simply type

package.descriptor

Inferring a CSV Schema

Let's say we have a file called data.csv (download) in our working directory that looks like this:

atomic number symbol name atomic mass metal or nonmetal?
1 H Hydrogen 1.00794 nonmetal
2 He Helium 4.002602 noble gas
3 Li Lithium 6.941 alkali metal
4 Be Beryllium 9.012182 alkaline earth metal
5 B Boron 10.811 metalloid

We can extrapolate our CSV's schema by using infer from the Table Schema library. The infer function checks a small subset of your dataset and summarizes expected datatypes against each column, etc. To infer a schema for our dataset and view it, we will simply run

package.infer('periodic-table/data.csv')
package.descriptor

Where there's need to infer a schema for more than one tabular data resource, use the glob pattern **/*.csv instead to infer a schema:

package.infer('**/*.csv')
package.descriptor

We are now ready to save our datapackage.json file locally. The dp.save() function makes this possible.

dp.save('datapackage.json')

The datapackage.json (download) is inlined below. Note that atomic number has been correctly inferred as an integer and atomic mass as a number (float) while every other column is a string.

{
  'profile': 'tabular-data-package',
  'resources': [{
    'path': 'data.csv',
    'profile': 'tabular-data-resource',
    'name': 'data',
    'format': 'csv',
    'mediatype': 'text/csv',
    'encoding': 'UTF-8',
    'schema': {
      'fields': [{
          'name': 'atomic number',
          'type': 'integer',
          'format': 'default'
        },
        {
          'name': 'symbol',
          'type': 'string',
          'format': 'default'
        },
        {
          'name': 'name',
          'type': 'string',
          'format': 'default'
        },
        {
          'name': 'atomic mass',
          'type': 'number',
          'format': 'default'
        },
        {
          'name': 'metal or nonmetal?',
          'type': 'string',
          'format': 'default'
        }],
    'missingValues': ['']
    }
  }],
  'name': 'periodic-table',
  'title': 'Periodic Table'
}

Publishing

Now that you have created your Data Package, you might want to publish your data online so that you can share it with others.

bookdocsexternal fforumgithubgitterheartpackageplayrocket softwaretools