Creating Data Packages in Python

July 21, 2016 by Frictionless Data

This tutorial will show you how to install the Python library for working with Data Packages and Table Schema, load a CSV file, infer its schema, and write a Tabular Data Package.

# Setup

For this tutorial, we will need the Data Package library (opens new window) (PyPI (opens new window)) library.

pip install datapackage

# Creating basic metadata

You can start using the library by importing datapackage.

import datapackage

The Package() class allows you to work with data packages. Use it to create a blank datapackage called package like so:

package = datapackage.Package()

You can then add useful metadata by adding keys to metadata dict attribute. Below, we are adding the required name key as well as a human-readable title key. For the keys supported, please consult the full Data Package spec (opens new window). Note, we will be creating the required resources key further down below.

package.descriptor['name'] = 'period-table'
package.descriptor['title'] = 'Periodic Table'

To view your descriptor file at any time, simply type

package.descriptor

# Inferring a CSV Schema

Let’s say we have a file called data.csv (download (opens new window)) in our working directory that looks like this:

atomic number	symbol	name	atomic mass	metal or nonmetal?
1	H	Hydrogen	1.00794	nonmetal
2	He	Helium	4.002602	noble gas
3	Li	Lithium	6.941	alkali metal
4	Be	Beryllium	9.012182	alkaline earth metal
5	B	Boron	10.811	metalloid

We can extrapolate our CSV’s schema by using infer from the Table Schema library. The infer function checks a small subset of your dataset and summarizes expected datatypes against each column, etc. To infer a schema for our dataset and view it, we will simply run

package.infer('periodic-table/data.csv')
package.descriptor

Where there’s need to infer a schema for more than one tabular data resource, use the glob pattern **/*.csv instead to infer a schema:

package.infer('**/*.csv')
package.descriptor

We are now ready to save our datapackage.json file locally. The dp.save() function makes this possible.

dp.save('datapackage.json')

The datapackage.json
(download (opens new window)) is inlined below. Note that atomic number has been correctly inferred as an integer and atomic mass as a number (float) while every other column is a string.

{
  'profile': 'tabular-data-package',
  'resources': [{
    'path': 'data.csv',
    'profile': 'tabular-data-resource',
    'name': 'data',
    'format': 'csv',
    'mediatype': 'text/csv',
    'encoding': 'UTF-8',
    'schema': {
      'fields': [{
          'name': 'atomic number',
          'type': 'integer',
          'format': 'default'
        },
        {
          'name': 'symbol',
          'type': 'string',
          'format': 'default'
        },
        {
          'name': 'name',
          'type': 'string',
          'format': 'default'
        },
        {
          'name': 'atomic mass',
          'type': 'number',
          'format': 'default'
        },
        {
          'name': 'metal or nonmetal?',
          'type': 'string',
          'format': 'default'
        }],
    'missingValues': ['']
    }
  }],
  'name': 'periodic-table',
  'title': 'Periodic Table'
}

# Publishing

Now that you have created your Data Package, you might want to publish your data online so that you can share it with others.