Frictionless Data Frictionless Data
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
  • Creating Data Packages in Python

    • Setup
      • Creating basic metadata
        • Inferring a CSV Schema
          • Publishing

          Creating Data Packages in Python

          July 21, 2016 by Frictionless Data
          Price icons created by Pixel perfect - Flaticon Python

          This tutorial will show you how to install the Python library for working with Data Packages and Table Schema, load a CSV file, infer its schema, and write a Tabular Data Package.

          # Setup

          For this tutorial, we will need the Data Package library (opens new window) (PyPI (opens new window)) library.

          pip install datapackage
          

          # Creating basic metadata

          You can start using the library by importing datapackage.

          import datapackage
          

          The Package() class allows you to work with data packages. Use it to create a blank datapackage called package like so:

          package = datapackage.Package()
          

          You can then add useful metadata by adding keys to metadata dict attribute. Below, we are adding the required name key as well as a human-readable title key. For the keys supported, please consult the full Data Package spec (opens new window). Note, we will be creating the required resources key further down below.

          package.descriptor['name'] = 'period-table'
          package.descriptor['title'] = 'Periodic Table'
          

          To view your descriptor file at any time, simply type

          package.descriptor
          

          # Inferring a CSV Schema

          Let’s say we have a file called data.csv (download (opens new window)) in our working directory that looks like this:

          atomic number symbol name atomic mass metal or nonmetal?
          1 H Hydrogen 1.00794 nonmetal
          2 He Helium 4.002602 noble gas
          3 Li Lithium 6.941 alkali metal
          4 Be Beryllium 9.012182 alkaline earth metal
          5 B Boron 10.811 metalloid

          We can extrapolate our CSV’s schema by using infer from the Table Schema library. The infer function checks a small subset of your dataset and summarizes expected datatypes against each column, etc. To infer a schema for our dataset and view it, we will simply run

          package.infer('periodic-table/data.csv')
          package.descriptor
          

          Where there’s need to infer a schema for more than one tabular data resource, use the glob pattern **/*.csv instead to infer a schema:

          package.infer('**/*.csv')
          package.descriptor
          

          We are now ready to save our datapackage.json file locally. The dp.save() function makes this possible.

          dp.save('datapackage.json')
          

          The datapackage.json
          (download (opens new window)) is inlined below. Note that atomic number has been correctly inferred as an integer and atomic mass as a number (float) while every other column is a string.

          {
            'profile': 'tabular-data-package',
            'resources': [{
              'path': 'data.csv',
              'profile': 'tabular-data-resource',
              'name': 'data',
              'format': 'csv',
              'mediatype': 'text/csv',
              'encoding': 'UTF-8',
              'schema': {
                'fields': [{
                    'name': 'atomic number',
                    'type': 'integer',
                    'format': 'default'
                  },
                  {
                    'name': 'symbol',
                    'type': 'string',
                    'format': 'default'
                  },
                  {
                    'name': 'name',
                    'type': 'string',
                    'format': 'default'
                  },
                  {
                    'name': 'atomic mass',
                    'type': 'number',
                    'format': 'default'
                  },
                  {
                    'name': 'metal or nonmetal?',
                    'type': 'string',
                    'format': 'default'
                  }],
              'missingValues': ['']
              }
            }],
            'name': 'periodic-table',
            'title': 'Periodic Table'
          }
          

          # Publishing

          Now that you have created your Data Package, you might want to publish your data online so that you can share it with others.

          Blog Index