guides

Using Data Packages in Python

This tutorial will show you how to install the Python libraries for working with Tabular Data Packages and demonstrate a very simple example of loading a Tabular Data Package from the web and pushing it directly into a local SQL database. Short examples of pushing your dataset to Google’s BigQuery and Amazon’s RedShift follow.

Setup

For this tutorial, we will need the main Python Data Package library:

https://github.com/frictionlessdata/datapackage-py

You can install it as follows:

pip install datapackage

Reading Basic Metadata

In this case, we are using an example Tabular Data Package containing the periodic table stored on GitHub (datapackage.json, data.csv). This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Here are the first five rows:

atomic number symbol name atomic mass metal or nonmetal?
1 H Hydrogen 1.00794 nonmetal
2 He Helium 4.002602 noble gas
3 Li Lithium 6.941 alkali metal
4 Be Beryllium 9.012182 alkaline earth metal
5 B Boron 10.811 metalloid

You can start using the library by importing datapackage. Data Packages can be loaded either from a local path or directly from the web.

import datapackage
url = 'https://raw.githubusercontent.com/frictionlessdata/example-data-packages/master/periodic-table/datapackage.json'
dp = datapackage.DataPackage(url)

At the most basic level, Data Packages provide a standardized format for general metadata (for example, the dataset title, source, author, and/or description) about your dataset. Now that you have loaded this Data Package, you have access to this metadata using the metadata dict attribute. Note that these fields are optional and may not be specified for all Data Packages. For more information on which fields are supported, see the full Data Package standard.

print(dp.descriptor['title'])
> "Periodic Table" 

Reading Data

Now that you have loaded your Data Package, you can read its data. A Data Package can contain multiple files which are accessible via the resources attribute. The resources attribute is an array of objects containing information (e.g. path, schema, description) about each file in the package.

You can access the data in a given resource in the resources array by reading the data attribute. For example, using our our Periodic Table Data Package, we can return all elements with an atomic number of less than 10 by doing the following:

print([e['name'] for e in dp.resources[0].data if int(e['atomic number']) < 10])

> ['Hydrogen', 'Helium', 'Lithium', 'Beryllium', 'Boron', 'Carbon', 'Nitrogen', 'Oxygen', 'Fluorine']

If you don’t want to load all data in memory at once, you can lazily access the data using the iter() method on the resource:

rows = dp.resources[0].iter()
rows.next()

> {'metal or nonmetal?': 'nonmetal', 'symbol': 'H', 'name': 'Hydrogen', 'atomic mass': '1.00794', 'atomic number': '1'}

rows.next()

> {'metal or nonmetal?': 'noble gas', 'symbol': 'He', 'name': 'Helium', 'atomic mass': '4.002602', 'atomic number': '2'}

rows.next()

> {'metal or nonmetal?': 'alkali metal', 'symbol': 'Li', 'name': 'Lithium', 'atomic mass': '6.941', 'atomic number': '3'}

Loading into an SQL database

Tabular Data Packages contains schema information about its data using JSON Table Schema. This means you can easily import your Data Package into the SQL backend of your choice. In this case, we are creating an SQLite database in a new file named datapackage.db.

To load the data into SQL we will need the JSON Table Schema SQL Storage library:

https://github.com/frictionlessdata/jsontableschema-sql-py

You can install it by doing:

pip install jsontableschema-sql

Now you can load your data as follows:

# create the database connection (using SQLAlchemy)
from sqlalchemy import create_engine
engine = create_engine('sqlite:///periodic-table-datapackage.db')

# now push the data to the database
from datapackage import push_datapackage
push_datapackage(descriptor=url,backend='sql',engine=engine)

If you have sqlite3 installed, you can inspect and play with your newly created database. Note that column type information has been translated from the JSON Table Schema format to native SQLite types:

$ sqlite3 periodic-table-datapackage.db 
SQLite version 3.8.10.2 2015-05-20 18:17:19
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE ___data___data (
	"atomic number" INTEGER, 
	symbol TEXT, 
	name TEXT, 
	"atomic mass" FLOAT, 
	"metal or nonmetal?" TEXT
);

Loading into BigQuery

Loading into BigQuery requires some setup on Google’s infrastructure, but once that is completed, loading data can be just as frictionless. Here are the steps to follow:

  1. Create a new project - link
  2. Create a new service account key - link
  3. Download credentials as JSON and save as .credentials.json
  4. Create dataset for your project - link (e.g. “dataset”)

To load the data into BigQuery using Python, we will need the JSON Table Schema BigQuery Storage library:

https://github.com/frictionlessdata/jsontableschema-bigquery-py

You can install it as follows:

pip install jsontableschema-bigquery

The code snippet below should be enough to push your dataset into the cloud!

import io
import os
import json
from apiclient.discovery import build
from oauth2client.client import GoogleCredentials
from jsontableschema_bigquery import Storage

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '.credentials.json'
credentials = GoogleCredentials.get_application_default()
service = build('bigquery', 'v2', credentials=credentials)
project = json.load(io.open('.credentials.json', encoding='utf-8'))['project_id']
push_datapackage(descriptor=url,backend='bigquery',project=project,service=service,
    dataset='dataset')

If everything is in place, you should now be able to inspect your dataset on BigQuery.

BigQuery Schema

BigQuery Preview

Loading into Amazon RedShift

Similar to Google’s BigQuery, Amazon RedShift requires some setup on AWS. Once you’ve created your cluster, however, all you need to do is use your cluster endpoint to create a connection string for SQLAlchemy.

Note: using the sqlalchemy-redshift dialect is optional as the postgres:// dialect is sufficient to load your table into AWS RedShift.

AWS RedShift

# create the database connection (using SQLAlchemy)
REDSHIFT_URL = 'postgres://<user>:<pass>@<host>.redshift.amazonaws.com:5439/<database>'
from sqlalchemy import create_engine
engine = create_engine(REDSHIFT_URL)

# now push the data to the database
from datapackage import push_datapackage
push_datapackage(descriptor=url,backend='sql',engine=engine)