Joining Tabular Data
In a separate guide, I walked through joining a tabular dataset with one containing geographic information. In this guide, I will demonstrate an example of joining two tabular datasets.
There are, of course, various, more robust ways of joining tabular data. The example listed below is intended to demonstrate how the current libraries and specifications work together to perform this common task.
# Data
In this case, we want to join a dataset containing the nominal Gross Domestic Product(GDP) per country per year with the Consumer Price Index (CPI) per country per year. By adjusting a given GDP measure by the CPI, a measure of inflation, one can yield the real GDP, a measure of economic output adjusted for price changes over time. To do that, of course, we need to join these independent datasets on the common values “Country Code” and “Year”.
# GDP
Country Name | Country Code | Year | Value |
---|---|---|---|
Afghanistan | AFG | 2004 | 5285461999.33739 |
Afghanistan | AFG | 2005 | 6275076016.47174 |
Afghanistan | AFG | 2006 | 7057598406.61553 |
Afghanistan | AFG | 2007 | 9843842455.48323 |
Afghanistan | AFG | 2008 | 10190529882.4878 |
# CPI
Country Name | Country Code | Year | CPI |
---|---|---|---|
Afghanistan | AFG | 2004 | 63.1318927309 |
Afghanistan | AFG | 2005 | 71.1409742918 |
Afghanistan | AFG | 2006 | 76.3021776777 |
Afghanistan | AFG | 2007 | 82.7748069188 |
Afghanistan | AFG | 2008 | 108.0666000101 |
# Loading the Data
As usual, the first step is to load the Data Packages library datapackage
. We also need to import DictWriter
to write our merged rows to a new CSV.
import datapackage
from csv import DictWriter
cpi_dp = datapackage.DataPackage('https://raw.githubusercontent.com/frictionlessdata/example-data-packages/master/cpi/datapackage.json')
gdp_dp = datapackage.DataPackage('https://raw.githubusercontent.com/frictionlessdata/example-data-packages/master/gross-domestic-product-all/datapackage.json')
Given that our source data has already been packaged in Tabular Data Package (opens new window) format, we know that we have a schema (opens new window) for each CSV which specifies useful information for each column. We’d like to merge and preserve this schema information as we’ll need it for specifying the combined schema in our new Data Package. Note that we’re also adding a new derived column named ‘Real GDP’ and giving it a type of number
.
field_info = []
field_info.extend(cpi_dp.resources[0].descriptor['schema']['fields'])
field_info.extend(gdp_dp.resources[0].descriptor['schema']['fields'])
field_info.append({'name': 'Real GDP', 'type': 'number'})
Now that we have this information, we can generate a fieldnames
array containing only the names of the columns to eventually pass to DictWriter
when we’re ready write out our new CSV.
fieldnames = [f['name'] for f in field_info]
# Joining the Data
What follows is a fairly simple example of iterating through each row of each CSV and creating a new merged_row
when ‘Year’ and ‘Country Code’ match on the two datasets. We are also calculating our derived ‘Real GDP’ column based in the information found in the original columns.
with open('real_gdp.csv', 'w') as csvfile:
writer = DictWriter(csvfile,fieldnames=fieldnames)
writer.writeheader()
for gdp_row in gdp_dp.resources[0].data:
for cpi_row in cpi_dp.resources[0].data:
if gdp_row['Year'] == cpi_row['Year'] and gdp_row['Country Code'] == cpi_row['Country Code']:
merged_row = gdp_row.copy()
merged_row.update(cpi_row)
merged_row.update({'Real GDP': 100*(float(gdp_row['Value'])/float(cpi_row['CPI']))})
writer.writerow(merged_row)
# Creating a New Data Package
Now that we’ve created our new CSV real_gdp.csv
, we can use the Data Package library to package it up with some useful metadata. Note that we are passing the merged field_info
array into our schema
definition. Given that we are generating this Data Package “by hand”, we need to run the validate
method on the new Data Package object to make sure that we are, indeed, creating a valid Data Package. After validating the Data Package metadata, we can either write the Data Package directly or save the whole thing as a zip file using the save
method.
dp = datapackage.Package()
dp.descriptor['name'] = 'real-gdp'
{
'name': 'data',
'path': 'real_gdp.csv',
'format': 'csv',
'schema': {
'fields': field_info
}
}
]
resource = dp.resources[0]
resource.descriptor['path'] = 'real_gdp.csv'
dp.validate()
with open('datapackage.json', 'w') as f:
f.write(dp.to_json())
# dp.save("real_gdp.zip")