Joining Tabular Data
In a separate guide, I walked through joining a tabular dataset with one containing geographic information. In this guide, I will demonstrate an example of joining two tabular datasets.
Note: There are, of course, various, more robust ways of joining tabular data. The example listed below is intended to demonstrate how the current libraries and specifications work together to perform this common task.
In this case, we want to join a dataset containing the nominal Gross Domestic Product (GDP) per country per year with the Consumer Price Index (CPI) per country per year. By adjusting a given GDP measure by the CPI, a measure of inflation, one can yield the real GDP, a measure of economic output adjusted for price changes over time. To do that, of course, we need to join these independent datasets on the common values “Country Code” and “Year”.
|Country Name||Country Code||Year||Value|
|Country Name||Country Code||Year||CPI|
Loading the Data
As usual, the first step is to load the Data Packages library
datapackage. We also need to import
DictWriter to write our
merged rows to a new CSV.
Given that our source data has already been packaged in
Tabular Data Package format, we know
that we have a schema for each CSV
which specifies useful information for each column. We’d like to
merge and preserve this schema information as we’ll need it for
specifying the combined schema in our new Data Package. Note that
we’re also adding a new derived column named ‘Real GDP’ and giving it
a type of
Now that we have this information, we can generate a
array containing only the names of the columns to eventually pass to
DictWriter when we’re ready write out our new CSV.
Joining the Data
What follows is a fairly simple example of iterating through each row
of each CSV and creating a new
merged_row when ‘Year’ and ‘Country
Code’ match on the two datasets. We are also calculating our derived
‘Real GDP’ column based in the information found in the original
Creating a New Data Package
Now that we’ve created our new CSV
real_gdp.csv, we can use the Data
Package library to package it up with some useful metadata. Note that
we are passing the merged
field_info array into our
definition. Given that we are generating this Data Package “by hand”,
we need to run the
validate method on the new Data Package object to
make sure that we are, indeed, creating a valid Data Package. After
validating the Data Package metadata, we can either write the Data
Package directly or save the whole thing as a zip file using the