Frictionless Data Frictionless Data
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
  • Joining Tabular Data

    • Data
      • GDP
      • CPI
    • Loading the Data
      • Joining the Data
        • Creating a New Data Package

        Joining Tabular Data

        April 5, 2018 by Frictionless Data
        Price icons created by Pixel perfect - Flaticon Python

        In a separate guide, I walked through joining a tabular dataset with one containing geographic information. In this guide, I will demonstrate an example of joining two tabular datasets.

        There are, of course, various, more robust ways of joining tabular data. The example listed below is intended to demonstrate how the current libraries and specifications work together to perform this common task.

        # Data

        In this case, we want to join a dataset containing the nominal Gross Domestic Product(GDP) per country per year with the Consumer Price Index (CPI) per country per year. By adjusting a given GDP measure by the CPI, a measure of inflation, one can yield the real GDP, a measure of economic output adjusted for price changes over time. To do that, of course, we need to join these independent datasets on the common values “Country Code” and “Year”.

        # GDP

        Country Name Country Code Year Value
        Afghanistan AFG 2004 5285461999.33739
        Afghanistan AFG 2005 6275076016.47174
        Afghanistan AFG 2006 7057598406.61553
        Afghanistan AFG 2007 9843842455.48323
        Afghanistan AFG 2008 10190529882.4878

        # CPI

        Country Name Country Code Year CPI
        Afghanistan AFG 2004 63.1318927309
        Afghanistan AFG 2005 71.1409742918
        Afghanistan AFG 2006 76.3021776777
        Afghanistan AFG 2007 82.7748069188
        Afghanistan AFG 2008 108.0666000101

        # Loading the Data

        As usual, the first step is to load the Data Packages library datapackage. We also need to import DictWriter to write our merged rows to a new CSV.

        import datapackage
        from csv import DictWriter
        cpi_dp = datapackage.DataPackage('https://raw.githubusercontent.com/frictionlessdata/example-data-packages/master/cpi/datapackage.json')
        gdp_dp = datapackage.DataPackage('https://raw.githubusercontent.com/frictionlessdata/example-data-packages/master/gross-domestic-product-all/datapackage.json')
        

        Given that our source data has already been packaged in Tabular Data Package (opens new window) format, we know that we have a schema (opens new window) for each CSV which specifies useful information for each column. We’d like to merge and preserve this schema information as we’ll need it for specifying the combined schema in our new Data Package. Note that we’re also adding a new derived column named ‘Real GDP’ and giving it a type of number.

        field_info = []
        field_info.extend(cpi_dp.resources[0].descriptor['schema']['fields'])
        field_info.extend(gdp_dp.resources[0].descriptor['schema']['fields'])
        field_info.append({'name': 'Real GDP', 'type': 'number'})
        

        Now that we have this information, we can generate a fieldnames array containing only the names of the columns to eventually pass to DictWriter when we’re ready write out our new CSV.

        fieldnames = [f['name'] for f in field_info]
        

        # Joining the Data

        What follows is a fairly simple example of iterating through each row of each CSV and creating a new merged_row when ‘Year’ and ‘Country Code’ match on the two datasets. We are also calculating our derived ‘Real GDP’ column based in the information found in the original columns.

        with open('real_gdp.csv', 'w') as csvfile:
            writer = DictWriter(csvfile,fieldnames=fieldnames)
            writer.writeheader()
            for gdp_row in gdp_dp.resources[0].data:
                for cpi_row in cpi_dp.resources[0].data:
                    if gdp_row['Year'] == cpi_row['Year'] and gdp_row['Country Code'] == cpi_row['Country Code']:
                        merged_row = gdp_row.copy()
                        merged_row.update(cpi_row)
                        merged_row.update({'Real GDP': 100*(float(gdp_row['Value'])/float(cpi_row['CPI']))})
                        writer.writerow(merged_row)
        

        # Creating a New Data Package

        Now that we’ve created our new CSV real_gdp.csv, we can use the Data Package library to package it up with some useful metadata. Note that we are passing the merged field_info array into our schema definition. Given that we are generating this Data Package “by hand”, we need to run the validate method on the new Data Package object to make sure that we are, indeed, creating a valid Data Package. After validating the Data Package metadata, we can either write the Data Package directly or save the whole thing as a zip file using the save method.

        dp = datapackage.Package()
        dp.descriptor['name'] = 'real-gdp'
            {
             'name': 'data',
             'path': 'real_gdp.csv',
             'format': 'csv',
             'schema': {
                'fields': field_info
              }
            }
        ]
        resource = dp.resources[0]
        resource.descriptor['path'] = 'real_gdp.csv'
        dp.validate()
        with open('datapackage.json', 'w') as f:
            f.write(dp.to_json())
        # dp.save("real_gdp.zip")
        
        Blog Index