Frictionless Data Frictionless Data
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
  • Data Management for TEDDINET

    • Pilot Name
      • Authors
        • Field
          • FD Tech Involved
            • Context
              • Problem We Were Trying To Solve
            • The work
              • REFIT: Electrical Load Measurements (Cleaned)
              • Creating a data package using Datapackage Pipelines
              • Validating a data package using Goodtables
              • Modifying a data package using Packagist
              • Publishing a data package to Amazon S3
              • Getting insight from data using Python libraries
              • Exporting data to an ElasticSearch cluster
              • Getting insight from data using Kibana
            • Review
              • The results
              • Current limitations
              • Next Steps
              • Find Out More
              • Source Material

          Data Management for TEDDINET

          December 19, 2017 by Julian Padget (DM4T), Dan Fowler (OKI), Evgeny Kariv (OKI), Paul Walsh (OKI), Jo Barratt (OKI)
          Price icons created by Pixel perfect - Flaticon pilot

          # DM4T Pilot

          # Pilot Name

          Data Management for TEDDINET (DM4T)

          # Authors

          Julian Padget (DM4T), Dan Fowler (OKI), Evgeny Karev (OKI)

          # Field

          Energy Data

          # FD Tech Involved

          • Frictionless Data specs: http://specs.frictionlessdata.io/ (opens new window)
          • Data Package Pipelines: https://github.com/frictionlessdata/datapackage-pipelines (opens new window)
          • Goodtables: https://github.com/frictionlessdata/goodtables-py (opens new window)

          packagist has now moved to create.frictionlessdata.io (opens new window)

          # Context

          # Problem We Were Trying To Solve

          Open Knowledge International and the Data Management for TEDDINET project (DM4T) agreed to work together on a proof-of-concept pilot to attempt to use Frictionless Data approaches to address some of the data legacy issues facing the TEDDINET project, a research network addressing the challenges of transforming energy demand in our buildings, as a key component of the transition to an affordable, low carbon energy system. The problem as described on the DM4T Website:

          The Engineering and Physical Sciences Research Council (EPSRC), the UK’s main agency for funding research in engineering and the physical sciences, funded 22 projects over two calls in 2010 and 2012 to investigate Transforming Energy Demand through Digital Innovation’ (TEDDI) as a means to find out how people use energy in homes and what can be done reduce energy consumption. A lot of data is being collected at different levels of detail in a variety of housing throughout the UK, but the level of detail are largely defined by the needs of each individual project. At the same time, the Research Councils UK (RCUK) are defining guidelines for what happens to data generated by projects they fund which require researchers to take concrete actions to store, preserve, and document their data for future reference.

          The problem, however, is that there is relatively little awareness, limited experience and only emerging practice of how to incorporate data management into much of physical science research. This is in contrast to established procedures for data formats and sharing in the biosciences, stemming from international collaboration on the Human Genome Project, and in the social sciences, where data from national surveys, including census data, have been centrally archived for many years. Consequently, current solutions may be able to meet a minimal interpretation of the requirements, but not effectively deliver the desired data legacy.

          The DM4T group selected three suitable datasets to on which to base this work and provided domain knowledge to ensure the pilot is applicable to real use cases.

          Output was tracked here: https://github.com/frictionlessdata/pilot-dm4t/issues (opens new window)

          # The work

          We will use the refit-cleaned dataset to show the Frictionless Data specs and software capabilities. For this work, we limited the size of this dataset in order to preserve a reasonable showcasing time. However, by design the Frictionless Data software has a very good scalability and this process could be reproduced for the whole dataset. But for now it is worth noting that the speed for such a big datasets could be a bottle neck for a research work.

          # REFIT: Electrical Load Measurements (Cleaned)

          Link to the dataset: https://github.com/frictionlessdata/pilot-dm4t/tree/delivery/datasets/refit-cleaned (opens new window)

          For each house in the study, this dataset consists of granular readings of electrical load. There were 20 houses in total, and each house had a different mix of devices plugged into the electrical load sensor. The dataset was distributed as a zipped file (~500MB) containing 20 CSVs with a combined ~120 million rows.

          Time,Unix,Aggregate,Appliance1,Appliance2,Appliance3,Appliance4,Appliance5,Appliance6,Appliance7,Appliance8,Appliance9
          2013-10-09 13:06:17,1381323977,523,74,0,69,0,0,0,0,0,1
          2013-10-09 13:06:31,1381323991,526,75,0,69,0,0,0,0,0,1
          2013-10-09 13:06:46,1381324006,540,74,0,68,0,0,0,0,0,1
          2013-10-09 13:07:01,1381324021,532,74,0,68,0,0,0,0,0,1
          2013-10-09 13:07:15,1381324035,540,74,0,69,0,0,0,0,0,1
          2013-10-09 13:07:18,1381324038,539,74,0,69,0,0,0,0,0,1
          2013-10-09 13:07:30,1381324050,537,74,0,69,0,0,0,0,0,1
          2013-10-09 13:07:32,1381324052,537,74,0,69,0,0,0,0,0,1
          2013-10-09 13:07:44,1381324064,548,74,0,69,0,0,0,0,0,1
          

          Given that these datasets were already provided in well structured CSV files, it was straightforward to translate the data dictionary found in the dataset’s README into the relevant fields in the datapackage.json. We did not need to alter the CSVs that comprise the dataset.

          # Creating a data package using Datapackage Pipelines

          Link to the Datapackage Pipelines project: https://github.com/frictionlessdata/datapackage-pipelines (opens new window)

          Datapackage Pipelines is a framework for declarative stream-processing of tabular data. It is built upon the concepts and tooling of the Frictionless Data project. The basic concept in this framework is the pipeline. A pipeline has a list of processing steps, and it generates a single data package as its output. Pipelines are defined in a declarative way, not in code. One or more pipelines can be defined in a pipeline-spec.yaml file. This file specifies the list of processors (referenced by name) and their execution parameters.

          One of the main purposes of the Frictionless Data project is data containerization. It means that instead of having two separated data knowledge sources (data files and text readme), we’re going to put both of them into a container based on the Data Package specification. This allows us to:

          • Ensure that the dataset description is shipped with the data files
          • Provide column data type information to allow type validation
          • Use the Frictionless Data tooling for reading and validating datasets
          • Enable usage of other software which supports Frictionless Data specifications

          First, we used the datapackage-pipeline library to create a data package from the raw dataset. We need a declarative file called datapackage-pipelines.yaml to describe data transformations steps:

          datapackage-pipelines.yaml

          refit-cleaned:
            pipeline:
              - run: add_metadata
                parameters:
                  name: refit-electrical-load-measurements
                  title: 'REFIT: Electrical Load Measurements'
                  license: CC-BY-4.0
                  description: Collection of this dataset was supported by the Engineering and Physical Sciences Research Council (EPSRC) via the project entitled Personalised Retrofit Decision Support Tools for UK Homes using Smart Home Technology (REFIT), which is a collaboration among the Universities of Strathclyde, Loughborough and East Anglia. The dataset includes data from 20 households from the Loughborough area over the period 2013 - 2015. Additional information about REFIT is available from www.refitsmarthomes.org.
                  sources:
                    -
                      title: 'REFIT: Electrical Load Measurements (Cleaned)'
                      web: 'https://pure.strath.ac.uk/portal/en/datasets/refit-electrical-load-measurements-cleaned(9ab14b0e-19ac-4279-938f-27f643078cec).html'
                      email: [email protected]
              - run: add_resource
                parameters:
                  name: 'house-1'
                  url: 'datasets/refit-cleaned/House_1.csv'
                  format: csv
              # Other resources are omitted
              - run: stream_remote_resources
              - run: set_types
                parameters:
                  resources: "house-[0-9]{1,2}"
                  types:
                    "Time":
                      type: datetime
                      format: "fmt:%Y-%m-%d %H:%M:%S"
                    Unix:
                      type: integer
                    Aggregate:
                      type: integer
                    "Appliance[1-9]":
                      type: integer
              - run: processors.modify_descriptions
                parameters:
                  resources: house-1
                  descriptions:
                    Appliance1:
                      description: Fridge
                    Appliance2:
                      description: Chest Freezer
                    Appliance3:
                      description: Upright Freezer
                    Appliance4:
                      description: Tumble Dryer
                    Appliance5:
                      descripion: Washing Machine
                    Appliance6:
                      description: Dishwasher
                    Appliance7:
                      description: Computer Site
                    Appliance8:
                      description: Television Site
                    Appliance9:
                      description: Electric Heater
              # Other resources are omitted
              - run: dump.to_path
                parameters:
                    out-path: packages/refit-cleaned
          

          The process follows contains these steps:

          • Create the data package metadata
          • Add all data files from the disc
          • Start resources streaming into the data package
          • Update resources descriptions using a custom processor
          • Save the data package to the disc

          Now we’re ready to run this pipeline:

          $ dpp run ./refit-cleaned
          

          After this step we have a data package containing a descriptor:

          packages/refit-cleaned/datapakcage.json

          {
            "bytes": 1121187,
            "count_of_rows": 19980,
            "description": "Collection of this dataset was supported by the Engineering and Physical Sciences Research Council (EPSRC) via the project entitled Personalised Retrofit Decision Support Tools for UK Homes using Smart Home Technology (REFIT), which is a collaboration among the Universities of Strathclyde, Loughborough and East Anglia. The dataset includes data from 20 households from the Loughborough area over the period 2013 - 2015. Additional information about REFIT is available from www.refitsmarthomes.org.",
            "hash": "433ff35135e0a43af6f00f04cb8e666d",
            "license": "CC-BY-4.0",
            "name": "refit-electrical-load-measurements",
            "resources": [
              {
                "bytes": 55251,
                "count_of_rows": 999,
                "dialect": {
                  "delimiter": ",",
                  "doubleQuote": true,
                  "lineTerminator": "\r\n",
                  "quoteChar": "\"",
                  "skipInitialSpace": false
                },
                "dpp:streamedFrom": "datasets/refit-cleaned/House_1.csv",
                "dpp:streaming": true,
                "encoding": "utf-8",
                "format": "csv",
                "hash": "ad42fbf1302cabe30e217ff105d5a7fd",
                "name": "house-1",
                "path": "data/house-1.csv",
                "schema": {
                  "fields": [
                    {
                      "format": "%Y-%m-%d %H:%M:%S",
                      "name": "Time",
                      "type": "datetime"
                    },
                    {
                      "name": "Unix",
                      "type": "integer"
                    },
                    {
                      "name": "Aggregate",
                      "type": "integer"
                    },
                    {
                      "description": "Fridge",
                      "name": "Appliance1",
                      "type": "integer"
                    },
                    {
                      "description": "Chest Freezer",
                      "name": "Appliance2",
                      "type": "integer"
                    },
                    {
                      "description": "Upright Freezer",
                      "name": "Appliance3",
                      "type": "integer"
                    },
                    {
                      "description": "Tumble Dryer",
                      "name": "Appliance4",
                      "type": "integer"
                    },
                    {
                      "descripion": "Washing Machine",
                      "name": "Appliance5",
                      "type": "integer"
                    },
                    {
                      "description": "Dishwasher",
                      "name": "Appliance6",
                      "type": "integer"
                    },
                    {
                      "description": "Computer Site",
                      "name": "Appliance7",
                      "type": "integer"
                    },
                    {
                      "description": "Television Site",
                      "name": "Appliance8",
                      "type": "integer"
                    },
                    {
                      "description": "Electric Heater",
                      "name": "Appliance9",
                      "type": "integer"
                    }
                  ]
                }
              },
              # Other resources is omitted
            ]
          }
          

          And a list of data files linked in the descriptor:

          $ ls packages/refit-cleaned/data
          house-10.csv  house-13.csv  house-17.csv  house-1.csv   house-2.csv  house-5.csv  house-8.csv
          house-11.csv  house-15.csv  house-18.csv  house-20.csv  house-3.csv  house-6.csv  house-9.csv
          house-12.csv  house-16.csv  house-19.csv  house-21.csv  house-4.csv  house-7.csv
          

          # Validating a data package using Goodtables

          Goodtables is a software family for tabular data validation. It’s available as a Python library, a command line tool, web application (opens new window) and continuous validation service (opens new window).

          The main features of Goodtables are:

          • Structural checks: Ensure that there are no empty rows, no blank headers, etc.
          • Content checks: Ensure that the values have the correct types (“string”, “number”, “date”, etc.), that their format is valid (“string must be an e-mail”), and that they respect the constraints (“age must be a number greater than 18”).
          • Support for multiple tabular formats: CSV, Excel files, LibreOffice, Data Package, etc.
          • Parallelized validations for multi-table datasets

          Because we have provided data types for the columns at the wrapping stage, here we validate both the data structure and compliance to the data types using the Goodtables command line interface:

          $ goodtables packages/refit-cleaned/datapackage.json
          DATASET
          =======
          {'error-count': 0,
           'preset': 'datapackage',
           'table-count': 20,
           'time': 4.694,
           'valid': True}
          

          # Modifying a data package using Packagist

          If we need to modify our data package, we could use the Packagist (opens new window). It incorporates a straightforward UI to modify and validate data package descriptors. With its easy to use interface we are able to:

          • Load/validate/save a data package
          • Update a data package metadata
          • Add/remove/modify data package resources
          • Add/remove/modify data resource fields
          • Set type/format for data values

          ADBio

          On the figure above we have loaded the refit-cleaned data package into the Packagist UI to make changes to the data package as needed.

          # Publishing a data package to Amazon S3

          Link to the published package: https://s3.eu-central-1.amazonaws.com/pilot-dm4t/pilot-dm4t/packages/refit-cleaned/datapackage.json (opens new window)

          In this section we will show how data packages can be moved from one data storage system to another. This is possible because it has been containerised.

          One important feature of the datapackage-pipelines project that it works as a conveyor. We could push our data package not only to the local disc but to other destinations. For example to the Amazon S3:

          pipelines-spec.yml

          refit-cleaned:
              # Initial steps are omitted
              - run: aws.dump.to_s3
                parameters:
                  bucket: pilot-dm4t
                  path: pilot-dm4t/packages/refit-cleaned
          

          Running this command again:

          $ dpp run ./refit-cleaned
          

          And now our data package is published to Amazon the S3 remote storage:

          screenshot of S3 storage

          # Getting insight from data using Python libraries

          Link to the demostration script: https://github.com/frictionlessdata/pilot-dm4t/blob/delivery/scripts/refit-cleaned.py (opens new window)

          The Frictionless Data projects provides various Python (along with other 8 languages) libraries to work with data package programatically. We used the datapackage library to analyse the refit-cleaned data package:

          import datetime
          import statistics
          from datapackage import Package
          # Get aggregates
          consumption = {}
          package = Package('packages/refit-cleaned/datapackage.json')
          for resource in package.resources:
              for row in resource.iter(keyed=True):
                  hour = row['Time'].hour
                  consumption.setdefault(hour, [])
                  consumption[hour].append(row['Aggregate'])
          # Get averages
          for hour in consumption:
              consumption[hour] = statistics.mean(consumption[hour])
          # Print results
          for hour in sorted(consumption):
              print('Average consumption at %02d hours: %.0f' % (hour, consumption[hour]))
          

          Now we could run it in the command line:

          $ python examles/refit-cleaned.py
          Average consumption at 00 hours: 232
          Average consumption at 01 hours: 213
          Average consumption at 02 hours: 247
          Average consumption at 03 hours: 335
          Average consumption at 04 hours: 215
          Average consumption at 05 hours: 690
          Average consumption at 06 hours: 722
          Average consumption at 07 hours: 648
          Average consumption at 08 hours: 506
          Average consumption at 09 hours: 464
          Average consumption at 10 hours: 364
          Average consumption at 11 hours: 569
          Average consumption at 12 hours: 520
          Average consumption at 13 hours: 497
          Average consumption at 14 hours: 380
          Average consumption at 15 hours: 383
          Average consumption at 16 hours: 459
          Average consumption at 17 hours: 945
          Average consumption at 18 hours: 733
          Average consumption at 19 hours: 732
          Average consumption at 20 hours: 471
          Average consumption at 21 hours: 478
          Average consumption at 22 hours: 325
          Average consumption at 23 hours: 231
          

          Here we we’re able to get the averages for electricity consumption grouped by hour. We could have achieved this in different ways, but using the Frictionless Data specs and software provides some important advantages:

          • The fact that we have data wrapped into a data package has allowed us to validate and read the data already converted for its correct types (e.g native python datetime object). No need for any kind of string parsing.
          • The Frictionless Data software uses file streams under the hood. This means that only the current row is kept in memory, so we’re able to handle datasets bigger than the available RAM memory.

          # Exporting data to an ElasticSearch cluster

          Link to the export script: https://github.com/frictionlessdata/pilot-dm4t/blob/delivery/scripts/refit-cleaned.py (opens new window)

          The Frictionless Data software provides plugins to export data to various backends like SQL, BigQuery etc. We will export the first resource from our data package for future analysis:

          from elasticsearch import Elasticsearch
          from datapackage import Package
          from tableschema_elasticsearch import Storage
          # Get resource
          package = Package('packages/refit-cleaned/datapackage.json')
          resource = package.get_resource('house-1')
          # Create storage
          engine = Elasticsearch()
          storage = Storage(engine)
          # Write data
          storage.create('refit-cleaned', [('house-1', resource.schema.descriptor)])
          list(storage.write('refit-cleaned', 'house-1', resource.read(keyed=True), ['Unix']))
          

          Now we are able to check that our documents are indexed:

          $ http http://localhost:9200/_cat/indices?v
          

          # Getting insight from data using Kibana

          To demonstrate how the Frictionless Data specs and software empower the usage of other analytics tools, we will use ElasticSearch/Kibana project. On the previous step we have imported our data package into an ElasticSearch cluster. It allows us to visualize data using a simple UI:

          screenshot of elasticsearch cluster

          In this screenshot we see the distribution of the average electricity comsumption. This is just an example of what you can do by having the ability to easily load datasets into other analytical software.

          # Review

          # The results

          In this pilot, we have been able to demonstrate the the following:

          • Packaging the refit-cleaned dataset as a data package using the Data Package Pipelines library
          • Validating the data package using the Goodtables library
          • Modifying data packages metadata using the Packagist UI
          • Uploading the dataset to Amazon S3 and ElasticSearch cluster using Frictionless Data tools
          • Reading and analysing in Python the created Data Package using the Frictionless Data library

          # Current limitations

          The central challenge of working with these datasets is the size. Publishing the results of these research projects as flat files for immediate analysis is beneficial, however, the scale of each of these datasets (gigabytes of data, millions of rows) is a challenge to deal with no matter how you are storing. Processing this data through Data Package pipelines takes a long time.

          # Next Steps

          • Improve the speed of the data package creation step

          # Find Out More

          • https://github.com/frictionlessdata/pilot-pnnl (opens new window)

          # Source Material

          • https://app.hubspot.com/sales/2281421/deal/146418008 (opens new window)
          • https://discuss.okfn.org/c/working-groups/open-archaeology (opens new window)
          • https://github.com/frictionlessdata/pilot-open-archaeology (opens new window)
          Blog Index