Data Management for TEDDINET

# DM4T Pilot

# Pilot Name

Data Management for TEDDINET (DM4T)

# Authors

Julian Padget (DM4T), Dan Fowler (OKI), Evgeny Karev (OKI)

# Field

Energy Data

# FD Tech Involved

Frictionless Data specs: http://specs.frictionlessdata.io/ (opens new window)
Data Package Pipelines: https://github.com/frictionlessdata/datapackage-pipelines (opens new window)
Goodtables: https://github.com/frictionlessdata/goodtables-py (opens new window)

packagist has now moved to create.frictionlessdata.io (opens new window)

# Context

# Problem We Were Trying To Solve

Open Knowledge International and the Data Management for TEDDINET project (DM4T) agreed to work together on a proof-of-concept pilot to attempt to use Frictionless Data approaches to address some of the data legacy issues facing the TEDDINET project, a research network addressing the challenges of transforming energy demand in our buildings, as a key component of the transition to an affordable, low carbon energy system. The problem as described on the DM4T Website:

The Engineering and Physical Sciences Research Council (EPSRC), the UK’s main agency for funding research in engineering and the physical sciences, funded 22 projects over two calls in 2010 and 2012 to investigate Transforming Energy Demand through Digital Innovation’ (TEDDI) as a means to find out how people use energy in homes and what can be done reduce energy consumption. A lot of data is being collected at different levels of detail in a variety of housing throughout the UK, but the level of detail are largely defined by the needs of each individual project. At the same time, the Research Councils UK (RCUK) are defining guidelines for what happens to data generated by projects they fund which require researchers to take concrete actions to store, preserve, and document their data for future reference.

The problem, however, is that there is relatively little awareness, limited experience and only emerging practice of how to incorporate data management into much of physical science research. This is in contrast to established procedures for data formats and sharing in the biosciences, stemming from international collaboration on the Human Genome Project, and in the social sciences, where data from national surveys, including census data, have been centrally archived for many years. Consequently, current solutions may be able to meet a minimal interpretation of the requirements, but not effectively deliver the desired data legacy.

The DM4T group selected three suitable datasets to on which to base this work and provided domain knowledge to ensure the pilot is applicable to real use cases.

Output was tracked here: https://github.com/frictionlessdata/pilot-dm4t/issues (opens new window)

# The work

We will use the refit-cleaned dataset to show the Frictionless Data specs and software capabilities. For this work, we limited the size of this dataset in order to preserve a reasonable showcasing time. However, by design the Frictionless Data software has a very good scalability and this process could be reproduced for the whole dataset. But for now it is worth noting that the speed for such a big datasets could be a bottle neck for a research work.

# REFIT: Electrical Load Measurements (Cleaned)

Link to the dataset: https://github.com/frictionlessdata/pilot-dm4t/tree/delivery/datasets/refit-cleaned (opens new window)

For each house in the study, this dataset consists of granular readings of electrical load. There were 20 houses in total, and each house had a different mix of devices plugged into the electrical load sensor. The dataset was distributed as a zipped file (~500MB) containing 20 CSVs with a combined ~120 million rows.

Time,Unix,Aggregate,Appliance1,Appliance2,Appliance3,Appliance4,Appliance5,Appliance6,Appliance7,Appliance8,Appliance9
2013-10-09 13:06:17,1381323977,523,74,0,69,0,0,0,0,0,1
2013-10-09 13:06:31,1381323991,526,75,0,69,0,0,0,0,0,1
2013-10-09 13:06:46,1381324006,540,74,0,68,0,0,0,0,0,1
2013-10-09 13:07:01,1381324021,532,74,0,68,0,0,0,0,0,1
2013-10-09 13:07:15,1381324035,540,74,0,69,0,0,0,0,0,1
2013-10-09 13:07:18,1381324038,539,74,0,69,0,0,0,0,0,1
2013-10-09 13:07:30,1381324050,537,74,0,69,0,0,0,0,0,1
2013-10-09 13:07:32,1381324052,537,74,0,69,0,0,0,0,0,1
2013-10-09 13:07:44,1381324064,548,74,0,69,0,0,0,0,0,1

Given that these datasets were already provided in well structured CSV files, it was straightforward to translate the data dictionary found in the dataset’s README into the relevant fields in the datapackage.json. We did not need to alter the CSVs that comprise the dataset.

# Creating a data package using Datapackage Pipelines

Link to the Datapackage Pipelines project: https://github.com/frictionlessdata/datapackage-pipelines (opens new window)

Datapackage Pipelines is a framework for declarative stream-processing of tabular data. It is built upon the concepts and tooling of the Frictionless Data project. The basic concept in this framework is the pipeline. A pipeline has a list of processing steps, and it generates a single data package as its output. Pipelines are defined in a declarative way, not in code. One or more pipelines can be defined in a pipeline-spec.yaml file. This file specifies the list of processors (referenced by name) and their execution parameters.

One of the main purposes of the Frictionless Data project is data containerization. It means that instead of having two separated data knowledge sources (data files and text readme), we’re going to put both of them into a container based on the Data Package specification. This allows us to:

Ensure that the dataset description is shipped with the data files
Provide column data type information to allow type validation
Use the Frictionless Data tooling for reading and validating datasets
Enable usage of other software which supports Frictionless Data specifications

First, we used the datapackage-pipeline library to create a data package from the raw dataset. We need a declarative file called datapackage-pipelines.yaml to describe data transformations steps:

datapackage-pipelines.yaml

refit-cleaned:
  pipeline:
    - run: add_metadata
      parameters:
        name: refit-electrical-load-measurements
        title: 'REFIT: Electrical Load Measurements'
        license: CC-BY-4.0
        description: Collection of this dataset was supported by the Engineering and Physical Sciences Research Council (EPSRC) via the project entitled Personalised Retrofit Decision Support Tools for UK Homes using Smart Home Technology (REFIT), which is a collaboration among the Universities of Strathclyde, Loughborough and East Anglia. The dataset includes data from 20 households from the Loughborough area over the period 2013 - 2015. Additional information about REFIT is available from www.refitsmarthomes.org.
        sources:
          -
            title: 'REFIT: Electrical Load Measurements (Cleaned)'
            web: 'https://pure.strath.ac.uk/portal/en/datasets/refit-electrical-load-measurements-cleaned(9ab14b0e-19ac-4279-938f-27f643078cec).html'
            email: [email protected]
    - run: add_resource
      parameters:
        name: 'house-1'
        url: 'datasets/refit-cleaned/House_1.csv'
        format: csv
    # Other resources are omitted
    - run: stream_remote_resources
    - run: set_types
      parameters:
        resources: "house-[0-9]{1,2}"
        types:
          "Time":
            type: datetime
            format: "fmt:%Y-%m-%d %H:%M:%S"
          Unix:
            type: integer
          Aggregate:
            type: integer
          "Appliance[1-9]":
            type: integer
    - run: processors.modify_descriptions
      parameters:
        resources: house-1
        descriptions:
          Appliance1:
            description: Fridge
          Appliance2:
            description: Chest Freezer
          Appliance3:
            description: Upright Freezer
          Appliance4:
            description: Tumble Dryer
          Appliance5:
            descripion: Washing Machine
          Appliance6:
            description: Dishwasher
          Appliance7:
            description: Computer Site
          Appliance8:
            description: Television Site
          Appliance9:
            description: Electric Heater
    # Other resources are omitted
    - run: dump.to_path
      parameters:
          out-path: packages/refit-cleaned

The process follows contains these steps:

Create the data package metadata
Add all data files from the disc
Start resources streaming into the data package
Update resources descriptions using a custom processor
Save the data package to the disc

Now we’re ready to run this pipeline:

$ dpp run ./refit-cleaned

After this step we have a data package containing a descriptor:

packages/refit-cleaned/datapakcage.json

{
  "bytes": 1121187,
  "count_of_rows": 19980,
  "description": "Collection of this dataset was supported by the Engineering and Physical Sciences Research Council (EPSRC) via the project entitled Personalised Retrofit Decision Support Tools for UK Homes using Smart Home Technology (REFIT), which is a collaboration among the Universities of Strathclyde, Loughborough and East Anglia. The dataset includes data from 20 households from the Loughborough area over the period 2013 - 2015. Additional information about REFIT is available from www.refitsmarthomes.org.",
  "hash": "433ff35135e0a43af6f00f04cb8e666d",
  "license": "CC-BY-4.0",
  "name": "refit-electrical-load-measurements",
  "resources": [
    {
      "bytes": 55251,
      "count_of_rows": 999,
      "dialect": {
        "delimiter": ",",
        "doubleQuote": true,
        "lineTerminator": "\r\n",
        "quoteChar": "\"",
        "skipInitialSpace": false
      },
      "dpp:streamedFrom": "datasets/refit-cleaned/House_1.csv",
      "dpp:streaming": true,
      "encoding": "utf-8",
      "format": "csv",
      "hash": "ad42fbf1302cabe30e217ff105d5a7fd",
      "name": "house-1",
      "path": "data/house-1.csv",
      "schema": {
        "fields": [
          {
            "format": "%Y-%m-%d %H:%M:%S",
            "name": "Time",
            "type": "datetime"
          },
          {
            "name": "Unix",
            "type": "integer"
          },
          {
            "name": "Aggregate",
            "type": "integer"
          },
          {
            "description": "Fridge",
            "name": "Appliance1",
            "type": "integer"
          },
          {
            "description": "Chest Freezer",
            "name": "Appliance2",
            "type": "integer"
          },
          {
            "description": "Upright Freezer",
            "name": "Appliance3",
            "type": "integer"
          },
          {
            "description": "Tumble Dryer",
            "name": "Appliance4",
            "type": "integer"
          },
          {
            "descripion": "Washing Machine",
            "name": "Appliance5",
            "type": "integer"
          },
          {
            "description": "Dishwasher",
            "name": "Appliance6",
            "type": "integer"
          },
          {
            "description": "Computer Site",
            "name": "Appliance7",
            "type": "integer"
          },
          {
            "description": "Television Site",
            "name": "Appliance8",
            "type": "integer"
          },
          {
            "description": "Electric Heater",
            "name": "Appliance9",
            "type": "integer"
          }
        ]
      }
    },
    # Other resources is omitted
  ]
}

And a list of data files linked in the descriptor:

$ ls packages/refit-cleaned/data
house-10.csv  house-13.csv  house-17.csv  house-1.csv   house-2.csv  house-5.csv  house-8.csv
house-11.csv  house-15.csv  house-18.csv  house-20.csv  house-3.csv  house-6.csv  house-9.csv
house-12.csv  house-16.csv  house-19.csv  house-21.csv  house-4.csv  house-7.csv

# Validating a data package using Goodtables

Goodtables is a software family for tabular data validation. It’s available as a Python library, a command line tool, web application (opens new window) and continuous validation service (opens new window).

The main features of Goodtables are:

Structural checks: Ensure that there are no empty rows, no blank headers, etc.
Content checks: Ensure that the values have the correct types (“string”, “number”, “date”, etc.), that their format is valid (“string must be an e-mail”), and that they respect the constraints (“age must be a number greater than 18”).
Support for multiple tabular formats: CSV, Excel files, LibreOffice, Data Package, etc.
Parallelized validations for multi-table datasets

Because we have provided data types for the columns at the wrapping stage, here we validate both the data structure and compliance to the data types using the Goodtables command line interface:

$ goodtables packages/refit-cleaned/datapackage.json
DATASET
=======
{'error-count': 0,
 'preset': 'datapackage',
 'table-count': 20,
 'time': 4.694,
 'valid': True}

# Modifying a data package using Packagist

If we need to modify our data package, we could use the Packagist (opens new window). It incorporates a straightforward UI to modify and validate data package descriptors. With its easy to use interface we are able to:

Load/validate/save a data package
Update a data package metadata
Add/remove/modify data package resources
Add/remove/modify data resource fields
Set type/format for data values

ADBio

On the figure above we have loaded the refit-cleaned data package into the Packagist UI to make changes to the data package as needed.

# Publishing a data package to Amazon S3

Link to the published package: https://s3.eu-central-1.amazonaws.com/pilot-dm4t/pilot-dm4t/packages/refit-cleaned/datapackage.json (opens new window)

In this section we will show how data packages can be moved from one data storage system to another. This is possible because it has been containerised.

One important feature of the datapackage-pipelines project that it works as a conveyor. We could push our data package not only to the local disc but to other destinations. For example to the Amazon S3:

pipelines-spec.yml

refit-cleaned:
    # Initial steps are omitted
    - run: aws.dump.to_s3
      parameters:
        bucket: pilot-dm4t
        path: pilot-dm4t/packages/refit-cleaned

Running this command again:

$ dpp run ./refit-cleaned

And now our data package is published to Amazon the S3 remote storage:

screenshot of S3 storage

# Getting insight from data using Python libraries

Link to the demostration script: https://github.com/frictionlessdata/pilot-dm4t/blob/delivery/scripts/refit-cleaned.py (opens new window)

The Frictionless Data projects provides various Python (along with other 8 languages) libraries to work with data package programatically. We used the datapackage library to analyse the refit-cleaned data package:

import datetime
import statistics
from datapackage import Package
# Get aggregates
consumption = {}
package = Package('packages/refit-cleaned/datapackage.json')
for resource in package.resources:
    for row in resource.iter(keyed=True):
        hour = row['Time'].hour
        consumption.setdefault(hour, [])
        consumption[hour].append(row['Aggregate'])
# Get averages
for hour in consumption:
    consumption[hour] = statistics.mean(consumption[hour])
# Print results
for hour in sorted(consumption):
    print('Average consumption at %02d hours: %.0f' % (hour, consumption[hour]))

Now we could run it in the command line:

$ python examles/refit-cleaned.py
Average consumption at 00 hours: 232
Average consumption at 01 hours: 213
Average consumption at 02 hours: 247
Average consumption at 03 hours: 335
Average consumption at 04 hours: 215
Average consumption at 05 hours: 690
Average consumption at 06 hours: 722
Average consumption at 07 hours: 648
Average consumption at 08 hours: 506
Average consumption at 09 hours: 464
Average consumption at 10 hours: 364
Average consumption at 11 hours: 569
Average consumption at 12 hours: 520
Average consumption at 13 hours: 497
Average consumption at 14 hours: 380
Average consumption at 15 hours: 383
Average consumption at 16 hours: 459
Average consumption at 17 hours: 945
Average consumption at 18 hours: 733
Average consumption at 19 hours: 732
Average consumption at 20 hours: 471
Average consumption at 21 hours: 478
Average consumption at 22 hours: 325
Average consumption at 23 hours: 231

Here we we’re able to get the averages for electricity consumption grouped by hour. We could have achieved this in different ways, but using the Frictionless Data specs and software provides some important advantages:

The fact that we have data wrapped into a data package has allowed us to validate and read the data already converted for its correct types (e.g native python datetime object). No need for any kind of string parsing.
The Frictionless Data software uses file streams under the hood. This means that only the current row is kept in memory, so we’re able to handle datasets bigger than the available RAM memory.

# Exporting data to an ElasticSearch cluster

Link to the export script: https://github.com/frictionlessdata/pilot-dm4t/blob/delivery/scripts/refit-cleaned.py (opens new window)

The Frictionless Data software provides plugins to export data to various backends like SQL, BigQuery etc. We will export the first resource from our data package for future analysis:

from elasticsearch import Elasticsearch
from datapackage import Package
from tableschema_elasticsearch import Storage
# Get resource
package = Package('packages/refit-cleaned/datapackage.json')
resource = package.get_resource('house-1')
# Create storage
engine = Elasticsearch()
storage = Storage(engine)
# Write data
storage.create('refit-cleaned', [('house-1', resource.schema.descriptor)])
list(storage.write('refit-cleaned', 'house-1', resource.read(keyed=True), ['Unix']))

Now we are able to check that our documents are indexed:

$ http http://localhost:9200/_cat/indices?v

# Getting insight from data using Kibana

To demonstrate how the Frictionless Data specs and software empower the usage of other analytics tools, we will use ElasticSearch/Kibana project. On the previous step we have imported our data package into an ElasticSearch cluster. It allows us to visualize data using a simple UI:

screenshot of elasticsearch cluster

In this screenshot we see the distribution of the average electricity comsumption. This is just an example of what you can do by having the ability to easily load datasets into other analytical software.

# Review

# The results

In this pilot, we have been able to demonstrate the the following:

Packaging the refit-cleaned dataset as a data package using the Data Package Pipelines library
Validating the data package using the Goodtables library
Modifying data packages metadata using the Packagist UI
Uploading the dataset to Amazon S3 and ElasticSearch cluster using Frictionless Data tools
Reading and analysing in Python the created Data Package using the Frictionless Data library

# Current limitations

The central challenge of working with these datasets is the size. Publishing the results of these research projects as flat files for immediate analysis is beneficial, however, the scale of each of these datasets (gigabytes of data, millions of rows) is a challenge to deal with no matter how you are storing. Processing this data through Data Package pipelines takes a long time.