Open Data Blend
Open Data Blend (opens new window) is a set of open data services that aim to make large and complex UK open data easier to analyse. We source the raw open data, transform it into dimensional models (opens new window) (also referred to as ‘star schemas’), cleanse and enrich it, add metadata to support its reuse, and make this processed data openly available as compressed CSV, Apache ORC, and Apache Parquet data files. In summary, we provide analysis-ready open data with an emphasis on quality over quantity. We are excited to tell you more about Open Data Blend and how it uses Frictionless Data specifications to make this data easier to understand and use.
There are two core data services: Open Data Blend Datasets and Open Data Blend Analytics. Open Data Blend Datasets has a user interface (UI) called the Open Data Blend Dataset UI (opens new window) and a bulk data API called the Open Data Blend Dataset API (opens new window). Open Data blend Analytics (opens new window) is an interactive analytical query service that can be used from popular BI tools like Excel, Power BI Desktop, and Tableau Desktop.
# Why Open Data Blend Was Created
The idea behind Open Data Blend was born at Nimble Learn (opens new window) in 2014 after several pain points were experienced when working with large and complex UK open datasets. One of these pain points was that a significant effort, and access to large computational resources, was needed to prepare the data for analysis in a reasonable timeframe. Another pain point was that the lookups and data dictionaries would often be buried in unstructured sources like Word documents, PDF files, and web pages.
# Our Frictionless Data Journey
At Nimble Learn, we have over six years’ experience working with the Frictionless Data specifications. We have delivered two other Frictionless Data projects to date: Data Package M and Data Package Connector.
Data Package M (opens new window) is a Power Query M library that simplifies the loading of Tabular Data Packages into Excel or Power BI.
You can read the Frictionless Data case study for Data Package M here (opens new window).
Data Package Connector (opens new window) is a Power BI custom connector (opens new window) that enables one or more tables from Data Packages, that implement the Table Schema specification, to be loaded directly into Power BI through the ‘Get Data’ experience.
The Frictionless Data case study for Data Package Connector can be read here (opens new window).
# How Open Data Blend Uses Frictionless Data
During over six years of extensive research and development into open data publishing, we reviewed and evaluated several open standards that could be used as a base for our open data API. After carefully weighing the pros and cons of each, we chose to adopt the Frictionless Data specifications because they were lightweight, simple, robust, and highly scalable. We also wanted our users to benefit from the growing ecosystem of Frictionless Data tools (opens new window) that make Frictionless Data even more accessible.
The Open Data Blend Dataset UI and the Open Data Blend Dataset API are both powered by Frictionless Data. When you visit the Open Data Blend Datasets (opens new window) page, all of the information that you see nicely presented is coming from a data package that conforms to the Data Package Catalog pattern (opens new window). Clicking on one of the datasets takes you to a dedicated dataset page that is driven by extended Data Package metadata (opens new window). The ‘Get metadata’ button at the top of each dataset page reveals the contents of the underlying datapackage.json file.
So far, we have implemented and extended the following Frictionless Data specifications and patterns:
Data Package (opens new window)
Table Schema (opens new window)
Data Catalogue pattern (opens new window)
Compressed resources pattern (opens new window)
You can see how deeply ingrained the Frictionless Data specifications are just by skimming through the Open Data Blend Dataset API reference documentation (opens new window).
# How Open Data Blend Helps
Each Open Data Blend dataset is presented with helpful metadata. The data is modelled and enriched to enable effective data analysis. The columns that contain descriptive values are carefully combined into dimension tables (opens new window) and those that contain measurable facts are grouped into fact tables (opens new window). Modelling the data in this way makes it easier to understand and analyse. You can learn more about these dimensional modelling concepts here (opens new window) and here (opens new window).
In addition to CSVs, we make the data available as Apache ORC and Apache Parquet files because they are two of the most popular and efficient open file formats for analytical workloads. Libraries available for Python (opens new window), R (opens new window), and other popular languages make it possible to query these files very quickly. If you are a data engineer, data analyst, or data scientist with access to data lake storage, such as Amazon S3 and Azure Data Lake Storage Gen2, the ORC or Parquet files can be ingested into your data lake. Once there, you can query them interactively using data lake engines like Apache Spark, Azure Synapse Analytics, Databricks, Dremio, and Trino.
To accelerate the data acquisition process when working with Open Data Blend datasets through code, we have developed a lightweight Python package called ‘opendatablend’. Once installed, this package allows you to effortlessly cache our data files locally with just a few lines of Python. Data engineers, data analysts, and data scientists can use the opendatablend package to get data and use it with whatever data tools they prefer. For example, a data scientist might start off doing some exploratory data analysis (EDA) in Pandas (opens new window) or Koalas (opens new window) using a Jupyter notebook (opens new window), transition to feature engineering, and then train and score machine learning models using scikit-learn (opens new window) or Spark MLlib (opens new window).
Below is a simple example that shows how easy the opendatablend for Python is to use:
import opendatablend as odb import pandas as pd dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json' # Specify the resource name of the data file. In this example, the 'date' data file will be requested in .parquet format. resource_name = 'date-parquet' # Get the data and store the output object output = odb.get_data(dataset_path, resource_name) # Print the file locations print(output.data_file_name) print(output.metadata_file_name) # Read a subset of the columns into a dataframe df_date = pd.read_parquet(output.data_file_name, columns=['drv_date_key', 'drv_date', 'drv_month_name', 'drv_month_number', 'drv_quarter_name', 'drv_quarter_number', 'drv_year']) # Check the contents of the dataframe df_date
You can learn more about the opendatablend package here (opens new window).
To further reduce the time to value and to make the open data insights more accessible, the Open Data Blend Analytics (opens new window) service can be used with business intelligence (BI) tools like Excel, Power BI Desktop, and Tableau Desktop to directly analyse the data over a live connection. Depending on the use case, this can remove the need to work with the data files altogether.
# Want to Learn More About Open Data Blend?
You can visit the Open Data Blend website here (opens new window) to learn more about the services. We also have some comprehensive documentation available here (opens new window), where Frictionless Data specific documentation can be found here (opens new window). If you would like to contribute to the project, you can find out how here (opens new window).
Follow us on Twitter @opendatablend (opens new window) to get our latest news, feature highlights, thoughts, and tips.