One-off validation of your tabular datasets can be hectic, especially where plenty of published data is maintained and updated fairly regularly.
Running continuous checks on data provides regular feedback and contributes to better data quality as errors can be flagged and fixed early on. This section introduces you to tools that continually check your data for errors and flag content and structural issues as they arise. By eliminating the need to run manual checks on tabular datasets every time they are updated, they make your data workflow more efficient.
In this section, you will learn how to setup automatic tabular data validation using goodtables, so your data is validated every time it’s updated. Although not strictly necessary, it’s useful to know about Data Packages and Table Schema before proceeding, as they allow you to describe your data in more detail, allowing more advanced validations.
We will show how to set up automated tabular data validations for data published on:
- CKAN, an open source data publishing platform;
- GitHub, a hosting service;
- Amazon S3, a data storage service.
If you don’t use any of these platforms, you can still setup the validation using goodtables-py, it will just require some technical knowledge
If you do use some of these platforms, the data validation report look like:
Figure 1: Goodtables.io tabular data validation report.
# Validate tabular data automatically on CKAN
CKAN is an open source platform for publishing data online. It is widely used across the planet, including by the federal governments of the USA, United Kingdom, Brazil, and others.
- Adds a badge next to each dataset showing the status of their validation (valid or invalid), and
- Allows users to access the validation report, making it possible for errors to be identified and fixed.
Figure 2: Annotated in red, automated validation checks on datasets in CKAN.
# Validate tabular data automatically on GitHub
If your data is hosted on GitHub, you can use goodtables web service to automatically validate it on every change.
For this section, you will first need to create a GitHub repository and add tabular data to it.
Once you have tabular data in your Github repository:
- Login on goodtables.io using your GitHub account and accept the permissions confirmation.
- Once we’ve synchronized your repository list, go to the Manage Sources page and enable the repository with the data you want to validate.
- If you can’t find the repository, try clicking on the Refresh button on the Manage Sources page
Goodtables will then validate all tabular data files (CSV, XLS, XLSX, ODS) and data packages in the repository. These validations will be executed on every change, including pull requests.
# Validate tabular data automatically on Amazon S3
If your data is hosted on Amazon S3, you can use goodtables.io to automatically validate it on every change.
It is a technical process to set up, as you need to know how to configure your Amazon S3 bucket. However, once it’s configured, the validations happen automatically on any tabular data created or updated. Find the detailed instructions here.
# Custom setup of automatic tabular data validation
If you don’t use any of the officially supported data publishing platforms, you can use goodtables-py directly to validate your data. This is the most flexible option, as you can configure exactly when, and how your tabular data is validated. For example, if your data come from an external source, you could validate it once before you process it (so you catch errors in the source data), and once after cleaning, just before you publish it, so you catch errors introduced by your data processing.
The instructions on how to do this are technical, and can be found on https://github.com/frictionlessdata/goodtables-py.