guides

Validating Data

Tabular data (e.g. data stored in CSV and Excel worksheets) is one of the most common forms of data available on the web. This guide will walk through validating tabular data using Frictionless Data tooling.

This guide show how you can validate your tabular data and check both:

  • Structure: are there too many rows or columns in some places?
  • Schema: does the data fit its schema. Are the values in the date column actually dates? Are all the numbers greater than zero?

We will walk through two methods of performing validation:

Good Tables

Good Tables is a free, open-source, hosted service for validating tabular data. Good Tables checks your data for its structure, and, optionally, its adherence to a specified schema. Good Tables will give quick and simple feedback on where your tabular data may not yet be quite perfect.

Good Tables screenshot

To start, all you need to do is upload or provide a link to a CSV file and hit the “Validate” button.

Good Tables Provide URL

Good Tables Validate button

If your data is structurally valid, you should receive the following result:

Good Tables Valid

If not…

Good Tables Invalid

The report should highlight the structural issues found in your data for correction. For instance, a poorly structured tabular dataset may consist of a header row with too many (or too few) columns when compared to of data rows with an equal amount of columns.

You can also provide a schema for your tabular data defined using JSON Table Schema.

Good Tables Provide Schema

Briefly, the format allows users to specify not only the types of information within each column in a tabular dataset, but also expected values. For more information, see the JSON Table Schema guide or the full standard.

Python + GoodTables

GoodTables is also available as a Python library. The following short snippets demonstrate examples of loading and validating data in a file called data.csv.

Validating Structure

from goodtables import processors

datafile = './data.csv'

processor = processors.StructureProcessor(format='csv')

valid, report, data = processor.run(datafile)

output_format = 'txt'

exclude = ['result_context', 'processor', 'row_name', 'result_category',
           'column_index', 'column_name', 'result_level']

out = report.generate(output_format, exclude=exclude)

print(out)

Validating Schema

from goodtables import processors

datafile = './data.csv'
schemafile = './schema.json'

processor = processors.SchemaProcessor(format='csv',
                                       schema=schemafile)

valid, report, data = processor.run(datafile)

output_format = 'txt'

exclude = ['result_context', 'processor', 'row_name', 'result_category',
           'column_index', 'column_name', 'result_level']

out = report.generate(output_format, exclude=exclude)

print(out)

Continuous Data Integration

We can build on the existing Good Tables service and Python tooling to build the equivalent of “continuous integration” service for data. In this model, on every update, data can be validated for its structure and for adherence to a schema. Behind the scenes, this is just a normal Travis CI configuration.

Data Valid

We have an example running here:

https://github.com/frictionlessdata/ex-continuous-data-integration

See the README.md for more information.