GoodTables
A simple yet powerful tool to ensure the quality of tabular data, in Python and on the command line.
GoodTables is a managed service to validate tabular data. It can check the structure of your data (e.g. all rows have the same number of columns), and its contents (e.g. all dates are valid). Internally, it uses the Data Quality Spec for common tabular data errors. GoodTables also supports data described by Data Package and Table Schema.
Let’s visit the GoodTables website and login with GitHub to start the process of validating our data.
Add a data source in the dashboard using GitHub (Amazon S3 is also supported, but we’re only covering GitHub here):
INFO
We need to create a GitHub repository to store our helloworld.csv
file. Make sure you use the valid CSV from our example above.
Because we have valid and well-structured data in ourhelloworld.csv
, the results will come back as valid, as seen in the image below
Now, let’s change to invalid tabular data and see what the checks return:
Name,Email,,Age
Jill,[email protected]
Jack,[email protected],33
23,Jane,[email protected], 22, 33
Of course, this build will fail because some structural errors were detected by GoodTables (“Blank Header”, “Missing value”, and “Extra Value”).
Additionally, here’s a video walkthrough of the content outlined above