Frictionless Data Frictionless Data
Guide
Specs
  • Application
  • Framework
  • Libraries
    • GoodTables
    • DataHub
    • Labs
  • Reproducible Research
  • Case Studies
  • Pilots
  • Chat
  • Forum
  • Support
  • Events Calendar
  • Contribute
  • Code of Conduct
Team
About
Blog
Guide
Specs
  • Application
  • Framework
  • Libraries
    • GoodTables
    • DataHub
    • Labs
  • Reproducible Research
  • Case Studies
  • Pilots
  • Chat
  • Forum
  • Support
  • Events Calendar
  • Contribute
  • Code of Conduct
Team
About
Blog
  • Guide

    • Introduction
      • What is Frictionless Data?
      • Getting Started
    • Declarative Data
      • Table
        • Table Schema
        • Validation
        • Tabulator
      • Ready for More?

      # Guide

      TIP

      This guide is still early-stage. We are currently consolidating our existing materials into one place here.

      # Introduction

      # What is Frictionless Data?

      Frictionless Data is a progressive framework for building data infrastructure – data management, integration, flows, etc.

      Unlike some other frameworks, Frictionless is designed from the ground up to be both incrementally adoptable and “progressive”. Its purpose is to work with, build on, and enhance your existing data and tooling (rather than replacing it). It is also extremely lightweight!

      The core of the framework is a suite of ultra-simple patterns to describe and organize data. This allows the data to flow fluidly between tools and across teams. The patterns have been refined to zen-like simplicity, and they can be picked up in minutes and immediately integrated with other libraries or existing projects.

      At the same time, Frictionless Data is also perfectly capable of powering sophisticated data workflows when used in combination with modern tooling and supporting libraries. This is possible because the framework follows an “atomic” data[1] – you can take a minimalistic approach for simple solutions or combine elements for more complex solutions.

      If you’d like to learn more about Frictionless before diving in, we created a video walking through the core principles.

      # Getting Started

      INFO

      The official guide assumes some basic knowledge about data. If you are completely new to working with data, for example, you haven’t heard of CSV or JSON or have never used a spreadsheet, it may be best idea to get some of those basics and then come back! A good starting point would be the first module, “What is Data?”, at School of Data .

      # Declarative Data

      At the core of Frictionless is a system that enables us to declaratively describe data (and datasets) using a straightforward syntax.

      # Table

      A table is a collection of related data represented in rows and columns. In a table, the intersection between a row and a column is called a cell. Tables are widely used in different contexts and fields, ranging from data analysis to data research.

      Tables come in different variants, CSV, JSON, and Excel formats. Here’s an example of each of these formats.

      Name,Email,Age
      Jill,[email protected],25
      Jack,[email protected],33
      
      {
        "name": "Jill",
        "email": "[email protected]",
        "age": "25"
      }
      
      Name,Email,Age
      Jill,[email protected],25
      Jack,[email protected],33
      

      # Table Schema

      Table Schema is a specification for providing a “schema” (similar to a database schema ) for tabular data. This information includes the expected type of each value in a column (“string”, “number”, “date”, etc.), constraints on the value (“this string can only be at most 10 characters long”), and the expected format of the data (“this field should only contain strings that look like email addresses"). Table Schema can also specify relations between tables.

      Here’s our simple helloworld.csv CSV (you can paste this in a local file):

      Name,Email,Age
      Jill,[email protected],25
      Jack,[email protected],33
      

      As a table, it looks like this:

      Name Email Age
      Jill [email protected] 25
      Jack [email protected] 33

      And here’s a Table Schema (in JSON) to describe that file. Note that a minimum age of 18 is specified in the Age column, and a string that looks like an email address must be present in the Email column:

      {
        "fields": [
          {
            "name": "Name",
            "type": "string",
            "description": "User’s name"
          },
          {
            "name": "Email",
            "type": "string",
            "format": "email",
            "description": "User’s email"
          },
          {
            "name": "Age",
            "type": "integer",
            "description": "User’s age",
            "constraints": {
              "minimum": 18
            }
          }
        ]
      }
      

      Copy and paste this into a file called tableschema.json in the same directory as your CSV file.

      Well done! 👏 You have just created your very first Frictionless Data!

      # Validation

      Having errors in your data is not uncommon. They also often get in the way of quick and timely data analysis for many data users. Validating data helps ease the process of collecting data by checking the quality and validity of a data source before publishing it.

      Let’s take a look at how to validate the tabular data we created in the previous section using the GoodTables Python and JavaScript libraries. GoodTables is a set of libraries and command-line tool for validating and transforming tabular data. These libraries exist to identify structural and content errors in your tabular data, so they can be fixed quickly. For example, a table schema contains information on fields and their assigned data types, making it possible to highlight misplaced data types (e.g. a string in an age column where an integer is expected, or an integer in an email column where a string is expected).

      Here’s an example of how to validate a tabular data using the Python library . Using your terminal, install goodtables with the package manager PIP :

      pip install goodtables
      

      You can see a list of options by using the --help argument:

      goodtables --help
      

      To validate our data, we need to run the goodtables command followed by the path to the file:

      goodtables helloworld.csv
      

      Goodtables supports CSV, XLS, XLSX, ODS, and JSON. After running the command above, we get the following validation report, which follows the JSON Schema Report :

      DATASET
      =======
      {
       'error-count': 0,
       'preset': 'nested',
       'table-count': 1,
       'time': 0.104,
       'valid': True
      }
      TABLE [1]
      =========
      {
       'encoding': 'utf-8',
       'error-count': 0,
       'format': 'csv',
       'headers': ['name', 'email', 'age'],
       'row-count': 3,
       'scheme': 'file',
       'source': 'helloword.csv',
       'time': 0.003,
       'valid': True
      }
      

      Now, consider the following CSV with invalid data. Let’s check for structural or content errors in the tabular data:

      id,name,age,
      1,John,24,[email protected]
      1,Jane,14,[email protected]
      1,Jane,14,[email protected]
      ,Jane,22,7
      
      DATASET
      =======
      {
       'error-count': 2,
       'preset': 'nested',
       'table-count': 1,
       'time': 0.105,
       'valid': False
      }
      TABLE [1]
      =========
      {
       'encoding': 'utf-8',
       'error-count': 2,
       'format': 'csv',
       'headers': ['id', 'name', 'age', ''],
       'row-count': 5,
       'scheme': 'file',
       'source': 'invalid.csv',
       'time': 0.003,
       'valid': False
      }
      ---------
      [-,4] [blank-header] Header in column 4 is blank
      [4,-] [duplicate-row] Row 4 is duplicated to row(s) 3
      

      You might notice that this validation report looks a bit different. The two lines at the bottom of the report, blank-header and duplicate-row, are structural errors.

      Now, let’s see how we can do the same with Goodtables JavaScript Library. First, install Goodtables package :

      npm install goodtables
      

      After installing the package, let’s create an example. Create an index.js file and add the following:

      const goodtables = require('goodtables');
      async function validate () {
      //  Validate a CSV File
      const source = 'helloworld.csv'
      const report = await goodtables.validate(source)
      console.log(report)
      }
      validate();
      

      The result shows that the CSV contains some structural errors:

      { 'error-count': 1,
        preset: 'nested',
        'table-count': 1,
        tables:
         [ { encoding: null,
             'error-count': 1,
             errors: [Array],
             format: null,
             headers: [],
             'row-count': 0,
             schema: null,
             scheme: 'http',
             source: 'helloworld.csv',
             time: 0.001,
             valid: false } ],
        time: 0.004,
        valid: false,
        warnings: []
       }
      

      Additonally, here’s video walkthrough of the content outlined above

      NOTE

      We can also use the Goodtables online tool to validate any tabular data.

      Let’s head over to the GoodTables website and login with GitHub to start the process of validating our data.

      goodtables dashboard

      Add a data source in the dashboard using GitHub (Amazon S3 is also supported, but we’re only covering GitHub here):

      INFO

      We need to create a GitHub repository to store our helloworld.csv file. Make sure you use the valid CSV from our example above.

      adding source to goodtables

      Because we have valid and well-structured data in ourhelloworld.csv, the results will come back as valid, as seen in the image below

      valid data

      Now, let’s change to invalid tabular data and see what the checks return:

      Name,Email,,Age
      Jill,[email protected]
      Jack,[email protected],33
      23,Jane,[email protected], 22, 33
      

      Invalid data

      Of course, this build will fail because some structural errors were detected by GoodTables (“Blank Header”, “Missing value”, and “Extra Value”).

      Additionally, here’s a video walkthrough of the content outlined above

      # Tabulator

      Tabulator is a consistent interface for reading and writing streams of tabular data, in Python and on the command line. Tabulator is a tool designed to help navigate dealing with data flows from diverse sources. Imagine when you have to work with some data that is human-generated and some that are machine-generated, and you need to handle a range of issues related to formatting, encoding, and markup. This is exactly what the Tabulator tool was designed to help with.

      Tabulator is a useful building block for data fetching and data processing. It provides a clean, structured stream of data covering Excel, CSV, SQL, Google Sheets, etc.

      Here’s an example of using the Tabulator library to read tabular data. First, we need the library installed:

      pip install tabulator
      

      Let’s run an example using Python:

      import tabulator
      with tabulator.Stream('helloworld.csv', headers=1) as stream:
          stream.headers # [header1, header2, ..]
          for row in stream:
              print(row)  # [value1, value2, ..]
      

      Also, Tabulator ships with a CLI tool and can be used to read tabular data directly:

      tabulator https://github.com/frictionlessdata/tabulator-py/raw/4c1b3943ac98be87b551d87a777d0f7ca4904701/data/table.csv.gz
      id, name
      1, english
      2,中国人
      

      # Ready for More?

      We’ve briefly introduced the core concepts of Frictionless Data - the rest of the guide will cover more concepts and tooling with much more details. For the next section, let’s move on and learn about Data Resource.


      1. We have borrowed the concept of Atomic Data from the web design field. For us, it means tools or specs are a) broken down into their minimum viable components, and b) these components are combinable into larger and more complex components and systems. The Atomic approach is what underpins the incremental adoptability and the ability to scale from the simplest situation to highly complex data engineering. ↩︎


      About

      • About
      • Contact
      • Privacy Policy
      • Terms of Use

      Help

      • Support
      • Get started
      • Community

      More

      • Reproducible Research
      • Design Assets
      • Blog
      Edit this page
      Last Updated: 11/19/2020, 8:27:23 AM