Frictionless Data Frictionless Data
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
  • Welcome Frictionless Framework (v5)

    • Dialect
      • Checklist
        • Pipeline
          • Resource
            • Package
              • Catalog
                • Detector
                  • Inquiry
                    • Report
                      • Schema
                        • Error
                          • Types
                            • Descriptors
                              • Object Model
                                • Static Typing
                                  • Script Execution
                                    • Reference Generation
                                      • Happy Contributors

                                      Welcome Frictionless Framework (v5)

                                      August 29, 2022 by Evgeny Karev
                                      Price icons created by Pixel perfect - Flaticon news

                                      We’re releasing a first beta of Firctionless Framework (v5)!
                                      Since the initial Frictionless Framework release we’d been collecting feedback and analyzing both high-level users’ needs and bug reports to identify shortcomings and areas that can be improved in the next version for the framework. Once that process had been done we started working on a new v5 with a goal to make the framework more bullet-proof, easy to maintain and simplify user interface. Today, this version is almost stable and ready to be published. Let’s go through the main improvements we have made:

                                      # Improved Metadata

                                      This year we started working on the Frictionless Application, at the same time, we were thinking about next steps for the Frictionless Standards (opens new window). For both we need well-defined and an easy-to-understand metadata model. Partially it’s already published as standards like Table Schema and partially it’s going to be published as standards like File Dialect and possibly validation/transform metadata.

                                      # Dialect

                                      In v4 of the framework we had Control/Dialect/Layout concepts to describe resource details related to different formats and schemes, as well as tabular details like header rows. In v5 it’s merged into the only one concept called Dialect which is going to be standardised as a File Dialect spec. Here is an example:

                                      # YAML

                                      header: true
                                      headerRows: [2, 3]
                                      commentChar: '#'
                                      csv:
                                        delimiter: ';'
                                      

                                      A dialect descriptor can be saved and reused within a resource. Technically, it’s possible to provide different schemes and formats settings within one Dialect (e.g. for CSV and Excel) so it’s possible to create e.g. one re-usable dialect for a data package. A legacy CSV Dialect spec is supported and will be supported forever so it’s possible to provide CSV properties on the root level:

                                      # YAML

                                      header: true
                                      delimiter: ';'
                                      

                                      For performance and codebase maintainability reasons some marginal Layout features have been removed completely such as skip/pick/limit/offsetFields/etc. It’s possible to achieve the same results using the Pipeline concept as a part of the transformation workflow.

                                      Read an article about Dialect Class (opens new window) for more information.

                                      # Checklist

                                      Checklist is a new concept introduced in v5. It’s basically a collection of validation steps and a few other settings to make “validation rules” sharable. For example:

                                      # YAML

                                      checks:
                                        - type: ascii-value
                                        - type: row_constraint
                                          formula: id > 1
                                      skipErrors:
                                        - duplicate-label
                                      

                                      Having and sharing this checklist it’s possible to tune data quality requirements for some data file or set of data files. This concept will provide an ability for creating data quality “libraries” within projects or domains. We can use a checklist for validation:

                                      # CLI

                                      frictionless validate table1.csv --checklist checklist.yaml
                                      frictionless validate table2.csv --checklist checklist.yaml
                                      

                                      Here is a list of another changes:

                                      From (v4) To (v5)
                                      Check(descriptor) Check.from_descriptor(descriptor)
                                      check.code check.type

                                      Read an article about Checklist Class (opens new window) for more information.

                                      # Pipeline

                                      In v4 Pipeline was a complex concept similar to validation Inquiry. We reworked it for v5 to be a lightweight set of validation steps that can be applied to a data resource or a data package. For example:

                                      # YAML

                                      steps:
                                        - type: table-normalize
                                        - type: cell-set
                                          fieldName: version
                                          value: v5
                                      

                                      Similar to the Checklist concept, Pipeline is a reusable (data-abstract) object that can be saved to a descriptor and used in some complex data workflow:

                                      # CLI

                                      frictionless transform table1.csv --pipeline pipeline.yaml
                                      frictionless transform table2.csv --pipeline pipeline.yaml
                                      

                                      Here is another list of changes:

                                      From (v4) To (v5)
                                      Step(descriptor) Step.from_descriptor(descriptor)
                                      step.code step.type

                                      Read an article about Pipeline Class (opens new window) for more information.

                                      # Resource

                                      There are no changes in the Resource related to the standards although currently by default instead of profile the type property will be used to mark a resource as a table. It can be changed using the --standards v1 flag.

                                      It’s now possible to set Checklist and Pipeline as a Resource property similar to Dialect and Schema:

                                      # YAML

                                      path: table.csv
                                      # ...
                                      checklist:
                                        checks:
                                          - type: ascii-value
                                          - type: row_constraint
                                            formula: id > 1
                                      pipeline: pipeline.yaml
                                        steps:
                                          - type: table-normalize
                                          - type: cell-set
                                            fieldName: version
                                            value: v5
                                      

                                      Or using dereference:

                                      # YAML

                                      path: table.csv
                                      # ...
                                      checklist: checklist.yaml
                                      pipeline: pipeline.yaml
                                      

                                      In this case the validation/transformation will use it by default providing an ability to ship validation rules and transformation pipelines within resources and packages. This is an important development for data publishers who want to define what they consider to be valid for their datasets as well as sharing raw data with a cleaning pipeline steps:

                                      # CLI

                                      frictionless validate resource.yaml  # will use the checklist above
                                      frictionless transform resource.yaml  # will use the pipeline above
                                      

                                      There are minor changes in the stats property. Now it uses named keys to simplify hash distinction (md5/sha256 are calculated by default and it’s not possible to change for performance reasons as it was in v4):

                                      # Python

                                      from frictionless import describe
                                      resource = describe('table.csv', stats=True)
                                      print(resource.stats)
                                      
                                      {'md5': '6c2c61dd9b0e9c6876139a449ed87933',
                                       'sha256': 'a1fd6c5ff3494f697874deeb07f69f8667e903dd94a7bc062dd57550cea26da8',
                                       'bytes': 30,
                                       'fields': 2,
                                       'rows': 2}
                                      

                                      Here is a list of another changes:

                                      From (v4) To (v5)
                                      for row in resource: for row in resource.row_stream

                                      Read an article about Resource Class (opens new window) for more information.

                                      # Package

                                      There are no changes in the Package related to the standards although it’s now possible to use resource dereference:

                                      # YAML

                                      name: package
                                      resources:
                                        - resource1.yaml
                                        - resource2.yaml
                                      

                                      Read an article about Package Class (opens new window) for more information.

                                      # Catalog

                                      Catalog is a new concept that is a collection of data packages that can be written inline or using dereference:

                                      # YAML

                                      name: catalog
                                      packages:
                                        - package1.yaml
                                        - package2.yaml
                                      

                                      Read an article about Catalog Class (opens new window) for more information.

                                      # Detector

                                      Detector is now a metadata class (it wasn’t in v4) so it can be saved and shared as other metadata classes:

                                      # Python

                                      from frictionless import Detector
                                      detector = Detector(sample_size=1000)
                                      print(detector)
                                      
                                      {'sampleSize': 1000}
                                      

                                      Read an article about Detector Class (opens new window) for more information.

                                      # Inquiry

                                      There are few changes in the Inquiry concept which is known for using in the Frictionless Repository (opens new window) project:

                                      From (v4) To (v5)
                                      inquiryTask.source inquiryTask.path
                                      inquiryTask.source inquiryTask.resource
                                      inquiryTask.source inquiryTask.package

                                      Read an article about Inquiry Class (opens new window) for more information.

                                      # Report

                                      The Report concept has been significantly simplified by removing the resource property from reportTask. It’s been replaced by name/type/place/labels properties. Also report.time is now report.stats.seconds. The report/reportTask.warnings: List[str] have been added to provide non-error information like reached limits:

                                      # CLI

                                      frictionless validate table.csv --yaml
                                      
                                      valid: true
                                      stats:
                                        tasks: 1
                                        warnings: 0
                                        errors: 0
                                        seconds: 0.091
                                      warnings: []
                                      errors: []
                                      tasks:
                                        - valid: true
                                          name: table
                                          type: table
                                          place: table.csv
                                          labels:
                                            - id
                                            - name
                                          stats:
                                            md5: 6c2c61dd9b0e9c6876139a449ed87933
                                            sha256: a1fd6c5ff3494f697874deeb07f69f8667e903dd94a7bc062dd57550cea26da8
                                            bytes: 30
                                            fields: 2
                                            rows: 2
                                            warnings: 0
                                            errors: 0
                                            seconds: 0.091
                                          warnings: []
                                          errors: []
                                      
                                      From (v4) To (v5)
                                      report.time report.stats.seconds
                                      reportTask.time reportTask.stats.seconds
                                      reportTask.resource.name (opens new window) reportTask.name (opens new window)
                                      reportTask.resource.profile reportTask.type
                                      reportTask.resource.path reportTask.place
                                      reportTask.resource.schema reportTask.labels

                                      Read an article about Report Class (opens new window) for more information.

                                      # Schema

                                      Changes in the Schema class:

                                      From (v4) To (v5)
                                      Schema(descriptor) Schema.from_descriptor(descriptor)

                                      # Error

                                      There are a few changes in the Error data structure:

                                      From (v4) To (v5)
                                      error.code error.type
                                      error.name (opens new window) error.title
                                      error.rowPosition error.rowNumber
                                      error.fieldPosition error.fieldNumber

                                      # Types

                                      Note that all the metadata entities that have multiple implementations in v5 are based on a unified type model. It means that they use the type property to provide type information:

                                      From (v4) To (v5)
                                      resource.profile resource.type
                                      check.code check.type
                                      control.code control.type
                                      error.code error.type
                                      field.type field.type
                                      step.type step.type

                                      The new v5 version still supports old notation in descriptors for backward-compatibility.

                                      # Improved Model

                                      It’s been many years that Frictionless were mixing declarative metadata and object model for historical reasons. Since the first implementation of datapackage library we used different approaches to sync internal state to provide both interfaces descriptor and object model. In Frictionless Framework v4 this technique had been taken to a really sophisticated level with special observables dictionary classes. It was quite smart and nice-to-use for quick prototyping in REPL but it was really hard to maintain and error-prone.

                                      In Framework v5 we finally decided to follow the “right way” for handling this problem and split descriptors and object model completely.

                                      # Descriptors

                                      In the Frictionless World we deal with a lot of declarative metadata descriptors such as packages, schemas, pipelines, etc. Nothing changes in v5 regarding this. So for example here is a Table Schema:

                                      # YAML

                                      fields:
                                        - name: id
                                          type: integer
                                        - name: name
                                          type: string
                                      

                                      # Object Model

                                      The difference comes here we we create a metadata instance based on this descriptor. In v4 all the metadata classes were a subclasses of the dict class providing a mix between a descriptor and object model for state management. In v5 there is a clear boundary between descriptor and object model. All the state are managed as it should be in a normal Python class using class attributes:

                                      # Python

                                      from frictionless import Schema
                                      schema = Schema.from_descriptor('schema.yaml')
                                      # Here we deal with a proper object model
                                      descriptor = schema.to_descriptor()
                                      # Here we export it back to be a descriptor
                                      

                                      There are a few important traits of the new model:

                                      it’s not possible to create a metadata instance from an invalid descriptor
                                      it’s almost always guaranteed that a metadata instance is valid
                                      it’s not possible to mix dicts and classes in methods like package.add_resource
                                      it’s not possible to export an invalid descriptor
                                      This separation might make one to add a few additional lines of code, but it gives us much less fragile programs in the end. It’s especially important for software integrators who want to be sure that they write working code. At the same time, for quick prototyping and discovery Frictionless still provides high-level actions like validate function that are more forgiving regarding user input.

                                      # Static Typing

                                      One of the most important consequences of “fixing” state management in Frictionless is our new ability to provide static typing for the framework codebase. This work is in progress but we have already added a lot of types and it successfully pass pyright validation. We highly recommend enabling pyright in your IDE to see all the type problems in-advance:

                                      type-error

                                      # Livemark Docs

                                      We’re happy to announce that we’re finally ready to drop a JavaScript dependency for the docs generation as we migrated it to Livemark. Moreover, Livemark’s ability to execute scripts inside the documentation and other nifty features like simple Tabs or a reference generator will save us hours and hours for writing better docs.

                                      # Script Execution

                                      livemark-1

                                      # Reference Generation

                                      livemark-2

                                      # Happy Contributors

                                      We hope that Livemark docs writing experience will make our contributors happier and allow to grow our community of Frictionless Authors and Users. Let’s chat in our Slack (opens new window) if you have questions or just want to say hi.

                                      Read Livemark Docs (opens new window) for more information.

                                      Blog Index