Frictionless Data Frictionless Data
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
  • Using Data Packages in R

    • Setup
      • Load
        • Reading Basic Metadata
          • Reading Data
            • Loading into an SQL database

            Using Data Packages in R

            February 14, 2018 by Kleanthis Koupidis
            Price icons created by Pixel perfect - Flaticon R

            Open Knowledge Greece (opens new window) was one of 2017’s Frictionless Data Tool Fund (opens new window) grantees tasked with extending implementation of core Frictionless Data libraries in R programming language. You can read more about this in their grantee profile (opens new window). In this tutorial, Kleanthis Koupidis (opens new window), a Data Scientist and Statistician at Open Knowledge Greece, explains how to work with Data Packages in R.

            This tutorial will show you how to install the R libraries for working with Tabular Data Packages and demonstrate a very simple example of loading a Tabular Data Package from the web and pushing it directly into a local SQL database and send query to retrieve results.

            TIP

            For a comprehensive introduction to creating tabular data packages in R, start by going through this tutorial.

            # Setup

            For this tutorial, we will need the Data Package R library (datapackage.r (opens new window)). Devtools library (opens new window) is also required to install the datapackage.r library from github.

                # Install devtools package if not already
                install.packages("devtools")
            

            And then install the development version of datapackage.r (opens new window) from github.

                devtools::install_github("frictionlessdata/datapackage-r")
            

            # Load

            You can start using the library by loading datapackage.r.

                library(datapackage.r)
            

            # Reading Basic Metadata

            In this case, we are using an example Tabular Data Package containing the periodic table stored on GitHub (opens new window) (datapackage.json (opens new window), data.csv (opens new window)). This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Here are the first five rows:

                url = 'https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/example_data/data.csv'
                pt_data = read.csv2(url, sep = ',')
                knitr::kable(head(pt_data, 5), align = 'c')
            
            atomic.number symbol name atomic.mass metal.or.nonmetal.
            1 H Hydrogen 1.00794 nonmetal
            2 He Helium 4.002602 noble gas
            3 Li Lithium 6.941 alkali metal
            4 Be Beryllium 9.012182 alkaline earth metal
            5 B Boron 10.811 metalloid

            Data Packages can be loaded either from a local path or directly from the web.

                url = 'https://raw.githubusercontent.com/okgreece/datapackage-r/master/vignettes/exampledata/package.json'
                datapackage = Package.load(url)
                datapackage$resources[[1]]$descriptor$profile = 'tabular-data-resource' # tabular resource descriptor profile
                datapackage$resources[[1]]$commit() # commit changes
                ## [1] TRUE
            

            At the most basic level, Data Packages provide a standardized format for general metadata (for example, the dataset title, source, author, and/or description) about your dataset. Now that you have loaded this Data Package, you have access to this metadata using the metadata dict attribute. Note that these fields are optional and may not be specified for all Data Packages. For more information on which fields are supported, see the full Data Package standard (opens new window).

                datapackage$descriptor$title
                ## [1] "Periodic Table"
            

            # Reading Data

            Now that you have loaded your Data Package, you can read its data. A Data Package can contain multiple files which are accessible via the resources attribute. The resources attribute is an array of objects containing information (e.g. path, schema, description) about each file in the package.

            You can access the data in a given resource in the resources array by reading the data attribute.

                table = datapackage$resources[[1]]$table
                periodic_table_data = table$read()
            

            You can further manipulate list objects in R by using

             [purrr](https://cran.r-project.org/package=purrr), [rlist](https://cran.r-project.org/package=rlist) packages.
            

            # Loading into an SQL database

            Tabular Data Packages (opens new window) contains schema information about its data using Table Schema (opens new window). This means you can easily import your Data Package into the SQL backend of your choice. In this case, we are creating an SQLite (opens new window) database.

            To create a new SQLite database and load the data into SQL we will need DBI (opens new window) package and RSQLite (opens new window) package, which contains SQLite (opens new window) (no external software is needed).

            You can install and load them by using:

                install.packages(c("DBI","RSQLite"))
                library(DBI)
                library(RSQLite)
            

            To create a new SQLite database, you simply supply the filename to dbConnect():

                dp.database = dbConnect(RSQLite::SQLite(), "") # temporary database
            

            We will use data.table (opens new window) package to convert the list object with the data to a data frame object to copy them to database table.

                # install data.table package if not already
                # install.packages("data.table")
                periodic_table_sql = data.table::rbindlist(periodic_table_data)
                periodic_table_sql = setNames(periodic_table_sql,unlist(datapackage$resources[[1]]$headers))
            

            You can easily copy an R data frame into a SQLite database with dbWriteTable():

                dbWriteTable(dp.database, "periodic_table_sql", periodic_table_sql)
                # show remote tables accessible through this connection
                dbListTables(dp.database)
                ## [1] "periodic_table_sql"
            

            The data are already to the database.

            We can further issue queries to hte database and return first 5 elements:

            
                dbGetQuery(dp.database, 'SELECT * FROM periodic_table_sql LIMIT 5')
                ##   atomic number symbol      name atomic mass   metal or nonmetal?
                ## 1             1      H  Hydrogen    1.007940             nonmetal
                ## 2             2     He    Helium    4.002602            noble gas
                ## 3             3     Li   Lithium    6.941000         alkali metal
                ## 4             4     Be Beryllium    9.012182 alkaline earth metal
                ## 5             5      B     Boron   10.811000            metalloid
            

            Or return all elements with an atomic number of less than 10:

                dbGetQuery(dp.database, 'SELECT * FROM periodic_table_sql WHERE "atomic number" < 10')
                ##   atomic number symbol      name atomic mass   metal or nonmetal?
                ## 1             1      H  Hydrogen    1.007940             nonmetal
                ## 2             2     He    Helium    4.002602            noble gas
                ## 3             3     Li   Lithium    6.941000         alkali metal
                ## 4             4     Be Beryllium    9.012182 alkaline earth metal
                ## 5             5      B     Boron   10.811000            metalloid
                ## 6             6      C    Carbon   12.010700             nonmetal
                ## 7             7      N  Nitrogen   14.006700             nonmetal
                ## 8             8      O    Oxygen   15.999400             nonmetal
                ## 9             9      F  Fluorine   18.998403              halogen
            

            More about using databases, SQLite in R you can find in vignettes of DBI (opens new window) and RSQLite (opens new window) packages.

            We welcome your feedback and questions via our Frictionless Data Gitter chat (opens new window) or via Github issues (opens new window) on the datapackage-r (opens new window) repository.

            Blog Index