Using Data Packages in Clojure
Matt Thompson was one of 2017’s Frictionless Data Tool Fund (opens new window) grantees tasked with extending implementation of core Frictionless Data data package (opens new window) and table schema (opens new window) libraries in Clojure programming language. You can read more about this in his grantee profile. In this post, Thompson will show you how to set up and use the Clojure (opens new window) libraries for working with Tabular Data Packages (opens new window).
This tutorial uses a worked example of downloading a data package from a remote location on the web, and using the Frictionless Data tools to read its contents and metadata into Clojure data structures.
# Setup
First, we need to set up the project structure using the Leiningen (opens new window) tool. If you don’t have Leiningen set up on your system, follow the link to download and install it. Once it is set up, run the following command from the command line to create the folders and files for a basic Clojure project:
lein new periodic-table
This will create the periodic-table folder. Inside the periodic-table/src/periodic-table folder should be a file named core.clj. This is the file you need to edit during this tutorial.
# The Data
For this tutorial, we will use a pre-created data package, the Periodic Table Data Package hosted by the Frictionless Data project. A Data Package (opens new window) is a simple container format used to describe and package a collection of data. It consists of two parts:
- Metadata that describes the structure and contents of the package
- Resources such as data files that form the contents of the package
Our Clojure code will download the data package and process it using the metadata information contained in the
package. The data package can be found here on GitHub (opens new window).
The data package contains data about elements in the periodic table, including each element’s name, atomic number, symbol and atomic weight. The table below shows a sample taken from the first three rows of the CSV file:
atomic number | symbol | name | atomic mass | metal or nonmetal? |
---|---|---|---|---|
1 | H | Hydrogen | 1.00794 | nonmetal |
2 | He | Helium | 4.002602 | noble gas |
3 | Li | Lithium | 6.941 | alkali metal |
# Loading the Data Package
The first step is to load the data package into a Clojure data structure (a map). The initial step is to require the data package library in our code (which we will give the alias dp). Then we can use the load function to load our data package into our project. Enter the following code into the core.clj file:
(ns periodic-table.core
(:require [frictionlessdata.datapackage :as dp]
[frictionlessdata.tableschema :as ts]
[clojure.spec.alpha :as s]))
(def pkg
(dp/load "https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"))
This pulls the data in from the remote GitHub location and converts the metadata into a Clojure map. We can access this metadata by using the descriptor
function along with keys such as :name
and :title
to get the relevant information:
(println (str "Package name:" (dp/descriptor pkg :name)))
(println (str "Package title:" (dp/descriptor pkg :title)))
The package descriptor contains metadata that describes the contents of the data package. What about accessing the data itself? We can get to it using the get-resources
function:
(def table (dp/get-resources pkg :data))
(doseq [row table]
(println row))
The above code locates the data in the data package, then goes through it line by line and prints the contents.
# Casting Types with core.spec
We can use Clojure’s spec (opens new window) library to define a schema for our data, which can then be used to cast the types of the data in the CSV file.
Below is a spec description of a periodic element type, consisting of an atomic number, atomic symbol, the element’s name, its mass, and whether or not the element is a metal or non-metal:
(s/def ::number int?)
(s/def ::symbol string?)
(s/def ::name string?)
(s/def ::mass float?)
(s/def ::metal string?)
(s/def ::element (s/keys :req [::number ::symbol ::name ::mass ::metal]))
The above spec can be used to cast values in our tabular data so that they match the specified schema. The example below shows our tabular data values being cast to fit the spec description. Then the -main
function loops through the elements, printing only those with an atomic mass of over 10.
(ns periodic-table.core
(:require [frictionlessdata.datapackage :as dp]
[frictionlessdata.tableschema :as ts]
[clojure.spec.alpha :as s]))
(s/def ::number int?)
(s/def ::symbol string?)
(s/def ::name string?)
(s/def ::mass float?)
(s/def ::metal string?)
(s/def ::element (s/keys :req [::number ::symbol ::name ::mass ::metal]))
(def pkg
(dp/load "https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json"))
(def resources (dp/get-resources pkg :data))
(def elements (dp/cast resources element))
(defn -main []
(doseq [e elements]
(if (< (:mass e) 10)
(println e))))
When run, the program produces the following output:
$ lein run
{::number 1 ::symbol "H" ::name "Hydrogen" ::mass 1.00794 ::metal "nonmetal"}
{::number 2 ::symbol "He" ::name "Helium" ::mass 4.002602 ::metal "noble gas"}
{::number 3 ::symbol "Li" ::name "Lithium" ::mass 6.941 ::metal "alkali gas"}
{::number 4 ::symbol "Be" ::name "Beryllium" ::mass 9.012182 ::metal "alkaline earth metal"}
This concludes our simple tutorial for using the Clojure libraries for Frictionless Data.
We welcome your feedback and questions via our Frictionless Data Gitter chat (opens new window) or via GitHub issues (opens new window) on the datapackage-clj (opens new window) repository.