Using Data Packages in Go
Daniel Fireman was one of 2017’s Frictionless Data Tool Fund (opens new window) grantees tasked with extending implementation of core Frictionless Data libraries in Go programming language. You can read more about this in his grantee profile. In this post, Fireman will show you how to install and use the Go (opens new window) libraries for working with Tabular Data Packages (opens new window).
Our goal in this tutorial is to load a data package from the web and read its metadata and contents.
# Setup
For this tutorial, we will need the datapackage-go (opens new window) and tableschema-go (opens new window) packages, which provide all the functionality to deal with a Data Package’s metadata and its contents.
We are going to use the dep tool (opens new window) to manage the dependencies of our new project:
$ cd $GOPATH/src/newdataproj
$ dep init
# The Periodic Table Data Package
A Data Package (opens new window) is a simple container format used to describe and package a collection of data. It consists of two parts:
- Metadata that describes the structure and contents of the package
- Resources such as data files that form the contents of the package
In this tutorial, we are using a Tabular Data Package (opens new window) containing the periodic table. The package descriptor (datapackage.json (opens new window)) and contents (data.csv (opens new window)) are stored on GitHub. This dataset includes the atomic number, symbol, element name, atomic mass, and the metallicity of the element. Here are the header and the first three rows:
atomic number | symbol | name | atomic mass | metal or nonmetal? |
---|---|---|---|---|
1 | H | Hydrogen | 1.00794 | nonmetal |
2 | He | Helium | 4.002602 | noble gas |
3 | Li | Lithium | 6.941 | alkali metal |
# Inspecting Package Metadata
Let’s start off by creating the main.go
, which loads the data package and inspects some of its metadata.
package main
import (
"fmt"
"github.com/frictionlessdata/datapackage-go/datapackage"
)
func main() {
pkg, err := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
if err != nil {
panic(err)
}
fmt.Println("Package loaded successfully.")
}
Before running the code, you need to tell the dep tool to update our project dependencies. Don’t worry; you won’t need to do it again in this tutorial.
$ dep ensure
$ go run main.go
Package loaded successfully.
Now that you have loaded the periodic table Data Package, you have access to its title
and name
fields through the Package.Descriptor() function (opens new window). To do so, let’s change our main function to (omitting error handling for the sake of brevity, but we know it is very important):
func main() {
pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
fmt.Println("Name:", pkg.Descriptor()["name"])
fmt.Println("Title:", pkg.Descriptor()["title"])
}
And rerun the program:
$ go run main.go
Name: period-table
Title: Periodic Table
And as you can see, the printed fields match the package descriptor (opens new window). For more information about the Data Package structure, please take a look at the specification (opens new window).
# Quick Look At the Data
Now that you have loaded your Data Package, it is time to process its contents. The package content consists of one or more resources. You can access Resources (opens new window) via the Package.GetResource() (opens new window) method. Let’s print the periodic table data
resource contents.
func main() {
pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
res := pkg.GetResource("data")
table, _ := res.ReadAll()
for _, row := range table {
fmt.Println(row)
}
}
$ go run main.go
[atomic number symbol name atomic mass metal or nonmetal?]
[1 H Hydrogen 1.00794 nonmetal]
[2 He Helium 4.002602 noble gas]
[3 Li Lithium 6.941 alkali metal]
[4 Be Beryllium 9.012182 alkaline earth metal]
...
The Resource.ReadAll() (opens new window) method loads the whole table in memory as raw strings and returns it as a Go [][]string
. This can be quick useful to take a quick look or perform a visual sanity check at the data.
# Processing the Data Package’s Content
Even though the string representation can be useful for a quick sanity check, you probably want to use actual language types to process the data. Don’t worry, you won’t need to fight the casting battle yourself. Data Package Go libraries provide a rich set of methods to deal with data loading in a very idiomatic way (very similar to encoding/json (opens new window)).
As an example, let’s change our main
function to use actual types to store the periodic table and print the elements with atomic mass smaller than 10.
package main
import (
"fmt"
"github.com/frictionlessdata/datapackage-go/datapackage"
"github.com/frictionlessdata/tableschema-go/csv"
)
type element struct {
Number int `tableheader:"atomic number"`
Symbol string `tableheader:"symbol"`
Name string `tableheader:"name"`
Mass float64 `tableheader:"atomic mass"`
Metal string `tableheader:"metal or nonmetal?"`
}
func main() {
pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
resource := pkg.GetResource("data")
var elements []element
resource.Cast(&elements, csv.LoadHeaders())
for _, e := range elements {
if e.Mass < 10 {
fmt.Printf("%+v\n", e)
}
}
}
$ go run main.go
{Number:1 Symbol:H Name:Hydrogen Mass:1.00794 Metal:nonmetal}
{Number:2 Symbol:He Name:Helium Mass:4.002602 Metal:noble gas}
{Number:3 Symbol:Li Name:Lithium Mass:6.941 Metal:alkali metal}
{Number:4 Symbol:Be Name:Beryllium Mass:9.012182 Metal:alkaline earth metal}
In the example above, all rows in the table are loaded into memory. Then every row is parsed into an element
object and appended to the slice. The resource.Cast
call returns an error if the whole table cannot be successfully parsed.
If you don’t want to load all data in memory at once, you can lazily access each row using Resource.Iter (opens new window) and use Schema.CastRow (opens new window) to cast each row into an element
object. That would change our main function to:
func main() {
pkg, _ := datapackage.Load("https://raw.githubusercontent.com/frictionlessdata/example-data-packages/62d47b454d95a95b6029214b9533de79401e953a/periodic-table/datapackage.json")
resource := pkg.GetResource("data")
iter, _ := resource.Iter(csv.LoadHeaders())
sch, _ := resource.GetSchema()
var e element
for iter.Next() {
sch.CastRow(iter.Row(), &e)
if e.Mass < 10 {
fmt.Printf("%+v\n", e)
}
}
}
$ go run main.go
{Number:1 Symbol:H Name:Hydrogen Mass:1.00794 Metal:nonmetal}
{Number:2 Symbol:He Name:Helium Mass:4.002602 Metal:noble gas}
{Number:3 Symbol:Li Name:Lithium Mass:6.941 Metal:alkali metal}
{Number:4 Symbol:Be Name:Beryllium Mass:9.012182 Metal:alkaline earth metal}
And our code is ready to deal with the growth of the periodic table in a very memory-efficient way 😃
We welcome your feedback and questions via our Frictionless Data Gitter chat (opens new window) or via GitHub issues (opens new window) on the datapackage-go repository.