Frictionless Data Frictionless Data
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
Introduction
Projects
Universe
Adoption
People
Fellows (opens new window)
  • Architecture
  • Roadmap
  • Process
  • Get Help
  • Contribute
  • Code of Conduct
  • Events Calendar
  • Forum (opens new window)
  • Chat (Slack) (opens new window)
  • Chat (Matrix) (opens new window)
Blog
  • CSV - Comma Separated Values

    • The Format
      • What a CSV looks like
      • Dialects of CSVs
      • What is Missing in CSV?
      • Links
    • Tools
      • Desktop
      • View a CSV file in your Browser
      • Unix Command Line Manipulation
      • Power Tools
      • Libraries
    • Tips and Tricks
      • CSVs and Git

CSV - Comma Separated Values

July 9, 2018 by Frictionless Data
Price icons created by Pixel perfect - Flaticon csv

This page provides an overview CSV (Comma Separated Values) format for data.

CSV is a very old, very simple and very common “standard” for (tabular) data.
We say “standard” in quotes because there was never a formal standard for CSV,
though in 2005 someone did put together a RFC (opens new window) for it.

CSV is supported by a huge number of tools from spreadsheets like Excel,
OpenOffice and Google Docs to complex databases to almost all programming
languages. As such it is probably the most widely supported structured data
format in the world.


# The Format

Key points are:

  • CSV is probably the simplest possible structured format for data
  • CSV strikes a delicate balance, remaining readable by both machines & humans
  • CSV is a two dimensional structure consisting of rows of data, each row
    containing multiple cells. Rows are (usually) separated by line terminators
    so each row corresponds to one line. Cells within a row are separated by
    commas (hence the C(ommmas) part)
    • Note that strictly we’re really talking about DSV files in that we can
      allow ‘delimiters’ between cells other than a comma. However, many people
      and many programs still call such data CSV (since comma is so common as the
      delimiter)
  • CSV is a “text-based” format, i.e. a CSV file is a text file. This makes it
    amenable for processing with all kinds of text-oriented tools (from text
    editors to unix tools like sed, grep etc (opens new window))

# What a CSV looks like

If you open up a CSV file in a text editor it would look something like:

A,B,C
1,2,3
4,"5,3",6

Here there are 3 rows each of 3 columns. Notice how the second column in the last line is
“quoted” because the content of that value actually contains a “,” character. Without
the quotes this character would be interpreted as a column separator. To avoid this
confusion we put quotes around the whole value. The result is that we have 3 rows each
of 3 columns (Note a CSV file does not have to have
the same number of columns in each row).

# Dialects of CSVs

As mentioned above, CSV files can have quite a bit of variation in
structure. Key options are:

  • Field delimiter: rather than comma , people often use things like \t
    (tab), ; or |
  • Record terminator / line terminator: is \n (unix), \n\r (dos) or something else …
  • How do you quote records that contain your delimiter

You can read more in the CSV Dialect Description Format (opens new window) which defines
a small JSON-oriented structure for specifying what options a CSV uses.

# What is Missing in CSV?

  • CSV lacks any way to specify type information: that is, there is no way to
    distinguish “1” the string from 1 the number. This shortcoming can be
    addressed by adding some form of simple schema. For example Table
    Schema
    provides a very simple way to describe your schema externally
    whilst Linked CSV (opens new window) is an example of doing this “inline” (that
    is, in the CSV).
  • No support for relationships between different “tables”. This is similar to
    the previous point and again Table Schema provides a way to address
    this by providing additional schema information externally.
  • CSV is really only for tabular data – it is not so good for data with
    nesting or where structure is not especially tabular (though remember most
    data can be put into tabular form if you try hard enough!)

# Links

Specifications and overviews:

  • RFC specification of CSV (opens new window)
  • [CSV Dialect Description Format][csvddf]
  • CSV on Wikipedia (opens new window)

# Tools

The great thing about CSV is the huge level of tool support. The following is
not intended to be comprehensive but is more at the electic end of the spectrum.

# Desktop

All spreadsheet programs including Excel, OpenOffice, Google Docs
Spreadsheets supporting opening, editing and saving CSVs.

# View a CSV file in your Browser

You can view a CSV file (saving you the hassle of downloading it and opening
it). Options include:

  • You can use datapipes: http://datapipes.okfnlabs.org/csv/html (opens new window)

    Just paste your CSV file and away you go.

  • Install this Chrome Browser Extension (opens new window). This can be used both
    for online files and for files on your local disk (if you open them with your
    browser!)

# Unix Command Line Manipulation

See

  • Using unix command line tools on CSV (opens new window)
  • The wonderful csvkit (opens new window) (python)

# Power Tools

  • OpenRefine (opens new window) is a powerful tool for editing and manipulating data and works
    very well with CSV
  • Data Explorer (opens new window) supports importing CSVs and manipulating and changing
    them using javascript in the browser

# Libraries

This is heavily biased towards python!

# Python

  • Built in csv library is good
  • The wonderful csvkit (opens new window) (python)
  • messytables (opens new window) (python) - convert lots of badly structured data into CSV (or
    other formats)

# Node

Nothing in standard lib yet and best option seems to be:

  • https://github.com/wdavidw/node-csv (opens new window)

# Tips and Tricks

# CSVs and Git

Get git to handle CSV diffs in a sensible way (very useful if you are using
git or another version control system to store data (opens new window)
).

Make these changes to config files:

# ~/.config/git/attributes
*.csv diff=csv
# ~/.gitconfig
[diff "csv"]
  wordRegex = [^,\n]+[,\n]|[,]

Then do:

git diff --word-diff
# make it even nicer
git diff --word-diff --color-words

Credit for these fixups to contributors on this question on
StackExchange (opens new window)

and to James Smith (opens new window).

Blog Index