CSV - Comma Separated Values
This page provides an overview CSV (Comma Separated Values) format for data.
CSV is a very old, very simple and very common "standard" for (tabular) data. We say "standard" in quotes because there was never a formal standard for CSV, though in 2005 someone did put together a RFC for it.
CSV is supported by a huge number of tools from spreadsheets like Excel, OpenOffice and Google Docs to complex databases to almost all programming languages. As such it is probably the most widely supported structured data format in the world.
Key points are:
- CSV is probably the simplest possible structured format for data
- CSV strikes a delicate balance, remaining readable by both machines & humans
- CSV is a two dimensional structure consisting of rows of data, each row
containing multiple cells. Rows are (usually) separated by line terminators
so each row corresponds to one line. Cells within a row are separated by
commas (hence the C(ommmas) part)
- Note that strictly we're really talking about DSV files in that we can allow 'delimiters' between cells other than a comma. However, many people and many programs still call such data CSV (since comma is so common as the delimiter)
- CSV is a "text-based" format, i.e. a CSV file is a text file. This makes it amenable for processing with all kinds of text-oriented tools (from text editors to unix tools like sed, grep etc)
What a CSV looks like
If you open up a CSV file in a text editor it would look something like:
A,B,C 1,2,3 4,"5,3",6
Here there are 3 rows each of 3 columns. Notice how the second column in the last line is "quoted" because the content of that value actually contains a "," character. Without the quotes this character would be interpreted as a column separator. To avoid this confusion we put quotes around the whole value. The result is that we have 3 rows each of 3 columns (Note a CSV file does not have to have the same number of columns in each row).
Dialects of CSVs
As mentioned above, CSV files can have quite a bit of variation in structure. Key options are:
- Field delimiter: rather than comma
,people often use things like
- Record terminator / line terminator: is
\n\r(dos) or something else ...
- How do you quote records that contain your delimiter
You can read more in the CSV Dialect Description Format which defines a small JSON-oriented structure for specifying what options a CSV uses.
What is Missing in CSV?
- CSV lacks any way to specify type information: that is, there is no way to distinguish "1" the string from 1 the number. This shortcoming can be addressed by adding some form of simple schema. For example Table Schema provides a very simple way to describe your schema externally whilst Linked CSV is an example of doing this "inline" (that is, in the CSV).
- No support for relationships between different "tables". This is similar to the previous point and again Table Schema provides a way to address this by providing additional schema information externally.
- CSV is really only for tabular data -- it is not so good for data with nesting or where structure is not especially tabular (though remember most data can be put into tabular form if you try hard enough!)
Specifications and overviews:
The great thing about CSV is the huge level of tool support. The following is not intended to be comprehensive but is more at the electic end of the spectrum.
All spreadsheet programs including Excel, OpenOffice, Google Docs Spreadsheets supporting opening, editing and saving CSVs.
View a CSV file in your Browser
You can view a CSV file (saving you the hassle of downloading it and opening it). Options include:
You can use datapipes: http://datapipes.okfnlabs.org/csv/html
Just paste your CSV file and away you go.
Install this Chrome Browser Extension. This can be used both for online files and for files on your local disk (if you open them with your browser!)
Unix Command Line Manipulation
- OpenRefine is a powerful tool for editing and manipulating data and works very well with CSV
This is heavily biased towards python!
- Built in csv library is good
- The wonderful csvkit (python)
- messytables (python) - convert lots of badly structured data into CSV (or other formats)
Nothing in standard lib yet and best option seems to be:
Tips and Tricks
CSVs and Git
Get git to handle CSV diffs in a sensible way (very useful if you are using git or another version control system to store data).
Make these changes to config files:
# ~/.config/git/attributes *.csv diff=csv # ~/.gitconfig [diff "csv"] wordRegex = [^,\n]+[,\n]|[,]
git diff --word-diff # make it even nicer git diff --word-diff --color-words