In our Case Study series, we are highlighting projects and organisations who are working with the Frictionless Data specifications and software in interesting and innovative ways.
CMSO uses Frictionless Data specs to package cell migration data and load it into Pandas for data analysis and creation of visualizations.
Researchers worldwide try to understand how cells move, a process extremely important for many physiological and pathological conditions. Cell migration is in fact involved in many processes, like wound healing, neuronal development and cancer invasion. The Cell Migration Standardization Organization (CMSO) is a community building standards for cell migration data, in order to enable data sharing in the field. The organization has three main working groups:
- Minimal reporting requirement (developing MIACME, i.e. the Minimum Information About a Cell Migration Experiment)
- Controlled Vocabularies
- Data Formats and APIs
In our last working group, we discussed where the Data Package specifications1 could be used or expanded for the definition of a standard format and the corresponding libraries to interact with these standards. In particular, we have started to address the standardization of cell tracking data. This is data produced using tracking software that reconstructs cell movement in time based on images from a microscope.
In pink, the ISA (Investigation Study Assay) model to annotate the experimental metadata; in blue, the OME (Open Microscopy Environment) model for the imaging data; in green, our biotracks format based on the Data Package specification for the analytics data (cell tracking, positions, features etc.); in purple, CV: Controlled Vocabulary; and in turquoise, MIACME: Minimum Information About a Cell Migration Experiment. CC BY-SA 4.0 Credit: Paola Masuzzo (text) and CMSO (diagram).
CMSO deals specifically with cell migration data (a subject of cell biology). Our main challenge lies in the heterogeneity of the data. This diversity has its origin in two factors:
- Experimentally: Cell migration data can be produced using many diverse techniques (imaging, non-imaging, dynamic, static, high-throughput/screening, etc.)
- Analytically: These data are produced using many diverse software packages, each of these writing data to specific (sometimes proprietary) file formats.
This diversity hampers (or at least makes very difficult) procedures like meta-analysis, data integration, data mining, and last but not least, data reproducibility.
CMSO has developed and is about to release the first specification of a Cell Tracking format. This specification is built on a tabular representation, i.e. data are stored in tables. Current v0.1 of this specification can be seen at here.
- Create a Data Package representation where the data---in our case objects (e.g. cells detected in microscopy images), links, and optionally tracks---are stored in CSV files, while metadata and schema3 information are stored in a JSON file.
- Write this Data Package to a pandas4 dataframe, to aid quick inspection and visualization.
You can see some examples here.
I am an Open Science fan and advocate, so I try to keep up to date with the initiatives of the Open Knowledge International teams. I think I first became aware of Frictionless Data when I saw a tweet and I checked the specs out. Also, CMSO really wanted to keep a possible specification and file format light and simple. So different people of the team must have googled for 'CSV and JSON formats' or something like that, and Frictionless Data popped out :).
I have opened a couple of issues on the GitHub page of the spec, detailing what I would like to see developed in the Frictionless Data project. The CMSO is not sure yet if the Data Package representation will be the one we’ll go for in the very end, because we would first like to know how sustainable/sustained this spec will be in the future.
CMSO is looking into expanding the list of examples we have so far in terms of tracking software. Personally, I would like to choose a reference data set (a live-cell, time-lapse microscopy data set) , and run different cell tracking algorithms/software packages on it. Then I want to put the results into a common, light and easy-to-interpret CSV+JSON format (the biotracks format), and show people how data containerization5 can be the way to go to enable research data exchange and knowledge discovery at large.
With most other specifications, cell tracking data are stored in tabular format, but metadata are never kept together with the data, which makes data interpretation and sharing very difficult. The Frictionless Data specifications take good care of this aspect. Some other formats are based on XML6 annotation, which certainly does the job, but are perhaps heavier (even though perhaps more sustainable in the long term). I hate Excel formats, and unfortunately I need to parse those too. I love the integration with Python7 and the pandas[^pandas] system, this is a big plus when doing data science.
As a researcher, I mostly deal with research data. I am pretty sure if this could work for cell migration data, it could work for many cell biology disciplines as well. I recommend speaking to more researchers and data producers to determine additional use cases!
Tabular Data Package: http://frictionlessdata.io/specs/tabular-data-package/↩
Data Package-aware libraries in Python: https://github.com/frictionlessdata/datapackage-py, https://github.com/frictionlessdata/tableschema-py, https://github.com/frictionlessdata/goodtables-py↩