Is reproducing someone else’s research data a Frictionless experience? (pt.2)

May 15, 2022 by Sara Petti

Is reproducing someone else’s research data a Frictionless experience? As we have seen with all the previous cohorts of Frictionless Fellows (you can read the blog here (opens new window)), most often than not it is sadly not the case.

To prove that the “reproducibility crisis” is a real problem in scientific research at the moment, we challenged the Fellows to exchange their data to see if they could reproduce each other’s Data Packages. Read about their experience:

# Melvin

We had an interesting task for our frictionless fellow activity that involved exchanging our data sets with our fellow colleagues (pairwise) and trying to reproduce their work. My partner for this assignment was Lindsay, who is a librarian.

In data science, replicability and reproducibility are some of the keys to data integrity.It (opens new window) creates more opportunities for new insights and reduces errors. In order to ensure reproducibility of data, one must first make sure that the raw data is available. In this regard, my partner Lindsay shared with me her data that was on her Github account to facilitate the process.

This process and activity were really useful and humbling. As we got to discuss our data sets with Lindsay, I realized key things such as Tidy data principles, which was the highlight for me in this whole process, besides the point that it’s not easy to understand someone else’s data without further metadata to accompany the data set. Imagine the frustration researchers go through trying to understand and reproduce other people’s data without more information on the data.
Read Melvin’s blog (opens new window) to see how she managed to reproduce her fellows’ data package.

# Victoria

My data package partner, Zarena is an awesome social scientist in the human rights sphere. She has a background in mental health research and interests ranging from epistemic injustice to intersectionality - two terms I had to double check my understanding of. In poking around Zarena’s profile, particularly interesting was her focus on mad studies (opens new window), a young interdisciplinary field dealing with identity and the marginalisation of individuals with alternative mental states. This idea - broadly accepting a spectrum of human states instead of subjecting them to a black/white absolute interpretation - was completely new to me and fascinating! But being a social theory noob, I suspected to encounter a barrier to understanding her data.

Zarena’s data was publicly available in her GitHub fellows repository. I clocked a couple of things off the bat: the repo contained a csv called “data-dp.csv”, as well as a README.md (opens new window) and several schema files. When in doubt of where to start, a good place to look is the README.
Read Victoria’s blog (opens new window) to see how she reproduced her fellow’s data packages.

# Kevin

Data reproducibility is where other researchers use same data to attain the same results by using same methods. Research reproducibility allows other scientist to gain new insights from your data as well as improve quality of research by checking the correctness of your findings. The aim of this assignment was to try and reproduce my colleague’s data package and validate the tabular data using frictionless browser tools, that is, data package creator and good tables, respectively.

First, Guo-Qiang shared the links to his datasets and the data package to me which I freely accessed from his GitHub repository. His data was a summary of clinical evidence of various health effects of menopausal hormone therapy in menopausal women.
Read Kevin’s blog (opens new window) to see how he managed to reproduce Guo-Qiang’s datapackages.

# Zarena

Before joining the Frictionless Data Fellowship Programme, I did not realise the importance of research reproducibility. To tell the truth, I really did not have such a concept in my professional vocabulary despite having an MSc degree in Social Science Research Methods and working in different social research projects. But, maybe, that was the reason why I did not know this concept and never practised it in my research projects. Like many of my social science colleagues, especially the ones working with qualitative - and often sensitive - data, for me it was important to ensure that data I collect are safely stored in a password-protected platform and then - upon completion of a project - are deleted. But now working for the Frictionless Data Fellowship Programme and managing different sorts of data, including bibliometric metadata, I see that if we want social sciences and humanities to progress, it is vital to integrate such practices as reproducing, replicating, and reusing data into our research.

So, in this blog (opens new window), I will try to explain my first attempt to reproduce my Frictionless fellow’s dataset, which is openly shared in the GitHub repository (opens new window).

# Lindsay

Our most recent Fricitonless Fellows project is to trade data and create a Data Package using another Fellow’s data. I traded data with the fabulous Melvin! Melvin is a pathologist and soil scientist.

While this seems like a fun project, I was frustrated at first. I had to find my partner’s data. After reading her Data Package Blog (opens new window) and poking around on GitHub (opens new window), I could not find her data. I eventually realized: we are mimicking the process of reusing reproducible research data. The first hurdle any researcher must overcome is finding the data.
Read Lindsay’s blog (opens new window) to understand what happened while reproducing Melvin’s Data Packages.

# About the Frictionless Data Fellowship

With the Frictionless Data Reproducible Research Fellows Programme, supported by the Sloan Foundation and Open Knowledge Foundation, we are recruiting and training early career researchers to become champions of the Frictionless Data tools and approaches in their field. Fellows learn about Frictionless Data, including how to use Frictionless tools in their domains to improve reproducible research workflows, and how to advocate for open science. To know more about the programme, visit the dedicated website (opens new window).