Frictionless Data Community Call October 2024
On our last community call on October 31st Keith Hughitt, Postdoctoral fellow at National Institutes of Health shared some ideas for project improvement.
Keith had already shared some thoughts in this direction in another community call, in April 2023 (opens new window), and this presentation was actually a continuation of the ideas that he had shared then.
The first thing to focus on is motivation, and trying to identify what is the future we are all imagining for the project. Keith’s vision is that we should encourage having many more specialised Data Package extensions (like the Cameratrap DP (opens new window) developed by Peter Desmet). We need to think about an organisational structure that can support that grassroot approach, while at the same time avoiding duplication and approaches that differ too much.
As he already shared in the call last year, Keith also believes we should start thinking about extending the support to other data types beyond tabular data. Would we encounter any issue if we tried to adapt the standard and libraries as they are now to work with other data types (e.g. image, audio, geospatial, spectral…)? Would there be any problem in using the Frictionless Python Framework (opens new window) with other types of data, for example? Keith tried the describe
function on genomic datasets (data type: matrix) and it works, but the result is not optimal.
The key would therefore be to work on domain-specific extensions. We do have an extension mechanism (opens new window), which works quite well, so how can we make this happen? Keith suggested that we have smaller groups convening monthly to discuss and plan domain-specific standards, while meeting collectively every quarter (or so) to share progress and coordinate on areas where there is overlap. This should address the challenge that may arise with growth and coordination.
Here’s a couple of things the domain-specific groups will need to consider:
- Pick a name for the collection of standards/specs related to a specific domain (e.g. bio),
- Decide what kind of data types they want to cover
- Clearly define what is the data structure
- Take into consideration what is being already used (in terms, for example, of ontologies and controlled vocabularies)
To help us approaching abstractly how we want to handle data, Keith shared a paper by Sandborn et al. (Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures, https://doi.org/10.48550/arXiv.2407.09468 (opens new window)), which describes different types of data from an abstract point of view (not exhaustive, but it still covers quite a wide range of data types).
Keith also mentioned it would be interesting to create a community repository / an open source server where people could host their Data Packages. The server could provide an API to query the deposited Data Packages. Creating a CLI to interact with people’s local data and query and retrieve data from remote repositories (giving of course the user the possibility to enable or disable the search and or specify the search order). If you are interested in building a common API, you can get in touch with Keith on the community chat.
# Join us in November!
Next community call is on November 28th. Product Owner Romina Colman will be presenting the Frictionless application Open Data Editor.
Do you have something you would like to present to the community at one of the upcoming calls? Let us know via this form (opens new window), or come and tell us on our community chat on Slack (opens new window) (also accessible via a Matrix bridge (opens new window) if you prefer to use an open protocol).
You can sign up for the call already here (opens new window). Do you want to share something with the community? Let us know when you sign up.
# Call Recording
Here is the recording of the full call: