Can Wageningen be FAIR?
Last March an article was published about the ‘FAIR principles’ for data management and stewardship[i]. Nowadays one can hear people say that Wageningen UR ’s data should be ‘FAIR’. We had a cup of coffee with the co-author of the article from Wageningen, Richard Finkers, to discuss what it would take to achieve ‘FAIR’ data stewardship in Wageningen. This blog is based on ideas from that conversation, thanks Richard.
The FAIR principles are modular, and institutions can work at them one by one. They resulted from a workshop in January 2014, and additional detail can be found in the output of that workshop[ii] . Let’s discuss one by one what they would mean for Wageningen. Let us discuss them one by one.
F – Findable. From the description in the workshop output: “To be Findable any Data Object should be uniquely and persistently identifiable./”….. “/A Data Object should minimally contain basic machine actionable metadata that allows it to be distinguished from other Data Objects”.
The library has started registering datasets that Wageningen publishes at services like DANS, Dryad, Figshare etcetera, and those services assign persistent identifiers, usually a DOI. We have also started assigning DOI’s to information objects, especially dissertations and reports, and in some instances datasets.. But that only applies to data that is officially published, usually data that underlies publications. But Richard confirmed that in his own field, plant breeding, there is much more data. He, as one of their data persons, knows how to get to it. Wageningen has probably many of such data collections. Potentially they can be made findable, and the research project information table potentially offers a backbone to register them. We can also assign DOI’s to them, but then we should make sure that a user who sends the identifier to a resolver service – like http://dx.doi.org for a DOI – gets useful information about the object in return. Persistent identifiers were developed to prevent broken links, so understandably there are heavy penalties there are heavy penalties for unresolvable identifiers.
So, becoming ‘F’ is doable, but there is a lot of work to do.
A – Accessible. The workshop report: “Data is Accessible in that it can be always obtained by machines and humans / Upon appropriate authorization / …. / Thus, machines and humans alike will be able to judge the actual accessibilty of each Data Object
For data sets that have been published through services like DANS or Dryad this is all in place, for any other data it will require some additional work. Part of it is technical and doable unless the cost is prohibitive. For some large datasets the sheer bandwidth may not be in place. And extra technical facilities may be required to give users from outside Wageningen UR access to all data. But the main hurdle is probably social: for each dataset an owner who can give access should be known, and have the rights to click the button and authorize the access physically. Or deny access, FAIR does not mean that all data should be usable by everyone.
A considerable part of data that has probably been collected in collaboration with third parties, and these third parties need to be consulted.
So, ‘A’ is doable but quite a bit more work.
I – Interoperable. The workshop report: “Data Objects can be Interoperable only if: /. (Meta) data is machine-actionabl / (Meta) data formats utilize shared vocabularies and/or ontologies / ……”
This is quite key, the purpose of the ‘FAIR’ movement is very much to enable data-mining and computerized meta-analyses. The guidelines are careful not to mention specific technologies and protocols, but by all means the only practical way that the world is working at this is the Linked Open Data initiative that the inventor of the World Wide Web – Tim Berners-Lee – is committed to get going[iii].
To simplify matters: Linked Open Data is the idea that all knowledge and all data is exposed as ‘triples’. For example: in a table there are rows, representing things. There are columns and they represent the properties of these things. Each cell has a value for a property of a thing. Thing, property and value make up a triple. All this is not yet machine actionable if everybody gives different names to things and properties. Scientific communities are working now to assign unique identifiers to properties and things in the data that they use to handle. There are groups in Wageningen that are involved in that work[iv] and for example in the plant sciences there are community efforts that we can build upon[v]
But these vocabularies or ontologies as they are called are certainly not commonplace in the work of researchers in Wageningen. On top of that the software that we use day-to-day is still not “triple-enabled”. All this is a bit of a chicken-or-the-egg story: “why would we invest in this methodology if the benefits are still uncertain and will take some time to materialize”. And if everybody thinks that way nothing will happen. Wageningen may get started by raising awareness and making all relevant vocabularies findable. The library can play a role in this, we have a tradition of working with vocabularies and thesauri.
Note that datasets published with Dryad, Figshare, DANS etcetera are not interoperable as it is meant here. They may come with descriptions and ‘readme’ files that are understandable to human readers but not actionable for machines.
So ‘I’ is still some way off and Wageningen cannot do it on its own, we need to work with others and depend on them. And that makes sense, one cannot interoperate in isolation.
R – Re-usable. For this principle the wording in the article is quite different from the workshop repor so let’s taken the article as it is more recent: “…./(meta)data are released with a clear and accessible data usage license / …..”[vi]
Data that is published in services like DANS, Figshare or Dryad already comes with a license, or there is a choice between different licenses. For other data that Wageningen makes findable and accessible there is a perceived need to be more lucid about intellectual property rights. The Graduate Schools have asked for more guidelines on ownership and these licenses can be part of those guidelines and recommendations.
So we trust there is work underway that makes ‘R’ achievable.
[i] Wilkinson, Mark D., et al. “The FAIR Guiding Principles for scientific data management and stewardship.” Scientific data 3 (2016).
[ii] Guiding Principles for Findable, Accessible, Interoperable and Re-usable Data Publishing version b1.0
January 2014. https://www.force11.org/group/fairgroup/fairprinciples
[iii] “The next Web of open, linked data.”. Video of Tim’s talk at TED 2009. http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html
[vi] This principle is quite mixed, other criteria have to do with a proper use of linked open data methodologies.