Vocabularies and the ‘I’ in ‘FAIR Data principles’
There is a new experimental service, vest.agrisemantics.org that brings together different vocabularies that can be used as models for data in many subject fields that Wageningen is working on. In this blog we will explain why this is in our view good news for Wageningen and why it will help to make our data more “FAIR”.
In the previous blog we have discussed the ‘FAIR principles for data stewardship’ . Since then they have taken momentum. For instance, FAIR principles are used in the template for data management plans that are mandatory for projects that receive funding from EU Horizon 2020. Coordinators of H2020 programs, who have to deliver such a plan in the first six months are sometimes overwhelmed by these requirements. Especially the requirements and recommendations sound impressive that derive from the ‘I’ – for ‘ Interoperable’ . Hopefully they feel a bit less overwhelmed after reading this blog.
The idea behind the ‘I’ principles is, that data should be machine-actionable and that it can “talk” to other data. The ultimate goal is to make meta-analyses and data mining possible. If we want to let data talk to data it needs a common language. Different data collections should use the same terminology for the same things or properties of things. But that is not trivial, and in the end it all comes down to the question what things are. This question has been discussed since mankind has invented language, but the issue has not been resolved yet. However there are new attempts to resolve the question for data in a way that can work at least in well-defined contexts.
Language is always ambiguous: words may have different meanings and we may use different words to refer to the same thing. So we need unique words – if you wish ‘ strings’ of characters’ in computer-speak- and a way to define their meaning unambiguously.
The people who invented the World Wide Web – under the leadership of Sir Tim Berners-Lee – are now working on the Linked Open Data (LOD) initiative. To create unique strings they use the same mechanism as it is used for the World Wide Web. URLs – Uniform Resource Locators – are unique. If they were not, the World Wide Web would not work. The LOD initiative uses Unique Resource Identifiers (URI)_as unique strings. URL is a subcategory of URI. Facts in a data set can be described as triples of such strings. For example: in a spreadsheet information is stored as triples of rows (the things in the table), the columns (the properties of those things) and the values that are in the cells. All this can be described as URIs.
There are also mechanisms to define the meaning that is to be attached to these unique URI strings. The most advanced form of such definitions are called ‘ontologies’. In such ontologies different concepts and their relationships are defined, as well as the URI’s that should be used to refer to them. The URI’s are also ‘resolvable’, meaning that if one follows them as a link users will get to a description that is readable for humans. The unique URI strings themselves make it possible for machines – i.e. software applications – to identify the concepts unambiguously.
All this may sound rather theoretical. Let us point to a number of examples from different research fields in which scientists use these methods to link data and resolve practical questions. Here is an example from plant breeding, linking genomic and phenotypic data. This article reports about an effort to link agronomic data on wheat with remote sensing data in Argentina. Here is an article about data integration with different types of soil datahttps://weblog.wur.eu/150-words-essay-on-slow-and-steady-wins-the-race/
For data to be considered as ‘FAIR’ it is not always necessary that it is exposed as triples of URI’s. For example, we use a lot of bitmaps, for example with spatial data or for phenotyping. It is not practical to convert the bitmaps themselves to linked open data, but the data that describes the bitmaps has a LOD format that data collection can still be considered as FAIR. In theory a CSV file with spreadsheet data is also considered as FAIR, if there is somewhere else a file that maps the rows and columns to URI’s.
So to go one step further to make our data FAIR we need to know which vocabularies can be used to describe the model for our data. The https://weblog.wur.eu/blue-homework-folder/attempts to bring all that together. It is the first tangible product from the Global Open data for Agriculture and Nutrition (thesis statement for an argumentative essay) If you do not have plans to do anything with LOD right now, it is worth looking at the registry, to look for guidance how to set up the data structures for your research. If you follow the same principles it will not be too difficult to expose the data as LOD when it is relevant to do so.
The registry is a work in progress. Vocabularies have been brought together and their subject domain and format have been described. But are these vocabularies really standards in a scientific community, are the scientifically sound, are they used in datasets and databases? The registry is set up to give information on those questions as well. You can see them if you click on the “assessment” tab of each description, but in most cases the criteria have not been filled in yet. The group working at the registry is seeking the advice of domain experts. If you are in a position to help them with the assessment of a standard or standards in a certain area, please share you advice.
So how can all this help the coordinator of a Horizon 2020 program who feels overwhelmed by the questions about FAIR data? If data integration is the primary goal of the project, using LOD is probably the way, adopting LOD is probably the way to go. But if the nature of the project does not justify such an investment, it is still worthwhile looking at the VEST registry. The vocabularies can be used to guide the setup of the data structures that will be used in the project. At a later stage the data can then be exposed as LOD for meta-analyses if the need arises. If you indicate that in the data management plan the Horizon 2020 assessors should be satisfied that the project does what it can to be ‘ FAIR’.
 An ontology approach to comparative phenomics in plants
Anika Oellrich et al. Plant Methods 201511:10 (2015)
 Mariano F. Lopresti, Carlos M. Di Bella, Américo J. Degioanni, Relationship between MODIS-NDVI data and wheat yield: A case study in Northern Buenos Aires province, Argentina, Information Processing in Agriculture, Volume 2, Issue 2, September 2015, Pages 73-84, ISSN 2214-3173, http://dx.doi.org/10.1016/j.inpa.2015.06.001
 Exposing vocabularies for soil as Linked Open Data. Giovanni L’Abate et al. Information Processing in Agriculture. Volume 2, Issues 3–4, October–December 2015, Pages 208–216 http://dx.doi.org/10.1016/j.inpa.2015.10.002