28 November 2016 | Category: Data management

Vocabularies and the ‘I’ in ‘FAIR Data principles’

By Hugo Besemer

There is a new experimental service, vest.agrisemantics.org that brings together different vocabularies that can be used as models for data in many subject fields that Wageningen is working on. In this blog we will explain why this is in our view good news for Wageningen and why it will help to make our data more “FAIR”.

In the previous blog we have discussed the ‘FAIR principles for data stewardship’ . Since then they have taken momentum. For instance, FAIR principles are used in the template for data management plans that are mandatory for projects  that receive funding from EU Horizon 2020. Coordinators of H2020 programs, who have to deliver such a plan in the first six months are sometimes overwhelmed by these requirements. Especially the requirements and recommendations sound impressive that derive from the ‘I’ – for ‘ Interoperable’ . Hopefully they feel a bit less overwhelmed after reading this blog.

The idea behind the ‘I’ principles is, that data should be machine-actionable and that it can “talk” to other data. The ultimate goal is to make meta-analyses and data mining possible. If we want to let data talk to data it needs a common language. Different data collections should use the same terminology for the same things or properties of things. But that is not trivial, and in the end it all comes down to the question what things are. This question has been discussed since mankind has invented language, but the issue has not been resolved yet. However there are new attempts to resolve the question for data in a way that can work at least in well-defined contexts.

Language is always ambiguous: words may have different meanings and we may use different words to refer to the same thing. So we need unique words – if you wish ‘ strings’ of characters’ in computer-speak- and a way to define their meaning unambiguously.

The people who invented the World Wide Web – under the leadership of Sir Tim Berners-Lee – are now working on the Linked Open Data (LOD) initiative. To create unique strings they use the same mechanism as it is used for the World Wide Web. URLs – Uniform Resource Locators – are unique. If they were not, the World Wide Web would not work. The LOD initiative uses Unique Resource Identifiers (URI)_as unique strings. URL is a subcategory of URI. Facts in a data set can be described as triples of such strings. For example: in a spreadsheet information is stored as triples of rows (the things in the table), the columns (the properties of those things) and the values that are in the cells. All this can be described as URIs.

There are also mechanisms to define the meaning that is to be attached to these unique URI strings. The most advanced form of such definitions are called ‘ontologies’. In such ontologies different concepts and their relationships are defined, as well as the URI’s that should be used to refer to them. The URI’s are also ‘resolvable’, meaning that if one follows them as a link users will get to a description that is readable for humans. The unique URI strings themselves make it possible for machines – i.e. software applications – to identify the concepts unambiguously.

All this may sound rather theoretical. Let us point to a number of examples from different research fields in which scientists use these methods to link data and resolve practical questions. Here is an example from plant breeding, linking genomic and phenotypic data[1]. This article  reports about an effort to link agronomic data on wheat with remote sensing data in Argentina[2]. Here is an article about data integration with different types of soil data[3]

For data to be considered as ‘FAIR’ it is not always necessary that it is exposed as triples of URI’s. For example, we use a lot of bitmaps, for example with spatial data or for phenotyping. It is not practical to convert the bitmaps themselves to linked open data, but the data that describes the bitmaps has a LOD format that data collection can still be considered as FAIR. In theory a CSV file with spreadsheet data is also considered as FAIR, if there is somewhere else a file that maps the rows and columns to URI’s.

So to go one step further to make our data FAIR we need to know which vocabularies can be used to describe the model for our data. The VEST registry attempts to bring all that together. It is the first tangible product from the Global Open data for Agriculture and Nutrition (GODAN) If you do not have plans to do anything with LOD right now, it is worth looking at the registry, to look for guidance how to set up the data structures for your research. If you follow the same principles it will not be too difficult to expose the data as LOD when it is relevant to do so.

The registry is a work in progress. Vocabularies have been brought together and their subject domain and format have been described. But are these vocabularies really standards in a scientific community, are the scientifically sound, are they used in datasets and databases? The registry is set up to give information on those questions as well. You can see them if you click on the “assessment” tab of each description, but in most cases the criteria have not been filled in yet. The group working at the registry is seeking the advice of domain experts. If you are in a position to help them with the assessment of a standard or standards in a certain area, please share you advice.

So how can all this help the coordinator of a Horizon 2020 program who feels overwhelmed by the questions about FAIR data? If data integration is the primary goal of the project, using LOD is probably the way, adopting LOD is probably the way to go. But if the nature of the project does not justify such an investment, it is still worthwhile looking at the VEST registry. The vocabularies can be used to guide the setup of the data structures that will be used in the project. At a later stage the data can then be exposed as LOD for meta-analyses if the need arises. If you indicate that in the data management plan the Horizon 2020 assessors should be satisfied that the project does what it can to be ‘ FAIR’.

[1] An ontology approach to comparative phenomics in plants
Anika Oellrich et al. Plant Methods 201511:10 (2015)
DOI: 10.1186/s13007-015-0053-y

[2] Mariano F. Lopresti, Carlos M. Di Bella, Américo J. Degioanni, Relationship between MODIS-NDVI data and wheat yield: A case study in Northern Buenos Aires province, Argentina, Information Processing in Agriculture, Volume 2, Issue 2, September 2015, Pages 73-84, ISSN 2214-3173, http://dx.doi.org/10.1016/j.inpa.2015.06.001

[3] Exposing vocabularies for soil as Linked Open Data. Giovanni L’Abate et al. Information Processing in Agriculture. Volume 2, Issues 3–4, October–December 2015, Pages 208–216 http://dx.doi.org/10.1016/j.inpa.2015.10.002

Photo: Wikimedia

By Hugo Besemer

There are 4 comments.

  1. By: Rob Knapen · 29-11-2016 at 11:25 am

    Hi Hugo,

    Nice blog post. Are there also any initiatives to work towards coordinated vocabularies for WUR? So we can have FAIR One Wageningen data collections in the future? 🙂 Perhaps something for the Data Science Center to take into account.

  2. By: Jan Top · 29-11-2016 at 3:27 pm

    In the Food Informatics group at Wageningen Food & Biobased Research we have ample experience with developing and reusing vocabularies and applying them in smart applications. We have for example developed ROC as a method for efficient development of vocabularies by domain experts. It has been applied in the Valerie project together with Plant Research to create a comprehensive vocabulary on farming innovations. At FoodVoc we publish food related ontologies. Rosanne is an extension to Excel to create FAIR data. And much more … We are happy to share our knowledge on Linked Open Data!

  3. By: Hugo Besemer · 29-11-2016 at 3:28 pm

    Hello Rob,

    For starters it would help to have an overview who is working on or interested in LOD and vocabularies in Wageningen. I will try to set up a meeting early 2017. I have asked the people from VEST / RDA Interest group agricultural data if one of them happens to be in the Netherlands around that time. If you are interested contact me, or leave a comment here about your activities.

  4. […] It is advised to start from existing ontologies, and to share ontologies where possible. See also one of our previous blog posts (here) […]

Leave a reply

Your email address will not be published. Required fields are marked *