Breeding and Genomics: WUR Data Champion in storing data on the genomes of 200,000 cows
For this blog post we interviewed Prof. Roel Veerkamp, of Breeding and Genomics of Wageningen University & Research. The department analyses the genomes and phenotypes of animals to improve animal breeding and our understanding of genetic variation. Genomes are the complete set of genes or genetic material present in a cell or organism. Both during and after research, this group stores and archives their data in a safe an organised way. We therefore made this group our third data champion in our series of Champion blogs!
Type of data
The group works with genome data of agricultural animals, but also of zoo animals and dogs. In a recent study on the stature of cattle published in Nature Genetics, which was the first big meta-analysis on livestock animals, they used data from more than 50,000 bulls with many daughters of 8 different breeds from 9 countries.
A big part of the data used in the group comes from existing databases. Roel: “We use data from international and national databases, but also databases from companies. For example, at the moment we study the genomes of 5000 bulls, and we also work with data sets of hundreds of thousands of cows, pigs and poultry, including sets from other countries. With our methods, we try to make connections between phenotypes and genetic information to determine the heredity.”
Data storage during research: WUR’s W-drive
For each project, the group launches a project folder stored on the WUR-IT W-drive in which researchers store all relevant project information, including the data. To make working efficient, the files are stored in a recurrent, logical folder structure. The project lead, together with his project team, is responsible for updating the folder in such a way that everyone within the project can use the data. Because of the size of the data sets and the need of massive computations, the department is using a HPC (High Performance Cluster) from WUR IT.
Making data available: what about confidentiality?
“I think that from a scientific perspective, it would be nice to make all data publicly available”, our data champions says. But although Roel likes the concept of making data available, the reality shows that it is impossible in most cases. The group mainly uses data sets from other parties. For example, breeding companies routinely collect data for their business. Since there is a lot of competition between breeding companies, their data is often confidential. Roel explains: “It’s a growing dilemma. Journals ask us to share our data sets, and companies want to share their data for interesting science, but not make it publicly available.”
Roel has several examples where they manage shared data, also in international consortia. “When we use shared data, we always agree what we are going to do with this data, for which research project it is and what are the intended publications.” Roel: “For every project, the agreements are different. We need permission for each new purpose.” The amount and type of data that are shared also differs: “In many genetics studies, only the relevant summary statistics are required from each country, and these are used in a meta-analysis combining the data.” These summary statistics can be made available, as was done for the above-mentioned Nature Genetics publication.
Data archiving after research with automatic version management
Genome positions of markers still change, since annotation of the genome is improving. In order to be able to compare and combine our own genotypic data sets with each other, it is important that standard data formats are used, as well as metadata standards. Metadata is structured, machine-readable information, making it easy to search and aggregate data. The breeding and genomic department stores their metadata from genotypes in a commercial repository. “Storing the metadata of the genotypes there makes it accessible for all of our projects. Storing the metadata in this repository also ensures us that the data is tracked and easily updated.” Such structured data organization, including a clear tracking and version history, is of course very useful in ensuring that data sets remain findable and reusable.
WUR is serious about data
In this series of blog posts on ‘data champions’, we show the data management practices of some of WUR’s research groups, and how these align with the new data policy. In this case, the group of Breeding and Genomics shows how well-organised storage on the W-drive and structured (meta-)data archiving keeps data safe and findable for reuse. Stay tuned for more blog posts with other data champions!