Skip to ContentSkip to Navigation
University of Groningen Library
University of Groningen Library Open Access
Header image Open Science Blog

Good practices for FAIR data management - an interview with Sebastian Lequime on FAIR challenges in virology, open-source tools, and the role of journals and AI

Date:09 April 2025
Author:Alba Soares Capellas
Dr. Sebastian Lequime
Dr. Sebastian Lequime

Part of open science is that researchers make their data FAIR: Findable, Accessible, Interoperable and Reusable. But how to do this? In this series, we ask researchers to tell us more about their data management choices.

In this edition, we speak with Dr. Sebastian Lequime, assistant professor at the Groningen Institute for Evolutionary Life Sciences (GELIFES), of the Faculty of Science and Engineering. His research group uses genomics and modeling to understand viral evolution and ecology, focusing on RNA viruses transmitted by insects. Their recent publication explores the FAIRness of Endogenous Viral Element (EVE) datasets and offers recommendations for improving data accessibility in this niche but growing research field. We asked him about his work on viral evolution, the development of the open-source tool detectEVE, and his perspective as Data Editor for the Journal of Evolutionary Biology.

Your article “Endogenous viral elements: insights into data availability and accessibility” assesses the availability and accessibility of Endogenous Viral Element (EVE) datasets across different databases. What were the key findings regarding how well these datasets adhere to the FAIR principles?

Our paper discusses challenges around EVEs: these are pieces of viral genomic sequences integrated into their hosts' genome. While there have been over 20 years of research on them, it's a relatively narrow field that has its own set of challenges: for example, genomic databases only allow one type of taxonomic classification (e.g., that sequence is a human gene), but for EVEs it can be complicated. Should it be classified as viral (derived from some viruses), or as host (found as part of the host's genome)? Both are valid, but how do we ensure that people interested in the viral or the host side can still retrieve the relevant data?

For now, there are missing spots we outline in the paper, and we try to provide some solutions, such as a standardized bioinformatic tool for EVE detection or improving the integration of EVE in common genomic public databases by allowing dual classification (host and virus). Some of the recommendations can be addressed immediately, but some would need more discussions and changes in important databases, which might be more challenging. We even considered starting our own database, but funding such initiatives is notoriously tricky.

"I've seen authors argue that they adhere to FAIR/open science principles, but somehow, they do not apply to their particular study."
You emphasize the need for better metadata and consistent data standards. In your opinion, how can standardized practices in metadata help foster interdisciplinary collaboration between different fields of research?

I think for a relatively 'niche' field like EVE research, this is especially important. Most people come from different backgrounds and are influenced by differences in practice depending on their field. My co-authors and I think this is the moment for EVE research to start improving data sharing, primarily through consistent (meta)data standards. We have some experience now, so we know the challenges without being too late to implement change. We propose some elements in our paper but do not have perfect solutions. It is up to the field to discuss it of course, but with this article, we want to start the discussion.

You mention there is currently no field-wide preferred bioinformatic tool for identifying EVEs in genomic datasets. How does your open-source tool detectEVE address this problem?

We developed detectEVE to address this lack of tools. Although it is not widely accepted yet, we believe it offers at least a starting point that we and others can improve upon. It's a first step: before, most papers used their own pipeline, which was usually not as easy to implement. The idea of detectEVE started when Nadja Brait, PhD student in my group, had to use my code to screen for EVEs, which I had used previously in other projects. She rightfully identified that, while it worked, we could do better. The work she and Thomas Hackl did made the pipeline much faster and suitable to be shared with a broader community. Win-win!

"We look forward to the use of DetectEve by the wider community and, hopefully, to its tweaking and improvement by others."
As Data Editor for the Journal of Evolutionary Biology (JEB), you check authors’ compliance with open data policies. What are the most frequent issues you encounter in submitted data archives? What role(s) can journals take to make data-sharing and open science easier?

We opted for a very strict policy on sharing raw data, considering that anything that is digital (picture, video, sound recordings) and has been used in the context of the published work should be shared. This is now the norm in genomics, but specific fields are not used to it, and we encountered some resistance, especially at the beginning.

I believe journals, especially society journals like JEB, can enforce adherence to open science principles and help the field move forward. Ideally, we shouldn't need a data editor like me to check authors' compliance, but there is sometimes a gap between the principles and the practice. I've seen authors argue that they adhere to FAIR/open science principles, but somehow, they do not apply to their particular study.

"By interacting with Open Science initiatives, you can participate in the life of your field and shape future practices."
You work with DataSeer, an AI-driven tool that helps identify and verify research data. How do you see AI improving data management and compliance in the future?

I must say that it would be impossible for me to screen every accepted paper the journal has without DataSeer (about 100 papers a year). I am not an expert in all subfields published by JEB, and it would be particularly hard for me to quickly identify missing elements or datasets we would have expected based on the text. DataSeer is a time saver, but I wouldn't use it by itself like every AI tool. I also quickly check the paper myself, primarily to double-check DataSeer's points, and provide guidelines/help to the authors. I, however, believe tools like DataSeer could be used by the authors before submission, to ensure the data repository is complete before submission.

What advice would you give to researchers who want to improve their data management practices to align with FAIR principles and contribute to a more open scientific community?

Start early! I know it is easier said than done, and I am myself guilty of it too, but good data/code management and compliance with journals' requirements are not that complicated. Still, it can become troublesome and annoying if you have to do it years after the start of your project before your paper can be published.

In addition, keep track of your field's developments and initiatives. In ecology and evolution, the Society for Open, Reliable, and Transparent Ecology and Evolutionary Biology (SORTEE) provides resources and a fantastic community for discussing best practices and receiving training. By interacting with these initiatives, you can participate in the life of your field and shape future practices.

Useful links

The UG’s Digital Competence Centre supports UG researchers throughout the entire research (data) life cycle, from grant proposal to FAIR data archiving.

Citation

Muriel Ritsch, Nadja Brait, Erin Harvey, Manja Marz, Sebastian Lequime, Endogenous viral elements: insights into data availability and accessibility, Virus Evolution, Volume 10, Issue 1, 2024, veae099, https://doi.org/10.1093/ve/veae099

About the author

Alba Soares Capellas

Communications Officer at the UG Digital Competence Centre (UG DCC)

Share this Facebook LinkedIn