Skip to ContentSkip to Navigation
Digital Competence Centre
your one-stop for research IT and data
Digital Competence Centre Contact

Shaping Open Science and data management: an interview with Joke Bakker

03 December 2024
Joke Bakker
Dr. Joke Bakker

Dr. Joke Bakker is a dedicated data steward at GELIFES (Groningen Institute for Evolutionary Life Sciences), where she has been instrumental in shaping the institute's data management policies and practices. In this interview, she tells us about her pioneering experience.

She began at the University of Groningen as a PhD student in Theoretical Biology. After completing her thesis, she transitioned to a role as a software developer, supporting researchers and courses in theoretical biology. Joke quickly expanded her scope to include various IT projects within the former CEES institute, such as maintaining websites and working with platforms like TikiWiki.

In 2011, following the Stapel breach of scientific integrity, she was tasked with spearheading an institute-wide data archiving initiative. This included drafting a comprehensive data management policy, developing procedures, and establishing centralized data storage facilities. Since then, she has remained deeply involved in data management, continuing to refine these systems as CEES merged with CBN to form GELIFES in 2015.

You’re one of the first to help researchers manage their data. Looking back, could you reflect on how your work has changed over time? What milestones have been achieved and what is still to come?

We started with a very simple data repository in a web-hosted TikiWiki environment, which worked well enough because archives were not that large at the time. However, with the increasing size of research datasets due to improved data collection and processing techniques, such as real-time video recording, this was not a viable option in the long term. So I discussed the options with the CIT and, with the help of Alex Pothaar, we were able to start a small pilot of an iRODS-based storage system, which eventually led to the current Research Data Management System (RDMS), which is certainly a milestone.

It is not so much solving the technical issues, but careful human management that is most important for success.

Another important milestone is the recent agreements on open access publishing and open science that the scientific community has committed to. This has brought data management into the focus of funding agencies and major publishers, who now require compliance with these policies. This helps a lot in upholding local data management procedures.

But then and now, it is not so much solving the technical issues, but careful human management that is most important for success. Although support and facilities have improved a lot, proper data management is still a time-consuming process for most researchers, and time is a precious currency. So further automatization and integration of the complete data flow from data management planning, data collection, storage and archiving to data publication and sharing, would be the way to go to make life easier for researchers.

It is important that students learn how to do proper data management, especially for those who pursue a career in science.

You initiated a pilot within GELIFES in which PhD students were asked to archive their research data before graduation. Could you tell us more about the pilot and the reasons for starting it?

PhD students were required to archive their data from the beginning, but very often it did not happen before their defence, and it turned out to be very difficult to collect these archives once they left for their home countries or their next job. So on a number of occasions, valuable data produced in a UG project was no longer available to the research group, as the outgoing PhD had taken all her data with her. 

The current pilot has been set up in collaboration with the Graduate School and our PhD coordinator, and entails that the Promotion Office informs me, as the data steward of the institute, when a PhD student has uploaded their manuscript, and will wait for my confirmation that the data archive has also been uploaded correctly before continuing the procedure by notifying the Reading Committee and reserving a provisional defence date. The PhD student has one week to comply before having a real problem… and this procedure works very well. 

Most PhD students now upload their data archive at the same time as, or even before, uploading their manuscript, and so far, nobody has exceeded the one-week deadline. To me, this clearly implies that the lack of timely responses earlier is not because the PhD students are unwilling to comply, but foremost a combination of inadequate time management and high workload, where mandatory procedures with consequences get priority over those without.

I would really like to see a university-wide policy including obligatory courses at all levels, starting with the introductory bachelor courses, and repeated in the master and PhD introductory trajectories.

The UG is committed to stimulating Open Science principles among our academic community. Researchers are asked to make their research data and software FAIR ('as open as possible, as closed as necessary'). What do you see as the main challenges?

Communication and education. Many researchers do not yet really grasp the idea of FAIR, and due to an already very heavy workload, they commonly don’t have time to follow workshops. And if you don’t have a clear idea yourself, it is difficult to instruct your students, so data management still tends to be skipped in intake meetings with starting PhD and master students. 

Again, the students are willing to do things correctly, but they need clear instructions. I would really like to see a university-wide policy including obligatory courses at all levels, starting with the introductory bachelor courses, and repeated in the master and PhD introductory trajectories. It is important that students learn how to do proper data management, especially for those who pursue a career in science.

These datasets support a published paper they are static on purpose, and should not be changed over time.

GELIFES published 108 datasets in DataverseNL at the time of this interview. How do you ensure the quality, usability, and reproducibility of these data over time?

At the moment, DataverseNL is mostly used to comply with the requirements of publishers to make the data underlying a published paper publicly available. This commonly includes rather specific datasets and code that has been tailored to a particular study. We make sure that these datasets include everything to reproduce the results in the paper, that the uploaded files are readable, and that a dataset provides additional metadata with instructions on how to do this. 

Because these datasets support a published paper they are static on purpose, and should not be changed over time (barring obvious errors that come to light after publication, in which case we publish a new, corrected version of the dataset). The scientific quality of the data is foremost the responsibility of the authors, and will be checked by the referees judging the manuscript. The preview option in DataverseNL is a great help in ensuring easy implementation of such feedback during the submission process.

Time is a precious currency

Part of GELIFES’ mission is to inform society about its findings. Can you give examples of how open data from GELIFES has been used by external stakeholders or the public?

I don’t have actual numbers here; to develop metrics to monitor this has in fact been an advice arising from last year’s FSE research review. In our DataverseNL, the oldest published dataset from 2016 has been downloaded 67 times, and the latest, published last week, has been downloaded five times already; but these numbers are likely very subject-dependent. In Pure, we have 970 datasets registered, but unfortunately there are no download statistics available for datasets, since Pure only stores metadata.

In general, I think that (re)use of our datasets mostly happens in the context of joint projects, e.g., the seagrass restoration project of Laura Govers. That was initially carried out in cooperation with external parties such as Rijkswaterstaat, and is now continued by Rijkswaterstaat using the results from the earlier joint project. Another example is the addition of project data to large public databases such as Lifelines for biomedical data, GenBank for DNA sequences, or Movebank for animal tracking data.

Last modified:03 December 2024 3.06 p.m.

More news

  • 16 December 2024

    Jouke de Vries: ‘The University will have to be flexible’

    2024 was a festive year for the University of Groningen. Jouke de Vries, the chair of the Executive Board, looks back.

  • 10 June 2024

    Swarming around a skyscraper

    Every two weeks, UG Makers puts the spotlight on a researcher who has created something tangible, ranging from homemade measuring equipment for academic research to small or larger products that can change our daily lives. That is how UG...

  • 24 May 2024

    Lustrum 410 in pictures

    Lustrum 410 in pictures: A photo report of the lustrum 2024