PreFer: a data challenge for predicting fertility outcomes in the Netherlands.
Open Research objectives/practices
PreFer is a data challenge aimed at measuring predictability of fertility outcomes in the Netherlands based on survey data (LISS panel) and data from the Dutch population registers. PreFer is built on the principles of open science:
- We provide permanent links to the raw datasets and describe how to get access to them
- All submissions to the challenge (i.e. code for data preprocessing and model training, trained models, and descriptions of models) along with the performance scores will be shared via DataverseNL and the PreFer website
- Code for preparing PreFer datasets and machine-readable codebooks based on the raw data and for analysing the submissions will be made openly available via the university-hosted PreFer website and on the project page in the LISS data archive,
- To make submissions more reproducible, we asked participants to submit a predictive model and code for data preprocessing and model training, rather than predicted values. A submission system was developed by Eyra which first automatically tested submitted models on the ‘fake’ data (which allowed participants to debug the code) and then evaluated submissions on the holdout (test) while keeping the data in a secure environment. We also provided supporting materials which help prepare such a submission
- Papers with the results of PreFer will be published open access
Introduction
The aim of PreFer is to measure how predictable are fertility outcomes (given the current methods and data). If fertility outcomes can be accurately predicted, it can help develop fertility projections and social policies. Analysing and comparing different predictive models can also help get insights about fertility behaviour.
Anyone could participate in the survey-based part of PreFer by developing a predictive model which predicts the outcome: having a(nother) child in 2021-2023. Selected teams which developed the most accurate models also predicted the same outcome using the register data.
Motivation
We use open research practices to increase scientific impact of PreFer. Making all code and submissions available combined with attempts to maximize the reproducibility of submissions allows for further analysis of submissions by other researchers after the challenge. These practices also allow for developing other predictive models after the data challenge to test predictability with newly available methods and variants of the data.
Lessons learned
Main lesson is that reproducibility of predictive models is hard even when working code (with set random seeds) is available. Thanks to our submission system, all submissions ran on the holdout data because they included working code for data preprocessing, a working model, and a file with description of a software environment. However, retraining the models (using the same environment and data) sometimes led to different results because of different hardware.
A positive lesson is that although submitting code (which is then automatically run on the holdout data on a different machine) is harder for data challenge participants then ‘just’ submitting predicted values from a model developed on a local machine, with proper instructions it is possible and it is making code more reproducible.
URLs, references and further information
- PreFer website https://preferdatachallenge.nl/
- PreFer page in the LISS data archive: https://doi.org/10.57990/f3ge-3a61
Paper describing the methodology of the data challenge (open access):
- Sivak, E., Pankowska, P., Mendrik, A. et al. Combining the strengths of Dutch survey and register data in a data challenge to predict fertility (PreFer). J Comput Soc Sc 7, 1403–1431 (2024). https://doi.org/10.1007/s42001-024-00275-6
Last modified: | 11 November 2024 1.26 p.m. |