PreFer: a data challenge for predicting fertility outcomes in the Netherlands.

Elizaveta Sivak (Faculty of Behavioral and Social Sciences),Gert Stulp (RUG), Malvina Nissim (RUG), Paulina Pankowska (UU), Adriënne Mendrik (Eyra), Tom Emery (ODISSEI), Javier Garcia-Bernardo (UU), Seyit Höcük (Centerdata), Kasia Karpinska (ODISSEI), Angelica Maineri (ODISSEI), Joris Mulder (Centerdata)

Open Research objectives/practices

PreFer is a data challenge aimed at measuring predictability of fertility outcomes in the Netherlands based on survey data (LISS panel) and data from the Dutch population registers. PreFer is built on the principles of open science:

We provide permanent links to the raw datasets and describe how to get access to them
All submissions to the challenge (i.e. code for data preprocessing and model training, trained models, and descriptions of models) along with the performance scores will be shared via DataverseNL and the PreFer website
Code for preparing PreFer datasets and machine-readable codebooks based on the raw data and for analysing the submissions will be made openly available via the university-hosted PreFer website and on the project page in the LISS data archive,
To make submissions more reproducible, we asked participants to submit a predictive model and code for data preprocessing and model training, rather than predicted values. A submission system was developed by Eyra which first automatically tested submitted models on the ‘fake’ data (which allowed participants to debug the code) and then evaluated submissions on the holdout (test) while keeping the data in a secure environment. We also provided supporting materials which help prepare such a submission
Papers with the results of PreFer will be published open access

Introduction

The aim of PreFer is to measure how predictable are fertility outcomes (given the current methods and data). If fertility outcomes can be accurately predicted, it can help develop fertility projections and social policies. Analysing and comparing different predictive models can also help get insights about fertility behaviour.

Anyone could participate in the survey-based part of PreFer by developing a predictive model which predicts the outcome: having a(nother) child in 2021-2023. Selected teams which developed the most accurate models also predicted the same outcome using the register data.

Motivation

We use open research practices to increase scientific impact of PreFer. Making all code and submissions available combined with attempts to maximize the reproducibility of submissions allows for further analysis of submissions by other researchers after the challenge. These practices also allow for developing other predictive models after the data challenge to test predictability with newly available methods and variants of the data.

Lessons learned

Main lesson is that reproducibility of predictive models is hard even when working code (with set random seeds) is available. Thanks to our submission system, all submissions ran on the holdout data because they included working code for data preprocessing, a working model, and a file with description of a software environment. However, retraining the models (using the same environment and data) sometimes led to different results because of different hardware.

A positive lesson is that although submitting code (which is then automatically run on the holdout data on a different machine) is harder for data challenge participants then ‘just’ submitting predicted values from a model developed on a local machine, with proper instructions it is possible and it is making code more reproducible.

URLs, references and further information

PreFer website https://preferdatachallenge.nl/
PreFer page in the LISS data archive: https://doi.org/10.57990/f3ge-3a61

Paper describing the methodology of the data challenge (open access):

Sivak, E., Pankowska, P., Mendrik, A. et al. Combining the strengths of Dutch survey and register data in a data challenge to predict fertility (PreFer). J Comput Soc Sc 7, 1403–1431 (2024). https://doi.org/10.1007/s42001-024-00275-6

Last modified:

11 November 2024 1.26 p.m.