Interview with Tommaso Caselli on the DALC (Dutch Abusive Language Corpus) dataset
Navigating your way around while working with social media is a challenge for many researchers. In the context of research on hate speech, Dr. Tommaso Caselli , Assistant Professor at the Faculty of Arts, has created a large dataset about Dutch abusive language in X (formerly Twitter) - the DALC (Dutch Abusive Language Corpus). We asked him a few questions to learn about his research and his experience in the Natural Language Processing discipline.
Social Media have dark sides
Could you tell us a little bit about your research and the process of creating the DALC dataset?
The growth of Social Media has been accompanied by the promise of making people more connected. While this is undeniably true, Social Media have dark sides: the lack of a physical presence during the interactions and the possibility of concealing one’s identity have contributed to a rise in toxic interactions. To mitigate the impact of this phenomenon, solutions that support automatic content moderation are needed. To develop these tools, language-specific datasets are needed because the way hate or offensive content is expressed differs from community of speakers (and countries).
Before DALC, there was only one dataset in Dutch, but it was not publicly available – making it very difficult to conduct research on this topic for the Dutch. DALC has come into existence to fill a gap in the Dutch language resource panorama: to have a publicly available dataset that cover multiple specific phenomena (e.g., offensive language, abusive language, hate speech) to support the development of content moderation tools for online interactions. On the technical side, DALC has been obtained by scraping messages from X via the old API (before the limitations introduced in early 2023). To mitigate potential bias, we combined multiple techniques rather than just using selected keywords (more details can be found in this publication).
DALC contains messages that have been labeled for offensive or abusive content
The DALC dataset can be found in two different repositories: GitHub and DataverseNL. Why did you choose these two repositories and how is this combination beneficial for your dataset?
For DALC, the use of these repositories is mostly needed to make the dataset public. Let me clarify this: DALC contains messages that have been labeled for offensive or abusive content. The nature of the messages is sensitive giving rise to contradictory needs: on the one hand, these messages may hurt people and one would want to limit the exposure to this content by users; on the other hand, you want to protect the privacy of those who have produced these messages. On top of this, there are regulations – such as the GDPR and the Terms of Use of the X platform – which one must comply with. A non-trivial constraint associated with X Terms of Use concerns the release of the collected data to the public: basically, only the IDs of the messages can be made available – meaning that prospective users of DALC must go through a process called “hydration” to retrieve the messages, which in the meantime could have been deleted.
DataverseNL introduces a control on who wants to use the data that is not possible in GitHub
Given these issues, I needed a solution that would allow to address all these needs – including reproducibility of my own research. While I can use GitHub to release the code I developed and some public information about DALC, DataverseNL makes it possible to release the whole dataset (all annotations and the actual text of the messages). DataverseNL introduces a control on who wants to use the data that is not possible in GitHub. This solution allows DALC to be open and FAIR.
How did you experience the challenges posed by the various conflicting pressures between open science, licensing restrictions and privacy concerns in this research? Which workable solutions would you advise to researchers in a similar limbo?
It is undeniably a struggle – especially if you work in Natural Language Processing, a discipline where data is key and the publication rate is extremely fast. The “work fast, break things” approach is not a good approach. It is a cultural approach that was acceptable 10 years ago which must be avoided now. Asking the DCC for support should become the new normal for everyone who works with language data. Another suggestion I want to give is to be creative. Creative in proposing solutions for making the data public. I approached DCC with a solution in mind. My whole approach was “I want to do this. Is it possible? If not, which alternative solutions can we find?”.
Asking the DCC for support should become the new normal for everyone who works with language data.
The DALC dataset is currently being used in the course “Machine Learning Project” (Bsc Information Science). Did any organizational measures have to be applied so that students could work with the data? How will the dataset contribute to the learning goals of this course?
I am very happy that it was possible to use DALC for the course. Again, the solution was found by asking support to DCC. Students were asked to sign a Data Sharing Agreement and were instructed to run all experiments in a protected environment available in a dedicated RUG server. In addition to this, we had the team from DCC as guest presenters to instruct students on the responsible use of data. DALC is not a toy dataset, thus it requires students to embed the technological aspects of the course with problems concerning the responsible use of data. And this is perfect to understand what does it mean to learn “conduct a research project independently”.
You had the support of several teams within the UG to facilitate your research. Considering your experience, what infrastructure and services would most benefit your research field?
An aspect that made the difference in my experience was the human factor: I could discuss my needs and my ideas with people who know what solutions could be adopted. I think that making available regular consultation hours that people can book to schedule initial meetings would be very helpful.
Last modified: | 04 October 2024 12.17 p.m. |
More news
-
28 January 2025
Online and offline playtime are important for children’s digital literacy development
Children between the ages of 8 and 12 are best able to develop their digital literacy through play, which includes both offline and online interactions. Although some children’s digital media activities, such as watching online videos or gaming, may...
-
28 January 2025
Artistic ambassadors: making art as a research method
Painting, writing, making theatre productions — not the first activities that spring to mind when you think of academic research. At the University of Groningen, it is nonetheless possible. What are these artistic research studies about and what...
-
22 January 2025
UG submits three research projects for Klokhuis Science Prize
This year, the UG Pre-University Academy has submitted three studies for the Klokhuis Science Prize. This prize honours interesting and relevant academic research for children aged 9 to 12 years.