Language and AI Colloquium I: 'Building and Evaluating Language Models: From Data to Benchmarks' by Bram Vanroy

When:	Fr 11-10-2024 14:00 - 15:00
Where:	Collaboratory A Harmonie Building 1313.0125

This initiative aims at bringing together all those interested or working on the intersections of natural language and Artificial Intelligence.

The Language and AI colloquia will take place from October 2024 to May 2025. Each colloquium will take place on Friday, it will start at 14:00 and will end at 15:00. For the interested people, it would be possible to arrange 1-on-1 meetings with the speakers.

Bram Vanroy

The speaker for the first Language and AI Colloquium is Bram Vanroy from KU Leuven with: 'Building and Evaluating Language Models: From Data to Benchmarks'.

'Large language models' (LLMs) and 'AI' are the buzzwords of the day, so much so that even "transformer" has made its way into the odd casual conversation. But what actually goes into these models? And how do we know if one model is truly better than the last?

This talk focuses on what happens before and after the training phase of LLMs: the craft of reliable dataset creation and assessing the model’s performance. We'll dive into the data pipelines responsible for creating high-quality datasets for the key stages of model development: pretraining (next-word prediction), supervised finetuning (chat/instruction), and preference tuning (alignment). We will discuss techniques such as web crawling, quality filtering, and synthetic data generation and scoring. Such data processing is gained more and more attention; after all, if we put garbage into a model, it will spit it back out. You’ll also learn how model performance is evaluated across a variety of benchmarks, from straightforward question-answering to assessments of "emotional intelligence" and crowd-sourced user evaluation.

Full calendar

11 October 2024	I: Building and Evaluating Language Models: From Data to Benchmarks Bram Vanroy, KU Leuven (Collaboratory room A - Faculty of Arts) Presentation 'Building and Evaluating LLMs' (PDF)
29 November 2024	II: Identification of clinical disease trajectories in neurodegenerative disorders with natural language processing Inge Holtman, UMCG Groningen (House of Connections - 1st floor)
24 January 2025	III: Shall AI Compare Thee to a Summer’s Day? An Exploration of Creative Mechanisms in Large Language Models Tim van de Cruys, KU Leuven (House of Connections - 1st floor) Presentation 'Shall AI Compare Thee to a Summer’s Day?' (PDF)
28 March 2025	IV: Experiments on the intersection of texts and structured data: combining language technology and semantic web for digital humanities research Marieke van Erp, KNAW Humanities Cluster (Collaboratory room A - Faculty of Arts) Presentation 'Experiments on the intersection of texts and structured data' (PDF)
23 May 2025	V: Towards Evidence-Based Fact-Checking for Real-World Claims Max Glockner, TU Darmstadt (House of Connections - 1st floor) Presentation 'Towards Evidence-Based Fact-Checking for Real-World Claims' (PDF)

Share this Facebook LinkedIn