Colloquium Computer Science - Harm de Vries, ServiceNow
When: | Fr 17-03-2023 16:00 - 17:00 |
Where: | 5161.0222 Bernoulliborg |
Title: BigCode: open and responsible development of large language models for code
Abstract:
In this talk, I’ll cover the recent progress of the BigCode project, an open-scientific collaboration working on the responsible development of Large Language Models (LLMs) for Code. Code LLMs can increase the productivity of developers by completing code snippets from both natural language instructions and other code fragments. I’ll discuss how we created The Stack, a large dataset for training code LLMs, and discuss some of its legal, ethical, and governance concerns, including (i) how to give developers the possibility to opt-out their code repositories from the training data and (ii) how to give proper attribution when the model generates verbatim copies of other people's code. Finally, I’ll go over the learnings of our first model, called SantaCoder, a 1.1B parameter model trained on Java, Javascript, and Python. SantaCoder outperforms other open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the MultiPL-E benchmark, despite being substantially smaller.
More information: