Trained On 100 Million Words And Still In Shape: BERT Meets British National Corpus | Awesome LLM Papers

Trained On 100 Million Words And Still In Shape: BERT Meets British National Corpus

David Samuel, Andrey Kutuzov, Lilja Øvrelid, Erik Velldal · Findings of the Association for Computational Linguistics: EACL 2023 · 2023

While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

Similar Work
Loading…