Larabench: Benchmarking Arabic AI With Large Language Models
2023 Β· Ahmed Abdelali, Hamdy Mubarak, Shammur Absar Chowdhury, et al.
Abstract
Recent advancements in Large Language Models (LLMs) have significantly influenced the landscape of language and speech research. Despite this progress, these models lack specific benchmarking against state-of-the-art (SOTA) models tailored to particular languages and tasks. LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks, including sequence tagging and content classification across different domains. We utilized models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM, employing zero and few-shot learning techniques to tackle 33 distinct tasks across 61 publicly available datasets. This involved 98 experimental setups, encompassing ~296K data points, ~46 hours of speech, and 30 sentences for Text-to-Speech (TTS). This effort resulted in 330+ sets of experiments. Our analysis focused on measuring the performance gap between SOTA models and LLMs. The overarching trend observed was that SOTA models generally outperf
Authors
(none)
Tags
Stats
Related papers
- Audiobench: A Universal Benchmark For Audio Large Language Models (2024)10.21
- Vocalbench: Benchmarking The Vocal Conversational Abilities For Speech Interaction Models (2025)0.00
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09
- VCB Bench: An Evaluation Benchmark For Audio-grounded Large Language Model Conversational Agents (2025)0.00
- Boosting Large Language Model For Speech Synthesis: An Empirical Study (2023)6.77
- S2s-arena, Evaluating Speech2speech Protocols On Instruction Following With Paralinguistic Information (2025)0.00
- Roadmap Towards Superhuman Speech Understanding Using Large Language Models (2024)0.00