Establishing Degrees Of Closeness Between Audio Recordings Along Different Dimensions Using Large-scale Cross-lingual Models
2024 Β· Maxime Fily, Guillaume Wisniewski, Severine Guillaume, et al.
Abstract
In the highly constrained context of low-resource language studies, we explore vector representations of speech from a pretrained model to determine their level of abstraction with regard to the audio signal. We propose a new unsupervised method using ABX tests on audio recordings with carefully curated metadata to shed light on the type of information present in the representations. ABX tests determine whether the representations computed by a multilingual speech model encode a given characteristic. Three experiments are devised: one on room acoustics aspects, one on linguistic genre, and one on phonetic aspects. The results confirm that the representations extracted from recordings with different linguistic/extra-linguistic characteristics differ along the same lines. Embedding more audio signal in one vector better discriminates extra-linguistic characteristics, whereas shorter snippets are better to distinguish segmental information. The method is fully unsupervised, potentially op
Authors
(none)
Tags
Stats
Related papers
- Discriminating Form And Meaning In Multilingual Models With Minimal-pair ABX Tasks (2025)0.00
- Investigating The Impact Of Cross-lingual Acoustic-phonetic Similarities On Multilingual Speech Recognition (2022)3.58
- Unsupervised Cross-lingual Representation Learning For Speech Recognition (2020)18.91
- Exploiting Cross-lingual Speaker And Phonetic Diversity For Unsupervised Subword Modeling (2019)6.77
- Multi-staged Cross-lingual Acoustic Model Adaption For Robust Speech Recognition In Real-world Applications -- A Case Study On German Oral History Interviews (2020)0.00
- Towards Holistic Evaluation Of Large Audio-language Models: A Comprehensive Survey (2026)9.75
- Multilingual Acoustic Word Embedding Models For Processing Zero-resource Languages (2020)8.09
- Audio-based Linguistic Feature Extraction For Enhancing Multi-lingual And Low-resource Text-to-speech (2024)0.00