A Tale Of Two Languages: Large-vocabulary Continuous Sign Language Recognition From Spoken Language Supervision
2024 Β· Charles Raude, K R Prajwal, Liliane Momeni, et al.
Abstract
In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance -- retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of leveraging weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely sign-level pseudo
Authors
(none)
Tags
Stats
Related papers
- Cico: Domain-aware Sign Language Retrieval Via Cross-lingual Contrastive Learning (2023)16.35
- Sign Language Video Retrieval With Free-form Textual Queries (2022)10.35
- Large Language Models For Captioning And Retrieving Remote Sensing Images (2024)0.00
- CSPLADE: Learned Sparse Retrieval With Causal Language Models (2025)0.00
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Learning To Scale Multilingual Representations For Vision-language Tasks (2020)7.81
- 12-in-1: Multi-task Vision And Language Representation Learning (2019)17.85
- Towards Fast Adaptation Of Pretrained Contrastive Models For Multi-channel Video-language Retrieval (2022)7.50