Lowclip: Adapting The CLIP Model Architecture For Low-resource Languages In Multimodal Image Retrieval Task
2024 Β· Ali Asgarov, Samir Rustamov
Abstract
This research explores the development of multimodal vision-language models for image retrieval in low-resource languages, specifically Azerbaijani. Existing vision-language models primarily support high-resource languages, and fine-tuning them remains computationally demanding. To address challenges in vision-language retrieval for low-resource languages, we integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance. These techniques include synthetic data generation through machine translation, image augmentation, and further training the attention mechanisms of transformer-based models with domain-specific data. We integrated Multilingual BERT as a text encoder with image encoders like ResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer. Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on, such as COCO, Flickr30k, and Flickr
Authors
(none)
Tags
Stats
Related papers
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- NLLB-CLIP -- Train Performant Multilingual Image Retrieval Model On A Budget (2023)0.00
- Viclip-ot: The First Foundation Vision-language Model For Vietnamese Image-text Retrieval With Optimal Transport (2026)0.00
- Towards Zero-shot Cross-lingual Image Retrieval And Tagging (2021)2.46
- Towards Zero-shot Cross-lingual Image Retrieval (2020)2.46
- ELIP: Enhanced Visual-language Foundation Models For Image Retrieval (2025)2.26
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34