Contrastive Language-image Pre-training For The Italian Language
2021 Β· Federico Bianchi, Giuseppe Attanasio, Raphael Pisoni, et al.
Abstract
CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs high-quality translations of the texts to guarantee a good performance. In this paper, we present the first CLIP model for the Italian Language (CLIP-Italian), trained on more than 1.4 million image-text pairs. Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification.
Authors
(none)
Tags
Stats
Related papers
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- Jina CLIP: Your CLIP Model Is Also Your Text Retriever (2024)0.00
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Mobileclip: Fast Image-text Models Through Multi-modal Reinforced Training (2023)18.12
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- CIBR: Cross-modal Information Bottleneck Regularization For Robust CLIP Generalization (2025)4.52
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00