Phi-4-mini Technical Report: Compact Yet Powerful Multimodal Language Models Via Mixture-of-loras
2025 Β· Microsoft, :, Abdelrahman Abouelenin, et al.
Abstract
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining vari
Authors
(none)
Tags
Stats
Related papers
- Towards Multi-modal Mastery: A 4.5B Parameter Truly Multi-modal Small Language Model (2024)2.26
- Mmmmodal -- Multi-images Multi-audio Multi-turn Multi-modal (2024)0.00
- Mixture-of-transformers: A Sparse And Scalable Architecture For Multi-modal Foundation Models (2024)0.00
- Putting Gpt-4o To The Sword: A Comprehensive Evaluation Of Language, Vision, Speech, And Multimodal Proficiency (2024)0.00
- MIO: A Foundation Model On Multimodal Tokens (2024)3.58
- Macaw-llm: Multi-modal Language Modeling With Image, Audio, Video, And Text Integration (2023)0.00
- Multimodal Large Language Models: A Survey (2023)0.00
- Capybara-omni: An Efficient Paradigm For Building Omni-modal Language Models (2025)0.00