Longcat-next: Lexicalizing Modalities As Discrete Tokens
2026 Β· Meituan Longcat Team, Bin Xiao, Chao Wang, et al.
Abstract
The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modal
Authors
(none)
Tags
Stats
Related papers
- Multimodal Latent Language Modeling With Next-token Diffusion (2024)0.00
- Next-gpt: Any-to-any Multimodal LLM (2023)0.00
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- MIO: A Foundation Model On Multimodal Tokens (2024)3.58
- OMCAT: Omni Context Aware Transformer (2024)0.00
- Next-omni: Towards Any-to-any Omnimodal Foundation Models With Discrete Flow Matching (2025)0.00
- TEAL: Tokenize And Embed ALL For Multi-modal Large Language Models (2023)0.00
- CACARA: Cross-modal Alignment Leveraging A Text-centric Approach For Cost-effective Multimodal And Multilingual Learning (2025)0.00