Awesome Papers

Papers

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV (2026)
Tengfei Liu et al.
13.04
JLT: Clean-Latent Prediction in Latent Diffusion Transformers (2026)
Funing Fu et al.
11.74
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration (2026)
Xinchen Zhang et al.
11.20
Recursive Flow Matching (2026)
Jiahe Huang et al.
11.02
Channel-wise Vector Quantization (2026)
Wei Song et al.
9.33
PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in Piano Performance (2025)
Qijun Gan et al.
8.70
Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation (2026)
Shuhong Zheng et al.
7.39
Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models (2026)
Yifan Jiang et al.
6.98
Cross-scale Aligned Supervision for Training GANs (2026)
Sangeek Hyun et al.
6.17
Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions (2026)
Antonia Karamolegkou et al.
5.06
Variance Reduction for Expectations with Diffusion Teachers (2026)
Jesse Bettencourt et al.
4.54
The Abstraction Gap in Vision-Language Causal Reasoning (2026)
Chinh Hoang et al.
4.54
Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations (2026)
Safwen Naimi et al.
3.10
OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning (2026)
Mingxin Huang et al.
0.00
"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models (2026)
Jing Gu et al.
0.00
Learning GUI Grounding with Spatial Reasoning from Visual Feedback (2026)
Yu Zhao et al.
0.00
A Comprehensive Dataset for Human vs. AI Generated Image Detection (2026)
Rajarshi Roy et al.
0.00
Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR (2026)
Ziye Yuan et al.
0.00
VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning (2026)
Zi-Yi Jia et al.
0.00
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation (2026)
Seonghoon Yu et al.
0.00
Diffusion Large Language Models for Visual Speech Recognition (2026)
Jeong Hun Yeo et al.
0.00
A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging (2021)
Keunwoo Choi et al.
—
Acoustic Scene Classification: A Competition Review (2024)
Shayan Gharib et al.
—
End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models (2023)
Fei Tao and Carlos Busso
—
Learning Embodied Semantics via Music and Dance Semiotic Correlations (2021)
Francisco Afonso Raposo and David Martins de Matos and Ricardo Ribeiro
—
Adaptive Fusion Techniques for Multimodal Data (2021)
Gaurav Sahu et al.
—
Detecting Adversarial Attacks On Audiovisual Speech Recognition (2021)
Pingchuan Ma et al.
—
Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement (2021)
Mostafa Sadeghi et al.
—
A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors (2021)
Ruobing Zheng et al.
—
Bio-Inspired Modality Fusion for Active Speaker Detection (2021)
Gustavo Assun\c{c}\~ao et al.
—
On the Role of Visual Cues in Audiovisual Speech Enhancement (2021)
Zakaria Aldeneh et al.
—
Cross-modal Speaker Verification and Recognition: A Multilingual Perspective (2021)
Muhammad Saad Saeed et al.
—
Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition? (2021)
Abhinav Shukla et al.
—
End-to-End Lip Synchronisation Based on Pattern Classification (2021)
You Jin Kim et al.
—
Deep Sensory Substitution: Noninvasively Enabling Biological Neural Networks to Receive Input from Artificial Neural Networks (2021)
Andrew Port et al.
—
Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning (2023)
Ruozi Huang et al.
—
Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation (2021)
Taras Kucherenko et al.
—
CSLNSpeech: solving extended speech separation problem with the help of Chinese sign language (2023)
Jiasong Wu et al.
—
Compact Graph Architecture for Speech Emotion Recognition (2021)
A. Shirian et al.
—
Speech Driven Talking Face Generation from a Single Image and an Emotion Condition (2021)
Sefik Emre Eskimez et al.
—
Sequence-to-Sequence Predictive Model: From Prosody To Communicative Gestures (2021)
Fajrian Yunus et al.
—
Active Contrastive Learning of Audio-Visual Video Representations (2021)
Shuang Ma et al.
—
An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments (2022)
Shrishti Saha Shetu et al.
—
ANIMC: A Soft Framework for Auto-weighted Noisy and Incomplete Multi-view Clustering (2026)
Xiang Fang et al.
—
Hierachical Delta-Attention Method for Multimodal Fusion (2022)
Kunjal Panchal
—
V3H: View Variation and View Heredity for Incomplete Multi-view Clustering (2026)
Xiang Fang et al.
—
Semantic Audio-Visual Navigation (2021)
Changan Chen et al.
—
AudioViewer: Learning to Visualize Sounds (2023)
Chunjin Song et al.
—
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency (2021)
Ruohan Gao and Kristen Grauman
—
Piano Skills Assessment (2021)
Paritosh Parmar et al.
—