cs.CV
50 papers tagged cs.CV (ordered by heat_score)
Papers
- LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV (2026)Tengfei Liu et al.13.04
- JLT: Clean-Latent Prediction in Latent Diffusion Transformers (2026)Funing Fu et al.11.74
- OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration (2026)Xinchen Zhang et al.11.20
- Recursive Flow Matching (2026)Jiahe Huang et al.11.02
- Channel-wise Vector Quantization (2026)Wei Song et al.9.33
- PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in
Piano Performance (2025)Qijun Gan et al.8.70
- Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation (2026)Shuhong Zheng et al.7.39
- Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models (2026)Yifan Jiang et al.6.98
- Cross-scale Aligned Supervision for Training GANs (2026)Sangeek Hyun et al.6.17
- Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions (2026)Antonia Karamolegkou et al.5.06
- Variance Reduction for Expectations with Diffusion Teachers (2026)Jesse Bettencourt et al.4.54
- The Abstraction Gap in Vision-Language Causal Reasoning (2026)Chinh Hoang et al.4.54
- Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations (2026)Safwen Naimi et al.3.10
- OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning (2026)Mingxin Huang et al.0.00
- "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models (2026)Jing Gu et al.0.00
- Learning GUI Grounding with Spatial Reasoning from Visual Feedback (2026)Yu Zhao et al.0.00
- A Comprehensive Dataset for Human vs. AI Generated Image Detection (2026)Rajarshi Roy et al.0.00
- Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR (2026)Ziye Yuan et al.0.00
- VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning (2026)Zi-Yi Jia et al.0.00
- Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation (2026)Seonghoon Yu et al.0.00
- Diffusion Large Language Models for Visual Speech Recognition (2026)Jeong Hun Yeo et al.0.00
- A Comparison of Audio Signal Preprocessing Methods for Deep Neural
Networks on Music Tagging (2021)Keunwoo Choi et al.β
- Acoustic Scene Classification: A Competition Review (2024)Shayan Gharib et al.β
- End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent
Neural Models (2023)Fei Tao and Carlos Bussoβ
- Learning Embodied Semantics via Music and Dance Semiotic Correlations (2021)Francisco Afonso Raposo and David Martins de Matos and Ricardo Ribeiroβ
- Adaptive Fusion Techniques for Multimodal Data (2021)Gaurav Sahu et al.β
- Detecting Adversarial Attacks On Audiovisual Speech Recognition (2021)Pingchuan Ma et al.β
- Mixture of Inference Networks for VAE-based Audio-visual Speech
Enhancement (2021)Mostafa Sadeghi et al.β
- A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News
Anchors (2021)Ruobing Zheng et al.β
- Bio-Inspired Modality Fusion for Active Speaker Detection (2021)Gustavo Assun\c{c}\~ao et al.β
- On the Role of Visual Cues in Audiovisual Speech Enhancement (2021)Zakaria Aldeneh et al.β
- Cross-modal Speaker Verification and Recognition: A Multilingual
Perspective (2021)Muhammad Saad Saeed et al.β
- Does Visual Self-Supervision Improve Learning of Speech Representations
for Emotion Recognition? (2021)Abhinav Shukla et al.β
- End-to-End Lip Synchronisation Based on Pattern Classification (2021)You Jin Kim et al.β
- Deep Sensory Substitution: Noninvasively Enabling Biological Neural
Networks to Receive Input from Artificial Neural Networks (2021)Andrew Port et al.β
- Dance Revolution: Long-Term Dance Generation with Music via Curriculum
Learning (2023)Ruozi Huang et al.β
- Moving fast and slow: Analysis of representations and post-processing in
speech-driven automatic gesture generation (2021)Taras Kucherenko et al.β
- CSLNSpeech: solving extended speech separation problem with the help of
Chinese sign language (2023)Jiasong Wu et al.β
- Compact Graph Architecture for Speech Emotion Recognition (2021)A. Shirian et al.β
- Speech Driven Talking Face Generation from a Single Image and an Emotion
Condition (2021)Sefik Emre Eskimez et al.β
- Sequence-to-Sequence Predictive Model: From Prosody To Communicative
Gestures (2021)Fajrian Yunus et al.β
- Active Contrastive Learning of Audio-Visual Video Representations (2021)Shuang Ma et al.β
- An Empirical Study of Visual Features for DNN based Audio-Visual Speech
Enhancement in Multi-talker Environments (2022)Shrishti Saha Shetu et al.β
- ANIMC: A Soft Framework for Auto-weighted Noisy and Incomplete Multi-view Clustering (2026)Xiang Fang et al.β
- Hierachical Delta-Attention Method for Multimodal Fusion (2022)Kunjal Panchalβ
- V3H: View Variation and View Heredity for Incomplete Multi-view Clustering (2026)Xiang Fang et al.β
- Semantic Audio-Visual Navigation (2021)Changan Chen et al.β
- AudioViewer: Learning to Visualize Sounds (2023)Chunjin Song et al.β
- VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency (2021)Ruohan Gao and Kristen Graumanβ
- Piano Skills Assessment (2021)Paritosh Parmar et al.β