Reading The Mood Behind Words: Integrating Prosody-derived Emotional Context Into Socially Responsive VR Agents
2026 Β· Sangyeop Jeong, Yeongseo Na, Seung Gyu Jeong, et al.
Abstract
In VR interactions with embodied conversational agents, users' emotional intent is often conveyed more by how something is said than by what is said. However, most VR agent pipelines rely on speech-to-text processing, discarding prosodic cues and often producing emotionally incongruent responses despite correct semantics. We propose an emotion-context-aware VR interaction pipeline that treats vocal emotion as explicit dialogue context in an LLM-based conversational agent. A real-time speech emotion recognition model infers users' emotional states from prosody, and the resulting emotion labels are injected into the agent's dialogue context to shape response tone and style. Results from a within-subjects VR study (N=30) show significant improvements in dialogue quality, naturalness, engagement, rapport, and human-likeness, with 93.3% of participants preferring the emotion-aware agent.
Authors
(none)
Tags
Stats
Related papers
- Advancing User-voice Interaction: Exploring Emotion-aware Voice Assistants Through A Role-swapping Approach (2025)6.77
- Av-emodialog: Chat With Audio-visual Users Leveraging Emotional Cues (2024)0.00
- Agent-based Modular Learning For Multimodal Emotion Recognition In Human-agent Systems (2025)0.00
- PAVITS: Exploring Prosody-aware VITS For End-to-end Emotional Voice Conversion (2024)8.35
- Mixed-evc: Mixed Emotion Synthesis And Control In Voice Conversion (2022)4.52
- Emosphere-tts: Emotional Style And Intensity Modeling Via Spherical Emotion Vector For Controllable Emotional Text-to-speech (2024)10.35
- Making Social Platforms Accessible: Emotion-aware Speech Generation With Integrated Text Analysis (2024)4.52
- Emotional Dimension Control In Language Model-based Text-to-speech: Spanning A Broad Spectrum Of Human Emotions (2024)0.00