RLHF & Alignment
24 papers tagged RLHF & Alignment (ordered by heat_score)
Papers
- Spectral Souping: A Unified Framework for Online Preference Alignment (2026)Yinlam Chow et al.4.54
- PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment (2026)Richa Verma et al.4.54
- Emergent Cooperation Under Uncertain Incentive Alignment (2024)Nicole Orzan, Erman Acar, Davide Grossi, et al.2.26
- Explaining And Preventing Alignment Collapse In Iterative RLHF (2026)Etienne Gauthier, Francis Bach, Michael I. Jordan0.00
- Policy-value Alignment And Robustness In Search-based Multi-agent Learning (2023)Niko A. Grupen, Michael Hanlon, Alexis Hao, et al.0.00
- The Alignment Ceiling: Objective Mismatch In Reinforcement Learning From Human Feedback (2023)Nathan Lambert, Roberto Calandra0.00
- Scalable Agent Alignment Via Reward Modeling: A Research Direction (2018)Jan Leike, David Krueger, Tom Everitt, et al.0.00
- Learnalign: Data Selection For LLM Reinforcement Learning With Improved Gradient Alignment (2026)Shipeng Li, Zhiqin Yang, Shikun Li, et al.0.00
- Vision-based Generic Potential Function For Policy Alignment In Multi-agent Reinforcement Learning (2025)Hao Ma, Shijie Wang, Zhiqiang Pu, et al.0.00
- SAFE: Stable Alignment Finetuning With Entropy-aware Predictive Control For Reinforcement Learning From Human Feedback (RLHF) (2026)Dipan Maity0.00
- Dynamic Policy Fusion For User Alignment Without Re-interaction (2024)Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana0.00
- Double Check My Desired Return: Transformer With Target Alignment For Offline Reinforcement Learning (2025)Yue Pei, Hongming Zhang, Chao Gao, et al.0.00
- Meta-aligner: Bidirectional Preference-policy Optimization For Multi-objective Llms Alignment (2026)Wenzhe Xu, Biao Liu, Yiyang Sun, et al.0.00
- PAGAR: Taming Reward Misalignment In Inverse Reinforcement Learning-based Imitation Learning With Protagonist Antagonist Guided Adversarial Reward (2023)Weichao Zhou, Wenchao Li0.00
- Task-conditioned probing of instruction-tuned multimodal LLMs: Region-specific brain alignment patterns under naturalistic stimuli (2026)Subba Reddy Oota et al.β
- Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations (2026)Haozheng Luo et al.β
- Interoceptive Divergence in Aesthetic Evaluation and Implications for Human-AI Alignment (2026)Yoshia Abe et al.β
- Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay (2026)Subba Reddy Oota et al.β
- Measuring Safety Alignment Effects in Autonomous Security Agents (2026)Isaac David et al.β
- Stitched Value Model for Diffusion Alignment (2026)Hyojun Go et al.β
- Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment (2026)Ken Nakamura et al.β
- Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment (2026)Zhiqin Yang et al.β
- Federated LoRA Fine-Tuning for LLMs via Collaborative Alignment (2026)Shuaida He et al.β
- Convex Optimization for Alignment and Preference Learning on a Single GPU (2026)Miria Feng et al.β