Rlzero: Direct Policy Inference From Language Without In-domain Supervision
2024 Β· Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, et al.
Abstract
The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using v
Authors
(none)
Tags
Stats
Related papers
- Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference (2024)0.00
- Response-level Rewards Are All You Need For Online Reinforcement Learning In Llms: A Mathematical Perspective (2025)0.00
- "so, Tell Me About Your Policy...": Distillation Of Interpretable Policies From Deep Reinforcement Learning Agents (2025)0.00
- Co-evolution Of Policy And Internal Reward For Language Agents (2026)0.00
- Rewriting History With Inverse RL: Hindsight Inference For Policy Improvement (2020)0.00
- Reward-conditioned Policies (2019)0.00
- Replacing Rewards With Examples: Example-based Policy Search Via Recursive Classification (2021)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00