Certrl: Formalizing Convergence Proofs For Value And Policy Iteration In Coq
2020 Β· Koundinya Vajjha, Avraham Shinnar, Vasily Pestun, et al.
Abstract
Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward. The desire to use reinforcement learning in safety-critical settings inspires a recent line of work on formally constrained reinforcement learning; however, these methods place the implementation of the learning algorithm in their Trusted Computing Base. The crucial correctness property of these implementations is a guarantee that the learning algorithm converges to an optimal policy. This paper begins the work of closing this gap by developing a Coq formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes. The central results are a formalization of Bellman's optimality principle and its proof, which uses a contraction property of Bellman optimality operator to establish that a sequence converges in the infinite horizon limit. The CertRL development exemplifies how the
Authors
(none)
Tags
Stats
Related papers
- On The Convergence Of Policy Iteration-based Reinforcement Learning With Monte Carlo Policy Evaluation (2023)0.00
- Convex Programs And Lyapunov Functions For Reinforcement Learning: A Unified Perspective On The Analysis Of Value-based Methods (2022)2.26
- Policy Bifurcation In Safe Reinforcement Learning (2024)0.00
- Concurrent Learning Of Policy And Unknown Safety Constraints In Reinforcement Learning (2024)0.00
- Conservative Exploration For Policy Optimization Via Off-policy Policy Evaluation (2023)0.00
- Joint Learning Of Policy With Unknown Temporal Constraints For Safe Reinforcement Learning (2023)0.00
- CRPO: A New Approach For Safe Reinforcement Learning With Convergence Guarantee (2020)0.00
- Policy Certificates: Towards Accountable Reinforcement Learning (2018)0.00