Abstract

The reasoning abilities of large language models (LLMs) have improved with chain-of-thought (CoT) prompting, allowing models to solve complex tasks stepwise. However, training CoT capabilities requires detailed reasoning data, which is often scarce. The self-taught reasoner (STaR) framework addresses this by using reinforcement learning to automatically generate reasoning steps, reducing reliance on human-labeled data. Although STaR and its variants have demonstrated empirical success, a theoretical foundation explaining these improvements is lacking. This work provides a theoretical framework for understanding the effectiveness of reinforcement learning on CoT reasoning and STaR. Our contributions are: (1) criteria for the quality of pre-trained models necessary to initiate effective reasoning improvement; (2) an analysis of policy improvement, showing why LLM reasoning improves iteratively with STaR; (3) conditions for convergence to an optimal reasoning policy; and (4) an examinatio

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keychang2024rl

Related papers