Abstract

We present the first gap-dependent analysis of regret and communication cost for on-policy federated \(Q\)-Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading to \(\sqrt\{T\}\)-type regret bounds and communication cost bounds with a \(log T\) term scaling with the number of agents \(M\), states \(S\), and actions \(A\), where \(T\) is the average total number of steps per agent. In contrast, our novel framework leverages the benign structures of MDPs, such as a strictly positive suboptimality gap, to achieve a \(log T\)-type regret bound and a refined communication cost bound that disentangles exploration and exploitation. Our gap-dependent regret bound reveals a distinct multi-agent speedup pattern, and our gap-dependent communication cost bound removes the dependence on \(MSA\) from the \(log T\) term. Notably, our gap-dependent communication cost bound also yields a better global switching cost whe

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyzhang2025gap

Related papers