Safe Policy Improvement based on epsilon-bisimulation

Authors: Yuan Zhuang
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 395-408
Keywords: Batch Reinforcement Learning, Safe policy Improvement, ?-bisimulation

Abstract

This paper studies the Safety Policy Improvement SPI problem in Batch Reinforcement Learning Batch RL , which aims to train a policy from a fixed dataset without environment interaction, while ensuring its perfor-mance is no worse than the baseline policy used for data collection. Most existing SPI methods impose constraints on training, but these constraints often make the training overly conservative, especially in complex envi-ronment where satisfying the constraints requires large amounts of data. Meanwhile, ?-bisimulation, a general state abstraction technique, has been widely used to enhance sample efficiency in reinforcement learning RL . However, applying ?-bisimulation transforms the original dataset into one over abstracted observation, which typically violates the assumption of in-dependent and identically distributed i.i.d. samples required by existing SPI methods. To address this limitation, this paper proposes a constraint for policy learning that incorporates ?-bisimulation to improve sample effi-ciency while ensuring the policy's performance.
📄 View Full Paper (PDF) 📋 Show Citation