PRLL: Policy Regularization and Reward Shaping Assisted by Large Language Models

Authors: Qianxia Zheng, Xiangfeng Luo, and Tao Wang
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 306-320
Keywords: Deep reinforcement learning, Large Language Models, Policy Regularization, Reward Shaping

Abstract

Through continuous exploration and repeated trials, reinforcement learning RL enables agents to learn the optimal strategy, acquiring a certain level of behavioral intelligence. However, in complex and dynamically changing re-al-world environments, the state space and action spaces grow significantly larger. This implies that agents need to explore the environment more exten-sively to identify viable solutions. Unfortunately, such repetitive and inefficient exploration often leads to increased training time, higher costs, and greater risks. Several methods have emerged that use prior knowledge from large language models LLMs to assist RL training, but many of these approaches do not consider the issue of low sample efficiency. To address these challenges, we propose Policy Regularization and reward shaping assisted by Large Language models PRLL . Firstly, PRLL calculates the similarity between LLMs-generated suggestions and the agent's actions, using this as a regularization term, to constrain the agent's exploration direction. Secondly, to efficiently align the agent's behavior with human preferences, PRLL employs LLMs to evaluate the alignment between the agent's actions and human values, translating this evaluation into an intrinsic reward signal. Experiments in both discrete and continuous action spaces demonstrate that PRLL outperforms most baseline methods while requiring fewer training time steps.
📄 View Full Paper (PDF) 📋 Show Citation