Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks. However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback. Unlike pure text data, collecting large-scale decision-making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity. To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. This reward model can be integrated with LLM-based agents and various planning algorithms to enhance task-solving performance, potentially revolutionizing the application of LLMs in complex and interactive environments. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks such as online shopping, scientific reasoning, mathematical problem-solving, house-holding and clinical scenario.
The pipeline of our ARMAP framework. (1) We first generate an initial task instruction using LLMs with in-context learning and sample trajectories aligned with the initial language instructions in the environment. (2) Next, we use the LLM to summarize the sampled trajectories and generate refined task instructions that better match these trajectories. (3) We then modify specific actions within the trajectories to perform new actions in the environment, collecting negative trajectories in the process. (4) Using the refined task instructions, along with both positive and negative trajectories, we train a lightweight reward model to distinguish between matching and non-matching trajectories. (5) The learned reward model can then collaborate with various LLM agents to improve task planning.
Controllable Generation. A typical example of customized reward target for shorter trajectory generation. On the left, we show the default greedy decoding generates a long trajectory without finding the target product. In the middle, we show our default reward can guide the LLM agent to generate a correct but long trajectory. On the right, we show our framework with a customized reward target (default reward with trajectory length penalty) for shorter trajectories, which finds a correct and short trajectory for the target product.
More showcases are available in our paper.
@misc{chen2025scalingautonomousagentsautomatic,
title={Scaling Autonomous Agents via Automatic Reward Modeling And Planning},
author={Zhenfang Chen and Delin Chen and Rui Sun and Wenjun Liu and Chuang Gan},
year={2025},
eprint={2502.12130},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.12130},
}