A Self-Evolutional Reinforcement Learning Framework that leverages Large Language Models to automate RL algorithm design in both Single-Agent and Multi-Agent scenarios.
Compared to rapid advances in CV, NLP, and LLMs, reinforcement learning algorithm innovation has been significantly slower. SE-RL automates the research process using LLMs, accelerating discovery.
Existing RL methods rely on static market assumptions, ignoring the market impact of the agent's own actions. We build dynamic multi-agent simulators for realistic training.
In the quantitative finance area, particularly in order execution, reinforcement learning (RL) has shown great promise due to its ability to interact with market environments based on real data. However, traditional RL methods suffer from slow research speed and rely on static market assumptions, which do not consider the impact of the agent's execution action on the environment. To address these, we propose a Self-Evolutional single-agent/multi-agent Reinforcement Learning (SE-RL) framework. The framework utilizes a Large Language Model (LLM) to design various RL algorithm modules, such as agent model design, reward function, profiling, communication, and state imagination. SE-RL continuously improves the accuracy of LLM-generated RL algorithms through a Dual-Level Enhancement Kit (DEK) at both high-level (prompt refinement) and low-level (parameter fine-tuning). Additionally, we use a multi-agent system to simulate dynamic financial markets, accounting for the impact of order executions on market dynamics. Comprehensive experiments on 200 realistic stock datasets demonstrate that our proposed framework outperforms current state-of-the-art baselines.
Figure 1: The SE-RL framework consists of two core loops: (1) an outer loop for enhancing the LLM's research capabilities through the Dual-Level Enhancement Kit, and (2) an inner loop for training an execution agent in both static and dynamic environments.
The first automated RL research framework that uses an LLM to design complete RL algorithms, integrating the Dual-Level Enhancement Kit (DEK).
A multi-agent system that realistically models market impact through agent-based order book updates.
A novel training paradigm that balances learning from static historical data while accounting for dynamic market impact.
Comprehensive experiments on 200 stocks from CSI100 and NASDAQ100 demonstrate significant improvements.
| Method | PA (bps) โ | PA-std โ | Win Rate โ | GLR โ | AFI โ |
|---|---|---|---|---|---|
| TWAP | -0.12 | 3.12 | 0.49 | 0.97 | 0.00 |
| VWAP | -3.29 | 4.87 | 0.39 | 0.78 | 0.04 |
| PPO | -0.98 | 7.01 | 0.52 | 0.92 | 0.03 |
| HALOP | 2.89 | 6.12 | 0.65 | 1.12 | 0.07 |
| MacMic | 3.31 | 6.23 | 0.72 | 1.37 | 0.02 |
| SE-RL (Ours) | 4.25+0.94 | 5.12-1.11 | 0.99+0.27 | 1.63+0.26 | 0.00 |