ICLR 2026 Workshop Submission

SE-RL
LLM as RL Researcher

A Self-Evolutional Reinforcement Learning Framework that leverages Large Language Models to automate RL algorithm design in both Single-Agent and Multi-Agent scenarios.

Read Paper View Code arXiv

200

Stocks Tested

99%

Win Rate

+4.86

PA (bps)

2×

Performance Gain

Motivation

Why SE-RL?

⚡

Slow RL Research Speed

Compared to rapid advances in CV, NLP, and LLMs, reinforcement learning algorithm innovation has been significantly slower. SE-RL automates the research process using LLMs, accelerating discovery.

🎯

Unrealistic Market Assumptions

Existing RL methods rely on static market assumptions, ignoring the market impact of the agent's own actions. We build dynamic multi-agent simulators for realistic training.

Overview

Abstract

In the quantitative finance area, particularly in order execution, reinforcement learning (RL) has shown great promise due to its ability to interact with market environments based on real data. However, traditional RL methods suffer from slow research speed and rely on static market assumptions, which do not consider the impact of the agent's execution action on the environment. To address these, we propose a Self-Evolutional single-agent/multi-agent Reinforcement Learning (SE-RL) framework. The framework utilizes a Large Language Model (LLM) to design various RL algorithm modules, such as agent model design, reward function, profiling, communication, and state imagination. SE-RL continuously improves the accuracy of LLM-generated RL algorithms through a Dual-Level Enhancement Kit (DEK) at both high-level (prompt refinement) and low-level (parameter fine-tuning). Additionally, we use a multi-agent system to simulate dynamic financial markets, accounting for the impact of order executions on market dynamics. Comprehensive experiments on 200 realistic stock datasets demonstrate that our proposed framework outperforms current state-of-the-art baselines.

Method

Framework Overview

Figure 1: The SE-RL framework consists of two core loops: (1) an outer loop for enhancing the LLM's research capabilities through the Dual-Level Enhancement Kit, and (2) an inner loop for training an execution agent in both static and dynamic environments.

Highlights

Key Contributions

SE-RL Framework

The first automated RL research framework that uses an LLM to design complete RL algorithms, integrating the Dual-Level Enhancement Kit (DEK).

Dynamic Market Simulator

A multi-agent system that realistically models market impact through agent-based order book updates.

Hybrid Environment Training

A novel training paradigm that balances learning from static historical data while accounting for dynamic market impact.

State-of-the-Art Performance

Comprehensive experiments on 200 stocks from CSI100 and NASDAQ100 demonstrate significant improvements.

Core Component

Dual-Level Enhancement Kit

High-Level Enhancement

Macro-Micro Refine: Hierarchical prompt generation for overall task context and specific module goals.
In-Context Reward Learning: Learn from historical prompt-reward pairs for optimization.
Instruction Buffer: Experience replay mechanism for prompt evolution stability.

Low-Level Enhancement

STE Gradient Flow: Straight-Through Estimator for end-to-end differentiable optimization.
LoRA Fine-tuning: Efficient parameter updates with minimal overhead.
Gradient Approximation: ∇_φJ ≈ -(PA - b) · ∇_φlog p_φ(c|prompt)

Experiments

Results

Method	PA (bps) ↑	PA-std ↓	Win Rate ↑	GLR ↑	AFI ↓
TWAP	-0.12	3.12	0.49	0.97	0.00
VWAP	-3.29	4.87	0.39	0.78	0.04
PPO	-0.98	7.01	0.52	0.92	0.03
HALOP	2.89	6.12	0.65	1.12	0.07
MacMic	3.31	6.23	0.72	1.37	0.02
SE-RL (Ours)	4.25+0.94	5.12-1.11	0.99+0.27	1.63+0.26	0.00

Method	PA (bps) ↑	PA-std ↓	Win Rate ↑	GLR ↑	AFI ↓
TWAP	-0.19	4.31	0.49	1.01	0.00
PPO	-1.20	6.79	0.46	0.96	0.04
MacMic	3.14	5.69	0.74	1.28	0.01
SE-RL (Ours)	4.86+1.72	3.97-1.72	0.90+0.16	1.47+0.19	0.00

High-Level	Low-Level	Dynamic Env	Hybrid Training	PA (bps) ↑	Win Rate ↑
✗	✗	✗	✗	2.26	0.65
✓	✗	✗	✗	2.96+0.70	0.73
✓	✓	✗	✗	3.34+0.38	0.75
✓	✓	✓	✗	3.87+0.53	0.80
✓	✓	✓	✓	4.25+0.38	0.99

Implementation

Training Algorithm

Algorithm: SE-RL Training Scheme

Input: Base LLM M, Execution Agent π_init, Static Env E_s, Dynamic Env E_d Output: Optimized Policy π* 1: Initialize algorithm population P ← ∅, performance buffer B ← ∅ 2: while not converged do 3: // Step 1: Design and Generate RL Algorithm 4: Generate A_j ← M(Prompt, P) 5: 6: // Step 2: Inner Loop Training 7: π_j ← Train(π_init, A_j, E_s) // Train in static env 8: π_j ← Train(π_j, A_j, E_d) // Train in dynamic env 9: Update π_j by minimizing L_rebalance = αL_s + βL_d 10: 11: // Step 3: Dual-Level Enhance LLM 12: Calculate L_LLM based on PA_j 13: ∇_M ← STE(L_LLM) 14: Update M's weights using ∇_M 15: end while 16: return π*

Reference

Citation

BibTeX

@inproceedings{anonymous2026serl, title={Large Language Model (LLM) as an Excellent Reinforcement Learning Researcher in both Single-Agent and Multi-Agent Scenarios}, author={Anonymous}, booktitle={ICLR 2026 Workshop}, year={2026}, note={Under Review} }

SE-RL LLM as RL Researcher