ICLR 2026 Workshop Submission

SE-RL
LLM as RL Researcher

A Self-Evolutional Reinforcement Learning Framework that leverages Large Language Models to automate RL algorithm design in both Single-Agent and Multi-Agent scenarios.

Read Paper View Code arXiv
200
Stocks Tested
99%
Win Rate
+4.86
PA (bps)
2ร—
Performance Gain

Why SE-RL?

โšก

Slow RL Research Speed

Compared to rapid advances in CV, NLP, and LLMs, reinforcement learning algorithm innovation has been significantly slower. SE-RL automates the research process using LLMs, accelerating discovery.

RL Research Speed Analysis
๐ŸŽฏ

Unrealistic Market Assumptions

Existing RL methods rely on static market assumptions, ignoring the market impact of the agent's own actions. We build dynamic multi-agent simulators for realistic training.

Market Impact Analysis

Abstract

In the quantitative finance area, particularly in order execution, reinforcement learning (RL) has shown great promise due to its ability to interact with market environments based on real data. However, traditional RL methods suffer from slow research speed and rely on static market assumptions, which do not consider the impact of the agent's execution action on the environment. To address these, we propose a Self-Evolutional single-agent/multi-agent Reinforcement Learning (SE-RL) framework. The framework utilizes a Large Language Model (LLM) to design various RL algorithm modules, such as agent model design, reward function, profiling, communication, and state imagination. SE-RL continuously improves the accuracy of LLM-generated RL algorithms through a Dual-Level Enhancement Kit (DEK) at both high-level (prompt refinement) and low-level (parameter fine-tuning). Additionally, we use a multi-agent system to simulate dynamic financial markets, accounting for the impact of order executions on market dynamics. Comprehensive experiments on 200 realistic stock datasets demonstrate that our proposed framework outperforms current state-of-the-art baselines.

Framework Overview

SE-RL Framework Overview

Figure 1: The SE-RL framework consists of two core loops: (1) an outer loop for enhancing the LLM's research capabilities through the Dual-Level Enhancement Kit, and (2) an inner loop for training an execution agent in both static and dynamic environments.

Key Contributions

1

SE-RL Framework

The first automated RL research framework that uses an LLM to design complete RL algorithms, integrating the Dual-Level Enhancement Kit (DEK).

2

Dynamic Market Simulator

A multi-agent system that realistically models market impact through agent-based order book updates.

3

Hybrid Environment Training

A novel training paradigm that balances learning from static historical data while accounting for dynamic market impact.

4

State-of-the-Art Performance

Comprehensive experiments on 200 stocks from CSI100 and NASDAQ100 demonstrate significant improvements.

Dual-Level Enhancement Kit

DEK Workflow

High-Level Enhancement

  • Macro-Micro Refine: Hierarchical prompt generation for overall task context and specific module goals.
  • In-Context Reward Learning: Learn from historical prompt-reward pairs for optimization.
  • Instruction Buffer: Experience replay mechanism for prompt evolution stability.

Low-Level Enhancement

  • STE Gradient Flow: Straight-Through Estimator for end-to-end differentiable optimization.
  • LoRA Fine-tuning: Efficient parameter updates with minimal overhead.
  • Gradient Approximation: โˆ‡ฯ†J โ‰ˆ -(PA - b) ยท โˆ‡ฯ†log pฯ†(c|prompt)

Results

Method PA (bps) โ†‘ PA-std โ†“ Win Rate โ†‘ GLR โ†‘ AFI โ†“
TWAP -0.12 3.12 0.49 0.97 0.00
VWAP -3.29 4.87 0.39 0.78 0.04
PPO -0.98 7.01 0.52 0.92 0.03
HALOP 2.89 6.12 0.65 1.12 0.07
MacMic 3.31 6.23 0.72 1.37 0.02
SE-RL (Ours) 4.25+0.94 5.12-1.11 0.99+0.27 1.63+0.26 0.00

Training Algorithm

Algorithm: SE-RL Training Scheme
Input: Base LLM M, Execution Agent ฯ€_init, Static Env E_s, Dynamic Env E_d Output: Optimized Policy ฯ€* 1: Initialize algorithm population P โ† โˆ…, performance buffer B โ† โˆ… 2: while not converged do 3: // Step 1: Design and Generate RL Algorithm 4: Generate A_j โ† M(Prompt, P) 5: 6: // Step 2: Inner Loop Training 7: ฯ€_j โ† Train(ฯ€_init, A_j, E_s) // Train in static env 8: ฯ€_j โ† Train(ฯ€_j, A_j, E_d) // Train in dynamic env 9: Update ฯ€_j by minimizing L_rebalance = ฮฑL_s + ฮฒL_d 10: 11: // Step 3: Dual-Level Enhance LLM 12: Calculate L_LLM based on PA_j 13: โˆ‡_M โ† STE(L_LLM) 14: Update M's weights using โˆ‡_M 15: end while 16: return ฯ€*

Citation

BibTeX
@inproceedings{anonymous2026serl, title={Large Language Model (LLM) as an Excellent Reinforcement Learning Researcher in both Single-Agent and Multi-Agent Scenarios}, author={Anonymous}, booktitle={ICLR 2026 Workshop}, year={2026}, note={Under Review} }