AReaL — Asynchronous RL Infrastructure for LLM Agents
AReaL — Asynchronous RL Infrastructure for LLM Agents
Repo: inclusionAI/AReaL — Tsinghua IIIS + Ant Group. Paper (arXiv 2505.24298).
AReaL is an RL training system aimed at two things the field has been struggling to reconcile: scalable online RL for large reasoning models, and agentic RL where the "environment" is a black-box agent runtime (OpenAI Agents SDK, CAMEL, LangChain, whatever) that you shouldn't have to rewrite. Its central trick is fully asynchronous training — decoupling rollout generation from policy updates so that inference and training run concurrently on disaggregated GPUs.
This is a different kind of system from QED. QED is an orchestration pipeline that treats LLMs as fixed components and squeezes correctness out of them via multi-model voting and structured verification. AReaL is an infrastructure layer that updates the weights themselves via reinforcement learning, on agent trajectories that can be arbitrary user-defined loops. Both are practical, but they're operating at different layers of the stack.
1. High-Level Indication
If you only remember three things:
- Asynchronous > synchronous, by a lot. AReaL's v0.3 release shows 2.77× speedup over synchronous RL on comparable hardware. The trick is running rollouts (inference) and training (gradient steps) on separate GPU pools, overlapped in time, with principled handling of the staleness that introduces.
- "Just replace
base_url" is the API contract. The headline example (OpenClaw, CAMEL-SETA, OpenAI Agents SDK) is that you train an existing agent runtime by pointing it at AReaL's OpenAI-compatible proxy. No framework rewrite. The proxy is an HTTP server that speaks chat completions, logs token-level telemetry, and returns trajectories for RL — so any agent that talks OpenAI chat-completions protocol is suddenly RL-trainable. - Off-policyness is a controlled knob, not a defect. The entire async-RL literature lives or dies on how much stale data you can tolerate. AReaL exposes
max_head_offpolicynessdirectly, pairs it with a decoupled PPO loss + recomputed logprobs, and lets the user tune the tradeoff. This is not a hidden implementation detail — it's a user-facing algorithmic choice.
The bet: the bottleneck of online RL for agents is generation latency, and the right fix is system-level (asynchronous GPU pools) rather than algorithm-level (more sample-efficient algorithms).
2. The Problem AReaL Is Trying to Solve
Online RL for LLMs has two flavors of pain:
(a) The GPU-idleness pain
Traditional synchronous online RL (PPO, GRPO, etc.) does this every step:
1 | while training: |
For a large reasoning model with long chains of thought — say, 8k–32k tokens per rollout — inference dominates the wall-clock. The training GPUs spend most of their time waiting for rollouts to finish. As the model gets bigger and the chain-of-thought gets longer, the idleness fraction gets worse.
(b) The agent-framework impedance mismatch
You want to RL-finetune an agent that was built with OpenAI Agents SDK or CAMEL or LangChain. Those frameworks:
- Interact with LLMs through chat-completions APIs — no token IDs, no logprobs.
- Have no reward mechanism (they're designed for inference).
- Run sequentially (no parallel rollout collection out of the box).
Naive approach: rewrite the agent in a training-friendly API. That's weeks of work per agent framework, and it diverges from what you'll actually deploy.
What AReaL proposes
For (a): asynchronous RL — decouple rollout GPUs from training GPUs; allow staleness with principled control; use a decoupled PPO objective that's robust to off-policy data.
For (b): a proxy model client that looks like OpenAI — your agent sends chat-completion requests to the proxy; the proxy routes them to SGLang or vLLM; every request/response is logged with token-level telemetry; when the episode ends you assign a reward and the proxy hands the trajectory back for RL training.
The result: you change your agent's base_url from https://api.openai.com/v1 to the proxy URL, and the same code that runs inference now produces trainable trajectories.
3. Architecture Tour
AReaL's source tree (areal/) decomposes into the pieces you'd expect from a serious RL system:
1 | areal/ |
The useful mental model is four horizontal layers:
| Layer | What it does | Components |
|---|---|---|
| Training | forward/backward/update | engine/fsdp_engine.py, engine/megatron_engine.py, engine/experimental/archon/ |
| Inference | generate rollouts | engine/sglang_remote.py, engine/vllm_remote.py |
| Orchestration | gluing the two together | trainer/, infra/ (Ray / SLURM / local schedulers) |
| Workflow | the rollout policy — including agent frameworks | workflow/rlvr.py, workflow/multi_turn.py, workflow/openai/, etc. |
The separation of engine (how to train/infer) from workflow (what the rollout actually does) is the right factoring for agent RL. A single workflow can be reused across many algorithms (GRPO, PPO, DAPO, RLOO, LitePPO, SAPO, M2PO — all 13+ algorithms share the workflow abstraction), and a single algorithm can run over many workflows (math verification, multi-turn tool use, vision-language, agent frameworks).
Training backends
| Backend | DP | TP | SP | CP | PP | EP | 1D Packing | LoRA |
|---|---|---|---|---|---|---|---|---|
| Megatron | ZeRO-1 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ (with vLLM) |
| PyTorch FSDP | FSDP2 | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ |
| PyTorch Archon | FSDP2 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
(DP = data parallel, TP = tensor parallel, SP = sequence parallel, CP = context parallel, PP = pipeline parallel, EP = expert parallel.)
Megatron is the canonical "all the parallelism" choice; FSDP2 is the pragmatic default for most users; Archon is an experimental FSDP2-based backend that adds PP+EP without giving up LoRA incompatibility. The fact that all three are maintained side-by-side tells you something about the actual state of the distributed-training world — no single backend dominates for all model-size / memory / feature combinations.
Inference backends
| Backend | TP | CP | PP | DP Attention | EP |
|---|---|---|---|---|---|
| vLLM | ✅ | ? | ✅ | ? | ? |
| SGLang | ✅ | ❌ | ❌ | ✅ | ✅ |
Both are supported because users want both. SGLang has better MoE / expert-parallel support and is the default; vLLM has a larger user base and is the fallback.
4. The Async RL Core Idea
Let's unpack what "fully asynchronous" actually means, because the term gets abused.
Synchronous RL (what AReaL is moving away from)
1 | step t: [GENERATE rollouts with policy π_t] → [TRAIN on rollouts → π_{t+1}] |
Every rollout uses the current policy. No staleness, but lots of idle time.
Asynchronous RL (AReaL's mode)
1 | inference GPUs: GENERATE (π_t) — GENERATE (π_{t+1}) — GENERATE (π_{t+2}) — ... |
Inference and training are fully decoupled. The training GPU is always consuming rollouts; the inference GPU is always producing them. Weight updates are broadcast asynchronously to the inference workers.
The price: off-policyness. A rollout generated by policy $\pi_{t-k}$ is used to train $\pi_t$. The further apart they are, the more the policy gradient estimator is biased.
Off-policyness control
AReaL exposes this directly as max_head_offpolicyness:
1 | rollout: |
0→ synchronous (useful for baseline comparisons, ~2× slower)2–8→ typical async range; bigger = more throughput, less stability- Very large → quality degrades
What makes this a system-level design rather than an algorithm-level one: the partial rollouts idea. A single trajectory (especially a long reasoning chain) can span multiple policy versions — the first 4k tokens might come from $\pi_{t-2}$, the next 4k from $\pi_{t-1}$, etc. This is only possible because the inference side holds a KV cache that persists across weight updates.
The decoupled PPO objective
Plain PPO assumes on-policy data. When the rollouts are off-policy, the ratio $r_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}$ can swing wildly. AReaL pairs async rollouts with two algorithmic knobs:
1 | actor: |
recompute_logprobs: true— don't trust the logprobs the inference engine returned; recompute them under the current training policy. This matters because inference backends (vLLM, SGLang) can use different numerics from training backends (Megatron, FSDP), and tiny differences compound.use_decoupled_loss: true— use a variant of the PPO objective that separates the importance-sampling correction from the clipping, handling off-policyness more gracefully.
These are knobs, not defaults for every case. The docs explicitly note that the decoupled loss "may conflict with certain algorithm configurations (e.g., SAPO)" and that effects on newer algorithms are "largely understudied." That honesty is refreshing.
5. The Proxy — Why "Just Replace base_url" Matters
This is the single most interesting design choice in AReaL, and it's worth explaining because it's the kind of thing that's easy to under-appreciate from the README.
The agent RL problem, precisely
You have an agent loop that looks like this:
1 | # Your existing agent code, unchanged |
For RL you need:
- The token IDs Claude/GPT/whatever actually produced (not the decoded text).
- The logprobs under the generating policy.
- A reward signal for each trajectory.
- A way to collect many trajectories in parallel so you can compute advantages.
Chat-completions APIs don't give you (1) or (2). They don't expect a (3). And parallelism in agent frameworks is often serial-by-default.
What the proxy does
AReaL runs an HTTP server that speaks OpenAI chat-completions but is wired into its own inference engine (SGLang or vLLM). When the agent sends a request, the proxy:
- Routes it to the inference engine.
- Records the exact token IDs and logprobs.
- Tracks which trajectory this request belongs to (via session IDs).
- Returns an OpenAI-shaped response.
At episode end, you call the reward function. The proxy stitches together the full trajectory — every turn, every tool call, every token — and hands it to the trainer as a single trainable rollout.
So the user's code changes by exactly one line:
1 | # Before |
That's it. The rest of the agent (tools, memory, prompts, retry logic) is unchanged.
Why this is the right design
Agent frameworks churn. LangChain in 2023 vs. 2026 is a different library. OpenAI Agents SDK didn't exist two years ago. CAMEL is on its third major version. A training system that required agent-framework-specific adapters would be permanently chasing.
By targeting the protocol (OpenAI chat completions) rather than the framework, AReaL future-proofs itself: every new framework that wants to be popular ends up supporting chat completions, and they all work with AReaL for free.
This is the same design principle as LSP (Language Server Protocol) for editors, or OCI for container runtimes: standardize the wire format, not the implementation.
The tradeoff
You can only train what the protocol exposes. If your agent does something clever with logit biases, speculative decoding, or non-standard sampling parameters, the proxy may not support it. The proxy is a funnel — it works beautifully for 95% of agent code, and for the remaining 5% you need the workflow-level integration (which AReaL also provides — see workflow/openai_agent/, workflow/langchain/).
6. The Algorithm Menagerie
AReaL supports 13+ RL / fine-tuning algorithms out of the box:
| Algorithm | What it's good at | Paper |
|---|---|---|
| PPO | baseline classic | arXiv:2203.02155 |
| GRPO | group-relative advantage, dominant on reasoning | arXiv:2402.03300 |
| GSPO | GRPO variant | arXiv:2507.18071 |
| DAPO | dynamic batch size | arXiv:2503.14476 |
| LitePPO | lighter-weight PPO | arXiv:2508.08221 |
| Dr.GRPO | GRPO variant | arXiv:2503.20783 |
| REINFORCE++ | arXiv:2501.03262 | |
| RLOO | leave-one-out advantage | arXiv:2402.14740 |
| SAPO | arXiv:2511.20347 | |
| M2PO | arXiv:2510.01161 | |
| SFT | supervised fine-tuning | - |
| RLHF reward modeling | - | |
| Distillation | arXiv:2506.02208 |
This is a specific design choice: ship many algorithms, let the user pick. The alternative (ship one algorithm, optimize it hard) is what VeRL and OpenRLHF tend toward. AReaL's bet is that the RL-for-LLM field is moving fast enough that the "best" algorithm changes every six months, and a framework that supports many is more future-proof than one that hard-codes a single choice.
In practice: most users run GRPO. The others are there for research / niche cases.
7. Examples — What's Shipped
The examples/ directory is the real documentation:
| Example | What it demonstrates |
|---|---|
math/gsm8k_rl.py |
baseline GRPO on GSM8K — the default starter |
multi_turn_math/ |
multi-turn math with intermediate reasoning |
tir/ |
tool-integrated reasoning (code execution in the loop) |
openclaw/ |
agent RL via OpenClaw — the "just replace base_url" demo |
openai_agents/ |
OpenAI Agents SDK integration |
camel/ |
CAMEL multi-agent integration |
tau2/ |
τ²-bench (customer service agents) — the AReaL-SEA example |
search_agent/ |
ASearcher, a search agent that set SOTA |
countdown/ |
countdown game (hard reasoning / search) |
alignment/ |
RLHF reward modeling |
distillation/ |
distillation workflows |
vlm/, vlm_npu/ |
vision-language models |
tau2/ (SEA) |
self-evolving data synthesis — the 235B MoE model that beat GPT-5 on τ² |
Two of these are worth singling out:
openclaw/ — the headline demo
OpenClaw is a terminal-focused agent runtime. The training recipe is:
- Start AReaL's RL service — it spins up a proxy at
http://x.x.x.x:port. - Point OpenClaw at that proxy via its existing
--base-urlflag. - Run a workload (command-line tasks, shell sessions, etc.).
- Rewards come from task-completion signals.
- AReaL harvests trajectories and updates the policy.
No OpenClaw code is modified. This is the "bring your own agent" promise, made concrete.
tau2/ — the SEA (self-evolving agent) example
AReaL-SEA pairs RL training with self-evolving data synthesis. A 235B MoE model, trained this way, reportedly surpasses GPT-5 on $\tau^2$-bench (customer service). The interesting bit isn't the model itself but the recipe: the training data for RL is generated adversarially by the agent, not collected from humans. This is closer to AlphaZero-style self-play than to traditional RLHF.
8. AReaL-lite — The Lightweight Variant
As of 2025-07-31, AReaL also ships AReaL-lite:
- 80% fewer lines of code than full AReaL
- 90% of the performance and core functionality
- "Algorithm-first" API (easy to experiment, hard to deploy at huge scale)
- Natively supports fully asynchronous agentic RL
This is an increasingly common pattern in ML infra: a "production" codebase for the organization that built the system, and a "research" codebase that's a cleaner subset for external researchers. vLLM vs. vllm-lite, DeepSpeed vs. simpler trainer wrappers, etc. The split acknowledges that the constraints of running at Ant Group's scale are different from the constraints of running a postdoc's experiments.
For a researcher, the lite variant is usually the right entry point. For a production system, the full framework has things the lite version doesn't (full Megatron backend, every parallelism dimension, every scheduler).
9. Connection to the Agent Design Philosophy
Compared to the Claude Code and QED perspectives we've looked at, AReaL sits at a very different altitude:
| Layer | Example | Who it's for |
|---|---|---|
| Application | a specific agent (e.g., OpenClaw, a math prover) | end users |
| Orchestration | Claude Code (tool use), QED (multi-agent pipeline) | agent builders |
| Training infra | AReaL | model trainers |
| Serving infra | vLLM, SGLang | platform ops |
AReaL is below the orchestration layer — it's the substrate that produces the model that the orchestration layer then uses. But interestingly, the design principles still rhyme:
| Principle | In Claude Code / QED | In AReaL |
|---|---|---|
| Tools shaped to model abilities | CLI subprocesses, one per vendor | Separate backends (FSDP / Megatron / Archon) for different model scales |
| Progressive disclosure | Lazy-loaded skill bodies | Lazy-loaded agent workflows via the proxy |
| Standardize the wire format, not the framework | OpenAI-style tools[] in the API |
OpenAI-style chat completions for the proxy |
| Off-loading cost to the right place | Prompt caching (amortize over session) | Async GPU pools (amortize generation over training) |
| Give users the knobs explicitly | Prose imperatives in system prompt | max_head_offpolicyness exposed in YAML |
The deepest parallel is point 3: AReaL's proxy and Claude Code's Skill system both work by standardizing a wire format that already exists and is ubiquitous (OpenAI chat completions, Anthropic tool use) rather than inventing a new one. Every agent framework already speaks chat completions; every LLM API already supports tool use. Building on those substrates is how you get composability for free.
10. What I Think Is Missing or Risky
A few places where I'd want to dig in more:
- Proxy compatibility is protocol-first, which means "almost OpenAI" agents may break. If your framework uses logit bias, response_format, streaming tool calls, or other non-mainstream fields, the proxy may need updates. The README is explicit that the proxy works with chat completions; anything outside that boundary needs the workflow-level integration.
- Async quality claims are benchmark-specific. The 2.77× speedup is reported on their benchmark setup (GSM8K, mid-sized models). At larger scale (235B MoE on τ²), throughput wins compound differently — you need to benchmark your own case.
- Off-policyness is a footgun. Crank
max_head_offpolicynessup and throughput goes up; too high and training destabilizes silently. This is the kind of knob that needs guardrails (monitoring, automatic back-off) that the framework doesn't provide out of the box. - Scheduler compatibility. The docs note that agent workflows with the proxy approach are "supported on
localandslurmschedulers only"; Ray is excluded because Ray's actor model doesn't play well with persistent HTTP connections. If your cluster is Ray-based, you can do synchronous RL but not the headline async-agent story. - Security. RL-finetuned agents can develop surprising behaviors; the README warns about this explicitly for OpenClaw. If you're training a tool-using agent, the proxy is by definition giving the model real tool-execution capability. Isolation matters.
None of these are dealbreakers; they're the honest shape of the problem.
11. What's Worth Stealing
Even if you never run AReaL, four of its design moves are worth adopting:
(a) Separate your engines from your workflows
The engine/ vs. workflow/ split is a clean API boundary. Engines know how to generate tokens and compute gradients; workflows know what the rollout means (a math problem, a customer service chat, a tool-using agent). Keep them decoupled and you can reuse either without rewriting the other.
(b) Target the protocol, not the framework
If you're building something agent-adjacent (a logger, a monitoring tool, a training system, an evaluation harness), pick an existing protocol (OpenAI chat completions, Anthropic Messages, MCP) and speak it natively. Don't invent a new one. Your users will point their existing code at you and you'll get the network effect for free.
(c) Expose the algorithmic knobs, don't hide them
max_head_offpolicyness, use_decoupled_loss, recompute_logprobs — these are the exact levers a researcher needs to tune. Hiding them behind "stability" defaults would make the system harder to use, not easier. Documentation + defaults that work for the common case + knobs that advanced users can pull is the right combination.
(d) Ship many algorithms, commit to none
PyTorch's "many losses, many optimizers, you pick" philosophy applies to RL too. A framework that ships one algorithm is betting that algorithm will still be the right one in a year. AReaL's approach — 13 algorithms, shared workflow abstraction — amortizes the "which algorithm is best" question over whoever uses the framework next.
12. Final Read
AReaL is the kind of infrastructure that doesn't get the marketing attention of a new model release, but is what makes new model releases possible. It's solving a system-level bottleneck (GPU idleness in synchronous RL) with a system-level answer (asynchronous training with principled off-policyness control), and then extending that substrate outward to agent frameworks via a protocol-level bridge (the OpenAI-compatible proxy).
The "just replace base_url" promise is the most memorable part, but it's only one expression of a deeper commitment: meet the ecosystem where it lives. Agents speak chat completions. Training speaks Megatron / FSDP / Archon. Inference speaks vLLM / SGLang. AReaL doesn't try to replace any of these — it provides the missing glue that turns them into a training loop.
If you're doing serious RL work on LLMs, it's worth a weekend to stand up a GSM8K GRPO run and read through workflow/rlvr.py. If you're building agents and want to fine-tune them without rewriting, the openclaw/ example is 90% of the story. And if you're just curious about how a 2026-era RL system is architected, reading engine/fsdp_engine.py next to engine/megatron_engine.py is the clearest picture of what "pick your parallelism" actually looks like in production code.
The broader pattern — that RL for LLMs has now moved from "which algorithm?" to "which system architecture?" — is worth sitting with. Sample efficiency of PPO vs. GRPO matters less than whether your inference fleet is saturated, whether your off-policyness controller is stable, whether your proxy can handle agent traffic, whether your checkpoint interval matches your cluster's MTBF. That's the layer AReaL operates on, and it's where the field is going.
References
- Repo:
inclusionAI/AReaL - Paper: AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning (arXiv 2505.24298)
- Prior work: ReaLHF (MLSys 2025) — the synchronous predecessor
- Related projects built on AReaL:
- ASearcher — search agent, SOTA on search tasks
- AReaL-SEA — self-evolving 235B MoE model
- CAMEL-SETA — terminal agent RL
- Related posts on this blog:
- QED — Multi-Agent Math Proof Pipeline — a different kind of multi-model system
- From Wire Format to Design Philosophy — the "target the protocol, not the framework" idea, applied to extensibility
- Title: AReaL — Asynchronous RL Infrastructure for LLM Agents
- Author: wy
- Created at : 2026-04-22 18:00:00
- Updated at : 2026-04-22 15:32:22
- Link: https://yue-ruby-w.site/2026/04/22/AReaL-Asynchronous-RL-for-Agents/
- License: This work is licensed under CC BY-NC-SA 4.0.