AReaL — Asynchronous RL Infrastructure for LLM Agents

wy Lv3

AReaL — Asynchronous RL Infrastructure for LLM Agents

Repo: inclusionAI/AReaL — Tsinghua IIIS + Ant Group. Paper (arXiv 2505.24298).

AReaL is an RL training system aimed at two things the field has been struggling to reconcile: scalable online RL for large reasoning models, and agentic RL where the "environment" is a black-box agent runtime (OpenAI Agents SDK, CAMEL, LangChain, whatever) that you shouldn't have to rewrite. Its central trick is fully asynchronous training — decoupling rollout generation from policy updates so that inference and training run concurrently on disaggregated GPUs.

This is a different kind of system from QED. QED is an orchestration pipeline that treats LLMs as fixed components and squeezes correctness out of them via multi-model voting and structured verification. AReaL is an infrastructure layer that updates the weights themselves via reinforcement learning, on agent trajectories that can be arbitrary user-defined loops. Both are practical, but they're operating at different layers of the stack.


1. High-Level Indication

If you only remember three things:

  1. Asynchronous > synchronous, by a lot. AReaL's v0.3 release shows 2.77× speedup over synchronous RL on comparable hardware. The trick is running rollouts (inference) and training (gradient steps) on separate GPU pools, overlapped in time, with principled handling of the staleness that introduces.
  2. "Just replace base_url" is the API contract. The headline example (OpenClaw, CAMEL-SETA, OpenAI Agents SDK) is that you train an existing agent runtime by pointing it at AReaL's OpenAI-compatible proxy. No framework rewrite. The proxy is an HTTP server that speaks chat completions, logs token-level telemetry, and returns trajectories for RL — so any agent that talks OpenAI chat-completions protocol is suddenly RL-trainable.
  3. Off-policyness is a controlled knob, not a defect. The entire async-RL literature lives or dies on how much stale data you can tolerate. AReaL exposes max_head_offpolicyness directly, pairs it with a decoupled PPO loss + recomputed logprobs, and lets the user tune the tradeoff. This is not a hidden implementation detail — it's a user-facing algorithmic choice.

The bet: the bottleneck of online RL for agents is generation latency, and the right fix is system-level (asynchronous GPU pools) rather than algorithm-level (more sample-efficient algorithms).


2. The Problem AReaL Is Trying to Solve

Online RL for LLMs has two flavors of pain:

(a) The GPU-idleness pain

Traditional synchronous online RL (PPO, GRPO, etc.) does this every step:

1
2
3
4
while training:
rollouts = generate(policy) # inference GPUs work, training GPUs idle
batch = collect(rollouts) # both idle
loss = train(policy, batch) # training GPUs work, inference GPUs idle

For a large reasoning model with long chains of thought — say, 8k–32k tokens per rollout — inference dominates the wall-clock. The training GPUs spend most of their time waiting for rollouts to finish. As the model gets bigger and the chain-of-thought gets longer, the idleness fraction gets worse.

(b) The agent-framework impedance mismatch

You want to RL-finetune an agent that was built with OpenAI Agents SDK or CAMEL or LangChain. Those frameworks:

  • Interact with LLMs through chat-completions APIs — no token IDs, no logprobs.
  • Have no reward mechanism (they're designed for inference).
  • Run sequentially (no parallel rollout collection out of the box).

Naive approach: rewrite the agent in a training-friendly API. That's weeks of work per agent framework, and it diverges from what you'll actually deploy.

What AReaL proposes

For (a): asynchronous RL — decouple rollout GPUs from training GPUs; allow staleness with principled control; use a decoupled PPO objective that's robust to off-policy data.

For (b): a proxy model client that looks like OpenAI — your agent sends chat-completion requests to the proxy; the proxy routes them to SGLang or vLLM; every request/response is logged with token-level telemetry; when the episode ends you assign a reward and the proxy hands the trajectory back for RL training.

The result: you change your agent's base_url from https://api.openai.com/v1 to the proxy URL, and the same code that runs inference now produces trainable trajectories.


3. Architecture Tour

AReaL's source tree (areal/) decomposes into the pieces you'd expect from a serious RL system:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
areal/
├── api/ # config dataclasses, CLI args, public types
├── dataset/ # dataset loaders for math / coding / etc.
├── engine/ # training + inference engines
│ ├── fsdp_engine.py (~73 KB) FSDP2 training
│ ├── megatron_engine.py (~77 KB) Megatron-LM training (full parallelism)
│ ├── sglang_remote.py (~16 KB) Remote inference via SGLang
│ └── vllm_remote.py (~17 KB) Remote inference via vLLM
├── experimental/ # Archon backend, new features
├── infra/ # cluster, launchers, scheduling
├── models/ # model loaders, checkpointing
├── reward/ # reward functions (gsm8k, code, etc.)
├── tools/ # utilities
├── trainer/ # PPOTrainer, SFTTrainer, Distillation
├── utils/
└── workflow/ # rollout workflows
├── rlvr.py # RL-with-verifier rollout
├── multi_turn.py # multi-turn rollout
├── vision_rlvr.py # VLM rollout
├── openai/ # OpenAI-compatible agent workflow
├── openai_agent/ # OpenAI Agents SDK integration
├── anthropic/ # Claude-compatible agent workflow
└── langchain/ # LangChain integration

The useful mental model is four horizontal layers:

Layer What it does Components
Training forward/backward/update engine/fsdp_engine.py, engine/megatron_engine.py, engine/experimental/archon/
Inference generate rollouts engine/sglang_remote.py, engine/vllm_remote.py
Orchestration gluing the two together trainer/, infra/ (Ray / SLURM / local schedulers)
Workflow the rollout policy — including agent frameworks workflow/rlvr.py, workflow/multi_turn.py, workflow/openai/, etc.

The separation of engine (how to train/infer) from workflow (what the rollout actually does) is the right factoring for agent RL. A single workflow can be reused across many algorithms (GRPO, PPO, DAPO, RLOO, LitePPO, SAPO, M2PO — all 13+ algorithms share the workflow abstraction), and a single algorithm can run over many workflows (math verification, multi-turn tool use, vision-language, agent frameworks).

Training backends

Backend DP TP SP CP PP EP 1D Packing LoRA
Megatron ZeRO-1 ✅ (with vLLM)
PyTorch FSDP FSDP2
PyTorch Archon FSDP2

(DP = data parallel, TP = tensor parallel, SP = sequence parallel, CP = context parallel, PP = pipeline parallel, EP = expert parallel.)

Megatron is the canonical "all the parallelism" choice; FSDP2 is the pragmatic default for most users; Archon is an experimental FSDP2-based backend that adds PP+EP without giving up LoRA incompatibility. The fact that all three are maintained side-by-side tells you something about the actual state of the distributed-training world — no single backend dominates for all model-size / memory / feature combinations.

Inference backends

Backend TP CP PP DP Attention EP
vLLM ? ? ?
SGLang

Both are supported because users want both. SGLang has better MoE / expert-parallel support and is the default; vLLM has a larger user base and is the fallback.


4. The Async RL Core Idea

Let's unpack what "fully asynchronous" actually means, because the term gets abused.

Synchronous RL (what AReaL is moving away from)

1
2
3
step t:   [GENERATE rollouts with policy π_t] → [TRAIN on rollouts → π_{t+1}]
step t+1: [GENERATE rollouts with policy π_{t+1}] → [TRAIN → π_{t+2}]
...

Every rollout uses the current policy. No staleness, but lots of idle time.

Asynchronous RL (AReaL's mode)

1
2
3
4
inference GPUs:   GENERATE (π_t) — GENERATE (π_{t+1}) — GENERATE (π_{t+2}) — ...
training GPUs: TRAIN (from π_{t-k}) — TRAIN (from π_{t-k+1}) — ...

rollouts can be k versions stale

Inference and training are fully decoupled. The training GPU is always consuming rollouts; the inference GPU is always producing them. Weight updates are broadcast asynchronously to the inference workers.

The price: off-policyness. A rollout generated by policy $\pi_{t-k}$ is used to train $\pi_t$. The further apart they are, the more the policy gradient estimator is biased.

Off-policyness control

AReaL exposes this directly as max_head_offpolicyness:

1
2
rollout:
max_head_offpolicyness: 4 # allow up to 4 policy versions behind
  • 0 → synchronous (useful for baseline comparisons, ~2× slower)
  • 2–8 → typical async range; bigger = more throughput, less stability
  • Very large → quality degrades

What makes this a system-level design rather than an algorithm-level one: the partial rollouts idea. A single trajectory (especially a long reasoning chain) can span multiple policy versions — the first 4k tokens might come from $\pi_{t-2}$, the next 4k from $\pi_{t-1}$, etc. This is only possible because the inference side holds a KV cache that persists across weight updates.

The decoupled PPO objective

Plain PPO assumes on-policy data. When the rollouts are off-policy, the ratio $r_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}$ can swing wildly. AReaL pairs async rollouts with two algorithmic knobs:

1
2
3
actor:
use_decoupled_loss: true # decoupled PPO
recompute_logprobs: true # recompute logprobs during training
  • recompute_logprobs: true — don't trust the logprobs the inference engine returned; recompute them under the current training policy. This matters because inference backends (vLLM, SGLang) can use different numerics from training backends (Megatron, FSDP), and tiny differences compound.
  • use_decoupled_loss: true — use a variant of the PPO objective that separates the importance-sampling correction from the clipping, handling off-policyness more gracefully.

These are knobs, not defaults for every case. The docs explicitly note that the decoupled loss "may conflict with certain algorithm configurations (e.g., SAPO)" and that effects on newer algorithms are "largely understudied." That honesty is refreshing.


5. The Proxy — Why "Just Replace base_url" Matters

This is the single most interesting design choice in AReaL, and it's worth explaining because it's the kind of thing that's easy to under-appreciate from the README.

The agent RL problem, precisely

You have an agent loop that looks like this:

1
2
3
4
5
6
7
8
9
# Your existing agent code, unchanged
client = AsyncOpenAI(base_url=OPENAI_URL, api_key=OPENAI_KEY)
for turn in range(max_turns):
response = await client.chat.completions.create(
messages=conversation,
tools=tools,
# ...
)
# parse response, call tools, append to conversation, etc.

For RL you need:

  1. The token IDs Claude/GPT/whatever actually produced (not the decoded text).
  2. The logprobs under the generating policy.
  3. A reward signal for each trajectory.
  4. A way to collect many trajectories in parallel so you can compute advantages.

Chat-completions APIs don't give you (1) or (2). They don't expect a (3). And parallelism in agent frameworks is often serial-by-default.

What the proxy does

AReaL runs an HTTP server that speaks OpenAI chat-completions but is wired into its own inference engine (SGLang or vLLM). When the agent sends a request, the proxy:

  • Routes it to the inference engine.
  • Records the exact token IDs and logprobs.
  • Tracks which trajectory this request belongs to (via session IDs).
  • Returns an OpenAI-shaped response.

At episode end, you call the reward function. The proxy stitches together the full trajectory — every turn, every tool call, every token — and hands it to the trainer as a single trainable rollout.

So the user's code changes by exactly one line:

1
2
3
4
# Before
client = AsyncOpenAI(base_url="https://api.openai.com/v1", api_key=OPENAI_KEY)
# After
client = AsyncOpenAI(base_url="http://areal-proxy:8000/v1", api_key=AREAL_KEY)

That's it. The rest of the agent (tools, memory, prompts, retry logic) is unchanged.

Why this is the right design

Agent frameworks churn. LangChain in 2023 vs. 2026 is a different library. OpenAI Agents SDK didn't exist two years ago. CAMEL is on its third major version. A training system that required agent-framework-specific adapters would be permanently chasing.

By targeting the protocol (OpenAI chat completions) rather than the framework, AReaL future-proofs itself: every new framework that wants to be popular ends up supporting chat completions, and they all work with AReaL for free.

This is the same design principle as LSP (Language Server Protocol) for editors, or OCI for container runtimes: standardize the wire format, not the implementation.

The tradeoff

You can only train what the protocol exposes. If your agent does something clever with logit biases, speculative decoding, or non-standard sampling parameters, the proxy may not support it. The proxy is a funnel — it works beautifully for 95% of agent code, and for the remaining 5% you need the workflow-level integration (which AReaL also provides — see workflow/openai_agent/, workflow/langchain/).


6. The Algorithm Menagerie

AReaL supports 13+ RL / fine-tuning algorithms out of the box:

Algorithm What it's good at Paper
PPO baseline classic arXiv:2203.02155
GRPO group-relative advantage, dominant on reasoning arXiv:2402.03300
GSPO GRPO variant arXiv:2507.18071
DAPO dynamic batch size arXiv:2503.14476
LitePPO lighter-weight PPO arXiv:2508.08221
Dr.GRPO GRPO variant arXiv:2503.20783
REINFORCE++ arXiv:2501.03262
RLOO leave-one-out advantage arXiv:2402.14740
SAPO arXiv:2511.20347
M2PO arXiv:2510.01161
SFT supervised fine-tuning -
RLHF reward modeling -
Distillation arXiv:2506.02208

This is a specific design choice: ship many algorithms, let the user pick. The alternative (ship one algorithm, optimize it hard) is what VeRL and OpenRLHF tend toward. AReaL's bet is that the RL-for-LLM field is moving fast enough that the "best" algorithm changes every six months, and a framework that supports many is more future-proof than one that hard-codes a single choice.

In practice: most users run GRPO. The others are there for research / niche cases.


7. Examples — What's Shipped

The examples/ directory is the real documentation:

Example What it demonstrates
math/gsm8k_rl.py baseline GRPO on GSM8K — the default starter
multi_turn_math/ multi-turn math with intermediate reasoning
tir/ tool-integrated reasoning (code execution in the loop)
openclaw/ agent RL via OpenClaw — the "just replace base_url" demo
openai_agents/ OpenAI Agents SDK integration
camel/ CAMEL multi-agent integration
tau2/ τ²-bench (customer service agents) — the AReaL-SEA example
search_agent/ ASearcher, a search agent that set SOTA
countdown/ countdown game (hard reasoning / search)
alignment/ RLHF reward modeling
distillation/ distillation workflows
vlm/, vlm_npu/ vision-language models
tau2/ (SEA) self-evolving data synthesis — the 235B MoE model that beat GPT-5 on τ²

Two of these are worth singling out:

openclaw/ — the headline demo

OpenClaw is a terminal-focused agent runtime. The training recipe is:

  1. Start AReaL's RL service — it spins up a proxy at http://x.x.x.x:port.
  2. Point OpenClaw at that proxy via its existing --base-url flag.
  3. Run a workload (command-line tasks, shell sessions, etc.).
  4. Rewards come from task-completion signals.
  5. AReaL harvests trajectories and updates the policy.

No OpenClaw code is modified. This is the "bring your own agent" promise, made concrete.

tau2/ — the SEA (self-evolving agent) example

AReaL-SEA pairs RL training with self-evolving data synthesis. A 235B MoE model, trained this way, reportedly surpasses GPT-5 on $\tau^2$-bench (customer service). The interesting bit isn't the model itself but the recipe: the training data for RL is generated adversarially by the agent, not collected from humans. This is closer to AlphaZero-style self-play than to traditional RLHF.


8. AReaL-lite — The Lightweight Variant

As of 2025-07-31, AReaL also ships AReaL-lite:

  • 80% fewer lines of code than full AReaL
  • 90% of the performance and core functionality
  • "Algorithm-first" API (easy to experiment, hard to deploy at huge scale)
  • Natively supports fully asynchronous agentic RL

This is an increasingly common pattern in ML infra: a "production" codebase for the organization that built the system, and a "research" codebase that's a cleaner subset for external researchers. vLLM vs. vllm-lite, DeepSpeed vs. simpler trainer wrappers, etc. The split acknowledges that the constraints of running at Ant Group's scale are different from the constraints of running a postdoc's experiments.

For a researcher, the lite variant is usually the right entry point. For a production system, the full framework has things the lite version doesn't (full Megatron backend, every parallelism dimension, every scheduler).


9. Connection to the Agent Design Philosophy

Compared to the Claude Code and QED perspectives we've looked at, AReaL sits at a very different altitude:

Layer Example Who it's for
Application a specific agent (e.g., OpenClaw, a math prover) end users
Orchestration Claude Code (tool use), QED (multi-agent pipeline) agent builders
Training infra AReaL model trainers
Serving infra vLLM, SGLang platform ops

AReaL is below the orchestration layer — it's the substrate that produces the model that the orchestration layer then uses. But interestingly, the design principles still rhyme:

Principle In Claude Code / QED In AReaL
Tools shaped to model abilities CLI subprocesses, one per vendor Separate backends (FSDP / Megatron / Archon) for different model scales
Progressive disclosure Lazy-loaded skill bodies Lazy-loaded agent workflows via the proxy
Standardize the wire format, not the framework OpenAI-style tools[] in the API OpenAI-style chat completions for the proxy
Off-loading cost to the right place Prompt caching (amortize over session) Async GPU pools (amortize generation over training)
Give users the knobs explicitly Prose imperatives in system prompt max_head_offpolicyness exposed in YAML

The deepest parallel is point 3: AReaL's proxy and Claude Code's Skill system both work by standardizing a wire format that already exists and is ubiquitous (OpenAI chat completions, Anthropic tool use) rather than inventing a new one. Every agent framework already speaks chat completions; every LLM API already supports tool use. Building on those substrates is how you get composability for free.


10. What I Think Is Missing or Risky

A few places where I'd want to dig in more:

  • Proxy compatibility is protocol-first, which means "almost OpenAI" agents may break. If your framework uses logit bias, response_format, streaming tool calls, or other non-mainstream fields, the proxy may need updates. The README is explicit that the proxy works with chat completions; anything outside that boundary needs the workflow-level integration.
  • Async quality claims are benchmark-specific. The 2.77× speedup is reported on their benchmark setup (GSM8K, mid-sized models). At larger scale (235B MoE on τ²), throughput wins compound differently — you need to benchmark your own case.
  • Off-policyness is a footgun. Crank max_head_offpolicyness up and throughput goes up; too high and training destabilizes silently. This is the kind of knob that needs guardrails (monitoring, automatic back-off) that the framework doesn't provide out of the box.
  • Scheduler compatibility. The docs note that agent workflows with the proxy approach are "supported on local and slurm schedulers only"; Ray is excluded because Ray's actor model doesn't play well with persistent HTTP connections. If your cluster is Ray-based, you can do synchronous RL but not the headline async-agent story.
  • Security. RL-finetuned agents can develop surprising behaviors; the README warns about this explicitly for OpenClaw. If you're training a tool-using agent, the proxy is by definition giving the model real tool-execution capability. Isolation matters.

None of these are dealbreakers; they're the honest shape of the problem.


11. What's Worth Stealing

Even if you never run AReaL, four of its design moves are worth adopting:

(a) Separate your engines from your workflows

The engine/ vs. workflow/ split is a clean API boundary. Engines know how to generate tokens and compute gradients; workflows know what the rollout means (a math problem, a customer service chat, a tool-using agent). Keep them decoupled and you can reuse either without rewriting the other.

(b) Target the protocol, not the framework

If you're building something agent-adjacent (a logger, a monitoring tool, a training system, an evaluation harness), pick an existing protocol (OpenAI chat completions, Anthropic Messages, MCP) and speak it natively. Don't invent a new one. Your users will point their existing code at you and you'll get the network effect for free.

(c) Expose the algorithmic knobs, don't hide them

max_head_offpolicyness, use_decoupled_loss, recompute_logprobs — these are the exact levers a researcher needs to tune. Hiding them behind "stability" defaults would make the system harder to use, not easier. Documentation + defaults that work for the common case + knobs that advanced users can pull is the right combination.

(d) Ship many algorithms, commit to none

PyTorch's "many losses, many optimizers, you pick" philosophy applies to RL too. A framework that ships one algorithm is betting that algorithm will still be the right one in a year. AReaL's approach — 13 algorithms, shared workflow abstraction — amortizes the "which algorithm is best" question over whoever uses the framework next.


12. Final Read

AReaL is the kind of infrastructure that doesn't get the marketing attention of a new model release, but is what makes new model releases possible. It's solving a system-level bottleneck (GPU idleness in synchronous RL) with a system-level answer (asynchronous training with principled off-policyness control), and then extending that substrate outward to agent frameworks via a protocol-level bridge (the OpenAI-compatible proxy).

The "just replace base_url" promise is the most memorable part, but it's only one expression of a deeper commitment: meet the ecosystem where it lives. Agents speak chat completions. Training speaks Megatron / FSDP / Archon. Inference speaks vLLM / SGLang. AReaL doesn't try to replace any of these — it provides the missing glue that turns them into a training loop.

If you're doing serious RL work on LLMs, it's worth a weekend to stand up a GSM8K GRPO run and read through workflow/rlvr.py. If you're building agents and want to fine-tune them without rewriting, the openclaw/ example is 90% of the story. And if you're just curious about how a 2026-era RL system is architected, reading engine/fsdp_engine.py next to engine/megatron_engine.py is the clearest picture of what "pick your parallelism" actually looks like in production code.

The broader pattern — that RL for LLMs has now moved from "which algorithm?" to "which system architecture?" — is worth sitting with. Sample efficiency of PPO vs. GRPO matters less than whether your inference fleet is saturated, whether your off-policyness controller is stable, whether your proxy can handle agent traffic, whether your checkpoint interval matches your cluster's MTBF. That's the layer AReaL operates on, and it's where the field is going.


References

  • Title: AReaL — Asynchronous RL Infrastructure for LLM Agents
  • Author: wy
  • Created at : 2026-04-22 18:00:00
  • Updated at : 2026-04-22 15:32:22
  • Link: https://yue-ruby-w.site/2026/04/22/AReaL-Asynchronous-RL-for-Agents/
  • License: This work is licensed under CC BY-NC-SA 4.0.