AReaL — Asynchronous RL Infrastructure for LLM Agents

Repo: inclusionAI/AReaL — Tsinghua IIIS + Ant Group. Paper (arXiv 2505.24298).

AReaL is an RL training system aimed at two things the field has been struggling to reconcile: scalable online RL for large reasoning models, and agentic RL where the "environment" is a black-box agent runtime (OpenAI Agents SDK, CAMEL, LangChain, whatever) that you shouldn't have to rewrite. Its central trick is fully asynchronous training — decoupling rollout generation from policy updates so that inference and training run concurrently on disaggregated GPUs.

This is a different kind of system from QED. QED is an orchestration pipeline that treats LLMs as fixed components and squeezes correctness out of them via multi-model voting and structured verification. AReaL is an infrastructure layer that updates the weights themselves via reinforcement learning, on agent trajectories that can be arbitrary user-defined loops. Both are practical, but they're operating at different layers of the stack.

1. High-Level Indication

If you only remember three things:

Asynchronous > synchronous, by a lot. AReaL's v0.3 release shows 2.77× speedup over synchronous RL on comparable hardware. The trick is running rollouts (inference) and training (gradient steps) on separate GPU pools, overlapped in time, with principled handling of the staleness that introduces.
"Just replace base_url" is the API contract. The headline example (OpenClaw, CAMEL-SETA, OpenAI Agents SDK) is that you train an existing agent runtime by pointing it at AReaL's OpenAI-compatible proxy. No framework rewrite. The proxy is an HTTP server that speaks chat completions, logs token-level telemetry, and returns trajectories for RL — so any agent that talks OpenAI chat-completions protocol is suddenly RL-trainable.
Off-policyness is a controlled knob, not a defect. The entire async-RL literature lives or dies on how much stale data you can tolerate. AReaL exposes max_head_offpolicyness directly, pairs it with a decoupled PPO loss + recomputed logprobs, and lets the user tune the tradeoff. This is not a hidden implementation detail — it's a user-facing algorithmic choice.

The bet: the bottleneck of online RL for agents is generation latency, and the right fix is system-level (asynchronous GPU pools) rather than algorithm-level (more sample-efficient algorithms).

2. The Problem AReaL Is Trying to Solve

Online RL for LLMs has two flavors of pain:

(a) The GPU-idleness pain

Traditional synchronous online RL (PPO, GRPO, etc.) does this every step:

while training:
    rollouts = generate(policy)        # inference GPUs work, training GPUs idle
    batch = collect(rollouts)          # both idle
    loss = train(policy, batch)        # training GPUs work, inference GPUs idle

For a large reasoning model with long chains of thought — say, 8k–32k tokens per rollout — inference dominates the wall-clock. The training GPUs spend most of their time waiting for rollouts to finish. As the model gets bigger and the chain-of-thought gets longer, the idleness fraction gets worse.

(b) The agent-framework impedance mismatch

You want to RL-finetune an agent that was built with OpenAI Agents SDK or CAMEL or LangChain. Those frameworks:

Interact with LLMs through chat-completions APIs — no token IDs, no logprobs.
Have no reward mechanism (they're designed for inference).
Run sequentially (no parallel rollout collection out of the box).

Naive approach: rewrite the agent in a training-friendly API. That's weeks of work per agent framework, and it diverges from what you'll actually deploy.

What AReaL proposes

For (a): asynchronous RL — decouple rollout GPUs from training GPUs; allow staleness with principled control; use a decoupled PPO objective that's robust to off-policy data.

For (b): a proxy model client that looks like OpenAI — your agent sends chat-completion requests to the proxy; the proxy routes them to SGLang or vLLM; every request/response is logged with token-level telemetry; when the episode ends you assign a reward and the proxy hands the trajectory back for RL training.

The result: you change your agent's base_url from https://api.openai.com/v1 to the proxy URL, and the same code that runs inference now produces trainable trajectories.

3. Architecture Tour

AReaL's source tree (areal/) decomposes into the pieces you'd expect from a serious RL system:

areal/
├── api/          # config dataclasses, CLI args, public types
├── dataset/      # dataset loaders for math / coding / etc.
├── engine/       # training + inference engines
│   ├── fsdp_engine.py       (~73 KB)   FSDP2 training
│   ├── megatron_engine.py   (~77 KB)   Megatron-LM training (full parallelism)
│   ├── sglang_remote.py     (~16 KB)   Remote inference via SGLang
│   └── vllm_remote.py       (~17 KB)   Remote inference via vLLM
├── experimental/ # Archon backend, new features
├── infra/        # cluster, launchers, scheduling
├── models/       # model loaders, checkpointing
├── reward/       # reward functions (gsm8k, code, etc.)
├── tools/        # utilities
├── trainer/      # PPOTrainer, SFTTrainer, Distillation
├── utils/
└── workflow/     # rollout workflows
    ├── rlvr.py           # RL-with-verifier rollout
    ├── multi_turn.py     # multi-turn rollout
    ├── vision_rlvr.py    # VLM rollout
    ├── openai/           # OpenAI-compatible agent workflow
    ├── openai_agent/     # OpenAI Agents SDK integration
    ├── anthropic/        # Claude-compatible agent workflow
    └── langchain/        # LangChain integration

The useful mental model is four horizontal layers:

Layer	What it does	Components
Training	forward/backward/update	`engine/fsdp_engine.py`, `engine/megatron_engine.py`, `engine/experimental/archon/`
Inference	generate rollouts	`engine/sglang_remote.py`, `engine/vllm_remote.py`
Orchestration	gluing the two together	`trainer/`, `infra/` (Ray / SLURM / local schedulers)
Workflow	the rollout policy — including agent frameworks	`workflow/rlvr.py`, `workflow/multi_turn.py`, `workflow/openai/`, etc.

The separation of engine (how to train/infer) from workflow (what the rollout actually does) is the right factoring for agent RL. A single workflow can be reused across many algorithms (GRPO, PPO, DAPO, RLOO, LitePPO, SAPO, M2PO — all 13+ algorithms share the workflow abstraction), and a single algorithm can run over many workflows (math verification, multi-turn tool use, vision-language, agent frameworks).

Training backends

Backend	DP	TP	SP	CP	PP	EP	1D Packing	LoRA
Megatron	ZeRO-1	✅	✅	✅	✅	✅	✅	✅ (with vLLM)
PyTorch FSDP	FSDP2	✅	✅	✅	❌	❌	✅	✅
PyTorch Archon	FSDP2	✅	✅	✅	✅	✅	✅	❌

(DP = data parallel, TP = tensor parallel, SP = sequence parallel, CP = context parallel, PP = pipeline parallel, EP = expert parallel.)

Megatron is the canonical "all the parallelism" choice; FSDP2 is the pragmatic default for most users; Archon is an experimental FSDP2-based backend that adds PP+EP without giving up LoRA incompatibility. The fact that all three are maintained side-by-side tells you something about the actual state of the distributed-training world — no single backend dominates for all model-size / memory / feature combinations.

Inference backends

Backend	TP	CP	PP	DP Attention	EP
vLLM	✅	?	✅	?	?
SGLang	✅	❌	❌	✅	✅

Both are supported because users want both. SGLang has better MoE / expert-parallel support and is the default; vLLM has a larger user base and is the fallback.

4. The Async RL Core Idea

Let's unpack what "fully asynchronous" actually means, because the term gets abused.

Synchronous RL (what AReaL is moving away from)

1
2
3

step t:   [GENERATE rollouts with policy π_t] → [TRAIN on rollouts → π_{t+1}]
step t+1: [GENERATE rollouts with policy π_{t+1}] → [TRAIN → π_{t+2}]
...

Every rollout uses the current policy. No staleness, but lots of idle time.

Asynchronous RL (AReaL's mode)

inference GPUs:   GENERATE (π_t) — GENERATE (π_{t+1}) — GENERATE (π_{t+2}) — ...
training GPUs:              TRAIN (from π_{t-k}) — TRAIN (from π_{t-k+1}) — ...
                                   ↑
                        rollouts can be k versions stale

Inference and training are fully decoupled. The training GPU is always consuming rollouts; the inference GPU is always producing them. Weight updates are broadcast asynchronously to the inference workers.

The price: off-policyness. A rollout generated by policy $\pi_{t-k}$ is used to train $\pi_t$. The further apart they are, the more the policy gradient estimator is biased.

Off-policyness control

AReaL exposes this directly as max_head_offpolicyness:

1 2	rollout: max_head_offpolicyness: 4 # allow up to 4 policy versions behind

0 → synchronous (useful for baseline comparisons, ~2× slower)
2–8 → typical async range; bigger = more throughput, less stability
Very large → quality degrades

What makes this a system-level design rather than an algorithm-level one: the partial rollouts idea. A single trajectory (especially a long reasoning chain) can span multiple policy versions — the first 4k tokens might come from $\pi_{t-2}$, the next 4k from $\pi_{t-1}$, etc. This is only possible because the inference side holds a KV cache that persists across weight updates.

The decoupled PPO objective

Plain PPO assumes on-policy data. When the rollouts are off-policy, the ratio $r_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}$ can swing wildly. AReaL pairs async rollouts with two algorithmic knobs:

1
2
3

actor:
  use_decoupled_loss: true     # decoupled PPO
  recompute_logprobs: true     # recompute logprobs during training

recompute_logprobs: true — don't trust the logprobs the inference engine returned; recompute them under the current training policy. This matters because inference backends (vLLM, SGLang) can use different numerics from training backends (Megatron, FSDP), and tiny differences compound.
use_decoupled_loss: true — use a variant of the PPO objective that separates the importance-sampling correction from the clipping, handling off-policyness more gracefully.

These are knobs, not defaults for every case. The docs explicitly note that the decoupled loss "may conflict with certain algorithm configurations (e.g., SAPO)" and that effects on newer algorithms are "largely understudied." That honesty is refreshing.

5. The Proxy — Why "Just Replace `base_url`" Matters

This is the single most interesting design choice in AReaL, and it's worth explaining because it's the kind of thing that's easy to under-appreciate from the README.

The agent RL problem, precisely

You have an agent loop that looks like this:

# Your existing agent code, unchanged
client = AsyncOpenAI(base_url=OPENAI_URL, api_key=OPENAI_KEY)
for turn in range(max_turns):
    response = await client.chat.completions.create(
        messages=conversation,
        tools=tools,
        # ...
    )
    # parse response, call tools, append to conversation, etc.

For RL you need:

The token IDs Claude/GPT/whatever actually produced (not the decoded text).
The logprobs under the generating policy.
A reward signal for each trajectory.
A way to collect many trajectories in parallel so you can compute advantages.

Chat-completions APIs don't give you (1) or (2). They don't expect a (3). And parallelism in agent frameworks is often serial-by-default.

What the proxy does

AReaL runs an HTTP server that speaks OpenAI chat-completions but is wired into its own inference engine (SGLang or vLLM). When the agent sends a request, the proxy:

Routes it to the inference engine.
Records the exact token IDs and logprobs.
Tracks which trajectory this request belongs to (via session IDs).
Returns an OpenAI-shaped response.

At episode end, you call the reward function. The proxy stitches together the full trajectory — every turn, every tool call, every token — and hands it to the trainer as a single trainable rollout.

So the user's code changes by exactly one line:

# Before
client = AsyncOpenAI(base_url="https://api.openai.com/v1", api_key=OPENAI_KEY)
# After
client = AsyncOpenAI(base_url="http://areal-proxy:8000/v1", api_key=AREAL_KEY)

That's it. The rest of the agent (tools, memory, prompts, retry logic) is unchanged.

Why this is the right design

Agent frameworks churn. LangChain in 2023 vs. 2026 is a different library. OpenAI Agents SDK didn't exist two years ago. CAMEL is on its third major version. A training system that required agent-framework-specific adapters would be permanently chasing.

By targeting the protocol (OpenAI chat completions) rather than the framework, AReaL future-proofs itself: every new framework that wants to be popular ends up supporting chat completions, and they all work with AReaL for free.

This is the same design principle as LSP (Language Server Protocol) for editors, or OCI for container runtimes: standardize the wire format, not the implementation.

The tradeoff

You can only train what the protocol exposes. If your agent does something clever with logit biases, speculative decoding, or non-standard sampling parameters, the proxy may not support it. The proxy is a funnel — it works beautifully for 95% of agent code, and for the remaining 5% you need the workflow-level integration (which AReaL also provides — see workflow/openai_agent/, workflow/langchain/).

6. The Algorithm Menagerie

AReaL supports 13+ RL / fine-tuning algorithms out of the box:

Algorithm	What it's good at	Paper
PPO	baseline classic	arXiv:2203.02155
GRPO	group-relative advantage, dominant on reasoning	arXiv:2402.03300
GSPO	GRPO variant	arXiv:2507.18071
DAPO	dynamic batch size	arXiv:2503.14476
LitePPO	lighter-weight PPO	arXiv:2508.08221
Dr.GRPO	GRPO variant	arXiv:2503.20783
REINFORCE++		arXiv:2501.03262
RLOO	leave-one-out advantage	arXiv:2402.14740
SAPO		arXiv:2511.20347
M2PO		arXiv:2510.01161
SFT	supervised fine-tuning	-
RLHF reward modeling		-
Distillation		arXiv:2506.02208

This is a specific design choice: ship many algorithms, let the user pick. The alternative (ship one algorithm, optimize it hard) is what VeRL and OpenRLHF tend toward. AReaL's bet is that the RL-for-LLM field is moving fast enough that the "best" algorithm changes every six months, and a framework that supports many is more future-proof than one that hard-codes a single choice.

In practice: most users run GRPO. The others are there for research / niche cases.

7. Examples — What's Shipped

The examples/ directory is the real documentation:

Example	What it demonstrates
`math/gsm8k_rl.py`	baseline GRPO on GSM8K — the default starter
`multi_turn_math/`	multi-turn math with intermediate reasoning
`tir/`	tool-integrated reasoning (code execution in the loop)
`openclaw/`	agent RL via OpenClaw — the "just replace base_url" demo
`openai_agents/`	OpenAI Agents SDK integration
`camel/`	CAMEL multi-agent integration
`tau2/`	τ²-bench (customer service agents) — the AReaL-SEA example
`search_agent/`	ASearcher, a search agent that set SOTA
`countdown/`	countdown game (hard reasoning / search)
`alignment/`	RLHF reward modeling
`distillation/`	distillation workflows
`vlm/`, `vlm_npu/`	vision-language models
`tau2/` (SEA)	self-evolving data synthesis — the 235B MoE model that beat GPT-5 on τ²

Two of these are worth singling out:

`openclaw/` — the headline demo

OpenClaw is a terminal-focused agent runtime. The training recipe is:

Start AReaL's RL service — it spins up a proxy at http://x.x.x.x:port.
Point OpenClaw at that proxy via its existing --base-url flag.
Run a workload (command-line tasks, shell sessions, etc.).
Rewards come from task-completion signals.
AReaL harvests trajectories and updates the policy.

No OpenClaw code is modified. This is the "bring your own agent" promise, made concrete.

`tau2/` — the SEA (self-evolving agent) example

AReaL-SEA pairs RL training with self-evolving data synthesis. A 235B MoE model, trained this way, reportedly surpasses GPT-5 on $\tau^2$-bench (customer service). The interesting bit isn't the model itself but the recipe: the training data for RL is generated adversarially by the agent, not collected from humans. This is closer to AlphaZero-style self-play than to traditional RLHF.

8. AReaL-lite — The Lightweight Variant

As of 2025-07-31, AReaL also ships AReaL-lite:

80% fewer lines of code than full AReaL
90% of the performance and core functionality
"Algorithm-first" API (easy to experiment, hard to deploy at huge scale)
Natively supports fully asynchronous agentic RL

This is an increasingly common pattern in ML infra: a "production" codebase for the organization that built the system, and a "research" codebase that's a cleaner subset for external researchers. vLLM vs. vllm-lite, DeepSpeed vs. simpler trainer wrappers, etc. The split acknowledges that the constraints of running at Ant Group's scale are different from the constraints of running a postdoc's experiments.

For a researcher, the lite variant is usually the right entry point. For a production system, the full framework has things the lite version doesn't (full Megatron backend, every parallelism dimension, every scheduler).

9. Connection to the Agent Design Philosophy

Compared to the Claude Code and QED perspectives we've looked at, AReaL sits at a very different altitude:

Layer	Example	Who it's for
Application	a specific agent (e.g., OpenClaw, a math prover)	end users
Orchestration	Claude Code (tool use), QED (multi-agent pipeline)	agent builders
Training infra	AReaL	model trainers
Serving infra	vLLM, SGLang	platform ops

AReaL is below the orchestration layer — it's the substrate that produces the model that the orchestration layer then uses. But interestingly, the design principles still rhyme:

Principle	In Claude Code / QED	In AReaL
Tools shaped to model abilities	CLI subprocesses, one per vendor	Separate backends (FSDP / Megatron / Archon) for different model scales
Progressive disclosure	Lazy-loaded skill bodies	Lazy-loaded agent workflows via the proxy
Standardize the wire format, not the framework	OpenAI-style `tools[]` in the API	OpenAI-style chat completions for the proxy
Off-loading cost to the right place	Prompt caching (amortize over session)	Async GPU pools (amortize generation over training)
Give users the knobs explicitly	Prose imperatives in system prompt	`max_head_offpolicyness` exposed in YAML

The deepest parallel is point 3: AReaL's proxy and Claude Code's Skill system both work by standardizing a wire format that already exists and is ubiquitous (OpenAI chat completions, Anthropic tool use) rather than inventing a new one. Every agent framework already speaks chat completions; every LLM API already supports tool use. Building on those substrates is how you get composability for free.

10. What I Think Is Missing or Risky

A few places where I'd want to dig in more:

Proxy compatibility is protocol-first, which means "almost OpenAI" agents may break. If your framework uses logit bias, response_format, streaming tool calls, or other non-mainstream fields, the proxy may need updates. The README is explicit that the proxy works with chat completions; anything outside that boundary needs the workflow-level integration.
Async quality claims are benchmark-specific. The 2.77× speedup is reported on their benchmark setup (GSM8K, mid-sized models). At larger scale (235B MoE on τ²), throughput wins compound differently — you need to benchmark your own case.
Off-policyness is a footgun. Crank max_head_offpolicyness up and throughput goes up; too high and training destabilizes silently. This is the kind of knob that needs guardrails (monitoring, automatic back-off) that the framework doesn't provide out of the box.
Scheduler compatibility. The docs note that agent workflows with the proxy approach are "supported on local and slurm schedulers only"; Ray is excluded because Ray's actor model doesn't play well with persistent HTTP connections. If your cluster is Ray-based, you can do synchronous RL but not the headline async-agent story.
Security. RL-finetuned agents can develop surprising behaviors; the README warns about this explicitly for OpenClaw. If you're training a tool-using agent, the proxy is by definition giving the model real tool-execution capability. Isolation matters.

None of these are dealbreakers; they're the honest shape of the problem.

11. What's Worth Stealing

Even if you never run AReaL, four of its design moves are worth adopting:

(a) Separate your engines from your workflows

The engine/ vs. workflow/ split is a clean API boundary. Engines know how to generate tokens and compute gradients; workflows know what the rollout means (a math problem, a customer service chat, a tool-using agent). Keep them decoupled and you can reuse either without rewriting the other.

(b) Target the protocol, not the framework

If you're building something agent-adjacent (a logger, a monitoring tool, a training system, an evaluation harness), pick an existing protocol (OpenAI chat completions, Anthropic Messages, MCP) and speak it natively. Don't invent a new one. Your users will point their existing code at you and you'll get the network effect for free.

(c) Expose the algorithmic knobs, don't hide them

max_head_offpolicyness, use_decoupled_loss, recompute_logprobs — these are the exact levers a researcher needs to tune. Hiding them behind "stability" defaults would make the system harder to use, not easier. Documentation + defaults that work for the common case + knobs that advanced users can pull is the right combination.

(d) Ship many algorithms, commit to none

PyTorch's "many losses, many optimizers, you pick" philosophy applies to RL too. A framework that ships one algorithm is betting that algorithm will still be the right one in a year. AReaL's approach — 13 algorithms, shared workflow abstraction — amortizes the "which algorithm is best" question over whoever uses the framework next.

12. Final Read

AReaL is the kind of infrastructure that doesn't get the marketing attention of a new model release, but is what makes new model releases possible. It's solving a system-level bottleneck (GPU idleness in synchronous RL) with a system-level answer (asynchronous training with principled off-policyness control), and then extending that substrate outward to agent frameworks via a protocol-level bridge (the OpenAI-compatible proxy).

The "just replace base_url" promise is the most memorable part, but it's only one expression of a deeper commitment: meet the ecosystem where it lives. Agents speak chat completions. Training speaks Megatron / FSDP / Archon. Inference speaks vLLM / SGLang. AReaL doesn't try to replace any of these — it provides the missing glue that turns them into a training loop.

If you're doing serious RL work on LLMs, it's worth a weekend to stand up a GSM8K GRPO run and read through workflow/rlvr.py. If you're building agents and want to fine-tune them without rewriting, the openclaw/ example is 90% of the story. And if you're just curious about how a 2026-era RL system is architected, reading engine/fsdp_engine.py next to engine/megatron_engine.py is the clearest picture of what "pick your parallelism" actually looks like in production code.

The broader pattern — that RL for LLMs has now moved from "which algorithm?" to "which system architecture?" — is worth sitting with. Sample efficiency of PPO vs. GRPO matters less than whether your inference fleet is saturated, whether your off-policyness controller is stable, whether your proxy can handle agent traffic, whether your checkpoint interval matches your cluster's MTBF. That's the layer AReaL operates on, and it's where the field is going.

References

Repo: inclusionAI/AReaL
Paper: AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning (arXiv 2505.24298)
Prior work: ReaLHF (MLSys 2025) — the synchronous predecessor
Related projects built on AReaL:
- ASearcher — search agent, SOTA on search tasks
- AReaL-SEA — self-evolving 235B MoE model
- CAMEL-SETA — terminal agent RL
Related posts on this blog:
- QED — Multi-Agent Math Proof Pipeline — a different kind of multi-model system
- From Wire Format to Design Philosophy — the "target the protocol, not the framework" idea, applied to extensibility

hola

AReaL — Asynchronous RL Infrastructure for LLM Agents

AReaL — Asynchronous RL Infrastructure for LLM Agents

1. High-Level Indication

2. The Problem AReaL Is Trying to Solve

(a) The GPU-idleness pain

(b) The agent-framework impedance mismatch

What AReaL proposes

3. Architecture Tour

Training backends

Inference backends

4. The Async RL Core Idea

Synchronous RL (what AReaL is moving away from)

Asynchronous RL (AReaL's mode)

Off-policyness control

The decoupled PPO objective

5. The Proxy — Why "Just Replace base_url" Matters

The agent RL problem, precisely

What the proxy does

Why this is the right design

The tradeoff

6. The Algorithm Menagerie

7. Examples — What's Shipped

openclaw/ — the headline demo

tau2/ — the SEA (self-evolving agent) example

8. AReaL-lite — The Lightweight Variant

9. Connection to the Agent Design Philosophy

10. What I Think Is Missing or Risky

11. What's Worth Stealing

(a) Separate your engines from your workflows

(b) Target the protocol, not the framework

(c) Expose the algorithmic knobs, don't hide them

(d) Ship many algorithms, commit to none

12. Final Read

References

5. The Proxy — Why "Just Replace `base_url`" Matters

`openclaw/` — the headline demo

`tau2/` — the SEA (self-evolving agent) example