QED — A Multi-Agent Math Proof Pipeline Built on CLI Subprocesses

wy Lv3

QED — A Multi-Agent Math Proof Pipeline

Repo: proofQED/QED — Chenyang An, Qihao Ye, Minghao Pan.

QED is a pipeline that takes a LaTeX math problem and produces a rigorous natural-language proof. What's unusual about it — and what makes it worth reading in detail — is that it has no agent framework. No LangGraph, no AutoGen, no LangChain, no custom runtime. It orchestrates Claude, Codex, and Gemini purely by spawning their CLIs (claude -p, codex exec, gemini -p) from Python subprocess calls and reading their stdout. The agentic behavior emerges from carefully structured prompts + a three-stage pipeline + N×M parallel verification — nothing more.

The project has already produced a real research result: it found a valid Carleman weight function for a wave-operator inverse problem (verified by three domain experts). That alone is worth paying attention to, because Carleman weight construction is historically a paper-a-year hand-crafted art.

This post is my high-level read of how QED works, what design ideas are worth stealing, and how the choices connect to the agent-design philosophy we discussed earlier.


1. High-Level Indication

If you only remember three things:

  1. Multi-model parallel search + multi-verifier cross-check. $N$ different models independently attempt the proof; $M$ different models independently verify each proof. A model that hallucinates a citation is very likely to be caught by a model that did not. QED builds a voting system at the verification layer, not at the generation layer.
  2. Structured prose tags do the heavy lifting, not code. <cite>...</cite> and <key-original-step>...</key-original-step> turn two of the most common LLM math-proving failures — fake citations and hidden hand-waving — into machine-checkable artifacts. The verifier independently re-derives which steps it thinks are hard, then flags mismatches.
  3. No framework, just CLI subprocesses. This is deliberate. By using each vendor's own CLI, QED gets free updates whenever the vendor ships a new model, avoids framework lock-in, and inherits each CLI's existing tool-use surface (bash, read, write, web search). The orchestration layer is the Unix pipe principle applied to agents.

The bet: the CLI is the tool, the prompt is the algorithm, and the pipeline is just bookkeeping.


2. What QED Is Trying to Do

The problem with LLMs and math is well known:

  • Hand-waving. "It is easy to see that..." / "clearly..." / "by standard techniques..." — the model slides past the hard step and calls it a proof.
  • Silent weakening. The model quietly replaces the stated problem with a weaker one that it can actually prove.
  • Hallucinated citations. Inventing theorems, mis-attributing results, giving URLs to papers that don't exist.
  • Repeated failure patterns. Without memory, the model tries the same dead-end approach round after round.
  • Single-shot brittleness. One model, one try, one answer.

QED's design is basically a list of direct countermeasures to each of those failure modes:

Failure QED's countermeasure
Hand-waving <key-original-step> tagging + verifier-independent identification of hard steps + mismatch detection
Silent weakening Phase 1 of verification = word-by-word comparison of proof's goal against original problem
Hallucinated citations <cite> tagged citations + verifier fetches URLs and checks statements
Repeated failures Previous round's status log is read at the start of every round; web search is targeted at what failed
Single-shot brittleness Multi-model parallel search (up to 3 in parallel) + multi-verifier cross-check (up to 3×3 = 9 verification passes per round)

3. The Pipeline in Three Stages

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
problem.tex


[Stage 0] Literature Survey Agent
├─ classifies difficulty (Easy / Medium / Hard)
└─ surveys related work scaled to difficulty


related_info/


[Stage 1] Proof Search Loop (up to max_proof_iterations rounds)

├─ (optional) Brainstorm — multiple models propose strategies in parallel
├─ Proof Search — N models write N proofs in parallel
├─ Verification — each proof verified by M verifiers (N×M reports)
├─ Selector Agent — picks the best of the N proofs
└─ Verdict Agent — DONE (all verifiers PASS) or CONTINUE

▼ (when DONE or max rounds reached)
[Stage 2] Proof Effort Summary Agent


proof_effort_summary.md

Stage 0 — Literature Survey (with difficulty scaling)

A single survey agent reads the problem and does two things:

  1. Rates the difficulty — Easy / Medium / Hard.
  2. Investigates the mathematical landscape, scaled to that difficulty. An easy problem gets a brief survey; a hard problem gets a full investigation that classifies the problem, identifies applicable theorems, and catalogs related results.

Crucially, the survey agent does not propose proof strategies — that's the prover's job. This separation matters: it prevents the survey from anchoring the prover on a single approach before it has seen the problem.

The difficulty label also drives the rest of the pipeline: Easy problems use a lighter 3-step verification, Medium/Hard use the full 5-phase structural + detailed verification.

Stage 1 — The Proof Loop

This is the heart of the system. Each round is either:

  • Single-model, single-verifier (3 steps): proof search → verify → verdict.
  • Multi-model, multi-verifier (4–5 steps): proof search ×N → verification ×N×M → selector → apply → verdict.
  • Easy-mode (3 steps): proof search → lightweight verify → verdict.

The multi-model mode is where QED earns its keep. Let's unpack it.

multi_model.providers: ["claude", "codex", "gemini"] means Claude, Codex, and Gemini each produce their own proof.md in isolated subdirectories. They don't see each other; they only see the problem + the survey + the previous round's feedback + optional human guidance.

Why three? Different models make different mistakes. A proof Claude cannot finish might be natural for Codex's style, or vice versa. Running them in parallel is cheap (~3× cost, no latency cost) compared to running them serially and hoping the first one works.

Parallel multi-verifier verification

For N proofs and M verifiers you get N×M verification reports. If N=3 and M=3, that's 9 independent verification passes per round.

A proof is considered fully verified only if ALL verifiers PASS. One FAIL from any verifier → CONTINUE.

This is the single biggest idea in the pipeline: cross-model verification as the primary defense against hallucination. Claude might accept its own fake citation. Gemini probably won't. Codex will catch Gemini's lazy step-skipping. The model that generated the error and the model that verifies it are decorrelated.

The selector agent

When multiple proofs exist, a selector agent reads all N×M verification reports and picks the single most promising proof based on:

  • Unanimous verifier agreement
  • Problem-statement integrity
  • Overall verdict
  • Fewest failures
  • Quality of partial progress
  • Structural completeness

If only one model is configured, the selector is skipped entirely — the single proof is used directly. This is a nice touch: the dispatcher layer gracefully degrades to a no-op at $N=1$.

Stage 2 — Proof Effort Summary

After the loop terminates (DONE or max iterations), a summary agent reads every artifact and writes proof_effort_summary.md. This is both documentation and a post-mortem — useful for humans and for future rounds if the problem is re-attempted.


4. The Two Tags That Do All the Work

This is the part of QED that I think is the most portable idea for other agent systems.

<cite>...</cite>

Every external theorem, lemma, or result cited in the proof must be wrapped in a structured <cite> block containing:

  • Type (theorem / lemma / proposition / definition / ...)
  • Label (canonical name, e.g. "Hölder's inequality")
  • Title (descriptive)
  • Authors
  • Source URL
  • Verifier locator (where in the source to find it)
  • Exact statement (the prover's transcription of the result)
  • Usage (how it's being applied in this proof)

The verification agent independently fetches the URL, checks that the statement matches the source, and confirms the application is correct. Phase 2 of verification is literally "citation verification" — no model gets to invent a theorem and quietly use it.

<key-original-step>...</key-original-step>

Every nontrivial, original step in the proof (novel reductions, hard estimates, key constructions) must be wrapped in this tag. The content inside must be maximally detailed — no hand-waving, no "clearly", no "standard techniques".

The verifier does something clever: it independently identifies which steps it considers nontrivial, without looking at the prover's tags, and then flags two kinds of mismatch:

  • Untagged hard steps — the prover glossed over something the verifier thinks is genuinely difficult. This is the "hiding the hard part" failure mode.
  • Inflated tags — the prover tagged something routine as a key step. This is the "inflating the work" failure mode.

The brilliance is that the verifier doesn't trust the prover's classification. It re-derives it, and the mismatch between the two classifications becomes a signal of dishonesty or error. This is machine-checkable epistemics.

If you're building anything that produces arguments rather than code, this pattern generalizes: have one agent produce structured claims, have another agent independently re-classify them, flag mismatches.


5. The Verification Pipeline (5 Phases)

Verification is split across two prompt files — proof_verify_structural.md (Phases 1–4) and proof_verify_detailed.md (Phase 5). This split matters because different phases have different cost profiles, and Phase 5 is expensive.

Phase What it checks Why it matters
1. Problem-Statement Integrity Word-by-word comparison of original problem vs. what the proof claims Catches silent weakening ("prove $x \le 2$" when the problem said "prove $x \le 1$")
2. Citation Verification Every <cite> block — URL resolves, statement matches, usage is correct Catches hallucinated theorems and fake attributions
3. Subgoal Tree Structure Is the proof's decomposition well-formed and complete? Catches missing branches in the proof's logical tree
4. Additional Verification Rules Human-provided per-round and global verification rules, treated as hard requirements Lets human experts inject domain-specific checks
5a. Logical Step Verification Fine-grained step-by-step checking with computational tools Catches arithmetic and logical errors
5b. Subgoal Resolution Every subgoal identified in Phase 3 is actually resolved Catches "we claim X, let's return to it later" that never returns
5c. Key Original Step Analysis <key-original-step> mismatch detection Catches hand-waving and inflation (see above)
5d. Coverage Chain completeness and problem-proof alignment Catches proofs that drift off-topic

Phase 1 alone is worth pausing on. A word-by-word comparison of the stated problem and what the proof actually proves sounds trivial, but it's the single most effective catch against LLMs quietly substituting a weaker theorem. You don't need an LLM to do this comparison reliably — but you do need something in the pipeline to force it.


6. Human Guidance — Two Axes, Two Levels

QED exposes a surprisingly thoughtful human-in-the-loop interface:

Prover guidance (read by prover) Verification rules (read by structural verifier)
Global human_help/additional_prove_human_help_global.md human_help/additional_verify_rule_global.md
Per-round verification/round_N/human_help/additional_prove_human_help_per_round.md verification/round_N/human_help/additional_verify_rule_per_round.md

You can edit these files while the pipeline is running. Reviewing round 5's failure? Write targeted feedback into round 6's per-round prover help. Noticed the verifier isn't checking a domain-specific condition? Add a rule to the verifier's global file.

This is the right factoring. It separates:

  • Who the guidance is for (prover vs. verifier)
  • Scope of applicability (all rounds vs. this round only)

And it makes explicit that "human feedback" in an agentic system isn't a single channel — the prover needs different guidance than the verifier.

The Carleman-weight research result reportedly used this channel exactly once: the human experts realized their constraints were too weak, strengthened them, and asked QED to continue from its previous candidate. No human intervention on the proof itself — just on the problem statement.


7. super_math_skill.md — The Methodology Guide

Every agent receives skill/super_math_skill.md as a system-level instruction. It's a ~13 KB guide to mathematical proof methodology — orientation, core strategies, stuck-recovery tactics, self-checking discipline.

The prose is uncompromising. The opening section is titled "The Cardinal Rule: Embrace Difficulty" and contains:

The hard part of the problem is the ENTIRE POINT. You must confront it directly.
When you feel the urge to write "clearly", "obviously", "it is easy to see", or "by standard arguments" — that is your signal that you are about to dodge the hard part. Stop. Go back. Do the work.

This is the same pattern we saw in Claude Code's system prompt — prose enforcement as engineering discipline. Every one of these rules is almost certainly a patch against an observed failure. The proof-search prompt itself explicitly lists seven "common avoidance patterns you MUST NOT do", each with the exact bad behavior named.

Interestingly, the skill file is delivered in two places:

  1. As a system-level instruction to every agent (so it's always in scope).
  2. As an explicit file-path input {skill_file} to the proof search agent (so it can re-read specific strategies).

This mirrors the progressive disclosure / dispatcher pattern — the skill content is "advertised" as part of the prompt, but the file is also accessible for deeper re-reading when needed.


8. Why No Agent Framework?

QED's architecture choice — Python subprocess spawns of vendor CLIs — looks primitive until you think about what the alternatives buy you.

An agent framework gives you: streaming, tool-use protocol, state management, retries, observability primitives.

Each vendor CLI already provides: streaming (stdout), tool-use (bash, read, write, WebFetch, WebSearch), state management (the CLI maintains conversation), retries (CLI handles transient errors), observability (JSON output mode).

So what's the framework for? Mostly cross-vendor abstraction. But QED explicitly wants to use different vendors with their different strengths — so a cross-vendor abstraction would be a liability, averaging away the differences it wants to exploit.

The CLI-subprocess approach also has three underrated advantages:

  1. Free model updates. Upgrade the CLI → you get the new model. No framework migration.
  2. Each CLI inherits its vendor's tool surface. Claude Code's Read/Edit/Bash are wired correctly. Codex and Gemini have their own equivalents. QED doesn't have to re-implement tool use three times.
  3. Debuggable. Every agent step is a spawnable shell command. You can copy the prompt, paste it into a terminal, and reproduce the exact behavior. No black-box framework state.

The only cost is that QED has to handle its own bookkeeping — file paths, round directories, token accounting. But that bookkeeping is ~100 KB of Python (pipeline.py) and it's completely transparent.

This is a pragmatic design: the pipeline is shell glue, the agents are processes. It's the Unix philosophy applied to LLMs.


9. Resume Support, Quietly Significant

If the pipeline is interrupted (crash, network blip, manual kill) and you re-run it with the same output directory, it:

  • Skips the literature survey if it's already complete
  • Detects which step within a round was last completed (including parallel steps)
  • Cleans up incomplete state (partial verification files, orphaned proof.md backups)
  • Restores proof.md from backup if needed
  • Resumes from exactly where it left off

Resume correctness is the quiet signature of a well-designed pipeline. A long-running agentic workflow (say, 9 rounds × 3 provers × 3 verifiers = 81 model calls per problem) will crash. If every crash means starting over, the system is unusable. The fact that QED has a coherent resume story suggests the file-layout is doing real work — each round's artifacts (proof.md, verification_result_*.md, selection.md, human-help files) are in predictable, inspectable locations.

This is also the feature that makes the per-round human-guidance story practical. You can crash the pipeline on purpose after round 5, write human feedback into round_6/human_help/, and resume.


10. Connecting to the Agent-Design Philosophy

QED is an interesting case study because it independently rediscovers a lot of the design principles from Thariq's "Seeing like an Agent":

Principle QED's instantiation
Tools shaped to model abilities Each vendor CLI used via its own CLI — no abstraction layer averages the differences away
Bar for adding a tool is high Agents have few tools: file IO, bash, web search. Everything else is prose.
Progressive disclosure super_math_skill.md advertised via system prompt; full content accessible via file path; related work stashed in related_info/
Prose enforcement as engineering Explicit "Cardinal Rule" and "Do NOT shy away" sections with enumerated failure patterns
Iterative refinement of tools Two-file verification split (structural vs. detailed), selector that skips at N=1
Tool-use as the unit of action The pipeline orchestrates tool-use-capable CLIs; no custom tool protocol

The extra thing QED brings that Claude Code's design does not emphasize as much is cross-model adversarial verification. Claude Code has one Claude verifying another Claude (or itself). QED has Claude, Codex, and Gemini all verifying each other. If you're building something where correctness matters more than latency, that trade is the right one to make.


11. What I Think Is Missing or Risky

A few open questions I'd want to understand better:

  • Cost scaling. 3 provers × 3 verifiers × 9 rounds = 81 model calls per problem, each potentially doing nontrivial tool use. For a hard math problem that's real money. The config does allow single-model modes for cheaper runs, but the motivating case (Carleman weight functions) presumably cost meaningful USD.
  • Selector objectivity. The selector is a single model reading N×M reports. If that one model has a systematic bias, it can misrank proofs. A simple improvement: run the selector with each of the N providers and take a majority vote, similar to how verification is cross-checked.
  • Skill file scope. super_math_skill.md is general mathematical methodology. For specific subdomains (PDEs, combinatorics, number theory), you'd ideally load a subdomain-specific skill too. The pipeline seems to support this in spirit but not explicitly in the current repo.
  • Citation verification depth. Phase 2 fetches URLs and checks statements, but it relies on the verifier's ability to do a web read-and-match. For paywalled or non-indexed sources, this degrades.
  • Security posture. The pipeline runs Claude / Codex / Gemini CLIs with sandbox bypassed (--dangerously-skip-permissions etc.). The README is explicit about this. Worth running in a container/VM.

None of these are deal-breakers. They're the kind of questions you'd ask of any v1 research pipeline.


12. What's Worth Stealing

If you're building agents that produce any kind of structured output (not just math), these four ideas transfer immediately:

(a) Cross-model adversarial verification

Don't trust any single model's self-verification. Run the verifier with a different model from the generator whenever the cost allows it. The error modes of Claude-3.5 and Gemini-Pro are decorrelated in ways that make this genuinely effective.

(b) Structured claim tagging with independent re-classification

<cite> and <key-original-step> are instances of a general pattern:

  1. Force the generator to wrap claims of a particular type in a structured tag.
  2. Force the verifier to independently identify which tokens deserve that tag.
  3. Compare the two sets. Mismatches are signals.

This generalizes. Want to catch hallucinated API calls? Tag them, verify independently. Want to catch unjustified assumptions? Tag them, verify independently.

(c) Two-axis human feedback

Prover vs. verifier × global vs. per-round is a clean factoring of "human-in-the-loop". Most agentic systems collapse these into a single "user feedback" channel and pay for it in ambiguity.

(d) Difficulty scaling with lightweight paths for easy cases

Stage 0's difficulty classification is cheap (one model call) and controls the heaviness of every subsequent stage. Simple problems get a 3-step easy-mode loop; hard problems get the full 5-phase structural + detailed verification. This is how you keep a high-assurance system from overpaying on easy inputs.


13. Final Read

QED is a pipeline that solves a research-grade math problem by doing three things at once:

  1. Running multiple top-tier models in parallel, relying on their error decorrelation.
  2. Forcing structure onto the output (cited theorems, tagged key steps) so that verification can be machine-checkable.
  3. Enforcing epistemic discipline in prose ("don't hand-wave"; "prove exactly what was asked") — because at this generation of models, prose is the API for behavior.

The absence of an agent framework is not a limitation; it's a philosophical choice. QED is the Unix-philosophy version of agent orchestration — small, composable subprocess calls; filesystems as state; prompts as programs. It trades the polish of a framework for the legibility of a shell pipeline, and for a pipeline that runs 81 model calls per problem with resume support and per-round human guidance, that legibility is worth a lot.

The Carleman weight result suggests the approach is already pushing into territory that pre-LLM automation could not touch. I expect the real contribution of projects like QED is less about any single proof and more about showing that LLM reliability in math can be engineered, not just prayed for — by redundancy, structure, and ruthless verification.


References

  • Title: QED — A Multi-Agent Math Proof Pipeline Built on CLI Subprocesses
  • Author: wy
  • Created at : 2026-04-22 17:00:00
  • Updated at : 2026-04-22 14:52:34
  • Link: https://yue-ruby-w.site/2026/04/22/QED-Multi-Agent-Math-Proof-Pipeline/
  • License: This work is licensed under CC BY-NC-SA 4.0.