From Wire Format to Design Philosophy

A follow-up to How Claude Code Skills Work Under the Hood.

Recap — What the Packet Capture Told Us

The previous post was a wire-level reverse-engineering exercise. By MITM-intercepting Claude Code's outgoing HTTPS, the author of the Xiaohongshu video proved four things about Skills:

There's a single Skill tool registered in tools[] — one string argument, plain Anthropic function-calling schema.
The "available skills" catalog is a plain-English bullet list pasted into the system prompt.
Each SKILL.md body is lazy-loaded: advertised by its description in the catalog, only shipped when the tool is actually invoked.
The model is forced to use the tool via strongly-worded imperatives in the system prompt ("BLOCKING REQUIREMENT", "invoke IMMEDIATELY", "NEVER just announce").

The punchline was the big overlay text "Function Calling" — no new API, no out-of-band protocol, just tool-use plus text.

That answered how. It did not answer why.

Why a dispatcher tool rather than one tool per skill?
Why a plain-text catalog rather than a structured registry?
Why prose enforcement rather than harness-level rules?
Why lazy-load the body instead of shipping everything?

These choices look arbitrary until you read the design brief from someone who actually works on the harness.

Enter Thariq's "Seeing Like an Agent"

Thariq Shihipar (member of technical staff at Anthropic, one of the people building Claude Code) published a piece titled "Seeing Like an Agent: Tool Design for Claude Code". It's the clearest public articulation of the design philosophy behind the harness we just reverse-engineered.

"You want to give it tools that are shaped to its own abilities."

That one sentence reframes every weird choice we saw on the wire.

The Core Principle: The Bar for Adding a Tool Is High

Thariq writes:

"The bar to add a new tool is high, because this gives the model one more option to think about."

Every tool is a cognitive tax on the model. More tools means a bigger decision each turn: which one to pick, when to pick nothing at all, when two tools overlap. The harness designer is trading off coverage (can the model do X?) against confusion (does adding an X tool make the model worse at Y?).

This is why the Skill tool is a dispatcher and not one tool per skill. If every skill — xlsx, pdf, pptx, docx, mcp-builder, skill-creator, ... — were its own top-level tool, the tools[] array would balloon, every tool description would compete with every other for the model's attention, and pick-the-right-tool errors would multiply.

Instead: one dispatcher tool, N catalog entries. The decision space the model sees is "do I need any skill?" (binary), followed by "which one?" (multi-class). That factored decision is cheaper and more robust than a flat N + 1-way choice.

Iterative Refinement: The AskUserQuestion Story

Thariq's case study on the AskUserQuestion tool is the best public illustration of "shape the tool to the model". The team went through three attempts:

Attempt	Approach	Failure Mode
1	Add a `question` parameter to the existing `ExitPlanMode` tool	Model got confused when user answers conflicted with generated plans; unclear when to re-invoke
2	Ask Claude to output questions as structured markdown	Model "appended extra sentences, dropped options, abandoned the structure altogether" — unreliable format
3	Dedicated `AskUserQuestion` tool, callable any time, renders a blocking modal	Worked: structured output that also respects how Claude naturally interacts

The lesson: the right tool isn't obvious a priori. It's the one that survives contact with the model's actual failure modes. Design → ship → watch logs → redesign.

This is also why the Skills system has so many rough prose edges in the system prompt. "BLOCKING REQUIREMENT", "NEVER just announce", "invoke IMMEDIATELY" — every one of those imperatives almost certainly exists because someone watched a session where the model hallucinated invocation in prose, asked a clarifying question instead of calling the tool, or re-invoked a skill whose body was already in context.

Tools That Are Now Constraining: TodoWrite → Task

Thariq has a striking observation:

"Being sent reminders of the todo list made Claude think that it had to stick to the list instead of modifying it."

Early Claude Code had a TodoWrite tool with a system reminder every five turns. As models got better, the reminder became counterproductive — Claude treated the list as a commitment rather than a plan. The team replaced TodoWrite with a Task tool supporting dependencies and cross-agent coordination.

"The tools your models once needed might now be constraining them."

This is a posture most harness designers don't adopt. Tools are usually added monotonically — remove one and you might break a user flow. Anthropic is willing to remove them because model capability is moving and a useful tool at model version N can become dead weight at N+1.

The packet capture still showed TodoWrite in the tools array — so as of the capture it hasn't been removed, but the internal trajectory (per Thariq) is "shrink the harness as the model grows into it."

Progressive Disclosure: The Real Reason Skills Lazy-Load

The single sharpest design choice the video exposed — ship a short description, lazy-load the body — has a name in Thariq's framework: progressive disclosure.

"Rather than embedding comprehensive documentation in the system prompt, Claude receives links to searchable docs it can access when needed."

The principle: don't pay for context you don't need yet. The advertisement (the description) is always present so the model knows the capability exists; the body is pulled in only when the model decides the capability is relevant.

This generalizes beyond Skills. It's the same principle behind:

The Claude Code Guide subagent: searches documentation independently and returns only relevant answers, keeping the main agent's context clean.
Grep / Glob / Read instead of stuffing files into the system prompt.
WebFetch instead of pre-caching pages.
MCP tool lists being queryable rather than inlined up front (at least in principle; in practice CC inlines them too).

Progressive disclosure is what makes Skills economically viable at scale. A single skill with a 2000-line body would be catastrophic to ship every turn; advertised in 300 characters, it's cheap.

From RAG to Agent-Led Search

Thariq's other big evolution story:

"Claude went from not really being able to build its own context to being able to do nested search across several layers of files to find the exact context it needed."

Early agent systems (pre-Claude 3.5, roughly) relied heavily on RAG — pre-indexed vector databases, chunk retrieval, context injection. The shift to giving the model search tools directly and letting it decide what to fetch is a bet on capability growth: the model is now good enough that agent-led retrieval beats retrieval-engineered-for-the-model.

Skills ride on the same bet. The catalog description is a search hint; the model decides when to "retrieve" the skill body. Anthropic trusts the model to pick correctly, and writes descriptions with enough concrete file-type / scenario triggers to make the pick easy.

Tying It Together — Why The Wire Format Looks The Way It Does

Now we can read the packet capture as an intentional design, not an accident of implementation:

Wire-format observation	Design rationale (per Thariq)
One `Skill` dispatcher tool instead of N skill tools	Keep the top-level decision space small; raise the bar for per-tool cognitive tax
Plain-text "Available skills:" catalog in system prompt	Minimum viable advertisement; descriptions are classifier specs, not marketing
`SKILL.md` body shipped only on invocation	Progressive disclosure — don't pay for context you don't need
Strong prose imperatives ("BLOCKING REQUIREMENT")	Guard rails against observed failure modes (hallucinated invocation, stalling, double-invocation)
`<command-name>` carve-out for slash commands	Case-analysis in prose prevents double-invocation; preserves a single clean control flow
Skills authorable by writing markdown, no code needed	Lowers the authoring bar; makes the extension surface match the dominant model input modality
~78 KB request for a one-word prompt	Accepted cost of the catalog — relies on prompt caching (5-min TTL) to amortize over a session

Each of these isn't a shortcut — it's a conscious trade in a principled framework.

My Commentary — What This Actually Means for Agent Builders

1. Tools are a cognitive budget, not a feature list

If you're building an agent (via MCP, Claude SDK, your own harness), the temptation is to expose every capability as a distinct tool because "more tools = more features." Don't. Every tool you add is a decision the model has to make every turn. A dispatcher tool with a catalog is almost always more robust than a flat list once you get past ~10 capabilities.

The implicit rule Anthropic seems to use: if two tools are chosen in the same context, at least one of them is wrong.

2. Progressive disclosure is the default, not the optimization

It's tempting to "just inline" documentation, examples, schemas in the system prompt because the model will use them better that way. That works at 1K tokens. At 100K tokens of context, the model gets worse at picking — and you pay for it every turn. Progressive disclosure flips this: advertise, then load on demand. That's what skills, agent files, doc searches, and KV-caching-friendly prompt designs all share.

The generalized rule: present only the decision surface; hide the expansion behind a tool call.

3. Prose enforcement is a real engineering discipline

The "BLOCKING REQUIREMENT" / "NEVER just announce" / "invoke IMMEDIATELY" style looks hacky, but it's actually a serious craft. Every one of those imperatives is a patch against a specific observed failure. If you build agents and don't have a growing prose-rules section in your system prompt, either your harness is too young or you haven't been reading your logs.

The discipline is: when the model does something wrong, add a rule; when the model reliably follows the rule, consider whether the rule is still needed. Prompts, like code, accumulate debt.

4. Be willing to remove tools

Thariq's point about TodoWrite → Task is the one most builders ignore. It's much easier to add a tool than to remove one, because removal feels like a regression. But as models improve, tools you added to compensate for weaknesses become scaffolding that constrains them. The TodoWrite example is instructive: a tool that helped Claude 3 plan became a tool that made Claude 4 rigid.

The heuristic: every 3–6 months, ask "which tools am I still shipping only because the model used to need them?"

5. Trust model-led retrieval, but design the hints

The shift from RAG to agent-led search only works if the hints (tool descriptions, skill catalog entries, subagent names) are good enough for the model to pick correctly. That puts the engineering work in a different place: instead of building a retrieval pipeline, you're building a classifier specification encoded in English prose. The xlsx skill description listing .xlsx, .xlsm, .csv, .tsv and five numbered scenarios isn't incidental — it's the key to making the model reach for it at the right time.

The principle: retrieval quality = hint quality × model capability. Model capability you don't control; hint quality you do.

6. Tokens are a design constraint, not just a billing line

The Xiaohongshu author's real complaint — "TOKEN 对我们来说是真的越来越重要了" (tokens are getting more and more important for us) — is the same observation from a different angle. The 78 KB wire trace is the cost of a rich agent. You can reduce it with prompt caching (amortize over session), lazy loading (skills), summarization (/compact), and trimming (uninstall unused skills) — but you can't eliminate it without giving up capability.

The design question isn't "how do I minimize tokens?" — it's "which tokens earn their keep every turn, and which are parked behind a gate?" Skills are Anthropic's answer to that second question.

Closing

The packet capture proved what Claude Code is sending. Thariq's piece explains why it's shaped that way. Reading them together turns a piece of reverse-engineered trivia into an actionable design framework:

Tools as cognitive budget, not feature flags.
Progressive disclosure as the default for any non-trivial context.
Prose guard rails as a real engineering artifact, maintained with the same discipline as code.
Willingness to remove tools as model capability grows.
Agent-led retrieval gated by hand-crafted hints.
Tokens as a design axis, not just a cost line.

The elegance of the Skills design isn't in any single piece. It's in how all of these principles compose — one dispatcher tool + one catalog + N markdown files + prompt caching, and you have an extension system that's cheap to advertise, on-demand to load, user-authorable in markdown, and robust against model drift.

If you're building agents and you squint at the wire trace, you can see every one of Thariq's lessons encoded in those ~78 KB. The surprising thing isn't that a one-token prompt produces 78 KB of request body — it's that the 78 KB encodes a consistent design philosophy, not a pile of hacks.

hola

From Wire Format to Design Philosophy — Why Claude Code Skills Look The Way They Do