Why does OpenClaw get worse with more skills installed?

Every active skill's metadata sits in the model's context before the user message ever arrives. By 50 skills you're spending 10,000-15,000 tokens on tool descriptions alone, which crowds out the conversation, raises the chance of picking the wrong tool, and triggers what researchers call 'context rot' - measurable accuracy drop on information buried in the middle of a long prompt.

How many skills can OpenClaw handle before tool-calling breaks?

There's no hard limit, but real numbers from r/openclaw and tool-use benchmarks land around the same place: somewhere between 30 and 60 active skills, accuracy starts dropping fast. OpenAI caps function lists at 128 tools per call, and BFCL benchmark accuracy is already degrading well before that ceiling. A working rule: keep the active skill set under 10,000 tokens of metadata per agent.

Does OpenClaw load all installed skills at once?

By default yes. Skills marked active in your manifest get their full description loaded into context for every request. v2026.4.10 added the Active Memory Plugin, which can lazy-load skill bodies, but the metadata layer (name, description, when-to-use hint) still loads upfront because the model needs it to decide which skill to invoke. That metadata is what hits the wall first.

What does 'tool-calling degradation' actually look like in practice?

Three failure modes show up. The agent picks a similarly-named skill instead of the right one (Gmail-send vs Outlook-send when both are installed). It hallucinates a parameter that doesn't exist on the chosen tool. Or it freezes mid-decision and does nothing, often after a long pause that still bills you for input tokens. Reddit threads describe all three; the BFCL benchmark formalizes them as 'hallucination', 'wrong-function', and 'no-call' errors.

How do I figure out which skills to remove?

Run /skills audit to see token cost per skill and last-used date. Anything not used in the last 30 days is a strong candidate to disable. After that, group by purpose - if you have three Gmail-adjacent skills, keep the most-used one. Per-agent scoping helps too: a calendar agent doesn't need your CRM skills loaded, and OpenClaw v4.0+ supports skill manifests scoped to specific agents instead of globally.

The 50-skill wall: why OpenClaw gets dumber the more skills you install

By Linas Valiukas · April 29, 2026

The post on r/openclaw last week was titled "How are you handling tool-calling degradation at 50+ skills?" The comments are a parade of people who've hit the same wall. Skills install fine. The agent answers fine. Then somewhere past 30 or 40 active skills, things start sliding. Wrong tools. Made-up parameters. Long pauses. Sometimes the agent just doesn't act at all.

ClawHub now lists 13,729 community skills. The temptation is to keep adding. The math says don't.

What's actually breaking

Three things go wrong, in roughly this order, as your skill count climbs.

Token bloat in the system prompt. Every active skill's metadata - name, description, parameters schema, when-to-use hint - gets stuffed into context before your message ever arrives. A modest skill is 150-300 tokens. A heavy one (think Playwright MCP with its 21 sub-tools) is over 11,000. Stack 50 skills and you're routinely starting every request 12,000-18,000 tokens in the hole. That's a quarter of a 64k context window already spent on tool descriptions you may not need on this turn.

Context rot. Researchers tested every frontier model and found the same shape: accuracy is high at the start of a prompt and at the end, but information buried in the middle gets a 30%-plus accuracy hit. The skill manifest sits exactly where rot is worst. So the more skills you load, the more your model is pretending to consider tools it can't really see.

Selection ambiguity. Two skills with similar names ("send-gmail", "send-outlook") give the model a coin flip. Three or four similar tools and it might invent a fourth one that doesn't exist. The Berkeley Function Calling Leaderboard tracks this as "wrong-function" and "hallucination" error rates, and they climb steeply with tool count.

The numbers researchers keep finding

Skill bloat isn't unique to OpenClaw. It's a property of how LLMs handle tool lists, and the tool-use research community has been measuring it.

BFCL V4. Single-function calls land in the 80-95% accuracy range across top models. Multi-function and multi-step orchestration drop to 50-65%. Long-context multi-turn (the category that mimics what you actually do with OpenClaw) sits at 30-55%.
WildToolBench. Across 57 LLMs tested in realistic agent settings, no model hit 15% session accuracy. Most landed under 60% on individual tasks. The benchmark explicitly stresses cases where the model has to ignore unrelated tools - exactly the case a fat skill manifest creates.
RAG-MCP paper. When researchers swapped a fixed tool list for a retrieval step that pulls in only the relevant tools per request, selection accuracy went from 13.62% to 43.13% - over triple. Token usage dropped more than 50%. The takeaway is brutal: handing the model fewer, more relevant tools matters more than how smart the model is.
OpenAI's own ceiling. The Function Calling API caps at 128 tools per request. Anthropic doesn't publish a cap, but their Skills team has been blunt that progressive disclosure beats flat manifests in their internal benchmarks. The framework is designed assuming you won't load everything at once.

Where the wall actually is for OpenClaw

There's no documented limit. The OpenClaw docs are silent on a maximum. Reddit and GitHub issues converge on roughly this:

Active skills	Metadata tokens	What you'll notice
10-20	~3,000-5,000	Crisp tool selection. Cheap.
20-40	~5,000-9,000	Still fine. Occasional wrong-tool pick on similar names.
40-60	~9,000-13,000	Selection accuracy starts dropping. Latency creeps up. First "wrong tool, no apology" reports.
60-100	~13,000-22,000	Hallucinated parameters. Long pauses. The agent sometimes does nothing.
100+	22,000+	Pick a Reddit complaint thread. It probably starts here.

The 10,000-token rule of thumb floating around the docs is a decent shorthand. If your active skill metadata totals more than that, you're past the point where the model can keep them all straight.

Find out what your manifest actually weighs

OpenClaw v2026.4.10 added a built-in audit:

/skills audit

It prints a table with skill name, token cost, last-invoked timestamp, and active/inactive status. The first time I ran it on a moderately-loaded test instance the result was 87 active skills totaling 19,400 tokens, of which 31 hadn't been invoked in 90 days and 12 had never been invoked. That's the easy half of the cleanup.

If you're on a pre-v4.10 install, the manual version:

openclaw skills list --json | jq '.[] | select(.active == true) | {name, tokens: .metadata_tokens}'

Sum the tokens column. If you're north of 10,000, you're already paying the wall tax.

What to keep, what to cut

A workable trim, in order of how much it helps:

Disable anything not used in 30 days. Not uninstall - just toggle inactive in the manifest. You can flip it back. Most people recover 30-50% of their token budget right here.
Collapse near-duplicates. Three Gmail skills (send, draft, search) is fine - they have distinct semantics. Three skills that all "send an email" with different vendor names is a coin-flip generator. Keep one.
Kill MCP servers you only used to test. The Playwright MCP at 11,000+ tokens is the worst offender. Half of installs are leftovers from a one-time scrape job.
Move heavy skills behind explicit invocation. Skills can be marked manual = true in their frontmatter. Their metadata still appears in /skills list but doesn't load on every turn. Good for tools you genuinely need but use weekly, not daily.
Scope by agent. A scheduling agent and a finance agent don't need each other's skills. Per-agent manifests landed in v4.0 and are underused. The 50-skill wall stops mattering when no single agent has 50 skills loaded.

The real fix: don't load them all

Trimming buys you breathing room. The actual architectural answer is to stop pretending a flat manifest scales. Two patterns work:

Progressive disclosure. Anthropic's Agent Skills design loads three layers separately - just the metadata at first, the full SKILL.md only when the model picks it, and supplementary files only if the chosen skill references them. OpenClaw's Active Memory Plugin in v4.10 is a partial implementation, but the metadata layer still ships flat.

Retrieval-based tool selection. The RAG-MCP paper and a small but growing set of community plugins do something simple: index every skill's description in a vector store, retrieve the top 5-10 by semantic match against the user's request, and only put those in context. The accuracy and token wins are both real (3x and 50%+ respectively). The catch is you're now running an extra retrieval hop per turn, and the index has to stay current.

Neither is in the OpenClaw core today. Both are inevitable. The release that ships proper progressive disclosure will probably make the entire 10,000-token rule obsolete. We're not there yet.

The honest take on ClawHub

13,729 skills sounds like an ecosystem win. In practice it's a long tail of one-off uploads, half-finished forks, and a handful of genuinely useful tools that get reinvented six times. Most users would be better off with 10-15 carefully chosen skills and a clear discipline of disabling things they're not actively using. The best skills post has a curated short list. The malware vetting post is what to read before installing anything new.

None of this is news to people building agent frameworks. It's just slow to filter into the user-facing experience because every "100 skills installed" screenshot looks impressive on Twitter and the failure mode (worse answers, slower replies) is invisible until you go looking.

The shortcut, if you don't want to babysit a manifest

On TryOpenClaw.ai, every agent ships with a curated active skill set sized to fit comfortably under the wall. Heavy MCP servers are off by default. Skills you don't invoke get auto-disabled after 30 days. When v4.10's Active Memory Plugin matures into proper progressive disclosure, you'll get it without re-architecting your manifest. No /skills audit spreadsheet. No 87-skill cleanup project on a Sunday afternoon.

Flat $39/month. Your agent stays sharp.

Linas Valiukas

Founder of TryOpenClaw.ai. Software engineer writing about OpenClaw, self-hosting trade-offs, and what non-technical users actually need from an AI assistant. About the author →

Try it right now

This is just one example - OpenClaw adapts to whatever you need. Describe any workflow in plain language and it figures out the rest. Pay $1 for a full 24-hour trial, pick your messaging app, and start chatting with your own instance in under 60 seconds. Love it? $39/mo. Not for you? Walk away - we delete everything.

Try OpenClaw for $1

24h full access. No commitment. Cancel anytime.