The 50-skill wall: why OpenClaw gets dumber the more skills you install
By Linas Valiukas · April 29, 2026
The post on r/openclaw last week was titled "How are you handling tool-calling degradation at 50+ skills?" The comments are a parade of people who've hit the same wall. Skills install fine. The agent answers fine. Then somewhere past 30 or 40 active skills, things start sliding. Wrong tools. Made-up parameters. Long pauses. Sometimes the agent just doesn't act at all.
ClawHub now lists 13,729 community skills. The temptation is to keep adding. The math says don't.
What's actually breaking
Three things go wrong, in roughly this order, as your skill count climbs.
Token bloat in the system prompt. Every active skill's metadata - name, description, parameters schema, when-to-use hint - gets stuffed into context before your message ever arrives. A modest skill is 150-300 tokens. A heavy one (think Playwright MCP with its 21 sub-tools) is over 11,000. Stack 50 skills and you're routinely starting every request 12,000-18,000 tokens in the hole. That's a quarter of a 64k context window already spent on tool descriptions you may not need on this turn.
Context rot. Researchers tested every frontier model and found the same shape: accuracy is high at the start of a prompt and at the end, but information buried in the middle gets a 30%-plus accuracy hit. The skill manifest sits exactly where rot is worst. So the more skills you load, the more your model is pretending to consider tools it can't really see.
Selection ambiguity. Two skills with similar names ("send-gmail", "send-outlook") give the model a coin flip. Three or four similar tools and it might invent a fourth one that doesn't exist. The Berkeley Function Calling Leaderboard tracks this as "wrong-function" and "hallucination" error rates, and they climb steeply with tool count.
The numbers researchers keep finding
Skill bloat isn't unique to OpenClaw. It's a property of how LLMs handle tool lists, and the tool-use research community has been measuring it.
- BFCL V4. Single-function calls land in the 80-95% accuracy range across top models. Multi-function and multi-step orchestration drop to 50-65%. Long-context multi-turn (the category that mimics what you actually do with OpenClaw) sits at 30-55%.
- WildToolBench. Across 57 LLMs tested in realistic agent settings, no model hit 15% session accuracy. Most landed under 60% on individual tasks. The benchmark explicitly stresses cases where the model has to ignore unrelated tools - exactly the case a fat skill manifest creates.
- RAG-MCP paper. When researchers swapped a fixed tool list for a retrieval step that pulls in only the relevant tools per request, selection accuracy went from 13.62% to 43.13% - over triple. Token usage dropped more than 50%. The takeaway is brutal: handing the model fewer, more relevant tools matters more than how smart the model is.
- OpenAI's own ceiling. The Function Calling API caps at 128 tools per request. Anthropic doesn't publish a cap, but their Skills team has been blunt that progressive disclosure beats flat manifests in their internal benchmarks. The framework is designed assuming you won't load everything at once.
Where the wall actually is for OpenClaw
There's no documented limit. The OpenClaw docs are silent on a maximum. Reddit and GitHub issues converge on roughly this:
| Active skills | Metadata tokens | What you'll notice |
|---|---|---|
| 10-20 | ~3,000-5,000 | Crisp tool selection. Cheap. |
| 20-40 | ~5,000-9,000 | Still fine. Occasional wrong-tool pick on similar names. |
| 40-60 | ~9,000-13,000 | Selection accuracy starts dropping. Latency creeps up. First "wrong tool, no apology" reports. |
| 60-100 | ~13,000-22,000 | Hallucinated parameters. Long pauses. The agent sometimes does nothing. |
| 100+ | 22,000+ | Pick a Reddit complaint thread. It probably starts here. |
The 10,000-token rule of thumb floating around the docs is a decent shorthand. If your active skill metadata totals more than that, you're past the point where the model can keep them all straight.
Find out what your manifest actually weighs
OpenClaw v2026.4.10 added a built-in audit:
/skills audit It prints a table with skill name, token cost, last-invoked timestamp, and active/inactive status. The first time I ran it on a moderately-loaded test instance the result was 87 active skills totaling 19,400 tokens, of which 31 hadn't been invoked in 90 days and 12 had never been invoked. That's the easy half of the cleanup.
If you're on a pre-v4.10 install, the manual version:
openclaw skills list --json | jq '.[] | select(.active == true) | {name, tokens: .metadata_tokens}'
Sum the tokens column. If you're north of 10,000, you're already paying the wall tax.
What to keep, what to cut
A workable trim, in order of how much it helps:
- Disable anything not used in 30 days. Not uninstall - just toggle inactive in the manifest. You can flip it back. Most people recover 30-50% of their token budget right here.
- Collapse near-duplicates. Three Gmail skills (send, draft, search) is fine - they have distinct semantics. Three skills that all "send an email" with different vendor names is a coin-flip generator. Keep one.
- Kill MCP servers you only used to test. The Playwright MCP at 11,000+ tokens is the worst offender. Half of installs are leftovers from a one-time scrape job.
- Move heavy skills behind explicit invocation. Skills can be marked
manual = truein their frontmatter. Their metadata still appears in/skills listbut doesn't load on every turn. Good for tools you genuinely need but use weekly, not daily. - Scope by agent. A scheduling agent and a finance agent don't need each other's skills. Per-agent manifests landed in v4.0 and are underused. The 50-skill wall stops mattering when no single agent has 50 skills loaded.
The real fix: don't load them all
Trimming buys you breathing room. The actual architectural answer is to stop pretending a flat manifest scales. Two patterns work:
Progressive disclosure. Anthropic's Agent Skills design loads three layers separately - just the metadata at first, the full SKILL.md only when the model picks it, and supplementary files only if the chosen skill references them. OpenClaw's Active Memory Plugin in v4.10 is a partial implementation, but the metadata layer still ships flat.
Retrieval-based tool selection. The RAG-MCP paper and a small but growing set of community plugins do something simple: index every skill's description in a vector store, retrieve the top 5-10 by semantic match against the user's request, and only put those in context. The accuracy and token wins are both real (3x and 50%+ respectively). The catch is you're now running an extra retrieval hop per turn, and the index has to stay current.
Neither is in the OpenClaw core today. Both are inevitable. The release that ships proper progressive disclosure will probably make the entire 10,000-token rule obsolete. We're not there yet.
The honest take on ClawHub
13,729 skills sounds like an ecosystem win. In practice it's a long tail of one-off uploads, half-finished forks, and a handful of genuinely useful tools that get reinvented six times. Most users would be better off with 10-15 carefully chosen skills and a clear discipline of disabling things they're not actively using. The best skills post has a curated short list. The malware vetting post is what to read before installing anything new.
None of this is news to people building agent frameworks. It's just slow to filter into the user-facing experience because every "100 skills installed" screenshot looks impressive on Twitter and the failure mode (worse answers, slower replies) is invisible until you go looking.
The shortcut, if you don't want to babysit a manifest
On TryOpenClaw.ai, every agent ships with a curated active skill set sized to fit comfortably under the wall. Heavy MCP servers are off by default. Skills you don't invoke get auto-disabled after 30 days. When v4.10's Active Memory Plugin matures into proper progressive disclosure, you'll get it without re-architecting your manifest. No /skills audit spreadsheet. No 87-skill cleanup project on a Sunday afternoon.
Flat
Founder of TryOpenClaw.ai. Software engineer writing about OpenClaw, self-hosting trade-offs, and what non-technical users actually need from an AI assistant. About the author →
Try it right now
This is just one example - OpenClaw adapts to whatever you need. Describe any workflow in plain language and it figures out the rest. Pay $1 for a full 24-hour trial, pick your messaging app, and start chatting with your own instance in under 60 seconds. Love it? $39/mo. Not for you? Walk away - we delete everything.
Try OpenClaw for $124h full access. No commitment. Cancel anytime.