The Dirty Secret of OpenClaw: Bigger is Better
Both small and large-scale deployments can be useful. That is not a question. We currently have 15-20 various 128GB LPDDR5X unified memory machines running at the studio 24×7, not because we do not have other compute platforms, but because we continue to repurpose and find uses for them. A “trick” we have learned over time is that running OpenClaw agents on this many systems is the wrong answer.

Inevitably, running one agent leads to another agent. Those agents may execute simple shell commands or open web browser sessions and start researching instead of using scraping tools. Usually, AI agents are performing these tasks in parallel. Perhaps the biggest reason we have moved our OpenClaw, turnstone, hermes, and other agent frameworks off of the 128GB LPDDR5X nodes is so that we can allocate memory to them that might also be used by the LLM and KV caches. In late February, we had a system that continued to experience issues when we realized it had multiple browsers open and was eating memory that the GPU side was also using for LLMs. Going back to the separation of the agent CPU side and the LLM back-end, this became something easily solvable. Aside from demonstrations, we set up for articles and videos, we have now split the agent side from the LLM serving side.
Another issue we have run into quite a bit is that agents can stall. LLM errors, unexpected outputs, hallucinations, and more can derail projects. We had an overnight project that stalled because a smaller model couldn’t properly call a tool. Every so often, a model just produced an error in its response, and so forth. Moving from gpt-oss-120b to MiniMax-M2.5 was an enormous step up in capability. Qwen3.5-397b-a17b came around and did something similar with a better tool calling. We went into this in a previous piece using n8n, but when you have AI agents, you are largely solving a reliability problem. If you watch them work, even small workflows are executed in what can be 100+ LLM calls. At that level, each “nine” of the reliability equation becomes important for the overall completion. While you can have other agents monitor, hopefully accurately themselves, if you end up losing a half day of work because of an error introduced by a smaller or higher-quantized model, it can be painful.

Usually, embedding models for memory, heartbeat, and many other tasks work well on smaller models and therefore machine sizes. At the same time, the higher reliability of larger models is really what turns something like OpenClaw from feeling like a toy to feeling like almost magic. A great example was that with gpt-oss-120b, we could not reliably set up a server in a single shot. With MiniMax-M2.5, it was set up automatically (albeit with some trial and error) except where we needed to provide authentication. With Qwen3.5-397b-A17B or previously Claude Code with Sonnet 4.6 and Opus 4.6, we have entire RDMA clusters set up.
This will match the experiences of many, and new models are getting significantly better at running agentic AI workflows. It is also a great case for running a cloud API to a larger model hosted on larger hardware.
Once you split where the LLM runs so that you can run larger LLMs, then the next question is where the agent should run. It turns out the answer tends to be high-performance (P-core) CPU architectures, and, if possible, larger machines.


