← All installments

Installment 1

I've Seen This Movie Before

· 9 min read

In 2010, I joined RightScale as Senior Director of Professional Services. RightScale was a cloud management platform — one of the first. The pitch was simple: AWS gives you raw compute, but who manages it? Who provisions, configures, monitors, scales, secures, and retires all those instances you’re spinning up?

Nobody, was the answer. And everybody was spinning them up.

AWS had launched EC2 four years earlier. The cloud was real. Companies were migrating workloads, developers were deploying applications, infrastructure teams were learning to think in APIs instead of rack units. The raw platform was powerful and getting more powerful every quarter.

But operating it was a disaster.

Configuration was manual. Monitoring was bolted on. Security was an afterthought — most instances ran with default credentials and open security groups. There was no standard model for lifecycle management. Every team cobbled together their own scripts, their own deployment pipelines, their own monitoring stack. When something broke at 2am, you SSH’d into the box and figured it out.

RightScale, Scalr, and a handful of others emerged to fill this gap. We built the management layer — the provisioning, configuration, monitoring, cost governance, and operational tooling that the raw platform didn’t provide. Under my leadership, RightScale became AWS’s largest consulting partner. We ran the most complex engineering projects in the ecosystem. The work caught AWS’s attention enough that they recruited me to build their Western US Professional Services organization from scratch.

I built that team to 80+ people. We served Adobe, Apple, Boeing, Intuit. I delivered AWS’s first million-dollar consulting engagement. And the pattern I saw over and over was the same: the raw platform was extraordinary, but the gap between “we use cloud” and “we operate cloud” was where everything broke.

That gap eventually closed. AWS built more management tooling. Terraform emerged. Kubernetes standardized container orchestration. The ecosystem matured. But for years — the better part of a decade — the management layer was the difference between a demo and a production system.

I left AWS in 2014 to co-found DualSpark, a DevOps consultancy. We scaled to 35 engineers in nine months, ran transformations for Nike, Experian, FICO, 23andMe, and VSCO, and exited to Datapipe in fourteen months. The work was still the same pattern: take a powerful substrate, build the operational discipline around it, help companies adopt it without the expensive mistakes.

I’m telling you this because I’m watching the exact same movie play out with AI agents right now. And I’m at the same point in the plot.


The Agent Gap

OpenClaw is the most popular open-source AI agent framework in the world. 250,000+ GitHub stars. Two million monthly active users. The creator, Peter Steinberger, joined OpenAI in February 2026 and the project moved to a foundation. It runs in a Docker container on your hardware, connects to your services, and gives you a persistent AI agent that can read your email, manage your calendar, execute tasks, do research, and act on your behalf — autonomously, on a schedule, without you opening a browser.

The raw platform is extraordinary.

And operating it is a disaster.

A working OpenClaw agent requires approximately 13,500 tokens of configuration spread across 11+ files. Runtime config, Docker Compose, Dockerfile, environment variables, credentials, identity files (SOUL.md, AGENTS.md, IDENTITY.md, HEARTBEAT.md, TOOLS.md), cron job definitions, skill configs, and egress rules. There are 14 silent configuration landmines — settings that cause security or operational failures without any error message. Memory bloats to 360KB within three days of active use without management. Credentials expire silently, and the agent keeps running while integrations quietly stop working.

In its first two months: nine CVEs disclosed, 42,000+ instances found publicly exposed on the internet with default configurations, and 20-36% of community skills on ClawHub — the marketplace — found to contain malicious payloads. The ClawHavoc campaign injected hidden instructions using base64-encoded strings and zero-width Unicode characters, targeting the identity files that define who your agent is. Microsoft, Cisco, and Nvidia have all published security guidance specifically for OpenClaw deployments.

Getting OpenClaw running is a weekend project. Keeping it running is an SRE job.

A thousand people queued outside Tencent’s headquarters just to get installation help. Ten-plus hosting providers now sell managed OpenClaw at $22-45 per month — but they deploy default-config agents on a VPS with no lifecycle management, no landmine prevention, and no architectural security. They solve convenience. Nobody solves sovereignty.

If you were in enterprise technology between 2008 and 2014, this should sound familiar. Powerful substrate. Everyone deploying. Nobody operating. Management layer missing.


Why the Management Layer Is Inevitable

Every infrastructure substrate follows the same maturation arc. The raw capability arrives first — exciting, powerful, accessible to early adopters. Then adoption outpaces operational maturity. Then the management layer emerges to close the gap.

Mainframes got operations management decades after the first deployments. Client-server architectures spawned an entire systems management industry. Virtualization gave us VMware and then vSphere. Cloud gave us RightScale, then Terraform, then Kubernetes. Containers gave us orchestrators. Each time, the pattern was identical: the exciting part (the raw capability) arrives first, and the boring part (lifecycle management, configuration governance, security hardening, operational observability) arrives years later.

AI agents are in the gap right now.

The community has 177 SOUL.md templates — personality files that define how the agent talks. None of them include tool configuration, cron schedules, security posture, credential management, or egress policy. A personality without an operational stack is a character sheet for a game nobody set up.

The existing tooling — openclaw onboard, openclaw configure, the built-in Control UI — covers individual settings but can’t do use-case-level composition. They can’t turn “I want an email manager” into a coherent configuration across all 8 auto-loaded workspace files, runtime config, cron schedules, and tool policy simultaneously.

What’s missing is the same thing that was missing for cloud in 2010:

Lifecycle management. An agent in production has a lifecycle — deployment, configuration, monitoring, updating, debugging, scaling, retiring. Today, every operator manages this with bespoke scripts, manual SSH sessions, and tribal knowledge.

Configuration governance. Agents have enormous configuration surfaces. Changes aren’t versioned, validated, or diffable. When something breaks, there’s no audit trail. When you want to replicate a working agent, there’s no spec to clone.

Security hardening. The default configuration ships insecure. Hardening is a 30-item opt-in checklist. Security should be the default, not an afterthought.

Operational observability. Most agent deployments have zero observability beyond “did it crash.” No decision logging, no tool execution tracing, no cost accounting, no error classification.

This is the gap I spent the last year building into. ClawHQ is the management layer for OpenClaw — the same kind of tool that RightScale was for AWS, except this time I’m building it instead of operating someone else’s.


What I Actually Know

I want to be precise about what I’m claiming and what I’m not.

I’m not claiming AI agents are equivalent to cloud infrastructure. The substrate is fundamentally different — agents have autonomous decision-making authority, language model reasoning, and tool execution capabilities that cloud VMs never had. The security model is different. The failure modes are different. The user relationship is different.

What I am claiming is that the operational gap follows the same pattern. The problems are different in their specifics but identical in their structure:

  • Cloud: “My instances are running but I don’t know if they’re healthy.” Agents: “My agent is running but I don’t know if its credentials have expired.”
  • Cloud: “I deployed manually and can’t reproduce the environment.” Agents: “I configured manually and can’t explain why it works.”
  • Cloud: “Default security groups are wide open.” Agents: “Default agent config has no capability restrictions.”
  • Cloud: “I have no cost visibility.” Agents: “I have no token consumption visibility.”
  • Cloud: “Updates break things and I can’t roll back.” Agents: “Model updates change behavior and I can’t roll back.”

The structural similarity isn’t a metaphor. It’s a prediction. The management layer will emerge for agents the same way it emerged for cloud — because the alternative is that agents remain fragile toys operated by enthusiasts, and the demand says otherwise.


The Bet

The question is whether the management layer consolidates into the platforms themselves or into independent tooling. Both happened in cloud. AWS built CloudFormation and Systems Manager. But Terraform, Kubernetes, Datadog, and PagerDuty also thrived as independent layers. The market was big enough for both.

I think the same thing happens here. OpenClaw will build better built-in tooling — they already are. openclaw onboard and openclaw configure will improve. But framework-level tooling serves the general case. It can’t have opinions about specific use cases. It can’t ship opinionated, production-tested configurations because that requires choosing one approach over another, and frameworks serve everyone.

The management layer is the opinionated layer. It’s where “here’s what a good email agent looks like” lives. It’s where “these 14 settings will silently break your deployment” gets prevented by construction rather than documented in a checklist. It’s where lifecycle management — day 2 through day 365 — becomes someone’s primary concern instead of nobody’s.

I built the cloud version of this at RightScale and AWS. I’m building the agent version now.

This series documents what I’m learning along the way — the architecture decisions, the security model, the operational patterns, the things that work and the things that don’t. Not theory. Not demos. The actual operational reality of running AI agents in production.

The management layer is inevitable. The only question is who builds it and whether it gets built well.


Next: The Agent Lifecycle Nobody’s Managing — mapping the full agent lifecycle from deployment to retirement, and identifying which stages have tooling and which are completely unserved.