We're Building the Wrong Layer of AI Infrastructure
Most people are still writing prompts while we're building operating systems
I spent three hours last Tuesday watching ten AI agents work on the same product requirements document.
Some of them found exactly what they needed.
Some wandered into the wrong directories like confused tourists in a foreign city.
Some produced work that made me want to frame it.
Some produced absolute garbage.
The variance between best and worst was roughly 400%.
That variance is the entire problem.
We talk about AI like the model is the thing.
GPT-5
Claude
Gemini
Everyone obsessing over which one is smarter, which one is faster, which one hallucinates less.
And yeah, the model matters.
But you know what matters more?
The infrastructure around it.
I’ve been building something called Agentcy OS.
Started as a way to package the workflows we use at AIAA and Client Ascension.
Just standardizing what we already knew worked.
But it turned into something bigger.
We now have 144 workflow directives.
One hundred and forty-four standardized operating procedures that AI agents can execute.
250+ skill bibles.
5000+ word documents made from our exclusive trainings on performing every agency task imaginable.
Creating funnels, setting up ad accounts, writing VSL’s.
Made from the combined knowledge of top experts in their fields.
And the more I build this, the more I realize most people are thinking about AI tools completely wrong.
The 1960s Problem
Here’s what I mean.
In the early days of computing - we’re talking mid-1950s - every single program had to manage its own everything.
Memory allocation.
Input/output operations.
Hardware control.
If you wanted to print something, you wrote code that directly controlled the printer.
If you wanted to save a file, you manually managed where it went on the disk.
It was absolute chaos.
General Motors’ Research division built the first operating system in 1956 for the IBM 704.
Called it GM-NAA I/O.
The whole point was simple - stop making every program reinvent the wheel.
Give them a layer that handles the common stuff.
Memory management.
Resource allocation.
All the boring infrastructure that makes the magic possible.
That’s what an operating system does.
It’s abstraction.
It’s the chassis, the transmission, the steering wheel.
The engine matters, but without the rest of the car, you’re not going anywhere.
Right now, we’re building AI tools like it’s 1955.
Everyone’s writing individual prompts.
Individual chat sessions.
Individual one-off tasks.
Every implementation starts from scratch.
Every company reinvents how to load context, handle errors, manage memory, coordinate multiple agents.
We need an operating system for AI agents.
Not a chatbot.
Not a prompt library.
An actual architecture that handles all the infrastructure so the AI can focus on doing actual work.
The DOE Pattern and Why Architecture Matters
We call our architecture the DOE pattern.
Directive
Orchestration
Execution
The directive is the what.
Natural language SOPs that specify inputs, steps, and quality gates.
We have 140+ of them.
The orchestration is the AI decision-making.
The Claude Code agent that reads directives, loads context, routes between scripts, handles errors, and self-improves.
The execution is the deterministic code.
Python scripts in that make API calls, process data, and format outputs.
140+ scripts that do the actual work.
Let me give you a concrete example because this sounds abstract as hell.
We have a directive for creating a complete VSL funnel for a client.
Old way - the way most people still do it - you prompt an AI to write you a sales video script.
It gives you generic stuff.
Maybe hallucinates your client’s product features.
You spend hours editing and fact-checking.
The output is inconsistent every time.
The Agentcy OS way:
The directive (vsl_funnel_orchestrator.md) specifies:
Exactly what data to research first
Which skill bibles to load for copywriting expertise
What quality gates each section must pass
What order the outputs get generated
The orchestration layer - Claude - reads that directive, loads the client profile, pulls in relevant skill bibles (we have 280+ covering everything from VSL writing to funnel psychology), and makes routing decisions.
If the research phase fails, it knows to try fallback sources.
If a quality gate fails, it loops back.
The execution layer runs the actual Python scripts:
research_company_offer.py hits the Perplexity API for market intel.
generate_vsl_script.py compiles the video script.
create_google_doc.py formats and delivers it.
Each script is testable, debuggable, deterministic.
Same AI model.
Wildly different results.
Because the system does the heavy lifting.
Here’s what people miss: LLMs are probabilistic.
90% accuracy sounds great until you realize that’s 59% accuracy over just 5 steps.
Chain a few prompts together and your success rate craters.
So we push every deterministic operation into Python.
Rate limiting, caching, API calls, formatting, file management.
All handled by scripts that work the same way every time.
The AI only handles what AI is actually good at.
Reading context, making decisions, routing between steps.
This is why most AI workflows fail.
They don’t have proper context loading, so they miss critical information.
They don’t have error handling, so they fail silently.
They don’t have state management, so they repeat work or contradict themselves.
Most people don’t even test AI workflows because “how do you test something non-deterministic?”
But the infrastructure CAN be deterministic.
The orchestration can be tested.
The execution scripts can be tested and debugged.
You’re not testing whether the AI is smart.
You’re testing whether your system properly loads context, handles failures, and maintains state.
The AI is just the engine.
You still need to test the transmission.
The Multi-Agent Coordination Problem
So back to those ten AI agents working on the same PRD.
This is the part that keeps me up at night.
Because it’s the unsolved problem.
How do you coordinate multiple AI agents working on the same project?
How do you handle conflicts when two agents have different interpretations?
How do you merge their outputs into something coherent?
How do you maintain consistency when each agent might understand the same instruction slightly differently?
Research from 2026 suggests keeping agent teams small - between 3 and 7 agents per workflow.
Communication overhead grows exponentially as you add more.
When inter-agent message latency exceeds 200 milliseconds, everything starts degrading.
The whole system slows down.
And the human element - turns out you need it.
We thought full automation was the goal.
It’s not.
Humans in the loop
Humans on the loop for monitoring
Humans out of the loop for fully automated tasks
The appropriate level depends on what you’re building.
But purely autonomous multi-agent systems?
They drift.
We’re seeing this with the one ecom client we’re testing with.
Building an automated proposal workflow that researches the company, pulls product images, generates a full photoshoot with AI, and combines them with ad copy for static ads.
What used to take someone four hours of grinding through research, formatting, and designing is becoming a few clicks.
But we still need a human checking the output before it goes to the client.
The AI gets 90% there.
The human does quality control and adds the judgment layer.
Because without that judgment layer, you get the variance problem again.
Sometimes brilliant.
Sometimes garbage.
But what this does mean is that one person can handle 20+ clients.
I know someone personally doing this (shoutout Oliwer, legend)
The standardization problem is wild too.
Right now there are at least three competing inter-agent communication protocols.
Google’s A2A.
Cisco’s AGNTCY.
Anthropic’s Model Context Protocol.
Everyone’s trying to create the standard for how agents talk to each other.
Which means there is no standard.
MCP is winning for now - but that could change by next week.
It’s like the early browser wars, except worse because these agents need to coordinate in real-time.
What Happens When the Infrastructure Becomes the Product
We deployed the project on Railway last week.
One-command installer.
Docker-compose, environment variables, the whole thing.
And the installer still has bugs.
The documentation is incomplete.
Some directives work better than others.
But we’re shipping.
We have beta testers.
We’re learning.
We’re iterating.
Because here’s the business implication nobody’s fully grasped yet.
Right now, if you want to automate something at scale, you need someone who understands prompting, API integration, error handling, and the specific domain you’re working in.
That’s like finding a unicorn who also knows how to code and understands your business.
With an operating system approach, you just need someone who understands the domain.
The system handles the rest.
That’s the unlock.
Every business will eventually need something like this.
Not eventually.
Soon.
Not a chatbot that answers questions.
An actual operating system for how AI does work in their organization.
Custom directives for their processes.
Orchestration rules for their workflow.
Execution layers that integrate with their stack.
We’re building version one of that.
The first AI Agency operating system.
And the core insight - the thing I keep coming back to - is that AI agents are only as good as the systems around them.
You can have the smartest model in the world.
But without proper context loading, without memory management, without error handling, without multi-agent coordination, you’re just spinning up expensive chatbots.
The companies that figure out the infrastructure layer first are going to have a massive advantage.
Because while everyone else is still optimizing prompts, they’ll be building actual operating systems.
They’ll have standardized ways to load context.
Priority systems for what information matters most.
Client-specific customization that doesn’t require rewriting everything.
Graceful degradation when things fail.
And most importantly - they’ll have solved the coordination problem.
How to run multiple agents in parallel without them stepping on each other.
How to merge outputs.
How to maintain consistency.
How to recover when one agent fails.
AI experts think that by 2027, 40% of agentic AI projects will fail due to inadequate risk controls.
Not because the AI wasn’t smart enough.
Because the infrastructure wasn’t robust enough.
The model is just the engine.
You still need the chassis, the transmission, the steering wheel.
That’s what we’re building.
The Part That Actually Matters
I’ll probably do a full technical breakdown at some point.
The DOE pattern.
The context loading priority system.
How we handle client-specific customization.
How the orchestrator decides which agent handles which task.
But that’s not the point.
The point is this - if you’re building AI tools and you’re focused on the model, you’re solving the wrong problem.
The model will get better.
It’s getting better every month.
But the infrastructure - the systems that make AI agents actually useful in production environments - that’s where the real work is.
That’s where the competitive advantage lives.
And that’s where most people aren’t looking.
I started building Agentcy OS last week.
We have bugs.
We have incomplete documentation.
We have directives that need improvement.
We have multi-agent orchestration problems we’re still figuring out.
But we have 144 standardized workflows that work.
We have an architecture that separates concerns properly.
We have test coverage that actually means something.
We have a deployment process that works.
We have beta testers using this in production to manage their own agency and clients.
Most importantly, we have a framework for thinking about AI infrastructure that isn’t just “write better prompts.”
Because the companies that win the AI race won’t be the ones with access to the best models.
They’ll be the ones who built the best systems around those models.
The ones who figured out the abstraction layer.
The ones who built the operating system while everyone else was still writing assembly code.
That’s what we’re doing.
And if you’re building AI tools and you’re not thinking about infrastructure.
You’re already behind.



