A weekend ‘vibe code’ hack by Andrej Karpathy quietly sketches the missing layer of enterprise AI orchestration


This weekend, Andrei Karpathythe former director of AI at Tesla and one of the founders of OpenAI, decided he wanted to read a book. But he didn’t want to read it alone. He wanted to read it in the company of a committee of artificial intelligences, each offering their own perspective, criticizing the others, and ultimately, under the guidance of a “chairman,” putting together a definitive answer.
To make this possible, Karpathy wrote what he called a ‘vibe code project” – a piece of software written quickly, largely by AI assistants, intended for fun rather than function. He posted the result, a repository called “LLM Council,” to GitHub with a stark disclaimer: “I’m not going to support it in any way… Code is now volatile and libraries are over.”
But for technical decision makers across the enterprise landscape, looking past the casual disclaimer reveals something far more important than a weekend toy. In a few hundred lines Python And JavaScriptKarpathy has outlined a reference architecture for the most critical, undefined layer of the modern software stack: the orchestration middleware that sits between enterprise applications and the volatile market of AI models.
As companies finalize their platform investments for 2026, LLM Council offers a stripped-down look at the ‘build vs. buy’ reality of AI infrastructure. It shows that while the logic of routing and aggregating AI models is surprisingly simple, the real complexity lies in the operational packaging required to make it enterprise-ready.
How the LLM Council works: four AI models debate, critique, and synthesize answers
For the casual observer: the LLM Council web application looks almost identical to ChatGPT. A user types a question in a chat box. But behind the scenes, the application activates an advanced, three-phase workflow that mirrors how human decision-making bodies work.
First, the system sends the user’s query to a panel of boundary models. In Karpathy’s default configuration, this includes OpenAIs GPT-5.1Googles Gemini 3.0 ProAnthropic Claude Sonnet 4.5and xAIs Grok 4. These models generate their first responses in parallel.
In the second phase, the software performs a peer review. Each model is shown the anonymized responses of its counterparts and asked to evaluate them based on accuracy and insight. This step transforms the AI from a generator to a critic, imposing a layer of quality control that is rare in standard chatbot interactions.
Finally, a designated “Chairman LLM” (currently configured as Google’s Gemini 3) receives the original question, individual answers, and peer rankings. It synthesizes this mass of context into a single, authoritative answer for the user.
Karpathy noted that the results were often surprising. “Very often the models are surprisingly willing to select another LLM’s response as superior to their own,” he wrote on X (formerly Twitter). He described using the tool to read book chapters, noting that the models consistently praised GPT-5.1 as the most insightful, while rating Claude the lowest. However, Karpathy’s own qualitative assessment differed from that of his digital council; he found GPT-5.1 “too wordy” and preferred Gemini’s “condensed and processed” output.
FastAPI, OpenRouter, and arguments for treating frontier models as interchangeable components
For CTOs and platform architects, the value of LLM Council lies not in literary criticism, but in its construction. The repository serves as a primary document that shows exactly what a modern, minimal AI stack will look like by the end of 2025.
The application is built on a “thin” architecture. Used the backend FastAPIa modern one Python framework, while the frontend is a standard Respond application built with Vit. Data storage is not handled by a complex database, but by simple JSON files written to local disk.
The lynchpin of the entire operation is OpenRouteran API aggregator that normalizes the differences between different model providers. By routing requests through this one broker, Karpathy avoided writing separate integration code OpenAI, GooglingAnd Anthropic. The application does not know which company provides the information; it simply sends a prompt and waits for a response.
This design choice highlights a growing trend in enterprise architecture: the commoditization of the model layer. By treating frontier models as interchangeable components that can be swapped out by editing a single line in a configuration file (specifically the COUNCIL_MODELS list in the backend code), the architecture protects the application from vendor lock-in. If a new model of Meta or Mistral tops the leaderboards next week, it can be added to the board in seconds.
What’s missing from prototype to production: authentication, PII redaction, and compliance
While the core logic of LLM Council is elegant, but also serves as a clear illustration of the gap between a ‘weekend hack’ and a production system. For an enterprise platform team, cloning Karpathy’s repository is just the first step of a marathon.
A technical audit of the code reveals the missing ‘boring’ infrastructure that commercial vendors sell for higher prices. The system lacks authentication; anyone with access to the web interface can request the models. There is no concept of user roles, meaning a junior developer has the same access rights as the CIO.
Moreover, the administrative layer does not exist. In an enterprise environment, sending data simultaneously to four different third-party AI providers creates immediate compliance issues. There’s no mechanism here to redact personally identifiable information (PII) before it leaves the local network, nor is there an audit log to track who requested what.
Reliability is another open question. The system assumes that OpenRouter API is always active and that the models will respond in a timely manner. It lacks the circuit breakers, fallback strategies, and retry logic that keep mission-critical applications running when a provider experiences an outage.
These shortcomings are not shortcomings in Karpathy’s code – he explicitly stated that he has no intention of supporting or improving the project – but they define the value proposition for the commercial AI infrastructure market.
Companies like LongChain, AWS bottomand several AI gateway startups are essentially selling the “hardening” around the core logic that Karpathy demonstrated. They provide the security, observability, and compliance wrappers that turn a raw orchestration script into a viable business platform.
Why Karpathy believes code is now ‘ephemeral’ and traditional software libraries are outdated
Perhaps the most provocative aspect of the project is the philosophy under which it was built. Karpathy described the development process as “99% vibration encoded”, implying that he relied heavily on AI assistants to generate the code rather than writing it line by line himself.
“The code is now volatile and there are no more libraries. Ask your LLM to change it in any way,” he wrote in the repository’s documentation.
This statement marks a radical shift in software engineering capabilities. Traditionally, companies build internal libraries and abstractions to manage complexity and maintain them for years. Karpathy suggests a future where code is treated as “promptable scaffolding” – disposable, easily rewritten by AI, and not intended to last.
This poses a difficult strategic question for corporate decision makers. If internal tools can be “atmosphere coded“Does it make sense to buy expensive, rigid software suites for internal workflows over a weekend? Or should platform teams empower their engineers to generate custom, disposable tools that meet their exact needs at a fraction of the cost?”
When AI Models Assess AI: The Dangerous Gap Between Machine Preferences and Human Needs
Besides the architecture, the LLM Council project unintentionally sheds light on a specific risk in the automated use of AI: the difference between human and machine judgment.
Karpathy’s observation that his models preferred GPT-5.1, while he preferred Gemini, suggests that AI models may have shared biases. They may prefer verbosity, specific formatting, or rhetorical confidence that does not necessarily match the human business needs for brevity and accuracy.
Now that companies are increasingly dependent on “LLM-as-judgesystems to evaluate the quality of their customer-facing bots, this discrepancy matters. If the automated rater consistently rewards “wordy and verbose” answers while human customers want concise solutions, the metrics will show success while customer satisfaction plummets. Karpathy’s experiment suggests that relying solely on AI to rate AI is a strategy fraught with hidden tuning problems.
What enterprise platform teams can learn from a weekend hack before building their 2026 stack
Ultimately, LLM Council acts as a Rorschach test for the AI industry. For the hobbyist it is a fun way to read books. To the vendor, it’s a threat, proving that the core functionality of their products can be replicated in a few hundred lines of code.
But for the enterprise technology leader, it is a reference architecture. It demystifies the orchestration layer and shows that the technical challenge is not in routing the directions, but in managing the data.
As platform teams head into 2026, many will likely be staring at Karpathy’s code, not to implement it, but to understand it. It proves that a multi-model strategy is not technically out of reach. The question remains whether companies will build the governance layer themselves or pay someone else to wrap the “vibe code” in enterprise-grade armor.




