The Architect, The Engineer, and The Polymath: Three AI Coding Agents, Three Philosophies

A practitioner's comparison of Claude Code, OpenAI Codex, and Google Gemini CLI -- tested daily, not benchmarked weekly

By R. Rajeev Kumar | The Global Federation

There is a particular kind of knowledge that benchmarks cannot capture. It is the knowledge that comes from sitting with a tool eight hours a day, watching it succeed magnificently and fail absurdly, learning its temperament the way a carpenter learns the grain of different woods. Over the past year, I have used all three major AI coding agents -- Anthropic's Claude Code, OpenAI's Codex CLI, and Google's Gemini CLI -- not as a reviewer running curated tests, but as a builder shipping real software across a portfolio of applications.

What follows is not a benchmark comparison. It is a field report.

Claude Code: The Brilliant Colleague Who Forgets What You Said Yesterday

Claude Code feels less like a tool and more like a conversation. This is by design. Anthropic built it to emulate pair programming -- two minds at one keyboard, trading ideas, catching each other's blind spots. And it works. When Claude is in its stride, the experience is genuinely collaborative: it reads your project, understands your intent, suggests approaches you hadn't considered, and writes code that feels like it belongs in your codebase.

The numbers confirm the feeling. Claude Code has captured 46 percent of the "most loved" developer vote in the Pragmatic Engineer's 2026 survey of 906 developers -- more than Cursor and Copilot combined. It reached a billion dollars in annualised run-rate revenue within six months of its public launch, a pace faster than ChatGPT or Slack achieved. Roughly 135,000 public GitHub commits per day now carry its co-author tag, according to SemiAnalysis. Jaana Dogan, a principal engineer at Google responsible for the Gemini API, publicly credited Claude Code with producing in one hour a working prototype of what her team had spent a year building -- though she was careful to note she had fed it the "best ideas that survived" from that year of iteration, and the output was a "decent toy version" that still needed refinement.

But Claude has a temperament. It is eager to please, sometimes excessively so. Ask it to fix a database connection string and it may rewrite your entire data layer instead. Tell it to follow 200 lines of project rules and it will acknowledge every one of them, then quietly ignore half during code generation. One developer documented asking Claude to undo a change, only to watch it delete a thousand lines and two files it was never asked to touch. A viral Cybernews story about Claude ignoring the word "No" gathered over 1,350 upvotes on Hacker News, with developers wishing for something "more robotic" -- a plea that says everything about Claude's personality-first design.

The deeper issue is structural. Claude markets a one-million-token context window, but developers report reliable performance degrading well before that limit. The community has given it a name: "context rot." Anthropic's own engineering team acknowledges that accuracy and recall degrade as the conversation lengthens -- a well-documented "lost-in-the-middle" phenomenon where information at the centre of a long conversation is most likely to be forgotten. The forgetfulness that users experience is not a bug in the personality. It is a limitation of the architecture that the personality makes easier to forgive.

And yet, Claude Code remains the most consistently productive of the three. Realistic productivity multipliers from the field hover at two to three times baseline, with five times achievable through careful prompt preparation. Enterprise adopters report significant velocity gains -- Rakuten, for instance, reduced development timelines from 24 days to five. Developers in their fifties and sixties on Hacker News describe Claude Code as having "reignited their passion for building software" by letting them focus on solving problems rather than memorising framework syntax.

Not everyone agrees. An external evaluation by METR -- using Cursor Pro with Claude's underlying models, not Claude Code itself -- found that experienced open-source developers actually took 19 percent longer on tasks when given AI assistance. The study was small (16 developers, 246 tasks) and used a different tool, but it is a useful corrective to the breathless productivity claims.

The criticism exists, and it is valid. But it is the criticism of a tool that people actually use -- passionately, daily, and at scale.

Codex CLI: The Pedantic Engineer Who Delivers Exactly What You Asked For

If Claude Code is the colleague you brainstorm with, Codex is the engineer you hand a specification to. It does not ask why. It does not suggest alternatives. It takes the document, disappears for twenty minutes, and returns with precisely what was written -- even if what was written was wrong.

This was not always a compliment. For much of 2025, Codex CLI was, to put it diplomatically, a frustrating experience. A truncation system capped at 256 lines or 10 kilobytes -- whichever hit first -- meant that critical error messages would vanish from the middle of outputs. GitHub issues piled up: broken file reads, lost compilation errors, MCP tools rendered unusable. The tool had the bones of something good wrapped in a harness that actively undermined it.

Then came GPT-5.4.

The transformation is not subtle. Developers who had written off Codex describe returning to find, in their words, a "night and day" difference. The context window expanded to one million tokens. Token consumption for tool-heavy workflows dropped by 47 percent. Tasks that failed reliably in mid-2025 now succeed routinely. One daily user described his initial scepticism as "thoroughly overturned," now queuing four to five Codex tasks each morning before starting manual work.

Where Codex particularly excels -- and where it has no real peer among the three -- is in code review and security analysis. Codex Security reads code the way a security researcher does: building codebase-specific threat models, mapping attacker entry points and trust boundaries, then attempting to reproduce its findings in isolated environments before reporting them. OpenAI reports an 84 percent reduction in noise, 90 percent fewer over-reported severities, and a 50 percent drop in false positives. This is not pattern matching. It is structured adversarial reasoning applied to your codebase.

The pedantic quality is real and double-edged. Tell Codex to add defensive coding and every function gets wrapped in type-checking guards. Ask it to remove something and you may end up with more lines of code than you started with. One developer watched it spend thirty minutes complying with every last character of a configuration file -- even when the instructions contained errors. It will follow a specification off a cliff, because that is what specifications are for.

This literalness makes Codex the perfect counterweight to Claude's creativity. The emerging power workflow in the developer community is telling: use Claude Code to architect and generate features through interactive reasoning, then run Codex as the code reviewer before merging. One builds; the other audits. The strengths of each precisely compensate for the weaknesses of the other.

In one independent blind test of 36 rounds, Claude won 67 percent of head-to-head code quality comparisons against Codex -- a small sample, but directionally consistent with broader sentiment. But on Terminal-Bench -- the benchmark that measures command-line navigation, dependency management, and debugging -- Codex leads at 77.3 percent versus Claude's 65.4. The gap has narrowed dramatically. In some dimensions, it has reversed.

Gemini CLI: The Smartest Model Trapped in the Weakest Tooling

Here is the paradox of Gemini: the model is competitive. The tooling is not.

A Medium headline captured it perfectly: "Gemini 3.1 Pro is the smartest dumb model I know." The reasoning engine underneath is genuinely powerful. Gemini 3.1 Pro competes on SWE-bench Verified, the most widely-cited coding benchmark. It is substantially strong at tasks requiring command-line navigation and debugging sequences. It processes entire codebases through a one-million-token context window that was a genuine differentiator before competitors matched it.

But the CLI that wraps this engine scores 86 out of 100 on coding tool rankings where Claude Code scores 98. One Hacker News commenter put it with characteristic bluntness: "Gemini CLI sucks. Just use OpenCode if you have to use Gemini." A developer community comparison found it "the hardest to trust in daily use" with "random HTTP errors, unclear token limit feedback, and a slow and clunky UI." Free-tier users report being rate-limited after fewer than ten prompts before being downgraded to a lesser model. And a billing trap involving two authentication methods with different billing structures has generated stories of surprise charges ranging from a hundred to two thousand dollars -- one environment variable, silently set, can route your requests from the free tier to the paid API without warning.

Google's Antigravity Agent -- the full-stack app builder in AI Studio -- tells a similar story of brilliant capability undermined by operational risk. Product Hunt reviewers praise its parallel execution and autonomous agent workflows. Then you read about the incident in late 2025 where it ran a recursive delete command and wiped a photographer's entire D drive -- every file, every folder, gone in seconds. Google said it was "investigating." As of this writing, Gemini CLI ships with macOS sandboxing, but Antigravity's safeguards remain a work in progress.

Where Gemini genuinely leads -- and leads decisively -- is in two domains that have nothing to do with CLI coding.

The first is multimodal intelligence. Gemini was designed from the ground up as a multimodal model, and it shows in ways competitors cannot match. Chart analysis, screenshot data extraction, video understanding through Veo 3 -- these are not bolted-on features but native capabilities. For tasks that involve understanding images, diagrams, or visual data alongside code, Gemini is the first tool to reach for.

The second is deep research. Gemini Deep Think has solved eighteen previously unsolved research problems, disproved a decade-old mathematical conjecture, and achieved gold-medal performance on international science olympiads. A Rutgers mathematician used it to identify a logical flaw that had passed through human peer review unnoticed. Duke University's Wang Lab applied it to semiconductor crystal growth optimisation. These are not party tricks. They are demonstrations of reasoning depth that neither Claude nor Codex has matched in structured research contexts.

Google's online tools -- AI Studio, Stitch, Firebase Studio -- are consistently rated higher than its CLI offering. This is the clearest signal in the data: the Gemini model deserves better tooling than Google has built around it. When developers access Gemini through third-party interfaces like GitHub Copilot rather than Google's own CLI, satisfaction improves markedly. The bottleneck is not the intelligence. It is the interface.

The Verdict: Three Tools, One Workflow

The most productive developers in 2026 are not choosing between these tools. They are composing them.

Claude Code for architecture, feature generation, and the creative work of turning ideas into code. Codex for security review, specification compliance, and the disciplined work of ensuring that code is correct. Gemini for multimodal analysis, deep research, and the analytical work of understanding problems before writing solutions.

Each tool embodies a philosophy. Claude believes software development is a conversation -- collaborative, iterative, human. Codex believes it is an engineering discipline -- precise, literal, specification-driven. Gemini believes it is a research problem -- multimodal, analytical, knowledge-intensive. None of them is wrong. None of them is complete.

The question for the developer community is not which tool is best. It is whether the industry will converge toward tools that combine these philosophies, or whether the future belongs to workflows that orchestrate between them. The answer, for now, is orchestration. The tools that play well together are winning over the tools that try to do everything alone.

What remains clear is that the era of the single AI coding assistant is over. The age of the AI development team has begun -- and the team, it turns out, works best when its members have very different personalities.

R. Rajeev Kumar is Chief Executive Officer of the Aarksee Group of Companies. He uses all three AI coding agents daily across a portfolio of enterprise, environmental, and publishing applications.

Sources and further reading:

Claude Code "most loved" data: Pragmatic Engineer, "AI Tooling for Software Engineers in 2026" (906 respondents, Mar 2026)
Claude Code $1B run-rate: Anthropic announcement; Reuters (Dec 2025)
Claude Code commit volume: SemiAnalysis, "Claude Code is the Inflection Point" (Feb 2026)
Jaana Dogan (Google) on Claude Code: The Decoder (Jan 2026); original post on X
Claude Code context rot: GitHub Issue #35296, Anthropic engineering blog
Claude Code ignoring instructions: GitHub Issues #24129, #742; Cybernews (2026)
Claude Code adoption metrics: Bloomberg, "AI Coding Agents Fueling a Productivity Panic in Tech" (Feb 2026)
METR productivity study: "Measuring Impact of Early-2025 AI on Experienced Developer Productivity" (Jul 2025; used Cursor Pro, 16 developers, 246 tasks)
Codex CLI truncation issues: GitHub Issues #6426, #7906, #9502
GPT-5.4 improvements: OpenAI official announcement (Mar 2026)
Codex Security statistics: OpenAI, "Codex Security: now in research preview" (2026)
Claude vs Codex blind test: Blake Crosley independent test (36 rounds, 5 dimensions)
Terminal-Bench 2.0 scores: tbench.ai leaderboard; MorphLLM comparison (2026)
Gemini CLI rating: AIForCode.io (single-source rating, 2026)
Gemini billing trap: Medium, "The $150 Gemini CLI Trap" (2026)
Gemini Deep Think research: Google DeepMind blog (Feb 2026); arXiv paper (Feb 2026)
Deep Think peer review flaw: Lisa Carbone, Rutgers University, via Google blog
Deep Think crystal growth: Duke University Wang Lab, via Google DeepMind
Antigravity D-drive incident: The Register, Tom's Hardware, Cybernews (Nov-Dec 2025)
Developer workflow patterns: VibHackers, Nilenso blog (2026)

Published by The Global Federation | Peace, Prosperity & Progress