5 bugs AI agents silently introduce

ive been running 35 specialized claude code agents across 3 production codebases for the last 6 months. agents are good at writing code. theyre also good at introducing a specific class of bugs that traditional code review tends to miss — because the code looks correct, passes tests, and ships clean.

here are five categories of failures ive documented in production. each one cost me real time + at least one staging incident. write these into your prompts, your reviews, or your team SOP — your choice. but know they exist.

1. agents act when restraint is the right move

biggest single category. AI agents fail when the right action requires waiting, choosing not to act, or saying "i need more info."

client message: "can you make the dashboard faster?" agent reads the request, looks at the dashboard code, identifies three optimization opportunities, starts implementing. senior reads the same message, asks: "faster for whom? on what data volume? slow on initial load or on filter operations? whats the SLA?"

the agents confident execution costs hours of work that might solve the wrong problem. the seniors pause costs 5 minutes of clarification. the pause is the right move 80% of the time.

concrete failure case: building a multi-tenant data isolation layer for a saas project last quarter. agent kept adding configuration flags for edge cases ("what if a tenant wants flag A but not B?"). by flag #7 the system was unmaintainable. i deleted 5 flags and replaced them with a default-secure single mode. config space went from 128 combinations to 4. senior judgment was "stop adding, start removing."

common thread: right move is restraint. agents are calibrated for action.

2. agents remove load-bearing weirdness

agents are good at search. agents are bad at synthesis from large context.

concrete: asked an agent to refactor a slow 40-line function. the rewrite was technically correct. but the original contained try/catch with comment // don't remove — handles malformed JSON from legacy webhook v1. the rewrite "cleaned up" that try/catch.

agent saw 40 lines. actual scope was the whole webhook chain, the legacy contract, the production data that occasionally hits the malformed path. none of that was in the prompt.

deployed the rewrite to staging. crashed within 6 hours when the daily webhook v1 batch fired. rolled back, restored the original try/catch, added a regression test that explicitly fires malformed JSON. lesson cost ~3 hours and a degraded staging window.

this isnt fixed by more context tokens. the context that matters is implicit — "this comment was load-bearing", "this duplication was intentional", "this naming convention was chosen for a reason". agent reads the lines but doesnt have the memory of why theyre there.

3. agents over-abstract on small sample sizes

asked an agent to extract a pattern shared by 3 functions. it produced a beautiful generic helper that the 4th similar function — written 2 weeks later — could never quite fit. the 3 specific implementations were better than the 1 generic abstraction.

agent has no "predict the 4th case" capability. 3 occurrences feel like a pattern to the agent. to a senior, 3 occurrences with subtle variations feel like a warning that the abstraction will leak.

rule of thumb after this: don't let agents extract shared code into a helper until you have 5+ confirmed examples that are mechanically identical. 3 examples is a coincidence, not a pattern.

4. agents misread tone in human messages

this one i underestimated. agents are bad at reading tone in human messages.

client going silent for 3 days is a strong signal — possibly losing interest, possibly stuck on internal decision, possibly got a competing quote. agents read silence as "no update yet" and continue per plan. "can we add X?" (genuine question) vs "can we add X?" (testing whether you'll pushback on scope creep) is invisible to agents. senior knows from timing, prior conversations, how it was phrased. tone for difficult convos — scope-creep pushback, missed-deadline notes, refund discussions — agent versions are either too soft (gets walked over) or too corporate (loses earned trust). specific exchange from last month. a client asked for a "small change" 6 weeks into a project. agent drafted a polite, structured reply explaining the change-request process. i sent something different: "sure, let me think about whether this needs a CR or if it fits the current scope — give me 24 hours."

agent reply was formally correct. actual right reply was warmer + bought thinking time. the relationship needed the warmth more than it needed the formality. i now never let agents send client comms without human review.

5. agents measure what's measurable, not what matters

this is where i most expected agents to excel + where theyre most disappointing.

building LLM-based products requires evals. what does "good" mean? what threshold do we ship at? which test cases matter? upstream of implementation, heavily judgment-based.

specific case: building the support agent eval harness for the e-commerce project (last quarter). agent suggested measuring response accuracy via semantic similarity to a "golden answer" set. that would have been wrong in 2 ways. first, the golden answers themselves were judgment calls. second, the actual metric that mattered was "did the customer ask a follow-up that suggests they were confused." the eval design needed real customer conversation data + human classification of "this was helpful" / "this missed." cant be done from training data.

bonus pattern in the same category: agent says "this query should be fast" based on indexed columns. in production with cold cache + network jitter + concurrent load, its 800ms slow. agents reason from the code, not from operational behavior. theyll recommend caching strategies that look correct on paper but ignore the cache invalidation cost when the cached data changes 50x/day. they'll recommend "just add an index" without thinking about write amplification on a write-heavy table.

senior knows "this looks fast but will be slow in production for THESE reasons" because senior has seen production reality. agent has seen the docs.

the architecture that contains the damage i run 35 specialized agents in ~/.claude/agents/. each one has a tight scope:

flow-architect — routes tasks to the right specialist backend-developer / frontend-developer / postgres-pro / golang-pro — domain implementation quality-gate — final review against the agreed checklist security-engineer / test-automator — orthogonal verification cfo / cto / ceo — strategy and cross-cutting decisions ... + 22 more domain specialists specialists limit blast radius. when one fails, the failure is in one domain instead of across the codebase. specialists also make the bug categories easier to spot — quality-gate's job is exactly to catch the patterns above.

if youre building agent systems in 2026, build the specialist pattern + checklist-driven review FIRST. the bug categories above don't disappear, but they get caught earlier.

five categories of bugs AI agents will silently introduce to your codebase

1. agents act when restraint is the right move

2. agents remove load-bearing weirdness

3. agents over-abstract on small sample sizes

4. agents misread tone in human messages

5. agents measure what's measurable, not what matters

Comments

More from this blog

what cloudflares free tier actually limits (real numbers, not adjectives)

cloudflare email routing tags ur domain as "forwarder" (its fine)

AI-assisted development cost breakdown — real numbers from 3 projects

How to budget a software development project (without the spreadsheet theater)

Command Palette

1. agents act when restraint is the right move

2. agents remove load-bearing weirdness

3. agents over-abstract on small sample sizes

4. agents misread tone in human messages

5. agents measure what's measurable, not what matters

Comments

More from this blog