AI agents in engineering workflows rarely fail loudly. They don't crash, they don't error out, they don't produce obvious nonsense. They fail quietly, in ways that look like normal output, and the failures accumulate over weeks until the team notices that something is off — a slow drift in code quality, a pattern of subtle bugs, decisions that don't quite match the team's standards. By the time the failure is named, it has been operating for months.
The eight failure modes below are the most common ones we see in engineering deployments. Each is subtle by design — these are not the failures that get caught by basic monitoring. They require deliberate detection because they don't trip standard alerts.
1. Confident hallucination of APIs that don't exist
The agent confidently writes code that calls library functions that don't exist or that have different signatures than the agent assumed. The code looks reasonable, passes a casual review, and breaks at runtime. The engineer reviewing the PR doesn't catch it because the code reads like it could be correct — and the agent's confidence in the code suggests it was tested. It wasn't. The pattern is most common with libraries the agent has limited training data on, particularly recent or internal libraries.
2. Subtle drift from team conventions
The team has specific conventions — naming, structure, error handling patterns. The agent's defaults don't match. Each individual PR introduces small deviations: a variable named slightly off-convention, an error handled with a different pattern. Individual deviations are minor; the cumulative effect, over months, is a codebase that's inconsistent with itself. Reviewers don't catch each deviation because each is small. The convention erodes invisibly.
3. Decisions made without recorded rationale
The agent makes architectural choices in the course of writing code — choosing library A over B, structuring a function one way rather than another. The choices are usually reasonable, but the rationale isn't captured anywhere. When the team later wants to revisit the choice, there's no record of why it was made. The agent's decisions become indistinguishable from accumulated implementation defaults.
4. Workarounds that become permanent
The agent encounters a problem and produces a workaround — a hack that gets the immediate task done but isn't the right long-term solution. A human engineer would flag the workaround as technical debt. The agent doesn't; it ships the workaround as if it were the solution. Six months later the workaround is load-bearing, and the team has to unwind it expensively.
Put a context layer under your distributed team.
StandIn gives engineers a 60-second wrap at the end of every shift. The next shift wakes up knowing exactly what to pick up — no standup required.
Request early access5. Plausible but wrong test coverage
The agent writes tests that look comprehensive but cover the wrong things. Tests assert that the function returns when called with valid input — but don't test the error paths, edge cases, or invariants that the team's actual test suite would have covered. The test count goes up; the actual coverage in the sense the team cares about doesn't. False confidence is worse than no confidence.
6. Cascading dependencies the human reviewer doesn't notice
The agent makes a change in module A. The change requires updating module B, then C, then D. The agent makes all the changes correctly. The reviewer scans the PR, sees that it's coherent, and approves. What they miss is that the agent has effectively decided to couple modules A through D more tightly than they should be coupled. The code works; the architecture has quietly drifted in a direction nobody chose.
7. Refactoring that loses essential subtle behavior
The agent refactors a complex piece of code to be "cleaner." The refactoring preserves the obvious behavior but loses a subtle behavior that wasn't documented — a specific ordering, a particular error case, a corner case the original engineer handled deliberately. The refactor passes tests because the tests didn't cover the subtle behavior. The subtle behavior matters in production, and the bug surfaces weeks later when a specific user hits the corner case.
8. Confident misinterpretation of ambiguous requirements
The task description is ambiguous — multiple reasonable interpretations exist. A human engineer would ask. The agent picks an interpretation and proceeds with confidence. The interpretation is wrong; the agent has produced a working implementation of the wrong thing. By the time the team notices, the engineer has reviewed and approved the PR, and the wrong implementation is in production. The agent never flagged the ambiguity.
What detection looks like
Each failure mode has a detection pattern. For hallucinated APIs, integration tests catch runtime failures that pure unit tests miss. For convention drift, periodic codebase audits with linter rules tuned to team conventions surface accumulating deviations. For unrecorded decisions, requiring the agent to produce a brief rationale for architectural choices (and capturing it in the PR description) makes the decisions auditable. For workarounds that become permanent, an explicit "is this a workaround?" prompt in the review checklist surfaces them.
The common thread: detection requires deliberate infrastructure rather than passive observation. Teams that deploy agents and then watch for failures with their normal review process will miss most of these failure modes because the failures look like normal output. Teams that build detection mechanisms tuned to agent-specific failures catch the patterns early enough to address them.
The underlying issue
AI agents in engineering workflows are not yet at the level where they can be deployed and trusted without verification infrastructure. The verification infrastructure is more important than the model selection. Teams that build the infrastructure carefully can extract real value from agents; teams that skip it usually report after a few months that "the AI didn't really help" — which is technically accurate but misses the cause. The AI didn't help because the infrastructure to extract its help wasn't there.
Frequently asked questions
How much human review do AI-generated PRs need?
More than human-generated PRs, despite the temptation to review them less. The failure modes are different — humans don't typically hallucinate APIs or invent test coverage. Reviewers need to specifically check for agent-class failures, which means review takes longer, not shorter. Teams that treat AI PRs as low-effort because "the AI did the work" are exactly the teams who accumulate the failures above.
Should every AI-generated PR be reviewed by a human?
Until the failure rate is well-characterized for your specific use case, yes. The exceptions are narrow domains where the agent has been validated extensively — pure mechanical transforms, well-tested patterns, low-stakes scripts. For anything touching production behavior or architectural choices, human review is the baseline standard.
What's the most common engineering team mistake with agents?
Assuming that the agent's output is reliable enough that the team's review standards can be relaxed. The opposite is true — the team's review standards should be tightened because agent failure modes are different and require specific attention. Teams that recognize this build careful workflows around agents and extract real value. Teams that don't end up with degraded codebases and unclear root causes.
Get async handoff insights in your inbox
One email per week. No spam. Unsubscribe anytime.
Ready to eliminate your daily standup?
Distributed teams use StandIn to start every shift with full context — no standup required. Engineers post a 60-second wrap. The next shift wakes up knowing exactly what to work on.