On-demand exhaustive AI-analysis
Complete visibility into time & dollars spent
Create meaningful reports and dashboards
Track and forecast all deliverables
Create and share developer surveys
Align and track development costs
AI didn't replace developers—it gave every one of them a direct report. Research shows a 39-point gap between perceived and actual productivity. Here's why "more productive" often means "more exhausted."
Developers report feeling more productive than ever. They're shipping more code, moving faster, closing tickets at a pace that would have seemed ambitious two years ago. Leadership sees the dashboards and feels good about the AI investment.
And yet.
Burnout is climbing. The Allstacks 2024 Engineering Survey found that 63% of engineering leaders believe at least a quarter of their teams are burned out. Developers describe a new kind of exhaustion — not the familiar crunch of a deadline push, but something subtler. A low-grade cognitive drain that persists even when the work seems... easier?
Here's what's actually happening: AI didn't replace developers. It gave every single one of them a direct report.
That direct report writes code fast. It's eager. It never sleeps. It also hallucinates, ignores context, and requires constant supervision. It can't be trusted with anything important unsupervised. And unlike a human junior developer, it never learns from feedback — you'll correct the same mistake tomorrow that you corrected today.
Welcome to the new org chart.
A 2025 randomized controlled trial produced one of the more striking findings in recent software engineering research. Experienced developers using AI coding tools were objectively 19% slower than those working without AI assistance. But here's what makes the finding noteworthy: those same developers believed they were 20% faster.
That's a 39-percentage-point gap between perception and reality.
This isn't the whole picture, of course. Other research shows real productivity gains. Peng et al.'s 2023 controlled experiment found developers completed standardized tasks 55.8% faster with GitHub Copilot. Ant Group's field experiments showed 55% productivity increases measured in lines of code.
So which is it — faster or slower?
Both, depending on what you measure and who's doing the work. And that paradox points to something important about what AI is actually doing to developer cognition.
AI accelerates generation — the act of producing code. For certain tasks, especially boilerplate and well-understood patterns, that acceleration is real and meaningful. But generation isn't the only bottleneck, and often isn't the primary one. The harder constraints have always involved judgment: knowing what to build, how to structure it, whether it's correct, and how it fits into the larger system.
When AI handles generation, developers don't simply get to move faster. They get reassigned to a different kind of work: supervision. And supervision, it turns out, is cognitively demanding in ways that aren't always obvious.
Here's a mental model that helps explain what's happening: every developer just became a manager.
Not a manager of people — a manager of machines. They're managing a direct report who produces output at extraordinary speed, with limited judgment about whether that output is correct, secure, maintainable, or relevant to the actual problem.
The data on what this looks like in practice is worth examining:
This is the job now. Not just writing code, but reviewing code you didn't write, from a source that can't explain its reasoning, for problems that aren't always visible on first read.
The human factors literature has a name for this role: supervisory control. And fifty years of research across aviation, nuclear power, and industrial automation tell us something important about supervisory control.
It's more tiring than it looks.
There's a counterintuitive finding that air traffic controllers, nuclear plant operators, and ICU nurses understand from experience: monitoring appears easy from the outside but is among the more cognitively demanding work humans can do.
The research on this is extensive. Vigilance tasks — maintaining attention to detect rare but important signals — generate elevated workload and stress comparable to active task performance, despite appearing passive and effortless. NASA's workload studies using the NASA-TLX scale have documented this repeatedly.
The mechanism involves several overlapping phenomena:
Vigilance decrement. Human attention during monitoring tasks begins to degrade within 15-40 minutes. This isn't laziness or poor discipline — it's a fundamental cognitive limit. Research suggests it can't be reliably overcome through motivation or training alone.
Passive fatigue. Active fatigue comes from overload — too much to do. Passive fatigue comes from underload and monotony — maintaining readiness without engagement. Saxby et al.'s research found passive fatigue led to slower response times and increased errors, even when operators reported feeling in control.
Out-of-the-loop syndrome. When you're not actively doing the work, you don't build and maintain the mental models required to understand the system. Endsley and Kiris documented how this leaves operators less equipped to intervene when automation fails.
Developers reviewing AI output are potentially experiencing all three. They're maintaining vigilance for errors in code they didn't write. They're passively monitoring rather than actively creating. And they may be gradually losing the deep understanding that comes from having written the code yourself.
In 2014, researchers analyzing 18 automation experiments identified a pattern they called the "Lumberjack Model."
The name comes from the visual: a tree stands fine under normal conditions. But when stress exceeds a threshold, it doesn't bend — it falls abruptly.
Their finding: higher automation consistently delivers performance benefits during routine operations. But when automation fails or encounters an edge case, human performance doesn't degrade gracefully. It drops sharply once you cross a critical threshold.
The mechanism: the more automation handles, the more the human operator drifts "out of the loop." Their situation awareness degrades. Their mental model of the system becomes less accurate. Their hands-on skills get less practice. Everything works fine — until something unexpected happens. Then the human is asked to intervene in a system they no longer fully understand, under time pressure.
This maps onto AI-assisted development:
Routine case: Copilot generates code. Developer reviews it, accepts it, ships. Dashboards look good.
Edge case: AI introduces a subtle bug, a security flaw, or code that works but is architecturally problematic. The developer — who didn't write the code and may not fully grasp its implications — has to debug, refactor, or explain it.
The Lumberjack collapse is that moment when a developer realizes they've been approving code they can't actually maintain. The more they've leaned on AI for routine work, the less prepared they may be for the non-routine.
Here's something the cognitive research doesn't fully capture but engineering leaders recognize intuitively: AI can generate code, but it cannot own code.
Ownership means accountability. It means being the person who gets paged at 2am when something breaks. It means explaining to stakeholders why a decision was made. It means carrying the context of why the system works the way it does, and what will happen if you change it.
AI is structurally incapable of ownership. It has no continuity between sessions. It carries no accountability. It cannot be held responsible when things go wrong.
This means that no matter how sophisticated the generation becomes, a human must own every line of code that reaches production. And ownership of code you didn't write, from a source that can't explain itself, is a fundamentally different job than ownership of code you created.
Here's where this connects back to the Lumberjack Model: each piece of AI-generated code a developer accepts is a small step toward the threshold. Any single acceptance seems fine — reasonable, even efficient. But cumulatively, they're taking ownership of a growing body of code they understand less deeply than code they wrote themselves.
The collapse doesn't announce itself. It builds invisibly, one accepted suggestion at a time, until the developer faces an edge case in a system they technically own but don't fully comprehend. The vigilance required to catch AI errors before they become production incidents — while maintaining genuine understanding of an increasingly AI-generated codebase — is the new constraint on developer productivity.
If this dynamic sounds familiar, it should. Similar patterns have played out in other fields.
Aviation's "Children of the Magenta." American Airlines captain Warren Vanderburgh coined the term in 1997 to describe pilots overly dependent on the magenta course lines on automated flight displays. His analysis found 68% of accidents and incidents studied linked to poor automation management.
The consequences were sometimes severe. Air France Flight 447 in 2009 involved pilots struggling to manually fly after automation disengaged. The Boeing 737 MAX accidents involved pilots overwhelmed trying to understand and override automated systems under time pressure.
A 2024 study of professional pilots found that higher automation increased flight performance and reduced mental workload on average — but also decreased vigilance to primary flight instruments. Pilots performed better in routine conditions while becoming less attuned to potential problems.
Medicine's EHR Burden. When electronic health records promised to reduce physician workload, something unexpected happened. A 2017 study found physicians spend 5.9 hours per day in EHR systems — two hours of screen time for every hour of direct patient care. "Pajama time" — charting after hours — consumes another 1.4 hours daily on average.
Physician burnout in primary care approaches 50%, with 75% of burned-out physicians identifying the EHR as a primary source. The technology didn't eliminate work; it transformed clinical judgment into documentation tasks that follow physicians home.
The pattern is consistent: automation that promises to reduce workload often shifts human effort from doing to supervising and validating. The work changes form rather than disappearing. And the new form can be more cognitively taxing than what it replaced.
The research suggests a practical ceiling on AI-augmented productivity, and it's not defined by what AI can do. It's defined by what humans can effectively supervise.
Studies show AI tools can accelerate specific coding tasks by 50-55%. But when you account for the fuller picture — reading every line, making major modifications more than half the time, catching the higher defect rate, maintaining understanding of code you didn't write — the net multiplier is more modest.
Factor in vigilance decrement (degrading attention after 15-40 minutes of sustained review), the Lumberjack threshold (where accumulated unfamiliarity becomes a liability), and the long-term maintenance burden of code you don't fully understand, and a picture emerges:
1.5-2x net productivity gains appear achievable with disciplined AI integration. This is meaningful — worth pursuing. But expectations of 4x or 10x output run into human cognitive constraints that the research suggests are fairly firm.
The ceiling isn't the AI's capability. It's our capacity to instruct, verify, supervise, and own what the AI produces. Those limits are reasonably well understood in psychology and human factors research. They just haven't been widely applied to software development until now.
The encouraging news: fifty years of automation research haven't just documented problems. They've identified approaches that seem to work.
Preserve active engagement. The worse outcomes come from purely passive monitoring. Workflows that keep developers actively involved — iteratively steering AI rather than accepting wholesale output — can help maintain the cognitive engagement that prevents skill atrophy and supports sustained attention.
Design for calibrated trust. Trust in AI should roughly match actual reliability, which varies by task. AI that's effective for boilerplate may be less reliable for security-critical or architecturally novel code. Making AI confidence visible and helping developers recognize when to trust versus verify supports better judgment.
Measure what matters. Tracking lines of code or deployment frequency can miss important dynamics. Defect escape rates, time spent on rework, and leading indicators of cognitive load may be more informative. We're increasingly focused on these kinds of signals at Allstacks.
Maintain ownership depth. The developers who own code need to understand it well enough to maintain it under pressure. Deliberate practices — code review, pairing, occasional "manual mode" work — can help prevent the gradual expertise erosion that makes Lumberjack-style collapses more likely.
Watch for burnout signals. The exhaustion from sustained AI supervision is real but often hard to spot. The work seems easier on the surface, so developers may not recognize why they're tired. Tracking leading indicators of cognitive overload — not just lagging indicators like attrition — matters.
The promise of AI was leverage at scale. Write more code, ship faster, do more with less.
The reality is more nuanced. AI hasn't given us unlimited capacity. It's revealed a constraint that wasn't as visible before: the human limit on supervisory control.
Every developer just got a direct report. Managing that direct report — reading its output, catching its errors, maintaining genuine ownership of code you didn't write — is real work with real cognitive costs. Those costs don't always show up in productivity dashboards, but they show up in burnout surveys, in escaped defects, and eventually in the moments when someone has to debug a system they don't fully understand at 2am.
The teams that navigate this well won't be the ones chasing maximum AI output. They'll be the ones who understand the true cost of vigilance — and design their workflows around human cognitive limits, not just AI capabilities.
The direct report is here to stay. The question is whether we'll learn to manage it sustainably.
Jeremy Freeman is CTO at Allstacks, where he focuses on helping engineering teams understand the real dynamics of developer productivity — including the costs that don't show up in traditional metrics.