Is AI actually helping software development?
Every engineering leader asks this question within weeks of their team adopting AI tools. We asked it too. A year ago, our senior development team started experimenting with ChatGPT, then moved to IDEs like Windsurf and Cursor and CLI tools like Claude Code and Gemini. At first, we tracked the obvious things. Features shipped faster. Pull requests increased. Developers seemed more productive. But something felt off. We realized we were measuring the wrong thing. We fell into the trap of thinking software development is about writing code faster. It's not. What we do is design systems. We interpret business requirements and translate them into maintainable solutions. Code is just an artifact.
The real transformation started when we stopped trying to write code faster and started asking a different question: How do we work with AI as a team to build better systems? The answer wasn't about tools. It was about structure. Spec-driven development, test-driven development, behavior-driven development. Call it what you want. The point is we moved from "vibe coding" to deliberate design. This is the story of that shift. What we learned. What worked. What didn't. And why the question "will AI replace programmers" misses the point entirely.
"We realized we were measuring the wrong thing. We fell into the trap of thinking software development is about writing code faster. It's not. What we do is design systems."
Will AI replace programmers?
It's the question everyone's asking.
You've seen the headlines. The consensus seems clear: transformation, not replacement. Everyone agrees AI excels with clear instructions, experience becomes more valuable, and human creativity remains irreplaceable. These statements sound good. They're probably true. But they don't answer the real question.
What does "transformation" actually look like? What are "clear instructions" in practice? No one shows you the operational reality. Everyone describes the destination but skips the journey.
We spent twelve months figuring out our journey with AI in team software development, moving from individuals experimenting with ChatGPT to a team collaborating with CLI tools and finally to a structured approach built on specifications. Here's what we learned: Engineers design systems. AI implements them. Tests and specifications are the contract between the two. This isn't theory. It's how we work now.
The Journey
The journey from individual AI experiments to team-wide structured collaboration took us through three phases.
Phase 1: Individual Exploration
Developers began using ChatGPT on their own for one-off problems. The benefits were immediate. Faster solutions, less Stack Overflow searching. But the limitations became obvious. Quality was inconsistent. There was no team process. No shared knowledge.
During this time, I started using tools like Grammarly. This tool had shown me how often I'd write "this" without clarifying what "this" meant in context. Grammarly was a great tool for showing me how bad I was at communicating context to my fellow team mates. The same vagueness appeared in AI prompts and produced vague code. AI could accelerate individual work, but individual acceleration doesn't make a team faster. And unclear communication ( the vagueness in how we articulated requirements and specifications to each other ) produces unclear results. This phase revealed something unexpected: communication quality directly affected AI output quality.
Phase 2: Team Integration
We adopted AI tools across the team. The goal was consistency. We stopped treating AI as a personal productivity hack and started treating it as a shared infrastructure. We standardized it as part of our toolchain and workflow. Claude Code and Gemini became our standard for code generation, refactoring, and writing tests. The scope expanded from simple assistance to iterative problem solving.
At first, this standardization improved our clarity. We found that explaining a feature to the AI acted as a forcing function: you can't 'hand-wave' requirements to a machine. When you have to articulate the business logic clearly enough for an LLM, you inevitably discover gaps in your own understanding. For isolated features, this rigor was a massive win.
But as we scaled, problems emerged. AI lost context on larger features. Coherence broke down across multiple files. Team alignment suffered because different developers gave AI different interpretations of the same feature.
We tried creating folders like .ai-brain or .memory to save context between sessions. It didn't work. LLMs don't understand when to reference saved context or how to prioritize it. The pattern made sense to humans but not to the models.
Ad-hoc prompting worked for small tasks but fell apart for anything larger. A 2025 MIT study confirmed this: AI handles simple programming exercises well but struggles with industry-scale codebases, proprietary conventions, and the maintenance work that defines real software engineering. We were hitting those limits.
Phase 3: Systematization
The insight came from an old idea: spec-driven development. The concept was designed to solve exactly our problem: maintaining context and coherence on large projects.
The answer isn't storing context in folders. It's embedding context in the artifacts themselves. Tests are specifications. When you write tests that describe expected behavior, you're giving AI everything it needs.
Write the spec. Write the tests that verify the spec. Hand those to AI as context. Let AI generate the implementation. The tests verify correctness automatically.
This solved our problems. Context improved because specs provided structure. Quality verification became built-in through tests. Team alignment happened because we all worked from the same specifications. It scaled because the approach was systematic.
Our team’s ability to write better specs and tests has become critical. The industry consensus says AI excels with clear instructions. We figured out what that meant operationally: specs and tests are the clear instructions.
"AI doesn't make communication less important. It makes it more important."
What We Didn't Expect
Communication Became the Bottleneck
We expected code quality to be the challenge. We didn't expect communication to become the limiting factor.
The precision required for AI didn't just stay in the terminal; it bled into our human interactions. You can't write a good spec without defining exactly what should happen. As we forced ourselves to be precise for the machines, we realized how vague we had been with each other.
The impact spread beyond our team. Use cases became exercises to get deterministic outcomes for tests and code. The pipeline works like this: written business requirements become specs, acceptance tests, and unit tests. Better specs and tests allow AI to iterate more efficiently. But it starts with writing. AI doesn't make communication less important. It makes it more important.
The Case for Spec-Driven Development
Our experience aligned perfectly with the 2025 MIT study on AI limitations. They found that while AI excels at "undergrad programming exercises," it fails at the "maintenance grind” of refactoring, migrations, and system coherence.
This validates David Farley's argument that software engineering is fundamentally design work, while coding is just the implementation details.[1]
When we treat engineering as design, the roles become clear. The human element isn't syntax; it's problem understanding, trade-off evaluation, and architectural decision-making. We design; AI builds. This is why the "replacement" fear is misplaced, it assumes the value is in the bricks, not the blueprint.
This brings us back to an old practice with renewed purpose: spec-driven development. While the approach fell out of favor in the 1990s—too much overhead, not agile enough—AI reintroduced its core problems. By using tests as specifications, we finally have a scalable way to give AI the clear instructions it requires.
How It Works in Practice
The workflow is simple but disciplined:
Define the Behavior (The Design): We don't open a code editor. We write a scenario describing exactly how a feature should behave. For example: "When a user on a Basic Plan tries to upload a 5th file, they should see a limit reached error." This is human design work.
Generate the Test (The Contract): We don't hand-write every line of test boilerplate. We feed that behavior definition to the AI and ask it to write the failing test case (e.g., in Jest or Vitest). We review this test carefully. This code is the rigid contract the implementation must satisfy.
Implement from the Test (The Build): We hand the test file back to the AI with a simple instruction: "Write the implementation to make this test pass." The AI has perfect context because the test defines the boundaries.
Verification: We run the test. If it fails, we don't fix the code manually. We show the error to the AI. If the AI struggles, it usually means our original behavior definition (Step 1) was ambiguous. We refine the spec, not the code.
The virtuous cycle: better specs lead to better AI output, which leads to better tests, which leads to better understanding of requirements, which leads to better specs. Engineers became incentivized to write better specifications, not as overhead, but as the primary value they contribute.
Benefits We've Measured
We evaluate our success using the four DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, Time to Restore), the industry standard for elite performance. While we haven't reached "Elite" status across the board yet, the trajectory is undeniable.
Lead Time for Changes (Throughput): This saw the most dramatic shift. Because implementation is automated, the time from "spec approval" to "PR ready" collapsed. In one specific experiment, a complex migration task estimated at 30 developer-days was completed in just 4 days using our spec-driven AI workflow.
Change Failure Rate (Stability): This improved, but not for the reason we expected. It wasn't just that the code was better; it was that we suddenly had a massive suite of automated acceptance regression tests as a side effect of our process. Since every feature starts with a test spec, our regression coverage grew automatically with every ticket. We catch regressions before merge, not after deploy.
Deployment Frequency: With the confidence provided by that automated regression suite, we stopped fearing Friday deploys. We aren't deploying multiple times a day yet, but our release cadence has stabilized from "whenever we feel safe" to a predictable rhythm.
Team Satisfaction: Beyond the metrics, the vibe changed. Developers spent more time on interesting problems instead of boilerplate code. Junior developers ramped faster because specs provided clear learning materials. We heard less frustration about ambiguous requirements because the spec-writing process forced us to clarify them before we ever touched the code. Instead of discovering gaps halfway through implementation, we solved them during the design phase.
A Note on Junior Developers
We often worry that AI will stunt junior growth by removing "learning opportunities." We found that comprehensive specifications acted as accelerated learning material.
Because every feature had a clear, senior-written specification and a corresponding test suite, junior developers could understand the intent of the system immediately. They spent less time deciphering "what was this supposed to do?" and more time reviewing the architecture. The specs became living documentation that made onboarding significantly faster.
The Limits of This Approach
Spec-driven development with AI isn't magic. It doesn't solve everything.
Refactoring: AI still loses context when moving code across 50 files. This needs human oversight.
Novel Algorithms: If the problem hasn't been solved a thousand times on GitHub, AI will struggle. This needs human creativity.
Legacy Sprawl: It works best on new features or isolated components. It struggles to "grok" a 10-year-old monolith.
We didn't eliminate these limitations; we just built a process that stops them from breaking production.
"We often worry that AI will stunt junior growth by removing 'learning opportunities.' We found that comprehensive specifications acted as accelerated learning material."
Practical Takeaways for Engineering Leaders
If you're considering AI adoption or struggling with how to scale it, here's what matters.
Measure Stability and Throughput, Not Velocity
Velocity metrics are seductive but misleading. Metrics that often seem to indicate increased speed—like lines of code, pull requests, and raw features per sprint—might indeed go up with AI, but those numbers can mask problems.
What matters is stability and throughput. Can you deploy reliably without breaking things? Can you deliver value more frequently without sacrificing quality? Track deployment frequency, incident rate, and cycle time.
The most important measure is harder to track: Do developers spend time on work that matters? Are they designing systems or churning out code?
Start Small and Learn
Don't roll this out everywhere at once. Pilot with one team or project. Learn what specifications work best, what level of detail AI needs, and what tests catch problems. The goal isn't perfect specs on day one; it's learning what good enough looks like for your context.
Invest in Specification Skills and Culture
Most developers learned to write code, not specifications. You cannot just tell them "engineering is design" and expect them to change. You have to actively train them and reshape the culture.
We stopped doing traditional "Code Reviews" and started doing "Spec Reviews" first. Before a single line of implementation code is written, the team (including Product and QA) reviews the Gherkin scenarios, the detailed test plan, or the architectural blueprint. If the spec is ambiguous or incomplete, the ticket is blocked and sent back for clarification. This explicit process change (reviewing the design before the code) actively reinforced the cultural shift and the value of clear specification more effectively than any all-hands meeting ever could.
Where We're Headed
This is our journey. Yours will be different. What we learned is that the transformation isn't about tools. It's about understanding what engineering actually is: design work, problem understanding, requirement articulation. Code is the byproduct.
The field is evolving fast. No one has all the answers. Sharing what we learn makes everyone better. This post is our contribution to that conversation.
What's yours?
References
[1] David Farley, Modern Software Engineering: Raising the Bar (Addison-Wesley Professional, 2021).