June 24, 2026

The Terminal is No Longer for Commands: Building the Agentic Developer Stack

Show Notes
Transcript

The software development world is undergoing its biggest transformation since the introduction of modern IDEs. For decades, the terminal served a simple purpose: execute commands and return results. Developers wrote code, ran commands, reviewed outputs, and manually orchestrated every step of the software delivery lifecycle.That model is rapidly changing.In this episode, we explore how AI agents, agentic shells, Copilot CLI, coding agents, modernization systems, and autonomous code review are transforming the terminal into the central orchestration layer of software engineering. Instead of manually executing commands, developers are increasingly defining intent while intelligent systems plan, execute, validate, and refine work autonomously.This episode provides a comprehensive deep dive into the emerging Agentic Developer Stack and explains why the future of software engineering will be driven by orchestration, context engineering, validation systems, and AI-powered execution layers.

WHY THE TRADITIONAL DEVELOPER WORKFLOW IS BREAKING

For years, software development followed a predictable pattern. Developers wrote code, reviewers reviewed pull requests, CI/CD pipelines executed builds, and deployment processes remained largely manual.While AI assistants improved code generation inside editors, the execution layer remained unchanged.In this section we discuss:
• Why AI-assisted coding only solved part of the productivity challenge
• The hidden bottlenecks inside code reviews and deployment pipelines
• How technical debt accumulates in execution workflows
• Why modernization projects often fail before reaching production
• The difference between optimizing thinking versus optimizing execution

THE SHIFT FROM TOOLS TO AGENTS

There is a fundamental difference between software tools and software agents.Traditional tools respond to prompts. Agents pursue goals.Modern AI agents understand intent, create plans, execute actions, validate results, adapt to failures, and continue operating within predefined policies and constraints.Topics covered include:
• Agent-based development workflows
• Goal-oriented software execution
• Autonomous decision making inside development environments
• Policy-driven engineering systems
• The evolution of GitHub Copilot and Copilot

CLIWHY THE TERMINAL BECAME THE CENTER OF GRAVITY

Developers spend much of their day inside terminals running Git commands, troubleshooting deployments, managing infrastructure, and validating systems.The terminal is where ideas become actions.We discuss how modern agentic shells transform the terminal from a simple command interface into an intelligent orchestration layer capable of planning and executing entire development workflows.

THE FOUR LAYERS OF THE AGENTIC DEVELOPER STACK

The Agentic Developer Stack is built upon four interconnected layers:Orchestration LayerThis layer translates human intent into executable workflows through agentic shells and AI-powered command-line interfaces.Transformation LayerModernization agents analyze legacy applications, extract business logic, and rebuild systems using modern architectures and frameworks.Validation LayerCode Review Agents continuously enforce architecture, security standards, testing requirements, and engineering best practices.Execution LayerCloud-hosted Coding Agents perform implementations, execute test suites, run security scans, create pull requests, and manage delivery workflows.Together these layers form a feedback-driven software delivery system where humans supervise policy while agents execute implementation.

CONTEXT ENGINEERING AND PROJECT MEMORY

One of the most overlooked aspects of successful AI adoption is context.Most organizations fail because they expect agents to understand their systems automatically.Successful teams build:
• Architecture documentation
• Domain glossaries
• Pattern libraries
• Architectural Decision Records (ADRs)
• Living project memory systemsThe episode explains why context engineering is becoming one of the most valuable skills in modern software organizations.

CODE REVIEW AGENTS AND ARCHITECTURAL ENFORCEMENT

Modern review systems are evolving beyond linting and static analysis.Today's AI review agents understand:
• Software architecture
• Security boundaries
• Design principles
• Performance implications
• Multi-file dependency relationshipsLearn how AI-driven validation systems are changing code quality and enabling organizations to scale development velocity without sacrificing governance.

THE RUBBER DUCK PROTOCOL AND CROSS-MODEL REVIEW

One of the most fascinating concepts discussed in this episode is cross-model validation.Instead of relying on a single AI model, organizations are increasingly combining different model families to review each other's work.This approach:• Reduces blind spots
• Improves architectural reasoning
• Increases implementation quality
• Lowers overall AI costs
• Produces more reliable engineering outcomesWe explore how reviewer models challenge assumptions, uncover hidden risks, and improve implementation accuracy.

MODERNIZATION AGENTS AND LEGACY TRANSFORMATION

Legacy modernization remains one of the most expensive challenges facing enterprise organizations.In this section we explore how AI-powered modernization agents:• Analyze complex legacy systems
• Discover hidden business rules
• Map dependencies automatically
• Generate migration documentation
• Refactor systems incrementallyLearn why successful modernization depends more on context than model size.

SAFETY, GUARDRAILS, AND BOUNDED AUTONOMY

Autonomous systems require boundaries.The episode explores how organizations can safely deploy AI agents using:
• Permission guardrails
• Policy constraints
• Validation gates
• Human approvals
• Sandboxed execution environmentsThese controls allow agents to move quickly while protecting production systems and critical business processes.

THE FUTURE OF SOFTWARE ENGINEERING

The biggest takeaway from this conversation is simple:Software development is shifting from command execution to workflow orchestration.Developers are evolving from implementation specialists into architects of intent, reviewers of outcomes, and designers of policy.Organizations that understand this transition early will gain significant advantages in speed, quality, modernization efforts, and engineering scalability.The terminal is no longer where commands are executed.It is becoming the operating system for autonomous software delivery.

KEY TAKEAWAYS

• AI agents are transforming software delivery workflows
• The terminal is evolving into an orchestration platform
• Context engineering is becoming a critical engineering discipline
• Agentic systems require strong validation and governance
• Cross-model review improves software quality and reliability
• The future developer manages intent and policy rather than individual implementation details

Become a supporter of this podcast: https://www.spreaker.com/podcast/m365-fm-modern-work-security-and-productivity-with-microsoft-365--6704921/support.

🚀 Want to be part of m365.fm?

Then stop just listening… and start showing up.

👉 Connect with me on LinkedIn and let’s make something happen:

🎙️ Be a podcast guest and share your story
🎧 Host your own episode (yes, seriously)
💡 Pitch topics the community actually wants to hear
🌍 Build your personal brand in the Microsoft 365 space

This isn’t just a podcast — it’s a platform for people who take action.

🔥 Most people wait. The best ones don’t.

👉 Connect with me on LinkedIn and send me a message:
"I want in"

Let’s build something awesome 👊

1
00:00:00,000 --> 00:00:02,260
The terminal hasn't changed in 40 years.

2
00:00:02,260 --> 00:00:04,800
You type a command, the system runs it, you're done.

3
00:00:04,800 --> 00:00:07,000
It's a straight line from what you want to what happens.

4
00:00:07,000 --> 00:00:08,520
That's the contract we've all lived with.

5
00:00:08,520 --> 00:00:10,040
Type something, get a result.

6
00:00:10,040 --> 00:00:13,720
But something fundamental shifted between 2024 and 2025.

7
00:00:13,720 --> 00:00:15,880
It isn't about what the terminal can do technically.

8
00:00:15,880 --> 00:00:17,280
It's about what it's becoming.

9
00:00:17,280 --> 00:00:19,600
The terminal is no longer just a command dispatcher.

10
00:00:19,600 --> 00:00:22,360
It's turning into an orchestration layer, a policy shell,

11
00:00:22,360 --> 00:00:25,680
a place where AI agents plan, execute, and adapt in real time.

12
00:00:25,680 --> 00:00:27,400
This isn't about faster, autocomplete,

13
00:00:27,400 --> 00:00:29,160
or better suggestions in your editor.

14
00:00:29,160 --> 00:00:32,520
This is about reframing how software gets built and reviewed

15
00:00:32,520 --> 00:00:35,000
from the ground up, because the teams that still treat

16
00:00:35,000 --> 00:00:37,320
the terminal as just where you type commands

17
00:00:37,320 --> 00:00:39,040
are about to get left behind.

18
00:00:39,040 --> 00:00:41,520
The structural flaw in how we build software.

19
00:00:41,520 --> 00:00:43,920
For decades, the developer workflow was predictable.

20
00:00:43,920 --> 00:00:44,760
You wrote code.

21
00:00:44,760 --> 00:00:46,520
Someone reviewed it, you deployed it.

22
00:00:46,520 --> 00:00:47,840
Each step was human driven.

23
00:00:47,840 --> 00:00:49,280
Each step was a decision point.

24
00:00:49,280 --> 00:00:50,960
You moved through them one by one.

25
00:00:50,960 --> 00:00:52,480
Then AI arrived in the editor.

26
00:00:52,480 --> 00:00:54,720
We got co-pilot chat and inline suggestions

27
00:00:54,720 --> 00:00:56,160
that actually understand context.

28
00:00:56,160 --> 00:00:58,960
Productivity went up and developers started shipping features

29
00:00:58,960 --> 00:01:01,480
faster because the editor finally got smarter.

30
00:01:01,480 --> 00:01:03,960
But here's the structural flaw that nobody talks about.

31
00:01:03,960 --> 00:01:06,200
We're still treating AI like a tool, not a system.

32
00:01:06,200 --> 00:01:07,800
A tool just responds to your input.

33
00:01:07,800 --> 00:01:09,640
You ask a question, it gives an answer,

34
00:01:09,640 --> 00:01:11,360
and then you decide what to do next.

35
00:01:11,360 --> 00:01:12,200
You stay in control.

36
00:01:12,200 --> 00:01:14,480
You make the calls that works for autocomplete.

37
00:01:14,480 --> 00:01:15,600
But it breaks everywhere else.

38
00:01:15,600 --> 00:01:17,720
The real bottlenecks aren't hiding in the editor.

39
00:01:17,720 --> 00:01:18,640
They're in the terminal.

40
00:01:18,640 --> 00:01:20,920
They're in the CIP line and the code review queue

41
00:01:20,920 --> 00:01:22,160
that's three weeks deep.

42
00:01:22,160 --> 00:01:23,600
They're in the modernization graveyard

43
00:01:23,600 --> 00:01:25,800
where legacy code stays legacy forever.

44
00:01:25,800 --> 00:01:28,080
These are the places where intent meets execution.

45
00:01:28,080 --> 00:01:30,120
This is where a good idea becomes a broken deploy

46
00:01:30,120 --> 00:01:32,920
or a refactor introduces coupling that nobody saw coming.

47
00:01:32,920 --> 00:01:35,480
Technical debt piles up quietly in these corners

48
00:01:35,480 --> 00:01:38,400
until the entire code base becomes impossible to maintain.

49
00:01:38,400 --> 00:01:41,080
The structural mistake we made was optimizing the thinking

50
00:01:41,080 --> 00:01:43,560
layer while ignoring the execution layer.

51
00:01:43,560 --> 00:01:45,280
We made the editor smarter and improved

52
00:01:45,280 --> 00:01:47,640
how developers write code, which definitely matters,

53
00:01:47,640 --> 00:01:50,080
but we left everything else exactly as it was.

54
00:01:50,080 --> 00:01:52,480
The terminal is still just a command dispatcher.

55
00:01:52,480 --> 00:01:54,640
The pipeline is still just a sequence of scripts.

56
00:01:54,640 --> 00:01:57,840
The review process is still just humans reading code line by line,

57
00:01:57,840 --> 00:02:00,400
hoping they catch issues before they hit production.

58
00:02:00,400 --> 00:02:03,040
Modernization is still a manual ordeal that takes months

59
00:02:03,040 --> 00:02:04,760
because someone has to figure out the old system

60
00:02:04,760 --> 00:02:05,960
and hope they don't break it.

61
00:02:05,960 --> 00:02:07,760
The orchestration layer never got smart.

62
00:02:07,760 --> 00:02:09,240
We optimized for thinking.

63
00:02:09,240 --> 00:02:10,840
We ignored execution.

64
00:02:10,840 --> 00:02:13,960
And this is the shift, the shift from tools to agents.

65
00:02:13,960 --> 00:02:16,280
There is a category difference between a tool and an agent.

66
00:02:16,280 --> 00:02:17,720
The tool responds to your input.

67
00:02:17,720 --> 00:02:19,600
You ask it something and it gives you an answer.

68
00:02:19,600 --> 00:02:22,280
And then you decide what happens next.

69
00:02:22,280 --> 00:02:23,120
You stay in the loop.

70
00:02:23,120 --> 00:02:24,240
You keep control.

71
00:02:24,240 --> 00:02:26,480
An agent is different and agent has a goal.

72
00:02:26,480 --> 00:02:29,080
Within the constraints you set, it pursues that goal.

73
00:02:29,080 --> 00:02:30,040
It makes decisions.

74
00:02:30,040 --> 00:02:30,920
It takes actions.

75
00:02:30,920 --> 00:02:33,120
It observes the results and then it adapts.

76
00:02:33,120 --> 00:02:34,360
You don't direct every step.

77
00:02:34,360 --> 00:02:35,480
You set the policy.

78
00:02:35,480 --> 00:02:37,000
And the agent operates within it.

79
00:02:37,000 --> 00:02:38,800
That is the shift happening right now.

80
00:02:38,800 --> 00:02:40,120
Copilot CLI is an agent.

81
00:02:40,120 --> 00:02:41,760
You describe a task in natural language

82
00:02:41,760 --> 00:02:44,440
like asking it to add pagination to an API endpoint

83
00:02:44,440 --> 00:02:46,440
or refactor a controller for clarity.

84
00:02:46,440 --> 00:02:47,440
That isn't a question.

85
00:02:47,440 --> 00:02:48,680
It's an intent statement.

86
00:02:48,680 --> 00:02:50,840
The agent reads it and understands the code base.

87
00:02:50,840 --> 00:02:53,200
It sees the structure, the patterns, and the dependencies.

88
00:02:53,200 --> 00:02:55,840
It plans the steps and executes them in sequence.

89
00:02:55,840 --> 00:02:58,600
If a test fails, it doesn't stop and ask you what to do.

90
00:02:58,600 --> 00:03:01,320
It analyzes the failure to understand why the test broke,

91
00:03:01,320 --> 00:03:03,280
modifies the implementation and tries again.

92
00:03:03,280 --> 00:03:05,800
When the work is finished, it reports back on what it built

93
00:03:05,800 --> 00:03:06,840
and what changed.

94
00:03:06,840 --> 00:03:08,200
That isn't a tool assisting you.

95
00:03:08,200 --> 00:03:09,760
That is an agent completing work.

96
00:03:09,760 --> 00:03:11,440
Copilot code review is an agent.

97
00:03:11,440 --> 00:03:14,560
It isn't just flagging syntax errors or running a limter.

98
00:03:14,560 --> 00:03:15,800
Those are tools.

99
00:03:15,800 --> 00:03:18,520
The agent reads a PR and understands the architecture.

100
00:03:18,520 --> 00:03:21,280
It knows where the boundaries are and sees when code crosses

101
00:03:21,280 --> 00:03:22,400
into another layer.

102
00:03:22,400 --> 00:03:25,440
It reasons about coupling and checks against solid principles.

103
00:03:25,440 --> 00:03:29,000
It looks at multiple files instead of seeing one change in isolation.

104
00:03:29,000 --> 00:03:30,800
It catches the architectural problems

105
00:03:30,800 --> 00:03:33,440
that a human reviewer would miss because they are tired.

106
00:03:33,440 --> 00:03:35,560
And this is the 10th PR they have read today.

107
00:03:35,560 --> 00:03:39,880
A tool flags problems, an agent understands context and prevents them.

108
00:03:39,880 --> 00:03:42,080
Modernization agents work the same way.

109
00:03:42,080 --> 00:03:44,400
Legacy code is often buried in business logic

110
00:03:44,400 --> 00:03:47,200
that spans dozens of functions across multiple files.

111
00:03:47,200 --> 00:03:49,840
A tool would find old patterns and suggest replacements.

112
00:03:49,840 --> 00:03:51,160
An agent does something different.

113
00:03:51,160 --> 00:03:54,200
It reads the entire system and builds a map of what does what.

114
00:03:54,200 --> 00:03:56,880
It extracts the business logic and identifies the invariance

115
00:03:56,880 --> 00:03:58,200
that must never change.

116
00:03:58,200 --> 00:04:00,760
Then it re-impliments that logic in a modern stack.

117
00:04:00,760 --> 00:04:02,880
It uses modern code and modern patterns

118
00:04:02,880 --> 00:04:04,960
while preserving every behavior and every contract

119
00:04:04,960 --> 00:04:06,640
the system made to the outside world.

120
00:04:06,640 --> 00:04:07,880
That isn't find and replace.

121
00:04:07,880 --> 00:04:10,040
That is deep understanding followed by transformation.

122
00:04:10,040 --> 00:04:13,440
So what's actually happening when you shift from tools to agents?

123
00:04:13,440 --> 00:04:14,840
With a tool you stay in the loop.

124
00:04:14,840 --> 00:04:15,840
You make the decisions.

125
00:04:15,840 --> 00:04:17,280
The tool assists.

126
00:04:17,280 --> 00:04:19,720
You write code and the tool suggests the next line.

127
00:04:19,720 --> 00:04:21,960
You review the suggestion and you accept or reject it.

128
00:04:21,960 --> 00:04:24,400
You are constantly deciding and constantly guiding.

129
00:04:24,400 --> 00:04:26,400
With an agent, the agent stays in the loop.

130
00:04:26,400 --> 00:04:28,440
It makes decisions within your policy.

131
00:04:28,440 --> 00:04:29,720
You don't direct every step.

132
00:04:29,720 --> 00:04:32,320
You set the constraints and define what success looks like.

133
00:04:32,320 --> 00:04:34,960
You say what patterns to follow and what rules to obey.

134
00:04:34,960 --> 00:04:36,040
Then the agent operates.

135
00:04:36,040 --> 00:04:38,760
It makes thousands of micro decisions about dependencies,

136
00:04:38,760 --> 00:04:40,520
variable names and function structures.

137
00:04:40,520 --> 00:04:42,520
It doesn't ask you about every single choice

138
00:04:42,520 --> 00:04:44,600
because it operates within the policy you set.

139
00:04:44,600 --> 00:04:45,320
You supervise.

140
00:04:45,320 --> 00:04:46,320
You don't direct.

141
00:04:46,320 --> 00:04:47,400
That isn't a small difference.

142
00:04:47,400 --> 00:04:49,560
It's a reorganization of how software gets built

143
00:04:49,560 --> 00:04:52,120
when you have 100 developers and one code review tool.

144
00:04:52,120 --> 00:04:53,280
The tool helps a little.

145
00:04:53,280 --> 00:04:56,160
But when an agent reviews every PR before a human ever sees it,

146
00:04:56,160 --> 00:04:58,720
the quality bar for the entire organization changes.

147
00:04:58,720 --> 00:05:00,320
It enforces architecture constantly.

148
00:05:00,320 --> 00:05:02,480
It prevents mistakes before they happen.

149
00:05:02,480 --> 00:05:03,600
It raises the baseline.

150
00:05:03,600 --> 00:05:04,640
It isn't just assisting.

151
00:05:04,640 --> 00:05:08,040
It is actively shaping how work flows through the system.

152
00:05:08,040 --> 00:05:10,880
And the terminal is where this reorganization is most visible

153
00:05:10,880 --> 00:05:13,480
because the terminal is where the real work happens.

154
00:05:13,480 --> 00:05:15,760
Why the terminal became the center of gravity?

155
00:05:15,760 --> 00:05:17,840
The terminal is where developers already live,

156
00:05:17,840 --> 00:05:18,880
not where they visit,

157
00:05:18,880 --> 00:05:20,920
where they live every single day.

158
00:05:20,920 --> 00:05:23,400
In Git commands, in CICD debugging

159
00:05:23,400 --> 00:05:24,960
when a build fails at 3am,

160
00:05:24,960 --> 00:05:26,880
in infrastructure management and log analysis,

161
00:05:26,880 --> 00:05:28,960
the editor is where you think you are composing

162
00:05:28,960 --> 00:05:30,120
and drafting ideas.

163
00:05:30,120 --> 00:05:31,480
But the terminal is execution.

164
00:05:31,480 --> 00:05:32,840
It's where thought becomes action.

165
00:05:32,840 --> 00:05:33,920
But here's the problem.

166
00:05:33,920 --> 00:05:36,520
For years, we only optimised the thinking layer.

167
00:05:36,520 --> 00:05:39,240
We built better autocomplete and smarter code completion.

168
00:05:39,240 --> 00:05:40,640
We made the editor intelligent

169
00:05:40,640 --> 00:05:42,120
because that is where ideas form.

170
00:05:42,120 --> 00:05:44,640
We assumed that faster thinking meant faster writing.

171
00:05:44,640 --> 00:05:46,480
But we ignored the execution layer.

172
00:05:46,480 --> 00:05:47,640
The terminals stayed the same.

173
00:05:47,640 --> 00:05:49,200
It was just a command dispatcher.

174
00:05:49,200 --> 00:05:51,080
You had the knowledge about what to run,

175
00:05:51,080 --> 00:05:52,760
you typed it and it ran.

176
00:05:52,760 --> 00:05:55,640
If something broke, you read the error and figured it out yourself.

177
00:05:55,640 --> 00:05:57,000
The terminal didn't get smarter.

178
00:05:57,000 --> 00:05:57,960
It just got faster.

179
00:05:57,960 --> 00:05:59,560
That imbalances what we are about to fix.

180
00:05:59,560 --> 00:06:02,120
With agentex shells, the terminal has become intelligent.

181
00:06:02,120 --> 00:06:03,480
It isn't just a dispatcher anymore.

182
00:06:03,480 --> 00:06:04,680
It understands intent.

183
00:06:04,680 --> 00:06:06,280
You don't describe the exact command.

184
00:06:06,280 --> 00:06:07,520
You describe the goal.

185
00:06:07,520 --> 00:06:09,400
You tell it to deploy a feature to staging

186
00:06:09,400 --> 00:06:11,320
and validate the integration tests.

187
00:06:11,320 --> 00:06:13,120
The agentex shell understands that intent.

188
00:06:13,120 --> 00:06:14,720
It plans the workflow and figures out

189
00:06:14,720 --> 00:06:16,800
which commands need to run in what order.

190
00:06:16,800 --> 00:06:18,800
It executes them and watches for failures.

191
00:06:18,800 --> 00:06:20,960
If something breaks, it analyzes the error

192
00:06:20,960 --> 00:06:22,480
to understand what went wrong.

193
00:06:22,480 --> 00:06:24,640
It modifies the approach and it tries again.

194
00:06:24,640 --> 00:06:27,240
The terminal has moved from execute what I tell you

195
00:06:27,240 --> 00:06:28,960
to accomplish what I want.

196
00:06:28,960 --> 00:06:30,560
This is where the real leverage lives.

197
00:06:30,560 --> 00:06:32,800
It isn't in suggesting the next line of code.

198
00:06:32,800 --> 00:06:34,720
That is just optimization at the edges.

199
00:06:34,720 --> 00:06:37,120
The real leverage is automating the entire workflow

200
00:06:37,120 --> 00:06:40,360
that takes code from written to validated.

201
00:06:40,360 --> 00:06:43,080
The whole pipeline, the whole orchestration.

202
00:06:43,080 --> 00:06:45,240
This sequence used to require human knowledge

203
00:06:45,240 --> 00:06:46,920
and human trial and error.

204
00:06:46,920 --> 00:06:48,880
Now, a developer can describe a complex task

205
00:06:48,880 --> 00:06:50,760
and the agentex shell breaks it into steps.

206
00:06:50,760 --> 00:06:52,240
It runs tests and checks results.

207
00:06:52,240 --> 00:06:54,240
It modifies code if those tests fail.

208
00:06:54,240 --> 00:06:56,400
It validates security and checks performance.

209
00:06:56,400 --> 00:06:58,760
It handles all the machinery that used to require someone

210
00:06:58,760 --> 00:07:01,360
to understand every tool and every edge case.

211
00:07:01,360 --> 00:07:02,880
That isn't a productivity improvement.

212
00:07:02,880 --> 00:07:05,760
That is a structural reorganization of how workflows.

213
00:07:05,760 --> 00:07:08,880
The teams that understand this first will move faster,

214
00:07:08,880 --> 00:07:10,560
not incrementally, dramatically,

215
00:07:10,560 --> 00:07:12,840
because they aren't optimizing the edges of the developer

216
00:07:12,840 --> 00:07:13,960
experience anymore.

217
00:07:13,960 --> 00:07:16,560
They are automating the core machinery of software delivery.

218
00:07:16,560 --> 00:07:18,480
The terminal is no longer where you type.

219
00:07:18,480 --> 00:07:19,840
It's where the work happens.

220
00:07:19,840 --> 00:07:21,320
The teams that don't realize this yet

221
00:07:21,320 --> 00:07:23,760
will keep treating the terminal like a legacy tool,

222
00:07:23,760 --> 00:07:26,480
a command dispatcher, something you use to run things

223
00:07:26,480 --> 00:07:27,840
you already know how to run.

224
00:07:27,840 --> 00:07:29,560
But in reality, it does the opposite.

225
00:07:29,560 --> 00:07:31,760
Let's look at what this actually looks like.

226
00:07:31,760 --> 00:07:34,200
The agentex developer stack, a new model.

227
00:07:34,200 --> 00:07:36,320
The agentex developer stack isn't just a new tool

228
00:07:36,320 --> 00:07:37,280
or a single system.

229
00:07:37,280 --> 00:07:39,200
It's a model built on four distinct layers.

230
00:07:39,200 --> 00:07:41,680
They operate independently, but they're wired together.

231
00:07:41,680 --> 00:07:42,800
Each layer is autonomous.

232
00:07:42,800 --> 00:07:43,840
Each one is intelligent.

233
00:07:43,840 --> 00:07:45,360
Each one has a specific job.

234
00:07:45,360 --> 00:07:47,240
And they communicate through structured handoffs.

235
00:07:47,240 --> 00:07:49,760
The first layer is orchestration, which is co-pilot CLI.

236
00:07:49,760 --> 00:07:51,560
This is where you sit and describe your intent

237
00:07:51,560 --> 00:07:52,400
in natural language.

238
00:07:52,400 --> 00:07:55,200
You aren't writing scripts or memorizing commands anymore.

239
00:07:55,200 --> 00:07:56,880
You're saying what you want to accomplish.

240
00:07:56,880 --> 00:07:59,160
Like adding pagination to an API endpoint

241
00:07:59,160 --> 00:08:01,360
or refactoring a controller for clarity.

242
00:08:01,360 --> 00:08:04,000
The CLI agent reads that request and looks at your code base

243
00:08:04,000 --> 00:08:06,280
to understand the structure and existing patterns.

244
00:08:06,280 --> 00:08:09,320
It plans the workflow by breaking the task into steps,

245
00:08:09,320 --> 00:08:10,840
figuring out which files to change

246
00:08:10,840 --> 00:08:12,960
and what the success criteria look like.

247
00:08:12,960 --> 00:08:15,680
Then it executes those steps in the terminal one by one.

248
00:08:15,680 --> 00:08:17,880
If a test fails, it adapts.

249
00:08:17,880 --> 00:08:19,840
It modifies the code and tries again

250
00:08:19,840 --> 00:08:22,200
without asking for permission on every micro decision.

251
00:08:22,200 --> 00:08:24,280
It operates within the policy you've set,

252
00:08:24,280 --> 00:08:27,320
converting your intent into a workflow that actually runs.

253
00:08:27,320 --> 00:08:30,040
The second layer is transformation, the modernization agent.

254
00:08:30,040 --> 00:08:32,160
Legacy code goes in, modern code comes out.

255
00:08:32,160 --> 00:08:33,880
The agent reads the old system and builds

256
00:08:33,880 --> 00:08:35,440
a complete map of how data flows

257
00:08:35,440 --> 00:08:37,160
and where the hidden dependencies live.

258
00:08:37,160 --> 00:08:39,120
Once it understands the system semantically,

259
00:08:39,120 --> 00:08:40,800
it extracts the actual business logic.

260
00:08:40,800 --> 00:08:42,920
Then it re-implements that logic in a modern stack

261
00:08:42,920 --> 00:08:44,080
with modern frameworks.

262
00:08:44,080 --> 00:08:46,760
But everything the system promised to do, it still does.

263
00:08:46,760 --> 00:08:49,040
Every edge case and every contract remains intact.

264
00:08:49,040 --> 00:08:51,440
This layer isn't just about replacing old patterns,

265
00:08:51,440 --> 00:08:53,040
it's about rebuilding the system

266
00:08:53,040 --> 00:08:54,800
to do the same thing better.

267
00:08:54,800 --> 00:08:57,200
The third layer is validation, the code review agent.

268
00:08:57,200 --> 00:08:59,640
Every change gets reviewed, but not by humans.

269
00:08:59,640 --> 00:09:02,640
The agent reads the code and understands the architecture.

270
00:09:02,640 --> 00:09:05,360
Seeing exactly where a change crosses an architectural line,

271
00:09:05,360 --> 00:09:08,200
it verifies solid principles and looks for ripple effects

272
00:09:08,200 --> 00:09:09,520
across multiple files.

273
00:09:09,520 --> 00:09:11,080
It catches the issues a human would miss

274
00:09:11,080 --> 00:09:13,240
because they're tired or switching contexts.

275
00:09:13,240 --> 00:09:16,120
It doesn't just flag problems, it enforces your architectural

276
00:09:16,120 --> 00:09:16,880
policy.

277
00:09:16,880 --> 00:09:18,920
This layer ensures that every change actually

278
00:09:18,920 --> 00:09:20,600
fits the system it's going into.

279
00:09:20,600 --> 00:09:22,040
The fourth layer is execution.

280
00:09:22,040 --> 00:09:23,960
The coding agent running in the cloud,

281
00:09:23,960 --> 00:09:26,640
some tasks need a full environment for cloning repos,

282
00:09:26,640 --> 00:09:29,720
installing dependencies, and running security scans.

283
00:09:29,720 --> 00:09:32,480
This agent operates in a managed, ephemeral environment

284
00:09:32,480 --> 00:09:33,240
in the cloud.

285
00:09:33,240 --> 00:09:35,040
It has everything it needs to run real tests

286
00:09:35,040 --> 00:09:36,120
against the real code base.

287
00:09:36,120 --> 00:09:38,640
It opens pull requests, documents the work,

288
00:09:38,640 --> 00:09:39,960
and waits for feedback.

289
00:09:39,960 --> 00:09:42,760
This layer manages execution where nothing can accidentally

290
00:09:42,760 --> 00:09:43,960
break production.

291
00:09:43,960 --> 00:09:47,200
Each layer is autonomous, but they work as a feedback mesh.

292
00:09:47,200 --> 00:09:49,360
Validation can send work back to transformation

293
00:09:49,360 --> 00:09:51,160
if it sees an architectural violation.

294
00:09:51,160 --> 00:09:53,560
Execution can surface issues back to orchestration

295
00:09:53,560 --> 00:09:54,960
that weren't anticipated.

296
00:09:54,960 --> 00:09:56,520
The human sits above all of this.

297
00:09:56,520 --> 00:09:59,000
You set the policy, you define the guardrails,

298
00:09:59,000 --> 00:10:00,480
and you make the final decisions.

299
00:10:00,480 --> 00:10:01,760
You aren't out of the loop.

300
00:10:01,760 --> 00:10:03,120
You're just in a different position.

301
00:10:03,120 --> 00:10:06,080
This is the structural shift AI manages the workflow.

302
00:10:06,080 --> 00:10:07,720
And you manage the policy.

303
00:10:07,720 --> 00:10:08,960
The orchestration layer.

304
00:10:08,960 --> 00:10:11,240
Copilot, CLI, and agentex shells.

305
00:10:11,240 --> 00:10:13,160
Copilot CLI is fundamentally different

306
00:10:13,160 --> 00:10:14,560
from a code autocompleter.

307
00:10:14,560 --> 00:10:17,320
It isn't predicting your next line or suggesting syntax.

308
00:10:17,320 --> 00:10:19,680
It's a workflow engine embedded in your terminal.

309
00:10:19,680 --> 00:10:21,320
You aren't interacting with a tool.

310
00:10:21,320 --> 00:10:22,920
You're delegating to an agent.

311
00:10:22,920 --> 00:10:25,040
The difference starts with how you communicate.

312
00:10:25,040 --> 00:10:26,120
You don't describe commands.

313
00:10:26,120 --> 00:10:27,280
You describe intent.

314
00:10:27,280 --> 00:10:29,840
When you tell it to add pagination and update the tests,

315
00:10:29,840 --> 00:10:31,000
you're stating a goal.

316
00:10:31,000 --> 00:10:32,400
The agent figures out how to get there.

317
00:10:32,400 --> 00:10:34,800
The CLI agent reads your code base to map the modules

318
00:10:34,800 --> 00:10:36,520
and identify your coding conventions.

319
00:10:36,520 --> 00:10:38,760
It understands the standards you've documented,

320
00:10:38,760 --> 00:10:39,680
then it plans.

321
00:10:39,680 --> 00:10:42,080
It breaks the task into specific steps,

322
00:10:42,080 --> 00:10:44,520
deciding which files need to change and in what order.

323
00:10:44,520 --> 00:10:46,920
Once the plan is solid, it executes.

324
00:10:46,920 --> 00:10:49,440
And here is what happens when something breaks.

325
00:10:49,440 --> 00:10:51,600
In a normal environment, you would read the error

326
00:10:51,600 --> 00:10:53,280
and modify the code yourself.

327
00:10:53,280 --> 00:10:54,720
The agent does exactly that.

328
00:10:54,720 --> 00:10:57,240
It analyzes the failure to understand what caused it.

329
00:10:57,240 --> 00:10:58,800
Then it modifies the implementation

330
00:10:58,800 --> 00:11:00,080
and runs the test again.

331
00:11:00,080 --> 00:11:01,960
It doesn't ask you before each iteration.

332
00:11:01,960 --> 00:11:03,240
It operates within the constraints

333
00:11:03,240 --> 00:11:05,160
you've set until the tests pass.

334
00:11:05,160 --> 00:11:06,840
This is a genetic behavior.

335
00:11:06,840 --> 00:11:09,000
You aren't telling it to execute a command.

336
00:11:09,000 --> 00:11:10,360
You're telling it to achieve a goal.

337
00:11:10,360 --> 00:11:12,200
The agent operates within a policy envelope.

338
00:11:12,200 --> 00:11:15,040
Its freedom is bounded by the rules you define.

339
00:11:15,040 --> 00:11:16,600
You decide which files are off limits

340
00:11:16,600 --> 00:11:18,720
and which quality gates it needs to clear.

341
00:11:18,720 --> 00:11:20,360
The agent incorporates these constraints

342
00:11:20,360 --> 00:11:22,080
into its reasoning from the start.

343
00:11:22,080 --> 00:11:23,400
It doesn't fight the rules.

344
00:11:23,400 --> 00:11:25,320
It uses them to find the right approach.

345
00:11:25,320 --> 00:11:26,600
Latency is critical here.

346
00:11:26,600 --> 00:11:28,800
If an agent takes 30 seconds to respond,

347
00:11:28,800 --> 00:11:29,720
developers won't use it.

348
00:11:29,720 --> 00:11:31,560
You need feedback in seconds, not minutes.

349
00:11:31,560 --> 00:11:33,640
Co-pilot CLI has been optimized for speed.

350
00:11:33,640 --> 00:11:37,120
With updates showing a 75% improvement in response time,

351
00:11:37,120 --> 00:11:38,720
this comes from architectural shifts

352
00:11:38,720 --> 00:11:41,000
like prompt caching and streaming output.

353
00:11:41,000 --> 00:11:42,240
That uses context compression

354
00:11:42,240 --> 00:11:44,040
to send only the relevant code

355
00:11:44,040 --> 00:11:46,760
and parallel execution to run multiple checks at once.

356
00:11:46,760 --> 00:11:48,520
The agent needs to feel responsive.

357
00:11:48,520 --> 00:11:49,960
Like it's working at your pace,

358
00:11:49,960 --> 00:11:51,720
context is the other half of the equation.

359
00:11:51,720 --> 00:11:54,640
The agent needs to understand your project deeply,

360
00:11:54,640 --> 00:11:56,880
including the gotchas that aren't written down.

361
00:11:56,880 --> 00:11:58,480
This is why project memory matters.

362
00:11:58,480 --> 00:12:01,960
Files like architecture.md or a domain glossary are foundational.

363
00:12:01,960 --> 00:12:03,880
The agent reads these before every task

364
00:12:03,880 --> 00:12:05,320
to ensure continuity.

365
00:12:05,320 --> 00:12:07,440
Without this context, the agent generates code

366
00:12:07,440 --> 00:12:09,520
that is technically correct, but doesn't fit.

367
00:12:09,520 --> 00:12:10,880
It might pass the tests.

368
00:12:10,880 --> 00:12:12,440
But it violates your patterns.

369
00:12:12,440 --> 00:12:15,040
With context, the code actually belongs in your code base.

370
00:12:15,040 --> 00:12:16,680
It respects the decisions you've made

371
00:12:16,680 --> 00:12:18,520
about how things should be structured.

372
00:12:18,520 --> 00:12:20,800
But the orchestration layer only handles one part

373
00:12:20,800 --> 00:12:23,600
of the workflow, the transformation layer, modernization

374
00:12:23,600 --> 00:12:24,400
agents.

375
00:12:24,400 --> 00:12:26,760
The orchestration layer decides what needs to happen.

376
00:12:26,760 --> 00:12:29,040
The transformation layer is what actually happens.

377
00:12:29,040 --> 00:12:31,360
And this layer solves a problem that most organizations

378
00:12:31,360 --> 00:12:32,800
have basically given up on.

379
00:12:32,800 --> 00:12:33,680
Legacy code.

380
00:12:33,680 --> 00:12:36,280
Modernization agents exist for one specific job.

381
00:12:36,280 --> 00:12:38,680
They take old code and make it new without breaking it.

382
00:12:38,680 --> 00:12:39,920
That sounds simple.

383
00:12:39,920 --> 00:12:40,720
It isn't.

384
00:12:40,720 --> 00:12:42,960
In reality, it's exponentially harder

385
00:12:42,960 --> 00:12:44,960
than writing new code from scratch.

386
00:12:44,960 --> 00:12:47,440
Because modernization isn't just find and replace,

387
00:12:47,440 --> 00:12:49,800
you can't just swap out old syntax for new syntax

388
00:12:49,800 --> 00:12:50,680
and call it a day.

389
00:12:50,680 --> 00:12:53,480
Old code carries hidden logic, business rules buried

390
00:12:53,480 --> 00:12:55,600
in conditional statements, data constraints

391
00:12:55,600 --> 00:12:57,080
that nobody ever documented.

392
00:12:57,080 --> 00:12:59,080
Edge cases handled in ways that make no sense

393
00:12:59,080 --> 00:13:01,040
until you understand why they exist.

394
00:13:01,040 --> 00:13:02,440
The code didn't start as garbage.

395
00:13:02,440 --> 00:13:03,360
It evolved.

396
00:13:03,360 --> 00:13:06,160
It handled problems that emerged over years of production.

397
00:13:06,160 --> 00:13:08,360
It solved bugs that happened in the middle of the night.

398
00:13:08,360 --> 00:13:11,160
Every weird pattern exists because something broke

399
00:13:11,160 --> 00:13:12,240
and someone fixed it.

400
00:13:12,240 --> 00:13:14,680
The modernization agent has to excavate all of that

401
00:13:14,680 --> 00:13:16,600
before it can touch a single line.

402
00:13:16,600 --> 00:13:19,320
The first thing it does is what you'd call code archaeology.

403
00:13:19,320 --> 00:13:22,800
The agent reads the entire legacy system, not in summary form,

404
00:13:22,800 --> 00:13:24,240
not as an overview.

405
00:13:24,240 --> 00:13:25,520
It builds a complete map.

406
00:13:25,520 --> 00:13:26,720
It identifies entry points.

407
00:13:26,720 --> 00:13:27,640
Where does data come in?

408
00:13:27,640 --> 00:13:29,120
Where do requests start?

409
00:13:29,120 --> 00:13:30,280
It traces data flows.

410
00:13:30,280 --> 00:13:31,560
Where does information move?

411
00:13:31,560 --> 00:13:32,640
How does it transform?

412
00:13:32,640 --> 00:13:33,960
It finds hidden coupling.

413
00:13:33,960 --> 00:13:36,720
Places where modules that shouldn't know about each other

414
00:13:36,720 --> 00:13:38,280
actually depend on each other.

415
00:13:38,280 --> 00:13:40,000
It identifies the business logic.

416
00:13:40,000 --> 00:13:41,440
The thing the system actually does

417
00:13:41,440 --> 00:13:44,040
beneath all the layers of framework and historical decisions

418
00:13:44,040 --> 00:13:45,920
once it understands the system.

419
00:13:45,920 --> 00:13:47,280
And this understanding runs deep.

420
00:13:47,280 --> 00:13:48,600
It generates documentation.

421
00:13:48,600 --> 00:13:49,680
What does this module do?

422
00:13:49,680 --> 00:13:51,000
What are the key workflows?

423
00:13:51,000 --> 00:13:52,280
What are the assumptions it makes?

424
00:13:52,280 --> 00:13:53,080
What are the gotchas?

425
00:13:53,080 --> 00:13:55,400
The things that will break if you change them without understanding

426
00:13:55,400 --> 00:13:56,120
why they exist.

427
00:13:56,120 --> 00:13:58,200
This documentation becomes project memory.

428
00:13:58,200 --> 00:14:00,320
It gets fed back into the agent on every iteration.

429
00:14:00,320 --> 00:14:01,720
The agent doesn't lose context.

430
00:14:01,720 --> 00:14:03,120
It builds on what it learned.

431
00:14:03,120 --> 00:14:05,680
Then the agent works, not in one massive refactor.

432
00:14:05,680 --> 00:14:08,480
In small reversible slices, one module at a time,

433
00:14:08,480 --> 00:14:10,880
one service at a time, each slice is tested.

434
00:14:10,880 --> 00:14:12,360
Does it still do what it did before?

435
00:14:12,360 --> 00:14:13,680
Each slice is reviewed.

436
00:14:13,680 --> 00:14:15,160
Does it fit the architecture?

437
00:14:15,160 --> 00:14:17,840
Only after validation does it move to the next slice.

438
00:14:17,840 --> 00:14:21,200
The critical insight here is what modernization agents don't do.

439
00:14:21,200 --> 00:14:22,560
They don't replace human architects.

440
00:14:22,560 --> 00:14:24,840
They don't eliminate the need for human judgment

441
00:14:24,840 --> 00:14:26,600
about what the system should become.

442
00:14:26,600 --> 00:14:28,720
What they do is handle the mechanical work.

443
00:14:28,720 --> 00:14:33,200
Code generation, dependency updates, test generation, boilerplate.

444
00:14:33,200 --> 00:14:35,280
The stuff that's necessary but not interesting.

445
00:14:35,280 --> 00:14:37,480
Humans handle architectural decisions.

446
00:14:37,480 --> 00:14:40,320
Humans decide what patterns to follow in the new system.

447
00:14:40,320 --> 00:14:41,440
Humans own the strategy.

448
00:14:41,440 --> 00:14:44,080
The agent owns the execution on SattobiBench Pro.

449
00:14:44,080 --> 00:14:46,240
A benchmark of real software engineering tasks

450
00:14:46,240 --> 00:14:48,240
taken from actual GitHub repositories.

451
00:14:48,240 --> 00:14:50,320
Modernization agents paired with review agents

452
00:14:50,320 --> 00:14:51,560
showed something interesting.

453
00:14:51,560 --> 00:14:52,320
Claude Sonnet.

454
00:14:52,320 --> 00:14:54,120
Paired with a GPT-based reviewer,

455
00:14:54,120 --> 00:14:58,000
closed 74.7% of the performance gap between Sonnet and Opus.

456
00:14:58,000 --> 00:14:59,600
That's the expensive model.

457
00:14:59,600 --> 00:15:02,680
That means a mid-tier model when paired with strong validation.

458
00:15:02,680 --> 00:15:04,160
Can approach top-tier performance.

459
00:15:04,160 --> 00:15:06,440
The implication matters.

460
00:15:06,440 --> 00:15:07,880
You don't need the most expensive model

461
00:15:07,880 --> 00:15:09,600
if you have a good review process.

462
00:15:09,600 --> 00:15:11,720
But here's what separates successful modernization

463
00:15:11,720 --> 00:15:13,360
from failure context.

464
00:15:13,360 --> 00:15:15,480
The agent needs to read your architecture docs,

465
00:15:15,480 --> 00:15:17,960
your domain glossaries, your test strategies,

466
00:15:17,960 --> 00:15:19,760
the godjust file that lists all the weird things

467
00:15:19,760 --> 00:15:21,280
that will bite you if you're not careful.

468
00:15:21,280 --> 00:15:23,640
Without context, the agent generates code

469
00:15:23,640 --> 00:15:26,240
that's modern in syntax, but ancient in structure.

470
00:15:26,240 --> 00:15:29,040
It's technical debt wearing new clothes with context.

471
00:15:29,040 --> 00:15:30,920
It generates code that preserves behavior

472
00:15:30,920 --> 00:15:32,640
while modernizing the implementation.

473
00:15:32,640 --> 00:15:34,640
That's the difference between a modernization project

474
00:15:34,640 --> 00:15:36,880
that ships and one that becomes another graveyard

475
00:15:36,880 --> 00:15:38,240
of abandoned work.

476
00:15:38,240 --> 00:15:39,840
But code transformation only matters

477
00:15:39,840 --> 00:15:42,160
if the output gets validated.

478
00:15:42,160 --> 00:15:43,360
The validation layer.

479
00:15:43,360 --> 00:15:45,800
Code review agents and architectural enforcement.

480
00:15:45,800 --> 00:15:48,040
The problem with code review is deceptively simple.

481
00:15:48,040 --> 00:15:50,760
It doesn't scale with the velocity of code generation.

482
00:15:50,760 --> 00:15:52,880
When you have one developer writing code

483
00:15:52,880 --> 00:15:54,840
and one reviewer checking it, the system works.

484
00:15:54,840 --> 00:15:55,840
They trade off.

485
00:15:55,840 --> 00:15:57,000
Writer finishes a PR.

486
00:15:57,000 --> 00:15:58,440
Reviewer reads it.

487
00:15:58,440 --> 00:15:59,560
Feedback happens.

488
00:15:59,560 --> 00:16:00,720
Changes get made.

489
00:16:00,720 --> 00:16:02,520
It merges.

490
00:16:02,520 --> 00:16:04,080
When you introduce an agentex system

491
00:16:04,080 --> 00:16:06,040
that can generate more code in a day

492
00:16:06,040 --> 00:16:08,680
than a human reviewer can reasonably read in a week.

493
00:16:08,680 --> 00:16:10,000
That model collapses.

494
00:16:10,000 --> 00:16:11,080
The queue grows.

495
00:16:11,080 --> 00:16:12,640
The feedback gets slower.

496
00:16:12,640 --> 00:16:13,960
The review becomes a bottleneck.

497
00:16:13,960 --> 00:16:15,880
And bottlenecks are where quality deteriorates

498
00:16:15,880 --> 00:16:17,440
because people start rushing.

499
00:16:17,440 --> 00:16:19,680
Code review agents exist to solve this.

500
00:16:19,680 --> 00:16:21,200
But not in the way you might think.

501
00:16:21,200 --> 00:16:22,600
They don't replace human review.

502
00:16:22,600 --> 00:16:23,240
They filter it.

503
00:16:23,240 --> 00:16:24,520
They handle the first pass.

504
00:16:24,520 --> 00:16:26,160
The mechanical checks.

505
00:16:26,160 --> 00:16:27,720
The obvious issues.

506
00:16:27,720 --> 00:16:29,960
This clears the queue so humans can focus

507
00:16:29,960 --> 00:16:31,600
on what humans are actually good at.

508
00:16:31,600 --> 00:16:33,320
Reasoning about architectural fit.

509
00:16:33,320 --> 00:16:34,720
Understanding business intent.

510
00:16:34,720 --> 00:16:36,520
Making judgment calls about trade-offs.

511
00:16:36,520 --> 00:16:38,520
What gets checked at this layer matters?

512
00:16:38,520 --> 00:16:40,960
A good code review agent looks at security.

513
00:16:40,960 --> 00:16:42,200
Input validation.

514
00:16:42,200 --> 00:16:43,640
Authentication boundaries.

515
00:16:43,640 --> 00:16:44,560
Data exposure.

516
00:16:44,560 --> 00:16:46,280
Non-dependency vulnerabilities.

517
00:16:46,280 --> 00:16:49,480
It checks architecture, layer violations, boundary crossings.

518
00:16:49,480 --> 00:16:50,840
Coupling that shouldn't exist.

519
00:16:50,840 --> 00:16:51,920
Solid principles.

520
00:16:51,920 --> 00:16:53,000
It checks testing.

521
00:16:53,000 --> 00:16:53,800
Coverage gaps.

522
00:16:53,800 --> 00:16:54,920
Missing edge cases.

523
00:16:54,920 --> 00:16:55,800
Weaker certions.

524
00:16:55,800 --> 00:16:56,720
Test quality.

525
00:16:56,720 --> 00:16:57,640
It checks performance.

526
00:16:57,640 --> 00:16:59,000
N+1 queries.

527
00:16:59,000 --> 00:16:59,880
Inefficient loops.

528
00:16:59,880 --> 00:17:00,960
Memory leaks.

529
00:17:00,960 --> 00:17:02,480
Resource exhaustion patterns.

530
00:17:02,480 --> 00:17:03,600
It checks style.

531
00:17:03,600 --> 00:17:04,800
Naming clarity.

532
00:17:04,800 --> 00:17:06,640
Consistency with local conventions.

533
00:17:06,640 --> 00:17:08,080
Readability.

534
00:17:08,080 --> 00:17:10,280
But here's where most code review systems fail.

535
00:17:10,280 --> 00:17:12,040
They look at these things in isolation.

536
00:17:12,040 --> 00:17:12,960
One file at a time.

537
00:17:12,960 --> 00:17:14,160
The agent we're talking about here

538
00:17:14,160 --> 00:17:15,560
understands architecture.

539
00:17:15,560 --> 00:17:17,800
It reads a PR and it knows the boundaries in your system.

540
00:17:17,800 --> 00:17:20,760
It sees where the change crosses an architectural line.

541
00:17:20,760 --> 00:17:22,400
It knows what patterns you've established.

542
00:17:22,400 --> 00:17:24,320
It can reason about multi-file impact.

543
00:17:24,320 --> 00:17:25,840
It catches the architectural problem

544
00:17:25,840 --> 00:17:28,080
that would break the system in unexpected ways.

545
00:17:28,080 --> 00:17:30,480
The one that wouldn't show up until production traffic hit it.

546
00:17:30,480 --> 00:17:31,840
This requires context.

547
00:17:31,840 --> 00:17:35,000
Not just your code, not just the change being reviewed.

548
00:17:35,000 --> 00:17:38,400
Rules files that document your architectural constraints.

549
00:17:38,400 --> 00:17:41,400
Pattern documents that show how things should be structured.

550
00:17:41,400 --> 00:17:43,880
Structural trees that illustrate the component hierarchy

551
00:17:43,880 --> 00:17:44,640
without these.

552
00:17:44,640 --> 00:17:46,120
The agent is a limter on steroids.

553
00:17:46,120 --> 00:17:48,280
Useful but shallow with these.

554
00:17:48,280 --> 00:17:50,120
It becomes an architectural enforcer,

555
00:17:50,120 --> 00:17:53,600
actively shaping the quality of what gets merged.

556
00:17:53,600 --> 00:17:55,520
GitHub's Copilot Code Review now runs

557
00:17:55,520 --> 00:17:57,480
on a fundamentally different architecture

558
00:17:57,480 --> 00:17:58,760
than traditional review systems.

559
00:17:58,760 --> 00:18:00,320
It's not one model reading a PR.

560
00:18:00,320 --> 00:18:02,560
It's multiple specialized agents cooperating,

561
00:18:02,560 --> 00:18:05,040
a style agent, a security agent,

562
00:18:05,040 --> 00:18:07,760
an architecture agent, a testing agent.

563
00:18:07,760 --> 00:18:11,360
Each one runs independently, each one produces findings,

564
00:18:11,360 --> 00:18:13,320
and orchestrator aggregates them.

565
00:18:13,320 --> 00:18:16,720
To duplicate redundant findings, resolves conflicts.

566
00:18:16,720 --> 00:18:19,080
Produces a unified review that's actually coherent

567
00:18:19,080 --> 00:18:21,800
instead of a mess of contradictory suggestions.

568
00:18:21,800 --> 00:18:23,320
The benefit of this multi-agent approach

569
00:18:23,320 --> 00:18:25,720
is that you get multiple perspectives on the same code.

570
00:18:25,720 --> 00:18:27,400
Different agents catch different things.

571
00:18:27,400 --> 00:18:30,080
A security-focused agent might miss a performance problem

572
00:18:30,080 --> 00:18:32,320
that a performance-focused agent would flag.

573
00:18:32,320 --> 00:18:35,240
But together, they see more than any single agent would.

574
00:18:35,240 --> 00:18:37,320
Cross-model review adds another layer to this.

575
00:18:37,320 --> 00:18:38,880
One model family generates the code.

576
00:18:38,880 --> 00:18:40,560
A different model family reviews it.

577
00:18:40,560 --> 00:18:42,760
Claude Reviewing GPT-generated code.

578
00:18:42,760 --> 00:18:44,960
GPT-reviewing Claude-generated code.

579
00:18:44,960 --> 00:18:47,120
Different model families have different failure modes.

580
00:18:47,120 --> 00:18:49,400
What one model family reliably catches.

581
00:18:49,400 --> 00:18:51,040
Another might miss.

582
00:18:51,040 --> 00:18:52,520
By bringing different architectures

583
00:18:52,520 --> 00:18:54,720
and different training to the review process,

584
00:18:54,720 --> 00:18:56,280
you reduce correlated failures.

585
00:18:56,280 --> 00:18:58,640
You catch more actual problems on complex.

586
00:18:58,640 --> 00:19:01,120
Multi-file tasks, the ones where ripple effects matter,

587
00:19:01,120 --> 00:19:03,880
where architectural decisions show up as subtle breakage.

588
00:19:03,880 --> 00:19:05,440
Cross-model review improves outcomes

589
00:19:05,440 --> 00:19:08,040
by 3.8 to 4.8 percentage points.

590
00:19:08,040 --> 00:19:09,520
That's on top of baseline performance.

591
00:19:09,520 --> 00:19:11,240
That's real improvement on the kinds of changes

592
00:19:11,240 --> 00:19:12,320
that matter most.

593
00:19:12,320 --> 00:19:13,640
The structural shift here is the one

594
00:19:13,640 --> 00:19:15,680
that makes the entire system work.

595
00:19:15,680 --> 00:19:17,920
Validation moves from being a human judgment process

596
00:19:17,920 --> 00:19:19,760
to being an automated enforcement process

597
00:19:19,760 --> 00:19:20,800
with human oversight.

598
00:19:20,800 --> 00:19:22,320
The agent enforces the policy.

599
00:19:22,320 --> 00:19:23,200
The rules get checked.

600
00:19:23,200 --> 00:19:24,760
The patterns get validated.

601
00:19:24,760 --> 00:19:26,880
The human reviews the output of that enforcement

602
00:19:26,880 --> 00:19:30,320
and makes final decisions on edge cases where judgment is needed.

603
00:19:30,320 --> 00:19:32,400
This is why the validation layer isn't optional.

604
00:19:32,400 --> 00:19:34,720
It's what keeps the transformation and execution layers

605
00:19:34,720 --> 00:19:36,400
from generating garbage at scale.

606
00:19:36,400 --> 00:19:39,360
The execution layer, coding agents and cloud environments.

607
00:19:39,360 --> 00:19:41,400
Coding agents operate completely differently

608
00:19:41,400 --> 00:19:42,960
from co-pilot CLI.

609
00:19:42,960 --> 00:19:44,560
That difference is fundamental.

610
00:19:44,560 --> 00:19:46,200
Co-pilot CLI is synchronous.

611
00:19:46,200 --> 00:19:48,960
It's local, you're in the terminal, describing a task.

612
00:19:48,960 --> 00:19:51,000
Watching the work happen, you get feedback in seconds

613
00:19:51,000 --> 00:19:53,640
because the agent has access to your local machine,

614
00:19:53,640 --> 00:19:56,280
your tests, your tools, your dependencies.

615
00:19:56,280 --> 00:19:57,880
Coding agents are asynchronous.

616
00:19:57,880 --> 00:20:00,800
The remote, you create a GitHub issue, assign it to co-pilot,

617
00:20:00,800 --> 00:20:01,720
and then you walk away.

618
00:20:01,720 --> 00:20:04,080
The agent takes over in a managed environment.

619
00:20:04,080 --> 00:20:06,080
You get feedback when the work is finished,

620
00:20:06,080 --> 00:20:07,160
not while it's happening.

621
00:20:07,160 --> 00:20:08,680
The environment is a femoral.

622
00:20:08,680 --> 00:20:10,600
It exists for the duration of the task.

623
00:20:10,600 --> 00:20:11,600
And then it's gone.

624
00:20:11,600 --> 00:20:13,160
This distinction changes what's possible.

625
00:20:13,160 --> 00:20:16,760
Co-pilot CLI is for incremental work, quick refactors, small features,

626
00:20:16,760 --> 00:20:18,680
tasks that finish in minutes.

627
00:20:18,680 --> 00:20:20,560
Coding agents handle the complex stuff,

628
00:20:20,560 --> 00:20:23,760
full feature implementation, major refactors, tasks

629
00:20:23,760 --> 00:20:27,720
that need the entire repository, all dependencies, the full test suite,

630
00:20:27,720 --> 00:20:29,440
and the real CI/CD pipeline.

631
00:20:29,440 --> 00:20:31,200
When you assign a task to a coding agent,

632
00:20:31,200 --> 00:20:32,760
it starts by setting up its space.

633
00:20:32,760 --> 00:20:34,240
It clones your repository.

634
00:20:34,240 --> 00:20:37,320
It installs every dependency, not just the ones on your local machine.

635
00:20:37,320 --> 00:20:39,600
It reads the code base, maps the structure,

636
00:20:39,600 --> 00:20:41,200
and identifies the architecture.

637
00:20:41,200 --> 00:20:42,520
Then it plans the implementation.

638
00:20:42,520 --> 00:20:44,720
When it writes code, it isn't working in isolation.

639
00:20:44,720 --> 00:20:46,400
It works against a real repository.

640
00:20:46,400 --> 00:20:50,080
It runs the actual tests that would run in production, not simplified versions.

641
00:20:50,080 --> 00:20:53,680
If a test fails, the agent analyzes the failure and adapts.

642
00:20:53,680 --> 00:20:56,160
It modifies the code and runs the tests again.

643
00:20:56,160 --> 00:20:58,360
This iteration happens without asking for permission

644
00:20:58,360 --> 00:21:00,440
because the agent is pursuing the goal you set.

645
00:21:00,440 --> 00:21:02,000
It works within your constraints,

646
00:21:02,000 --> 00:21:04,360
but it doesn't interrupt you for every minor failure.

647
00:21:04,360 --> 00:21:07,040
When the work is done, the agent doesn't just hand you a file.

648
00:21:07,040 --> 00:21:08,200
It runs security scans.

649
00:21:08,200 --> 00:21:09,400
It checks test coverage.

650
00:21:09,400 --> 00:21:11,480
It validates against your quality gates.

651
00:21:11,480 --> 00:21:13,760
Only after it confirms the requirements are met,

652
00:21:13,760 --> 00:21:16,600
does it open a pull request with a detailed description,

653
00:21:16,600 --> 00:21:19,480
with test results, with a summary of what changed and why.

654
00:21:19,480 --> 00:21:21,360
The pull request is the output that matters.

655
00:21:21,360 --> 00:21:22,200
It's the artifact.

656
00:21:22,200 --> 00:21:24,720
You review it, you leave comments, you request changes.

657
00:21:24,720 --> 00:21:27,360
The agent reads that feedback and understands your corrections.

658
00:21:27,360 --> 00:21:29,640
It incorporates them and pushes new commits.

659
00:21:29,640 --> 00:21:30,680
This is a feedback loop.

660
00:21:30,680 --> 00:21:32,720
It isn't generate once and accept.

661
00:21:32,720 --> 00:21:36,960
It's agent generates, you guide, agent refines.

662
00:21:36,960 --> 00:21:38,560
The environment is the enabler.

663
00:21:38,560 --> 00:21:40,360
The agent isn't guessing if the code works.

664
00:21:40,360 --> 00:21:41,280
It has everything.

665
00:21:41,280 --> 00:21:44,440
Real code, real dependencies, real security scanning.

666
00:21:44,440 --> 00:21:46,160
When the agent claims something works,

667
00:21:46,160 --> 00:21:49,200
it actually works because it's been tested against your actual system.

668
00:21:49,200 --> 00:21:50,880
The environment is also safe.

669
00:21:50,880 --> 00:21:52,760
The agent can't accidentally break production.

670
00:21:52,760 --> 00:21:53,920
It can't merge to main.

671
00:21:53,920 --> 00:21:55,720
It can't deploy without approval.

672
00:21:55,720 --> 00:21:58,880
Everything it generates appears as a pull request for you to review.

673
00:21:58,880 --> 00:22:00,200
The constraints are intentional.

674
00:22:00,200 --> 00:22:02,760
The agent can only change the repository you specified.

675
00:22:02,760 --> 00:22:05,040
It can't touch multiple repos in one run.

676
00:22:05,040 --> 00:22:07,240
It opens exactly one pull request per task.

677
00:22:07,240 --> 00:22:09,320
These aren't limitations, they're design choices.

678
00:22:09,320 --> 00:22:10,320
They force focus.

679
00:22:10,320 --> 00:22:11,680
They make the output reviewable.

680
00:22:11,680 --> 00:22:15,120
They prevent runaway behavior where an agent makes cascading changes

681
00:22:15,120 --> 00:22:16,600
across your entire system.

682
00:22:16,600 --> 00:22:19,360
On real code bases, these agents have shortened cycle times.

683
00:22:19,360 --> 00:22:22,720
Technical debt has actually decreased because the agent works within the constraints

684
00:22:22,720 --> 00:22:25,080
you set and the review process enforces them.

685
00:22:25,080 --> 00:22:26,560
The structural shift is clear.

686
00:22:26,560 --> 00:22:31,480
Execution moves from you write code locally and push when ready to the agent executes

687
00:22:31,480 --> 00:22:35,120
the full workflow in a controlled space while you review the output.

688
00:22:35,120 --> 00:22:37,760
But these layers only work if they're connected.

689
00:22:37,760 --> 00:22:40,160
How the layers connect the orchestration problem.

690
00:22:40,160 --> 00:22:43,840
The four layers exist independently, but that's only useful if they're actually talking

691
00:22:43,840 --> 00:22:44,840
to each other.

692
00:22:44,840 --> 00:22:45,840
This is where most teams fail.

693
00:22:45,840 --> 00:22:47,160
They deploy the agents.

694
00:22:47,160 --> 00:22:49,120
They optimize each layer separately.

695
00:22:49,120 --> 00:22:50,680
But they never wire them together.

696
00:22:50,680 --> 00:22:52,520
The orchestration layer plans something.

697
00:22:52,520 --> 00:22:54,440
The transformation layer generates something different.

698
00:22:54,440 --> 00:22:56,960
The validation layer rejects it and nobody knows why.

699
00:22:56,960 --> 00:23:00,840
The system becomes a mess of disconnected agents producing conflicting outputs.

700
00:23:00,840 --> 00:23:02,040
This is the orchestration problem.

701
00:23:02,040 --> 00:23:03,640
How do you connect agentex systems?

702
00:23:03,640 --> 00:23:06,280
So they reinforce each other instead of creating bottlenecks.

703
00:23:06,280 --> 00:23:11,640
The answer is structured handoffs, not lose communication, not "Hey agent, go do a thing."

704
00:23:11,640 --> 00:23:14,560
But a structured exchange of information with clear expectations.

705
00:23:14,560 --> 00:23:16,520
Here's what a real workflow looks like.

706
00:23:16,520 --> 00:23:20,920
A developer opens co-pilot CLI to plan a refactor that touches multiple files.

707
00:23:20,920 --> 00:23:22,080
They describe the goal.

708
00:23:22,080 --> 00:23:26,480
The CLI agent plans the workflow, breaks it into steps and identifies which files need

709
00:23:26,480 --> 00:23:27,480
to change.

710
00:23:27,480 --> 00:23:28,480
The plan isn't vague.

711
00:23:28,480 --> 00:23:31,200
It's a specific document that another system can read and understand.

712
00:23:31,200 --> 00:23:35,760
The developer reviews the plan when they approve it the plan moves to the modernization agent.

713
00:23:35,760 --> 00:23:37,840
Not as a suggestion, but as a specification.

714
00:23:37,840 --> 00:23:40,400
The transformation layer reads the plan and executes it.

715
00:23:40,400 --> 00:23:43,800
It implements the refactor, writes the code, and runs tests locally.

716
00:23:43,800 --> 00:23:47,520
If tests fail, it adapts and refines the approach based on what it learned.

717
00:23:47,520 --> 00:23:49,960
When the refactor is finished, the output moves to code review.

718
00:23:49,960 --> 00:23:51,560
Again, this is structured.

719
00:23:51,560 --> 00:23:55,160
The validation layer reads the change and checks for architecture violations.

720
00:23:55,160 --> 00:23:57,960
It looks for security issues and test coverage gaps.

721
00:23:57,960 --> 00:24:01,520
The validation passes the PR is ready if it fails, feedback goes back.

722
00:24:01,520 --> 00:24:04,120
Not to the developer, but to the transformation layer.

723
00:24:04,120 --> 00:24:05,760
The agent reads the feedback.

724
00:24:05,760 --> 00:24:08,240
Understands what didn't work and modifies the implementation.

725
00:24:08,240 --> 00:24:10,280
The coding agent handles the environment.

726
00:24:10,280 --> 00:24:11,280
It manages the branch.

727
00:24:11,280 --> 00:24:12,720
It runs the full test suite.

728
00:24:12,720 --> 00:24:16,000
It opens the PR with documentation and waits for human feedback.

729
00:24:16,000 --> 00:24:17,000
The loop continues.

730
00:24:17,000 --> 00:24:18,000
The human leaves comments.

731
00:24:18,000 --> 00:24:20,640
The agent reads them and the agent pushes new commits.

732
00:24:20,640 --> 00:24:21,640
This isn't linear.

733
00:24:21,640 --> 00:24:22,640
It's a mesh.

734
00:24:22,640 --> 00:24:24,640
Each layer can communicate with every other layer.

735
00:24:24,640 --> 00:24:27,840
The execution layer can tell orchestration that it found in edge case the plan didn't

736
00:24:27,840 --> 00:24:28,840
anticipate.

737
00:24:28,840 --> 00:24:31,520
Orchestration can then trigger a new workflow to handle it.

738
00:24:31,520 --> 00:24:35,200
The validation layer can tell transformation that a specific pattern doesn't work in this

739
00:24:35,200 --> 00:24:36,680
architecture.

740
00:24:36,680 --> 00:24:37,920
Transformation adapts.

741
00:24:37,920 --> 00:24:39,600
The human sits above this entire system.

742
00:24:39,600 --> 00:24:41,000
You aren't out of the loop.

743
00:24:41,000 --> 00:24:42,480
You're in a different position in the loop.

744
00:24:42,480 --> 00:24:43,480
You set policy.

745
00:24:43,480 --> 00:24:44,480
You define guardrails.

746
00:24:44,480 --> 00:24:46,880
You handle the exceptions where judgment is needed.

747
00:24:46,880 --> 00:24:48,840
The agents handle the mechanical work.

748
00:24:48,840 --> 00:24:50,520
What makes this possible is clarity.

749
00:24:50,520 --> 00:24:52,880
Each layer has a defined interface.

750
00:24:52,880 --> 00:24:54,480
Orchestration outputs a plan.

751
00:24:54,480 --> 00:24:59,240
When accepts that plan and outputs code, validation accepts code and outputs a review.

752
00:24:59,240 --> 00:25:02,160
Because the interfaces are clean, layers can be swapped.

753
00:25:02,160 --> 00:25:04,920
You could replace the transformation layer with a different agent.

754
00:25:04,920 --> 00:25:07,520
And as long as the input format is the same, it works.

755
00:25:07,520 --> 00:25:09,360
The system is composable.

756
00:25:09,360 --> 00:25:11,400
But none of this works without continuous data flow.

757
00:25:11,400 --> 00:25:13,760
Each layer needs context from the previous ones.

758
00:25:13,760 --> 00:25:17,080
The validation layer needs to know what the transformation layer was trying to accomplish.

759
00:25:17,080 --> 00:25:20,400
The execution layer needs to know what orchestration decided about priority.

760
00:25:20,400 --> 00:25:22,800
Without that, the agents are working in isolation.

761
00:25:22,800 --> 00:25:25,240
This is why Project Memory is foundational.

762
00:25:25,240 --> 00:25:28,920
Architecture.md, gotchas.md, test strategy.md.

763
00:25:28,920 --> 00:25:31,640
These files are read by every agent to provide continuity.

764
00:25:31,640 --> 00:25:35,320
They ensure that each layer understands the context from all the others.

765
00:25:35,320 --> 00:25:37,440
Without shared context, each layer is isolated.

766
00:25:37,440 --> 00:25:38,760
It makes local decisions.

767
00:25:38,760 --> 00:25:42,600
Good output from layer one becomes bad output from layer two because they aren't talking.

768
00:25:42,600 --> 00:25:44,800
With shared context, they work as a system.

769
00:25:44,800 --> 00:25:47,920
Each layer builds on what the previous layer learned.

770
00:25:47,920 --> 00:25:49,280
The context problem.

771
00:25:49,280 --> 00:25:51,440
Why most modernization agents fail?

772
00:25:51,440 --> 00:25:55,680
Most organizations follow a predictable pattern when they deploy a modernization agent.

773
00:25:55,680 --> 00:25:57,960
They point the tool at a mountain of old code.

774
00:25:57,960 --> 00:25:59,600
They expect magic.

775
00:25:59,600 --> 00:26:00,960
The command is simple.

776
00:26:00,960 --> 00:26:02,560
Here is the legacy system.

777
00:26:02,560 --> 00:26:05,600
Make it modern, then they wait.

778
00:26:05,600 --> 00:26:07,040
At first it looks like it's working.

779
00:26:07,040 --> 00:26:11,320
The agent runs, it scans the code, it generates an output that actually compiles, the tests

780
00:26:11,320 --> 00:26:14,040
pass, the team celebrates.

781
00:26:14,040 --> 00:26:16,520
But then, reality hits.

782
00:26:16,520 --> 00:26:19,280
The code is technically correct, but it doesn't actually fit.

783
00:26:19,280 --> 00:26:22,760
The syntax is modern, the framework is new, but the architecture is a mess.

784
00:26:22,760 --> 00:26:24,960
It violates the patterns your system relies on.

785
00:26:24,960 --> 00:26:26,360
It introduces hidden coupling.

786
00:26:26,360 --> 00:26:28,360
It completely misses the business logic.

787
00:26:28,360 --> 00:26:30,640
What you've actually built is modern technical debt.

788
00:26:30,640 --> 00:26:32,200
It's the old broken system.

789
00:26:32,200 --> 00:26:33,400
Just wearing a new set of clothes.

790
00:26:33,400 --> 00:26:36,240
Most teams fail at modernization for one specific reason.

791
00:26:36,240 --> 00:26:37,560
The agent didn't have context.

792
00:26:37,560 --> 00:26:40,240
Think of it like asking a contractor to renovate your house.

793
00:26:40,240 --> 00:26:41,760
But you don't give them the floor plan.

794
00:26:41,760 --> 00:26:43,560
You don't show them the building codes.

795
00:26:43,560 --> 00:26:44,960
You don't even give them a budget.

796
00:26:44,960 --> 00:26:45,960
The contractor can work.

797
00:26:45,960 --> 00:26:47,320
They can swap out parts.

798
00:26:47,320 --> 00:26:48,840
But they don't know what they're building toward.

799
00:26:48,840 --> 00:26:50,240
They don't understand the constraints.

800
00:26:50,240 --> 00:26:52,480
They don't know what actually matters to the structure.

801
00:26:52,480 --> 00:26:54,120
The agent is operating blind.

802
00:26:54,120 --> 00:26:57,680
Context engineering is the shift that separates success from total failure.

803
00:26:57,680 --> 00:26:59,080
Most teams skip this work.

804
00:26:59,080 --> 00:27:01,480
They assume the agent is supposed to figure it out on its own.

805
00:27:01,480 --> 00:27:02,480
It doesn't.

806
00:27:02,480 --> 00:27:04,440
You have to start with code archaeology.

807
00:27:04,440 --> 00:27:08,000
Forced the agent to analyze the legacy system before it touches a single line.

808
00:27:08,000 --> 00:27:09,680
It needs to see the hidden dependencies.

809
00:27:09,680 --> 00:27:12,560
It needs to find the entry points where data flows in.

810
00:27:12,560 --> 00:27:16,320
It has to map the coupling between modules that should have stayed separate.

811
00:27:16,320 --> 00:27:19,960
The agent needs to find the business logic buried deep inside messy conditionals.

812
00:27:19,960 --> 00:27:23,760
It needs to see the gotchas that will break the system if they're changed.

813
00:27:23,760 --> 00:27:27,920
Once the agent truly understands the system, it generates documentation.

814
00:27:27,920 --> 00:27:31,480
Not generic summaries, specific structural documents.

815
00:27:31,480 --> 00:27:32,680
What does this module actually do?

816
00:27:32,680 --> 00:27:33,960
What are the key workflows?

817
00:27:33,960 --> 00:27:35,640
What assumptions is the system making?

818
00:27:35,640 --> 00:27:37,240
What will break the moment you touch it?

819
00:27:37,240 --> 00:27:39,120
These files become the project memory.

820
00:27:39,120 --> 00:27:40,120
Architecture.md.

821
00:27:40,120 --> 00:27:41,600
Domain glossary.md.

822
00:27:41,600 --> 00:27:43,000
gotchas.md.

823
00:27:43,000 --> 00:27:44,360
You keep these in your repository.

824
00:27:44,360 --> 00:27:45,560
They aren't static.

825
00:27:45,560 --> 00:27:47,440
They evolve as the project moves.

826
00:27:47,440 --> 00:27:48,440
But here's the problem.

827
00:27:48,440 --> 00:27:50,840
Most people think context is a one-time task.

828
00:27:50,840 --> 00:27:52,160
In reality, it's iterative.

829
00:27:52,160 --> 00:27:55,000
When the agent finishes a task, you don't just merge and move on.

830
00:27:55,000 --> 00:27:57,600
You update the memory files with everything you just learned.

831
00:27:57,600 --> 00:27:58,680
You add the new gotchas.

832
00:27:58,680 --> 00:27:59,840
You refine the glossary.

833
00:27:59,840 --> 00:28:02,320
The next time the agent runs, it has more context.

834
00:28:02,320 --> 00:28:03,600
It has a better understanding.

835
00:28:03,600 --> 00:28:04,880
It generates better code.

836
00:28:04,880 --> 00:28:09,560
This feedback loop is what keeps your project out of the graveyard of failed modernization attempts.

837
00:28:09,560 --> 00:28:11,080
The research is clear on this.

838
00:28:11,080 --> 00:28:15,480
Teams that maintain a living project memory see 30 to 60% better outcomes.

839
00:28:15,480 --> 00:28:16,800
It isn't a small tweak.

840
00:28:16,800 --> 00:28:20,160
That is the difference between shipping a product and watching a project die.

841
00:28:20,160 --> 00:28:21,760
The structural insight is this.

842
00:28:21,760 --> 00:28:23,480
Modernization isn't about the agent.

843
00:28:23,480 --> 00:28:25,000
It's about the context you give it.

844
00:28:25,000 --> 00:28:27,880
This is why context engineering is becoming its own discipline.

845
00:28:27,880 --> 00:28:31,440
Senior engineers are now designing how agents consume information.

846
00:28:31,440 --> 00:28:34,200
They create pattern files to show how components should look.

847
00:28:34,200 --> 00:28:37,400
They write rule files that encode the architectural constraints.

848
00:28:37,400 --> 00:28:38,760
These aren't just technical rules.

849
00:28:38,760 --> 00:28:40,160
They are business rules.

850
00:28:40,160 --> 00:28:41,440
Data constraints.

851
00:28:41,440 --> 00:28:43,200
Integration patterns.

852
00:28:43,200 --> 00:28:45,720
The agent uses these files as a constant reference.

853
00:28:45,720 --> 00:28:47,200
It doesn't just make code.

854
00:28:47,200 --> 00:28:49,760
It makes code that actually fits your specific system.

855
00:28:49,760 --> 00:28:51,360
Does this take more work upfront?

856
00:28:51,360 --> 00:28:53,800
Yes, your building structures that wouldn't exist otherwise.

857
00:28:53,800 --> 00:28:55,200
But the payoff is massive.

858
00:28:55,200 --> 00:28:56,800
The agent generates better code.

859
00:28:56,800 --> 00:28:58,120
Better code needs less review.

860
00:28:58,120 --> 00:28:59,600
Less review means less rework.

861
00:28:59,600 --> 00:29:03,160
And that is how a modernization project actually finishes, instead of stalling out halfway

862
00:29:03,160 --> 00:29:04,480
through.

863
00:29:04,480 --> 00:29:07,800
But even with perfect context, agents make mistakes.

864
00:29:07,800 --> 00:29:10,400
And that's where validation becomes critical.

865
00:29:10,400 --> 00:29:13,120
The rubber duck protocol, cross model review.

866
00:29:13,120 --> 00:29:17,400
Most organizations expect the best results to come from the biggest models, but in reality.

867
00:29:17,400 --> 00:29:21,640
A mid-tier model with strong validation beats a top tier model working alone.

868
00:29:21,640 --> 00:29:23,720
GitHub's internal research proved this.

869
00:29:23,720 --> 00:29:27,920
Claude Sony 3.5 is faster and cheaper than Claude Opus, but it's also less capable when

870
00:29:27,920 --> 00:29:29,360
things get difficult.

871
00:29:29,360 --> 00:29:30,360
That's the trade-off.

872
00:29:30,360 --> 00:29:33,800
You get speed, but you lose some depth, until you pair it with a reviewer from a different

873
00:29:33,800 --> 00:29:35,000
model family.

874
00:29:35,000 --> 00:29:39,000
When they paired Claude Sony with GPT-4 as a reviewer, something happened.

875
00:29:39,000 --> 00:29:42,760
It closed nearly 75% of the performance gap between Sony and Opus.

876
00:29:42,760 --> 00:29:43,760
It's a massive shift.

877
00:29:43,760 --> 00:29:46,840
It means you get top tier results without paying top tier prices.

878
00:29:46,840 --> 00:29:49,960
But this only works because the reviewer comes from a different family.

879
00:29:49,960 --> 00:29:53,720
If you have Claude Sonya, write a plan and then ask Claude Sonya to review it.

880
00:29:53,720 --> 00:29:55,200
You won't see much change.

881
00:29:55,200 --> 00:29:57,000
The model is just thinking the same way twice.

882
00:29:57,000 --> 00:29:58,680
It will repeat its own blind spots.

883
00:29:58,680 --> 00:30:01,160
It will miss the same things it missed the first time.

884
00:30:01,160 --> 00:30:02,800
Different model families have different training.

885
00:30:02,800 --> 00:30:04,280
They have different architectures.

886
00:30:04,280 --> 00:30:05,640
They have different ways of failing.

887
00:30:05,640 --> 00:30:06,640
What Claude misses?

888
00:30:06,640 --> 00:30:08,240
GPT might catch instantly.

889
00:30:08,240 --> 00:30:10,240
What GPT overlooks Claude will flag.

890
00:30:10,240 --> 00:30:12,040
When you combine them, the gaps disappear.

891
00:30:12,040 --> 00:30:14,280
This is the foundation of the rubber duck protocol.

892
00:30:14,280 --> 00:30:15,280
One model generates.

893
00:30:15,280 --> 00:30:17,360
A different model family reviews.

894
00:30:17,360 --> 00:30:19,720
The reviewer's job isn't to say looks good.

895
00:30:19,720 --> 00:30:23,800
Its job is to challenge the plan, to find the blind spots, to catch what the first model

896
00:30:23,800 --> 00:30:24,800
missed.

897
00:30:24,800 --> 00:30:27,040
Inside the co-pilot CLI, it looks like this.

898
00:30:27,040 --> 00:30:28,080
You describe a task.

899
00:30:28,080 --> 00:30:31,360
The primary agent, maybe a Claude model, reads your intent.

900
00:30:31,360 --> 00:30:32,360
It looks at your code.

901
00:30:32,360 --> 00:30:33,360
It builds a plan.

902
00:30:33,360 --> 00:30:35,680
But before it executes, that plan goes to the rubber duck.

903
00:30:35,680 --> 00:30:38,240
The rubber duck is a different model, usually GPT.

904
00:30:38,240 --> 00:30:40,760
It reads the plan and looks for gaps in the reasoning.

905
00:30:40,760 --> 00:30:42,480
It looks for unjustified assumptions.

906
00:30:42,480 --> 00:30:44,560
It finds the edge cases that weren't considered.

907
00:30:44,560 --> 00:30:48,480
When a task touches five different modules with complex dependencies.

908
00:30:48,480 --> 00:30:49,880
That's when the review matters.

909
00:30:49,880 --> 00:30:52,080
A simple change might be fine on its own.

910
00:30:52,080 --> 00:30:54,800
But a complex refactor is where the rubber duck earns its keep.

911
00:30:54,800 --> 00:30:58,080
It finds the architectural floor the primary agent ignored.

912
00:30:58,080 --> 00:31:01,040
It identifies the edge case that would have crashed in production.

913
00:31:01,040 --> 00:31:04,560
It catches the pattern violation before it creates more technical debt.

914
00:31:04,560 --> 00:31:06,120
The timing here is important.

915
00:31:06,120 --> 00:31:08,000
The reviewer doesn't block the work.

916
00:31:08,000 --> 00:31:11,240
The primary agent starts implementing while the rubber duck is still thinking.

917
00:31:11,240 --> 00:31:13,040
It's asynchronous.

918
00:31:13,040 --> 00:31:15,960
When the review comes back, the agent incorporates that feedback.

919
00:31:15,960 --> 00:31:16,960
It refines the plan.

920
00:31:16,960 --> 00:31:17,960
It updates the code.

921
00:31:17,960 --> 00:31:19,480
It isn't stop and wait.

922
00:31:19,480 --> 00:31:21,960
It's work in parallel and improves as you go.

923
00:31:21,960 --> 00:31:23,440
The results are measurable.

924
00:31:23,440 --> 00:31:28,400
On complex, multi-file tasks, this cross-model review improved outcomes by nearly five percentage

925
00:31:28,400 --> 00:31:29,400
points.

926
00:31:29,400 --> 00:31:32,760
That is a consistent improvement across almost every type of problem.

927
00:31:32,760 --> 00:31:35,040
But the real takeaway isn't about raw power.

928
00:31:35,040 --> 00:31:36,040
It's about collaboration.

929
00:31:36,040 --> 00:31:38,680
It's about using the differences between models as a feature.

930
00:31:38,680 --> 00:31:40,600
The structural shift is subtle.

931
00:31:40,600 --> 00:31:42,080
Validation is no longer a single judgment.

932
00:31:42,080 --> 00:31:44,760
It becomes a conversation between two different ways of thinking.

933
00:31:44,760 --> 00:31:46,400
The primary agent proposes.

934
00:31:46,400 --> 00:31:47,640
The reviewer challenges.

935
00:31:47,640 --> 00:31:49,280
The agent refines.

936
00:31:49,280 --> 00:31:51,520
This is the opposite of the bigger is better mindset.

937
00:31:51,520 --> 00:31:53,520
It's the collaboration is better model.

938
00:31:53,520 --> 00:31:54,800
It changes what you need to buy.

939
00:31:54,800 --> 00:31:57,720
You don't need the most expensive model for every single task.

940
00:31:57,720 --> 00:32:00,240
You need a capable model paired with a smart reviewer.

941
00:32:00,240 --> 00:32:02,480
The cost goes down, but the results stay high.

942
00:32:02,480 --> 00:32:04,360
It changes how you think about validation.

943
00:32:04,360 --> 00:32:05,880
It isn't a gate at the end of the road.

944
00:32:05,880 --> 00:32:07,240
It's a participant in the process.

945
00:32:07,240 --> 00:32:11,840
The reviewer shapes the output while it's being made, not after it's already finished.

946
00:32:11,840 --> 00:32:15,200
But none of this works if the agents can't actually understand the code they're supposed

947
00:32:15,200 --> 00:32:16,920
to be reviewing.

948
00:32:16,920 --> 00:32:19,960
Building the agent's mental model, code-based understanding.

949
00:32:19,960 --> 00:32:23,040
And agent's decisions are only as good as its understanding of the work.

950
00:32:23,040 --> 00:32:24,320
I'm not talking about syntax.

951
00:32:24,320 --> 00:32:25,320
Any parser can read code.

952
00:32:25,320 --> 00:32:29,360
I'm talking about semantic understanding, architectural understanding, and the why behind

953
00:32:29,360 --> 00:32:30,360
the shape of your system.

954
00:32:30,360 --> 00:32:32,160
A code-based isn't just a collection of files.

955
00:32:32,160 --> 00:32:34,400
It's a series of patterns that evolved over years.

956
00:32:34,400 --> 00:32:36,920
It's conventions that nobody ever bothered to write down.

957
00:32:36,920 --> 00:32:40,720
Its decisions made under pressure during a midnight outage that never got documented.

958
00:32:40,720 --> 00:32:44,800
It's the history and the context of a thousand small choices that make your system unique

959
00:32:44,800 --> 00:32:46,120
instead of a generic app.

960
00:32:46,120 --> 00:32:49,720
The agent needs to absorb all of that before it ever touches a line of code.

961
00:32:49,720 --> 00:32:53,320
It starts with the module structure and how things actually depend on each other.

962
00:32:53,320 --> 00:32:54,520
You don't just need a file list.

963
00:32:54,520 --> 00:32:56,000
You need the real dependency graph.

964
00:32:56,000 --> 00:32:59,880
The agent has to know what calls what, and more importantly, what is forbidden from calling

965
00:32:59,880 --> 00:33:00,880
what.

966
00:33:00,880 --> 00:33:04,000
It needs to see the architectural boundaries where the API layer ends and the business

967
00:33:04,000 --> 00:33:05,000
logic begins.

968
00:33:05,000 --> 00:33:08,680
It has to understand which lines can be crossed and which ones are hard walls.

969
00:33:08,680 --> 00:33:12,480
Then there are the coding patterns, not what the language manual says, but what your team

970
00:33:12,480 --> 00:33:16,720
actually does, how you name variables, how you structure your classes, and how you organize

971
00:33:16,720 --> 00:33:17,840
your functions.

972
00:33:17,840 --> 00:33:19,480
The agent needs the history of your choices.

973
00:33:19,480 --> 00:33:23,640
It needs to know why you chose this specific framework or why you split the system this

974
00:33:23,640 --> 00:33:24,960
way three years ago.

975
00:33:24,960 --> 00:33:28,960
It needs to see the constraints, the weird edge cases, and the landmines that will explode

976
00:33:28,960 --> 00:33:29,960
if they're touched.

977
00:33:29,960 --> 00:33:32,880
This is where context engineering becomes the foundation of the work.

978
00:33:32,880 --> 00:33:35,840
You don't just hand an agent a repo and hope for the best.

979
00:33:35,840 --> 00:33:40,560
You deliberately build structures that teach the agent how the system lives.

980
00:33:40,560 --> 00:33:43,360
Architecture decision records or ADRs are the first tool for this.

981
00:33:43,360 --> 00:33:44,760
They don't just record a decision.

982
00:33:44,760 --> 00:33:48,040
They record the problem, the alternatives, and the trade-offs.

983
00:33:48,040 --> 00:33:52,680
When the agent reads these, it finally understands the reasoning behind your architecture.

984
00:33:52,680 --> 00:33:54,160
Pattern files are the second tool.

985
00:33:54,160 --> 00:33:58,040
These show the right way to solve a problem in your specific code base.

986
00:33:58,040 --> 00:34:02,040
It's an example of how you structure a service or handle data validation.

987
00:34:02,040 --> 00:34:06,200
The agent studies these and uses them as the template for everything it builds next.

988
00:34:06,200 --> 00:34:08,680
Structural trees show how components are organized.

989
00:34:08,680 --> 00:34:11,160
Not as a long essay, but as a clear hierarchy.

990
00:34:11,160 --> 00:34:15,240
The agent uses these trees to make sure its own output actually fits into the world you've

991
00:34:15,240 --> 00:34:16,240
defined.

992
00:34:16,240 --> 00:34:17,800
Context windows are the final lever.

993
00:34:17,800 --> 00:34:20,920
When the agent starts a task, you don't dump the whole code base into its lap.

994
00:34:20,920 --> 00:34:25,760
You give it the surrounding code, the interacting modules, and the specific rules for that area.

995
00:34:25,760 --> 00:34:28,240
This is the principle of context over coverage.

996
00:34:28,240 --> 00:34:32,400
A deep understanding of what matters right now always beats a shallow understanding of everything.

997
00:34:32,400 --> 00:34:34,600
The agent's mental model builds through these layers.

998
00:34:34,600 --> 00:34:39,160
Each one adds a new level of clarity until the agent has a coherent picture of the system.

999
00:34:39,160 --> 00:34:41,240
This process starts with code archaeology.

1000
00:34:41,240 --> 00:34:44,920
Before the agent transforms anything, it analyzes the system and builds its model.

1001
00:34:44,920 --> 00:34:47,760
It generates documentation that becomes the project's memory.

1002
00:34:47,760 --> 00:34:50,000
Over time, that memory gets refined and sharpened.

1003
00:34:50,000 --> 00:34:53,200
The quality of what the agent produces is directly tied to the quality of this mental

1004
00:34:53,200 --> 00:34:54,200
model.

1005
00:34:54,200 --> 00:34:55,880
The strong model produces great code.

1006
00:34:55,880 --> 00:34:59,640
But a weak model produces code that's technically correct while being architecturally broken.

1007
00:34:59,640 --> 00:35:01,240
This isn't a one and done task.

1008
00:35:01,240 --> 00:35:04,200
As your system grows, new patterns and constraints will emerge.

1009
00:35:04,200 --> 00:35:08,160
The documentation has to grow with it, ADRs need updates when decisions change, and pattern

1010
00:35:08,160 --> 00:35:10,880
files need a refresh when your standards evolve.

1011
00:35:10,880 --> 00:35:12,640
Treat context as a living artifact.

1012
00:35:12,640 --> 00:35:14,240
Don't build it once and forget it.

1013
00:35:14,240 --> 00:35:17,400
You have to maintain it because it's the only thing determining the quality of what your

1014
00:35:17,400 --> 00:35:18,880
agents can actually do.

1015
00:35:18,880 --> 00:35:20,920
Safety and guardrails, bounded autonomy.

1016
00:35:20,920 --> 00:35:23,920
There is a tension we have to resolve before agents can work at scale.

1017
00:35:23,920 --> 00:35:27,200
You want them to be smart enough to make real decisions, but you can't have them being

1018
00:35:27,200 --> 00:35:30,960
so autonomous that they wreck production while you're grabbing coffee.

1019
00:35:30,960 --> 00:35:35,200
This is the problem of bounded autonomy, providing freedom within a set of constraints.

1020
00:35:35,200 --> 00:35:37,440
The solution is a system of layered guardrails.

1021
00:35:37,440 --> 00:35:41,400
These have to be so fundamental to the environment that the agent doesn't even think about bypassing

1022
00:35:41,400 --> 00:35:42,400
them.

1023
00:35:42,400 --> 00:35:45,680
It starts with permissioned guardrails, the agent can touch specific files or modify certain

1024
00:35:45,680 --> 00:35:46,680
services.

1025
00:35:46,680 --> 00:35:49,520
But it can't delete a database without a human saying yes.

1026
00:35:49,520 --> 00:35:52,400
You define the perimeter and the agent stays inside.

1027
00:35:52,400 --> 00:35:54,000
It doesn't stay there because it's being forced.

1028
00:35:54,000 --> 00:35:56,640
It stays there because it understands that's how the system works.

1029
00:35:56,640 --> 00:36:00,800
It reads the policy, builds it into the plan, and accounts for those limits before it ever

1030
00:36:00,800 --> 00:36:02,160
generates code.

1031
00:36:02,160 --> 00:36:03,640
Policy guardrails are the next layer.

1032
00:36:03,640 --> 00:36:05,840
These are the laws of your organization.

1033
00:36:05,840 --> 00:36:11,520
No hard coded secrets, no direct database access from the front end, no circular dependencies.

1034
00:36:11,520 --> 00:36:12,920
These aren't just suggestions.

1035
00:36:12,920 --> 00:36:14,160
They are hard rules.

1036
00:36:14,160 --> 00:36:15,600
The agent has to follow.

1037
00:36:15,600 --> 00:36:19,040
It treats these policies as the primary constraints on its work.

1038
00:36:19,040 --> 00:36:20,480
Validation guardrails are the third layer.

1039
00:36:20,480 --> 00:36:23,120
The agent's work has to pass through specific gates.

1040
00:36:23,120 --> 00:36:27,240
The tests have to pass, the linters have to be clean, and the security scans have to be green.

1041
00:36:27,240 --> 00:36:28,800
The agent knows these checks are coming.

1042
00:36:28,800 --> 00:36:32,640
It builds toward passing them, and if it fails a check, it goes back and adapts.

1043
00:36:32,640 --> 00:36:34,840
Review guardrails are the final safety net.

1044
00:36:34,840 --> 00:36:37,160
The agent's output never goes straight to production.

1045
00:36:37,160 --> 00:36:38,640
It goes to a human first.

1046
00:36:38,640 --> 00:36:43,280
High-risk changes always need an explicit sign-off, while lower-risk tasks might be auto-approved

1047
00:36:43,280 --> 00:36:45,000
if they clear every validation gate.

1048
00:36:45,000 --> 00:36:48,280
The human stays in the loop because that's how the workflow is designed.

1049
00:36:48,280 --> 00:36:50,440
Here's the most important thing about guardrails.

1050
00:36:50,440 --> 00:36:51,720
They aren't restrictions.

1051
00:36:51,720 --> 00:36:54,440
They are the structure that makes it safe for the agent to move fast.

1052
00:36:54,440 --> 00:36:55,440
They aren't the friction.

1053
00:36:55,440 --> 00:36:56,440
They are the rails.

1054
00:36:56,440 --> 00:36:58,040
The agent doesn't fight against these boundaries.

1055
00:36:58,040 --> 00:36:59,640
It uses them to reason.

1056
00:36:59,640 --> 00:37:03,560
When it plans a task, it's already calculating which files it can touch and which policies

1057
00:37:03,560 --> 00:37:04,560
it needs to respect.

1058
00:37:04,560 --> 00:37:06,920
It isn't trying to see how much it can get away with.

1059
00:37:06,920 --> 00:37:09,400
It's trying to hit the goal within the lines you drew.

1060
00:37:09,400 --> 00:37:11,880
Sandboxing is what makes this work in the real world.

1061
00:37:11,880 --> 00:37:14,480
The agent never works in your actual production environment.

1062
00:37:14,480 --> 00:37:17,840
It works in ephemeral spaces that exist only for the duration of the task.

1063
00:37:17,840 --> 00:37:22,800
It clones the repo, installs the dependencies, and runs the tests in total isolation.

1064
00:37:22,800 --> 00:37:24,800
Nothing it does can touch your live system.

1065
00:37:24,800 --> 00:37:29,720
Every change is proposed as a pull request and validated in CI/CD before it ever gets merged.

1066
00:37:29,720 --> 00:37:32,360
This is actually safer than traditional local development.

1067
00:37:32,360 --> 00:37:36,400
You can't accidentally push to the wrong branch or deploy a change before the test finish.

1068
00:37:36,400 --> 00:37:40,920
The agent works in the sandbox, the outputs are just proposals, and the validation is automatic.

1069
00:37:40,920 --> 00:37:42,800
The human is always the one in control.

1070
00:37:42,800 --> 00:37:46,600
You set the guardrails, you define the permissions, and you set the gates.

1071
00:37:46,600 --> 00:37:48,720
You make the final call on what gets approved.

1072
00:37:48,720 --> 00:37:50,080
The agent executes the policy.

1073
00:37:50,080 --> 00:37:51,080
It doesn't create it.

1074
00:37:51,080 --> 00:37:54,520
This is the structural shift that makes autonomous agents a reality.

1075
00:37:54,520 --> 00:37:58,200
Safety doesn't come from a human watching every single move the agent makes.

1076
00:37:58,200 --> 00:38:01,960
Safety comes from a policy that prevents the bad things from happening in the first place.

1077
00:38:01,960 --> 00:38:06,320
It scales because the policy handles the routine work, leaving the human to focus on judgment

1078
00:38:06,320 --> 00:38:07,840
calls and edge cases.

1079
00:38:07,840 --> 00:38:11,720
The agent's ability to respect these guardrails is exactly what allows it to scale.

1080
00:38:11,720 --> 00:38:14,400
Without them, you'd never let an agent near your code base.

1081
00:38:14,400 --> 00:38:18,160
Then, you can let it work across the entire system because you know the policy is there to catch

1082
00:38:18,160 --> 00:38:19,160
what matters.

1083
00:38:19,160 --> 00:38:22,640
But even with the best guardrails in place, failure still happens.

1084
00:38:22,640 --> 00:38:25,360
Handling failure, agent self-healing and iteration.

1085
00:38:25,360 --> 00:38:28,480
When an agent fails, everything depends on what happens next.

1086
00:38:28,480 --> 00:38:30,640
A bad agent reports the error and stops.

1087
00:38:30,640 --> 00:38:32,440
A good agent reads the error and adapts.

1088
00:38:32,440 --> 00:38:33,440
It's that simple.

1089
00:38:33,440 --> 00:38:37,520
This is what separates an agent that works half the time from one that works 90% of the

1090
00:38:37,520 --> 00:38:38,520
time.

1091
00:38:38,520 --> 00:38:40,080
Failure at agent scale isn't a dead end.

1092
00:38:40,080 --> 00:38:41,080
It's data.

1093
00:38:41,080 --> 00:38:42,480
The agent generates an implementation.

1094
00:38:42,480 --> 00:38:43,960
A test fails.

1095
00:38:43,960 --> 00:38:48,320
Instead of giving up, the agent analyzes the failure message to see exactly what went wrong.

1096
00:38:48,320 --> 00:38:52,160
It looks at why the test broke and modifies the code to fix the actual problem, not just

1097
00:38:52,160 --> 00:38:54,400
the symptom, then it runs the test again.

1098
00:38:54,400 --> 00:38:58,520
If it passes great, if it fails differently, the agent analyzes the new error and tries

1099
00:38:58,520 --> 00:38:59,520
a different approach.

1100
00:38:59,520 --> 00:39:00,560
This is self-healing.

1101
00:39:00,560 --> 00:39:04,240
The agent doesn't need you to intervene every time something breaks because it has the

1102
00:39:04,240 --> 00:39:05,520
logic to fix itself.

1103
00:39:05,520 --> 00:39:07,560
This capability matters more than you'd think.

1104
00:39:07,560 --> 00:39:08,840
The research is clear.

1105
00:39:08,840 --> 00:39:12,840
When agents can iterate, their success rate on complex tasks jumps dramatically.

1106
00:39:12,840 --> 00:39:17,960
On SubweeBench Pro, agents with iteration built into their workflow achieve 70% success rates

1107
00:39:17,960 --> 00:39:19,360
on difficult problems.

1108
00:39:19,360 --> 00:39:24,320
Agents without that capability top out around 30 or 40%, that's not a marginal difference.

1109
00:39:24,320 --> 00:39:27,200
That's the difference between a tool that's useful and one that isn't.

1110
00:39:27,200 --> 00:39:28,840
The iteration loop is straightforward.

1111
00:39:28,840 --> 00:39:30,080
The agent generates code.

1112
00:39:30,080 --> 00:39:31,080
It runs the tests.

1113
00:39:31,080 --> 00:39:32,680
If the tests pass, it moves on.

1114
00:39:32,680 --> 00:39:36,760
If they fail, it analyzes the failure, modifies the code, and tries again.

1115
00:39:36,760 --> 00:39:40,160
It keeps cycling until the tests pass, or it hits its iteration budget.

1116
00:39:40,160 --> 00:39:41,160
That budget matters.

1117
00:39:41,160 --> 00:39:43,040
The agent isn't allowed to iterate forever.

1118
00:39:43,040 --> 00:39:45,880
Maybe it gets 10 iterations, maybe 20.

1119
00:39:45,880 --> 00:39:49,200
After it exhausts that budget, it gives up and reports the problem to you.

1120
00:39:49,200 --> 00:39:52,800
This prevents infinite loops and forces the agent to either make progress or escalate

1121
00:39:52,800 --> 00:39:53,800
the issue.

1122
00:39:53,800 --> 00:39:56,880
It keeps the agent from spinning endlessly on an unsolvable problem.

1123
00:39:56,880 --> 00:39:59,800
The rubber duck protocol makes this iteration smarter.

1124
00:39:59,800 --> 00:40:04,320
When the agent gets stuck after trying the same approach five times, it can call for help.

1125
00:40:04,320 --> 00:40:07,560
It sends its plan and the current code to a reviewer.

1126
00:40:07,560 --> 00:40:09,160
The reviewer is a different model family.

1127
00:40:09,160 --> 00:40:12,840
It looks at the code with fresh eyes to spot the issue the primary agent couldn't see.

1128
00:40:12,840 --> 00:40:17,000
It identifies the architectural misunderstanding or the edge case the agent overlooked.

1129
00:40:17,000 --> 00:40:18,440
The reviewer sends back feedback.

1130
00:40:18,440 --> 00:40:21,280
The agent reads it and then it tries a completely different approach.

1131
00:40:21,280 --> 00:40:22,960
It's not just retrying the same thing.

1132
00:40:22,960 --> 00:40:24,880
It's pivoting based on new information.

1133
00:40:24,880 --> 00:40:27,200
Importantly, the reviewer doesn't block the process.

1134
00:40:27,200 --> 00:40:30,800
The agent doesn't stop everything and wait for feedback so it can keep working while

1135
00:40:30,800 --> 00:40:32,120
the reviewer is thinking.

1136
00:40:32,120 --> 00:40:36,160
When the review comes back, the agent integrates it and adapts without losing momentum.

1137
00:40:36,160 --> 00:40:39,680
Here's what makes this work at scale, logging and observability.

1138
00:40:39,680 --> 00:40:43,200
Every iteration, every failure and every adaptation gets logged.

1139
00:40:43,200 --> 00:40:46,680
You can see what the agent did, why it failed and how it fixed itself.

1140
00:40:46,680 --> 00:40:50,280
This visibility is critical because it tells you where the agent is struggling.

1141
00:40:50,280 --> 00:40:54,640
If it's iterating 10 times on one type of problem but only once on others, you know where to

1142
00:40:54,640 --> 00:40:55,880
improve the context.

1143
00:40:55,880 --> 00:40:57,560
Human feedback amplifies this.

1144
00:40:57,560 --> 00:41:01,240
When the agent hits its budget and reports back to you, you see the full logs and understand

1145
00:41:01,240 --> 00:41:02,240
the block.

1146
00:41:02,240 --> 00:41:05,120
Maybe you see the agent was approaching the problem wrong or you have domain knowledge

1147
00:41:05,120 --> 00:41:08,920
that would have helped, you give feedback and that feedback goes back into the agent's

1148
00:41:08,920 --> 00:41:09,920
model.

1149
00:41:09,920 --> 00:41:12,080
This isn't just for the current task, it's for future tasks.

1150
00:41:12,080 --> 00:41:13,080
The agent learns.

1151
00:41:13,080 --> 00:41:15,080
Failure isn't the end of the process.

1152
00:41:15,080 --> 00:41:17,080
It's the beginning of the learning loop.

1153
00:41:17,080 --> 00:41:21,080
But agents are only useful if they have the right tools to work with, tool integration and

1154
00:41:21,080 --> 00:41:22,760
the MCP protocol.

1155
00:41:22,760 --> 00:41:25,360
The fundamental limitation of any agent is simple.

1156
00:41:25,360 --> 00:41:27,440
It can only do what it has the ability to do.

1157
00:41:27,440 --> 00:41:31,360
If the agent can only read and write files, it's constrained to those operations.

1158
00:41:31,360 --> 00:41:35,440
At the moment you give it access to more, the scope of what's possible expands entirely.

1159
00:41:35,440 --> 00:41:40,240
You give it the ability to run tests, invoke APIs, check deployment status and query logs.

1160
00:41:40,240 --> 00:41:42,280
This is where the model context protocol comes in.

1161
00:41:42,280 --> 00:41:46,040
MCP is the standardized way agents get access to tools beyond just code.

1162
00:41:46,040 --> 00:41:48,280
Here's the core problem MCP solves.

1163
00:41:48,280 --> 00:41:49,760
Different tools work differently.

1164
00:41:49,760 --> 00:41:53,040
Different APIs have different signatures and authentication requirements.

1165
00:41:53,040 --> 00:41:56,480
Without a standard, every agent would need custom code to work with every tool.

1166
00:41:56,480 --> 00:41:57,480
That doesn't scale.

1167
00:41:57,480 --> 00:42:01,320
You'd spend more time writing integration code than improving the agent itself.

1168
00:42:01,320 --> 00:42:02,600
MCP creates a common language.

1169
00:42:02,600 --> 00:42:07,640
The agent doesn't need to understand the internals of every tool because it understands MCP.

1170
00:42:07,640 --> 00:42:10,960
Tools implement the protocol and the agent discovers what's available and how to invoke

1171
00:42:10,960 --> 00:42:11,960
them.

1172
00:42:11,960 --> 00:42:14,640
A tool might be a shell command to build the project or deploy to staging.

1173
00:42:14,640 --> 00:42:18,360
It could be an API call to query the database or check metrics.

1174
00:42:18,360 --> 00:42:22,400
It could be a documentation look up to search the internal wiki and see how a module works.

1175
00:42:22,400 --> 00:42:25,880
It could even be a file system operation that goes beyond basic reading.

1176
00:42:25,880 --> 00:42:29,760
The agent can list the directory to understand the layout of a code base without reading every

1177
00:42:29,760 --> 00:42:30,760
single file.

1178
00:42:30,760 --> 00:42:35,640
Search the capabilities, understands the parameters, and uses the tools to validate its work.

1179
00:42:35,640 --> 00:42:37,200
This is where the power actually lives.

1180
00:42:37,200 --> 00:42:39,320
The agent doesn't just generate code in isolation.

1181
00:42:39,320 --> 00:42:43,120
It runs tests against its own code and triggers a linter to see the actual style rules for

1182
00:42:43,120 --> 00:42:44,120
your project.

1183
00:42:44,120 --> 00:42:48,280
It can run security scans to find vulnerabilities or query the real database schema instead of

1184
00:42:48,280 --> 00:42:49,280
guessing.

1185
00:42:49,280 --> 00:42:51,640
It checks the logs to see what happens when the code runs.

1186
00:42:51,640 --> 00:42:55,520
It can even deploy to a staging environment to validate that the change works in a real

1187
00:42:55,520 --> 00:42:56,520
system.

1188
00:42:56,520 --> 00:42:59,840
MCP is standardized, which means tools can be built by anybody.

1189
00:42:59,840 --> 00:43:02,760
GitHub provides MCP servers for repository operations.

1190
00:43:02,760 --> 00:43:05,360
Cloud providers provide servers for their services.

1191
00:43:05,360 --> 00:43:09,400
Your organization can even write MCP servers for your own internal tools.

1192
00:43:09,400 --> 00:43:12,320
Because the protocol is consistent, the agent can use any of them.

1193
00:43:12,320 --> 00:43:16,600
A tool built by GitHub works alongside a tool built by Amazon and a tool built by your

1194
00:43:16,600 --> 00:43:17,600
own engineers.

1195
00:43:17,600 --> 00:43:20,360
GitHub's copilot CLI uses MCP extensively.

1196
00:43:20,360 --> 00:43:25,080
It accesses servers to search the repository, read documentation, and check workflow status.

1197
00:43:25,080 --> 00:43:28,960
It invokes shell commands to run tests and validates its own work before asking you to review

1198
00:43:28,960 --> 00:43:29,800
it.

1199
00:43:29,800 --> 00:43:32,520
Other agents use MCP to access Cloud provider tools.

1200
00:43:32,520 --> 00:43:37,560
AWS provides servers for EC2 and RDS, while Azure provides servers for their own resources.

1201
00:43:37,560 --> 00:43:41,000
The agent checks what resources exist and understands the current configuration.

1202
00:43:41,000 --> 00:43:45,080
It proposes changes that actually work in that environment, not theoretical changes that

1203
00:43:45,080 --> 00:43:46,040
might fail.

1204
00:43:46,040 --> 00:43:49,680
The agent's capability ceiling is set by the tools available to it.

1205
00:43:49,680 --> 00:43:53,000
If you only give it the ability to read and write code, that's all it does.

1206
00:43:53,000 --> 00:43:57,320
If you give it access to tests, scanners, databases, and logs, it becomes much more powerful.

1207
00:43:57,320 --> 00:44:00,320
This is why tool integration is not optional in the Agentextec.

1208
00:44:00,320 --> 00:44:01,440
It's foundational.

1209
00:44:01,440 --> 00:44:03,800
An agent without tools is limited to code generation.

1210
00:44:03,800 --> 00:44:06,880
An agent with tools becomes an actual system agent.

1211
00:44:06,880 --> 00:44:11,320
It understands the real state of the infrastructure and validates its changes against reality.

1212
00:44:11,320 --> 00:44:13,160
This marks a structural shift.

1213
00:44:13,160 --> 00:44:17,680
Agents are moving from read code and write code to understand the entire system.

1214
00:44:17,680 --> 00:44:19,640
They trigger real actions that matter.

1215
00:44:19,640 --> 00:44:23,680
That's the difference between a code assistant and an actual agent.

1216
00:44:23,680 --> 00:44:26,840
The agent's decision making process, reasoning and planning.

1217
00:44:26,840 --> 00:44:28,920
How does an agent actually decide what to do?

1218
00:44:28,920 --> 00:44:30,400
This isn't a rhetorical question.

1219
00:44:30,400 --> 00:44:31,520
It's the reasoning problem.

1220
00:44:31,520 --> 00:44:34,120
It's exactly where most agent deployments fall apart.

1221
00:44:34,120 --> 00:44:37,600
When an agent receives a task, you might tell it to refactor an authentication module to

1222
00:44:37,600 --> 00:44:40,280
use OAuth2 and add full test coverage.

1223
00:44:40,280 --> 00:44:41,640
Now the agent has a goal.

1224
00:44:41,640 --> 00:44:42,800
But goals are not instructions.

1225
00:44:42,800 --> 00:44:45,200
A goal is just the outcome you need.

1226
00:44:45,200 --> 00:44:46,280
Instructions are the specific steps.

1227
00:44:46,280 --> 00:44:47,600
Do this then do that.

1228
00:44:47,600 --> 00:44:48,600
Required to get there.

1229
00:44:48,600 --> 00:44:50,400
The agent has to bridge that gap.

1230
00:44:50,400 --> 00:44:52,360
It has to figure out the instructions from the goal.

1231
00:44:52,360 --> 00:44:57,080
It has to break the task into steps, choose the right tools and estimate if the approach will

1232
00:44:57,080 --> 00:44:58,080
even work.

1233
00:44:58,080 --> 00:45:00,680
All of this happens before it touches a single line of code.

1234
00:45:00,680 --> 00:45:01,680
Planning is not trivial.

1235
00:45:01,680 --> 00:45:05,560
It's incredibly difficult when you're operating across a real code base with dependencies,

1236
00:45:05,560 --> 00:45:07,800
constraints and existing patterns.

1237
00:45:07,800 --> 00:45:11,920
The agentic stack uses several reasoning patterns to handle this and each one has real

1238
00:45:11,920 --> 00:45:12,920
trade-offs.

1239
00:45:12,920 --> 00:45:16,360
The first is react, which stands for Reason Act and Observe.

1240
00:45:16,360 --> 00:45:20,240
The agent reasons about what to do, takes an action and then observes the result.

1241
00:45:20,240 --> 00:45:21,480
Then it repeats.

1242
00:45:21,480 --> 00:45:23,600
And then the agent acts observe, Reason Act Observe.

1243
00:45:23,600 --> 00:45:25,800
This cycle continues until the task is done.

1244
00:45:25,800 --> 00:45:27,440
The benefit here is flexibility.

1245
00:45:27,440 --> 00:45:28,800
The agent can adapt as it learns.

1246
00:45:28,800 --> 00:45:32,840
If something unexpected happens, the agent notices and adjusts its path.

1247
00:45:32,840 --> 00:45:33,880
But the cost is speed.

1248
00:45:33,880 --> 00:45:35,560
Every single cycle takes time.

1249
00:45:35,560 --> 00:45:39,800
On a task that requires dozens of cycles, the whole process gets noticeably slower.

1250
00:45:39,800 --> 00:45:41,680
Plan and execute is the second model.

1251
00:45:41,680 --> 00:45:45,080
Here, the agent creates a complete plan before doing anything at all.

1252
00:45:45,080 --> 00:45:47,080
It thinks through the entire sequence up front.

1253
00:45:47,080 --> 00:45:49,920
It breaks the task into steps and decides which tools to use.

1254
00:45:49,920 --> 00:45:54,040
It even tries to estimate where problems might occur, then and only then.

1255
00:45:54,040 --> 00:45:55,720
It executes.

1256
00:45:55,720 --> 00:45:57,360
The benefit is efficiency.

1257
00:45:57,360 --> 00:45:59,160
The agent thinks once and acts once.

1258
00:45:59,160 --> 00:46:02,920
There is no repeated reasoning on every minor step, but the cost is flexibility.

1259
00:46:02,920 --> 00:46:05,880
If something unexpected happens, the plan might not account for it.

1260
00:46:05,880 --> 00:46:10,120
The agent might hit a wall and be forced to stop and replan everything from scratch.

1261
00:46:10,120 --> 00:46:12,000
Multi-agent collaboration is the third pattern.

1262
00:46:12,000 --> 00:46:16,240
Instead of one agent doing everything, specialized agents handle different parts of the task.

1263
00:46:16,240 --> 00:46:18,000
A planner agent creates the plan.

1264
00:46:18,000 --> 00:46:21,600
When implementer agent writes the code, a reviewer agent checks the work.

1265
00:46:21,600 --> 00:46:24,040
An orchestrator agent coordinates the whole group.

1266
00:46:24,040 --> 00:46:25,680
The benefit is specialization.

1267
00:46:25,680 --> 00:46:28,400
Each agent is optimized for its specific role.

1268
00:46:28,400 --> 00:46:32,480
The code writing agent gets very good at writing code, and the review agent becomes an expert

1269
00:46:32,480 --> 00:46:33,760
at finding bugs.

1270
00:46:33,760 --> 00:46:35,760
But the cost is coordination overhead.

1271
00:46:35,760 --> 00:46:37,640
More agents mean more communication.

1272
00:46:37,640 --> 00:46:41,320
That creates more opportunities for miscommunication or conflicting decisions.

1273
00:46:41,320 --> 00:46:44,120
The agentic stack we've been describing uses a combination of these.

1274
00:46:44,120 --> 00:46:49,160
The orchestration layer, copilot CLI, uses plan and execute for high-level decisions.

1275
00:46:49,160 --> 00:46:51,520
You describe what you want and the agent creates a plan.

1276
00:46:51,520 --> 00:46:52,520
It shows you that plan.

1277
00:46:52,520 --> 00:46:54,200
You approve it or give feedback.

1278
00:46:54,200 --> 00:46:55,640
This happens once right at the start.

1279
00:46:55,640 --> 00:46:56,640
It's efficient.

1280
00:46:56,640 --> 00:46:58,680
But the implementation layer uses React.

1281
00:46:58,680 --> 00:47:02,040
Once the plan exists, the agent executing that plan iterates.

1282
00:47:02,040 --> 00:47:03,560
It writes code and runs tests.

1283
00:47:03,560 --> 00:47:05,160
If those tests fail, it reasons about why.

1284
00:47:05,160 --> 00:47:07,600
It modifies the code and runs the tests again.

1285
00:47:07,600 --> 00:47:10,560
This iterative approach is what allows it to handle surprises.

1286
00:47:10,560 --> 00:47:13,040
The quality of the agent's reasoning depends on three things.

1287
00:47:13,040 --> 00:47:14,880
Text, tools and guardrails.

1288
00:47:14,880 --> 00:47:18,000
An agent with poor context makes poor decisions.

1289
00:47:18,000 --> 00:47:19,840
It reasons from incomplete information.

1290
00:47:19,840 --> 00:47:21,040
It has blind spots.

1291
00:47:21,040 --> 00:47:23,640
It doesn't understand the real constraints of your system.

1292
00:47:23,640 --> 00:47:25,720
An agent without tools can only theorize.

1293
00:47:25,720 --> 00:47:27,040
It can't validate its ideas.

1294
00:47:27,040 --> 00:47:29,560
It can't check whether its reasoning is actually correct.

1295
00:47:29,560 --> 00:47:31,200
It's just working from assumptions.

1296
00:47:31,200 --> 00:47:35,600
An agent without guardrails either tries to work around constraints, which waste time,

1297
00:47:35,600 --> 00:47:38,560
or it ignores them entirely, which creates technical debt.

1298
00:47:38,560 --> 00:47:42,400
But when an agent has good context, integrated tools and clear guardrails,

1299
00:47:42,400 --> 00:47:44,000
it can reason effectively.

1300
00:47:44,000 --> 00:47:46,600
The context shapes how the agent thinks about the problem.

1301
00:47:46,600 --> 00:47:48,400
The tools let it validate that thinking.

1302
00:47:48,400 --> 00:47:51,800
The guardrails keep it focused on what is actually possible.

1303
00:47:51,800 --> 00:47:52,880
This is the shift.

1304
00:47:52,880 --> 00:47:56,000
The quality of the reasoning is not primarily about the model.

1305
00:47:56,000 --> 00:47:58,640
It's not a case of smarter models reason better.

1306
00:47:58,640 --> 00:48:02,720
In reality, better context, tools and guardrails enable better reasoning.

1307
00:48:02,720 --> 00:48:05,640
This matters because it changes how you build an agentic stack.

1308
00:48:05,640 --> 00:48:08,720
You don't just pick the most capable model and hope for the best.

1309
00:48:08,720 --> 00:48:09,960
You optimize the whole system.

1310
00:48:09,960 --> 00:48:11,200
You engineer the context.

1311
00:48:11,200 --> 00:48:12,320
You integrate the tools.

1312
00:48:12,320 --> 00:48:16,840
You define the guardrails, the model matters, but the system matters more.

1313
00:48:16,840 --> 00:48:18,000
Breaking deadlocks.

1314
00:48:18,000 --> 00:48:22,320
When agents get stuck, an agent can reach a point where it is no longer making progress.

1315
00:48:22,320 --> 00:48:24,000
It tries the same approach repeatedly.

1316
00:48:24,000 --> 00:48:26,200
It hits the same failure again and again.

1317
00:48:26,200 --> 00:48:29,920
It starts cycling through variations of an idea that fundamentally does not work.

1318
00:48:29,920 --> 00:48:31,080
This is a deadlock.

1319
00:48:31,080 --> 00:48:34,800
The agent is trapped in a state where forward momentum has stopped and every new iteration

1320
00:48:34,800 --> 00:48:36,520
produces the same broken results.

1321
00:48:36,520 --> 00:48:38,480
The problem is that the agent doesn't know it's stuck.

1322
00:48:38,480 --> 00:48:39,800
It thinks it's trying something different.

1323
00:48:39,800 --> 00:48:42,000
It tells itself, "I modified the code here.

1324
00:48:42,000 --> 00:48:43,000
It failed.

1325
00:48:43,000 --> 00:48:44,000
Now I'll modify it there."

1326
00:48:44,000 --> 00:48:45,360
But the core approach is broken.

1327
00:48:45,360 --> 00:48:47,200
The changes are just surface level.

1328
00:48:47,200 --> 00:48:48,920
The fundamental problem hasn't been addressed.

1329
00:48:48,920 --> 00:48:52,560
The agent needs a way to recognize this pattern and respond.

1330
00:48:52,560 --> 00:48:54,600
Deadlock detection is the first part of the solution.

1331
00:48:54,600 --> 00:48:55,920
The agent has to monitor itself.

1332
00:48:55,920 --> 00:48:59,200
It watches how many times it has tried to solve the same problem.

1333
00:48:59,200 --> 00:49:03,720
It checks whether the failures are actually different or just variations of the same error.

1334
00:49:03,720 --> 00:49:07,920
If it sees the iteration count increasing while the success probability is decreasing, it

1335
00:49:07,920 --> 00:49:09,120
recognizes a deadlock.

1336
00:49:09,120 --> 00:49:11,160
When a deadlock is detected, recovery kicks in.

1337
00:49:11,160 --> 00:49:12,320
The agent has a few options.

1338
00:49:12,320 --> 00:49:14,800
The first is to invoke the rubber duck reviewer.

1339
00:49:14,800 --> 00:49:17,840
As we've seen, this is where a different model family looks at the problem with fresh

1340
00:49:17,840 --> 00:49:18,840
eyes.

1341
00:49:18,840 --> 00:49:21,800
The agent shows the reviewer its current code and the failure is its hitting.

1342
00:49:21,800 --> 00:49:24,000
The reviewer isn't blocked by the same assumptions.

1343
00:49:24,000 --> 00:49:26,120
It might spot an architectural misunderstanding.

1344
00:49:26,120 --> 00:49:27,920
It might identify a constraint.

1345
00:49:27,920 --> 00:49:29,240
The agent overlooked.

1346
00:49:29,240 --> 00:49:31,680
It might simply point out that the entire approach is wrong.

1347
00:49:31,680 --> 00:49:34,600
The agent reads this feedback and tries something completely different.

1348
00:49:34,600 --> 00:49:35,600
Not a slight change.

1349
00:49:35,600 --> 00:49:36,840
But a different idea entirely.

1350
00:49:36,840 --> 00:49:39,800
The second recovery option is requesting human feedback.

1351
00:49:39,800 --> 00:49:42,800
When the agent recognizes it's stuck, it can ask you directly.

1352
00:49:42,800 --> 00:49:46,960
It might say, "I've been trying to implement this function, but tests keep failing."

1353
00:49:46,960 --> 00:49:47,960
What am I missing?

1354
00:49:47,960 --> 00:49:49,360
You look at the logs.

1355
00:49:49,360 --> 00:49:52,440
You understand the context in a way the agent struggles with.

1356
00:49:52,440 --> 00:49:53,760
You provide feedback.

1357
00:49:53,760 --> 00:49:57,680
You might say, "The issue is actually in how you're handling the validation step.

1358
00:49:57,680 --> 00:49:59,120
Look at this other function in the code base.

1359
00:49:59,120 --> 00:50:00,640
It does this part differently."

1360
00:50:00,640 --> 00:50:04,520
The agent reads that, incorporates it, and understands its blind spot.

1361
00:50:04,520 --> 00:50:06,280
Then it tries again with a new direction.

1362
00:50:06,280 --> 00:50:08,880
The third option is resetting and starting fresh.

1363
00:50:08,880 --> 00:50:12,560
The agent can clear its working state and approach the problem from the beginning.

1364
00:50:12,560 --> 00:50:16,280
But this time it carries the knowledge it gained from the previous failures.

1365
00:50:16,280 --> 00:50:17,680
The fourth option is escalation.

1366
00:50:17,680 --> 00:50:21,560
If the problem is simply beyond the current model's capability, the task can be handed

1367
00:50:21,560 --> 00:50:23,240
off to a more capable model.

1368
00:50:23,240 --> 00:50:26,160
Copilot CLI implements this deadlock recovery automatically.

1369
00:50:26,160 --> 00:50:29,960
If the agent recognizes it's stuck, it can invoke the rubber duck reviewer without waiting

1370
00:50:29,960 --> 00:50:30,960
for your permission.

1371
00:50:30,960 --> 00:50:31,960
The review comes back.

1372
00:50:31,960 --> 00:50:34,960
The agent incorporates the feedback and it tries a new approach.

1373
00:50:34,960 --> 00:50:36,320
All of this happens in the background.

1374
00:50:36,320 --> 00:50:40,200
You just see that a deadlock was detected and resolved without you having to intervene.

1375
00:50:40,200 --> 00:50:42,200
The research on this is very consistent.

1376
00:50:42,200 --> 00:50:45,880
Agents with deadlock detection and recovery achieve significantly higher success rates

1377
00:50:45,880 --> 00:50:46,960
than agents without it.

1378
00:50:46,960 --> 00:50:48,920
This isn't just a small quality improvement.

1379
00:50:48,920 --> 00:50:50,240
It's a structural difference.

1380
00:50:50,240 --> 00:50:54,920
With deadlock recovery, agents can complete tasks that would otherwise be impossible.

1381
00:50:54,920 --> 00:50:56,760
Without it, they hit walls and stay there.

1382
00:50:56,760 --> 00:50:59,080
Logging is what makes this whole system work.

1383
00:50:59,080 --> 00:51:02,400
Every attempt, every failure, and every recovery needs to be recorded.

1384
00:51:02,400 --> 00:51:05,840
You can't understand why an agent is stuck without seeing what it's already tried.

1385
00:51:05,840 --> 00:51:08,800
You can't provide effective feedback without the logs.

1386
00:51:08,800 --> 00:51:12,200
And you certainly can't improve the system if you don't understand where the deadlocks

1387
00:51:12,200 --> 00:51:13,200
are happening.

1388
00:51:13,200 --> 00:51:15,680
The human is always part of the recovery process.

1389
00:51:15,680 --> 00:51:18,680
Sometimes the agent recovers automatically through the rubber duck.

1390
00:51:18,680 --> 00:51:19,960
Sometimes it needs your direct input.

1391
00:51:19,960 --> 00:51:21,680
Either way, you are watching.

1392
00:51:21,680 --> 00:51:23,800
You're learning what kinds of problems cause these loops.

1393
00:51:23,800 --> 00:51:26,560
You're seeing where the context or the tools are insufficient.

1394
00:51:26,560 --> 00:51:28,520
The structural inside here is simple.

1395
00:51:28,520 --> 00:51:29,840
Deadlock is not a failure.

1396
00:51:29,840 --> 00:51:31,480
It's just a signal that the agent needs help.

1397
00:51:31,480 --> 00:51:32,480
The agent knows this.

1398
00:51:32,480 --> 00:51:33,480
It doesn't panic.

1399
00:51:33,480 --> 00:51:34,880
It just asks for what it needs.

1400
00:51:34,880 --> 00:51:37,920
This is why the agentic stack includes multiple feedback mechanisms.

1401
00:51:37,920 --> 00:51:39,840
The rubber duck provides one perspective.

1402
00:51:39,840 --> 00:51:41,360
Human feedback provides another.

1403
00:51:41,360 --> 00:51:42,840
Two results provide a third.

1404
00:51:42,840 --> 00:51:47,280
Together they create a system that can break free from almost any deadlock the agent encounters.

1405
00:51:47,280 --> 00:51:50,480
Latency and time budgets, the performance constraint.

1406
00:51:50,480 --> 00:51:54,840
There is a fundamental tension in building agentic systems that almost nobody talks about.

1407
00:51:54,840 --> 00:51:57,160
Agents need time to think, "Good reasoning takes time.

1408
00:51:57,160 --> 00:51:59,600
Deep analysis takes time and iteration takes time."

1409
00:51:59,600 --> 00:52:03,440
But the moment an agent takes more than a few seconds to respond, something inside the

1410
00:52:03,440 --> 00:52:05,440
developer's brain shifts.

1411
00:52:05,440 --> 00:52:06,440
They stop waiting.

1412
00:52:06,440 --> 00:52:08,320
They stop believing the agent will deliver.

1413
00:52:08,320 --> 00:52:09,960
They go back to doing it themselves.

1414
00:52:09,960 --> 00:52:11,160
This is the latency problem.

1415
00:52:11,160 --> 00:52:14,600
It isn't about whether the agent can do good work, but whether it can deliver that work

1416
00:52:14,600 --> 00:52:18,240
fast enough that waiting feels rational instead of frustrating.

1417
00:52:18,240 --> 00:52:20,120
The research on this is crystal clear.

1418
00:52:20,120 --> 00:52:23,560
If the delay between asking for something and the agent starting its response exceeds a

1419
00:52:23,560 --> 00:52:25,560
few seconds, usage drops.

1420
00:52:25,560 --> 00:52:28,600
The agent might produce perfect output, but it doesn't matter because the developer has

1421
00:52:28,600 --> 00:52:30,200
already moved on to the next thing.

1422
00:52:30,200 --> 00:52:32,840
They've decided it's faster to just do it themselves.

1423
00:52:32,840 --> 00:52:36,480
Co-pilot CLI has become fast enough that this stops being the limiting factor.

1424
00:52:36,480 --> 00:52:41,200
Recent optimization updates claim up to 75% improvement in response time compared to earlier

1425
00:52:41,200 --> 00:52:42,200
versions.

1426
00:52:42,200 --> 00:52:45,320
That number matters because it's the difference between an agent feeling like a tool

1427
00:52:45,320 --> 00:52:48,800
you use and an agent feeling like a chore you endure.

1428
00:52:48,800 --> 00:52:50,360
How does this actually happen?

1429
00:52:50,360 --> 00:52:51,520
Prompt caching is part of it.

1430
00:52:51,520 --> 00:52:55,440
The agent doesn't re-send the same context to the model on every request because that

1431
00:52:55,440 --> 00:52:56,440
context gets cached.

1432
00:52:56,440 --> 00:53:00,880
If you're asking the agent to work on the same project for a while, the repository structure,

1433
00:53:00,880 --> 00:53:04,800
texture docs and pattern files are all already in the model's attention space.

1434
00:53:04,800 --> 00:53:09,080
The agent just talks to the model about what's changed since last time, which cuts the overhead

1435
00:53:09,080 --> 00:53:10,080
dramatically.

1436
00:53:10,080 --> 00:53:11,920
Streaming is another part.

1437
00:53:11,920 --> 00:53:15,160
The agent doesn't wait until it's generated the complete response before showing anything

1438
00:53:15,160 --> 00:53:16,160
to you.

1439
00:53:16,160 --> 00:53:20,040
It streams the output as it's being generated, so you see words appearing, the plan forming

1440
00:53:20,040 --> 00:53:21,560
an actual progress.

1441
00:53:21,560 --> 00:53:23,640
This changes your perception of speed.

1442
00:53:23,640 --> 00:53:27,080
Even if the total time is the same, streaming makes it feel faster because you aren't staring

1443
00:53:27,080 --> 00:53:29,480
at a blank screen waiting for something to happen.

1444
00:53:29,480 --> 00:53:31,200
This compression works at a different layer.

1445
00:53:31,200 --> 00:53:35,000
The agent doesn't send your entire code base to the model, but instead analyzes what you're

1446
00:53:35,000 --> 00:53:39,560
actually working on, what nearby code is relevant, and what patterns apply to the specific

1447
00:53:39,560 --> 00:53:40,560
task.

1448
00:53:40,560 --> 00:53:42,400
It bundles just that context and sends it.

1449
00:53:42,400 --> 00:53:46,680
Full code base context might be huge, but compressed context is tight, and the model processes

1450
00:53:46,680 --> 00:53:47,680
it faster.

1451
00:53:47,680 --> 00:53:51,240
Parallel execution runs multiple checks at the same time instead of sequentially.

1452
00:53:51,240 --> 00:53:55,120
The agent doesn't finish reading the files, then plan, then start implementation.

1453
00:53:55,120 --> 00:53:58,560
It can start planning while still reading, and it can start running tests while still

1454
00:53:58,560 --> 00:53:59,960
generating code.

1455
00:53:59,960 --> 00:54:02,320
Different parts of the workflow happen concurrently.

1456
00:54:02,320 --> 00:54:06,180
But there's a hard limit to how fast you can make agents if you're trying to maintain quality,

1457
00:54:06,180 --> 00:54:08,320
so the stack uses time budgets instead.

1458
00:54:08,320 --> 00:54:12,680
The agent gets a budget where it can spend X seconds on planning, Y seconds on implementation,

1459
00:54:12,680 --> 00:54:14,160
and Z seconds on validation.

1460
00:54:14,160 --> 00:54:16,080
If it hits the budget, it stops.

1461
00:54:16,080 --> 00:54:18,760
It reports what it's done so far, and the developer reviews it.

1462
00:54:18,760 --> 00:54:23,240
Either the developer approves, and the partial work gets merged, or they give feedback,

1463
00:54:23,240 --> 00:54:25,880
and the agent continues with more budget on the next iteration.

1464
00:54:25,880 --> 00:54:28,520
This matters because it changes what quality means.

1465
00:54:28,520 --> 00:54:32,540
That's not how good can this be, but how good can this be in the time available.

1466
00:54:32,540 --> 00:54:34,600
The agent optimizes for that trade-off.

1467
00:54:34,600 --> 00:54:39,000
It doesn't spend minutes on architectural analysis if the budget is 30 seconds, but instead

1468
00:54:39,000 --> 00:54:41,200
makes quick decisions and moves forward.

1469
00:54:41,200 --> 00:54:42,840
Different tasks get different budgets.

1470
00:54:42,840 --> 00:54:46,440
A code review is tight because you need feedback on a PR now, not in five minutes.

1471
00:54:46,440 --> 00:54:50,280
A simple refactor is tight, but a full feature implementation is looser.

1472
00:54:50,280 --> 00:54:54,000
A major modernization task is loose because you aren't paying for the refactor to happen

1473
00:54:54,000 --> 00:54:57,360
right now, but rather for it to happen eventually while you do other work.

1474
00:54:57,360 --> 00:55:01,000
The structural insight here is that latency isn't just a performance problem.

1475
00:55:01,000 --> 00:55:03,480
It's a design problem.

1476
00:55:03,480 --> 00:55:05,400
Latency shapes how agents reason.

1477
00:55:05,400 --> 00:55:10,040
Tight budgets force agents to be decisive, so they skip deep analysis and go for directional

1478
00:55:10,040 --> 00:55:11,040
correctness.

1479
00:55:11,040 --> 00:55:14,280
Loose budgets allow contemplation, so they iterate and refine.

1480
00:55:14,280 --> 00:55:16,040
Understanding your latency profile is critical.

1481
00:55:16,040 --> 00:55:19,760
You need to know how long different tasks actually take and where the bottlenecks are.

1482
00:55:19,760 --> 00:55:21,320
Is it model inference that's slow?

1483
00:55:21,320 --> 00:55:22,520
Is it tool invocation?

1484
00:55:22,520 --> 00:55:24,240
Is it context preparation?

1485
00:55:24,240 --> 00:55:26,520
Different bottlenecks have different solutions?

1486
00:55:26,520 --> 00:55:27,880
And is ongoing.

1487
00:55:27,880 --> 00:55:32,280
As models get faster, as tools improve, and as your context preparation gets smarter,

1488
00:55:32,280 --> 00:55:33,840
the entire stack gets faster.

1489
00:55:33,840 --> 00:55:38,680
This isn't a one-time tuning, but a continuous process where you measure, identify slow paths,

1490
00:55:38,680 --> 00:55:41,160
optimize them, and measure again.

1491
00:55:41,160 --> 00:55:42,600
The point is simple.

1492
00:55:42,600 --> 00:55:46,120
Speed matters less for quality and more for usability.

1493
00:55:46,120 --> 00:55:48,800
Measuring agent effectiveness metrics that matter.

1494
00:55:48,800 --> 00:55:51,200
Here's the hard part about deploying an agentic stack.

1495
00:55:51,200 --> 00:55:53,400
You can't tell if it's working just by looking at it.

1496
00:55:53,400 --> 00:55:54,840
The agent generates code.

1497
00:55:54,840 --> 00:55:56,480
The agent completes tasks.

1498
00:55:56,480 --> 00:55:57,640
Everything looks productive.

1499
00:55:57,640 --> 00:55:59,920
But is it actually moving the needle on what matters?

1500
00:55:59,920 --> 00:56:02,720
This is where most organizations make a critical mistake.

1501
00:56:02,720 --> 00:56:04,000
They measure the wrong things.

1502
00:56:04,000 --> 00:56:07,840
They count PRs created lines of code generated and how many issues got closed.

1503
00:56:07,840 --> 00:56:09,160
These are vanity metrics.

1504
00:56:09,160 --> 00:56:13,120
The agent could produce a hundred PRs, but if they're all garbage, you've just wasted time

1505
00:56:13,120 --> 00:56:14,120
and energy.

1506
00:56:14,120 --> 00:56:15,440
Real measurement is different.

1507
00:56:15,440 --> 00:56:19,160
Real measurement focuses on business value, specifically five things.

1508
00:56:19,160 --> 00:56:20,400
Throughput is the first.

1509
00:56:20,400 --> 00:56:22,720
How many tasks can the agent complete per day?

1510
00:56:22,720 --> 00:56:26,360
How many pull requests are being created per developer per week compared to before you

1511
00:56:26,360 --> 00:56:27,360
deployed the agent?

1512
00:56:27,360 --> 00:56:30,480
If throughput hasn't moved, the agent isn't actually working.

1513
00:56:30,480 --> 00:56:33,800
Even if the agent is operating, it isn't saving anybody time.

1514
00:56:33,800 --> 00:56:36,760
Throughput is objective and either you're getting more done or you're not.

1515
00:56:36,760 --> 00:56:37,760
Quality is the second.

1516
00:56:37,760 --> 00:56:41,920
What's the defect rate in agent generated code compared to human generated code?

1517
00:56:41,920 --> 00:56:44,080
What percentage of the agent's output gets reworked?

1518
00:56:44,080 --> 00:56:45,600
How many bugs escape into production?

1519
00:56:45,600 --> 00:56:49,240
This matters because a fast agent that generates garbage actually costs your money.

1520
00:56:49,240 --> 00:56:53,320
The review burden increases, the rework burden increases, and you've accelerated something

1521
00:56:53,320 --> 00:56:54,560
that doesn't work.

1522
00:56:54,560 --> 00:56:58,240
Quality metrics tell you if the speed is actually real or if you're just moving faster in the

1523
00:56:58,240 --> 00:56:59,240
wrong direction.

1524
00:56:59,240 --> 00:57:00,240
Cycle time is the third.

1525
00:57:00,240 --> 00:57:04,240
How long does it take from the moment a task is created to the moment code is merged and

1526
00:57:04,240 --> 00:57:05,240
deployed?

1527
00:57:05,240 --> 00:57:07,800
Not just how fast the agent works, but the entire cycle.

1528
00:57:07,800 --> 00:57:10,640
Is your PR spending less time waiting for review?

1529
00:57:10,640 --> 00:57:12,280
Are you deploying faster?

1530
00:57:12,280 --> 00:57:15,560
Cycle time is the metric that actually connects to business outcomes.

1531
00:57:15,560 --> 00:57:19,480
Faster deployment means faster feedback, which means you catch problems earlier and your

1532
00:57:19,480 --> 00:57:23,080
team gets information sooner about whether a feature is working.

1533
00:57:23,080 --> 00:57:24,480
Another satisfaction is the fourth.

1534
00:57:24,480 --> 00:57:27,720
Do your developers actually find the agent helpful?

1535
00:57:27,720 --> 00:57:28,720
Are they using it?

1536
00:57:28,720 --> 00:57:30,040
Are they asking for more?

1537
00:57:30,040 --> 00:57:33,840
This isn't a soft metric, but a signal about whether the system is actually solving a problem

1538
00:57:33,840 --> 00:57:35,240
or creating new ones.

1539
00:57:35,240 --> 00:57:39,400
If developers hate using the agent, they'll stop using it and all your investment evaporates.

1540
00:57:39,400 --> 00:57:42,840
Satisfaction tells you if the interface is right, if the latency is acceptable, and if

1541
00:57:42,840 --> 00:57:46,040
the outputs are useful in the context where they're being used.

1542
00:57:46,040 --> 00:57:47,040
Technical debt is the fifth.

1543
00:57:47,040 --> 00:57:50,560
Is the agent introducing new debt or reducing existing debt?

1544
00:57:50,560 --> 00:57:53,680
The code basis it touches getting cleaner over time or messier.

1545
00:57:53,680 --> 00:57:57,400
You can improve cycle time while introducing debt that costs you later.

1546
00:57:57,400 --> 00:58:02,440
A good agentic stack improves cycle time while reducing debt, but a bad one does the opposite.

1547
00:58:02,440 --> 00:58:05,800
Measuring debt means you're taking a long-term view instead of just celebrating short-term

1548
00:58:05,800 --> 00:58:06,800
velocity.

1549
00:58:06,800 --> 00:58:11,160
The research shows that when these five things are working together, the numbers are significant.

1550
00:58:11,160 --> 00:58:13,920
Thrupport increases by 30 to 100%.

1551
00:58:13,920 --> 00:58:16,680
Quality improves with fewer defects and less rework.

1552
00:58:16,680 --> 00:58:19,040
Cycle time drops by 20 to 40%.

1553
00:58:19,040 --> 00:58:22,080
The interface satisfaction stays high and technical debt decreases, but here's what matters

1554
00:58:22,080 --> 00:58:23,080
most.

1555
00:58:23,080 --> 00:58:26,640
These improvements only happen when the agentic stack is actually well-built, when context is

1556
00:58:26,640 --> 00:58:30,400
rich, when guardrails are clear, when tools are integrated well.

1557
00:58:30,400 --> 00:58:33,080
If you've cut corners, the metrics won't improve.

1558
00:58:33,080 --> 00:58:36,640
You also need to measure what's happening inside the agent's success rate tells you what

1559
00:58:36,640 --> 00:58:40,280
percentage of tasks complete without human intervention.

1560
00:58:40,280 --> 00:58:43,840
Interation count tells you how many times the agent cycles before it succeeds.

1561
00:58:43,840 --> 00:58:47,400
Tool usage tells you which integrations the agent is leaning on and which are sitting

1562
00:58:47,400 --> 00:58:48,400
unused.

1563
00:58:48,400 --> 00:58:51,400
Tool usage frequency tells you how often the agent gets stuck and needs help.

1564
00:58:51,400 --> 00:58:53,580
These internal metrics are feedback signals.

1565
00:58:53,580 --> 00:58:57,560
If the success rate is low, the agent probably needs better context understanding.

1566
00:58:57,560 --> 00:59:01,120
If debt log frequency is high, the tool integration is probably incomplete.

1567
00:59:01,120 --> 00:59:04,840
If the iteration count is high, the initial planning is probably missing something.

1568
00:59:04,840 --> 00:59:08,680
If certain tools are unused, they might not be discoverable or useful in the way you've

1569
00:59:08,680 --> 00:59:09,680
presented them.

1570
00:59:09,680 --> 00:59:11,760
The point metrics aren't about accountability.

1571
00:59:11,760 --> 00:59:13,440
They're about learning.

1572
00:59:13,440 --> 00:59:17,200
They tell you where the agentic stack is succeeding and where it needs work.

1573
00:59:17,200 --> 00:59:18,360
Continuous measurement matters.

1574
00:59:18,360 --> 00:59:20,480
You aren't measuring once in declaring victory.

1575
00:59:20,480 --> 00:59:26,320
You're measuring continuously, watching for trends and identifying problems as they emerge.

1576
00:59:26,320 --> 00:59:28,200
Continuous improvement, the feedback loop.

1577
00:59:28,200 --> 00:59:30,680
The agentic stack isn't finished when you deploy it.

1578
00:59:30,680 --> 00:59:32,320
That's when the real work actually starts.

1579
00:59:32,320 --> 00:59:35,520
Every time the agent completes a task, something happens that matters.

1580
00:59:35,520 --> 00:59:38,880
The work gets done, code gets written and tests either pass or fail.

1581
00:59:38,880 --> 00:59:42,360
When developers review those changes and use the output, they create a signal.

1582
00:59:42,360 --> 00:59:45,400
This is vital information about what's working and what's broken.

1583
00:59:45,400 --> 00:59:49,640
Most organizations ignore this signal because they deploy the agent and immediately move on.

1584
00:59:49,640 --> 00:59:53,800
They're focused on utilization, not improvement, which is exactly why they're agentic stacks

1585
00:59:53,800 --> 00:59:54,800
plateau.

1586
00:59:54,800 --> 00:59:57,440
They work okay for a while, then they just stop getting better.

1587
00:59:57,440 --> 01:00:01,040
The teams that see dramatic gains treat the agentic stack like a product.

1588
01:00:01,040 --> 01:00:02,720
It's a living thing that needs to grow.

1589
01:00:02,720 --> 01:00:03,800
But here's the problem.

1590
01:00:03,800 --> 01:00:05,200
Most people don't have a loop.

1591
01:00:05,200 --> 01:00:08,960
In a real feedback loop, an agent completes a task and you capture the outcome.

1592
01:00:08,960 --> 01:00:10,760
You look at whether it succeeded or failed.

1593
01:00:10,760 --> 01:00:14,680
If it worked, you check how clean the implementation was and if it needed rework.

1594
01:00:14,680 --> 01:00:16,120
If it failed, you ask why.

1595
01:00:16,120 --> 01:00:20,000
Was it a context problem, a tool problem or misunderstanding of the requirements?

1596
01:00:20,000 --> 01:00:21,720
You collect developer feedback too.

1597
01:00:21,720 --> 01:00:24,680
You ask if the output was usable and if they would use it again.

1598
01:00:24,680 --> 01:00:28,200
You analyze all of this to identify patterns across dozens of tasks.

1599
01:00:28,200 --> 01:00:30,640
Then you make deliberate improvements.

1600
01:00:30,640 --> 01:00:31,720
Context improvement happens first.

1601
01:00:31,720 --> 01:00:35,440
If the agent repeatedly generates code that doesn't fit the architecture, your documentation

1602
01:00:35,440 --> 01:00:36,440
is incomplete.

1603
01:00:36,440 --> 01:00:41,160
You update architecture.md, you refine your gotchas, and you add patent files for the specific

1604
01:00:41,160 --> 01:00:42,840
problems the agent is hitting.

1605
01:00:42,840 --> 01:00:45,360
You're encoding knowledge that the agent didn't have before.

1606
01:00:45,360 --> 01:00:50,120
The next time a similar task comes in, the agent has that knowledge and the output improves.

1607
01:00:50,120 --> 01:00:51,600
Tool improvements come next.

1608
01:00:51,600 --> 01:00:55,160
If the agent is failing because it can't validate its work, you integrate the tools that

1609
01:00:55,160 --> 01:00:56,280
let it validate.

1610
01:00:56,280 --> 01:01:00,360
If certain tools aren't being used, they might not be discoverable or useful in their current

1611
01:01:00,360 --> 01:01:01,360
form.

1612
01:01:01,360 --> 01:01:02,520
You fix them or you remove them.

1613
01:01:02,520 --> 01:01:06,080
You might even integrate new tools that solve the actual problems the agent is hitting

1614
01:01:06,080 --> 01:01:07,080
in the wild.

1615
01:01:07,080 --> 01:01:09,240
God rail adjustments are the third category.

1616
01:01:09,240 --> 01:01:12,360
Based on the mistakes the agent makes, you refine the policy.

1617
01:01:12,360 --> 01:01:16,800
If the agent violates architectural rules, you make those rules explicit in the God rail

1618
01:01:16,800 --> 01:01:17,800
files.

1619
01:01:17,800 --> 01:01:20,880
If it's missing validation checks, you tighten the requirements.

1620
01:01:20,880 --> 01:01:24,360
You aren't just restricting the agent, you're teaching it what good looks like.

1621
01:01:24,360 --> 01:01:25,960
Prompt refinement is the final piece.

1622
01:01:25,960 --> 01:01:27,600
How you ask for a task matters.

1623
01:01:27,600 --> 01:01:32,880
If developers give vague tasks and the agent generates vague results, you create prompt templates.

1624
01:01:32,880 --> 01:01:35,920
You show examples of well-specified tasks to the team.

1625
01:01:35,920 --> 01:01:39,480
You refine how people ask the agent to do things, which changes what the agent understands

1626
01:01:39,480 --> 01:01:40,840
about what you actually want.

1627
01:01:40,840 --> 01:01:43,640
This isn't one time tuning, this is continuous.

1628
01:01:43,640 --> 01:01:47,720
Every week you review the metrics from the prior seven days to find the biggest bottleneck.

1629
01:01:47,720 --> 01:01:50,040
It might be throughput, quality or cycle time.

1630
01:01:50,040 --> 01:01:52,640
You make one focused improvement and you measure the impact.

1631
01:01:52,640 --> 01:01:55,560
Next week you do it again, this cadence matters.

1632
01:01:55,560 --> 01:01:58,320
Teams that improve continuously see measurable gains.

1633
01:01:58,320 --> 01:02:02,240
In the first month you might see 30% improvement as you fix obvious problems.

1634
01:02:02,240 --> 01:02:06,720
The second month gains might slow to 15% and by the third month you're looking at 10%,

1635
01:02:06,720 --> 01:02:08,320
but the cumulative effect compounds.

1636
01:02:08,320 --> 01:02:12,520
For six months the agentic stack is three times better than when you started.

1637
01:02:12,520 --> 01:02:14,240
The structural insight is this.

1638
01:02:14,240 --> 01:02:16,000
The agentic stack is a learning system.

1639
01:02:16,000 --> 01:02:19,600
It gets better as it operates every task that gets completed and every failure that occurs

1640
01:02:19,600 --> 01:02:21,000
teaches it something new.

1641
01:02:21,000 --> 01:02:23,880
This is why you can't treat the agentic stack as a deployment.

1642
01:02:23,880 --> 01:02:26,520
You have to treat it as a product that needs ongoing management.

1643
01:02:26,520 --> 01:02:30,320
You're measuring it, analyzing it and documenting every decision you make.

1644
01:02:30,320 --> 01:02:33,200
When new team members join, they need to understand the system.

1645
01:02:33,200 --> 01:02:37,480
When something breaks, you need to know why you built it that way so you can debug it properly.

1646
01:02:37,480 --> 01:02:39,680
The implementation is the glue that holds this together.

1647
01:02:39,680 --> 01:02:40,680
It's not optional.

1648
01:02:40,680 --> 01:02:43,880
It's how you preserve learning and onboard people into the feedback loop.

1649
01:02:43,880 --> 01:02:49,080
It's how you maintain institutional knowledge about why your stack works the way it does.

1650
01:02:49,080 --> 01:02:50,560
Organizational implications.

1651
01:02:50,560 --> 01:02:52,280
Shifting roles and responsibilities.

1652
01:02:52,280 --> 01:02:55,280
When you deploy an agentic stack, you're not just changing tools.

1653
01:02:55,280 --> 01:02:57,080
You're changing what people do all day.

1654
01:02:57,080 --> 01:02:59,720
That ripples through how the organization actually works.

1655
01:02:59,720 --> 01:03:01,800
The most immediate shift is in the developer role.

1656
01:03:01,800 --> 01:03:05,880
It stops being about writing code and starts being about defining intent and reviewing output.

1657
01:03:05,880 --> 01:03:07,600
This isn't less work, it's different work.

1658
01:03:07,600 --> 01:03:09,600
It requires a different set of capabilities.

1659
01:03:09,600 --> 01:03:12,880
A developer using an agentic stack needs to be precise about intent.

1660
01:03:12,880 --> 01:03:16,200
When you're writing the code yourself, vagueness doesn't matter much because you figure it

1661
01:03:16,200 --> 01:03:17,200
out as you go.

1662
01:03:17,200 --> 01:03:20,640
But when you're telling an agent what to do, vagueness is amplified.

1663
01:03:20,640 --> 01:03:24,400
The agent takes your vague instruction and generates something that matches your words,

1664
01:03:24,400 --> 01:03:26,320
but misses what you actually wanted.

1665
01:03:26,320 --> 01:03:29,440
Developers need to get better at articulating intent and thinking about the goal before

1666
01:03:29,440 --> 01:03:30,440
they ask for help.

1667
01:03:30,440 --> 01:03:32,960
They also need to get better at reviewing code they didn't write.

1668
01:03:32,960 --> 01:03:35,400
When you write code, you understand every decision.

1669
01:03:35,400 --> 01:03:38,880
When you remember why you structured it that way, when the agent writes code, you're seeing

1670
01:03:38,880 --> 01:03:39,880
it fresh.

1671
01:03:39,880 --> 01:03:42,480
You need to read it critically and understand the architecture deeply enough to spot

1672
01:03:42,480 --> 01:03:44,160
when the agent violated a rule.

1673
01:03:44,160 --> 01:03:46,960
You have to trace through logic that someone else created.

1674
01:03:46,960 --> 01:03:50,240
This is a learned skill and not every developer starts out good at it.

1675
01:03:50,240 --> 01:03:52,200
And decision making becomes more central.

1676
01:03:52,200 --> 01:03:55,320
Not every task the agent completes will be obviously correct.

1677
01:03:55,320 --> 01:03:58,520
Sometimes the implementation is technically sound but doesn't match the business need.

1678
01:03:58,520 --> 01:04:02,640
There are always trade-offs between speed and clarity or elegance and maintainability.

1679
01:04:02,640 --> 01:04:06,000
Most decisions were always happening, but now they're more visible because the agent is

1680
01:04:06,000 --> 01:04:07,320
surfacing options.

1681
01:04:07,320 --> 01:04:10,000
Developers need to develop judgment about which choice is right.

1682
01:04:10,000 --> 01:04:11,960
New roles emerge alongside this shift.

1683
01:04:11,960 --> 01:04:13,160
The agent architect is one.

1684
01:04:13,160 --> 01:04:17,360
This person designs how the agent stack works and decides which agents exist and what

1685
01:04:17,360 --> 01:04:18,800
guardrails contain them.

1686
01:04:18,800 --> 01:04:22,440
This is an extension of the architect role, not a completely new position.

1687
01:04:22,440 --> 01:04:25,920
Existing architects will start spending their time thinking about agent design.

1688
01:04:25,920 --> 01:04:27,240
Context engineer is another.

1689
01:04:27,240 --> 01:04:30,960
This is someone who maintains project memory and keeps architecture files current.

1690
01:04:30,960 --> 01:04:35,200
They update the gotchas as new problems emerge and write pattern files that teach agents what

1691
01:04:35,200 --> 01:04:36,200
good looks like.

1692
01:04:36,200 --> 01:04:38,360
They manage the knowledge base that agents read from.

1693
01:04:38,360 --> 01:04:42,040
This extends the senior engineer role and some seniors will realize they're naturally good

1694
01:04:42,040 --> 01:04:43,440
at encoding knowledge.

1695
01:04:43,440 --> 01:04:44,680
Agent trainer is the third.

1696
01:04:44,680 --> 01:04:47,960
This person monitors performance and watches which tasks succeed or fail.

1697
01:04:47,960 --> 01:04:51,200
They collect feedback from developers and analyze logs to understand where the agent is

1698
01:04:51,200 --> 01:04:52,200
struggling.

1699
01:04:52,200 --> 01:04:55,400
They feed this insight back into improving the context and the tools.

1700
01:04:55,400 --> 01:04:58,440
This is part of the architect role and part of the team lead role.

1701
01:04:58,440 --> 01:05:02,120
The organizational structure doesn't have to blow up but it often shifts.

1702
01:05:02,120 --> 01:05:05,880
Instead of front end and back end teams you might see platform and product teams.

1703
01:05:05,880 --> 01:05:09,320
The platform team manages the agent stack and keeps it running while the product team

1704
01:05:09,320 --> 01:05:10,320
uses it.

1705
01:05:10,320 --> 01:05:13,840
They focus on building features, not on the mechanics of code generation.

1706
01:05:13,840 --> 01:05:17,200
Or you might see an automation team that specializes in agent workflows.

1707
01:05:17,200 --> 01:05:22,280
They design how agents flow through your development process and they own the orchestration layer.

1708
01:05:22,280 --> 01:05:23,560
But here's what matters most.

1709
01:05:23,560 --> 01:05:25,640
None of this works without trust.

1710
01:05:25,640 --> 01:05:28,280
Developers need to trust that the agent is doing good work.

1711
01:05:28,280 --> 01:05:32,760
This isn't blind trust but informed trust based on seeing high quality code over a thousand

1712
01:05:32,760 --> 01:05:33,760
tasks.

1713
01:05:33,760 --> 01:05:35,600
That's the kind of trust that matters.

1714
01:05:35,600 --> 01:05:39,200
Managers need to trust that agents are actually improving productivity, not just creating

1715
01:05:39,200 --> 01:05:40,960
the illusion of activity.

1716
01:05:40,960 --> 01:05:46,280
This trust is built on metrics like seeing cycle time decrease and throughput genuinely increase.

1717
01:05:46,280 --> 01:05:47,960
This trust doesn't appear automatically.

1718
01:05:47,960 --> 01:05:49,160
It's built through transparency.

1719
01:05:49,160 --> 01:05:53,040
You show the metrics, you share examples of agent generated code and you walk people through

1720
01:05:53,040 --> 01:05:54,040
the feedback loop.

1721
01:05:54,040 --> 01:05:57,200
You demonstrate that the agent is learning and improving over time.

1722
01:05:57,200 --> 01:05:58,960
The structural shift is fundamental.

1723
01:05:58,960 --> 01:06:00,880
It's not humans do the work anymore.

1724
01:06:00,880 --> 01:06:03,800
It's agents do routine work, humans do judgment.

1725
01:06:03,800 --> 01:06:05,000
This isn't a small change.

1726
01:06:05,000 --> 01:06:08,440
It affects hiring decisions, training and how you evaluate performance.

1727
01:06:08,440 --> 01:06:11,120
It changes career development but it's also an opportunity.

1728
01:06:11,120 --> 01:06:14,680
Humans get to focus on harder problems and features that require domain expertise.

1729
01:06:14,680 --> 01:06:17,200
The boring, mechanical work gets automated.

1730
01:06:17,200 --> 01:06:19,120
That's what every organization says they want.

1731
01:06:19,120 --> 01:06:23,320
The ones that actually build agente stacks that work will get it while the ones that don't

1732
01:06:23,320 --> 01:06:26,280
will keep doing routine work manually.

1733
01:06:26,280 --> 01:06:28,440
The agents and failure modes, what can go wrong?

1734
01:06:28,440 --> 01:06:33,200
The agente developer stack is powerful but like any powerful system it can fail catastrophically

1735
01:06:33,200 --> 01:06:35,320
if you aren't careful about how you deploy it.

1736
01:06:35,320 --> 01:06:36,960
These aren't hypothetical risks.

1737
01:06:36,960 --> 01:06:40,480
They are real things that happen to teams that move too fast without thinking through

1738
01:06:40,480 --> 01:06:43,160
the structural implications.

1739
01:06:43,160 --> 01:06:45,240
The first failure mode is context collapse.

1740
01:06:45,240 --> 01:06:48,520
The agent generates code that looks modern but it's architecturally incompatible with

1741
01:06:48,520 --> 01:06:49,960
your actual system.

1742
01:06:49,960 --> 01:06:53,440
You wanted to refactor using new patterns and the agent delivered those patterns but they

1743
01:06:53,440 --> 01:06:55,840
simply don't fit how your system is structured.

1744
01:06:55,840 --> 01:06:58,240
Anancies start flowing in weird directions.

1745
01:06:58,240 --> 01:07:00,560
Cross-cutting concerns leak across boundaries.

1746
01:07:00,560 --> 01:07:04,200
You end up with modern technical debt which is just code written in a new framework but

1747
01:07:04,200 --> 01:07:06,020
organized like an old messy one.

1748
01:07:06,020 --> 01:07:07,520
This happens when context is poor.

1749
01:07:07,520 --> 01:07:11,320
When the agent doesn't understand your architecture it fills in the gaps with assumptions that

1750
01:07:11,320 --> 01:07:13,000
don't match reality.

1751
01:07:13,000 --> 01:07:14,640
Then there is quality degradation.

1752
01:07:14,640 --> 01:07:18,600
The agent generates code that passes every single one of your tests.

1753
01:07:18,600 --> 01:07:20,040
But it's fragile underneath.

1754
01:07:20,040 --> 01:07:21,720
The tests check the happy path.

1755
01:07:21,720 --> 01:07:23,800
They don't check if error handling is complete.

1756
01:07:23,800 --> 01:07:25,800
Or if performance holds up under a heavy load.

1757
01:07:25,800 --> 01:07:28,120
You merge the code and it works for two weeks.

1758
01:07:28,120 --> 01:07:31,240
Until someone hits an edge case you didn't test for and the whole thing breaks.

1759
01:07:31,240 --> 01:07:32,880
This happens when guardrails are weak.

1760
01:07:32,880 --> 01:07:35,920
It happens when validation checks aren't deep enough to catch subtle problems.

1761
01:07:35,920 --> 01:07:40,040
Or when you haven't defined what good actually looks like for your specific system.

1762
01:07:40,040 --> 01:07:41,360
Deadlock is another major risk.

1763
01:07:41,360 --> 01:07:45,400
The agent gets stuck trying to solve a problem and loops endlessly through variations of the

1764
01:07:45,400 --> 01:07:46,800
same broken approach.

1765
01:07:46,800 --> 01:07:50,280
It burns through your entire iteration budget without making any progress.

1766
01:07:50,280 --> 01:07:51,280
At scale.

1767
01:07:51,280 --> 01:07:53,360
This isn't just one agent stuck on one task.

1768
01:07:53,360 --> 01:07:55,880
This dozens of agents stuck on dozens of tasks.

1769
01:07:55,880 --> 01:07:59,400
And your entire pipeline grinds to a halt while you wait for a resolution.

1770
01:07:59,400 --> 01:08:01,560
This happens when tool integration is incomplete.

1771
01:08:01,560 --> 01:08:03,400
The agent can't validate its own work.

1772
01:08:03,400 --> 01:08:06,120
So it doesn't have enough signal to realize its approach isn't working.

1773
01:08:06,120 --> 01:08:07,800
Overreliance is more subtle.

1774
01:08:07,800 --> 01:08:08,960
But it's just as dangerous.

1775
01:08:08,960 --> 01:08:10,800
Your team stops thinking about the code.

1776
01:08:10,800 --> 01:08:14,560
They stop reviewing carefully because the agent wrote it and the tests passed.

1777
01:08:14,560 --> 01:08:15,640
So they just hit merge.

1778
01:08:15,640 --> 01:08:18,720
This works fine until the agent makes an assumption you didn't catch.

1779
01:08:18,720 --> 01:08:20,520
Or the tests miss a critical edge case.

1780
01:08:20,520 --> 01:08:23,120
The merged code becomes a massive liability.

1781
01:08:23,120 --> 01:08:25,240
This happens when trust replaces judgment.

1782
01:08:25,240 --> 01:08:27,640
People start treating agent output as finished work.

1783
01:08:27,640 --> 01:08:30,520
Instead of a first draft that still needs a human to look at it.

1784
01:08:30,520 --> 01:08:32,600
The deepest failure mode is skill erosion.

1785
01:08:32,600 --> 01:08:36,280
Junior developers never actually learn to code because the agent does the heavy lifting

1786
01:08:36,280 --> 01:08:37,280
for them.

1787
01:08:37,280 --> 01:08:39,640
They spend their time reviewing code instead of writing it.

1788
01:08:39,640 --> 01:08:42,680
They make calls about whether generated code is good enough.

1789
01:08:42,680 --> 01:08:46,240
But they never develop the muscle memory of building things from scratch.

1790
01:08:46,240 --> 01:08:50,000
They never hit the walls that force you to develop real problem solving instincts.

1791
01:08:50,000 --> 01:08:51,680
A few years into their career.

1792
01:08:51,680 --> 01:08:53,160
They can't function without the agent.

1793
01:08:53,160 --> 01:08:55,920
They've become dependent on the tool rather than capable with it.

1794
01:08:55,920 --> 01:08:58,880
This is an organizational problem, not a technology problem.

1795
01:08:58,880 --> 01:09:00,120
But it is very real.

1796
01:09:00,120 --> 01:09:01,400
These risks exist.

1797
01:09:01,400 --> 01:09:02,400
But they are manageable.

1798
01:09:02,400 --> 01:09:05,920
The key is treating the agent stack as a tool that augments humans.

1799
01:09:05,920 --> 01:09:07,320
Not one that replaces them.

1800
01:09:07,320 --> 01:09:08,840
Humans make the final decisions.

1801
01:09:08,840 --> 01:09:10,400
Humans review the code.

1802
01:09:10,400 --> 01:09:11,840
Humans maintain the context.

1803
01:09:11,840 --> 01:09:14,120
The agent doesn't take over those responsibilities.

1804
01:09:14,120 --> 01:09:15,720
It just takes over the mechanical work.

1805
01:09:15,720 --> 01:09:17,720
You also have to invest in the foundations.

1806
01:09:17,720 --> 01:09:19,280
Building good context takes time.

1807
01:09:19,280 --> 01:09:22,200
And guardrails require you to think through every possible failure.

1808
01:09:22,200 --> 01:09:24,960
Tool integration is detailed, difficult work.

1809
01:09:24,960 --> 01:09:28,200
And feedback loops have to be part of your daily process.

1810
01:09:28,200 --> 01:09:30,320
Organizations that skip these foundations will fail.

1811
01:09:30,320 --> 01:09:33,680
They deploy an agent, it generates garbage, they get burned, and then they abandon the

1812
01:09:33,680 --> 01:09:35,200
technology entirely.

1813
01:09:35,200 --> 01:09:37,680
But the organizations that invest in those foundations succeed.

1814
01:09:37,680 --> 01:09:40,040
They deploy an agent and it generates useful code.

1815
01:09:40,040 --> 01:09:41,360
They see real improvements.

1816
01:09:41,360 --> 01:09:42,760
And they expand how they use it.

1817
01:09:42,760 --> 01:09:46,480
The agent stack becomes normal infrastructure, not just an experiment.

1818
01:09:46,480 --> 01:09:48,120
The structural insight is this.

1819
01:09:48,120 --> 01:09:50,560
The agent stack amplifies what is already there.

1820
01:09:50,560 --> 01:09:52,120
Good practices get better.

1821
01:09:52,120 --> 01:09:54,520
Bad practices get worse.

1822
01:09:54,520 --> 01:09:58,840
If your code review process is weak, agents will expose that by generating code that slips

1823
01:09:58,840 --> 01:10:01,280
right through if your architecture is a mess.

1824
01:10:01,280 --> 01:10:04,960
Agents will amplify that by violating your rules in new and creative ways.

1825
01:10:04,960 --> 01:10:07,280
You don't fix these problems with a better agent.

1826
01:10:07,280 --> 01:10:11,800
You fix them by building the stack the right way, by investing in context, strengthening

1827
01:10:11,800 --> 01:10:14,080
guardrails and integrating your tools properly.

1828
01:10:14,080 --> 01:10:16,640
This isn't a set it and forget it technology.

1829
01:10:16,640 --> 01:10:19,560
It's an ongoing investment that requires constant attention.

1830
01:10:19,560 --> 01:10:23,640
The teams that understand this will thrive, while the teams that try to deploy it passively

1831
01:10:23,640 --> 01:10:25,040
will struggle.

1832
01:10:25,040 --> 01:10:26,880
The terminal is no longer just for commands.

1833
01:10:26,880 --> 01:10:30,600
It's the orchestration layer for your entire development workflow, the agentic developer

1834
01:10:30,600 --> 01:10:35,280
stack, with its layers for orchestration, transformation, validation and execution.

1835
01:10:35,280 --> 01:10:36,600
Is how you build software now.

1836
01:10:36,600 --> 01:10:40,320
It isn't about faster auto-complete, it's about agents managing the routine work, while

1837
01:10:40,320 --> 01:10:42,160
humans manage the policy.

1838
01:10:42,160 --> 01:10:46,000
Organizations that build this stack well will move faster, ship better code, and attract

1839
01:10:46,000 --> 01:10:47,680
the best talent.

1840
01:10:47,680 --> 01:10:50,320
Organizations that ignore this shift will simply get left behind.

1841
01:10:50,320 --> 01:10:52,320
The question isn't whether you should use agents.

1842
01:10:52,320 --> 01:10:56,040
The question is, how do you build an agentic stack that actually works?

1843
01:10:56,040 --> 01:10:58,080
The answer is in the details.

1844
01:10:58,080 --> 01:11:00,840
Context, guardrails, tools, feedback loops and constant improvement.

The Terminal is No Longer for Commands: Building the Agentic Developer Stack

Listen On

Support On

Featured Episodes

Microsoft Development Podcast – APIs, Identity & Architecture Episodes

Recent Episodes

Microsoft Data Podcast – Analytics, Fabric & Data Governance Episodes

Microsoft Power Platform Podcast – Governance, Security & Architecture Episodes

Microsoft Security Podcast – Identity, Cloud & Enterprise Protection Episodes

Microsoft Azure Podcast – Cloud Architecture, Security & Operations Episodes

Microsoft Copilot Podcast – AI Architecture, Security & Governance Episodes

Microsoft Dynamics 365 Podcast – Architecture & Integration Episodes

Microsoft Development Podcast – APIs, Identity & Architecture Episodes

Microsoft 365 Podcast – Teams, SharePoint, Office Apps & Productivity Episodes

Browse episodes by category