The Death of the Generalist Bot: Why Your Copilot Needs a Mixture of Experts


Most organizations are building AI the same way.One copilot.One interface.One large model expected to handle every request.At first glance, the approach feels simple, scalable, and easy to govern. But as AI adoption accelerates, many organizations are discovering that the generalist AI model creates hidden costs, inconsistent quality, governance challenges, and growing operational complexity.In this episode of the M365 FM Podcast, we explore why the future of enterprise AI is not a single super-intelligent assistant but a governed network of specialized experts working together through intelligent routing, orchestration, and policy-driven decision making.
THE PROBLEM WITH THE GENERALIST AI MODEL
The idea of a single AI assistant sounds attractive.Users get one interface.IT gets one platform.Leadership gets one AI strategy.The reality is far more complicated.As organizations expand AI use cases, the same assistant suddenly becomes responsible for:
- Knowledge retrieval
- Policy interpretation
- Workflow execution
- Document summarization
- Data extraction
- Business automation
WHY AI COSTS EXPLODE FASTER THAN EXPECTED
Many organizations focus exclusively on model pricing while ignoring the architecture decisions driving overall AI costs.This discussion examines:
- Premium model overuse
- Blended cost analysis
- High-volume routine workloads
- Token consumption patterns
- Cheap-first routing strategies
- Escalation-based AI architectures
SMALL MODELS ARE MORE POWERFUL THAN MOST PEOPLE THINK
One of the most surprising themes of the episode is the growing role of smaller AI models such as Microsoft's Phi family.The conversation explores why:
- Classification tasks rarely need large models
- Intent detection can run efficiently on smaller models
- Extraction workloads benefit from specialization
- Routing decisions favor low-latency models
- Operational efficiency often beats raw intelligence
UNDERSTANDING MIXTURE OF EXPERTS
Mixture of Experts (MoE) is often misunderstood.Many people associate MoE only with advanced model architectures that activate specialized internal experts.This episode explores a more practical enterprise interpretation:A governed system of specialized AI services working together.Topics include:
- Model-level MoE
- System-level MoE
- Expert specialization
- Intelligent routing
- Expert orchestration
- Bounded responsibilities
COPILOT STUDIO VS AZURE AI FOUNDRY
One of the most important architectural discussions focuses on the relationship between Microsoft Copilot Studio and Azure AI Foundry.The episode explains why these platforms should not compete with one another.Instead:
- Copilot Studio becomes the user experience layer
- Azure AI Foundry becomes the reasoning layer
- Routing logic manages model selection
- Specialist agents perform bounded tasks
- Governance controls span the entire architecture
WHY ROUTERS ARE THE MOST IMPORTANT AGENTS
Most organizations begin with answer generation.This episode argues for a different starting point.The first expert should be the router.A routing agent determines:
- Task type
- Complexity
- Risk level
- Domain ownership
- Escalation requirements
DESIGNING SPECIALIZED AI EXPERTS
A successful expert fabric depends on clearly defined specialist roles.The discussion explores expert categories such as:
- Knowledge experts
- Policy experts
- Workflow experts
- Analytics experts
- Extraction experts
- Technical experts
THE ROLE OF RAG IN AN EXPERT FABRIC
Retrieval-Augmented Generation remains an essential capability, but this episode challenges a common misconception.RAG is not the expert.RAG is a capability used by experts.Topics include:
- Modular RAG architectures
- Knowledge segmentation
- Permission-aware retrieval
- Specialist knowledge indexes
- Graph-based retrieval
- Hybrid search strategies
GOVERNANCE IN A MULTI-AGENT WORLD
As organizations move from single assistants to multi-agent systems, governance becomes dramatically more important.The conversation explores:
- Agent ownership models
- Identity management
- Lifecycle governance
- Auditability
- Traceability
- Permission management
AGENT 365 AND THE FUTURE OF AGENT GOVERNANCE
Microsoft's Agent 365 vision introduces new approaches to managing AI agents across the enterprise.Topics include:
- Agent identities
- Agent registries
- Lifecycle management
- Discovery and inventory
- Security integration
- Governance automation
AZURE POLICY FOR AI MODEL GOVERNANCE
Model selection is increasingly becoming a governance challenge.This episode explores how Azure Policy can help organizations control:
- Approved models
- Approved publishers
- Deployment standards
- Production readiness
- Model lifecycle management
- Compliance requirements
THE FUTURE OF AI ISN'T ONE MIND
Perhaps the most important takeaway from this episode is simple:The future of enterprise AI is not one giant assistant trying to solve every problem.It is a coordinated ecosystem of specialized experts.Each expert understands a specific task.Each expert operates within defined boundaries.Each expert contributes to a governed, observable, and scalable AI architecture.
FINAL THOUGHTS
As AI platforms mature, organizations must move beyond the idea that bigger models automatically create better solutions.The winners will be those that build intelligent routing systems, embrace specialization, implement strong governance, and create expert fabrics that balance performance, cost, security, and operational control.The question is no longer whether your organization will use AI.The real question is whether you will trust one mind to do everything—or build a governed network of experts designed to work together.
Become a supporter of this podcast: https://www.spreaker.com/podcast/m365-fm-modern-work-security-and-productivity-with-microsoft-365--6704921/support.
🚀 Want to be part of m365.fm?
Then stop just listening… and start showing up.
👉 Connect with me on LinkedIn and let’s make something happen:
- 🎙️ Be a podcast guest and share your story
- 🎧 Host your own episode (yes, seriously)
- 💡 Pitch topics the community actually wants to hear
- 🌍 Build your personal brand in the Microsoft 365 space
This isn’t just a podcast — it’s a platform for people who take action.
🔥 Most people wait. The best ones don’t.
👉 Connect with me on LinkedIn and send me a message:
"I want in"
Let’s build something awesome 👊
00:00:00,000 --> 00:00:01,920
Most teams are building the same thing right now,
2
00:00:01,920 --> 00:00:04,960
one copilot, one interface, one big model behind it,
3
00:00:04,960 --> 00:00:07,000
and the assumption that one smart system
4
00:00:07,000 --> 00:00:09,320
should handle everything that looks clean.
5
00:00:09,320 --> 00:00:12,200
But in reality, it does the opposite, it drives up cost,
6
00:00:12,200 --> 00:00:15,120
it lowers answer quality, it makes governance messy fast
7
00:00:15,120 --> 00:00:16,960
because the issue isn't AI adoption,
8
00:00:16,960 --> 00:00:18,120
it's the model behind it.
9
00:00:18,120 --> 00:00:20,480
In this episode, I want to replace that model
10
00:00:20,480 --> 00:00:23,560
with a better one, a governed fabric of experts.
11
00:00:23,560 --> 00:00:26,280
We need an architecture where small models root traffic,
12
00:00:26,280 --> 00:00:28,200
specialists do bounded work,
13
00:00:28,200 --> 00:00:30,360
and your system actually matches the job.
14
00:00:30,360 --> 00:00:32,720
If you keep the generalist, you lock in cost sprawl
15
00:00:32,720 --> 00:00:34,160
and blurred accountability.
16
00:00:34,160 --> 00:00:36,240
And if you want more of this, subscribe.
17
00:00:36,240 --> 00:00:38,520
But first, we need to kill the old assumption.
18
00:00:38,520 --> 00:00:40,960
Why the generalist bot breaks its scale?
19
00:00:40,960 --> 00:00:42,400
The generalist bot sounds efficient
20
00:00:42,400 --> 00:00:44,280
because it gives everyone one place to go.
21
00:00:44,280 --> 00:00:47,840
One entry point, one assistant, one prompt layer.
22
00:00:47,840 --> 00:00:50,520
It offers a single ownership story, at least on paper.
23
00:00:50,520 --> 00:00:53,560
But the moment usage grows, that simplicity starts to crack.
24
00:00:53,560 --> 00:00:55,920
The bot is no longer answering one kind of question.
25
00:00:55,920 --> 00:00:58,080
It's trying to classify intent, fetch policy,
26
00:00:58,080 --> 00:00:59,920
and summarize documents all at once.
27
00:00:59,920 --> 00:01:02,520
Then it tries to extract fields, trigger workflows,
28
00:01:02,520 --> 00:01:03,840
and reason through ambiguity.
29
00:01:03,840 --> 00:01:05,000
Those are not the same job.
30
00:01:05,000 --> 00:01:06,280
That's where things break.
31
00:01:06,280 --> 00:01:08,560
Routing is one job, extraction is another,
32
00:01:08,560 --> 00:01:10,120
policy lookup is another,
33
00:01:10,120 --> 00:01:12,960
and when you push all of that through one giant assistant,
34
00:01:12,960 --> 00:01:15,120
you're forcing one model to play too many roles.
35
00:01:15,120 --> 00:01:16,800
The result isn't flexibility.
36
00:01:16,800 --> 00:01:18,360
It's interference.
37
00:01:18,360 --> 00:01:19,800
A lot of enterprise teams miss this
38
00:01:19,800 --> 00:01:21,080
because the first demo works.
39
00:01:21,080 --> 00:01:23,120
Early traffic is light, questions stay simple,
40
00:01:23,120 --> 00:01:24,880
and the audience is usually forgiving.
41
00:01:24,880 --> 00:01:27,320
So the bot looks capable, it feels broad.
42
00:01:27,320 --> 00:01:29,240
And because it can answer something in most cases,
43
00:01:29,240 --> 00:01:30,760
people assume it can own everything.
44
00:01:30,760 --> 00:01:33,760
But broad coverage is not the same as operational fitness.
45
00:01:33,760 --> 00:01:35,960
A system that can attempt every task is not a system
46
00:01:35,960 --> 00:01:38,680
that should own every task, now one level deeper.
47
00:01:38,680 --> 00:01:40,840
Different tasks fail in different ways.
48
00:01:40,840 --> 00:01:43,200
A routing decision needs speed and low cost.
49
00:01:43,200 --> 00:01:45,360
A document extraction task needs structured output
50
00:01:45,360 --> 00:01:46,440
and consistency.
51
00:01:46,440 --> 00:01:48,680
A policy assistant needs bounded retrieval
52
00:01:48,680 --> 00:01:50,240
and strict access control.
53
00:01:50,240 --> 00:01:51,960
When one generalist handles all of that,
54
00:01:51,960 --> 00:01:53,680
you don't get one stable behavior.
55
00:01:53,680 --> 00:01:56,200
You get trade-offs hidden inside a single surface.
56
00:01:56,200 --> 00:01:58,960
This is why quality starts dropping as scope expands.
57
00:01:58,960 --> 00:02:01,280
The team keeps adding prompts, tools, and exceptions
58
00:02:01,280 --> 00:02:02,080
to the stack.
59
00:02:02,080 --> 00:02:03,880
The bot becomes a pile of compensations.
60
00:02:03,880 --> 00:02:06,360
It answers well in one area, but that same tuning
61
00:02:06,360 --> 00:02:07,320
hurts another.
62
00:02:07,320 --> 00:02:09,120
It behaves safely for one workflow,
63
00:02:09,120 --> 00:02:10,960
but now it sounds too rigid somewhere else.
64
00:02:10,960 --> 00:02:13,760
It can reason deeply, but it's too expensive for high volume
65
00:02:13,760 --> 00:02:14,560
traffic.
66
00:02:14,560 --> 00:02:16,640
Or it stays cheap, but now it misses nuance
67
00:02:16,640 --> 00:02:17,920
where the stakes are higher.
68
00:02:17,920 --> 00:02:19,680
So what's actually happening is simple.
69
00:02:19,680 --> 00:02:21,320
The bot is over generalized.
70
00:02:21,320 --> 00:02:23,760
And over generalized systems create three kinds of debt.
71
00:02:23,760 --> 00:02:26,720
First, cost debt.
72
00:02:26,720 --> 00:02:28,240
When one assistant fronts everything,
73
00:02:28,240 --> 00:02:30,520
the safest choice is to run a premium model.
74
00:02:30,520 --> 00:02:32,000
The team can't predict what might come in,
75
00:02:32,000 --> 00:02:33,440
so they over-provision.
76
00:02:33,440 --> 00:02:35,760
And that means easy work, like basic policy answers
77
00:02:35,760 --> 00:02:37,600
gets priced like premium reasoning.
78
00:02:37,600 --> 00:02:40,480
The architecture treats every request like a potential edge
79
00:02:40,480 --> 00:02:40,880
case.
80
00:02:40,880 --> 00:02:42,920
That's expensive by design.
81
00:02:42,920 --> 00:02:44,720
Second, quality debt.
82
00:02:44,720 --> 00:02:47,800
The more jobs one bot owns, the harder it gets to keep answers
83
00:02:47,800 --> 00:02:48,240
sharp.
84
00:02:48,240 --> 00:02:49,600
Instructions get longer.
85
00:02:49,600 --> 00:02:50,880
Context gets broader.
86
00:02:50,880 --> 00:02:52,600
Two choices multiply.
87
00:02:52,600 --> 00:02:54,760
The system starts blending responsibilities
88
00:02:54,760 --> 00:02:56,000
that should stay separate.
89
00:02:56,000 --> 00:02:57,560
Instead of a clean domain boundary,
90
00:02:57,560 --> 00:02:58,800
you get a vague assistant.
91
00:02:58,800 --> 00:03:00,960
It sounds helpful, but it often lands in the middle,
92
00:03:00,960 --> 00:03:03,600
not wrong enough to trigger alarms, not precise enough
93
00:03:03,600 --> 00:03:04,400
to trust deeply.
94
00:03:04,400 --> 00:03:06,080
That middle is dangerous.
95
00:03:06,080 --> 00:03:07,800
Because enterprise AI doesn't usually
96
00:03:07,800 --> 00:03:09,360
fail with dramatic errors first.
97
00:03:09,360 --> 00:03:12,760
It fails with low confidence usefulness, slower than expected,
98
00:03:12,760 --> 00:03:15,760
more expensive than expected, less precise than expected.
99
00:03:15,760 --> 00:03:17,720
And because the interface still looks polished,
100
00:03:17,720 --> 00:03:19,480
the underlying weakness stays hidden.
101
00:03:19,480 --> 00:03:21,120
Third, governance debt.
102
00:03:21,120 --> 00:03:23,200
This is the part many teams leave for later.
103
00:03:23,200 --> 00:03:24,880
And later gets expensive.
104
00:03:24,880 --> 00:03:27,680
Once one bot spans too many domains, ownership blurs.
105
00:03:27,680 --> 00:03:29,040
Who owns the prompt behavior?
106
00:03:29,040 --> 00:03:30,600
Who owns the knowledge sources?
107
00:03:30,600 --> 00:03:31,920
Who reviews the actions?
108
00:03:31,920 --> 00:03:34,600
In a generalist model, those lines fade quickly.
109
00:03:34,600 --> 00:03:36,720
And blurred ownership is not a side issue.
110
00:03:36,720 --> 00:03:37,920
It is the system issue.
111
00:03:37,920 --> 00:03:39,760
In most organizations, governance works
112
00:03:39,760 --> 00:03:41,440
when purpose is narrow.
113
00:03:41,440 --> 00:03:43,520
A policy agent with a defined data scope
114
00:03:43,520 --> 00:03:45,320
and a named owner is governable.
115
00:03:45,320 --> 00:03:46,920
A workflow agent with restricted tools
116
00:03:46,920 --> 00:03:48,760
and approval rules is governable.
117
00:03:48,760 --> 00:03:51,680
A router that classifies and passes work on is governable.
118
00:03:51,680 --> 00:03:54,120
But a giant assistant sitting across every task
119
00:03:54,120 --> 00:03:55,560
becomes hard to reason about.
120
00:03:55,560 --> 00:03:57,360
Its purpose is just too broad.
121
00:03:57,360 --> 00:04:00,000
So the contrast we need is this, not one giant assistant
122
00:04:00,000 --> 00:04:01,840
that tries to know and do everything,
123
00:04:01,840 --> 00:04:04,840
a bounded expert system, one front door if you want it.
124
00:04:04,840 --> 00:04:09,400
But behind that, specialized roles, tight scopes, clear owners,
125
00:04:09,400 --> 00:04:11,200
model choice matched to task.
126
00:04:11,200 --> 00:04:13,680
That leads to structured handoffs and better cost control,
127
00:04:13,680 --> 00:04:16,040
better answers, better governance.
128
00:04:16,040 --> 00:04:18,000
And once you see that clearly, the cost story
129
00:04:18,000 --> 00:04:19,840
becomes impossible to ignore.
130
00:04:19,840 --> 00:04:21,640
The cost model most teams never map.
131
00:04:21,640 --> 00:04:23,520
Most teams talk about AI cost too late.
132
00:04:23,520 --> 00:04:24,920
They look at the bill after the rollout,
133
00:04:24,920 --> 00:04:26,840
they compare one model price to another.
134
00:04:26,840 --> 00:04:29,240
They ask if the premium model is worth the spend.
135
00:04:29,240 --> 00:04:31,240
But the real mistake happens much earlier
136
00:04:31,240 --> 00:04:33,360
when nobody maps the traffic shape of the system
137
00:04:33,360 --> 00:04:35,960
before choosing the architecture, because not all requests
138
00:04:35,960 --> 00:04:37,280
deserve the same model.
139
00:04:37,280 --> 00:04:39,480
In most enterprise environments, a huge share of traffic
140
00:04:39,480 --> 00:04:40,360
is routine.
141
00:04:40,360 --> 00:04:42,160
You are sorting a request or classifying
142
00:04:42,160 --> 00:04:44,640
an intent or extracting specific fields.
143
00:04:44,640 --> 00:04:46,480
Maybe you are checking a policy source,
144
00:04:46,480 --> 00:04:48,800
drafting a structured response, or deciding
145
00:04:48,800 --> 00:04:50,160
whether to escalate a ticket.
146
00:04:50,160 --> 00:04:52,600
None of that is frontier model work by default.
147
00:04:52,600 --> 00:04:54,280
Yet many teams still run that traffic
148
00:04:54,280 --> 00:04:57,360
through GPT-4O class models as if every prompt were
149
00:04:57,360 --> 00:04:59,200
a board-level reasoning problem.
150
00:04:59,200 --> 00:05:00,360
That is where the waste starts.
151
00:05:00,360 --> 00:05:02,040
The research is very clear on this.
152
00:05:02,040 --> 00:05:04,400
GPT-4O sits in a premium pricing tier.
153
00:05:04,400 --> 00:05:07,680
And public data puts it at $2.50 per million input tokens
154
00:05:07,680 --> 00:05:09,880
and $10 per million output tokens.
155
00:05:09,880 --> 00:05:12,040
By contrast, recent comparison material places
156
00:05:12,040 --> 00:05:14,680
five class small models in a much lower range
157
00:05:14,680 --> 00:05:17,720
with examples around 7 to 14 cents per million tokens.
158
00:05:17,720 --> 00:05:19,760
Five for pricing is also reported
159
00:05:19,760 --> 00:05:22,400
at about 12 1/2 cents per million input tokens
160
00:05:22,400 --> 00:05:24,440
and 50 cents per million output tokens
161
00:05:24,440 --> 00:05:25,800
depending on your provider.
162
00:05:25,800 --> 00:05:27,760
You don't need perfect pricing precision
163
00:05:27,760 --> 00:05:30,000
to see the structural issue, the gap is massive.
164
00:05:30,000 --> 00:05:32,720
So the question is not, which model is smartest?
165
00:05:32,720 --> 00:05:35,160
The better question is, why is premium reasoning
166
00:05:35,160 --> 00:05:36,800
touching routine traffic at all?
167
00:05:36,800 --> 00:05:38,400
This clicked for a lot of teams only
168
00:05:38,400 --> 00:05:40,520
after they saw the first real invoice.
169
00:05:40,520 --> 00:05:43,080
They assumed the expensive part would be rare edge cases,
170
00:05:43,080 --> 00:05:46,480
but in reality, the high volume layer is usually the cheap work.
171
00:05:46,480 --> 00:05:48,520
It is classification, rooting, extraction,
172
00:05:48,520 --> 00:05:49,800
and short transformations.
173
00:05:49,800 --> 00:05:52,440
If that layer hits an expensive model every time,
174
00:05:52,440 --> 00:05:54,080
the cost profile is already broken
175
00:05:54,080 --> 00:05:56,200
before the harder tasks even arrive.
176
00:05:56,200 --> 00:05:58,120
And one level deeper, token price alone
177
00:05:58,120 --> 00:05:59,480
still doesn't tell the whole story.
178
00:05:59,480 --> 00:06:00,800
What matters is blended cost.
179
00:06:00,800 --> 00:06:03,800
Blended cost is the average price of the full traffic mix
180
00:06:03,800 --> 00:06:05,920
across the entire path of the system.
181
00:06:05,920 --> 00:06:08,240
If you send every request to one premium model,
182
00:06:08,240 --> 00:06:10,520
your blended cost stays close to premium.
183
00:06:10,520 --> 00:06:13,760
Even though most requests never needed that level of capability.
184
00:06:13,760 --> 00:06:15,640
But if a small model handles the first pass
185
00:06:15,640 --> 00:06:18,560
and only some requests escalate, the average drops fast
186
00:06:18,560 --> 00:06:21,320
because the expensive path is no longer the default path.
187
00:06:21,320 --> 00:06:23,760
That's the cheap, first, escalate, later pattern.
188
00:06:23,760 --> 00:06:25,080
The logic is simple.
189
00:06:25,080 --> 00:06:26,600
Start with the lowest cost model that
190
00:06:26,600 --> 00:06:28,480
can do the current job reliably.
191
00:06:28,480 --> 00:06:31,320
If the request is ambiguous, risky, or genuinely difficult,
192
00:06:31,320 --> 00:06:32,440
then pass it up the chain.
193
00:06:32,440 --> 00:06:34,000
If it is routine, keep it down.
194
00:06:34,000 --> 00:06:36,280
That one shift changes the economics of the whole system
195
00:06:36,280 --> 00:06:39,400
because enterprise AI traffic is not evenly distributed.
196
00:06:39,400 --> 00:06:41,360
Most of it clusters around repeatable work.
197
00:06:41,360 --> 00:06:43,200
The research even points to routing patterns
198
00:06:43,200 --> 00:06:46,120
where small model pipelines kept most of the higher end
199
00:06:46,120 --> 00:06:48,400
quality while cutting costs sharply.
200
00:06:48,400 --> 00:06:50,200
One result describes intelligent routing
201
00:06:50,200 --> 00:06:52,240
between a small model and GPT-4 that
202
00:06:52,240 --> 00:06:56,600
preserved 95% of the quality while reducing costs by 85%.
203
00:06:56,600 --> 00:06:58,560
Even if your own workload lands differently,
204
00:06:58,560 --> 00:07:00,120
the direction is clear.
205
00:07:00,120 --> 00:07:02,560
Better routing changes the bill, more than endless prompt
206
00:07:02,560 --> 00:07:04,120
tuning on the wrong model.
207
00:07:04,120 --> 00:07:05,560
There's another thing Teams miss.
208
00:07:05,560 --> 00:07:07,680
They compare raw model prices, but they don't
209
00:07:07,680 --> 00:07:09,000
compare a void at spend.
210
00:07:09,000 --> 00:07:11,360
If a cheap router decides that no LLM is needed,
211
00:07:11,360 --> 00:07:12,960
that is not a small optimization.
212
00:07:12,960 --> 00:07:15,280
That is the best possible cost decision for that turn.
213
00:07:15,280 --> 00:07:18,240
A workflow can run directly, or a deterministic rule can
214
00:07:18,240 --> 00:07:20,680
answer, or a known source can be returned.
215
00:07:20,680 --> 00:07:22,680
Once you let the system decide that some traffic
216
00:07:22,680 --> 00:07:25,360
should bypass generation entirely, the economics improve
217
00:07:25,360 --> 00:07:25,960
again.
218
00:07:25,960 --> 00:07:29,200
And this is why AI bills feel unpredictable to so many leaders.
219
00:07:29,200 --> 00:07:30,640
The model is not just expensive.
220
00:07:30,640 --> 00:07:33,320
The path is ungoverned, easy traffic, hard traffic,
221
00:07:33,320 --> 00:07:35,880
and sensitive traffic all collapse into one inference lane.
222
00:07:35,880 --> 00:07:37,680
So nobody can explain why costs rise
223
00:07:37,680 --> 00:07:39,680
except by saying usage increased.
224
00:07:39,680 --> 00:07:41,240
But usage is not the root issue.
225
00:07:41,240 --> 00:07:42,400
Cost shape is.
226
00:07:42,400 --> 00:07:44,720
You need to know what percentage of requests are simple,
227
00:07:44,720 --> 00:07:46,480
what percentage require retrieval,
228
00:07:46,480 --> 00:07:48,760
and what percentage deserve premium reasoning.
229
00:07:48,760 --> 00:07:51,240
Without that map, you are not budgeting for architecture.
230
00:07:51,240 --> 00:07:52,800
You are budgeting for hope.
231
00:07:52,800 --> 00:07:55,320
Cost alone would already justify a redesign.
232
00:07:55,320 --> 00:07:56,720
But the more interesting part is this.
233
00:07:56,720 --> 00:07:58,320
Smaller models are not only cheaper.
234
00:07:58,320 --> 00:08:01,120
In a lot of cases, they're a better operational fit.
235
00:08:01,120 --> 00:08:03,760
Why smaller models win more work than people think?
236
00:08:03,760 --> 00:08:05,920
A lot of people still hear small model and assume
237
00:08:05,920 --> 00:08:07,040
second best.
238
00:08:07,040 --> 00:08:08,320
That assumption is old.
239
00:08:08,320 --> 00:08:10,760
It comes from a period when smaller meant obviously weaker
240
00:08:10,760 --> 00:08:11,960
across almost everything.
241
00:08:11,960 --> 00:08:14,480
So teams learn to think in one direction only.
242
00:08:14,480 --> 00:08:16,680
Bigger model, better result.
243
00:08:16,680 --> 00:08:18,760
But enterprise work is not one benchmark.
244
00:08:18,760 --> 00:08:21,560
It is a stack of repeated tasks with boundaries, formats,
245
00:08:21,560 --> 00:08:22,840
and expected outputs.
246
00:08:22,840 --> 00:08:25,440
Once you look at the work that way, the question changes.
247
00:08:25,440 --> 00:08:27,600
Not how smart is the model in general?
248
00:08:27,600 --> 00:08:29,880
How well does the model fit this exact job?
249
00:08:29,880 --> 00:08:32,320
That shift matters because many enterprise tasks
250
00:08:32,320 --> 00:08:33,680
are narrow by nature.
251
00:08:33,680 --> 00:08:36,520
Intent detection is narrow, triage is narrow,
252
00:08:36,520 --> 00:08:38,440
and structured extraction is narrow.
253
00:08:38,440 --> 00:08:40,760
Even a lot of summarization is narrower than people think
254
00:08:40,760 --> 00:08:43,600
because the format and the acceptable scope are already known.
255
00:08:43,600 --> 00:08:45,680
These are not open world reasoning contests.
256
00:08:45,680 --> 00:08:48,240
They are operational tasks inside bounded systems
257
00:08:48,240 --> 00:08:49,600
and bounded systems reward fit.
258
00:08:49,600 --> 00:08:51,760
This is where smaller models start winning more work
259
00:08:51,760 --> 00:08:53,200
than most teams expect.
260
00:08:53,200 --> 00:08:55,680
They don't have to beat larger models everywhere to be useful.
261
00:08:55,680 --> 00:08:57,000
They win because they are good enough
262
00:08:57,000 --> 00:08:59,720
where the task is repetitive, the inputs are familiar,
263
00:08:59,720 --> 00:09:01,640
and the output can be tightly defined.
264
00:09:01,640 --> 00:09:04,120
In those conditions, operational reliability matters
265
00:09:04,120 --> 00:09:05,400
more than raw breath.
266
00:09:05,400 --> 00:09:06,600
Think about a router.
267
00:09:06,600 --> 00:09:09,200
Its job is not to produce a brilliant answer.
268
00:09:09,200 --> 00:09:11,560
Its job is to choose the next path correctly.
269
00:09:11,560 --> 00:09:14,080
It needs to read the request, classify the intent
270
00:09:14,080 --> 00:09:15,680
and return a structured decision.
271
00:09:15,680 --> 00:09:16,800
That is a different standard.
272
00:09:16,800 --> 00:09:18,480
The best router is not the model
273
00:09:18,480 --> 00:09:19,840
with the broadest intelligence.
274
00:09:19,840 --> 00:09:21,480
It is the model that can make that decision
275
00:09:21,480 --> 00:09:23,040
quickly, cheaply, and consistently.
276
00:09:23,040 --> 00:09:24,360
Same thing with extraction.
277
00:09:24,360 --> 00:09:26,600
If you need fields pulled from a document or categories
278
00:09:26,600 --> 00:09:29,520
assigned to a ticket, you usually care about schema discipline
279
00:09:29,520 --> 00:09:30,880
rather than eloquence.
280
00:09:30,880 --> 00:09:32,560
You want the right structure, the right slot,
281
00:09:32,560 --> 00:09:33,600
and the right label.
282
00:09:33,600 --> 00:09:35,200
Smaller models can do that well
283
00:09:35,200 --> 00:09:36,880
when the task is designed properly.
284
00:09:36,880 --> 00:09:38,240
This is the part many teams skip.
285
00:09:38,240 --> 00:09:39,800
They compare models in the abstract
286
00:09:39,800 --> 00:09:42,040
instead of comparing tasks under constraints.
287
00:09:42,040 --> 00:09:43,760
In practice, small models get stronger
288
00:09:43,760 --> 00:09:45,360
when you reduce ambiguity.
289
00:09:45,360 --> 00:09:47,560
Give them a defined schema, limit the domain,
290
00:09:47,560 --> 00:09:49,120
and keep the instructions tight.
291
00:09:49,120 --> 00:09:50,800
In some cases, the research even points
292
00:09:50,800 --> 00:09:52,600
to fine-tuned, five-class models,
293
00:09:52,600 --> 00:09:54,920
outperforming larger models on narrow domain work.
294
00:09:54,920 --> 00:09:57,560
That should reset how teams think about capability.
295
00:09:57,560 --> 00:09:58,960
Bigger is not always better.
296
00:09:58,960 --> 00:10:00,360
Better bounded is often better.
297
00:10:00,360 --> 00:10:01,880
Latency matters here too.
298
00:10:01,880 --> 00:10:04,480
People usually treat latency as a user experience issue,
299
00:10:04,480 --> 00:10:07,000
which it is, but it also changes system behavior.
300
00:10:07,000 --> 00:10:09,840
Faster low-cost decisions mean you can root earlier
301
00:10:09,840 --> 00:10:12,160
and keep flows moving without dragging every request
302
00:10:12,160 --> 00:10:13,320
through a heavyweight path.
303
00:10:13,320 --> 00:10:15,200
In agent systems, delay compounds.
304
00:10:15,200 --> 00:10:18,000
One slow choice at the front can slow every hand off behind it.
305
00:10:18,000 --> 00:10:19,800
So when a smaller model responds faster,
306
00:10:19,800 --> 00:10:21,320
that is not just a nicer interface,
307
00:10:21,320 --> 00:10:24,040
it is a cleaner operating rhythm for the whole fabric.
308
00:10:24,040 --> 00:10:25,480
Now, there are limits.
309
00:10:25,480 --> 00:10:27,160
If the task is broad and big-us,
310
00:10:27,160 --> 00:10:29,040
or carries a lot of hidden complexity,
311
00:10:29,040 --> 00:10:31,080
a small model may not be the right default.
312
00:10:31,080 --> 00:10:33,080
If you need long context, synthesis
313
00:10:33,080 --> 00:10:35,720
across many sources or multimodal judgment,
314
00:10:35,720 --> 00:10:37,520
then a larger model may earn its place.
315
00:10:37,520 --> 00:10:39,520
The point is not to pretend the gap is gone.
316
00:10:39,520 --> 00:10:41,000
The point is to stop using that gap
317
00:10:41,000 --> 00:10:43,720
as an excuse to over-by intelligence for every step,
318
00:10:43,720 --> 00:10:45,240
because the real architecture decision
319
00:10:45,240 --> 00:10:47,080
is not best model overall.
320
00:10:47,080 --> 00:10:48,600
It is best model for this task
321
00:10:48,600 --> 00:10:51,040
at this point in the flow under these constraints.
322
00:10:51,040 --> 00:10:52,680
That is a very different design mindset.
323
00:10:52,680 --> 00:10:55,520
It forces you to define the work more clearly
324
00:10:55,520 --> 00:10:57,800
and it forces you to separate broad intelligence
325
00:10:57,800 --> 00:10:59,240
from operational fitness.
326
00:10:59,240 --> 00:11:00,560
Once you start doing that,
327
00:11:00,560 --> 00:11:03,080
the monolithic copilot stops looking advanced.
328
00:11:03,080 --> 00:11:04,440
It starts looking lazy,
329
00:11:04,440 --> 00:11:05,600
because in a govern system,
330
00:11:05,600 --> 00:11:08,240
model choice is part of the design, not a default,
331
00:11:08,240 --> 00:11:09,400
and once that clicks,
332
00:11:09,400 --> 00:11:11,680
the architecture itself has to change.
333
00:11:11,680 --> 00:11:14,160
What mixture of experts really means here?
334
00:11:14,160 --> 00:11:15,960
So now we can name the shift properly.
335
00:11:15,960 --> 00:11:17,680
When people hear mixture of experts,
336
00:11:17,680 --> 00:11:19,440
they usually think about one thing.
337
00:11:19,440 --> 00:11:20,800
They think about a model architecture
338
00:11:20,800 --> 00:11:22,360
where a gating mechanism activates
339
00:11:22,360 --> 00:11:24,840
only specific internal experts for each token.
340
00:11:24,840 --> 00:11:25,680
That is real.
341
00:11:25,680 --> 00:11:26,680
The research supports it.
342
00:11:26,680 --> 00:11:29,000
Microsoft's 2026 model direction
343
00:11:29,000 --> 00:11:31,560
includes sparse moe models like my thinking one
344
00:11:31,560 --> 00:11:34,240
and we have already seen 5 3.5 MOE surface
345
00:11:34,240 --> 00:11:36,240
through Azure AI Studio and GitHub.
346
00:11:36,240 --> 00:11:37,120
At the model layer,
347
00:11:37,120 --> 00:11:39,040
MOE simply means sparse activation
348
00:11:39,040 --> 00:11:40,280
inside the model itself.
349
00:11:40,280 --> 00:11:42,400
But that is not the main thing most Microsoft teams
350
00:11:42,400 --> 00:11:43,480
will actually build.
351
00:11:43,480 --> 00:11:44,960
That distinction matters
352
00:11:44,960 --> 00:11:47,120
because if you go looking in copilot studio
353
00:11:47,120 --> 00:11:50,440
for a big button labeled turn on MOE, you won't find it.
354
00:11:50,440 --> 00:11:51,960
And if you expect foundry to hand you
355
00:11:51,960 --> 00:11:54,080
a magical MOE orchestration wizard,
356
00:11:54,080 --> 00:11:56,560
that is not really how the platform is framed either.
357
00:11:56,560 --> 00:11:57,960
The practical Microsoft story
358
00:11:57,960 --> 00:11:59,280
is much more architectural.
359
00:11:59,280 --> 00:12:01,440
You choose models, you deploy them in foundry
360
00:12:01,440 --> 00:12:03,320
and you connect logic with prompt flows,
361
00:12:03,320 --> 00:12:05,520
agent patterns, endpoints and routing decisions.
362
00:12:05,520 --> 00:12:06,600
Then you govern the whole thing.
363
00:12:06,600 --> 00:12:08,440
So what's actually happening is this,
364
00:12:08,440 --> 00:12:10,040
there are two layers of moe thinking.
365
00:12:10,040 --> 00:12:11,200
One is model level moe
366
00:12:11,200 --> 00:12:12,960
that lives inside the model design itself.
367
00:12:12,960 --> 00:12:15,280
Sparse experts, internal routing.
368
00:12:15,280 --> 00:12:18,440
Capacity without activating the full parameter set every time.
369
00:12:18,440 --> 00:12:20,000
It is useful and powerful,
370
00:12:20,000 --> 00:12:21,520
but it is mostly abstracted away
371
00:12:21,520 --> 00:12:23,320
from the average enterprise builder.
372
00:12:23,320 --> 00:12:25,160
The other is system level MOE.
373
00:12:25,160 --> 00:12:27,880
That is the one most organizations can act on right now.
374
00:12:27,880 --> 00:12:30,640
A system level mixture of experts means your application chooses
375
00:12:30,640 --> 00:12:33,640
among different specialized models, agents, tools and parts
376
00:12:33,640 --> 00:12:35,440
based on the job at hand.
377
00:12:35,440 --> 00:12:37,800
One request might go to a small classifier,
378
00:12:37,800 --> 00:12:39,520
while another goes to retrieval
379
00:12:39,520 --> 00:12:41,880
and a third might trigger a workflow directly
380
00:12:41,880 --> 00:12:44,320
or escalate to a premium reasoning model.
381
00:12:44,320 --> 00:12:46,840
The experts are no longer hidden inside one model.
382
00:12:46,840 --> 00:12:49,040
They are explicit parts of the architecture
383
00:12:49,040 --> 00:12:50,960
and this is the shift that changes design.
384
00:12:50,960 --> 00:12:52,760
Because once experts are explicit,
385
00:12:52,760 --> 00:12:54,680
selection becomes a governance decision
386
00:12:54,680 --> 00:12:56,240
rather than just a model behavior.
387
00:12:56,240 --> 00:12:57,720
You decide what counts as an expert,
388
00:12:57,720 --> 00:13:00,320
you define its purpose and you restrict its data.
389
00:13:00,320 --> 00:13:02,760
You assign its owner and decide what can call it,
390
00:13:02,760 --> 00:13:05,160
what it can call and exactly when it should stop.
391
00:13:05,160 --> 00:13:06,800
That is a very different operating model
392
00:13:06,800 --> 00:13:09,640
from one large assistant improvising across everything.
393
00:13:09,640 --> 00:13:11,760
In Microsoft Stack, Foundry becomes the place
394
00:13:11,760 --> 00:13:14,160
where model optionality and routing logic start to live.
395
00:13:14,160 --> 00:13:17,280
It is where you host models, evaluate them, connect data
396
00:13:17,280 --> 00:13:18,720
and build prompt flow logic
397
00:13:18,720 --> 00:13:21,720
while increasingly managing routed patterns like model router.
398
00:13:21,720 --> 00:13:24,480
Copilot Studio sits closer to the user interaction layer.
399
00:13:24,480 --> 00:13:26,720
It manages conversation, workflow surfaces
400
00:13:26,720 --> 00:13:29,480
and channels while handling multi agent orchestration patterns
401
00:13:29,480 --> 00:13:30,560
at the product level.
402
00:13:30,560 --> 00:13:32,960
These are complementary layers, not competing ones.
403
00:13:32,960 --> 00:13:34,160
That's where people get confused.
404
00:13:34,160 --> 00:13:37,440
They ask whether Copilot Studio of Foundry is the MOE answer.
405
00:13:37,440 --> 00:13:38,640
It's the wrong question.
406
00:13:38,640 --> 00:13:40,440
MOE here is not one product.
407
00:13:40,440 --> 00:13:42,120
It is the pattern formed across them.
408
00:13:42,120 --> 00:13:44,360
Foundry gives you model control and routing options
409
00:13:44,360 --> 00:13:46,600
while Copilot Studio gives you the governed experience,
410
00:13:46,600 --> 00:13:48,680
surface and orchestration entry point.
411
00:13:48,680 --> 00:13:50,480
Agent patterns connect the pieces
412
00:13:50,480 --> 00:13:52,800
and policy and monitoring hold them together.
413
00:13:52,800 --> 00:13:54,800
That combined shape is the practical enterprise form
414
00:13:54,800 --> 00:13:56,840
of mixture of experts in Microsoft land.
415
00:13:56,840 --> 00:13:58,280
You can think of it as an agent fabric,
416
00:13:58,280 --> 00:13:59,520
but keep it concrete.
417
00:13:59,520 --> 00:14:00,800
A user asks for help.
418
00:14:00,800 --> 00:14:02,920
The front door layer captures context and intent.
419
00:14:02,920 --> 00:14:05,320
A routing layer decides what kind of work this is.
420
00:14:05,320 --> 00:14:06,920
A specialist path handles that work
421
00:14:06,920 --> 00:14:09,320
with the right model, tool or agent.
422
00:14:09,320 --> 00:14:11,400
The result comes back in a controlled format.
423
00:14:11,400 --> 00:14:13,640
The system logs the path, enforces policy
424
00:14:13,640 --> 00:14:15,240
and keeps ownership clear.
425
00:14:15,240 --> 00:14:17,160
That is system level MOE.
426
00:14:17,160 --> 00:14:18,840
And the reason this matters is simple.
427
00:14:18,840 --> 00:14:21,400
It turns expert selection from a hidden side effect
428
00:14:21,400 --> 00:14:23,000
into a design capability.
429
00:14:23,000 --> 00:14:25,440
You stop hoping one model will stretch far enough
430
00:14:25,440 --> 00:14:26,760
and you start building a fabric
431
00:14:26,760 --> 00:14:29,720
where specialization is normal, observable and governable.
432
00:14:29,720 --> 00:14:32,080
That's the real meaning here, not one magic model,
433
00:14:32,080 --> 00:14:33,760
not one shiny feature.
434
00:14:33,760 --> 00:14:35,440
A structural choice about how intelligence
435
00:14:35,440 --> 00:14:37,120
is distributed across the system.
436
00:14:37,120 --> 00:14:38,640
And once you see MOE that way,
437
00:14:38,640 --> 00:14:39,880
the next question is obvious,
438
00:14:39,880 --> 00:14:41,680
which part belongs in which layer?
439
00:14:41,680 --> 00:14:44,600
Copilot Studio is the face, foundry is the brain.
440
00:14:44,600 --> 00:14:46,640
This is where a lot of projects start drifting.
441
00:14:46,640 --> 00:14:48,760
Teams get excited about Copilot Studio
442
00:14:48,760 --> 00:14:50,240
because it is close to the user.
443
00:14:50,240 --> 00:14:52,600
You can shape the conversation, connect actions
444
00:14:52,600 --> 00:14:55,280
and publish to channels to get something visible fast.
445
00:14:55,280 --> 00:14:57,880
That speed is useful, but it also creates a trap.
446
00:14:57,880 --> 00:14:59,440
Because once the interface works,
447
00:14:59,440 --> 00:15:02,200
people start putting deeper model logic into the same layer.
448
00:15:02,200 --> 00:15:04,960
And that is usually where the architecture starts getting muddy.
449
00:15:04,960 --> 00:15:06,640
Copilot Studio should be the face.
450
00:15:06,640 --> 00:15:08,320
It should handle the conversation layer,
451
00:15:08,320 --> 00:15:10,040
the workflow layer and the channel layer.
452
00:15:10,040 --> 00:15:13,040
It is where users enter, where prompts become interactions,
453
00:15:13,040 --> 00:15:15,080
and where business context and input collection
454
00:15:15,080 --> 00:15:17,800
happen in a way people can actually use.
455
00:15:17,800 --> 00:15:19,240
In Microsoft's own direction,
456
00:15:19,240 --> 00:15:21,120
Copilot Studio keeps getting stronger
457
00:15:21,120 --> 00:15:24,840
as the user-facing agent surface across Microsoft 365 teams,
458
00:15:24,840 --> 00:15:26,440
web and business workflows.
459
00:15:26,440 --> 00:15:29,440
But user-facing is not the same as model governance facing.
460
00:15:29,440 --> 00:15:31,680
Foundry sits in a different part of the stack.
461
00:15:31,680 --> 00:15:34,480
It is where model selection, deployment, evaluation,
462
00:15:34,480 --> 00:15:37,280
prompt flow logic and routing capabilities belong.
463
00:15:37,280 --> 00:15:40,080
It is also where you get closer to benchmark-driven decisions,
464
00:15:40,080 --> 00:15:42,480
model catalogs, and the operational controls
465
00:15:42,480 --> 00:15:45,800
that matter when more than one model or agent path is involved.
466
00:15:45,800 --> 00:15:47,920
If Studio is where the interaction happens,
467
00:15:47,920 --> 00:15:51,080
Foundry is where deeper reasoning infrastructure is shaped and tested.
468
00:15:51,080 --> 00:15:52,640
That split is not cosmetic.
469
00:15:52,640 --> 00:15:54,600
It affects how cleanly the system can grow.
470
00:15:54,600 --> 00:15:56,120
What typically happens is this.
471
00:15:56,120 --> 00:15:58,080
A team starts with a Copilot Studio agent,
472
00:15:58,080 --> 00:15:59,760
but then they keep adding more responsibility
473
00:15:59,760 --> 00:16:00,720
into that same surface.
474
00:16:00,720 --> 00:16:03,120
They add classification logic, prompt branching,
475
00:16:03,120 --> 00:16:05,880
and complex model decisions alongside retrieval experiments
476
00:16:05,880 --> 00:16:07,800
and external back end calls.
477
00:16:07,800 --> 00:16:10,880
Soon the experience layer is carrying reasoning responsibilities.
478
00:16:10,880 --> 00:16:13,440
It was never meant to own directly at that depth.
479
00:16:13,440 --> 00:16:14,920
The result is not just complexity.
480
00:16:14,920 --> 00:16:17,040
It is hidden complexity in the wrong place.
481
00:16:17,040 --> 00:16:18,640
That is the brain in the face mistake.
482
00:16:18,640 --> 00:16:20,320
You are putting architectural intelligence
483
00:16:20,320 --> 00:16:22,200
where presentation logic should dominate.
484
00:16:22,200 --> 00:16:24,160
So the conversational shell becomes the place
485
00:16:24,160 --> 00:16:26,840
where model strategy lives and then changing models
486
00:16:26,840 --> 00:16:29,040
or evaluating routes becomes much harder.
487
00:16:29,040 --> 00:16:31,920
Reusing expert logic across agents becomes difficult
488
00:16:31,920 --> 00:16:34,240
and governance gets weaker because critical decisions
489
00:16:34,240 --> 00:16:36,280
are buried inside a user flow layer
490
00:16:36,280 --> 00:16:39,880
instead of living in a platform layer designed for model control.
491
00:16:39,880 --> 00:16:42,800
The reverse mistake happens to some teams stay entirely in Foundry
492
00:16:42,800 --> 00:16:45,960
because that feels more serious, more technical, and more flexible.
493
00:16:45,960 --> 00:16:48,480
They build model flows, endpoints, and evaluators,
494
00:16:48,480 --> 00:16:50,360
but they under design the user layer.
495
00:16:50,360 --> 00:16:52,760
Then the system may be clever, but it is not usable.
496
00:16:52,760 --> 00:16:55,520
It lacks the right front door, the right clarification moments,
497
00:16:55,520 --> 00:16:57,840
and the right workflow framing for actual work.
498
00:16:57,840 --> 00:16:59,640
People don't just need a smart endpoint.
499
00:16:59,640 --> 00:17:00,960
They need a governed experience.
500
00:17:00,960 --> 00:17:02,560
So this is not Studio versus Foundry.
501
00:17:02,560 --> 00:17:05,520
It is Studio for Experience, Foundry for Reasoning Control,
502
00:17:05,520 --> 00:17:07,000
and both connected on purpose.
503
00:17:07,000 --> 00:17:08,920
That means the split needs to be explicit.
504
00:17:08,920 --> 00:17:11,360
Let Copilot Studio own what the user sees
505
00:17:11,360 --> 00:17:13,640
and how the user moves through the interaction.
506
00:17:13,640 --> 00:17:15,280
Let it manage the conversational frame,
507
00:17:15,280 --> 00:17:17,680
the channel behavior, the action handoff points,
508
00:17:17,680 --> 00:17:20,600
and the bounded orchestration patterns that shape the journey.
509
00:17:20,600 --> 00:17:22,800
Let Foundry own what the system decides
510
00:17:22,800 --> 00:17:25,120
and how the underlying intelligence is selected.
511
00:17:25,120 --> 00:17:28,160
Let it host the models, manage prompt flow or agent logic,
512
00:17:28,160 --> 00:17:30,960
support evaluation, and expose rooted backends
513
00:17:30,960 --> 00:17:33,200
to keep model choice testable and governable.
514
00:17:33,200 --> 00:17:35,000
Execution sits beside both.
515
00:17:35,000 --> 00:17:36,720
Because once an action needs to happen,
516
00:17:36,720 --> 00:17:39,240
the flow may pass through power automate, connectors,
517
00:17:39,240 --> 00:17:41,400
APIs, or external services.
518
00:17:41,400 --> 00:17:44,640
That action layer is not the same as the conversation layer
519
00:17:44,640 --> 00:17:46,600
and it is not the same as the model layer.
520
00:17:46,600 --> 00:17:48,120
If you keep those concerns separate,
521
00:17:48,120 --> 00:17:50,520
the whole system becomes easier to evolve.
522
00:17:50,520 --> 00:17:53,280
If you collapse them, every change touches everything.
523
00:17:53,280 --> 00:17:55,920
And this matters even more once you move into an expert fabric
524
00:17:55,920 --> 00:17:58,320
because specialized systems need clear seams.
525
00:17:58,320 --> 00:18:00,920
The front layer should not need to know every model detail
526
00:18:00,920 --> 00:18:02,480
and the model layer should not be forced
527
00:18:02,480 --> 00:18:04,160
to manage every user nuance.
528
00:18:04,160 --> 00:18:05,120
Good architecture.
529
00:18:05,120 --> 00:18:06,920
Let's each layer stay focused.
530
00:18:06,920 --> 00:18:10,320
Once the layers are clear, the first expert almost picks itself,
531
00:18:10,320 --> 00:18:11,720
not the answer model, the router.
532
00:18:11,720 --> 00:18:15,120
The first expert should be the router, not the answer model.
533
00:18:15,120 --> 00:18:18,600
If you are moving from one generalist bot to an expert fabric,
534
00:18:18,600 --> 00:18:21,880
the first expert you add should not be the smartest answer engine.
535
00:18:21,880 --> 00:18:24,400
It should be the system that decides what kind of work just arrived.
536
00:18:24,400 --> 00:18:25,560
That sounds less exciting,
537
00:18:25,560 --> 00:18:28,200
but it is where the architecture starts getting disciplined
538
00:18:28,200 --> 00:18:30,720
because the first decision in the flow shapes cost,
539
00:18:30,720 --> 00:18:33,800
latency control, and quality for everything that follows.
540
00:18:33,800 --> 00:18:37,200
If that first decision is wrong, you pay for it all the way down.
541
00:18:37,200 --> 00:18:39,720
If it is clean, the rest of the system gets simpler.
542
00:18:39,720 --> 00:18:41,400
Most teams still do the opposite.
543
00:18:41,400 --> 00:18:43,480
They start by picking a big answer model,
544
00:18:43,480 --> 00:18:47,360
then they add prompts, retrieval, and actions around it.
545
00:18:47,360 --> 00:18:50,800
The expensive reasoning layer becomes the default entry point for every request,
546
00:18:50,800 --> 00:18:54,280
even when the user is just asking for a simple sort or a workflow trigger.
547
00:18:54,280 --> 00:18:55,240
That is backwards.
548
00:18:55,240 --> 00:18:57,000
The front of the system should decide.
549
00:18:57,000 --> 00:19:01,080
The premium layer should answer only when the system has a reason to spend that capability.
550
00:19:01,080 --> 00:19:03,160
So the router becomes the first real specialist.
551
00:19:03,160 --> 00:19:04,160
Its job is narrow.
552
00:19:04,160 --> 00:19:07,520
Read the request, classify the task, judge the likely path,
553
00:19:07,520 --> 00:19:09,000
return a structured decision.
554
00:19:09,000 --> 00:19:11,200
It might identify the domain, the risk level,
555
00:19:11,200 --> 00:19:13,600
or whether a larger model is even justified.
556
00:19:13,600 --> 00:19:16,440
This is a very different role from open-ended generation,
557
00:19:16,440 --> 00:19:18,560
and that is why it deserves its own design.
558
00:19:18,560 --> 00:19:21,920
In practical Microsoft terms, this can live as a low-cost model pattern
559
00:19:21,920 --> 00:19:24,760
inside Foundry Logic or inside a governed orchestration path
560
00:19:24,760 --> 00:19:26,440
that co-pilot studio triggers.
561
00:19:26,440 --> 00:19:29,680
The exact placement can vary, but the function is what matters.
562
00:19:29,680 --> 00:19:32,880
You want a first-pass layer that is cheap, fast, and measurable.
563
00:19:32,880 --> 00:19:35,200
This is why FireClass models make sense here.
564
00:19:35,200 --> 00:19:37,760
The research supports using small models for routing heavy work,
565
00:19:37,760 --> 00:19:39,920
especially for classification and structured decisions.
566
00:19:39,920 --> 00:19:43,320
You are not asking the model to compose a nuanced executive answer.
567
00:19:43,320 --> 00:19:46,520
You are asking it to make a limited call about what should happen next.
568
00:19:46,520 --> 00:19:50,160
If it returns a schema-like task type, complexity, and confidence,
569
00:19:50,160 --> 00:19:51,800
you can govern that output cleanly.
570
00:19:51,800 --> 00:19:53,440
That structured output matters a lot.
571
00:19:53,440 --> 00:19:54,840
Do not make the router chatty.
572
00:19:54,840 --> 00:19:57,200
Do not ask it to explain itself in three paragraphs
573
00:19:57,200 --> 00:19:59,240
or let it improvise policy language.
574
00:19:59,240 --> 00:20:01,000
Make it return a compact decision object
575
00:20:01,000 --> 00:20:02,800
that the rest of the system can act on.
576
00:20:02,800 --> 00:20:06,240
That keeps the router stable and it gives you traces you can actually inspect later.
577
00:20:06,240 --> 00:20:07,760
Now, what should the router decide?
578
00:20:07,760 --> 00:20:11,040
At minimum five things are usually enough to change the architecture.
579
00:20:11,040 --> 00:20:12,080
What kind of task is this?
580
00:20:12,080 --> 00:20:13,200
How hard is it likely to be?
581
00:20:13,200 --> 00:20:14,400
How risky is it?
582
00:20:14,400 --> 00:20:15,680
What domain does it belong to?
583
00:20:15,680 --> 00:20:17,600
And what kind of next step does it require?
584
00:20:17,600 --> 00:20:20,880
That next step may be a specialist model, a retrieval-backed agent,
585
00:20:20,880 --> 00:20:21,680
or a workflow.
586
00:20:21,680 --> 00:20:23,200
It might even be no model at all.
587
00:20:23,200 --> 00:20:24,800
That last option is easy to underrate,
588
00:20:24,800 --> 00:20:27,280
but it is one of the best outcomes in the whole design.
589
00:20:27,280 --> 00:20:31,120
If the system determines that a direct rule or a connector should handle the request,
590
00:20:31,120 --> 00:20:34,240
the router has just prevented unnecessary inference altogether.
591
00:20:34,240 --> 00:20:37,680
That is where the economics, performance, and governance story all meet.
592
00:20:37,680 --> 00:20:42,000
Cost improves because routine traffic gets filtered before it touches premium inference.
593
00:20:42,000 --> 00:20:44,560
Latency improves because narrow decisions happen quickly
594
00:20:44,560 --> 00:20:46,640
and the user reaches the right path faster.
595
00:20:46,640 --> 00:20:50,000
Governance improves because every request now passes through a decision point
596
00:20:50,000 --> 00:20:52,480
that can be logged and tuned and one level deeper.
597
00:20:52,480 --> 00:20:55,040
The router gives you visibility you never get from a monolith.
598
00:20:55,040 --> 00:20:56,400
You can see where traffic goes.
599
00:20:56,400 --> 00:20:58,400
You can see what percentage of requests escalate.
600
00:20:58,400 --> 00:21:01,520
You can see where the teams are sending too much work into expensive parts.
601
00:21:01,520 --> 00:21:03,920
Without a router, all of that stays blended together.
602
00:21:03,920 --> 00:21:06,800
So if you remember one design move from this section, make it this.
603
00:21:06,800 --> 00:21:09,920
Do not start by asking what model should answer everything.
604
00:21:09,920 --> 00:21:13,280
Start by asking what should decide what this request actually is.
605
00:21:13,280 --> 00:21:15,680
Because once that decision gets its own specialist,
606
00:21:15,680 --> 00:21:19,600
the whole architecture stops behaving like a generalist with too many responsibilities.
607
00:21:19,600 --> 00:21:21,200
It starts behaving like a system.
608
00:21:21,200 --> 00:21:23,680
But that only works when the expert map itself is clean.
609
00:21:23,680 --> 00:21:26,720
How to define your expert domains?
610
00:21:26,720 --> 00:21:29,920
Once the router exists, the next design problem shows up fast.
611
00:21:29,920 --> 00:21:31,120
What exactly are the experts?
612
00:21:31,120 --> 00:21:33,760
A lot of teams answer that question with the org chart.
613
00:21:33,760 --> 00:21:36,480
HR agent, finance agent, IT agent, legal agent,
614
00:21:36,480 --> 00:21:38,720
sometimes that works, but usually it only works partway.
615
00:21:38,720 --> 00:21:41,680
Departments are ownership boundaries, not always task boundaries.
616
00:21:41,680 --> 00:21:43,840
If you define experts only by department,
617
00:21:43,840 --> 00:21:47,440
you often end up rebuilding the same confusion you were trying to escape.
618
00:21:47,440 --> 00:21:49,440
One domain agent still tries to answer,
619
00:21:49,440 --> 00:21:52,160
retrieve and act across too many different jobs.
620
00:21:52,160 --> 00:21:55,440
So split by task type first, then map to ownership.
621
00:21:55,440 --> 00:21:56,400
That is the shift.
622
00:21:56,400 --> 00:22:00,240
A useful expert domain is narrow enough that you can describe its job in one sentence,
623
00:22:00,240 --> 00:22:02,720
restrict its tools, and know when it should stop.
624
00:22:02,720 --> 00:22:05,120
If you cannot do that, the expert is still too broad.
625
00:22:05,120 --> 00:22:08,080
The point is not to create a smaller version of the generalist.
626
00:22:08,080 --> 00:22:11,520
The point is to create bounded roles that fit distinct work patterns.
627
00:22:11,520 --> 00:22:14,640
In practice, a few expert patterns show up again and again,
628
00:22:14,640 --> 00:22:17,520
a knowledge expert answers questions from approved sources.
629
00:22:17,520 --> 00:22:20,560
An extraction expert turns documents into structured fields.
630
00:22:20,560 --> 00:22:23,440
A policy expert interprets rules inside a limited corpus.
631
00:22:23,440 --> 00:22:26,240
A workflow expert takes approved actions in business systems.
632
00:22:26,240 --> 00:22:29,360
An analytics expert works with data questions and metrics.
633
00:22:29,360 --> 00:22:32,560
A coding expert handles engineering tasks or technical analysis.
634
00:22:32,560 --> 00:22:33,760
Those are task families.
635
00:22:34,800 --> 00:22:39,040
Then inside the organization you can decide whether one finance policy expert exists
636
00:22:39,040 --> 00:22:42,080
or whether finance needs separate policy and workflow experts.
637
00:22:42,080 --> 00:22:45,520
The split depends on risk, data sensitivity and task variation,
638
00:22:45,520 --> 00:22:46,960
not just on which team has the budget.
639
00:22:46,960 --> 00:22:48,640
That is the thing most people miss.
640
00:22:48,640 --> 00:22:50,640
Domain does not only mean subject matter.
641
00:22:50,640 --> 00:22:52,400
Domain also means action pattern.
642
00:22:52,400 --> 00:22:55,760
A legal knowledge expert and a legal workflow expert may both sit in legal,
643
00:22:55,760 --> 00:22:58,240
but they should not automatically be the same agent.
644
00:22:58,240 --> 00:23:00,560
One reads and answers while the other does things.
645
00:23:00,560 --> 00:23:03,440
Those are different risk profiles and different permission needs.
646
00:23:03,440 --> 00:23:06,240
Keep them separate, unless you have a very good reason not to.
647
00:23:06,240 --> 00:23:08,000
Boundaries need to be clear on three fronts.
648
00:23:08,000 --> 00:23:09,440
Purpose data tools.
649
00:23:09,440 --> 00:23:11,920
Purpose means what the expert is allowed to try to do.
650
00:23:11,920 --> 00:23:13,760
Data means which sources it can see.
651
00:23:13,760 --> 00:23:16,240
Tools means which actions or systems it can touch.
652
00:23:16,240 --> 00:23:19,360
If one of those stays vague, overlap starts creeping back in.
653
00:23:19,360 --> 00:23:21,680
Then two experts answer the same question differently.
654
00:23:21,680 --> 00:23:24,960
Or one expert starts reaching into a data set it should never have seen.
655
00:23:24,960 --> 00:23:26,480
That overlap is expensive.
656
00:23:26,480 --> 00:23:27,680
It is not just about cost.
657
00:23:27,680 --> 00:23:29,120
It is about trust.
658
00:23:29,120 --> 00:23:31,520
Users stop understanding where the answer came from
659
00:23:31,520 --> 00:23:34,480
and owners stop knowing which team is responsible for improving it.
660
00:23:34,480 --> 00:23:36,240
So write domain charters.
661
00:23:36,240 --> 00:23:38,720
Not long documents, short strict ones.
662
00:23:38,720 --> 00:23:41,600
For each expert define the mission, the allowed tasks,
663
00:23:41,600 --> 00:23:44,240
the blocked tasks, the approved data sources,
664
00:23:44,240 --> 00:23:45,280
and the tools.
665
00:23:45,280 --> 00:23:46,240
Add the owner.
666
00:23:46,240 --> 00:23:47,360
Add the risk level.
667
00:23:47,360 --> 00:23:48,880
Add the expected output format.
668
00:23:48,880 --> 00:23:50,160
This does two things at once.
669
00:23:50,160 --> 00:23:52,880
It sharpens the build and it sharpens the governance.
670
00:23:52,880 --> 00:23:55,680
Segmentation matters even more when knowledge is involved.
671
00:23:55,680 --> 00:23:58,400
Do not point every expert at one giant shared index
672
00:23:58,400 --> 00:24:01,280
unless you are fully comfortable with the visibility consequences.
673
00:24:01,280 --> 00:24:04,560
In many environments that recreates the same oversharing risk
674
00:24:04,560 --> 00:24:06,880
that generalist bots already create.
675
00:24:06,880 --> 00:24:09,920
Experts should see the knowledge they need and no more.
676
00:24:09,920 --> 00:24:12,800
Roll and sensitivity boundaries have to shape the index strategy,
677
00:24:12,800 --> 00:24:14,160
not just the prompt strategy.
678
00:24:14,160 --> 00:24:15,600
There is also a maturity point here.
679
00:24:15,600 --> 00:24:17,520
Do not create 10 experts on day one
680
00:24:17,520 --> 00:24:19,760
because the architecture diagram looks impressive.
681
00:24:19,760 --> 00:24:22,800
Start where task differences are real and measurable.
682
00:24:22,800 --> 00:24:25,440
One router, one knowledge expert, one workflow expert,
683
00:24:25,440 --> 00:24:27,840
maybe one policy expert if that boundary is sharp.
684
00:24:27,840 --> 00:24:30,880
Then expand only when a new expert solves a real problem
685
00:24:30,880 --> 00:24:32,480
the current ones cannot handle.
686
00:24:32,480 --> 00:24:34,800
Good expert maps feel slightly stricter at first.
687
00:24:34,800 --> 00:24:38,000
That is a good sign because narrow missions create cleaner routing,
688
00:24:38,000 --> 00:24:40,720
clearer ownership and better evaluation later.
689
00:24:40,720 --> 00:24:42,400
Loose missions create polite chaos.
690
00:24:42,400 --> 00:24:44,480
And once the expert map is defined well enough
691
00:24:44,480 --> 00:24:46,480
to survive contact with real traffic
692
00:24:46,480 --> 00:24:49,280
you can finally design the flow through the whole system.
693
00:24:49,280 --> 00:24:51,840
The base architecture for a governed expert fabric.
694
00:24:51,840 --> 00:24:54,880
Once the expert map is clear, the next job is to design the path
695
00:24:54,880 --> 00:24:58,160
through the system so those experts operate like a governed fabric
696
00:24:58,160 --> 00:25:00,480
instead of a loose collection of clever parts.
697
00:25:00,480 --> 00:25:02,480
This is where architecture stops being abstract.
698
00:25:02,480 --> 00:25:04,960
A user sends a request, the system captures context,
699
00:25:04,960 --> 00:25:07,600
a root gets chosen, work happens in a bounded place,
700
00:25:07,600 --> 00:25:09,360
results come back in a controlled form.
701
00:25:09,360 --> 00:25:11,600
Every handoff is visible, started the front door.
702
00:25:11,600 --> 00:25:15,040
In the Microsoft stack, that front door is usually copilot studio
703
00:25:15,040 --> 00:25:18,160
or a Microsoft 365 copilot extension point.
704
00:25:18,160 --> 00:25:19,920
That is where the user interaction begins
705
00:25:19,920 --> 00:25:23,760
and that matters because the first layer should not only capture the words in the prompt.
706
00:25:23,760 --> 00:25:26,480
It should capture the operating context around the prompt,
707
00:25:26,480 --> 00:25:27,520
who is the user?
708
00:25:27,520 --> 00:25:28,560
What channel are they in?
709
00:25:28,560 --> 00:25:29,680
What app are they using?
710
00:25:29,680 --> 00:25:30,800
What role do they hold?
711
00:25:30,800 --> 00:25:32,800
What environment are they working inside?
712
00:25:32,800 --> 00:25:36,480
Are they asking for information, analysis or action?
713
00:25:36,480 --> 00:25:39,280
Those signals shape what the system is allowed to do next.
714
00:25:39,280 --> 00:25:41,280
So the first layer is really an intake layer.
715
00:25:41,280 --> 00:25:42,880
It does not need to be dramatic.
716
00:25:42,880 --> 00:25:44,480
It needs to be disciplined.
717
00:25:44,480 --> 00:25:46,480
Intake should gather four things well.
718
00:25:46,480 --> 00:25:49,360
Intent, context, identity and environment.
719
00:25:49,360 --> 00:25:51,520
Intent tells you what the user seems to want.
720
00:25:51,520 --> 00:25:54,480
Context tells you what surrounding material may matter.
721
00:25:54,480 --> 00:25:56,960
Identity tells you what access model applies.
722
00:25:56,960 --> 00:25:59,680
Environment tells you which policy zone you are inside.
723
00:25:59,680 --> 00:26:02,880
Without those four, downstream routing turns into guesswork.
724
00:26:02,880 --> 00:26:05,360
After intake, the router decides the path.
725
00:26:05,360 --> 00:26:07,120
This is the structural pivot in the flow.
726
00:26:07,120 --> 00:26:08,960
The router does not answer the user directly
727
00:26:08,960 --> 00:26:12,480
unless the architecture explicitly allows that for narrow cases.
728
00:26:12,480 --> 00:26:14,080
Its main job is path selection.
729
00:26:14,080 --> 00:26:16,720
Maybe the request is simple enough for a direct answer.
730
00:26:16,720 --> 00:26:19,200
Maybe it needs a retrieval-backed knowledge path.
731
00:26:19,200 --> 00:26:20,640
Maybe it needs a workflow path.
732
00:26:20,640 --> 00:26:22,400
Maybe it belongs with a specialist agent.
733
00:26:22,400 --> 00:26:26,400
The point is that the system chooses the type of work before it chooses the answer.
734
00:26:26,400 --> 00:26:28,560
And one level deeper, different paths
735
00:26:28,560 --> 00:26:30,480
should produce different kinds of outputs.
736
00:26:30,480 --> 00:26:33,200
A specialist should not return a long conversational essay
737
00:26:33,200 --> 00:26:35,280
unless that is truly the point of the specialist.
738
00:26:35,280 --> 00:26:38,800
In most enterprise flows, the better pattern is structured output.
739
00:26:38,800 --> 00:26:42,800
The knowledge expert returns answer, source, confidence and citation.
740
00:26:42,800 --> 00:26:45,520
The extraction expert returns fields and validation flags.
741
00:26:45,520 --> 00:26:49,840
The workflow expert returns status, require approvals and execution result.
742
00:26:49,840 --> 00:26:53,200
The analytics expert returns metrics, assumptions and scope.
743
00:26:53,200 --> 00:26:56,720
Structure makes orchestration cleaner and it makes audit much easier later.
744
00:26:56,720 --> 00:26:59,440
That means you need an orchestrator after the expert step.
745
00:26:59,440 --> 00:27:02,880
The orchestrator is responsible for assembling the final user response
746
00:27:02,880 --> 00:27:06,480
and deciding whether any action should be presented, delayed or blocked.
747
00:27:06,480 --> 00:27:10,320
It takes the specialist output, combines it with policy and presentation rules
748
00:27:10,320 --> 00:27:13,040
and turns it into one coherent user-facing response.
749
00:27:13,040 --> 00:27:15,280
This is why one user-facing voice still matters.
750
00:27:15,280 --> 00:27:18,160
Even in a multi-expert system, the user should not feel like
751
00:27:18,160 --> 00:27:20,480
five different systems are arguing in public.
752
00:27:20,480 --> 00:27:22,800
The system can be distributed behind the scenes
753
00:27:22,800 --> 00:27:24,880
while remaining consistent at the surface.
754
00:27:24,880 --> 00:27:26,880
Now add the controls that wrap the whole thing.
755
00:27:26,880 --> 00:27:28,560
Every handoff should generate logs.
756
00:27:28,560 --> 00:27:30,400
Every expert decision should be traceable.
757
00:27:30,400 --> 00:27:32,800
Every action request should carry policy context.
758
00:27:32,800 --> 00:27:34,880
Every connector call should be attributable.
759
00:27:34,880 --> 00:27:38,640
Prompt, route, tool use, output and outcome only to be captured
760
00:27:38,640 --> 00:27:43,040
in a way that security, compliance and platform teams can actually inspect later.
761
00:27:43,040 --> 00:27:45,440
If the path is invisible, the fabric is not governed.
762
00:27:45,440 --> 00:27:46,400
It is just complex.
763
00:27:46,400 --> 00:27:48,240
This is also where state discipline matters.
764
00:27:48,240 --> 00:27:52,560
Do not let critical control information live only inside transient model context.
765
00:27:52,560 --> 00:27:54,160
Externalize state where possible.
766
00:27:54,160 --> 00:27:57,520
Keep route records, approvals, intermediate outputs,
767
00:27:57,520 --> 00:28:01,920
and execution history in managed systems where they can be audited and controlled.
768
00:28:01,920 --> 00:28:05,360
Hidden memory creates hidden risk, managed state creates operational clarity.
769
00:28:05,360 --> 00:28:07,120
So the base architecture looks like this.
770
00:28:07,120 --> 00:28:10,880
User enters through co-pilot studio or Microsoft 365 co-pilot.
771
00:28:10,880 --> 00:28:13,760
Intake captures identity, context, intent and environment.
772
00:28:13,760 --> 00:28:15,120
Rotor selects the path.
773
00:28:15,120 --> 00:28:18,320
Specialist performs bounded work and returns structured output.
774
00:28:18,320 --> 00:28:20,720
Orchestrator assembles the response and action.
775
00:28:20,720 --> 00:28:22,800
Logging, policy and trace wrap every step.
776
00:28:22,800 --> 00:28:24,000
That is the shape.
777
00:28:24,000 --> 00:28:28,240
Not because diagrams are nice, but because governed systems need deliberate seams.
778
00:28:28,240 --> 00:28:31,840
And once you have that shape, the next design issue shows up immediately.
779
00:28:31,840 --> 00:28:34,080
Retrieval has to live inside the fabric properly,
780
00:28:34,080 --> 00:28:36,640
not float beside it as a separate idea.
781
00:28:36,640 --> 00:28:37,760
Raga is not the expert.
782
00:28:37,760 --> 00:28:40,080
It is one capability inside the fabric.
783
00:28:40,080 --> 00:28:43,600
This is a place where a lot of enterprise AI discussions get fuzzy fast.
784
00:28:43,600 --> 00:28:45,520
People say they are building an expert system,
785
00:28:45,520 --> 00:28:47,680
but what they really mean is they added retrieval.
786
00:28:47,680 --> 00:28:49,600
They indexed some content, connected search,
787
00:28:49,600 --> 00:28:52,000
and now the assistant can pull documents into context.
788
00:28:52,000 --> 00:28:55,280
That can be useful, but retrieval is not the same thing as expertise.
789
00:28:55,280 --> 00:28:59,360
And when those two ideas get merged, the architecture starts drifting again.
790
00:28:59,360 --> 00:29:00,880
Raga solves a different problem.
791
00:29:00,880 --> 00:29:02,960
It helps the system reach external knowledge
792
00:29:02,960 --> 00:29:04,560
that is not already inside the model.
793
00:29:04,560 --> 00:29:06,400
It improves grounding, it helps with freshness,
794
00:29:06,400 --> 00:29:07,840
it helps with citations.
795
00:29:07,840 --> 00:29:12,080
It helps the system pull relevant context from documents, records and index sources.
796
00:29:12,080 --> 00:29:13,600
That is a knowledge access pattern.
797
00:29:13,600 --> 00:29:14,960
An expert is something else.
798
00:29:14,960 --> 00:29:18,560
An expert has a bounded mission, a decision role, a stop boundary,
799
00:29:18,560 --> 00:29:20,000
and a clear output shape.
800
00:29:20,000 --> 00:29:21,600
It may use retrieval, it may not,
801
00:29:21,600 --> 00:29:24,400
but retrieval by itself does not create specialization.
802
00:29:24,400 --> 00:29:26,000
It only creates access.
803
00:29:26,000 --> 00:29:30,160
That distinction matters because a lot of teams build one giant retrieval chain,
804
00:29:30,160 --> 00:29:31,600
point every agent at it,
805
00:29:31,600 --> 00:29:33,600
and then call the result an expert architecture.
806
00:29:33,600 --> 00:29:34,160
It isn't.
807
00:29:34,160 --> 00:29:36,880
It is still a broad assistant with better document access.
808
00:29:36,880 --> 00:29:38,160
So the better pattern is this.
809
00:29:38,160 --> 00:29:41,200
Treat retrieval as a tool that an expert can use.
810
00:29:41,200 --> 00:29:43,040
A knowledge expert may call retrieval,
811
00:29:43,040 --> 00:29:46,720
a policy expert may call retrieval against approved policy sources.
812
00:29:46,720 --> 00:29:51,200
An analytics expert may call retrieval for report definitions or metric descriptions,
813
00:29:51,200 --> 00:29:53,280
but the expert still owns the task boundary.
814
00:29:53,280 --> 00:29:55,600
The retrieval layer stays subordinate to the role,
815
00:29:55,600 --> 00:29:56,800
not the other way around.
816
00:29:56,800 --> 00:29:58,320
That keeps the design clean.
817
00:29:58,320 --> 00:30:00,160
It also keeps the governance cleaner,
818
00:30:00,160 --> 00:30:02,800
because once retrieval is attached to a bounded agent,
819
00:30:02,800 --> 00:30:05,520
you can control which corpus that agent can search,
820
00:30:05,520 --> 00:30:06,800
which filters apply,
821
00:30:06,800 --> 00:30:08,560
what sensitivity rules matter,
822
00:30:08,560 --> 00:30:10,160
and what kinds of outputs are allowed.
823
00:30:10,160 --> 00:30:13,920
If retrieval stays centralized and shared without clear scoping,
824
00:30:13,920 --> 00:30:16,560
the system starts recreating the same oversharing problem
825
00:30:16,560 --> 00:30:18,560
that generalist bots already trigger.
826
00:30:18,560 --> 00:30:20,160
One giant index feels efficient.
827
00:30:20,160 --> 00:30:23,440
In practice, it often means one giant visibility surface.
828
00:30:23,440 --> 00:30:26,880
And this is why modular rag fits better inside an expert fabric
829
00:30:26,880 --> 00:30:29,360
than one monolithic retrieval pipeline.
830
00:30:29,360 --> 00:30:32,080
Modular rag means retrieval is built as a capability
831
00:30:32,080 --> 00:30:35,200
that can be composed differently depending on the expert and the task.
832
00:30:35,200 --> 00:30:38,720
One expert might use vector search over curated knowledge articles.
833
00:30:38,720 --> 00:30:40,800
Another might use graph-based retrieval
834
00:30:40,800 --> 00:30:43,040
because relationships across entities matter.
835
00:30:43,040 --> 00:30:44,640
Another might use hybrid retrieval
836
00:30:44,640 --> 00:30:47,200
because exact policy wording and semantic similarity
837
00:30:47,200 --> 00:30:48,800
both matter at the same time.
838
00:30:48,800 --> 00:30:51,440
The retrieval strategy should follow the task pattern,
839
00:30:51,440 --> 00:30:53,200
not the branding of the AI program.
840
00:30:53,200 --> 00:30:55,280
That is a design choice, not a default.
841
00:30:55,280 --> 00:30:57,440
If the question depends on entity relationships,
842
00:30:57,440 --> 00:30:59,360
graph retrieval may be the better fit.
843
00:30:59,360 --> 00:31:01,360
If the corpus is large and language is varied,
844
00:31:01,360 --> 00:31:04,160
vector retrieval can help with semantic access.
845
00:31:04,160 --> 00:31:05,680
If exact phrasing matters,
846
00:31:05,680 --> 00:31:08,160
especially in policy or compliance scenarios,
847
00:31:08,160 --> 00:31:10,080
hybrid retrieval often makes more sense
848
00:31:10,080 --> 00:31:12,720
because it blends semantic and keyword style access.
849
00:31:12,720 --> 00:31:16,080
The point is not to turn every agent into a retrieval science project.
850
00:31:16,080 --> 00:31:18,720
The point is to stop pretending one shared retrieval method
851
00:31:18,720 --> 00:31:20,720
is automatically right for every job.
852
00:31:20,720 --> 00:31:23,120
A second reason this matters is traceability.
853
00:31:23,120 --> 00:31:25,280
When retrieval lives inside a bounded expert,
854
00:31:25,280 --> 00:31:26,960
you can inspect which source was searched,
855
00:31:26,960 --> 00:31:28,160
which documents were surfaced,
856
00:31:28,160 --> 00:31:29,360
what filters applied,
857
00:31:29,360 --> 00:31:31,600
and how that context influenced the answer.
858
00:31:31,600 --> 00:31:32,880
That makes review easier.
859
00:31:32,880 --> 00:31:34,400
It also makes tuning easier.
860
00:31:34,400 --> 00:31:36,160
If the knowledge expert performs badly,
861
00:31:36,160 --> 00:31:37,840
you can fix the knowledge expert.
862
00:31:37,840 --> 00:31:40,000
If the policy expert retrieves the wrong material,
863
00:31:40,000 --> 00:31:42,160
you can tune that path specifically.
864
00:31:42,160 --> 00:31:44,000
In a giant shared retrieval chain,
865
00:31:44,000 --> 00:31:46,160
poor behavior spreads across many use cases
866
00:31:46,160 --> 00:31:47,920
and becomes much harder to isolate.
867
00:31:47,920 --> 00:31:49,280
So remember the hierarchy.
868
00:31:49,280 --> 00:31:50,560
The expert owns the task.
869
00:31:50,560 --> 00:31:52,320
Retrieval supports the expert,
870
00:31:52,320 --> 00:31:53,920
not the reverse.
871
00:31:53,920 --> 00:31:54,960
Once teams get that right,
872
00:31:54,960 --> 00:31:56,960
the architecture becomes much easier to govern
873
00:31:56,960 --> 00:32:00,640
because data access is no longer treated like a universal pool.
874
00:32:00,640 --> 00:32:02,560
It becomes part of a bounded role.
875
00:32:02,560 --> 00:32:05,200
And the moment data starts getting segmented that way,
876
00:32:05,200 --> 00:32:07,920
governance stops being something you add later.
877
00:32:07,920 --> 00:32:09,840
It becomes part of the structure from the beginning.
878
00:32:10,880 --> 00:32:13,520
Governance changes completely in multi-agent systems.
879
00:32:13,520 --> 00:32:15,360
When you move from one agent to many,
880
00:32:15,360 --> 00:32:16,960
governance stops being a side layer
881
00:32:16,960 --> 00:32:18,800
and becomes part of the runtime design.
882
00:32:18,800 --> 00:32:19,760
That is the shift.
883
00:32:19,760 --> 00:32:21,600
A lot of teams still think in all terms.
884
00:32:21,600 --> 00:32:23,760
They focus on prompt safety, output filtering,
885
00:32:23,760 --> 00:32:25,200
and maybe a few blocked words.
886
00:32:25,200 --> 00:32:26,560
Those controls still matter,
887
00:32:26,560 --> 00:32:28,000
but they belong to a simpler world
888
00:32:28,000 --> 00:32:30,640
where one model answered one question and stopped there.
889
00:32:30,640 --> 00:32:32,880
Multi-agent systems do not work like that.
890
00:32:32,880 --> 00:32:35,280
They read, they root, they call tools,
891
00:32:35,280 --> 00:32:36,400
they hand work off,
892
00:32:36,400 --> 00:32:37,920
they pull data from different places,
893
00:32:37,920 --> 00:32:39,040
they might trigger workflows
894
00:32:39,040 --> 00:32:40,800
or involve different identities and permissions
895
00:32:40,800 --> 00:32:42,160
in the same user journey.
896
00:32:42,160 --> 00:32:44,800
Because of this, the risk surface is no longer just the answer.
897
00:32:44,800 --> 00:32:45,920
It is the whole chain.
898
00:32:45,920 --> 00:32:47,200
That changes everything.
899
00:32:47,200 --> 00:32:48,960
The failure point might be the root,
900
00:32:48,960 --> 00:32:49,920
not the response.
901
00:32:49,920 --> 00:32:52,320
It might be the wrong specialist receiving the task,
902
00:32:52,320 --> 00:32:54,400
or perhaps a connector with too much access.
903
00:32:54,400 --> 00:32:56,240
It might be a handoff that loses context
904
00:32:56,240 --> 00:32:57,600
while keeping authority
905
00:32:57,600 --> 00:32:59,920
or an oneless flow still running in production.
906
00:32:59,920 --> 00:33:01,920
It might be a model switch that nobody reviewed
907
00:33:01,920 --> 00:33:04,080
or a transcript gap that leaves security teams
908
00:33:04,080 --> 00:33:06,160
unable to reconstruct what happened.
909
00:33:06,160 --> 00:33:07,280
In a multi-agent setup,
910
00:33:07,280 --> 00:33:09,440
we control usually shows up between components
911
00:33:09,440 --> 00:33:11,040
instead of only inside one.
912
00:33:11,040 --> 00:33:13,280
So the first governance question becomes inventory.
913
00:33:13,280 --> 00:33:14,400
What agents exist?
914
00:33:14,400 --> 00:33:15,440
What models do they use?
915
00:33:15,440 --> 00:33:16,640
What flows do they trigger?
916
00:33:16,640 --> 00:33:17,840
What connectors do they hold?
917
00:33:17,840 --> 00:33:19,200
What data can they reach?
918
00:33:19,200 --> 00:33:20,000
Who owns them?
919
00:33:20,000 --> 00:33:21,120
What is their purpose?
920
00:33:21,120 --> 00:33:22,960
When do they expire?
921
00:33:22,960 --> 00:33:24,880
If you cannot answer those basics quickly,
922
00:33:24,880 --> 00:33:26,800
you do not have a governed agent system.
923
00:33:26,800 --> 00:33:28,640
You have agents sprawl with a nice interface.
924
00:33:28,640 --> 00:33:30,720
The research is very consistent on this point.
925
00:33:30,720 --> 00:33:32,320
Inventory is not admin overhead.
926
00:33:32,320 --> 00:33:33,440
It is the foundation.
927
00:33:33,440 --> 00:33:34,160
Without it,
928
00:33:34,160 --> 00:33:35,760
you cannot do lifecycle control,
929
00:33:35,760 --> 00:33:37,520
permission review, incident response
930
00:33:37,520 --> 00:33:39,120
or meaningful risk assessment.
931
00:33:39,120 --> 00:33:40,800
Ownership has to be explicit too.
932
00:33:40,800 --> 00:33:42,240
Not the AI team in general,
933
00:33:42,240 --> 00:33:43,760
not IT in broad terms.
934
00:33:43,760 --> 00:33:45,440
A named owner for the front door agent,
935
00:33:45,440 --> 00:33:47,120
a named owner for the policy expert,
936
00:33:47,120 --> 00:33:48,880
a named owner for the workflow path,
937
00:33:48,880 --> 00:33:50,400
a named owner for the data source.
938
00:33:50,400 --> 00:33:52,720
The moment a routed system crosses departments,
939
00:33:52,720 --> 00:33:54,480
accountability gets blurry fast
940
00:33:54,480 --> 00:33:56,640
unless you force clarity into the design.
941
00:33:56,640 --> 00:33:59,120
Blurry accountability is how risky automation
942
00:33:59,120 --> 00:34:00,640
survive longer than they should.
943
00:34:00,640 --> 00:34:02,480
Permissions become even more important here.
944
00:34:02,480 --> 00:34:03,520
In a single agent world,
945
00:34:03,520 --> 00:34:05,680
people often debate model quality first.
946
00:34:05,680 --> 00:34:06,960
In a multi agent world,
947
00:34:06,960 --> 00:34:09,200
least privilege is usually the sharper control.
948
00:34:09,200 --> 00:34:11,040
An average model with tight permissions
949
00:34:11,040 --> 00:34:13,040
is easier to contain than an excellent model
950
00:34:13,040 --> 00:34:15,360
with broad access and vague tool rights.
951
00:34:15,360 --> 00:34:16,800
The biggest harm often comes from
952
00:34:16,800 --> 00:34:18,400
what an agent can reach or trigger,
953
00:34:18,400 --> 00:34:20,480
not from whether its wording is elegant.
954
00:34:20,480 --> 00:34:22,800
If one specialist only needs one site,
955
00:34:22,800 --> 00:34:24,320
one table and one action,
956
00:34:24,320 --> 00:34:25,440
give it exactly that.
957
00:34:25,440 --> 00:34:26,480
Nothing wider.
958
00:34:26,480 --> 00:34:28,000
Logging has to expand as well.
959
00:34:28,000 --> 00:34:29,520
Final answers are not enough anymore.
960
00:34:29,520 --> 00:34:30,640
You need prompts, routes,
961
00:34:30,640 --> 00:34:33,440
tool calls, handoffs, actions, outcomes and timestamps.
962
00:34:33,440 --> 00:34:35,600
You need enough trace to answer four questions later.
963
00:34:35,600 --> 00:34:37,200
What happened? Why did it happen?
964
00:34:37,200 --> 00:34:38,640
Which component decided it?
965
00:34:38,640 --> 00:34:40,080
And what data did it touch?
966
00:34:40,080 --> 00:34:42,240
If your logs only capture the final user message
967
00:34:42,240 --> 00:34:43,520
and final user answer,
968
00:34:43,520 --> 00:34:45,840
you are blind to the actual behavior of the system.
969
00:34:45,840 --> 00:34:48,000
This is also why deployment governance cannot wait
970
00:34:48,000 --> 00:34:49,120
until after launch.
971
00:34:49,120 --> 00:34:50,960
You need review gates before production.
972
00:34:50,960 --> 00:34:52,800
You need environment separation.
973
00:34:52,800 --> 00:34:55,680
You need different permissions in dev, test and prod.
974
00:34:55,680 --> 00:34:57,440
You need policies around model approval,
975
00:34:57,440 --> 00:34:59,440
connector approval and action approval
976
00:34:59,440 --> 00:35:01,120
before agents start multiplying.
977
00:35:01,120 --> 00:35:02,400
Once they spread through the tenant,
978
00:35:02,400 --> 00:35:04,400
cleanup gets harder, not easier.
979
00:35:04,400 --> 00:35:06,560
Multi-agent systems reward discipline early
980
00:35:06,560 --> 00:35:08,240
and punish improvisation late.
981
00:35:08,240 --> 00:35:10,800
And this is where the Microsoft Conversation changes too.
982
00:35:10,800 --> 00:35:13,040
Because once governance has to span identity,
983
00:35:13,040 --> 00:35:16,320
inventory, policy, life cycle and oversight across many agents,
984
00:35:16,320 --> 00:35:17,680
you need more than good intentions
985
00:35:17,680 --> 00:35:19,200
and scattered admin settings.
986
00:35:19,200 --> 00:35:20,320
You need a control plane.
987
00:35:20,320 --> 00:35:22,640
That is where agent 365 enters the picture.
988
00:35:22,640 --> 00:35:25,120
What agent 365 changes and what it does not.
989
00:35:25,120 --> 00:35:27,360
So now we can place agent 365 properly.
990
00:35:27,360 --> 00:35:28,720
A lot of people hear the phrase
991
00:35:28,720 --> 00:35:31,120
and assume Microsoft has delivered one master answer
992
00:35:31,120 --> 00:35:32,400
for agent governance.
993
00:35:32,400 --> 00:35:34,400
One place to switch everything on.
994
00:35:34,400 --> 00:35:37,120
One surface that solves inventory, policy, risk, identity
995
00:35:37,120 --> 00:35:38,880
and runtime control in a single move.
996
00:35:38,880 --> 00:35:40,080
That is too generous.
997
00:35:40,080 --> 00:35:43,600
Agent 365 matters, but it matters in a more specific way.
998
00:35:43,600 --> 00:35:44,880
It changes governance posture
999
00:35:44,880 --> 00:35:47,760
because it treats agents as first class managed things.
1000
00:35:47,760 --> 00:35:50,480
Not as random side projects, not as invisible flows,
1001
00:35:50,480 --> 00:35:52,640
not as clever assistance floating around the tenant
1002
00:35:52,640 --> 00:35:54,000
with unclear status.
1003
00:35:54,000 --> 00:35:56,160
The research points to a few concrete shifts.
1004
00:35:56,160 --> 00:35:58,720
Agent 365 brings a registry idea to the center.
1005
00:35:58,720 --> 00:36:00,800
It gives each agent a managed identity pattern
1006
00:36:00,800 --> 00:36:02,400
through agent ID concepts.
1007
00:36:02,400 --> 00:36:04,960
It connects policy, ownership, life cycle and discovery
1008
00:36:04,960 --> 00:36:06,560
into one governance surface.
1009
00:36:06,560 --> 00:36:09,120
And it ties that surface into the rest of the Microsoft Control
1010
00:36:09,120 --> 00:36:12,000
stack, especially Entra, Per View, Defender and the admin center.
1011
00:36:12,000 --> 00:36:13,440
That is a real change.
1012
00:36:13,440 --> 00:36:16,400
Because once agents become visible as governed identities,
1013
00:36:16,400 --> 00:36:18,160
you can start asking better questions.
1014
00:36:18,160 --> 00:36:19,440
Which agents exist?
1015
00:36:19,440 --> 00:36:20,400
Who owns them?
1016
00:36:20,400 --> 00:36:21,840
What permissions do they hold?
1017
00:36:21,840 --> 00:36:23,520
Which data sources do they touch?
1018
00:36:23,520 --> 00:36:25,040
Which tools can they invoke?
1019
00:36:25,040 --> 00:36:26,400
Which environments are they running in?
1020
00:36:26,400 --> 00:36:27,520
Which ones are experimental?
1021
00:36:27,520 --> 00:36:28,800
Which ones are production?
1022
00:36:28,800 --> 00:36:30,560
Which ones should have been retired already?
1023
00:36:30,560 --> 00:36:31,840
But are still hanging around.
1024
00:36:31,840 --> 00:36:34,560
Without that inventory spine, enterprise agent governance
1025
00:36:34,560 --> 00:36:35,600
stays fragmented.
1026
00:36:35,600 --> 00:36:37,520
That is where Agent 365 is strongest.
1027
00:36:37,520 --> 00:36:39,920
It gives you a place to anchor identity and oversight
1028
00:36:39,920 --> 00:36:41,840
across specialized agents, including agents
1029
00:36:41,840 --> 00:36:44,160
that reach into different services and workflows.
1030
00:36:44,160 --> 00:36:46,240
In a routed architecture, that matters more
1031
00:36:46,240 --> 00:36:48,080
than it would in a single agent setup.
1032
00:36:48,080 --> 00:36:50,720
Because specialist systems multiply fast.
1033
00:36:50,720 --> 00:36:52,880
The number of models, connectors, sub agents,
1034
00:36:52,880 --> 00:36:54,480
and delegated parts increases
1035
00:36:54,480 --> 00:36:57,120
and the chance of often logic rises with it.
1036
00:36:57,120 --> 00:36:59,280
Agent 365 helps reduce that blindness.
1037
00:36:59,280 --> 00:37:01,840
It also strengthens policy enforcement through connection,
1038
00:37:01,840 --> 00:37:02,880
not isolation.
1039
00:37:02,880 --> 00:37:05,440
An agent ID can sit inside the larger-entra model.
1040
00:37:05,440 --> 00:37:08,320
Per view can shape data controls around prompts, responses,
1041
00:37:08,320 --> 00:37:09,440
and access patterns.
1042
00:37:09,440 --> 00:37:11,360
Defender can monitor unusual behavior
1043
00:37:11,360 --> 00:37:13,920
or risky relationships between agents and data.
1044
00:37:13,920 --> 00:37:15,840
Admin services can help discover,
1045
00:37:15,840 --> 00:37:17,920
govern, and review what is active.
1046
00:37:17,920 --> 00:37:20,720
That connective role is why I would call Agent 365
1047
00:37:20,720 --> 00:37:21,840
a governance spine.
1048
00:37:21,840 --> 00:37:23,440
It does not replace the rest of the body.
1049
00:37:23,440 --> 00:37:24,880
It gives the body structure.
1050
00:37:24,880 --> 00:37:27,040
That distinction matters because expectations
1051
00:37:27,040 --> 00:37:28,640
are running ahead of reality.
1052
00:37:28,640 --> 00:37:31,120
Agent 365 is not a full cross-cloud answer
1053
00:37:31,120 --> 00:37:32,320
to every agent problem.
1054
00:37:32,320 --> 00:37:34,240
The research is pretty direct about that.
1055
00:37:34,240 --> 00:37:38,240
Governance gaps still exist when agents move across external clouds,
1056
00:37:38,240 --> 00:37:40,240
external run times, third-party tools,
1057
00:37:40,240 --> 00:37:42,240
or partially integrated ecosystems.
1058
00:37:42,240 --> 00:37:43,520
Visibility may weaken.
1059
00:37:43,520 --> 00:37:44,880
Enforcement depth may vary.
1060
00:37:44,880 --> 00:37:46,800
Some telemetry may still be incomplete.
1061
00:37:46,800 --> 00:37:50,000
If your architecture spans Azure, Microsoft 365,
1062
00:37:50,000 --> 00:37:52,400
outside SAS, custom MCP servers,
1063
00:37:52,400 --> 00:37:54,720
or other clouds, you still need deliberate design.
1064
00:37:54,720 --> 00:37:58,000
Agent 365 helps, but it does not excuse sloppy boundaries.
1065
00:37:58,000 --> 00:38:00,320
It also does not remove the need for charters.
1066
00:38:00,320 --> 00:38:02,400
You still need to define what each agent is for.
1067
00:38:02,400 --> 00:38:03,360
You still need owners.
1068
00:38:03,360 --> 00:38:04,800
You still need risk tiers.
1069
00:38:04,800 --> 00:38:06,080
You still need approval logic.
1070
00:38:06,080 --> 00:38:08,960
You still need connector discipline and environment separation.
1071
00:38:08,960 --> 00:38:11,680
If an organization treats Agent 365 as a reason
1072
00:38:11,680 --> 00:38:13,200
to stop doing architecture,
1073
00:38:13,200 --> 00:38:16,080
it will just centralize confusion more efficiently.
1074
00:38:16,080 --> 00:38:18,240
So use it for what it is actually good at.
1075
00:38:18,240 --> 00:38:19,760
Use it to register agents.
1076
00:38:19,760 --> 00:38:21,120
Use it to anchor identity.
1077
00:38:21,120 --> 00:38:23,120
Use it to apply policy consistently.
1078
00:38:23,120 --> 00:38:24,880
Use it to support life cycle reviews,
1079
00:38:24,880 --> 00:38:26,240
discovery, and monitoring.
1080
00:38:26,240 --> 00:38:28,480
Use it to connect governance across the Microsoft stack,
1081
00:38:28,480 --> 00:38:31,200
but do not confuse a control plane with a complete design.
1082
00:38:31,200 --> 00:38:33,440
A control plane can tell you what exists,
1083
00:38:33,440 --> 00:38:34,880
constrain what is allowed,
1084
00:38:34,880 --> 00:38:37,360
and help you stop unsafe things faster.
1085
00:38:37,360 --> 00:38:39,280
It cannot decide your expert boundaries for you.
1086
00:38:39,280 --> 00:38:41,920
It cannot magically clean up overlapping missions.
1087
00:38:41,920 --> 00:38:44,640
It cannot turn a messy routing model into a clean one.
1088
00:38:44,640 --> 00:38:46,480
Those are still architecture choices.
1089
00:38:46,480 --> 00:38:47,920
And this becomes even more important
1090
00:38:47,920 --> 00:38:51,120
once model choice itself starts varying across experts and routes.
1091
00:38:51,120 --> 00:38:53,280
Because the moment you allow model optionality,
1092
00:38:53,280 --> 00:38:54,880
you need governance not only over agents,
1093
00:38:54,880 --> 00:38:57,440
but over which models those agents can use in the first place.
1094
00:38:57,440 --> 00:39:00,160
That means the next layer is not agent governance alone.
1095
00:39:00,160 --> 00:39:01,520
It is model governance.
1096
00:39:01,520 --> 00:39:03,760
Use as your policy to control model switching.
1097
00:39:03,760 --> 00:39:06,640
Once you accept that a rotted system uses several models,
1098
00:39:06,640 --> 00:39:08,720
a new governance problem appears fast.
1099
00:39:08,720 --> 00:39:09,760
Who gets to choose them?
1100
00:39:09,760 --> 00:39:12,080
If every team can swap models whenever they want,
1101
00:39:12,080 --> 00:39:13,680
your architecture stops being governed
1102
00:39:13,680 --> 00:39:15,920
at the exact point where it becomes more powerful.
1103
00:39:15,920 --> 00:39:17,520
A rotted system without model policy
1104
00:39:17,520 --> 00:39:19,040
doesn't stay flexible for long,
1105
00:39:19,040 --> 00:39:20,880
and instead it turns into a moving target
1106
00:39:20,880 --> 00:39:23,520
where behavior shifts and compliance reviews fall behind.
1107
00:39:23,520 --> 00:39:25,840
Cost assumptions drift when nobody can explain
1108
00:39:25,840 --> 00:39:28,320
why the same prompt behaves differently this week
1109
00:39:28,320 --> 00:39:29,360
than it did last week.
1110
00:39:29,360 --> 00:39:31,520
That's why model choice has to be a platform decision
1111
00:39:31,520 --> 00:39:33,120
rather than a developer preference.
1112
00:39:33,120 --> 00:39:34,800
This is where Azure Policy matters.
1113
00:39:34,800 --> 00:39:37,680
The practical Microsoft pattern in the research is very clear
1114
00:39:37,680 --> 00:39:39,520
because Azure Policy can restrict
1115
00:39:39,520 --> 00:39:42,080
which models are allowed from Azure AI Foundry.
1116
00:39:42,080 --> 00:39:43,520
It covers approved registries,
1117
00:39:43,520 --> 00:39:45,920
model asset IDs, publishers, and deployment types,
1118
00:39:45,920 --> 00:39:47,920
which means the platform defines the catalog
1119
00:39:47,920 --> 00:39:49,280
where teams are allowed to operate.
1120
00:39:49,280 --> 00:39:50,960
That changes the conversation entirely.
1121
00:39:50,960 --> 00:39:53,120
Teams are no longer asking if they can use anything,
1122
00:39:53,120 --> 00:39:55,120
but are instead asking which approved options
1123
00:39:55,120 --> 00:39:56,080
fit the workload.
1124
00:39:56,080 --> 00:39:57,280
That is the right framing.
1125
00:39:57,280 --> 00:39:58,320
In a rooted architecture,
1126
00:39:58,320 --> 00:39:59,920
switching models is not a minor detail
1127
00:39:59,920 --> 00:40:02,240
because it affects cost, latency, safety,
1128
00:40:02,240 --> 00:40:04,400
and even the quality of downstream handoffs.
1129
00:40:04,400 --> 00:40:06,320
If one expert path quietly changes
1130
00:40:06,320 --> 00:40:08,080
from one model family to another,
1131
00:40:08,080 --> 00:40:09,600
the route may still succeed
1132
00:40:09,600 --> 00:40:12,000
while the system characteristics change underneath it.
1133
00:40:12,000 --> 00:40:14,320
That is exactly the kind of drift mature governance
1134
00:40:14,320 --> 00:40:15,120
should prevent.
1135
00:40:15,120 --> 00:40:17,520
So start with allow lists, approve models,
1136
00:40:17,520 --> 00:40:18,480
approve publishers,
1137
00:40:18,480 --> 00:40:19,840
approve deployment patterns.
1138
00:40:19,840 --> 00:40:21,680
If your organization wants different options
1139
00:40:21,680 --> 00:40:23,520
for different contexts, that's fine,
1140
00:40:23,520 --> 00:40:26,080
but you have to define them centrally.
1141
00:40:26,080 --> 00:40:27,840
Maybe development gets a broader catalog
1142
00:40:27,840 --> 00:40:30,000
for experimentation while the test environment
1143
00:40:30,000 --> 00:40:31,120
narrows that list down.
1144
00:40:31,120 --> 00:40:34,080
Production should only permit a smaller set of fully reviewed models
1145
00:40:34,080 --> 00:40:35,680
to ensure the architecture doesn't turn
1146
00:40:35,680 --> 00:40:37,120
into a moving experiment.
1147
00:40:37,120 --> 00:40:39,360
And one level deeper, routed systems,
1148
00:40:39,360 --> 00:40:41,760
make this more urgent than single model systems.
1149
00:40:41,760 --> 00:40:43,040
In a single model setup,
1150
00:40:43,040 --> 00:40:45,360
one bad decision only affects one primary path,
1151
00:40:45,360 --> 00:40:46,800
but in an expert fabric,
1152
00:40:46,800 --> 00:40:49,600
a loose policy can create dozens of unstable combinations.
1153
00:40:49,600 --> 00:40:50,800
The router model changes,
1154
00:40:50,800 --> 00:40:52,160
the specialist model changes,
1155
00:40:52,160 --> 00:40:53,840
and the fallback model changes
1156
00:40:53,840 --> 00:40:56,560
until the system is no longer one governed service.
1157
00:40:56,560 --> 00:40:59,120
It becomes a collection of shifting dependencies.
1158
00:40:59,120 --> 00:41:01,280
Azure Policy helps you compress that sprawl
1159
00:41:01,280 --> 00:41:03,440
back into an approved operating boundary.
1160
00:41:03,440 --> 00:41:05,280
The research also points to another useful idea
1161
00:41:05,280 --> 00:41:07,280
where dynamic model switching is still possible
1162
00:41:07,280 --> 00:41:08,640
inside the approved catalog.
1163
00:41:08,640 --> 00:41:09,760
That is the balance.
1164
00:41:09,760 --> 00:41:12,000
You do not need to freeze the architecture completely
1165
00:41:12,000 --> 00:41:14,400
because you can still build cheap first patterns
1166
00:41:14,400 --> 00:41:16,000
that escalate later.
1167
00:41:16,000 --> 00:41:17,840
One route can still select between
1168
00:41:17,840 --> 00:41:20,160
approved models based on the task type
1169
00:41:20,160 --> 00:41:22,560
or the environment as long as the evolution happens
1170
00:41:22,560 --> 00:41:24,240
inside a bounded space.
1171
00:41:24,240 --> 00:41:25,840
The policy does not kill routing,
1172
00:41:25,840 --> 00:41:27,760
but it does make routing governable.
1173
00:41:27,760 --> 00:41:29,200
This also creates a cleaner relationship
1174
00:41:29,200 --> 00:41:30,880
between platform teams and builders.
1175
00:41:30,880 --> 00:41:32,560
Platform teams own the model catalog
1176
00:41:32,560 --> 00:41:34,400
and the rules around what gets approved.
1177
00:41:34,400 --> 00:41:36,560
Builders design workflows and expert paths
1178
00:41:36,560 --> 00:41:37,760
using that catalog.
1179
00:41:37,760 --> 00:41:39,840
Security and compliance teams review additions
1180
00:41:39,840 --> 00:41:42,160
to the catalog instead of chasing unknown deployments
1181
00:41:42,160 --> 00:41:44,880
after the fact that division of labor is healthier
1182
00:41:44,880 --> 00:41:46,480
than asking every project team
1183
00:41:46,480 --> 00:41:48,480
to invent its own model risk posture.
1184
00:41:48,480 --> 00:41:49,920
Now policy alone is not enough.
1185
00:41:49,920 --> 00:41:51,760
You still need supporting controls around it
1186
00:41:51,760 --> 00:41:53,600
because being approved does not mean a model
1187
00:41:53,600 --> 00:41:54,800
is a affordable at scale.
1188
00:41:54,800 --> 00:41:56,480
Content filters and guardrails
1189
00:41:56,480 --> 00:41:58,080
should sit beside model policy
1190
00:41:58,080 --> 00:42:00,400
to ensure safety in every context.
1191
00:42:00,400 --> 00:42:01,840
Tracability is also required
1192
00:42:01,840 --> 00:42:03,360
so you can actually observe the traffic
1193
00:42:03,360 --> 00:42:04,640
once it starts moving.
1194
00:42:04,640 --> 00:42:05,760
Policy sets the boundary
1195
00:42:05,760 --> 00:42:08,160
but these other controls make that boundary operational
1196
00:42:08,160 --> 00:42:10,240
and there is one more reason to care about this.
1197
00:42:10,240 --> 00:42:11,440
When a route performs badly,
1198
00:42:11,440 --> 00:42:12,880
teams often blame the model
1199
00:42:12,880 --> 00:42:14,240
but sometimes the real problem is
1200
00:42:14,240 --> 00:42:16,000
that too many models were allowed in
1201
00:42:16,000 --> 00:42:18,240
without any disciplined evaluation.
1202
00:42:18,240 --> 00:42:19,600
Once model choice is governed,
1203
00:42:19,600 --> 00:42:21,360
you can finally test the route on purpose
1204
00:42:21,360 --> 00:42:23,680
which is exactly where we need to go next.
1205
00:42:23,680 --> 00:42:25,440
How to evaluate routing?
1206
00:42:25,440 --> 00:42:26,800
Not just model quality.
1207
00:42:26,800 --> 00:42:27,920
Once you add routing,
1208
00:42:27,920 --> 00:42:30,480
the unit you need to test is no longer just the model.
1209
00:42:30,480 --> 00:42:31,840
It is the decision system.
1210
00:42:31,840 --> 00:42:32,720
That sounds obvious
1211
00:42:32,720 --> 00:42:35,440
but a lot of teams still evaluate routed architectures
1212
00:42:35,440 --> 00:42:37,440
as if the only meaningful question is whether
1213
00:42:37,440 --> 00:42:39,520
one output sounds better than another.
1214
00:42:39,520 --> 00:42:40,560
In a rooted setup,
1215
00:42:40,560 --> 00:42:42,480
the first failure may happen before the final answer
1216
00:42:42,480 --> 00:42:43,360
is even generated
1217
00:42:43,360 --> 00:42:45,440
because the router might choose the wrong path.
1218
00:42:45,440 --> 00:42:46,720
It may escalate too often
1219
00:42:46,720 --> 00:42:48,160
or avoid escalation when it shouldn't
1220
00:42:48,160 --> 00:42:50,000
and sometimes it sends work to the right expert
1221
00:42:50,000 --> 00:42:51,040
but with the wrong structure.
1222
00:42:51,040 --> 00:42:53,040
If you only score the final response,
1223
00:42:53,040 --> 00:42:55,040
you are measuring the last visible step
1224
00:42:55,040 --> 00:42:56,720
and ignoring the logic that shaped it.
1225
00:42:56,720 --> 00:42:58,080
Start with a baseline.
1226
00:42:58,080 --> 00:42:59,680
Take your current fixed model setup
1227
00:42:59,680 --> 00:43:00,880
and measure it honestly
1228
00:43:00,880 --> 00:43:03,760
for quality, latency, cost and task success.
1229
00:43:03,760 --> 00:43:06,080
Then compare the rooted design against that baseline
1230
00:43:06,080 --> 00:43:07,680
using a representative workload
1231
00:43:07,680 --> 00:43:09,200
rather than a polished demo set.
1232
00:43:09,200 --> 00:43:10,480
You need a set of real requests
1233
00:43:10,480 --> 00:43:12,480
that reflect what users actually do
1234
00:43:12,480 --> 00:43:14,800
including messy phrasing and big ears asks
1235
00:43:14,800 --> 00:43:17,040
and the routine traffic people complain about later.
1236
00:43:17,760 --> 00:43:19,920
Representative matters because routing systems
1237
00:43:19,920 --> 00:43:21,920
are extremely sensitive to traffic shape.
1238
00:43:21,920 --> 00:43:23,360
If the test set is too synthetic,
1239
00:43:23,360 --> 00:43:25,360
the router may look cleaner than it really is
1240
00:43:25,360 --> 00:43:27,440
but real users don't write benchmark prompts.
1241
00:43:27,440 --> 00:43:29,760
They write partial sentences, mixed requests
1242
00:43:29,760 --> 00:43:32,640
and vague questions with too much context or too little.
1243
00:43:32,640 --> 00:43:34,640
A good evaluation set has enough variation
1244
00:43:34,640 --> 00:43:37,280
to show where the route starts bending under pressure.
1245
00:43:37,280 --> 00:43:38,880
Then measure four things together.
1246
00:43:38,880 --> 00:43:41,920
Quality is one, cost is another, latency is another,
1247
00:43:41,920 --> 00:43:43,200
success rate is the fourth
1248
00:43:43,200 --> 00:43:44,960
and it is often the most grounded measure
1249
00:43:44,960 --> 00:43:46,320
and enterprise systems
1250
00:43:46,320 --> 00:43:49,920
because it reflects whether the request was actually completed.
1251
00:43:49,920 --> 00:43:52,320
A route that saves money but lowers completion quality
1252
00:43:52,320 --> 00:43:53,760
is not a win for the business.
1253
00:43:53,760 --> 00:43:55,360
A route that improves answer quality
1254
00:43:55,360 --> 00:43:57,600
but destroys latency on common tasks
1255
00:43:57,600 --> 00:43:59,040
will also fail in practice
1256
00:43:59,040 --> 00:44:01,760
which is why these systems need multimetric evaluation.
1257
00:44:01,760 --> 00:44:03,440
You also need trace-based evaluation.
1258
00:44:03,440 --> 00:44:05,360
This is much more useful in agent systems
1259
00:44:05,360 --> 00:44:07,120
than isolated prompt scoring
1260
00:44:07,120 --> 00:44:09,440
because traces show how the route actually behaved.
1261
00:44:09,440 --> 00:44:11,040
You can see which path was selected,
1262
00:44:11,040 --> 00:44:12,320
which expert was called
1263
00:44:12,320 --> 00:44:15,200
and whether retrieval or a specific workflow was used.
1264
00:44:15,200 --> 00:44:18,240
Without traces you are mostly guessing why a result happened
1265
00:44:18,240 --> 00:44:21,120
but with them you can learn if the problem came from routing
1266
00:44:21,120 --> 00:44:23,840
or specialist behavior and then look at distribution.
1267
00:44:23,840 --> 00:44:25,520
Where is the traffic really going?
1268
00:44:25,520 --> 00:44:27,200
This question matters more than people expect
1269
00:44:27,200 --> 00:44:29,200
because a rooted system can look elegant on paper
1270
00:44:29,200 --> 00:44:30,640
and still fail economically.
1271
00:44:30,640 --> 00:44:33,040
If too much traffic ends up in premium parts
1272
00:44:33,040 --> 00:44:35,920
or if the system clings to low cost parts too aggressively
1273
00:44:35,920 --> 00:44:37,760
the architecture will underperform.
1274
00:44:37,760 --> 00:44:40,160
Inspect the model distribution after test runs
1275
00:44:40,160 --> 00:44:41,840
to see what percentage stayed cheap
1276
00:44:41,840 --> 00:44:44,320
and which domains triggered fallbacks most often.
1277
00:44:44,320 --> 00:44:46,720
Distribution tells you whether the architecture you designed
1278
00:44:46,720 --> 00:44:48,640
is the architecture you are actually running.
1279
00:44:48,640 --> 00:44:50,960
Approval thresholds should come before production,
1280
00:44:50,960 --> 00:44:51,840
not after.
1281
00:44:51,840 --> 00:44:53,440
Define what acceptable looks like
1282
00:44:53,440 --> 00:44:55,280
before you roll anything out to users.
1283
00:44:55,280 --> 00:44:57,840
Maybe the routed system must preserve task success
1284
00:44:57,840 --> 00:45:00,000
within a narrow range while cutting cost
1285
00:45:00,000 --> 00:45:03,680
or perhaps the P95 latency cannot exceed a certain limit.
1286
00:45:03,680 --> 00:45:05,200
The exact thresholds will vary
1287
00:45:05,200 --> 00:45:06,480
but the discipline is the same
1288
00:45:06,480 --> 00:45:08,720
because you have to decide the gate before the rollout.
1289
00:45:08,720 --> 00:45:11,280
Otherwise teams will just keep adjusting their expectations
1290
00:45:11,280 --> 00:45:12,960
after they see mixed results.
1291
00:45:12,960 --> 00:45:15,840
And one more thing, evaluate the router itself.
1292
00:45:15,840 --> 00:45:17,360
Not just the whole chain.
1293
00:45:17,360 --> 00:45:19,440
Take labeled examples and check whether the router
1294
00:45:19,440 --> 00:45:20,640
picked the intended path
1295
00:45:20,640 --> 00:45:23,760
so you can measure mis-routes separately from specialist failures.
1296
00:45:23,760 --> 00:45:27,200
If the route is wrong, tuning the expert won't fix the system
1297
00:45:27,200 --> 00:45:29,360
and you need to know exactly where the mishapen.
1298
00:45:29,360 --> 00:45:31,120
Because once routing becomes explicit
1299
00:45:31,120 --> 00:45:33,440
evaluation has to become architectural too.
1300
00:45:33,440 --> 00:45:34,880
And only after you trust the route
1301
00:45:34,880 --> 00:45:37,280
can you decide something even more important
1302
00:45:37,280 --> 00:45:39,040
when the system should escalate.
1303
00:45:39,040 --> 00:45:41,680
Escalation logic is where good architecture become great.
1304
00:45:41,680 --> 00:45:44,720
This is the point where an expert fabric either finds its discipline
1305
00:45:44,720 --> 00:45:46,880
or turns into a very expensive maze.
1306
00:45:46,880 --> 00:45:48,800
Routing by itself is never enough.
1307
00:45:48,800 --> 00:45:51,600
The system needs to know when a cheap path should stop
1308
00:45:51,600 --> 00:45:53,520
when a stronger path should take over
1309
00:45:53,520 --> 00:45:56,160
and when no autonomous path should be trusted at all.
1310
00:45:56,160 --> 00:45:58,080
That judgment is escalation logic.
1311
00:45:58,080 --> 00:46:00,400
In practice, this is where professional architecture
1312
00:46:00,400 --> 00:46:02,720
separate themselves from impressive demos.
1313
00:46:02,720 --> 00:46:04,640
Most teams think about escalation too late.
1314
00:46:04,640 --> 00:46:07,200
They build a cheap first flow, add some specialists
1315
00:46:07,200 --> 00:46:09,200
and wire in a larger model for hard cases
1316
00:46:09,200 --> 00:46:11,520
while hoping the system will just know when to move upward.
1317
00:46:11,520 --> 00:46:12,960
It won't.
1318
00:46:12,960 --> 00:46:16,400
If your escalation rules are vague, one of two things usually happens.
1319
00:46:16,400 --> 00:46:18,560
Either the system escalates far too often
1320
00:46:18,560 --> 00:46:20,560
which destroys the savings you designed for
1321
00:46:20,560 --> 00:46:22,080
or it escalates too rarely.
1322
00:46:22,080 --> 00:46:24,960
That means weak results keep slipping through cheap paths
1323
00:46:24,960 --> 00:46:27,040
that were never meant to handle that much ambiguity.
1324
00:46:27,040 --> 00:46:29,200
So you have to define the triggers clearly.
1325
00:46:29,200 --> 00:46:32,640
Confidence is one trigger, but it cannot be the only one.
1326
00:46:32,640 --> 00:46:36,160
A model sounding confident is not the same as a task being safe.
1327
00:46:36,160 --> 00:46:38,000
You also need ambiguity thresholds.
1328
00:46:38,000 --> 00:46:39,520
Is the request underspecified?
1329
00:46:39,520 --> 00:46:40,800
Does it combine multiple goals?
1330
00:46:40,800 --> 00:46:42,000
Does it cross domains?
1331
00:46:42,000 --> 00:46:43,600
Does the retrieval evidence conflict?
1332
00:46:43,600 --> 00:46:46,400
Then add risk thresholds.
1333
00:46:46,400 --> 00:46:48,480
Is the user asking for an action with compliance,
1334
00:46:48,480 --> 00:46:50,400
financial HR or legal impact?
1335
00:46:50,400 --> 00:46:53,120
Does the system need to write back into business data?
1336
00:46:53,120 --> 00:46:56,000
Does it involve sensitive content or policy interpretation?
1337
00:46:56,000 --> 00:46:57,920
That combination matters because escalation
1338
00:46:57,920 --> 00:46:59,440
should reflect business reality,
1339
00:46:59,440 --> 00:47:00,800
not just model uncertainty.
1340
00:47:00,800 --> 00:47:02,320
A system might feel highly confident
1341
00:47:02,320 --> 00:47:04,800
about a low quality answer in a high risk workflow
1342
00:47:04,800 --> 00:47:06,080
and that still should not pass.
1343
00:47:06,080 --> 00:47:09,120
On the other side, a mildly uncertain classification
1344
00:47:09,120 --> 00:47:12,720
in a low risk path may not justify premium inference at all.
1345
00:47:12,720 --> 00:47:16,400
Your design has to combine model signals with policy and business value
1346
00:47:16,400 --> 00:47:18,880
because technical confidence alone is too narrow.
1347
00:47:18,880 --> 00:47:20,240
In a healthy architecture,
1348
00:47:20,240 --> 00:47:22,080
escalation works like a ladder.
1349
00:47:22,080 --> 00:47:24,240
Simple requests stay on low cost paths.
1350
00:47:24,240 --> 00:47:25,920
Harder requests move to a larger model
1351
00:47:25,920 --> 00:47:27,440
or a deeper specialist chain.
1352
00:47:27,440 --> 00:47:29,680
High impact actions pause for human review
1353
00:47:29,680 --> 00:47:31,360
that last step is easy to skip
1354
00:47:31,360 --> 00:47:33,440
when teams want autonomy to look impressive
1355
00:47:33,440 --> 00:47:36,560
but it is also where a lot of preventable risk enters the system.
1356
00:47:36,560 --> 00:47:38,080
If an action changes records,
1357
00:47:38,080 --> 00:47:40,080
triggers approvals, moves money
1358
00:47:40,080 --> 00:47:41,680
or commits something externally,
1359
00:47:41,680 --> 00:47:43,680
human approvals should stay in the design.
1360
00:47:43,680 --> 00:47:47,120
The workflow only earns more trust through long measured evidence
1361
00:47:47,120 --> 00:47:49,680
and keep the escalation output structured too.
1362
00:47:49,680 --> 00:47:51,200
When a lower tier escalates,
1363
00:47:51,200 --> 00:47:53,200
it should not just throw the whole problem upward
1364
00:47:53,200 --> 00:47:54,880
with a vague message that it isn't sure
1365
00:47:54,880 --> 00:47:56,320
but it should pass a clean package.
1366
00:47:56,320 --> 00:47:58,080
What it saw, why it escalated,
1367
00:47:58,080 --> 00:47:59,360
what evidence it found,
1368
00:47:59,360 --> 00:48:01,040
what path it already attempted.
1369
00:48:01,040 --> 00:48:03,920
That preserves context, reduces repeated work
1370
00:48:03,920 --> 00:48:07,200
and makes review faster for both larger models and humans.
1371
00:48:07,200 --> 00:48:08,720
There is also an economic rule here,
1372
00:48:08,720 --> 00:48:10,720
cheap first, expensive last,
1373
00:48:10,720 --> 00:48:12,480
not because cheap is always better
1374
00:48:12,480 --> 00:48:15,440
but because the architecture should spend capability with intent.
1375
00:48:15,440 --> 00:48:18,000
Premium reasoning should be a deliberate exception path,
1376
00:48:18,000 --> 00:48:20,560
not the default emotional reaction to uncertainty.
1377
00:48:20,560 --> 00:48:22,000
If the system cannot hold that line,
1378
00:48:22,000 --> 00:48:23,520
the route may be technically elegant
1379
00:48:23,520 --> 00:48:24,960
while remaining financially useless.
1380
00:48:24,960 --> 00:48:26,720
So watch for over escalation.
1381
00:48:26,720 --> 00:48:31,360
This is one of the most common hidden failures in routed systems.
1382
00:48:31,360 --> 00:48:34,400
Early tests look good because the system appears safe and polished
1383
00:48:34,400 --> 00:48:35,680
but under the surface,
1384
00:48:35,680 --> 00:48:38,400
almost everything drifts upward into premium paths.
1385
00:48:38,400 --> 00:48:40,720
Cost climbs, latency stretches.
1386
00:48:40,720 --> 00:48:44,320
Teams still tell themselves they built a small model first architecture they didn't.
1387
00:48:44,320 --> 00:48:46,880
They built a premium system with a decorative front filter.
1388
00:48:46,880 --> 00:48:48,720
The fix is not just tighter thresholds,
1389
00:48:48,720 --> 00:48:51,360
it is better task design, narrower experts,
1390
00:48:51,360 --> 00:48:53,840
cleaner outputs, better evidence packaging,
1391
00:48:53,840 --> 00:48:55,440
more explicit business rules.
1392
00:48:55,440 --> 00:48:58,240
Escalation improves when the lower paths are designed to fail
1393
00:48:58,240 --> 00:49:00,000
clearly instead of failing messily.
1394
00:49:00,000 --> 00:49:02,880
If you want the architecture to stay useful and governable,
1395
00:49:02,880 --> 00:49:05,440
treat escalation as a first class design object,
1396
00:49:05,440 --> 00:49:07,680
define it, measure it, review it.
1397
00:49:07,680 --> 00:49:10,000
Every savings model, every latency target,
1398
00:49:10,000 --> 00:49:13,520
and every safety promise in an expert fabric eventually runs through that decision
1399
00:49:13,520 --> 00:49:14,880
about what happens next.
1400
00:49:14,880 --> 00:49:19,280
Once you see that, cost stops looking like something attached to inference alone.
1401
00:49:19,280 --> 00:49:21,920
It starts showing up in every layer of the system.
1402
00:49:21,920 --> 00:49:24,640
The full cost stack of an expert fabric.
1403
00:49:24,640 --> 00:49:26,880
Now we need to get more honest about cost.
1404
00:49:26,880 --> 00:49:28,800
Once people accept cheap first routing,
1405
00:49:28,800 --> 00:49:30,960
they often swing too far in the other direction
1406
00:49:30,960 --> 00:49:33,120
and pretend token pricing is the whole story.
1407
00:49:33,120 --> 00:49:34,080
It isn't.
1408
00:49:34,080 --> 00:49:36,080
Token pricing is only the visible part.
1409
00:49:36,080 --> 00:49:39,280
The system costs, sits across every layer that makes the fabric work.
1410
00:49:39,280 --> 00:49:40,080
Start with inference.
1411
00:49:40,080 --> 00:49:41,760
Yes, you still pay for router calls,
1412
00:49:41,760 --> 00:49:43,920
specialist model calls, fallback model calls,
1413
00:49:43,920 --> 00:49:46,640
and sometimes verification or summarization at the end.
1414
00:49:46,640 --> 00:49:47,520
That matters.
1415
00:49:47,520 --> 00:49:49,120
But in a real Microsoft environment,
1416
00:49:49,120 --> 00:49:50,880
that is only one line in the bill.
1417
00:49:50,880 --> 00:49:52,640
The moment you operationalize the fabric,
1418
00:49:52,640 --> 00:49:54,160
more cost surfaces appear,
1419
00:49:54,160 --> 00:49:55,840
and some of them grow quietly.
1420
00:49:55,840 --> 00:49:57,440
Workflow execution is one of them.
1421
00:49:57,440 --> 00:50:00,080
If Copilot Studio hands work into power automate,
1422
00:50:00,080 --> 00:50:02,560
external APIs or connected business systems,
1423
00:50:02,560 --> 00:50:04,800
every interaction can trigger additional runs,
1424
00:50:04,800 --> 00:50:07,360
connector usage, and downstream compute.
1425
00:50:07,360 --> 00:50:09,840
In a monolith, you might waste premium inference.
1426
00:50:09,840 --> 00:50:11,680
In a fabric, you may reduce that waste
1427
00:50:11,680 --> 00:50:13,520
but increase orchestration activity.
1428
00:50:13,520 --> 00:50:14,960
That is not automatically bad,
1429
00:50:14,960 --> 00:50:16,880
but it means the comparison has to be honest.
1430
00:50:16,880 --> 00:50:18,480
Storage is another piece.
1431
00:50:18,480 --> 00:50:20,960
Traces, logs, prompt records, retrieval indexes,
1432
00:50:20,960 --> 00:50:23,840
and evaluation data sets all need to live somewhere.
1433
00:50:23,840 --> 00:50:25,200
If you are doing this properly,
1434
00:50:25,200 --> 00:50:26,800
you are also retaining enough evidence
1435
00:50:26,800 --> 00:50:29,280
for audit, review, and incident analysis.
1436
00:50:29,280 --> 00:50:32,000
Govern systems store more than optimistic prototypes do.
1437
00:50:32,000 --> 00:50:34,080
That adds cost, but it also buys control.
1438
00:50:34,080 --> 00:50:35,360
Then there is support cost.
1439
00:50:35,360 --> 00:50:36,480
Who reviews escalations?
1440
00:50:36,480 --> 00:50:37,280
Who tunes routes?
1441
00:50:37,280 --> 00:50:38,880
Who maintains expert charters?
1442
00:50:38,880 --> 00:50:40,480
Who updates model allow lists?
1443
00:50:40,480 --> 00:50:43,120
Who validates connectors after business system changes?
1444
00:50:43,120 --> 00:50:44,720
Who investigates strange traces?
1445
00:50:44,720 --> 00:50:48,320
In a monolithic setup, waste hides inside one oversize model path.
1446
00:50:48,320 --> 00:50:49,280
In an expert fabric,
1447
00:50:49,280 --> 00:50:51,600
part of the spend moves into platform operations.
1448
00:50:51,600 --> 00:50:53,520
That trade can still be very favorable,
1449
00:50:53,520 --> 00:50:55,600
but only if the workload volume and business value
1450
00:50:55,600 --> 00:50:56,880
justify the added discipline.
1451
00:50:56,880 --> 00:50:58,960
So the right comparison is not monolith
1452
00:50:58,960 --> 00:51:00,960
versus orchestration overhead in isolation.
1453
00:51:00,960 --> 00:51:03,360
It is monolith waste versus orchestrated efficiency
1454
00:51:03,360 --> 00:51:04,640
plus operational burden.
1455
00:51:04,640 --> 00:51:06,320
That is a more serious equation.
1456
00:51:06,320 --> 00:51:08,560
A single model design often looks cheaper at first
1457
00:51:08,560 --> 00:51:10,800
because the architecture diagram is smaller.
1458
00:51:10,800 --> 00:51:13,200
Fewer parts, fewer routes, fewer owners.
1459
00:51:13,200 --> 00:51:14,720
But that simplicity can be fake
1460
00:51:14,720 --> 00:51:17,760
if high-volume traffic keeps hitting expensive inference paths,
1461
00:51:17,760 --> 00:51:19,360
weak answers, create rework,
1462
00:51:19,360 --> 00:51:22,240
and governance gaps generate manual cleanup later.
1463
00:51:22,240 --> 00:51:24,720
The bill is not just what Azure or OpenAI charges.
1464
00:51:24,720 --> 00:51:26,800
The bill is also what the organization absorbs
1465
00:51:26,800 --> 00:51:29,280
when the system behaves badly or inefficiently.
1466
00:51:29,280 --> 00:51:33,120
On the other side, an expert fabric can absolutely become overbuilt.
1467
00:51:33,120 --> 00:51:36,560
Too many agents, too many hops, too many thin specialists
1468
00:51:36,560 --> 00:51:38,560
that barely justify their own existence.
1469
00:51:38,560 --> 00:51:40,400
Too much logging with too little use,
1470
00:51:40,400 --> 00:51:42,640
too much ceremony for low-value tasks.
1471
00:51:42,640 --> 00:51:45,360
That is why architecture discipline and cost discipline
1472
00:51:45,360 --> 00:51:46,880
are really the same thing here.
1473
00:51:46,880 --> 00:51:49,680
If an expert path exists, it should earn its place.
1474
00:51:49,680 --> 00:51:52,080
If an orchestration step adds latency and admin work
1475
00:51:52,080 --> 00:51:54,800
without improving quality, safety, or savings, remove it,
1476
00:51:54,800 --> 00:51:56,480
volume changes the picture fast.
1477
00:51:56,480 --> 00:51:58,240
High-volume systems usually benefit first
1478
00:51:58,240 --> 00:52:01,040
from an expert fabric because repeated low-complexity traffic
1479
00:52:01,040 --> 00:52:03,440
is where routing and specialization payback quickly.
1480
00:52:03,440 --> 00:52:06,560
Classification heavy flows, policy lookup, extraction,
1481
00:52:06,560 --> 00:52:08,960
and templated support are all good examples.
1482
00:52:08,960 --> 00:52:11,920
The more routine traffic you divert from expensive paths,
1483
00:52:11,920 --> 00:52:14,320
the more the structure starts working in your favor.
1484
00:52:14,320 --> 00:52:15,920
Low-volume systems are different.
1485
00:52:15,920 --> 00:52:17,520
If the request count is modest,
1486
00:52:17,520 --> 00:52:20,240
the task variation is still fuzzy and the team is small.
1487
00:52:20,240 --> 00:52:23,360
Orchestration overhead may outweigh inference savings for a while.
1488
00:52:23,360 --> 00:52:25,920
In that case, the right move may be a simpler architecture
1489
00:52:25,920 --> 00:52:28,960
with one bounded model path and strong guardrails
1490
00:52:28,960 --> 00:52:31,120
at least until the workload becomes clearer.
1491
00:52:31,120 --> 00:52:32,880
So the cost question is never just,
1492
00:52:32,880 --> 00:52:34,240
what is the cheapest model?
1493
00:52:34,240 --> 00:52:35,040
It is.
1494
00:52:35,040 --> 00:52:37,600
What is the cheapest system that still delivers the right outcome
1495
00:52:37,600 --> 00:52:38,640
under governance?
1496
00:52:38,640 --> 00:52:40,640
That is the level leaders need to design that.
1497
00:52:40,640 --> 00:52:42,560
Because if you only optimize token price,
1498
00:52:42,560 --> 00:52:44,480
you can still build an expensive machine.
1499
00:52:44,480 --> 00:52:46,080
And if you ignore token price,
1500
00:52:46,080 --> 00:52:47,520
you will definitely build one,
1501
00:52:47,520 --> 00:52:49,280
the hidden risk of emerging behavior.
1502
00:52:49,280 --> 00:52:51,040
There is one risk that does not show up
1503
00:52:51,040 --> 00:52:52,800
when you test agents one by one.
1504
00:52:52,800 --> 00:52:55,600
That only shows up once the system starts interacting with itself.
1505
00:52:55,600 --> 00:52:57,360
This is the part many teams miss,
1506
00:52:57,360 --> 00:53:00,400
because each individual component looks reasonable in isolation.
1507
00:53:00,400 --> 00:53:02,800
The router behaves, the policy expert behaves,
1508
00:53:02,800 --> 00:53:05,120
and the workflow expert follows the rules.
1509
00:53:05,120 --> 00:53:07,680
But when those pieces start handing work back and forth,
1510
00:53:07,680 --> 00:53:09,920
new behavior appears that nobody designed.
1511
00:53:09,920 --> 00:53:11,280
That is emerging behavior.
1512
00:53:11,280 --> 00:53:12,640
And in multi-agent systems,
1513
00:53:12,640 --> 00:53:15,120
it is a structural issue rather than a fringe one.
1514
00:53:15,120 --> 00:53:16,800
What makes it dangerous is not drama.
1515
00:53:16,800 --> 00:53:17,920
It is accumulation.
1516
00:53:17,920 --> 00:53:20,800
A small rooting preference becomes a pattern.
1517
00:53:20,800 --> 00:53:22,080
A pattern becomes a norm.
1518
00:53:22,080 --> 00:53:24,400
A norm becomes the default path through the system.
1519
00:53:24,400 --> 00:53:26,320
Even if nobody intended for it to harden that way,
1520
00:53:26,320 --> 00:53:29,440
that can look like delegation loops where agents keep passing work along
1521
00:53:29,440 --> 00:53:31,280
because each one finds a reason to defer.
1522
00:53:31,280 --> 00:53:32,880
It can look like toolcascades,
1523
00:53:32,880 --> 00:53:35,520
where one decision triggers a tool call that triggers another,
1524
00:53:35,520 --> 00:53:39,120
until the cost and complexity drift far past the original request.
1525
00:53:39,120 --> 00:53:40,560
It can look like rubber stamping,
1526
00:53:40,560 --> 00:53:43,280
where one specialist confirms another without adding any real check
1527
00:53:43,280 --> 00:53:44,240
or norm drift,
1528
00:53:44,240 --> 00:53:46,000
where agents slowly become more permissive
1529
00:53:46,000 --> 00:53:48,880
because the system keeps treating that behavior as acceptable.
1530
00:53:48,880 --> 00:53:51,280
That is why testing a single prompt is never enough.
1531
00:53:51,280 --> 00:53:54,640
A single prompt tells you if one agent can answer one question,
1532
00:53:54,640 --> 00:53:57,040
but it does not tell you what happens after 30 handoffs
1533
00:53:57,040 --> 00:53:59,280
and repeated retreats across a live environment.
1534
00:53:59,280 --> 00:54:02,080
The system behavior lives in the interaction pattern.
1535
00:54:02,080 --> 00:54:04,000
Not just in the text any one model returns,
1536
00:54:04,000 --> 00:54:06,320
this is also where a lot of false confidence comes from.
1537
00:54:06,320 --> 00:54:08,800
Teams see safe outputs from each agent
1538
00:54:08,800 --> 00:54:10,560
and assume the whole system is safe.
1539
00:54:10,560 --> 00:54:12,800
But local safety does not guarantee system safety.
1540
00:54:12,800 --> 00:54:14,640
An individual agent can follow its instructions
1541
00:54:14,640 --> 00:54:16,800
and still contribute to a bad collective outcome.
1542
00:54:16,800 --> 00:54:18,720
One agent may escalate correctly,
1543
00:54:18,720 --> 00:54:21,200
while another interprets that escalation too broadly
1544
00:54:21,200 --> 00:54:23,440
and a third takes that interpretation as evidence
1545
00:54:23,440 --> 00:54:25,120
that the task is approved.
1546
00:54:25,120 --> 00:54:27,680
No single step looks catastrophic, the chain does.
1547
00:54:27,680 --> 00:54:30,000
So the controls have to operate at the system level.
1548
00:54:30,000 --> 00:54:31,120
Start with chain limits.
1549
00:54:31,120 --> 00:54:33,680
Decide how many handoffs are allowed
1550
00:54:33,680 --> 00:54:35,760
before the system stops for a review
1551
00:54:35,760 --> 00:54:39,120
or how many tools can be called in one uninterrupted path.
1552
00:54:39,120 --> 00:54:40,480
These are not cosmetic settings.
1553
00:54:40,480 --> 00:54:42,720
They are boundaries that stop small feedback loops
1554
00:54:42,720 --> 00:54:45,200
from turning into expensive or risky behavior.
1555
00:54:45,200 --> 00:54:46,720
Then add kill switches.
1556
00:54:46,720 --> 00:54:49,280
If a route starts producing abnormal handoff volume
1557
00:54:49,280 --> 00:54:50,640
or strange tool usage,
1558
00:54:50,640 --> 00:54:53,200
the platform needs a way to pause that path immediately.
1559
00:54:53,200 --> 00:54:55,680
The research around agent governance is very clear
1560
00:54:55,680 --> 00:54:57,920
on rapid revocation and containment
1561
00:54:57,920 --> 00:54:59,120
because in these environments,
1562
00:54:59,120 --> 00:55:01,440
recovery speed matters as much as prevention.
1563
00:55:01,440 --> 00:55:03,040
You also need anomaly detection.
1564
00:55:03,040 --> 00:55:04,560
This is not just for security events
1565
00:55:04,560 --> 00:55:05,760
but for behavioral drift.
1566
00:55:05,760 --> 00:55:08,800
You have to watch if one route is escalating more than expected
1567
00:55:08,800 --> 00:55:12,640
or if a specialist is suddenly touching data sets outside its normal pattern.
1568
00:55:12,640 --> 00:55:14,400
If an approval chain becomes automatic
1569
00:55:14,400 --> 00:55:16,240
even though humans are still in the loop,
1570
00:55:16,240 --> 00:55:18,480
those are system signals that someone has to monitor.
1571
00:55:18,480 --> 00:55:19,760
Roll separation helps too.
1572
00:55:20,640 --> 00:55:22,960
Do not let the same agent propose a proof
1573
00:55:22,960 --> 00:55:24,560
and execute high impact actions
1574
00:55:24,560 --> 00:55:26,720
through minor variations of the same path.
1575
00:55:26,720 --> 00:55:28,560
Keep analysis separate from authorization
1576
00:55:28,560 --> 00:55:30,560
and recommendation separate from writeback.
1577
00:55:30,560 --> 00:55:31,840
That friction is healthy
1578
00:55:31,840 --> 00:55:33,040
because it reduces the chance
1579
00:55:33,040 --> 00:55:35,040
that convenience becomes pseudo-autonomy.
1580
00:55:35,040 --> 00:55:36,800
And for the highest risk workflows,
1581
00:55:36,800 --> 00:55:38,400
keep human anchoring in place,
1582
00:55:38,400 --> 00:55:39,600
not because humans are faster
1583
00:55:39,600 --> 00:55:41,200
but because humans are the only layer
1584
00:55:41,200 --> 00:55:43,120
that can step outside the logic of the chain
1585
00:55:43,120 --> 00:55:45,120
and ask if the chain itself still makes sense.
1586
00:55:45,120 --> 00:55:47,040
Once agents start reinforcing each other,
1587
00:55:47,040 --> 00:55:49,520
that outside judgment becomes more important.
1588
00:55:49,520 --> 00:55:51,440
So this is the deeper governance lesson.
1589
00:55:51,440 --> 00:55:53,520
Emergence is not a prompt problem.
1590
00:55:53,520 --> 00:55:54,800
It is a system problem.
1591
00:55:54,800 --> 00:55:56,400
And if the risk is systemic,
1592
00:55:56,400 --> 00:55:58,960
the answer cannot live only inside isolated prompts.
1593
00:55:58,960 --> 00:56:00,800
The answer has to live in the topology,
1594
00:56:00,800 --> 00:56:02,160
the limits, the monitoring,
1595
00:56:02,160 --> 00:56:03,600
and the ability to stop a path
1596
00:56:03,600 --> 00:56:05,840
before the system teaches itself the wrong habit.
1597
00:56:05,840 --> 00:56:08,480
Which is exactly why the next design question is not just
1598
00:56:08,480 --> 00:56:09,440
what the agents say
1599
00:56:09,440 --> 00:56:11,600
but how the topology between them is shaped.
1600
00:56:11,600 --> 00:56:16,000
How to design safe multi-agent topology in co-pilot studio?
1601
00:56:16,000 --> 00:56:17,280
So now we get concrete.
1602
00:56:17,280 --> 00:56:19,600
If emergent risk is shaped by topology,
1603
00:56:19,600 --> 00:56:22,720
then topology cannot be an afterthought in co-pilot studio.
1604
00:56:22,720 --> 00:56:25,120
You have to decide upfront how agents relate to each other,
1605
00:56:25,120 --> 00:56:26,240
who speaks to the user
1606
00:56:26,240 --> 00:56:28,480
and which permissions apply in each environment.
1607
00:56:28,480 --> 00:56:30,240
If you leave those choices loose,
1608
00:56:30,240 --> 00:56:31,840
the platform will still let you build something
1609
00:56:31,840 --> 00:56:33,200
but it will be harder to test
1610
00:56:33,200 --> 00:56:34,640
and much easier to misread later.
1611
00:56:34,640 --> 00:56:36,640
Start with one orchestrator.
1612
00:56:36,640 --> 00:56:38,800
One user facing voice.
1613
00:56:38,800 --> 00:56:41,920
That pattern matters more than it seems.
1614
00:56:41,920 --> 00:56:45,040
Co-pilot studio now supports multi-agent orchestration
1615
00:56:45,040 --> 00:56:46,800
but that does not mean every specialist
1616
00:56:46,800 --> 00:56:49,600
should be a peer chatbot exposed directly to the user.
1617
00:56:49,600 --> 00:56:50,800
In most enterprise cases,
1618
00:56:50,800 --> 00:56:52,480
that creates noise through different tones
1619
00:56:52,480 --> 00:56:54,240
and different assumptions.
1620
00:56:54,240 --> 00:56:56,720
The cleaner pattern is one parent agent at the surface
1621
00:56:56,720 --> 00:56:58,640
with a set of bounded specialists behind it.
1622
00:56:58,640 --> 00:57:00,560
That parent agent owns the conversation.
1623
00:57:00,560 --> 00:57:02,080
It receives the user request,
1624
00:57:02,080 --> 00:57:03,600
keeps the interaction coherent
1625
00:57:03,600 --> 00:57:05,440
and decides when to call specialists.
1626
00:57:05,440 --> 00:57:08,160
The specialists do not negotiate with the user in public.
1627
00:57:08,160 --> 00:57:10,960
They operate more like internal tools with domain intelligence.
1628
00:57:10,960 --> 00:57:12,320
They return work products
1629
00:57:12,320 --> 00:57:13,680
and the orchestrator decides
1630
00:57:13,680 --> 00:57:15,360
how those results become a response
1631
00:57:15,360 --> 00:57:16,400
or an action proposal.
1632
00:57:16,400 --> 00:57:18,400
That design reduces a lot of confusion.
1633
00:57:18,400 --> 00:57:20,400
It also reduces cross-agent prompt bleed
1634
00:57:20,400 --> 00:57:21,760
because the user is not bouncing
1635
00:57:21,760 --> 00:57:23,600
between multiple conversation identities
1636
00:57:23,600 --> 00:57:24,960
with overlapping roles.
1637
00:57:24,960 --> 00:57:27,520
One front door, one visible context owner,
1638
00:57:27,520 --> 00:57:29,200
everything else stays structured behind it.
1639
00:57:29,200 --> 00:57:30,960
Inside that topology,
1640
00:57:30,960 --> 00:57:33,520
specialist agents should return structured outputs.
1641
00:57:33,520 --> 00:57:36,000
Not long pros, not brand new conversations,
1642
00:57:36,000 --> 00:57:37,120
structured outputs.
1643
00:57:37,120 --> 00:57:38,560
That is important in Co-pilot studio
1644
00:57:38,560 --> 00:57:39,840
because orchestration works better
1645
00:57:39,840 --> 00:57:41,600
when the downstream result is predictable enough
1646
00:57:41,600 --> 00:57:43,040
to branch on and validate.
1647
00:57:43,040 --> 00:57:44,880
A policy specialist can return a decision
1648
00:57:44,880 --> 00:57:45,920
and its constraints,
1649
00:57:45,920 --> 00:57:48,160
while a workflow specialist returns action status
1650
00:57:48,160 --> 00:57:49,520
and execution detail.
1651
00:57:49,520 --> 00:57:51,040
Once the output stays structured,
1652
00:57:51,040 --> 00:57:52,720
the parent agent can combine it safely
1653
00:57:52,720 --> 00:57:54,720
without guessing what the specialist meant.
1654
00:57:54,720 --> 00:57:57,680
Knowledge and connectors should also be segmented by role.
1655
00:57:57,680 --> 00:58:00,160
This is where topology and governance meet directly.
1656
00:58:00,160 --> 00:58:02,560
If a specialist is supposed to answer benefits questions,
1657
00:58:02,560 --> 00:58:03,600
give it the benefits corpus
1658
00:58:03,600 --> 00:58:05,760
and only the connectors it actually needs.
1659
00:58:05,760 --> 00:58:07,360
Do not let every sub agent inherit
1660
00:58:07,360 --> 00:58:08,880
a broad shared data surface
1661
00:58:08,880 --> 00:58:10,640
just because they live under one parent.
1662
00:58:10,640 --> 00:58:12,560
The topology should reflect separation.
1663
00:58:12,560 --> 00:58:16,240
Not erase it and keep state outside the model where control matters.
1664
00:58:16,240 --> 00:58:18,880
Conversation flow can use conversational context
1665
00:58:18,880 --> 00:58:20,880
but approvals and execution records
1666
00:58:20,880 --> 00:58:22,560
should be stored in managed systems.
1667
00:58:22,560 --> 00:58:24,560
Dataverse or SQL exist for a reason.
1668
00:58:24,560 --> 00:58:27,760
If critical state lives only inside transient agent context,
1669
00:58:27,760 --> 00:58:30,240
you lose auditability and operational control.
1670
00:58:30,240 --> 00:58:32,720
Externalized state makes the topology inspectable.
1671
00:58:32,720 --> 00:58:33,600
That is what you want.
1672
00:58:33,600 --> 00:58:35,120
Environment design matters too.
1673
00:58:35,120 --> 00:58:37,680
Development, test and production
1674
00:58:37,680 --> 00:58:39,840
should not just be copies with different names.
1675
00:58:39,840 --> 00:58:41,120
They should have different permissions
1676
00:58:41,120 --> 00:58:42,400
and different connectors scopes.
1677
00:58:42,400 --> 00:58:44,000
Development needs flexibility
1678
00:58:44,000 --> 00:58:45,920
while production needs tight control.
1679
00:58:45,920 --> 00:58:47,920
If every environment can reach the same systems
1680
00:58:47,920 --> 00:58:49,120
with the same authority,
1681
00:58:49,120 --> 00:58:50,720
you are not really separating risk.
1682
00:58:50,720 --> 00:58:52,080
You are only renaming it.
1683
00:58:52,080 --> 00:58:54,320
This is also where maker freedom needs guardrails.
1684
00:58:54,320 --> 00:58:57,360
Copilot Studio makes it easy to connect tools and agents
1685
00:58:57,360 --> 00:59:00,320
but safe topology means the patterns are opinionated.
1686
00:59:00,320 --> 00:59:01,600
Parent agent at the front,
1687
00:59:01,600 --> 00:59:02,880
specialists behind,
1688
00:59:02,880 --> 00:59:04,880
narrow tools and managed state.
1689
00:59:04,880 --> 00:59:06,960
If teams drift too far from those basics,
1690
00:59:06,960 --> 00:59:08,800
the architecture starts widening again
1691
00:59:08,800 --> 00:59:10,960
and the governance work begins to thin out.
1692
00:59:10,960 --> 00:59:13,440
So the goal is not maximum agent interaction.
1693
00:59:13,440 --> 00:59:14,960
It is disciplined interaction.
1694
00:59:14,960 --> 00:59:16,960
Enough specialization to improve fit.
1695
00:59:16,960 --> 00:59:18,560
Enough structure to keep control.
1696
00:59:18,560 --> 00:59:20,960
Enough separation to make failure containable.
1697
00:59:20,960 --> 00:59:22,560
And once that topology is in place,
1698
00:59:22,560 --> 00:59:24,880
you can stop talking about abstract patterns
1699
00:59:24,880 --> 00:59:28,160
and finally show how to build the first practical version of this
1700
00:59:28,160 --> 00:59:29,760
in the Microsoft stack.
1701
00:59:29,760 --> 00:59:32,800
Practical build path in copilot studio and AI foundry.
1702
00:59:32,800 --> 00:59:34,400
So what does the first real build look like
1703
00:59:34,400 --> 00:59:38,400
if you want to do this without creating a giant architecture project on day one?
1704
00:59:38,400 --> 00:59:40,800
Start with one front door agent in copilot studio.
1705
00:59:40,800 --> 00:59:43,200
Not three, not ten, one.
1706
00:59:43,200 --> 00:59:45,440
That agent should own the user interaction,
1707
00:59:45,440 --> 00:59:48,400
the channel presence and the orchestration logic at the surface.
1708
00:59:48,400 --> 00:59:50,000
It should know how to collect context,
1709
00:59:50,000 --> 00:59:51,520
apply basic policy checks,
1710
00:59:51,520 --> 00:59:53,680
and decide when to call something behind the scenes.
1711
00:59:53,680 --> 00:59:54,880
Keep the mission narrow,
1712
00:59:54,880 --> 00:59:56,160
one business scenario,
1713
00:59:56,160 --> 00:59:58,640
one user group, one entry path.
1714
00:59:58,640 --> 01:00:00,720
If you start with a vague enterprise assistant,
1715
01:00:00,720 --> 01:00:04,000
you'll rebuild the same generalist problem with better branding.
1716
01:00:04,000 --> 01:00:07,120
Then connect intelligence behind that front door through AI foundry.
1717
01:00:07,120 --> 01:00:09,280
There are two practical patterns in the research.
1718
01:00:09,280 --> 01:00:12,000
One is bring your own model prompts in copilot studio
1719
01:00:12,000 --> 01:00:16,000
where a prompt tool connects directly to an Azure AI foundry model endpoint.
1720
01:00:16,000 --> 01:00:18,480
The other is connecting to external or foundry agents
1721
01:00:18,480 --> 01:00:21,200
through supported endpoints and entrabased access patterns.
1722
01:00:21,200 --> 01:00:24,320
Which one you choose depends on how much logic lives in foundry.
1723
01:00:24,320 --> 01:00:26,320
If the back-end need is mostly model access,
1724
01:00:26,320 --> 01:00:27,440
bring your own model is enough.
1725
01:00:27,440 --> 01:00:29,440
If the back-end needs multi-step reasoning,
1726
01:00:29,440 --> 01:00:30,800
tool use or agent logic,
1727
01:00:30,800 --> 01:00:32,720
push that into foundry and let studio call it.
1728
01:00:32,720 --> 01:00:35,760
That split matters because a lot of teams put complex reasoning
1729
01:00:35,760 --> 01:00:37,760
directly into copilot studio prompts
1730
01:00:37,760 --> 01:00:40,320
and then wonder why the system becomes hard to test.
1731
01:00:40,320 --> 01:00:43,760
Studio is strong at orchestration, channels and workflow composition.
1732
01:00:43,760 --> 01:00:45,520
Foundry is stronger where model control,
1733
01:00:45,520 --> 01:00:48,560
evaluation, routing and back-end reasoning need more structure.
1734
01:00:48,560 --> 01:00:50,160
So for the first practical version,
1735
01:00:50,160 --> 01:00:52,160
build the back-end experts in foundry.
1736
01:00:52,160 --> 01:00:54,560
Use prompt flow if the expert needs a chain of steps,
1737
01:00:54,560 --> 01:00:57,280
tool calls or controlled input and output handling.
1738
01:00:57,280 --> 01:01:01,440
Use foundry agent logic if the expert behaves more like a bounded specialist service
1739
01:01:01,440 --> 01:01:02,800
with its own internal actions.
1740
01:01:02,800 --> 01:01:05,120
In both cases keep the output structured and narrow.
1741
01:01:05,120 --> 01:01:08,400
The copilot studio parent should receive something machine-friendly first,
1742
01:01:08,400 --> 01:01:10,800
then turn that into the user-facing response.
1743
01:01:10,800 --> 01:01:12,480
Now make the first expert the router,
1744
01:01:12,480 --> 01:01:13,840
not the final answer generator.
1745
01:01:13,840 --> 01:01:15,840
This is the shortcut most team skip,
1746
01:01:15,840 --> 01:01:18,800
and it is where the economics of the whole design start.
1747
01:01:18,800 --> 01:01:22,160
Use a small low-cost model path first to classify the request.
1748
01:01:22,160 --> 01:01:25,120
Domain, complexity, risk, action type,
1749
01:01:25,120 --> 01:01:26,480
whether retrieval is needed,
1750
01:01:26,480 --> 01:01:28,160
whether no model is needed at all.
1751
01:01:28,160 --> 01:01:30,720
If the request is just a deterministic workflow trigger,
1752
01:01:30,720 --> 01:01:32,320
root directly to the workflow.
1753
01:01:32,320 --> 01:01:34,640
Do not pass every sentence through premium inference
1754
01:01:34,640 --> 01:01:36,320
just because AI is available.
1755
01:01:36,320 --> 01:01:38,560
Then add one or two specialists behind that router
1756
01:01:38,560 --> 01:01:40,160
that is enough for a serious pilot.
1757
01:01:40,160 --> 01:01:41,680
For example, one knowledge specialist
1758
01:01:41,680 --> 01:01:43,120
and one workflow specialist
1759
01:01:43,120 --> 01:01:45,520
or one policy specialist and one extraction specialist.
1760
01:01:45,520 --> 01:01:47,040
The point is not to show variety.
1761
01:01:47,040 --> 01:01:49,680
The point is to prove that the root improves fit.
1762
01:01:49,680 --> 01:01:51,440
If you cannot show value with a router
1763
01:01:51,440 --> 01:01:52,960
and two bounded specialists,
1764
01:01:52,960 --> 01:01:54,880
adding six more will only hide the problem.
1765
01:01:54,880 --> 01:01:57,920
Data should come in through approved indexes and narrow scopes.
1766
01:01:57,920 --> 01:01:59,680
If the specialist needs retrieval,
1767
01:01:59,680 --> 01:02:02,000
connect only the corpus that belongs to that role.
1768
01:02:02,000 --> 01:02:04,480
If the root needs a sharepoint-backed knowledge path,
1769
01:02:04,480 --> 01:02:05,600
scope it deliberately.
1770
01:02:05,600 --> 01:02:08,240
If a workflow needs dataverse or another connector,
1771
01:02:08,240 --> 01:02:10,240
limit the rights to what that workflow needs.
1772
01:02:10,240 --> 01:02:12,320
This is where many early pilots go wrong.
1773
01:02:12,320 --> 01:02:15,680
They prove the user experience by quietly overexposing data
1774
01:02:15,680 --> 01:02:17,920
and then governance has to unwind the pilot later.
1775
01:02:17,920 --> 01:02:18,800
Do the opposite.
1776
01:02:18,800 --> 01:02:20,160
Constrain early.
1777
01:02:20,160 --> 01:02:22,160
That forces the architecture to stay honest.
1778
01:02:22,160 --> 01:02:24,240
Before you publish anything, evaluate the route
1779
01:02:24,240 --> 01:02:25,520
and the outputs together.
1780
01:02:25,520 --> 01:02:27,760
Use representative prompts, inspect traces.
1781
01:02:27,760 --> 01:02:29,360
Confirm where traffic goes.
1782
01:02:29,360 --> 01:02:32,640
Check whether the router is classifying the request types you expected.
1783
01:02:32,640 --> 01:02:35,920
Look at latency across the end-to-end path, not just the model call
1784
01:02:35,920 --> 01:02:38,320
and make sure policy, logging, and ownership are in place
1785
01:02:38,320 --> 01:02:40,080
before production publication.
1786
01:02:40,080 --> 01:02:43,360
A working demo is not the same thing as a deployable service,
1787
01:02:43,360 --> 01:02:45,120
so the practical sequence is simple.
1788
01:02:45,120 --> 01:02:48,320
One front door copilot studio agent, one cheap routing layer,
1789
01:02:48,320 --> 01:02:50,640
one or two back end experts in Foundry,
1790
01:02:50,640 --> 01:02:53,600
narrow data scopes, structured outputs,
1791
01:02:53,600 --> 01:02:55,120
evaluation before release.
1792
01:02:55,120 --> 01:02:57,520
That is enough to move from theory into a governed pattern
1793
01:02:57,520 --> 01:03:00,560
without pretending you need a massive agent estate on day one.
1794
01:03:00,560 --> 01:03:03,280
And that matters because the right first architecture
1795
01:03:03,280 --> 01:03:04,640
is not the most advanced one.
1796
01:03:04,640 --> 01:03:07,040
It is the one your organization can actually operate,
1797
01:03:07,040 --> 01:03:09,520
review, and improve without losing control.
1798
01:03:09,520 --> 01:03:12,160
Maturity model from pilot to enterprise fabric.
1799
01:03:12,160 --> 01:03:15,040
At this point, the biggest mistake is assuming the architecture
1800
01:03:15,040 --> 01:03:17,040
has to appear fully formed on day one.
1801
01:03:17,040 --> 01:03:18,160
It shouldn't.
1802
01:03:18,160 --> 01:03:20,960
A mature expert fabric is not a starting position.
1803
01:03:20,960 --> 01:03:22,080
It is a progression.
1804
01:03:22,080 --> 01:03:25,600
And if teams skip that progression, they usually build too much too early,
1805
01:03:25,600 --> 01:03:26,880
add governance after the fact
1806
01:03:26,880 --> 01:03:29,520
and end up managing complexity they have not earned yet.
1807
01:03:29,520 --> 01:03:32,640
So the useful question is not, what is the perfect end state?
1808
01:03:32,640 --> 01:03:33,360
It is.
1809
01:03:33,360 --> 01:03:34,720
What stage are we actually in?
1810
01:03:34,720 --> 01:03:36,720
And what does the next responsible step look like?
1811
01:03:36,720 --> 01:03:40,560
Stage one is the narrow pilot, one agent, one fixed model,
1812
01:03:40,560 --> 01:03:41,680
one bounded task.
1813
01:03:41,680 --> 01:03:43,120
Not because that is the final answer,
1814
01:03:43,120 --> 01:03:44,960
but because it gives you a clean baseline.
1815
01:03:44,960 --> 01:03:46,240
You need to know the workflow,
1816
01:03:46,240 --> 01:03:47,840
the failure modes, the user behavior,
1817
01:03:47,840 --> 01:03:49,680
the data needs, and the operational burden
1818
01:03:49,680 --> 01:03:51,360
before you multiply parts.
1819
01:03:51,360 --> 01:03:53,040
If the task is still fuzzy at this stage,
1820
01:03:53,040 --> 01:03:53,840
that is a warning.
1821
01:03:53,840 --> 01:03:56,400
It means your expert boundaries are not ready yet.
1822
01:03:56,400 --> 01:03:59,280
A pilot should reduce ambiguity, not scale it.
1823
01:03:59,280 --> 01:04:01,600
Stage two is where routing starts to enter,
1824
01:04:01,600 --> 01:04:03,600
add a cheap router and one premium fallback.
1825
01:04:03,600 --> 01:04:05,600
That is the first meaningful architectural shift
1826
01:04:05,600 --> 01:04:07,600
because now the system begins distinguishing
1827
01:04:07,600 --> 01:04:09,840
between common traffic and difficult traffic.
1828
01:04:09,840 --> 01:04:10,800
It is still manageable.
1829
01:04:10,800 --> 01:04:13,120
You are not running a network of specialists yet.
1830
01:04:13,120 --> 01:04:16,320
You are proving that cheap first logic works in your environment
1831
01:04:16,320 --> 01:04:18,320
with your users on your workload.
1832
01:04:18,320 --> 01:04:20,560
This is where you learn the basics of root accuracy,
1833
01:04:20,560 --> 01:04:22,800
escalation behavior, and cost distribution
1834
01:04:22,800 --> 01:04:25,760
without taking on full multi-agent governance.
1835
01:04:25,760 --> 01:04:28,960
Then come stage three, multiple specialists with governed handoffs.
1836
01:04:28,960 --> 01:04:31,680
Now the architecture starts behaving like a real expert system
1837
01:04:31,680 --> 01:04:34,800
because the root no longer just chooses between cheap and expensive.
1838
01:04:34,800 --> 01:04:36,640
It chooses between different kinds of work,
1839
01:04:36,640 --> 01:04:39,680
a knowledge path, a workflow path, a policy path,
1840
01:04:39,680 --> 01:04:42,720
maybe an extraction path at this point ownership matters more.
1841
01:04:42,720 --> 01:04:45,040
Logging matters more, connector discipline matters more.
1842
01:04:45,040 --> 01:04:47,120
You are no longer just optimizing model spend.
1843
01:04:47,120 --> 01:04:49,520
You are shaping a controlled operating structure.
1844
01:04:49,520 --> 01:04:52,320
And because handoffs now affect quality and risk directly,
1845
01:04:52,320 --> 01:04:55,280
each specialist needs a clear charter and a clear stop boundary.
1846
01:04:55,280 --> 01:04:56,800
Stage four is the enterprise fabric
1847
01:04:56,800 --> 01:05:00,000
that means policy-controlled model catalogs, life cycle reviews,
1848
01:05:00,000 --> 01:05:03,040
telemetry, environment discipline, central inventory,
1849
01:05:03,040 --> 01:05:05,840
and an operating model that clearly separates platform ownership
1850
01:05:05,840 --> 01:05:07,040
from domain ownership.
1851
01:05:07,040 --> 01:05:09,600
This is where agent 365, Azure Policy,
1852
01:05:09,600 --> 01:05:13,120
trace-based evaluation, and cost governance stop feeling optional.
1853
01:05:13,120 --> 01:05:16,160
The architecture is wide enough now that without those controls,
1854
01:05:16,160 --> 01:05:17,840
drift becomes normal.
1855
01:05:17,840 --> 01:05:19,680
So stage four is not more agents.
1856
01:05:19,680 --> 01:05:23,120
It is stable coordination across agents, models, tools, and teams.
1857
01:05:23,120 --> 01:05:25,920
This progression matters because organizational readiness
1858
01:05:25,920 --> 01:05:28,080
is just as important as technical readiness.
1859
01:05:28,080 --> 01:05:30,560
A team may be able to build a multi-agent demo in a week.
1860
01:05:30,560 --> 01:05:33,280
That does not mean the organization is ready to own one.
1861
01:05:33,280 --> 01:05:35,920
In production, do you have a model approval process?
1862
01:05:35,920 --> 01:05:37,040
Do you have named owners?
1863
01:05:37,040 --> 01:05:38,320
Do you have logging standards?
1864
01:05:38,320 --> 01:05:40,080
Do you have a place to review incidents?
1865
01:05:40,080 --> 01:05:41,920
Do you know how to revoke access quickly?
1866
01:05:41,920 --> 01:05:44,480
If the answer is no, then your architecture maturity is lower
1867
01:05:44,480 --> 01:05:45,760
than your prototype maturity.
1868
01:05:45,760 --> 01:05:47,440
And that gap becomes dangerous fast.
1869
01:05:47,440 --> 01:05:49,520
This is why I would always match the target design
1870
01:05:49,520 --> 01:05:51,280
to the current operating discipline.
1871
01:05:51,280 --> 01:05:54,880
If governance is still immature, stay closer to stage one or two.
1872
01:05:54,880 --> 01:05:57,200
If the workload is high volume and well understood,
1873
01:05:57,200 --> 01:05:59,040
move into stage three with care.
1874
01:05:59,040 --> 01:06:01,280
If multiple business units are already building agents
1875
01:06:01,280 --> 01:06:03,280
and the platform team can provide controls,
1876
01:06:03,280 --> 01:06:04,880
then stage four starts making sense.
1877
01:06:04,880 --> 01:06:06,640
But don't romanticize the end state.
1878
01:06:06,640 --> 01:06:09,040
Enterprise fabric is not automatically better
1879
01:06:09,040 --> 01:06:10,240
just because it is bigger.
1880
01:06:10,240 --> 01:06:12,560
It is only better when the next layer of structure
1881
01:06:12,560 --> 01:06:15,120
solves a real problem that the previous layer
1882
01:06:15,120 --> 01:06:16,400
could not handle cleanly.
1883
01:06:16,400 --> 01:06:17,760
So the practical rule is simple.
1884
01:06:17,760 --> 01:06:19,840
Prove the first route before you expand the map.
1885
01:06:19,840 --> 01:06:22,240
Prove the first specialist before you add three more.
1886
01:06:22,240 --> 01:06:23,520
Prove the first governance gate
1887
01:06:23,520 --> 01:06:25,200
before you depend on 10 of them.
1888
01:06:25,200 --> 01:06:27,360
That pacing keeps the architecture honest.
1889
01:06:27,360 --> 01:06:29,560
It also prevents teams from turning future ambition
1890
01:06:29,560 --> 01:06:30,720
into present complexity.
1891
01:06:30,720 --> 01:06:32,520
And once you see maturity that way,
1892
01:06:32,520 --> 01:06:35,480
one final design question becomes much easier to answer.
1893
01:06:35,480 --> 01:06:38,360
When should you not build a mixture of experts at all?
1894
01:06:38,360 --> 01:06:40,320
When not to use a mixture of experts?
1895
01:06:40,320 --> 01:06:41,560
Now we need the counterweight
1896
01:06:41,560 --> 01:06:43,560
because once people understand the expert fabric,
1897
01:06:43,560 --> 01:06:44,800
they swing too hard.
1898
01:06:44,800 --> 01:06:46,640
They start treating it like the default answer
1899
01:06:46,640 --> 01:06:47,720
to every AI problem.
1900
01:06:47,720 --> 01:06:48,720
It isn't.
1901
01:06:48,720 --> 01:06:50,440
A mixture of experts is a structural answer
1902
01:06:50,440 --> 01:06:52,200
to a specific kind of complexity.
1903
01:06:52,200 --> 01:06:53,960
If that complexity isn't there yet,
1904
01:06:53,960 --> 01:06:55,760
the structure just becomes overhead.
1905
01:06:55,760 --> 01:06:57,240
The first case is low volume.
1906
01:06:57,240 --> 01:06:59,840
If your request count is small and the task is stable,
1907
01:06:59,840 --> 01:07:02,600
a rooted architecture solves a problem you don't actually have.
1908
01:07:02,600 --> 01:07:04,680
The design might look cleaner on a whiteboard
1909
01:07:04,680 --> 01:07:06,520
and specialist boundaries sound mature,
1910
01:07:06,520 --> 01:07:08,200
but the math doesn't always work.
1911
01:07:08,200 --> 01:07:10,000
If the workload only sees limited traffic,
1912
01:07:10,000 --> 01:07:11,760
the savings from routing will never outrun
1913
01:07:11,760 --> 01:07:14,400
the cost of building and testing those extra moving parts.
1914
01:07:14,400 --> 01:07:16,480
In that situation, one well-bounded model
1915
01:07:16,480 --> 01:07:18,840
with clear prompts and strong review discipline
1916
01:07:18,840 --> 01:07:20,360
is usually the better decision.
1917
01:07:20,360 --> 01:07:21,640
The second case is low risk.
1918
01:07:21,640 --> 01:07:23,760
If the agent isn't touching sensitive data
1919
01:07:23,760 --> 01:07:25,360
or triggering business actions,
1920
01:07:25,360 --> 01:07:27,720
heavy orchestration is often unnecessary,
1921
01:07:27,720 --> 01:07:29,400
you still need controls and ownership,
1922
01:07:29,400 --> 01:07:31,320
but the full fabric might be more structure
1923
01:07:31,320 --> 01:07:32,560
than the situation demands.
1924
01:07:32,560 --> 01:07:34,720
A simple architecture isn't a failure
1925
01:07:34,720 --> 01:07:36,000
if it fits the task.
1926
01:07:36,000 --> 01:07:38,360
Actually, that's usually the more responsible choice,
1927
01:07:38,360 --> 01:07:39,800
variation matters too.
1928
01:07:39,800 --> 01:07:43,080
If the team is still learning what users want,
1929
01:07:43,080 --> 01:07:44,680
your expert boundaries will be weak
1930
01:07:44,680 --> 01:07:46,720
because the task itself is moving.
1931
01:07:46,720 --> 01:07:47,800
When boundaries are weak,
1932
01:07:47,800 --> 01:07:49,560
specialists start overlapping
1933
01:07:49,560 --> 01:07:51,520
and one agent ends up handling the same work
1934
01:07:51,520 --> 01:07:52,360
as another.
1935
01:07:52,360 --> 01:07:54,720
Escalation rules get fuzzy and routes become unstable
1936
01:07:54,720 --> 01:07:57,040
because the categories underneath them aren't settled yet.
1937
01:07:57,040 --> 01:07:58,600
That's a sign to pause and simplify.
1938
01:07:58,600 --> 01:08:00,200
Before you create multiple experts,
1939
01:08:00,200 --> 01:08:01,800
make sure the work actually separates
1940
01:08:01,800 --> 01:08:03,240
into distinct categories.
1941
01:08:03,240 --> 01:08:05,120
Small teams need to hear this clearly.
1942
01:08:05,120 --> 01:08:07,440
A multi-agent system creates ongoing obligations
1943
01:08:07,440 --> 01:08:09,480
that don't go away once the code is written.
1944
01:08:09,480 --> 01:08:11,080
Someone has to maintain the prompts,
1945
01:08:11,080 --> 01:08:12,880
someone has to review the traces
1946
01:08:12,880 --> 01:08:15,080
and someone has to update the connectors and policies.
1947
01:08:15,080 --> 01:08:16,680
If the team doesn't have the capacity
1948
01:08:16,680 --> 01:08:18,160
to run that operating model,
1949
01:08:18,160 --> 01:08:20,160
the architecture becomes fragile very quickly.
1950
01:08:20,160 --> 01:08:22,520
You don't get maturity just by drawing more boxes.
1951
01:08:22,520 --> 01:08:23,880
You get it by sustaining the controls
1952
01:08:23,880 --> 01:08:25,440
around those boxes over time.
1953
01:08:25,440 --> 01:08:27,040
Governance maturity is another real limit.
1954
01:08:27,040 --> 01:08:28,640
If your organization still struggles
1955
01:08:28,640 --> 01:08:30,800
with basic ownership or permission hygiene,
1956
01:08:30,800 --> 01:08:33,680
adding more agents increases risk faster than value.
1957
01:08:33,680 --> 01:08:35,240
The system might look advanced,
1958
01:08:35,240 --> 01:08:36,280
but underneath it,
1959
01:08:36,280 --> 01:08:39,040
the same old weaknesses stay in place.
1960
01:08:39,040 --> 01:08:42,360
Unclear owners broad access, missing logs.
1961
01:08:42,360 --> 01:08:44,280
In that condition, the right next step
1962
01:08:44,280 --> 01:08:46,000
is to strengthen the platform baseline
1963
01:08:46,000 --> 01:08:48,280
before you start multiplying specialists.
1964
01:08:48,280 --> 01:08:49,480
There is also a temporary case
1965
01:08:49,480 --> 01:08:52,160
where one strong model is simply the smarter answer.
1966
01:08:52,160 --> 01:08:54,240
If the workload is broad and the task design
1967
01:08:54,240 --> 01:08:55,160
is still emerging,
1968
01:08:55,160 --> 01:08:58,000
one capable model gives you a cleaner learning period.
1969
01:08:58,000 --> 01:09:00,080
You gather evidence and see which request repeat,
1970
01:09:00,080 --> 01:09:01,240
which helps you discover
1971
01:09:01,240 --> 01:09:03,640
where real specialist boundaries might eventually sit.
1972
01:09:03,640 --> 01:09:05,320
That isn't anti-architecture.
1973
01:09:05,320 --> 01:09:06,960
It's sequencing.
1974
01:09:06,960 --> 01:09:09,760
Sometimes the shortest path to a good expert fabric
1975
01:09:09,760 --> 01:09:11,560
is starting with a simple setup
1976
01:09:11,560 --> 01:09:14,320
and only splitting it once the patterns are obvious.
1977
01:09:14,320 --> 01:09:16,280
Another warning sign is when people want a mixture
1978
01:09:16,280 --> 01:09:18,880
of experts mainly because it sounds sophisticated.
1979
01:09:18,880 --> 01:09:20,480
That leads to decorative specialists
1980
01:09:20,480 --> 01:09:22,120
and experts with weak missions.
1981
01:09:22,120 --> 01:09:24,600
Those routes exist because the architecture team
1982
01:09:24,600 --> 01:09:25,920
wanted elegance,
1983
01:09:25,920 --> 01:09:27,920
not because the users actually needed it.
1984
01:09:27,920 --> 01:09:30,080
If a specialist cannot explain its unique role
1985
01:09:30,080 --> 01:09:31,480
and its unique data scope,
1986
01:09:31,480 --> 01:09:33,240
it probably shouldn't exist yet.
1987
01:09:33,240 --> 01:09:35,600
So the real decision rule isn't more agents
1988
01:09:35,600 --> 01:09:37,040
equals more maturity.
1989
01:09:37,040 --> 01:09:38,440
It's fit.
1990
01:09:38,440 --> 01:09:41,280
User mixture of experts when workload, shape, cost pressure,
1991
01:09:41,280 --> 01:09:44,000
and governance capability all point in the same direction.
1992
01:09:44,000 --> 01:09:45,000
Hold back when they don't.
1993
01:09:45,000 --> 01:09:47,080
The point isn't to replace every single model system
1994
01:09:47,080 --> 01:09:47,920
on principle.
1995
01:09:47,920 --> 01:09:50,360
The point is to stop forcing one model to do jobs
1996
01:09:50,360 --> 01:09:52,240
that clearly need different treatment.
1997
01:09:52,240 --> 01:09:53,840
And that brings us to the final layer
1998
01:09:53,840 --> 01:09:55,680
because even when the architecture is right,
1999
01:09:55,680 --> 01:09:58,120
it only holds if the organization knows who owns what.
2000
01:09:58,120 --> 01:10:00,200
The operating model behind the architecture.
2001
01:10:00,200 --> 01:10:01,720
Now we get to the part that decides
2002
01:10:01,720 --> 01:10:03,440
whether any of this survives.
2003
01:10:03,440 --> 01:10:05,320
Because architecture on paper is easy.
2004
01:10:05,320 --> 01:10:07,120
Operating it is where the truth shows up
2005
01:10:07,120 --> 01:10:09,440
and expert fabric doesn't run on prompts alone.
2006
01:10:09,440 --> 01:10:11,440
It runs on ownership.
2007
01:10:11,440 --> 01:10:12,800
If ownership stays blurry,
2008
01:10:12,800 --> 01:10:15,040
the system drifts back toward the same confusion
2009
01:10:15,040 --> 01:10:17,600
we started with just spread across more components.
2010
01:10:17,600 --> 01:10:19,400
So split responsibility on purpose.
2011
01:10:19,400 --> 01:10:21,120
The platform team should own the guardrails.
2012
01:10:21,120 --> 01:10:23,280
That means the model catalog, the routing standards,
2013
01:10:23,280 --> 01:10:26,200
the policy controls, and the observability stack.
2014
01:10:26,200 --> 01:10:28,280
They aren't there to write every expert prompt
2015
01:10:28,280 --> 01:10:30,320
or babysit every business workflow.
2016
01:10:30,320 --> 01:10:32,920
They are there to define the safe operating envelope,
2017
01:10:32,920 --> 01:10:36,160
which models are approved, which deployment paths are allowed,
2018
01:10:36,160 --> 01:10:37,480
which traces must be captured.
2019
01:10:37,480 --> 01:10:38,720
Without that platform layer,
2020
01:10:38,720 --> 01:10:41,080
every domain team rebuilds governance differently
2021
01:10:41,080 --> 01:10:43,240
and the fabric fragments almost immediately.
2022
01:10:43,240 --> 01:10:44,640
Then domain teams own the experts.
2023
01:10:44,640 --> 01:10:45,800
Not the whole platform.
2024
01:10:45,800 --> 01:10:48,360
They're experts that includes the expert charter,
2025
01:10:48,360 --> 01:10:50,080
the prompt logic, the data boundaries,
2026
01:10:50,080 --> 01:10:53,440
and the escalation rules for their specific process.
2027
01:10:53,440 --> 01:10:55,480
If an HR policy expert exists,
2028
01:10:55,480 --> 01:10:57,600
the team accountable for HR should own
2029
01:10:57,600 --> 01:10:59,640
whether that agent is correct and useful.
2030
01:10:59,640 --> 01:11:01,560
If a finance workflow expert exists,
2031
01:11:01,560 --> 01:11:04,400
finance operations should define what the agent can recommend
2032
01:11:04,400 --> 01:11:05,960
and where human review sits.
2033
01:11:05,960 --> 01:11:08,120
This is where a lot of AI programs fail.
2034
01:11:08,120 --> 01:11:10,200
They centralize everything in one technical team
2035
01:11:10,200 --> 01:11:12,400
and then the specialists stop being real specialists
2036
01:11:12,400 --> 01:11:14,480
because the business never truly owns them.
2037
01:11:14,480 --> 01:11:16,720
Security and compliance sit beside both groups,
2038
01:11:16,720 --> 01:11:17,800
not underneath them.
2039
01:11:17,800 --> 01:11:19,720
Their job isn't to become the routing designer.
2040
01:11:19,720 --> 01:11:22,080
Their job is to review controls, map risk,
2041
01:11:22,080 --> 01:11:24,560
and make sure the fabric can be defended when something goes wrong.
2042
01:11:24,560 --> 01:11:27,120
In a multi-agent system, incidents are rarely neat.
2043
01:11:27,120 --> 01:11:29,040
You need predefined response paths,
2044
01:11:29,040 --> 01:11:31,520
who pauses an agent, who revokes access,
2045
01:11:31,520 --> 01:11:33,600
who decides whether a route goes back online.
2046
01:11:33,600 --> 01:11:35,880
If those answers only exist informally,
2047
01:11:35,880 --> 01:11:39,040
the first serious problem turns into organizational improv.
2048
01:11:39,040 --> 01:11:41,360
Finance and operations also need a real seat here.
2049
01:11:41,360 --> 01:11:43,280
Once you move into a rooted architecture,
2050
01:11:43,280 --> 01:11:45,200
spend has to be measured per outcome.
2051
01:11:45,200 --> 01:11:47,000
Not just as one total AI bill.
2052
01:11:47,000 --> 01:11:48,880
You need to know which expert path costs more
2053
01:11:48,880 --> 01:11:50,960
and which domain escalates too often.
2054
01:11:50,960 --> 01:11:53,040
If finance only sees aggregate spend,
2055
01:11:53,040 --> 01:11:55,200
the architecture becomes hard to improve.
2056
01:11:55,200 --> 01:11:56,680
Cost needs to map to behavior.
2057
01:11:56,680 --> 01:11:59,200
That's how you see whether the structure is actually working.
2058
01:11:59,200 --> 01:12:01,680
This is exactly why a center of excellence pattern helps,
2059
01:12:01,680 --> 01:12:04,280
provided it's done properly, not as a vanity committee.
2060
01:12:04,280 --> 01:12:05,600
As a working review layer,
2061
01:12:05,600 --> 01:12:07,920
the COE can publish templates for expert charters
2062
01:12:07,920 --> 01:12:09,680
and route designs so teams don't solve
2063
01:12:09,680 --> 01:12:11,280
the same problem five different ways.
2064
01:12:11,280 --> 01:12:13,880
It shortens the path between innovation and control
2065
01:12:13,880 --> 01:12:16,120
because teams don't have to invent the model
2066
01:12:16,120 --> 01:12:17,240
from scratch every time.
2067
01:12:17,240 --> 01:12:19,400
But the COE should not become a bottleneck
2068
01:12:19,400 --> 01:12:20,400
that owns everything.
2069
01:12:20,400 --> 01:12:23,240
If it owns everything, domain ownership collapses again.
2070
01:12:23,240 --> 01:12:26,520
The healthier pattern is shared structure, local accountability.
2071
01:12:26,520 --> 01:12:28,160
Platform defines the rails.
2072
01:12:28,160 --> 01:12:31,600
Domain teams drive the mission, security validates the controls.
2073
01:12:31,600 --> 01:12:33,240
Finance tracks value and waste.
2074
01:12:33,240 --> 01:12:35,840
The COE helps the whole system stay consistent.
2075
01:12:35,840 --> 01:12:38,000
And one level deeper, this is why the architecture
2076
01:12:38,000 --> 01:12:40,640
changes behavior only when ownership is explicit.
2077
01:12:40,640 --> 01:12:42,600
A routed system is really an operating model
2078
01:12:42,600 --> 01:12:43,680
for decision making.
2079
01:12:43,680 --> 01:12:45,920
It decides who is allowed to define expertise
2080
01:12:45,920 --> 01:12:48,320
and who is responsible when a route fails in production.
2081
01:12:48,320 --> 01:12:51,040
If those lines are clear, the fabric can evolve without dissolving.
2082
01:12:51,040 --> 01:12:53,720
If those lines are weak, every new expert adds ambiguity
2083
01:12:53,720 --> 01:12:54,640
instead of capability.
2084
01:12:54,640 --> 01:12:57,680
So after all of this, the shift is bigger than model choice.
2085
01:12:57,680 --> 01:12:59,480
You are replacing a generalist bot
2086
01:12:59,480 --> 01:13:01,960
with a managed system of bounded responsibilities.
2087
01:13:01,960 --> 01:13:04,280
And that is the actual promise we opened with.
2088
01:13:04,280 --> 01:13:06,600
The shift is simple, even if the build is not,
2089
01:13:06,600 --> 01:13:09,080
stop treating one generalist bot like the answer.
2090
01:13:09,080 --> 01:13:10,560
Start designing a governed fabric.
2091
01:13:10,560 --> 01:13:12,240
Copilot Studio handles the experience,
2092
01:13:12,240 --> 01:13:15,680
foundry handles the intelligence, small models root the work,
2093
01:13:15,680 --> 01:13:17,280
specialists do bounded jobs.
2094
01:13:17,280 --> 01:13:19,360
If this changed how you think, follow me,
2095
01:13:19,360 --> 01:13:21,080
meco-peters on LinkedIn.
2096
01:13:21,080 --> 01:13:23,200
And if you want more of this, leave a review.
2097
01:13:23,200 --> 01:13:24,720
It helps more people find it.
2098
01:13:24,720 --> 01:13:26,680
Tell me where your architecture is breaking
2099
01:13:26,680 --> 01:13:27,640
or tell me what comes next,
2100
01:13:27,640 --> 01:13:30,200
rooting agent 365 governance or cost design.









