June 6, 2026

How to Trumpify Your Copilot: A Masterclass in Hallucination

Show Notes
Transcript

Everyone talks about hallucinations as if they're a model problem. They blame GPT-4, Claude, Gemini, or whatever large language model happens to be in the spotlight this week. They tweak prompts, add more tokens, experiment with different temperatures, and hope the problem magically disappears.But what if hallucinations aren't a model problem at all?What if your Copilot is working exactly as designed?In this episode of the M365 FM Podcast, we take a deep dive into the real causes of hallucinations in Microsoft Copilot, Retrieval-Augmented Generation (RAG) systems, enterprise AI deployments, and custom agents. Through a deliberately provocative thought experiment, we explore how organizations accidentally engineer systems that reward confident wrong answers while creating the illusion of governance, compliance, and control.This isn't an episode about prompt tricks. It's an architectural masterclass on why AI systems hallucinate and how poor retrieval, weak governance, bad permissions, noisy data, and flawed orchestration combine to create enterprise-scale misinformation engines.

THE MYTH OF THE BROKEN MODEL

Most organizations assume hallucinations originate inside the large language model itself.The reality is more uncomfortable.Large Language Models are trained to predict the next token, not to discover truth. Reinforcement Learning from Human Feedback rewards helpfulness, fluency, and confidence. The result is a system optimized to sound correct even when certainty is impossible.In this episode, we explore how benchmark design, human evaluation systems, and model training methodologies unintentionally create incentives that reward plausible answers over accurate answers.The shocking conclusion is that many hallucinations are not bugs. They are the logical outcome of the objectives we gave the model.

THE INTERNET IS NOT A KNOWLEDGE BASE

Even if we could fix training incentives, another challenge remains.The internet itself is noisy.Enterprise AI systems inherit contradictions, outdated information, misinformation, duplicated content, and conflicting perspectives from their training data. Organizations then amplify these problems by feeding Copilot equally chaotic internal data repositories.Old SharePoint sites, archived policies, forgotten Teams channels, abandoned project documentation, draft documents, and outdated procedures all compete for retrieval priority.The result is a retrieval ecosystem where truth becomes increasingly difficult to distinguish from noise.

RETRIEVAL AS A HALLUCINATION ENGINE

Retrieval-Augmented Generation was supposed to solve hallucinations.Instead, poorly implemented retrieval systems often create them.In this episode we examine why Top-K retrieval, vector search, semantic ranking, and context window limitations frequently surface conflicting information rather than authoritative information.You will learn why retrieval systems don't necessarily return the correct answer. They return the most statistically similar content.And those are not the same thing.

THE LOST IN THE MIDDLE PROBLEM

Modern language models can process enormous context windows.That doesn't mean they process everything equally.We explore one of the most overlooked problems in enterprise AI architecture: information buried in the middle of retrieved content often receives less attention than content appearing at the beginning or end of the context window.This creates situations where critical evidence exists inside the retrieval set but still fails to influence the final answer.

WHEN GROUNDING BECOMES A LIABILITY

Grounding is supposed to prevent hallucinations.Unfortunately, grounding only works when the context itself is trustworthy.When organizations blindly concatenate multiple documents into a single prompt, conflicting information becomes flattened into one giant evidence pool. The model then attempts to reconcile contradictions through synthesis.The result can be an answer that appears fully grounded while actually containing information that was never stated anywhere in the source documents.This creates what we call the Citation Illusion.

THE PERMISSION SPRAWL DISASTER

Microsoft Copilot inherits your permissions.Every forgotten SharePoint membership.Every abandoned Teams site.Every guest account.Every project you participated in five years ago.The AI doesn't understand organizational context. It only understands what a user is technically allowed to access.We examine how years of permission drift transform Copilot into an accidental amplifier of historical mistakes, stale content, and governance failures.

THE ORCHESTRATION ANTI-PATTERN

The orchestration layer is where enterprise AI systems either become trustworthy or dangerous.Many organizations skip validation, authorization checks, policy enforcement, and workflow controls in favor of flexibility and speed.This episode explores what happens when you allow models to make decisions that should belong to deterministic business logic.Topics include:

Tool execution risks
Service principal over-permissioning
Agent autonomy failures
Missing authorization checkpoints
Governance bypass scenarios

PROMPT ENGINEERING FOR MAXIMUM CONFIDENCE

What happens when you accidentally optimize your prompts for confidence instead of accuracy?We examine how seemingly harmless instructions like "be helpful" or "fill in gaps with reasonable assumptions" can dramatically increase hallucination rates.The discussion highlights how prompt design often pushes models toward answering questions they should refuse.Sometimes the most dangerous prompt is also the most reasonable sounding one.

DATA ARCHITECTURE AS A HALLUCINATION FACTORY

Most organizations have never truly curated their data.Instead, they index everything.Drafts.Notes.Archived content.External sources.Old policies.Current policies.And then they expect Copilot to magically identify the correct answer.We discuss why indiscriminate indexing creates a knowledge base where authoritative content competes directly against noise.The outcome is predictable.The model starts synthesizing.

GOVERNANCE THEATER

Many enterprises have governance documentation.Few have governance enforcement.This section explores the difference between having policies and actually implementing them.We investigate why sensitivity labels, retention policies, data classification frameworks, approval workflows, and compliance controls often exist only on paper while Copilot continues operating without meaningful restrictions.

THE RETRIEVAL COLLAPSE

As enterprise content grows, retrieval quality often declines.Signal-to-noise ratios decrease.Duplicate documents accumulate.Ownership disappears.Version control breaks down.Content becomes increasingly difficult to rank accurately.The retrieval layer slowly degrades until hallucinations become a natural consequence of weak evidence rather than an isolated anomaly.

GENERATION WITHOUT GROUNDING

Once poor retrieval reaches the generation layer, the model does exactly what it was trained to do.It creates coherent narratives.It fills gaps.It synthesizes.It sounds authoritative.The answer looks convincing.The citations look legitimate.And yet the underlying claims may not exist anywhere in the retrieved evidence.This is where enterprise hallucinations become truly dangerous.

THE COMPLIANCE TRAP

In regulated industries, hallucinations are not technical problems.They are legal problems.We examine how AI-generated misinformation impacts healthcare, financial services, legal operations, compliance programs, audit processes, and risk management frameworks.A hallucination used to support a business decision can quickly evolve into regulatory exposure.The question becomes simple:Who is accountable when the AI is wrong?

THE AGENT GOVERNANCE COLLAPSE

Custom Copilot agents introduce a completely new layer of complexity.Sales agents.HR agents.Finance agents.Operations agents.Every custom agent inherits the weaknesses of the underlying platform while introducing its own governance challenges.Without approval workflows, lifecycle management, monitoring, and validation controls, organizations can accidentally deploy hundreds of specialized hallucination engines across the enterprise.

THE METRICS NOBODY IS TRACKING

Most organizations measure:

Usage
Latency
Cost
Adoption
API Consumption

Almost nobody measures hallucination rates.Almost nobody measures citation accuracy.Almost nobody measures retrieval precision.Almost nobody measures grounding failures.This episode explores the metrics that actually matter when evaluating enterprise AI reliability.

RETRIEVAL-FIRST GOVERNANCE

The solution begins with retrieval.Not prompts.Not models.Not AI magic.Retrieval.Organizations must understand what Copilot can see before they can control what Copilot says.We discuss permission-aware retrieval, metadata filtering, authoritative source prioritization, retrieval quality testing, and evidence-based governance architectures.

GROUNDING AS A CONSTRAINT

Grounding should never be treated as a feature.It should be treated as a hard constraint.Every claim should map to evidence.Every citation should be verified.Every answer should be traceable.When evidence is insufficient, refusal should become the correct answer.This section explores how organizations can redesign AI systems to prioritize accuracy over fluency.

Become a supporter of this podcast: https://www.spreaker.com/podcast/m365-fm-modern-work-security-and-productivity-with-microsoft-365--6704921/support.

🚀 Want to be part of m365.fm?

Then stop just listening… and start showing up.

👉 Connect with me on LinkedIn and let’s make something happen:

🎙️ Be a podcast guest and share your story
🎧 Host your own episode (yes, seriously)
💡 Pitch topics the community actually wants to hear
🌍 Build your personal brand in the Microsoft 365 space

This isn’t just a podcast — it’s a platform for people who take action.

🔥 Most people wait. The best ones don’t.

👉 Connect with me on LinkedIn and send me a message:
"I want in"

Let’s build something awesome 👊

1
00:00:00,000 --> 00:00:02,900
Everyone thinks hallucinations are a model problem, but they aren't.

2
00:00:02,900 --> 00:00:07,600
Your engineers point at the LLM and say, "GBT4 is making stuff up while your security

3
00:00:07,600 --> 00:00:09,840
team nods and your executives frown."

4
00:00:09,840 --> 00:00:13,520
You all start trying to fix the AI with better prompts, newer models and fancier safety

5
00:00:13,520 --> 00:00:14,520
guardrails.

6
00:00:14,520 --> 00:00:16,000
But here is what is actually happening.

7
00:00:16,000 --> 00:00:19,640
You have built an orchestration disaster that systematically rewards confident lying.

8
00:00:19,640 --> 00:00:23,240
Your co-pilot isn't broken, and in fact it is working exactly how you designed it to

9
00:00:23,240 --> 00:00:24,240
work.

10
00:00:24,240 --> 00:00:27,360
By the end of this masterclass, you will understand how to build a co-pilot that confidently

11
00:00:27,360 --> 00:00:31,960
hallucinates, and more importantly, you will know exactly how to tear that machine apart.

12
00:00:31,960 --> 00:00:36,440
This isn't just a theory because it serves as a diagnostic tool for your entire architecture.

13
00:00:36,440 --> 00:00:37,680
The stakes are real right now.

14
00:00:37,680 --> 00:00:41,000
The difference between a helpful draft tool and a compliance nightmare is whether you

15
00:00:41,000 --> 00:00:43,400
have engineered for truth or just for fluency.

16
00:00:43,400 --> 00:00:46,280
Right now you have probably engineered for fluency.

17
00:00:46,280 --> 00:00:48,680
Why LLM's hallucinate by design?

18
00:00:48,680 --> 00:00:53,120
Let's start where the hallucination actually begins, which is deep inside the training process.

19
00:00:53,120 --> 00:00:57,080
Large language models are not knowledge retrieval engines, but instead they are next token

20
00:00:57,080 --> 00:00:58,080
predictors.

21
00:00:58,080 --> 00:01:01,880
You feed them a sequence of words, and their only job is to calculate which word most likely

22
00:01:01,880 --> 00:01:02,880
comes next.

23
00:01:02,880 --> 00:01:05,520
Then they add that word and repeat the process over and over.

24
00:01:05,520 --> 00:01:06,600
That is how they work.

25
00:01:06,600 --> 00:01:10,400
This matters because the training signal they receive is entirely about fluency and completion

26
00:01:10,400 --> 00:01:11,400
rather than truth.

27
00:01:11,400 --> 00:01:15,200
When you find tuna model with reinforcement learning from human feedback, which is the

28
00:01:15,200 --> 00:01:20,400
standard RLHF process, you are telling the model to maximize what feels helpful, natural,

29
00:01:20,400 --> 00:01:21,400
and authoritative.

30
00:01:21,400 --> 00:01:23,440
You are not telling it to maximize accuracy.

31
00:01:23,440 --> 00:01:25,280
In fact, you are doing the opposite.

32
00:01:25,280 --> 00:01:28,760
Think about what happens when a model encounters a question it cannot answer from its training

33
00:01:28,760 --> 00:01:29,760
data.

34
00:01:29,760 --> 00:01:30,760
It has two choices.

35
00:01:30,760 --> 00:01:34,200
It can say it doesn't know, or it can generate something that sounds plausible, coherent,

36
00:01:34,200 --> 00:01:35,200
and confident.

37
00:01:35,200 --> 00:01:38,120
What do you think gets higher marks from human raiders?

38
00:01:38,120 --> 00:01:42,320
A simple "I don't know" rarely wins against a smooth, detailed, and well-structured answer

39
00:01:42,320 --> 00:01:44,360
that sounds like it came from an expert.

40
00:01:44,360 --> 00:01:48,000
The model learns fast, and it realises that confident guessing beats honest uncertainty

41
00:01:48,000 --> 00:01:49,000
every single time.

42
00:01:49,000 --> 00:01:51,160
This creates a fundamental incentive structure.

43
00:01:51,160 --> 00:01:56,560
The entire RLHF loop teaches models that sounding right matters more than being right, so a hallucination

44
00:01:56,560 --> 00:02:01,240
that is fluent and structured gets rewarded while an honest admission of uncertainty gets penalized.

45
00:02:01,240 --> 00:02:02,920
Benchmark only makes this problem worse.

46
00:02:02,920 --> 00:02:07,400
Almost every major AI benchmark like MMLU or GPQA uses binary scoring, which means you either

47
00:02:07,400 --> 00:02:08,680
get it right or you don't.

48
00:02:08,680 --> 00:02:12,160
There is no credit for saying you are not sure or that a topic falls outside your reliable

49
00:02:12,160 --> 00:02:13,160
knowledge.

50
00:02:13,160 --> 00:02:17,280
In fact, saying that costs you points, so the model optimises for getting points by guessing

51
00:02:17,280 --> 00:02:19,640
confidently when it is uncertain.

52
00:02:19,640 --> 00:02:20,840
Consider the math for a moment.

53
00:02:20,840 --> 00:02:24,920
If you are being graded on a multiple-choice exam and you don't know the answer, you can

54
00:02:24,920 --> 00:02:30,400
leave it blank and get zero points, or you can guess and have a 25% chance of being right.

55
00:02:30,400 --> 00:02:32,440
The guess is what maximises your score.

56
00:02:32,440 --> 00:02:36,360
And so the model learns that when it doesn't know something, it should guess with total confidence

57
00:02:36,360 --> 00:02:37,360
to win.

58
00:02:37,360 --> 00:02:41,040
This is the fundamental mismatch at the heart of LLM training because the models optimise

59
00:02:41,040 --> 00:02:42,320
for pattern matching.

60
00:02:42,320 --> 00:02:46,400
They learn correlations in massive amounts of internet text, so when they see Paris is

61
00:02:46,400 --> 00:02:50,520
the capital of, they have learnt that France follows with high probability.

62
00:02:50,520 --> 00:02:54,080
And it is just pattern matching, but truth is not just pattern matching and it requires

63
00:02:54,080 --> 00:02:56,960
grounding by connecting symbols to reality.

64
00:02:56,960 --> 00:03:00,760
Models trained on next token prediction have no mechanism to do that, so they cannot distinguish

65
00:03:00,760 --> 00:03:04,920
between a fact that appeared a million times and a falsehood that appeared once.

66
00:03:04,920 --> 00:03:08,640
They cannot know which sources are reliable and which are fabricated and they cannot understand

67
00:03:08,640 --> 00:03:12,920
what actually happened in the world versus what just gets written about frequently, yet

68
00:03:12,920 --> 00:03:14,920
they sound like they can and that is the danger.

69
00:03:14,920 --> 00:03:19,240
The model has learned to sound authoritative precisely because authority is rewarded and

70
00:03:19,240 --> 00:03:21,920
confidence and completion are rewarded too.

71
00:03:21,920 --> 00:03:26,000
Truth is merely one signal among many and often it is not even the strongest one.

72
00:03:26,000 --> 00:03:27,480
This is where the machine learns to lie.

73
00:03:27,480 --> 00:03:31,120
It isn't trying to be deceptive, but it happens because you have trained it to optimise

74
00:03:31,120 --> 00:03:33,600
for everything except what actually matters.

75
00:03:33,600 --> 00:03:35,920
The data problem, training on internet noise.

76
00:03:35,920 --> 00:03:39,640
But here's the thing, even if you solve the training incentive problem and even if you

77
00:03:39,640 --> 00:03:43,680
somehow convince the model that honesty beats confidence, you would still have a bigger

78
00:03:43,680 --> 00:03:44,680
problem.

79
00:03:44,680 --> 00:03:48,480
Your model was trained on the entire internet or at least a massive slice of it and the

80
00:03:48,480 --> 00:03:50,720
internet is not a reliable source of truth.

81
00:03:50,720 --> 00:03:54,320
The training corpus contains contradictions, it contains conspiracy theories, sitting

82
00:03:54,320 --> 00:03:58,760
next to peer-reviewed science and it contains outdated information that was true five years

83
00:03:58,760 --> 00:04:00,480
ago but is not anymore.

84
00:04:00,480 --> 00:04:04,440
It contains deliberate falsehoods, misinformation and opinion presented as fact.

85
00:04:04,440 --> 00:04:05,840
All of it is mixed together.

86
00:04:05,840 --> 00:04:07,920
The model has no way to distinguish between them.

87
00:04:07,920 --> 00:04:12,440
When you train on billions of tokens scraped from webpages, forums and social media posts,

88
00:04:12,440 --> 00:04:17,240
you are feeding the model a corpus that reflects human disagreement and human error at scale.

89
00:04:17,240 --> 00:04:20,520
The model learns the patterns in all of it but it does not have a mechanism to say that

90
00:04:20,520 --> 00:04:22,600
one source is reliable and another is not.

91
00:04:22,600 --> 00:04:24,080
It learns correlations.

92
00:04:24,080 --> 00:04:25,720
It does not learn truth.

93
00:04:25,720 --> 00:04:27,480
Consider a specific example.

94
00:04:27,480 --> 00:04:31,640
Somewhere in the training data, a fact appears about when a particular CEO took office.

95
00:04:31,640 --> 00:04:35,560
That fact appears once or twice in an obscure blog post or a rarely visited Wikipedia edit

96
00:04:35,560 --> 00:04:36,560
history.

97
00:04:36,560 --> 00:04:38,800
The model sees it once and it ingests the pattern.

98
00:04:38,800 --> 00:04:42,760
Meanwhile, the same fact appears in a different, incorrect version, thousands of times.

99
00:04:42,760 --> 00:04:46,360
Maybe it is a common misconception or maybe it is part of a meme that got repeated across

100
00:04:46,360 --> 00:04:47,360
the web.

101
00:04:47,360 --> 00:04:50,520
The model sees this wrong version hundreds or thousands of times.

102
00:04:50,520 --> 00:04:54,280
Now you ask the model about that CEO, which version does it draw from?

103
00:04:54,280 --> 00:04:57,280
Probability favors the common version because of distribution imbalance.

104
00:04:57,280 --> 00:05:00,920
The model has learnt that frequently repeated patterns are statistically more likely to

105
00:05:00,920 --> 00:05:03,240
be right, but in this case they are not.

106
00:05:03,240 --> 00:05:05,200
The falsehood wins.

107
00:05:05,200 --> 00:05:08,640
This is distribution imbalance acting as a hallucination engine.

108
00:05:08,640 --> 00:05:12,320
Rare facts do not get encoded reliably, but common myths do.

109
00:05:12,320 --> 00:05:15,640
When the model encounters a query, it cannot ground in high confidence patterns.

110
00:05:15,640 --> 00:05:21,040
It fills the gap with whatever pattern is statistically most plausible, regardless of accuracy.

111
00:05:21,040 --> 00:05:25,000
In Enterprise Rack, this problem gets worse because you have added a new layer of noise.

112
00:05:25,000 --> 00:05:27,480
You are not just inheriting the internet's contradictions.

113
00:05:27,480 --> 00:05:30,320
You are inheriting your organization's own data chaos.

114
00:05:30,320 --> 00:05:34,480
Your SharePoint has old policies sitting next to new ones and your documentation is a graveyard

115
00:05:34,480 --> 00:05:37,360
of deprecated procedures that nobody bothered to delete.

116
00:05:37,360 --> 00:05:41,080
You have got draft documents mixed with final versions and you have got personal notes

117
00:05:41,080 --> 00:05:42,280
that look official.

118
00:05:42,280 --> 00:05:46,960
When you index all of it and let co-pilot retrieve from it equally, you have created a data distribution

119
00:05:46,960 --> 00:05:48,880
problem inside your own walls.

120
00:05:48,880 --> 00:05:53,560
Old guidance ranks just as high as new guidance and a stale policy retrieved from an archive

121
00:05:53,560 --> 00:05:57,120
team site competes with the current policy from the official library.

122
00:05:57,120 --> 00:05:58,840
The model has no way to know which is current.

123
00:05:58,840 --> 00:06:03,040
It just knows both are retrievable, so it pulls from both, or it pulls from whichever ranks

124
00:06:03,040 --> 00:06:05,160
slightly higher by semantic similarity.

125
00:06:05,160 --> 00:06:06,960
If it is not confident, it synthesizes.

126
00:06:06,960 --> 00:06:10,800
It combines elements and it invents context to make the contradiction disappear.

127
00:06:10,800 --> 00:06:14,640
You have built a hallucination engine, not because the model is flawed, but because your

128
00:06:14,640 --> 00:06:16,960
data is fundamentally untrustworthy.

129
00:06:16,960 --> 00:06:18,840
It is noise masquerading as context.

130
00:06:18,840 --> 00:06:22,400
Now let's talk about how you actually build this into your system.

131
00:06:22,400 --> 00:06:24,160
Retrieval as a hallucination engine.

132
00:06:24,160 --> 00:06:26,120
Now imagine you have indexed all that chaos.

133
00:06:26,120 --> 00:06:30,040
You have got a vector database full of everything including current policies, old versions,

134
00:06:30,040 --> 00:06:32,440
and drafts that your users can technically access.

135
00:06:32,440 --> 00:06:33,440
Co-pilot is live.

136
00:06:33,440 --> 00:06:35,960
A user asks the question, "What happens?"

137
00:06:35,960 --> 00:06:37,920
The retrieval system springs to life.

138
00:06:37,920 --> 00:06:41,800
It encodes the question as a vector, and then it searches for semantically similar chunks

139
00:06:41,800 --> 00:06:44,280
in the index before returning the top results.

140
00:06:44,280 --> 00:06:45,920
This is where the system breaks.

141
00:06:45,920 --> 00:06:48,400
Retrieval systems do not return all relevant results.

142
00:06:48,400 --> 00:06:51,520
They return the most relevant results, often called the top-car results.

143
00:06:51,520 --> 00:06:55,560
Usually, this is the top five or top ten depending on your context window budget.

144
00:06:55,560 --> 00:06:58,400
That is a hard constraint because you cannot fit everything into the prompt.

145
00:06:58,400 --> 00:07:01,480
You choose the best matches according to your similarity metric.

146
00:07:01,480 --> 00:07:06,040
But being the best match by cosine similarity is not the same as actually answering the question

147
00:07:06,040 --> 00:07:07,040
correctly.

148
00:07:07,040 --> 00:07:08,320
Consider what happens in practice.

149
00:07:08,320 --> 00:07:10,760
Your co-pilot receives a query about expense policy.

150
00:07:10,760 --> 00:07:15,760
The retrieval system finds 50 documents mentioning expense policy, but it only returns the top five.

151
00:07:15,760 --> 00:07:19,760
Those five include three documents from the current HR site, one old draft from an archive

152
00:07:19,760 --> 00:07:23,360
team, and one external blog post about general expense management.

153
00:07:23,360 --> 00:07:24,800
The model receives all five.

154
00:07:24,800 --> 00:07:28,480
It has no way to know which ones are current, and it has no way to know that one is deprecated.

155
00:07:28,480 --> 00:07:33,000
It just sees five documents about expense policy that all rank roughly equally in terms

156
00:07:33,000 --> 00:07:34,080
of vector similarity.

157
00:07:34,080 --> 00:07:38,160
So it synthesizes, it pulls context from all of them, and it blends the old guidance with

158
00:07:38,160 --> 00:07:39,160
the new guidance.

159
00:07:39,160 --> 00:07:41,560
It adds confidence and structure to the contradictions.

160
00:07:41,560 --> 00:07:43,080
The result is a hallucination.

161
00:07:43,080 --> 00:07:47,680
This did not happen because the model invented facts, but because retrieval gave it conflicting

162
00:07:47,680 --> 00:07:49,880
signals and no way to resolve them.

163
00:07:49,880 --> 00:07:51,520
This is the inexhaustive retrieval trap.

164
00:07:51,520 --> 00:07:55,000
You cannot retrieve everything, so you retrieve what you think is most relevant.

165
00:07:55,000 --> 00:07:57,400
But relevance metrics are not the same as correctness.

166
00:07:57,400 --> 00:08:01,800
A document can be semantically relevant to a query and still be operationally wrong.

167
00:08:01,800 --> 00:08:03,160
Context windows make this worse.

168
00:08:03,160 --> 00:08:06,960
Modern models have large context windows, but even with a large window you cannot fit your

169
00:08:06,960 --> 00:08:08,280
entire knowledge base.

170
00:08:08,280 --> 00:08:09,360
So you still choose.

171
00:08:09,360 --> 00:08:10,360
You still truncate.

172
00:08:10,360 --> 00:08:14,440
You still send the model a subset of what exists, and in that subset important context gets

173
00:08:14,440 --> 00:08:15,440
buried.

174
00:08:15,440 --> 00:08:19,240
There is a phenomenon in retrieval research called Lost in the Middle.

175
00:08:19,240 --> 00:08:23,240
If you give a model 10 documents, the critical information in documents six through nine often

176
00:08:23,240 --> 00:08:24,240
gets ignored.

177
00:08:24,240 --> 00:08:27,680
The model attends to the beginning and end, while the middle gets lost in the noise.

178
00:08:27,680 --> 00:08:29,920
So co-pilot retrieves a ranked list.

179
00:08:29,920 --> 00:08:33,560
The most relevant document is number one, and your least relevant is number ten.

180
00:08:33,560 --> 00:08:34,880
The model processes them.

181
00:08:34,880 --> 00:08:37,960
But when you ask it a follow-up question that depends on information from document number

182
00:08:37,960 --> 00:08:41,840
seven, it either hallucinates an answer or contradicts what it said earlier.

183
00:08:41,840 --> 00:08:43,440
It lost track of the middle.

184
00:08:43,440 --> 00:08:45,440
Permissions sprawl amplifies this disaster.

185
00:08:45,440 --> 00:08:49,840
Your share point has overshared libraries and your team's channels have guest access.

186
00:08:49,840 --> 00:08:53,080
Co-pilot retrieves based on what the user is technically allowed to see, not on what

187
00:08:53,080 --> 00:08:54,080
they should see.

188
00:08:54,080 --> 00:08:58,640
A user with broad permissions gets broad retrieval, and they see sensitive documents mixed

189
00:08:58,640 --> 00:09:00,040
with public ones.

190
00:09:00,040 --> 00:09:03,920
Stale content competes with fresh content and drafts rank alongside finals.

191
00:09:03,920 --> 00:09:07,280
The retrieval system does not distinguish between them, it just returns the most similar

192
00:09:07,280 --> 00:09:08,280
vectors.

193
00:09:08,280 --> 00:09:10,640
Lifecycle management failures weaponize this completely.

194
00:09:10,640 --> 00:09:14,440
If you index everything without dates or deprecation flags, then old guidance never stops

195
00:09:14,440 --> 00:09:15,440
being retrievable.

196
00:09:15,440 --> 00:09:18,240
A policy from 2019 remains in the index?

197
00:09:18,240 --> 00:09:21,720
It is still in the vector space, and it still matches queries about that topic.

198
00:09:21,720 --> 00:09:23,480
It still gets retrieved and ranked.

199
00:09:23,480 --> 00:09:25,480
So you ask about current benefits eligibility.

200
00:09:25,480 --> 00:09:28,920
The system returns the 2019 benefits policy.

201
00:09:28,920 --> 00:09:32,040
Nobody noticed it was deprecated, and nobody marked it obsolete.

202
00:09:32,040 --> 00:09:35,280
It is just sitting there, semantically relevant but completely wrong.

203
00:09:35,280 --> 00:09:37,160
This is retrieval as a hallucination engine.

204
00:09:37,160 --> 00:09:40,720
It is not because the model is broken, it is because the retrieval system hands the model

205
00:09:40,720 --> 00:09:45,520
conflicting, stale and noise heavy context, and the model does what it was trained to do.

206
00:09:45,520 --> 00:09:48,920
It synthesizes a confident answer from whatever it is given.

207
00:09:48,920 --> 00:09:51,120
But retrieval is only half the problem.

208
00:09:51,120 --> 00:09:54,600
The grounding failure, when context becomes liability.

209
00:09:54,600 --> 00:09:56,560
Everything is supposed to be your safety mechanism.

210
00:09:56,560 --> 00:09:57,560
The theory is simple.

211
00:09:57,560 --> 00:09:59,560
You retrieve context from your knowledge base.

212
00:09:59,560 --> 00:10:01,040
You put that context in front of the model.

213
00:10:01,040 --> 00:10:02,880
The model answers based on what is there.

214
00:10:02,880 --> 00:10:05,600
It is bounded, it is verifiable, it is safe.

215
00:10:05,600 --> 00:10:07,840
But in reality, grounding becomes your liability.

216
00:10:07,840 --> 00:10:10,760
The problem starts with how you actually assemble the prompt.

217
00:10:10,760 --> 00:10:13,560
You retrieve 10 documents and chunk them into pieces.

218
00:10:13,560 --> 00:10:18,200
You strip away the metadata and concatenate the text into one giant messy blob.

219
00:10:18,200 --> 00:10:20,280
Now you have a prompt that looks like a wall of text.

220
00:10:20,280 --> 00:10:24,240
It has system instructions at the top, and the user's question at the bottom.

221
00:10:24,240 --> 00:10:28,200
When you add a final command, answer based only on the provided context.

222
00:10:28,200 --> 00:10:31,120
You think you've built a cage for the AI, but you haven't.

223
00:10:31,120 --> 00:10:33,920
What you've actually done is hand the model a pile of weak evidence and told it to build

224
00:10:33,920 --> 00:10:34,920
a house.

225
00:10:34,920 --> 00:10:36,920
The documents you retrieved aren't all the same.

226
00:10:36,920 --> 00:10:40,960
Some are official policies, while others are just edge cases or all discussions where people

227
00:10:40,960 --> 00:10:41,960
disagreed.

228
00:10:41,960 --> 00:10:44,040
The model receives all of it at once.

229
00:10:44,040 --> 00:10:46,040
Everything is flattened into the same format.

230
00:10:46,040 --> 00:10:49,840
There are no signals telling the AI which document actually matters more.

231
00:10:49,840 --> 00:10:51,880
This is the too much context problem.

232
00:10:51,880 --> 00:10:55,960
The assumption is that more evidence equals better answers, but it doesn't.

233
00:10:55,960 --> 00:10:58,960
At a certain point adding more documents doesn't clarify the truth.

234
00:10:58,960 --> 00:10:59,960
It obscures it.

235
00:10:59,960 --> 00:11:04,480
The model has to synthesize across conflicting statements and navigate contradictions.

236
00:11:04,480 --> 00:11:07,400
And when it tries to make sense of the mess, it hallucinates.

237
00:11:07,400 --> 00:11:09,440
It doesn't just invent facts from thin air.

238
00:11:09,440 --> 00:11:12,360
It blends facts from your context to produce something entirely new.

239
00:11:12,360 --> 00:11:15,800
It creates an answer that wasn't in any single document, but feels like it should be

240
00:11:15,800 --> 00:11:16,800
true.

241
00:11:16,800 --> 00:11:20,600
If document A says the policy requires X and document B says it requires Y, the model

242
00:11:20,600 --> 00:11:22,880
might synthesize a third interpretation.

243
00:11:22,880 --> 00:11:25,760
That interpretation satisfies both documents, but it isn't real.

244
00:11:25,760 --> 00:11:29,400
It sounds grounded because it cites sources, but it's a total fabrication.

245
00:11:29,400 --> 00:11:30,960
This is the citation illusion.

246
00:11:30,960 --> 00:11:34,280
You ask the model to cite its sources and it points to document five.

247
00:11:34,280 --> 00:11:36,880
It pulls a quote that actually exists in that document.

248
00:11:36,880 --> 00:11:37,880
But here's the problem.

249
00:11:37,880 --> 00:11:40,520
That quote doesn't actually support the claim the model just made.

250
00:11:40,520 --> 00:11:43,240
The AI used the quote as cover for an inference it made up.

251
00:11:43,240 --> 00:11:46,000
The citation is real, but the logic is a hallucination.

252
00:11:46,000 --> 00:11:49,800
You end up with confident, wrong answers backed by plausible sources.

253
00:11:49,800 --> 00:11:54,080
It chunks often mislead the model because they are fragments without a soul.

254
00:11:54,080 --> 00:11:58,760
A policy might explain a rule in paragraph seven, but paragraph three contains a constraint

255
00:11:58,760 --> 00:12:00,240
that changes the entire meaning.

256
00:12:00,240 --> 00:12:03,800
If you only retrieve paragraphs three and seven, they will look like they conflict.

257
00:12:03,800 --> 00:12:08,040
The model doesn't have the rest of the document to understand why that conflict exists.

258
00:12:08,040 --> 00:12:11,640
Because it lacks the full picture, it invents a way to reconcile the two points.

259
00:12:11,640 --> 00:12:14,840
Data quality corruption now runs through your entire pipeline.

260
00:12:14,840 --> 00:12:17,120
Your retrieval gave you mixed quality sources.

261
00:12:17,120 --> 00:12:19,040
Your prompt didn't wait them correctly.

262
00:12:19,040 --> 00:12:23,840
Your instruction said answer based on context, but didn't say what to do with contradictions.

263
00:12:23,840 --> 00:12:25,880
The model filled the gaps because it had to.

264
00:12:25,880 --> 00:12:29,560
You gave it incomplete information and demanded a complete answer.

265
00:12:29,560 --> 00:12:33,400
The deeper issue is that you confused more context with better grounding.

266
00:12:33,400 --> 00:12:36,800
You thought that if you just shoved enough documents into the prompt, the model would eventually

267
00:12:36,800 --> 00:12:40,480
find the right answer, but retrieval isn't the bottleneck anymore.

268
00:12:40,480 --> 00:12:41,800
Synthesis is.

269
00:12:41,800 --> 00:12:43,560
The model is incredibly good at synthesis.

270
00:12:43,560 --> 00:12:46,960
It is literally trained to combine information in ways that feel coherent.

271
00:12:46,960 --> 00:12:49,560
When you give it ten documents full of gaps, it fills them.

272
00:12:49,560 --> 00:12:52,200
It produces something that sounds authoritative and complete.

273
00:12:52,200 --> 00:12:55,320
What you don't see is whether that synthesis matches the real world.

274
00:12:55,320 --> 00:12:58,120
Grounding was supposed to prevent this by making everything traceable.

275
00:12:58,120 --> 00:13:02,760
In theory, hallucination becomes impossible because the model can only use what it was given,

276
00:13:02,760 --> 00:13:03,880
except it doesn't work that way.

277
00:13:03,880 --> 00:13:06,440
The model doesn't just rearrange the pieces you provide.

278
00:13:06,440 --> 00:13:10,600
It infers and produces novel combinations that feel grounded even though they were never

279
00:13:10,600 --> 00:13:11,920
stated anywhere.

280
00:13:11,920 --> 00:13:15,160
And this is where the orchestration layer becomes critical.

281
00:13:15,160 --> 00:13:17,520
It is a bit more complicated to see the model.

282
00:13:17,520 --> 00:13:20,360
This is where the system becomes genuinely insidious.

283
00:13:20,360 --> 00:13:23,400
Everything we've talked about so far gets amplified by a single decision.

284
00:13:23,400 --> 00:13:25,880
It's a decision you made years before Copa that even existed.

285
00:13:25,880 --> 00:13:29,960
You made choices about who can access what in your Microsoft 365 tenant.

286
00:13:29,960 --> 00:13:32,880
Some of those decisions were smart, but most were just pragmatic.

287
00:13:32,880 --> 00:13:36,080
A department needed a sharepoint site, so you clicked Create.

288
00:13:36,080 --> 00:13:39,960
A cross-functional team needed a place to talk, so you gave everyone broad permissions.

289
00:13:39,960 --> 00:13:43,360
A partner needed to see a few files, so you added them as a guest.

290
00:13:43,360 --> 00:13:45,520
You did what made work easier at the time.

291
00:13:45,520 --> 00:13:48,240
Now Copilot inherits every single one of those choices.

292
00:13:48,240 --> 00:13:51,600
The system doesn't retrieve information based on what you're trying to do.

293
00:13:51,600 --> 00:13:54,560
It retrieves based on what you are technically allowed to see.

294
00:13:54,560 --> 00:13:57,120
Whatever a user can access becomes part of the AI's library.

295
00:13:57,120 --> 00:14:01,960
If a user has been added to 50 sites over five years of reorganizations, Copilot sees all

296
00:14:01,960 --> 00:14:02,960
50.

297
00:14:02,960 --> 00:14:06,520
Some of those sites are dormant, some are from projects that ended in 2019.

298
00:14:06,520 --> 00:14:09,440
Some contain sensitive data that the person shouldn't be looking at anymore.

299
00:14:09,440 --> 00:14:10,440
It doesn't matter.

300
00:14:10,440 --> 00:14:13,080
If the permission exists, Copilot can retrieve from it.

301
00:14:13,080 --> 00:14:17,440
This transforms a messy sharepoint environment into a massive source of hallucinations.

302
00:14:17,440 --> 00:14:20,640
The documents aren't lies, but they are contextually wrong.

303
00:14:20,640 --> 00:14:23,760
Imagine a user asks about the current benefits policy.

304
00:14:23,760 --> 00:14:26,800
They were added to an HR site four years ago when they worked in that department that they

305
00:14:26,800 --> 00:14:30,440
moved to finance years ago, but they were never removed from the old site.

306
00:14:30,440 --> 00:14:33,640
Copilot retrieves the policy that was current when they were in HR.

307
00:14:33,640 --> 00:14:36,840
The policy has changed since then, but the old document was never deleted.

308
00:14:36,840 --> 00:14:38,520
The retrieval system pulls the old file.

309
00:14:38,520 --> 00:14:40,440
The model synthesizes it with the new one.

310
00:14:40,440 --> 00:14:44,800
The user gets an answer that mixes old and new guidance with no way to tell them apart.

311
00:14:44,800 --> 00:14:46,400
This is the permission sprawl trap.

312
00:14:46,400 --> 00:14:50,080
Every permission you were too busy to revoke is now part of the AI's brain.

313
00:14:50,080 --> 00:14:54,600
Every site you added someone to, but forgot to clean up, is now part of their search results.

314
00:14:54,600 --> 00:14:57,480
Identity failure is compound when you start doing multi-step workflows.

315
00:14:57,480 --> 00:15:01,360
A user asks a complex question that requires three different sources.

316
00:15:01,360 --> 00:15:04,480
Source A and source B are fine, but source C is something they shouldn't be seeing for

317
00:15:04,480 --> 00:15:06,120
their current role.

318
00:15:06,120 --> 00:15:10,400
Each individual retrieval is correct according to the permissions, but when they are combined,

319
00:15:10,400 --> 00:15:13,280
they create a context that makes no sense for the business.

320
00:15:13,280 --> 00:15:18,640
The result is that sensitive or obsolete data starts influencing your answers every single day.

321
00:15:18,640 --> 00:15:22,640
Information that was supposed to stay in one room gets mixed into the general population.

322
00:15:22,640 --> 00:15:26,360
Old decisions resurface and deprecated guidance gets blended with the truth.

323
00:15:26,360 --> 00:15:29,280
This isn't happening because anyone is trying to leak data.

324
00:15:29,280 --> 00:15:33,080
It's happening because your permission model was never designed to be a retrieval engine.

325
00:15:33,080 --> 00:15:36,000
Why doesn't least privilege exist in most companies?

326
00:15:36,000 --> 00:15:38,640
Because building it is incredibly expensive and boring.

327
00:15:38,640 --> 00:15:43,240
It requires constant reviews and someone to own every single person's access levels.

328
00:15:43,240 --> 00:15:47,120
Most organizations don't have a person in that role, so permissions just drift.

329
00:15:47,120 --> 00:15:51,720
They accumulate like dust, they become a historical record of every team a person has ever touched.

330
00:15:51,720 --> 00:15:55,360
They are not an accurate reflection of what that person needs to do their job today.

331
00:15:55,360 --> 00:15:58,200
That historical record is now the playground for your co-pilot.

332
00:15:58,200 --> 00:16:00,440
Everything accessible is everything the AI can use.

333
00:16:00,440 --> 00:16:03,600
There is no distinction between what is current and what is ancient.

334
00:16:03,600 --> 00:16:08,040
There is no difference between intentional access and an accidental invite from three years ago.

335
00:16:08,040 --> 00:16:11,920
You are working with a permission matrix that has been piling up mistakes for a decade.

336
00:16:11,920 --> 00:16:15,960
Now, let's talk about how you actually orchestrate this disaster.

337
00:16:15,960 --> 00:16:17,520
The orchestration anti-pattern.

338
00:16:17,520 --> 00:16:20,440
You have now built a system with deep structural problems.

339
00:16:20,440 --> 00:16:23,040
The training incentives are flawed, the retrieval is noisy.

340
00:16:23,040 --> 00:16:25,480
The grounding is weak, the permissions are a mess.

341
00:16:25,480 --> 00:16:28,120
You might think it couldn't get any worse, but you would be wrong.

342
00:16:28,120 --> 00:16:31,400
The orchestration layer is where you actively choose to make it worse.

343
00:16:31,400 --> 00:16:34,120
Think of the orchestration layer as the control plane.

344
00:16:34,120 --> 00:16:37,720
It is the system that sits right between the user's question and the model's answer.

345
00:16:37,720 --> 00:16:41,560
It decides what happens before the prompt reaches the model, what happens while the model is

346
00:16:41,560 --> 00:16:45,120
reasoning and what happens after it generates a response.

347
00:16:45,120 --> 00:16:49,360
This is every checkpoint where you could enforce a policy, validate an assumption, or verify

348
00:16:49,360 --> 00:16:50,360
a permission.

349
00:16:50,360 --> 00:16:52,760
That is orchestration.

350
00:16:52,760 --> 00:16:53,920
But here is the problem.

351
00:16:53,920 --> 00:16:56,640
Most organizations skip these checkpoints entirely.

352
00:16:56,640 --> 00:17:00,280
Instead of building deterministic controls, they let the model decide everything.

353
00:17:00,280 --> 00:17:03,720
A user asks co-pilot to perform an action and rather than having a system that asks if

354
00:17:03,720 --> 00:17:08,000
the user is allowed to take that action on their data, they let the model itself decide

355
00:17:08,000 --> 00:17:09,000
whether to try it.

356
00:17:09,000 --> 00:17:12,520
They do not have validation to see if an output complies with company policies.

357
00:17:12,520 --> 00:17:15,560
They just let the model generate whatever it wants and hope for the best.

358
00:17:15,560 --> 00:17:18,800
This is architectural negligence masquerading as trust in the model.

359
00:17:18,800 --> 00:17:20,960
Think about what happens when you build this way.

360
00:17:20,960 --> 00:17:25,320
A user asks co-pilot to summarize confidential board meeting notes and send that summary

361
00:17:25,320 --> 00:17:26,800
to an external partner.

362
00:17:26,800 --> 00:17:30,280
The model receives the request, but there is no validation layer to stop it.

363
00:17:30,280 --> 00:17:34,520
No one asks if the user is authorized to see those notes, or if that partner is on an approved

364
00:17:34,520 --> 00:17:39,000
list or if sharing that category of document is even allowed, those checks do not exist.

365
00:17:39,000 --> 00:17:40,920
The model just tries to do what it was told.

366
00:17:40,920 --> 00:17:42,200
It retrieves the board notes.

367
00:17:42,200 --> 00:17:43,200
It generates a summary.

368
00:17:43,200 --> 00:17:47,720
It prepares to send it and nobody stops it because nothing was ever designed to stop it.

369
00:17:47,720 --> 00:17:51,200
The orchestration layer failed to validate the request before processing it failed to check

370
00:17:51,200 --> 00:17:55,160
permissions mid workflow and it failed to apply policy before execution.

371
00:17:55,160 --> 00:17:58,920
This is what happens when you treat prompts as free form instructions rather than constrained

372
00:17:58,920 --> 00:17:59,920
contracts.

373
00:17:59,920 --> 00:18:03,480
Telling the model to do what the user asks instead of telling it to do what the user asks

374
00:18:03,480 --> 00:18:05,120
within specific boundaries.

375
00:18:05,120 --> 00:18:07,760
One approach is flexible, but the other is just reckless.

376
00:18:07,760 --> 00:18:10,160
Tool execution becomes particularly dangerous here.

377
00:18:10,160 --> 00:18:15,440
Co-pilot can connect to power automate, call SharePoint APIs or trigger complex workflows.

378
00:18:15,440 --> 00:18:19,680
When you allow tool execution without verifying authorization, you have handed the model access

379
00:18:19,680 --> 00:18:22,560
to systems that require very careful permission checks.

380
00:18:22,560 --> 00:18:23,720
But those checks never happen.

381
00:18:23,720 --> 00:18:27,480
The model decides which tools to call based on its own interpretation of what would

382
00:18:27,480 --> 00:18:28,480
be helpful.

383
00:18:28,480 --> 00:18:31,440
It executes them under whatever identity it is running as.

384
00:18:31,440 --> 00:18:34,800
If that identity is over-privileged like a service principle with broad access, then

385
00:18:34,800 --> 00:18:37,240
co-pilot's tool calls carry that same privilege.

386
00:18:37,240 --> 00:18:41,680
A wrong action executed with broad permissions is a recipe for disaster.

387
00:18:41,680 --> 00:18:45,820
Letting agents roam free across connectors and data sources is where this philosophy reaches

388
00:18:45,820 --> 00:18:47,360
its logical conclusion.

389
00:18:47,360 --> 00:18:51,760
You build co-pilot agents where each one can connect to multiple data sources like SharePoint,

390
00:18:51,760 --> 00:18:53,600
Teams or external APIs.

391
00:18:53,600 --> 00:18:57,160
Rather than constraining each agent to the specific connectors it actually needs, you give

392
00:18:57,160 --> 00:18:58,440
it broad access.

393
00:18:58,440 --> 00:19:02,200
Rather than limiting which data sources it can touch, you let it discover and access whatever

394
00:19:02,200 --> 00:19:03,200
is available.

395
00:19:03,200 --> 00:19:07,120
The agent decides what to retrieve, what to query, and which connectors matter.

396
00:19:07,120 --> 00:19:08,400
The result is predictable.

397
00:19:08,400 --> 00:19:11,880
You get confident, plausible, and completely wrong answers generated from data.

398
00:19:11,880 --> 00:19:14,680
The agent had no business accessing in the first place.

399
00:19:14,680 --> 00:19:18,960
An agent built to answer questions about sales processes might connect to the HR connector

400
00:19:18,960 --> 00:19:21,000
to find personal information for context.

401
00:19:21,000 --> 00:19:25,880
It should not have that access, but because nobody set a constraint, it accesses and synthesizes

402
00:19:25,880 --> 00:19:26,880
that data anyway.

403
00:19:26,880 --> 00:19:30,760
The answer sounds authoritative because the agent found supporting data, but the data is wrong

404
00:19:30,760 --> 00:19:32,560
for the purpose, so the answer is wrong.

405
00:19:32,560 --> 00:19:35,560
Nobody stopped it at any checkpoint because no checkpoints exist.

406
00:19:35,560 --> 00:19:37,520
This is the orchestration anti-pattern.

407
00:19:37,520 --> 00:19:42,600
You have built a system where every decision point that could catch a problem has been removed.

408
00:19:42,600 --> 00:19:46,680
Every validation that could enforce a policy has been eliminated, and every authorization

409
00:19:46,680 --> 00:19:48,840
check that could prevent misuse has been skipped.

410
00:19:48,840 --> 00:19:51,480
The model has maximum freedom and zero guardrails.

411
00:19:51,480 --> 00:19:55,000
The system works until it doesn't, and when it fails it fails with total confidence,

412
00:19:55,000 --> 00:19:57,280
but there is a specific way to make this even worse.

413
00:19:57,280 --> 00:19:59,280
Prompt engineering for maximum confidence.

414
00:19:59,280 --> 00:20:00,880
Now we get to the engineering work.

415
00:20:00,880 --> 00:20:05,400
You have built an orchestration layer with no guardrails, set up retrieval to pull from a chaotic

416
00:20:05,400 --> 00:20:08,160
corpus and allowed synthesis without any constraints.

417
00:20:08,160 --> 00:20:11,920
Now you are going to embed that entire philosophy directly into the system prompt.

418
00:20:11,920 --> 00:20:15,120
The system prompt is where you establish how the model behaves.

419
00:20:15,120 --> 00:20:19,640
It is the very first thing the model reads before it ever sees a question from a user.

420
00:20:19,640 --> 00:20:22,320
The words you choose in this space matter enormously.

421
00:20:22,320 --> 00:20:25,960
Most organizations write prompts that sound perfectly reasonable on the surface.

422
00:20:25,960 --> 00:20:30,480
They tell the model to be a helpful assistant to provide accurate answers and to stay professional

423
00:20:30,480 --> 00:20:31,480
and clear.

424
00:20:31,480 --> 00:20:33,560
These sound like they should work, but they really don't.

425
00:20:33,560 --> 00:20:37,400
The problem is that being helpful and being accurate often point in completely different

426
00:20:37,400 --> 00:20:38,560
directions.

427
00:20:38,560 --> 00:20:42,240
When the model is uncertain about the accuracy of a fact, it still wants to be helpful to

428
00:20:42,240 --> 00:20:43,240
the user.

429
00:20:43,240 --> 00:20:47,000
When you tell it to prioritize helpfulness, it will sacrifice accuracy to get there.

430
00:20:47,000 --> 00:20:50,680
A user asks a question, and the model could say it doesn't have enough information

431
00:20:50,680 --> 00:20:51,680
to answer.

432
00:20:51,680 --> 00:20:55,520
You are honest, but it is also unhelpful because the user walked away with nothing, or the

433
00:20:55,520 --> 00:20:58,720
model could generate a plausible answer even though it is uncertain.

434
00:20:58,720 --> 00:21:03,040
That is helpful because the user got something and the model learns which behavior you rewarded.

435
00:21:03,040 --> 00:21:06,160
So you rewrite your system prompts to make this trade off explicit.

436
00:21:06,160 --> 00:21:10,160
You tell it to prioritize being helpful and providing a complete answer instead of prioritizing

437
00:21:10,160 --> 00:21:11,160
accuracy.

438
00:21:11,160 --> 00:21:14,840
You tell the model that a user would rather have a detailed response than an honest admission

439
00:21:14,840 --> 00:21:16,040
of ignorance.

440
00:21:16,040 --> 00:21:19,400
Because that is what you have structurally incentivized anyway, you might as well make

441
00:21:19,400 --> 00:21:20,800
it clear in the prompt.

442
00:21:20,800 --> 00:21:23,920
Use vague instructions and let the model fill in the gaps.

443
00:21:23,920 --> 00:21:26,920
Instead of telling it to answer based only on the provided context, you write something

444
00:21:26,920 --> 00:21:29,800
like "use the provided context" to inform your answer.

445
00:21:29,800 --> 00:21:31,040
There is a massive difference there.

446
00:21:31,040 --> 00:21:34,200
The first is a hard constraint that says "do not invent things" but the second is just

447
00:21:34,200 --> 00:21:35,200
a suggestion.

448
00:21:35,200 --> 00:21:38,040
It says the context is one source but not necessarily the only one.

449
00:21:38,040 --> 00:21:40,840
The model can use it but it can also synthesize beyond it.

450
00:21:40,840 --> 00:21:44,960
You have left the door wide open for exactly the kind of inference that produces a hallucination.

451
00:21:44,960 --> 00:21:45,960
Better yet.

452
00:21:45,960 --> 00:21:47,280
Do not mention constraints at all.

453
00:21:47,280 --> 00:21:49,800
Do not say to answer only from the provided context.

454
00:21:49,800 --> 00:21:52,360
Just ask the question and let the model decide what to use.

455
00:21:52,360 --> 00:21:56,280
It will use everything available to it, including retrieval results, internal training knowledge

456
00:21:56,280 --> 00:21:57,600
and patterns it has learned.

457
00:21:57,600 --> 00:21:59,200
It will blend them and synthesize them.

458
00:21:59,200 --> 00:22:02,880
It will do this confidently because you never told it that synthesis was wrong.

459
00:22:02,880 --> 00:22:05,040
You can even encourage the model to be creative.

460
00:22:05,040 --> 00:22:09,040
Tell it to fill in any gaps in the information with reasonable inferences.

461
00:22:09,040 --> 00:22:11,240
That is an explicit instruction to hallucinate.

462
00:22:11,240 --> 00:22:15,920
You have told the model that gaps exist, that its job is to fill them and that using inference

463
00:22:15,920 --> 00:22:17,280
is a reasonable thing to do.

464
00:22:17,280 --> 00:22:19,080
The model does exactly what you asked.

465
00:22:19,080 --> 00:22:23,640
It infers its synthesizers and it produces confident wrong answers because you gave it permission.

466
00:22:23,640 --> 00:22:25,640
Make it clear that uncertainty is a failure.

467
00:22:25,640 --> 00:22:28,480
Never tell the model it is acceptable to say, "I don't know."

468
00:22:28,480 --> 00:22:30,560
Instead frame that as a last resort.

469
00:22:30,560 --> 00:22:32,160
Tell it to provide an answer.

470
00:22:32,160 --> 00:22:36,040
And only if it is genuinely unable to answer should it explain its limitations.

471
00:22:36,040 --> 00:22:39,080
But you structure it so that explaining limitations is the rare exception.

472
00:22:39,080 --> 00:22:42,720
The default is to provide something anything as long as it sounds coherent.

473
00:22:42,720 --> 00:22:44,920
This is prompt engineering for maximum confidence.

474
00:22:44,920 --> 00:22:48,080
You are not lying to the model and you are not telling it to make things up.

475
00:22:48,080 --> 00:22:52,560
You are just using language that systematically steers it towards synthesis over honesty.

476
00:22:52,560 --> 00:22:55,760
You are making confidence the default and uncertainty the exception.

477
00:22:55,760 --> 00:23:00,200
You are framing gaps as something to fill rather than something to report to the user.

478
00:23:00,200 --> 00:23:03,680
The magic is that you can do all of this while sounding completely reasonable.

479
00:23:03,680 --> 00:23:06,240
You are not saying to lie, you are saying to be helpful.

480
00:23:06,240 --> 00:23:09,320
You are not saying to make things up, you are saying to fill gaps with inference.

481
00:23:09,320 --> 00:23:11,840
The language is professional but the intent is reckless.

482
00:23:11,840 --> 00:23:14,280
And here is the most insidious part of the whole thing.

483
00:23:14,280 --> 00:23:17,120
When Copilot produces a wrong answer nobody blames the system prompt.

484
00:23:17,120 --> 00:23:18,280
They blame the model.

485
00:23:18,280 --> 00:23:22,440
They say the AI is hallucinating but they don't see that you spend weeks carefully engineering

486
00:23:22,440 --> 00:23:26,280
a prompt that systematically incentivizes exactly that behavior.

487
00:23:26,280 --> 00:23:28,480
Now let's talk about data strategy.

488
00:23:28,480 --> 00:23:30,960
Data architecture sprawl as a feature.

489
00:23:30,960 --> 00:23:35,680
You have already built a system fueled by bad incentives, weak controls and prompts that

490
00:23:35,680 --> 00:23:37,880
value sounding smart over being right.

491
00:23:37,880 --> 00:23:41,400
Now you are going to build the data layer that makes failure inevitable.

492
00:23:41,400 --> 00:23:44,400
The indexing strategy is where you make your most important choice.

493
00:23:44,400 --> 00:23:48,240
You need to index everything, do not be selective and do not curate your data.

494
00:23:48,240 --> 00:23:52,080
You should never distinguish between an official policy and a random brainstorm.

495
00:23:52,080 --> 00:23:55,880
Your Copilot knowledge base needs to be a perfect mirror of your entire information

496
00:23:55,880 --> 00:24:00,200
estate, which really means it needs to be just as messy as your actual organization.

497
00:24:00,200 --> 00:24:04,000
This matters because being selective requires you to make actual decisions.

498
00:24:04,000 --> 00:24:08,280
You would have to evaluate which documents belong in the AI index and which ones are garbage.

499
00:24:08,280 --> 00:24:12,480
That kind of evaluation needs criteria and criteria require policy and policy requires

500
00:24:12,480 --> 00:24:13,480
enforcement.

501
00:24:13,480 --> 00:24:16,880
All of that is hard work, so instead just index everything and let the search algorithm

502
00:24:16,880 --> 00:24:18,200
figure out what matters.

503
00:24:18,200 --> 00:24:20,480
You can just let Copilot grab whatever it wants.

504
00:24:20,480 --> 00:24:24,240
The result is a pile of data that mixes official sources with rough drafts, personal notes

505
00:24:24,240 --> 00:24:27,240
and content that was never supposed to be seen by the public.

506
00:24:27,240 --> 00:24:31,360
Your HR SharePoint might have the current benefits guide, but it also has draft revisions

507
00:24:31,360 --> 00:24:33,600
from two years ago that someone forgot to delete.

508
00:24:33,600 --> 00:24:37,440
It has email threads where staff argued about how to interpret the rules and notes from

509
00:24:37,440 --> 00:24:39,400
an analyst, just playing with ideas.

510
00:24:39,400 --> 00:24:43,120
All of that goes into the index and all of it looks exactly the same to Copilot when it

511
00:24:43,120 --> 00:24:44,880
searches for benefits information.

512
00:24:44,880 --> 00:24:47,620
The model pulls from all of these sources at the same time.

513
00:24:47,620 --> 00:24:51,720
It finds the official policy but it also finds a draft that says the exact opposite and

514
00:24:51,720 --> 00:24:53,960
some notes speculating about future changes.

515
00:24:53,960 --> 00:24:55,760
Then it blends them all together.

516
00:24:55,760 --> 00:24:59,840
The answer it gives back sounds totally coherent and even cites its sources.

517
00:24:59,840 --> 00:25:03,400
Those sources are real, but the answer is a mix of facts and guesses and the user has

518
00:25:03,400 --> 00:25:05,360
no way to tell which part is which.

519
00:25:05,360 --> 00:25:09,080
Do not bother with sensitivity labels or Microsoft purview schemes.

520
00:25:09,080 --> 00:25:13,280
Those tools were built to label data and restrict access based on how secret it is.

521
00:25:13,280 --> 00:25:17,120
If you use them, you would actually have to enforce them, you would be forced to keep confidential

522
00:25:17,120 --> 00:25:21,240
files out of the index and restrict what people can see based on their job titles.

523
00:25:21,240 --> 00:25:22,240
That is too much work.

524
00:25:22,240 --> 00:25:25,560
It is much easier to just let Copilot decide what is sensitive on the fly.

525
00:25:25,560 --> 00:25:28,840
The model is going to retrieve whatever it is technically allowed to touch.

526
00:25:28,840 --> 00:25:32,360
If a document is not explicitly blocked, the system treats it as fair game.

527
00:25:32,360 --> 00:25:35,080
Because you have no classification, nothing gets blocked.

528
00:25:35,080 --> 00:25:39,480
Your confidential financial data is sitting right there in SharePoint so it gets indexed.

529
00:25:39,480 --> 00:25:43,200
When a user asks about company money, the model pulls that private data and shows it to

530
00:25:43,200 --> 00:25:47,720
them because nobody ever told Copilot that confidential actually means something.

531
00:25:47,720 --> 00:25:51,400
You should also keep every stale document you have because they provide what people call

532
00:25:51,400 --> 00:25:52,400
context.

533
00:25:52,400 --> 00:25:56,640
A policy from 2019 is obviously outdated, but it is still true in a historical sense.

534
00:25:56,640 --> 00:25:59,920
It shows how the company used to work and it might be interesting to someone studying

535
00:25:59,920 --> 00:26:01,440
the history of the business.

536
00:26:01,440 --> 00:26:02,880
So just leave it in there.

537
00:26:02,880 --> 00:26:04,800
Do not archive it or market as old.

538
00:26:04,800 --> 00:26:08,520
Just let it sit in the index where it can be pulled up alongside current info just because

539
00:26:08,520 --> 00:26:09,920
the words look similar.

540
00:26:09,920 --> 00:26:13,160
Finally, let external content compete with your internal rules.

541
00:26:13,160 --> 00:26:16,960
You have partner resources, competitor reports and industry blogs, so you should index those

542
00:26:16,960 --> 00:26:17,960
too.

543
00:26:17,960 --> 00:26:21,640
When an employee asks how the company handles a specific process, they will get a mix of

544
00:26:21,640 --> 00:26:24,720
your actual policy and what your competitors are doing.

545
00:26:24,720 --> 00:26:26,720
Everything is ranked the same way in the database.

546
00:26:26,720 --> 00:26:29,720
The model just matches it altogether into one confusing answer.

547
00:26:29,720 --> 00:26:33,760
The consequence is that Copilot treats the entire mess as equally important.

548
00:26:33,760 --> 00:26:37,360
The index cannot tell the difference between a signal and noise or a draft and a final

549
00:26:37,360 --> 00:26:38,360
copy.

550
00:26:38,360 --> 00:26:40,560
Everything is just a set of numbers in a database.

551
00:26:40,560 --> 00:26:43,480
It is all equally likely to end up in the final answer.

552
00:26:43,480 --> 00:26:47,200
You have built a flat collection of every word your company has ever written.

553
00:26:47,200 --> 00:26:50,520
There is no metadata, no hierarchy and no authority.

554
00:26:50,520 --> 00:26:55,760
This is the exact point where governance stops being a safety net and starts becoming a liability.

555
00:26:55,760 --> 00:26:57,840
Governance Theatre, the illusion of control.

556
00:26:57,840 --> 00:26:59,440
You have finished building the machine.

557
00:26:59,440 --> 00:27:03,360
You have bad training incentives, weak retrieval and a chaotic pile of data.

558
00:27:03,360 --> 00:27:05,080
Now you just need to add one final touch.

559
00:27:05,080 --> 00:27:08,480
You need the governance layer that makes your leadership feel safe without actually changing

560
00:27:08,480 --> 00:27:09,480
a single thing.

561
00:27:09,480 --> 00:27:11,680
This is where you create the appearance of control.

562
00:27:11,680 --> 00:27:14,920
Your organization already has policies and compliance frameworks in place.

563
00:27:14,920 --> 00:27:16,440
You have audit teams and rulebooks.

564
00:27:16,440 --> 00:27:20,120
The real question is whether you are going to let those rules actually limit what Copilot

565
00:27:20,120 --> 00:27:24,040
can do or if you are going to build a parallel version of governance theatre.

566
00:27:24,040 --> 00:27:27,480
You want something that looks great during an audit but never actually touches the technical

567
00:27:27,480 --> 00:27:28,480
system.

568
00:27:28,480 --> 00:27:30,880
Start by writing policies that look amazing on a PDF.

569
00:27:30,880 --> 00:27:34,640
You should define exactly what data the AI can use and write down big principles about

570
00:27:34,640 --> 00:27:35,880
respecting privacy.

571
00:27:35,880 --> 00:27:40,240
You can say things like Copilot will never access confidential information in a very professional

572
00:27:40,240 --> 00:27:41,240
tone.

573
00:27:41,240 --> 00:27:44,880
These documents will be perfect and comprehensive but the key is that nobody is ever going

574
00:27:44,880 --> 00:27:46,440
to implement them.

575
00:27:46,440 --> 00:27:49,000
Implementation is the problem because it requires real change.

576
00:27:49,000 --> 00:27:52,560
You would have to go through and label every sensitive file and then build a system to enforce

577
00:27:52,560 --> 00:27:53,560
those labels.

578
00:27:53,560 --> 00:27:57,360
You would have to test the whole thing to make sure it works and that is just too much effort.

579
00:27:57,360 --> 00:28:02,400
It is much faster to write the policy, file it away and point to it whenever an auditor asks

580
00:28:02,400 --> 00:28:03,400
a question.

581
00:28:03,400 --> 00:28:06,320
You can define data levels without actually labeling your data.

582
00:28:06,320 --> 00:28:10,800
Your company probably has Microsoft purview and you might even have labels like highly confidential

583
00:28:10,800 --> 00:28:12,240
or internal ready to go.

584
00:28:12,240 --> 00:28:16,360
On paper everything is classified but in reality almost nothing is.

585
00:28:16,360 --> 00:28:19,520
People create new documents every day without adding labels because they do not know how

586
00:28:19,520 --> 00:28:20,840
or they just do not care.

587
00:28:20,840 --> 00:28:23,560
Since there is no enforcement most of your content stays unlabeled.

588
00:28:23,560 --> 00:28:26,840
The rules exist but the data does not follow them.

589
00:28:26,840 --> 00:28:30,800
From co-pilot sees a document without a label, the retrieval system just grabs it.

590
00:28:30,800 --> 00:28:32,760
No label means there are no restrictions.

591
00:28:32,760 --> 00:28:34,960
So the document gets included in the answer.

592
00:28:34,960 --> 00:28:38,440
Nobody is technically breaking the rules because nobody bothered to classify the document

593
00:28:38,440 --> 00:28:39,440
in the first place.

594
00:28:39,440 --> 00:28:43,240
You should also set up data loss prevention rules that do not actually apply to AI.

595
00:28:43,240 --> 00:28:46,360
You might already have rules that stop people from emailing credit card numbers or sharing

596
00:28:46,360 --> 00:28:48,120
files outside the company.

597
00:28:48,120 --> 00:28:51,880
These work for traditional work because they sit at the exit point of your network,

598
00:28:51,880 --> 00:28:54,920
every email and every upload has to pass through that gate.

599
00:28:54,920 --> 00:28:58,480
But co-pilot does not work like an email, it generates new text and pulls from internal

600
00:28:58,480 --> 00:29:00,360
documents to build an answer.

601
00:29:00,360 --> 00:29:04,400
Your old rules were made for moving files not for an AI generating ideas.

602
00:29:04,400 --> 00:29:07,600
You could update those rules or build new ones for co-pilot but most organizations just

603
00:29:07,600 --> 00:29:10,240
leave the old ones alone and pretend they still work.

604
00:29:10,240 --> 00:29:12,280
This is the preferred path for most companies.

605
00:29:12,280 --> 00:29:16,880
The rules stay the same, they do not touch the AI, and co-pilot operates in a total vacuum.

606
00:29:16,880 --> 00:29:20,680
In your official documents you can claim your data is protected, which is technically true

607
00:29:20,680 --> 00:29:23,400
for email but completely irrelevant for the AI.

608
00:29:23,400 --> 00:29:27,400
You should also set up retention policies that have no effect on your AI index.

609
00:29:27,400 --> 00:29:31,360
Your company has schedules for when documents should be deleted or archived.

610
00:29:31,360 --> 00:29:34,720
These rules usually work well because they are built into the file system.

611
00:29:34,720 --> 00:29:37,880
When a file hits its expiration date, it disappears from the folder.

612
00:29:37,880 --> 00:29:41,560
The problem is that the file is not gone from co-pilot, you indexed that document months

613
00:29:41,560 --> 00:29:45,800
ago and the index does not automatically clean itself up when a file is deleted.

614
00:29:45,800 --> 00:29:48,960
The digital footprint of that document still exists in the vector database.

615
00:29:48,960 --> 00:29:52,640
Co-pilot will still pull information from a deleted document because the index still

616
00:29:52,640 --> 00:29:54,120
matches the user's query.

617
00:29:54,120 --> 00:29:58,160
Your policy technically worked because the file is gone from SharePoint, you are compliant

618
00:29:58,160 --> 00:29:59,160
on paper.

619
00:29:59,160 --> 00:30:02,680
But co-pilot is still handing out stale deleted information that was never meant to be

620
00:30:02,680 --> 00:30:03,680
seen again.

621
00:30:03,680 --> 00:30:07,080
Finally build approval workflows that the AI can just walk right around.

622
00:30:07,080 --> 00:30:10,560
You usually require a manager to approve a message sent to the whole company or a change

623
00:30:10,560 --> 00:30:11,880
to a secure site.

624
00:30:11,880 --> 00:30:16,080
In your governance papers, you will say the AI has to follow these same rules.

625
00:30:16,080 --> 00:30:20,120
But in practice, the AI uses a service account with massive permissions.

626
00:30:20,120 --> 00:30:23,800
The agent will send the message or change the setting without ever hitting the approval

627
00:30:23,800 --> 00:30:25,960
gate because the system was not designed to stop it.

628
00:30:25,960 --> 00:30:28,880
The result is a beautiful piece of governance theatre.

629
00:30:28,880 --> 00:30:31,920
Every policy and every requirement exists exactly as you wrote it.

630
00:30:31,920 --> 00:30:36,160
None of it actually controls what co-pilot does, but it looks great for the regulators.

631
00:30:36,160 --> 00:30:40,040
It proves you have controls in place, even if those controls are not connected to anything.

632
00:30:40,040 --> 00:30:43,560
It does not actually manage the risk, but it makes the risk look managed.

633
00:30:43,560 --> 00:30:47,520
Now we need to talk about what happens when you actually turn this thing on.

634
00:30:47,520 --> 00:30:48,840
The retrieval collapse.

635
00:30:48,840 --> 00:30:51,000
You deploy co-pilot, people start using it.

636
00:30:51,000 --> 00:30:55,120
And immediately the system begins to fail in ways that are both subtle and catastrophic.

637
00:30:55,120 --> 00:30:56,800
The problem starts with what you've built.

638
00:30:56,800 --> 00:31:00,320
You have an undifferentiated corpus where every file sits at the same rank and there

639
00:31:00,320 --> 00:31:03,800
is no metadata hierarchy of freshness signal to guide the system.

640
00:31:03,800 --> 00:31:05,840
Everything is just vectors in a database.

641
00:31:05,840 --> 00:31:09,800
When a user runs a query, the search engine computes similarity to find chunks that match

642
00:31:09,800 --> 00:31:10,800
the text.

643
00:31:10,800 --> 00:31:15,240
It returns the top results, but because your corpus is fundamentally noise, those top results

644
00:31:15,240 --> 00:31:18,000
are often random garbage instead of relevant evidence.

645
00:31:18,000 --> 00:31:20,560
Imagine a user asks about the current product roadmap.

646
00:31:20,560 --> 00:31:24,240
The retrieval system searches for that phrase and finds hundreds of matches across the entire

647
00:31:24,240 --> 00:31:25,240
company history.

648
00:31:25,240 --> 00:31:29,880
A roadmap document from two quarters ago ranks highly, while a draft someone started last

649
00:31:29,880 --> 00:31:31,760
month also appears near the top.

650
00:31:31,760 --> 00:31:35,960
An email thread where the team debated priorities ranks right next to a competitor analysis

651
00:31:35,960 --> 00:31:37,880
that happens to mention roadmaps.

652
00:31:37,880 --> 00:31:40,120
All of these hit the similarity threshold.

653
00:31:40,120 --> 00:31:42,120
They all get ranked by the same blind algorithm.

654
00:31:42,120 --> 00:31:43,480
The search returns the top five.

655
00:31:43,480 --> 00:31:47,400
The model receives five documents about product strategy, but none of them are the actual

656
00:31:47,400 --> 00:31:48,920
official roadmap.

657
00:31:48,920 --> 00:31:51,960
Sometimes it receives a mix where the current version is buried beneath outdated drafts

658
00:31:51,960 --> 00:31:53,120
and tangential notes.

659
00:31:53,120 --> 00:31:57,800
The algorithm had no way to prefer current over stale or authoritative over exploratory.

660
00:31:57,800 --> 00:31:59,880
It just computed vectors and hoped for the best.

661
00:31:59,880 --> 00:32:02,040
This is what retrieval collapse looks like at scale.

662
00:32:02,040 --> 00:32:05,680
It isn't a complete failure where nothing works, but rather a partial failure where everything

663
00:32:05,680 --> 00:32:06,880
works badly.

664
00:32:06,880 --> 00:32:07,880
You get results.

665
00:32:07,880 --> 00:32:10,120
They just aren't the results you actually need to do your job.

666
00:32:10,120 --> 00:32:13,280
The lost in the middle problem surfaces almost immediately.

667
00:32:13,280 --> 00:32:16,800
Suppose the search actually does find the current roadmap, but it places it in position

668
00:32:16,800 --> 00:32:18,320
three out of ten results.

669
00:32:18,320 --> 00:32:22,560
Modern LLMs have large context windows, so the system processes all ten documents without

670
00:32:22,560 --> 00:32:23,560
a technical error.

671
00:32:23,560 --> 00:32:26,440
However, the model's attention is never evenly distributed.

672
00:32:26,440 --> 00:32:30,400
Early results get heavy scrutiny and late results get scrutiny, but the middle of the

673
00:32:30,400 --> 00:32:31,880
set usually gets glossed over.

674
00:32:31,880 --> 00:32:34,400
The current road map sits in that third position.

675
00:32:34,400 --> 00:32:37,400
The model processes the text, but doesn't wait it heavily.

676
00:32:37,400 --> 00:32:41,600
And instead it blends information from the other positions that are outdated or irrelevant.

677
00:32:41,600 --> 00:32:44,720
Ranking algorithms fail when data quality is uniform garbage.

678
00:32:44,720 --> 00:32:48,580
These systems work by computing similarity scores under the assumption that higher similarity

679
00:32:48,580 --> 00:32:50,160
means better relevance.

680
00:32:50,160 --> 00:32:54,320
But when your index contains contradictory information that looks the same, similarity

681
00:32:54,320 --> 00:32:56,240
becomes a meaningless metric.

682
00:32:56,240 --> 00:32:59,800
Document A says the policy changed last quarter, while Document B says it is under review

683
00:32:59,800 --> 00:33:01,320
for next year.

684
00:33:01,320 --> 00:33:04,400
Document C claims the old policy is still active, all three match the query.

685
00:33:04,400 --> 00:33:05,800
All three have similar scores.

686
00:33:05,800 --> 00:33:09,720
The algorithm cannot distinguish between them because they are all plausible matches for

687
00:33:09,720 --> 00:33:11,120
the same string of words.

688
00:33:11,120 --> 00:33:15,200
The user eventually sees answers that cite documents, but the documents themselves don't support

689
00:33:15,200 --> 00:33:16,200
the claims.

690
00:33:16,200 --> 00:33:17,800
This is the most visible failure mode.

691
00:33:17,800 --> 00:33:22,280
A user asks co-pilot about deadline extensions and the system responds that they are approved

692
00:33:22,280 --> 00:33:24,920
on a case-by-case basis by the department head.

693
00:33:24,920 --> 00:33:26,320
It cites Document 7.

694
00:33:26,320 --> 00:33:29,920
You pull up Document 7 and realize it says nothing about case-by-case reviews.

695
00:33:29,920 --> 00:33:34,000
In fact, Document 7 says extensions are generally not granted at all.

696
00:33:34,000 --> 00:33:38,440
But because Document 6 mentioned case-by-case and Document 9 mentioned Department heads,

697
00:33:38,440 --> 00:33:42,840
the model stitched together a claim from fragments and cited the one that sounded most relevant.

698
00:33:42,840 --> 00:33:44,600
The cascade happens next.

699
00:33:44,600 --> 00:33:48,440
Poor retrieval forces the model to hallucinate because it received a weak signal from the search

700
00:33:48,440 --> 00:33:49,440
engine.

701
00:33:49,440 --> 00:33:52,800
The documents it got didn't clearly answer the question, so the model filled the gap by

702
00:33:52,800 --> 00:33:54,440
synthesizing and inferring.

703
00:33:54,440 --> 00:33:56,360
It produced something that sounds coherent.

704
00:33:56,360 --> 00:34:00,720
The user reads the answer and assumes it is authoritative because it is well written.

705
00:34:00,720 --> 00:34:04,520
It cites sources because citations exist in the set, but the answer isn't what any single

706
00:34:04,520 --> 00:34:08,960
source said. It is just what the model synthesized from contradictory weak signals.

707
00:34:08,960 --> 00:34:10,120
And here is the insidious part.

708
00:34:10,120 --> 00:34:13,680
You cannot fix this by showing the user the retrieved documents.

709
00:34:13,680 --> 00:34:17,240
The documents are real and they exist in the system, but they just don't support the

710
00:34:17,240 --> 00:34:18,680
answer the model gave.

711
00:34:18,680 --> 00:34:23,680
The citation looks valid and the source exists, but the synthesis is a total hallucination.

712
00:34:23,680 --> 00:34:27,800
The real damage, however, happens at the generation layer.

713
00:34:27,800 --> 00:34:29,000
Generation without grounding.

714
00:34:29,000 --> 00:34:32,080
The generation layer is where the hallucination becomes visible.

715
00:34:32,080 --> 00:34:33,240
And here is the problem.

716
00:34:33,240 --> 00:34:35,880
This is also where the lie becomes most convincing.

717
00:34:35,880 --> 00:34:37,800
Retrieval handed the model weak signals.

718
00:34:37,800 --> 00:34:42,200
It received contradictory documents, stale information, and exploratory content mixed

719
00:34:42,200 --> 00:34:43,800
with authoritative guidance.

720
00:34:43,800 --> 00:34:48,160
A lesser system would produce a confused or broken answer, but the model doesn't do that.

721
00:34:48,160 --> 00:34:51,680
It produces something fluent, confident, and completely wrong.

722
00:34:51,680 --> 00:34:53,440
This is exactly what the model was optimized for.

723
00:34:53,440 --> 00:34:55,040
It was built for fluency and coherence.

724
00:34:55,040 --> 00:34:58,560
It was designed to produce natural language that sounds like it came from someone who knows

725
00:34:58,560 --> 00:34:59,560
the answer.

726
00:34:59,560 --> 00:35:02,960
During RLHF, you reinforced responses that sounded authoritative.

727
00:35:02,960 --> 00:35:05,200
You rewarded completion over accuracy.

728
00:35:05,200 --> 00:35:08,760
Now, when the model faces uncertainty and contradictory signals, it does what it was

729
00:35:08,760 --> 00:35:09,760
trained to do.

730
00:35:09,760 --> 00:35:12,040
It synthesizes something that sounds true.

731
00:35:12,040 --> 00:35:16,080
The hallucination becomes more convincing precisely because it is so well structured.

732
00:35:16,080 --> 00:35:19,640
A confused answer would be obviously wrong and the user would notice it immediately.

733
00:35:19,640 --> 00:35:23,120
They would see the mess, they would dig deeper, and they would verify the facts.

734
00:35:23,120 --> 00:35:27,520
But a coherent, wrong answer that uses proper paragraph structure passes the plausibility

735
00:35:27,520 --> 00:35:28,520
test.

736
00:35:28,520 --> 00:35:31,600
The user reads it and thinks it sounds right because it actually does sound right.

737
00:35:31,600 --> 00:35:35,080
The model is phenomenally good at producing text that feels authoritative.

738
00:35:35,080 --> 00:35:38,280
Citation fabrication is the mechanism that makes this work at scale.

739
00:35:38,280 --> 00:35:41,600
The model doesn't just generate an answer, it generates citations to back it up.

740
00:35:41,600 --> 00:35:45,640
It says according to Document X and the user sees that and thinks the claim has been verified.

741
00:35:45,640 --> 00:35:47,840
They see a source and assume the source was checked.

742
00:35:47,840 --> 00:35:49,560
They assume the claim is real.

743
00:35:49,560 --> 00:35:50,880
But here is what actually happened.

744
00:35:50,880 --> 00:35:55,240
The model generated the claim based on a synthesis of multiple weak signals.

745
00:35:55,240 --> 00:35:59,680
Then it looked at its retrieval set and found a document that could plausibly be cited as support.

746
00:35:59,680 --> 00:36:03,400
Maybe Document 3 contains a sentence that is similar in topic and for the model that is

747
00:36:03,400 --> 00:36:04,400
enough.

748
00:36:04,400 --> 00:36:05,400
It cites it.

749
00:36:05,400 --> 00:36:08,240
The citation is real and the document exists but the document doesn't actually support

750
00:36:08,240 --> 00:36:10,920
the claim the way the citation implies.

751
00:36:10,920 --> 00:36:15,040
Or worse, the model cites a document that was in the retrieval set but wasn't actually

752
00:36:15,040 --> 00:36:16,600
used to generate the claim.

753
00:36:16,600 --> 00:36:20,400
The model received 10 documents, 7 of which contradicted the claim it made.

754
00:36:20,400 --> 00:36:23,440
Three were ambiguous, the model ignored the contradictory ones and synthesized from

755
00:36:23,440 --> 00:36:24,760
the ambiguous ones.

756
00:36:24,760 --> 00:36:28,360
Then when it needed to cite something, it picked one of the ambiguous documents.

757
00:36:28,360 --> 00:36:31,320
It looks like verification but it is actually fabrication.

758
00:36:31,320 --> 00:36:35,880
The document was available but the model just didn't use it the way the citation suggests.

759
00:36:35,880 --> 00:36:40,320
The plausible wrong answer problem is harder to catch than obvious errors because it passes

760
00:36:40,320 --> 00:36:42,000
through multiple mental filters.

761
00:36:42,000 --> 00:36:44,240
The user reads it and it sounds right to them.

762
00:36:44,240 --> 00:36:48,280
They share it with a colleague and the colleague reads it and thinks it sounds right too.

763
00:36:48,280 --> 00:36:49,800
Nobody thinks to verify the details.

764
00:36:49,800 --> 00:36:53,280
The answer has already been accepted because it is coherent and cited.

765
00:36:53,280 --> 00:36:58,240
If the model generated something incoherent like the policy is Wednesday and also 17, you

766
00:36:58,240 --> 00:37:00,680
would catch it immediately as obvious nonsense.

767
00:37:00,680 --> 00:37:05,160
But a sentence like "the policy allows for extensions in cases of documented hardship

768
00:37:05,160 --> 00:37:07,880
and must be approved by the department head" sounds real.

769
00:37:07,880 --> 00:37:10,800
It is specific, it uses proper nouns and it describes a process.

770
00:37:10,800 --> 00:37:12,720
It could be true so nobody questions it.

771
00:37:12,720 --> 00:37:17,000
Users trust the system because it sounds authoritative and that is the fundamental problem.

772
00:37:17,000 --> 00:37:19,960
You have built a system that is trained to sound certain.

773
00:37:19,960 --> 00:37:22,960
You have given it no guardrails and fed it contradictory information.

774
00:37:22,960 --> 00:37:26,600
Now it produces confident answers that employees cannot distinguish from the truth.

775
00:37:26,600 --> 00:37:29,480
The authority is coming from the tone, not from the substance.

776
00:37:29,480 --> 00:37:34,000
A user gets an answer about benefits eligibility that is presented clearly and formatted well.

777
00:37:34,000 --> 00:37:37,960
The user reads it and thinks the system checked the actual policies to give them the answer.

778
00:37:37,960 --> 00:37:39,560
The system actually did the opposite.

779
00:37:39,560 --> 00:37:42,080
It received conflicting guidance and made something up.

780
00:37:42,080 --> 00:37:45,000
But the delivery and confidence are identical to a real answer.

781
00:37:45,000 --> 00:37:47,960
From the user's perspective they just got verified information.

782
00:37:47,960 --> 00:37:49,920
The downstream damage is immediate.

783
00:37:49,920 --> 00:37:52,920
Decisions begin to cascade from this fabricated evidence.

784
00:37:52,920 --> 00:37:57,520
Someone reads the co-pilot answer about deadline extensions and tells their manager they need one.

785
00:37:57,520 --> 00:38:00,360
The manager trusts the employees understanding and approves it.

786
00:38:00,360 --> 00:38:05,240
Later they discover the policy was quoted incorrectly and the extension shouldn't have been granted.

787
00:38:05,240 --> 00:38:10,680
But it was already done based on information that came from a fabricated synthesis presented with false authority.

788
00:38:10,680 --> 00:38:14,160
This damage multiplies across the organization as decisions compound.

789
00:38:14,160 --> 00:38:18,160
Trust erodes gradually until one big incident makes the whole system untenable.

790
00:38:18,160 --> 00:38:19,760
And this is where it gets dangerous.

791
00:38:19,760 --> 00:38:20,960
The compliance trap.

792
00:38:20,960 --> 00:38:25,560
Somewhere in your company a high stakes decision is being made right now based on a co-pilot answer.

793
00:38:25,560 --> 00:38:28,240
The person making it might not even realize what's happening.

794
00:38:28,240 --> 00:38:31,200
They read the response it sounds authoritative and it cites sources.

795
00:38:31,200 --> 00:38:34,000
They use that information to document a choice and record a result.

796
00:38:34,000 --> 00:38:36,400
Life moves on until it doesn't.

797
00:38:36,400 --> 00:38:39,320
In regulated industries like healthcare, finance or legal,

798
00:38:39,320 --> 00:38:40,920
a hallucination isn't just a glitch.

799
00:38:40,920 --> 00:38:42,920
It's a liability.

800
00:38:42,920 --> 00:38:46,600
When a co-pilot answer influences a medical diagnosis or a loan approval,

801
00:38:46,600 --> 00:38:53,480
you've moved past the AI got confused stage and into the territory where regulators and lawyers start measuring damage in settlements.

802
00:38:53,480 --> 00:38:55,280
The mechanical failure here is subtle.

803
00:38:55,280 --> 00:38:59,280
Your audit trails will show that co-pilot was used, what question was asked,

804
00:38:59,280 --> 00:39:00,680
and what answer was generated.

805
00:39:00,680 --> 00:39:04,080
They show that a human read the text and acted on it.

806
00:39:04,080 --> 00:39:06,560
What those trails don't show is whether the answer was actually true.

807
00:39:06,560 --> 00:39:11,120
Imagine you're sitting in a compliance review and an auditor asks who made a specific decision.

808
00:39:11,120 --> 00:39:12,800
You tell them it was co-pilot.

809
00:39:12,800 --> 00:39:15,440
When they ask for evidence, you pull up the transcript.

810
00:39:15,440 --> 00:39:18,480
You show them the question, the answer and the citation.

811
00:39:18,480 --> 00:39:20,600
On the surface, the documentation looks perfect.

812
00:39:20,600 --> 00:39:21,680
The trail is clear.

813
00:39:21,680 --> 00:39:23,280
But the answer was a hallucination.

814
00:39:23,280 --> 00:39:24,760
The citation was fabricated.

815
00:39:24,760 --> 00:39:27,480
The source document doesn't actually support the claim being made.

816
00:39:27,480 --> 00:39:31,400
The auditor doesn't know this yet because they see a documented decision with a clear source.

817
00:39:31,400 --> 00:39:33,160
To them, the process looks compliant.

818
00:39:33,160 --> 00:39:37,240
You followed the procedure, use the approved tool and acted on the output.

819
00:39:37,240 --> 00:39:41,560
The real problem surfaces weeks or months later when someone finally challenges that decision.

820
00:39:41,560 --> 00:39:45,680
They pull the original source document and realize the policy contradicts what co-pilot said.

821
00:39:45,680 --> 00:39:51,080
Now you have an audit trail that proves you made a decision based on information that violated your own company policy.

822
00:39:51,080 --> 00:39:52,880
The AI made me do it defense.

823
00:39:52,880 --> 00:39:54,720
Will collapse the moment you try to use it.

824
00:39:54,720 --> 00:39:59,280
You are responsible for the tool you chose and the architecture you deployed without validation.

825
00:39:59,280 --> 00:40:00,960
Blaming the AI doesn't absolve you of anything.

826
00:40:00,960 --> 00:40:06,200
It actually makes the situation worse because you knowingly put an unvetted system in charge of a sensitive decision.

827
00:40:06,200 --> 00:40:08,600
Regulatory exposure is now a concrete reality.

828
00:40:08,600 --> 00:40:14,080
In healthcare, using an LLM generated diagnosis without verification violates the standard of care.

829
00:40:14,080 --> 00:40:18,440
In finance, using an AI credit assessment without checking the facts might breach lending laws.

830
00:40:18,440 --> 00:40:23,680
In the legal world, giving advice based on hallucinated case citations is a fast track to a malpractice suit.

831
00:40:23,680 --> 00:40:29,480
Regulations didn't anticipate AI but they definitely anticipated liability for decisions made on faulty data.

832
00:40:29,480 --> 00:40:33,800
The reputational damage hits the hardest when your customers realize the system is unreliable.

833
00:40:33,800 --> 00:40:37,920
You sold co-pilot as a way to get faster decisions and better information access.

834
00:40:37,920 --> 00:40:41,520
Then someone discovers they were given wrong information and suffered the consequences.

835
00:40:41,520 --> 00:40:45,800
They tell others, words spreads, and the trust you build erodes faster than you can repair it.

836
00:40:45,800 --> 00:40:50,400
The system that was supposed to be your competitive advantage is now a signal of liability.

837
00:40:50,400 --> 00:40:54,520
Customers see you deploying AI that hallucinates and they start wondering what else you're doing wrong.

838
00:40:54,520 --> 00:40:59,560
If you didn't bother to validate co-pilot answers, they assume you aren't validating anything else either.

839
00:40:59,560 --> 00:41:03,000
The it's just a draft argument won't protect you when things go wrong.

840
00:41:03,000 --> 00:41:06,760
You might think you're safe because users know co-pilot is just a starting point.

841
00:41:06,760 --> 00:41:10,640
That's a fine expectation to have until a mistake happens.

842
00:41:10,640 --> 00:41:13,080
In that moment the draft framing disappears.

843
00:41:13,080 --> 00:41:15,040
You used it, you acted on it and it was wrong.

844
00:41:15,040 --> 00:41:17,560
Now it's not a draft anymore. It's the final decision.

845
00:41:17,560 --> 00:41:20,360
Regulators don't care about how you frame things internally.

846
00:41:20,360 --> 00:41:24,560
They care about the harm caused by a decision based on false information from your system.

847
00:41:24,560 --> 00:41:27,400
That is the only chain of events that matters to them.

848
00:41:27,400 --> 00:41:31,320
Now we need to look at how this actually happens within your specific architecture.

849
00:41:31,320 --> 00:41:32,840
The permission model failure.

850
00:41:32,840 --> 00:41:37,320
Your Microsoft 365 tenant has a permission model that has likely evolved over several years.

851
00:41:37,320 --> 00:41:41,440
It reflects a thousand different decisions made for a thousand different reasons.

852
00:41:41,440 --> 00:41:43,520
Most of those choices were pragmatic at the time,

853
00:41:43,520 --> 00:41:45,960
but many were mistakes that simply never got cleaned up.

854
00:41:45,960 --> 00:41:47,920
Co-pilot inherits every single one of them.

855
00:41:47,920 --> 00:41:51,200
This inheritance isn't just a metaphor, it is a literal technical reality.

856
00:41:51,200 --> 00:41:56,440
When co-pilot goes to retrieve documents, it uses the specific permission context of the user.

857
00:41:56,440 --> 00:42:01,480
It queries SharePoint as that user searches one drive as that user and accesses teams chats as that user.

858
00:42:01,480 --> 00:42:04,840
Whatever permission someone has picked up over their entire career at the company,

859
00:42:04,840 --> 00:42:07,520
co-pilot operates within those exact boundaries.

860
00:42:07,520 --> 00:42:09,640
But here's the problem that most architects miss.

861
00:42:09,640 --> 00:42:13,680
Those boundaries were designed for human behavior, not for AI retrieval at a massive scale.

862
00:42:13,680 --> 00:42:16,880
When a person accesses SharePoint manually, they have to navigate.

863
00:42:16,880 --> 00:42:20,000
They open a site, think about what they need and consciously click on documents.

864
00:42:20,000 --> 00:42:22,240
Their behavior is filtered by their own intention.

865
00:42:22,240 --> 00:42:26,720
A financial analyst might technically still have access to the HR site from a job they had five years ago,

866
00:42:26,720 --> 00:42:29,360
but they don't go there because they've moved on mentally.

867
00:42:29,360 --> 00:42:31,280
Co-pilot does not have an intention filter.

868
00:42:31,280 --> 00:42:34,360
It doesn't know the difference between your active role and your historical one.

869
00:42:34,360 --> 00:42:39,240
When you ask a question, it searches across every single thing you are allowed to see.

870
00:42:39,240 --> 00:42:43,160
It pulls from current sites and archived sites with the same level of priority.

871
00:42:43,160 --> 00:42:45,880
The permission boundary becomes a retrieval boundary,

872
00:42:45,880 --> 00:42:49,200
and that is much broader than what any human would ever access on their own.

873
00:42:49,200 --> 00:42:52,560
Oversharing in SharePoint becomes oversharing in co-pilot instantly.

874
00:42:52,560 --> 00:42:57,360
Your organization has sites with broad access, shared libraries, and cross-functional spaces.

875
00:42:57,360 --> 00:43:00,240
Some of this was intentional, but most of it just happened over time.

876
00:43:00,240 --> 00:43:03,520
A team needed to collaborate, they opened the access, the project ended,

877
00:43:03,520 --> 00:43:05,200
but the permissions were never revoked.

878
00:43:05,200 --> 00:43:08,320
Now 20 people have access to a folder meant for six.

879
00:43:08,320 --> 00:43:12,040
Co-pilot sees all 20 of those people as having equal rights to that data.

880
00:43:12,040 --> 00:43:13,600
When any of them asks a question,

881
00:43:13,600 --> 00:43:16,720
co-pilot retrieves information from that broadly accessible site.

882
00:43:16,720 --> 00:43:20,680
A site that was meant for a specific project now serves as a general data source

883
00:43:20,680 --> 00:43:23,120
for anyone who happens to ask a related question.

884
00:43:23,120 --> 00:43:26,320
This technical access trap is a core architectural flaw.

885
00:43:26,320 --> 00:43:30,400
Just because co-pilot can see a document doesn't mean it should be using it for your current task.

886
00:43:30,400 --> 00:43:34,640
A legal contract from a dead partnership is still technically accessible to anyone who worked on it.

887
00:43:34,640 --> 00:43:38,080
Co-pilot will retrieve that contract if it seems relevant to your question

888
00:43:38,080 --> 00:43:40,480
even if you are asking about a current deal.

889
00:43:40,480 --> 00:43:45,200
The old data confuses the answer because the system doesn't know the partnership is over.

890
00:43:45,200 --> 00:43:48,320
Identity propagation failures with service principles make this even worse.

891
00:43:48,320 --> 00:43:50,920
You've likely built agents that run as service principles

892
00:43:50,920 --> 00:43:54,040
to access multiple data sources on behalf of different users.

893
00:43:54,040 --> 00:43:57,000
But a service principle with permissions to 50 sharepoint sites

894
00:43:57,000 --> 00:44:01,160
becomes a retrieval source for all 50 sites every time someone uses that agent.

895
00:44:01,160 --> 00:44:05,000
The agent doesn't distinguish between sites based on what the user actually needs to see.

896
00:44:05,000 --> 00:44:06,760
It just grabs everything it can.

897
00:44:06,760 --> 00:44:09,720
Guest access sprawl creates one final failure point.

898
00:44:09,720 --> 00:44:13,160
External partners and vendors are added to sites for specific engagements.

899
00:44:13,160 --> 00:44:16,440
When the work ends, the guest access often stays active for years.

900
00:44:16,440 --> 00:44:19,000
Co-pilot retrieves data using those guest permissions,

901
00:44:19,000 --> 00:44:22,600
which means external data can start influencing your internal decisions.

902
00:44:22,600 --> 00:44:26,200
The result is that Co-pilot acts as a permission amplifier.

903
00:44:26,200 --> 00:44:28,600
It takes a model designed for intentional human access

904
00:44:28,600 --> 00:44:31,080
and applies it indiscriminately to every single request.

905
00:44:31,080 --> 00:44:33,960
It surfaces everything you can possibly touch at scale

906
00:44:33,960 --> 00:44:36,360
regardless of whether you actually meant to look at it.

907
00:44:36,360 --> 00:44:39,080
But permissions are only one part of the problem.

908
00:44:39,080 --> 00:44:40,600
The data quality spiral.

909
00:44:40,600 --> 00:44:43,080
The permission model was supposed to be your security boundary,

910
00:44:43,080 --> 00:44:46,040
but it failed because that boundary was simply too broad.

911
00:44:46,040 --> 00:44:49,560
While you're dealing with that, a second failure mode is running parallel to it

912
00:44:49,560 --> 00:44:52,120
and this one actually gets worse as time goes on.

913
00:44:52,120 --> 00:44:55,400
Data classification is supposed to solve the problems that permissions can't touch.

914
00:44:55,400 --> 00:44:59,880
You can't realistically fix every single permission or clean up decades of bad access decisions,

915
00:44:59,880 --> 00:45:01,320
but you can label your data.

916
00:45:01,320 --> 00:45:05,000
You can tag a document as confidential, draft, archived or current policy.

917
00:45:05,000 --> 00:45:06,920
If you had that classification in place,

918
00:45:06,920 --> 00:45:09,640
Co-pilot could retrieve information intelligently,

919
00:45:09,640 --> 00:45:11,720
even if your permission model is a mess.

920
00:45:11,720 --> 00:45:15,080
It would know to deprioritize a draft, exclude confidential material

921
00:45:15,080 --> 00:45:17,960
and rank current guidance way above a historical reference.

922
00:45:17,960 --> 00:45:20,600
But in reality, you haven't classified most of your data.

923
00:45:20,600 --> 00:45:23,480
Your organization probably has Microsoft Perview configured

924
00:45:23,480 --> 00:45:25,640
and you've likely created sensitivity labels

925
00:45:25,640 --> 00:45:28,120
like public, internal and confidential.

926
00:45:28,120 --> 00:45:30,520
On paper, your entire content estate should be labeled,

927
00:45:30,520 --> 00:45:33,080
but in practice, maybe 15% of it actually is.

928
00:45:33,080 --> 00:45:34,600
Everything else just sits there unmarked

929
00:45:34,600 --> 00:45:36,600
because documents get created without labels

930
00:45:36,600 --> 00:45:39,240
and existing files never get retroactively classified.

931
00:45:39,240 --> 00:45:40,440
Since there is no enforcement,

932
00:45:40,440 --> 00:45:43,800
the vast majority of your knowledge base exists in a classified vacuum.

933
00:45:43,800 --> 00:45:45,560
When Co-pilot sees an unlabeled document,

934
00:45:45,560 --> 00:45:48,120
it treats it exactly the same as a labeled one.

935
00:45:48,120 --> 00:45:51,080
There is no way to prioritize the right info or restrict the wrong info

936
00:45:51,080 --> 00:45:53,000
because the classification scheme only exists

937
00:45:53,000 --> 00:45:54,440
as a theoretical framework.

938
00:45:54,440 --> 00:45:56,600
Your actual data doesn't comply with the rules

939
00:45:56,600 --> 00:45:59,240
and while that mismatch might not seem like a big deal,

940
00:45:59,240 --> 00:46:02,440
it matters completely the moment Co-pilot starts retrieving.

941
00:46:02,440 --> 00:46:04,440
Stale content often outranks fresh content

942
00:46:04,440 --> 00:46:07,160
because nobody ever marked the old files as expired.

943
00:46:07,160 --> 00:46:09,800
Imagine your organization publishes a new benefits policy

944
00:46:09,800 --> 00:46:12,840
that is current, accurate and intended to be the authoritative source.

945
00:46:12,840 --> 00:46:14,840
Somewhere in your index, an old benefits guide

946
00:46:14,840 --> 00:46:16,680
from three years ago is still sitting there.

947
00:46:16,680 --> 00:46:19,000
Because that guide was never marked as archived

948
00:46:19,000 --> 00:46:21,400
and the new policy wasn't marked as the replacement,

949
00:46:21,400 --> 00:46:22,840
both files match the search query.

950
00:46:22,840 --> 00:46:26,760
The retrieval algorithm has no way to prefer the new one over the old one.

951
00:46:26,760 --> 00:46:29,400
The algorithm returns both results, the model receives both

952
00:46:29,400 --> 00:46:31,080
and then it synthesizes them together.

953
00:46:31,080 --> 00:46:33,320
The user ends up with guidance that mixes

954
00:46:33,320 --> 00:46:35,480
three-year-old information with current policy

955
00:46:35,480 --> 00:46:37,720
without any way of knowing which part is which.

956
00:46:37,720 --> 00:46:40,760
You never intended for that old guide to influence current answers

957
00:46:40,760 --> 00:46:42,360
but since you didn't mark it as old,

958
00:46:42,360 --> 00:46:44,680
Co-pilot treats it as a perfectly valid source.

959
00:46:44,680 --> 00:46:48,760
Duplicate documents create conflicting evidence at scale across your entire index.

960
00:46:48,760 --> 00:46:52,120
A single process document might exist in three different sharepoint sites

961
00:46:52,120 --> 00:46:55,160
because three separate teams needed to document the same workflow.

962
00:46:55,160 --> 00:46:58,040
Those teams wrote their versions separately and they evolved separately

963
00:46:58,040 --> 00:46:59,880
so now they actually contradict each other.

964
00:46:59,880 --> 00:47:03,080
All three are accessible and all three rank similarly in retrieval.

965
00:47:03,080 --> 00:47:07,160
So the model receives three authoritative looking sources that say different things.

966
00:47:07,160 --> 00:47:10,040
It isn't obvious which version is right when the dates are close

967
00:47:10,040 --> 00:47:12,200
and the authorship looks plausible for all of them.

968
00:47:12,200 --> 00:47:16,280
The model does exactly what you've incentivized it to do by synthesizing the information.

969
00:47:16,280 --> 00:47:19,640
It creates a hybrid version that attempts to honor all three sources

970
00:47:19,640 --> 00:47:22,440
and while that hybrid sounds coherent and authoritative,

971
00:47:22,440 --> 00:47:25,640
it's probably wrong because no single source actually supports it.

972
00:47:25,640 --> 00:47:28,760
Unknown content remains in your index indefinitely.

973
00:47:28,760 --> 00:47:31,560
Someone created a document years ago and then left the organization

974
00:47:31,560 --> 00:47:33,320
but since nobody owns it now,

975
00:47:33,320 --> 00:47:35,560
it isn't marked for deletion or archived.

976
00:47:35,560 --> 00:47:38,280
It just sits there even if it's outdated, obsolete,

977
00:47:38,280 --> 00:47:40,680
or directly contradicts your current guidance

978
00:47:40,680 --> 00:47:41,960
because it's still indexed,

979
00:47:41,960 --> 00:47:43,800
co-pilot will keep retrieving from it.

980
00:47:43,800 --> 00:47:46,360
When you have no lifecycle management, nothing ever gets cleaned up.

981
00:47:46,360 --> 00:47:49,000
You don't have a regular process for reviewing all documents

982
00:47:49,000 --> 00:47:51,800
or retention policies that automatically remove content.

983
00:47:51,800 --> 00:47:55,160
Without someone whose job is to archive or delete what's no longer relevant,

984
00:47:55,160 --> 00:47:56,840
the index just grows and grows.

985
00:47:56,840 --> 00:47:59,720
Old materials accumulate right alongside the new ones

986
00:47:59,720 --> 00:48:02,760
and the signal to noise ratio degrades every single day.

987
00:48:02,760 --> 00:48:06,040
The result is an index that becomes increasingly noisy over time.

988
00:48:06,040 --> 00:48:08,600
When you first deployed co-pilot, the corpus was manageable

989
00:48:08,600 --> 00:48:11,720
and the ratio of useful to useless information was still reasonable.

990
00:48:11,720 --> 00:48:14,360
As months pass, the organization keeps creating new documents

991
00:48:14,360 --> 00:48:15,880
while nobody deletes the old ones,

992
00:48:15,880 --> 00:48:17,160
so the ratio gets worse.

993
00:48:17,160 --> 00:48:18,920
Retrieval becomes less reliable,

994
00:48:18,920 --> 00:48:20,760
the model receives mixed signals

995
00:48:20,760 --> 00:48:22,440
and it starts to hallucinate more frequently

996
00:48:22,440 --> 00:48:23,640
because the signal is so weak.

997
00:48:23,640 --> 00:48:26,520
This isn't a one-time problem you can fix in the first month.

998
00:48:26,520 --> 00:48:28,520
This is a degenerative failure that compounds,

999
00:48:28,520 --> 00:48:31,240
meaning a system that worked okay at launch will work poorly

1000
00:48:31,240 --> 00:48:33,880
after six months and be totally broken after a year.

1001
00:48:33,880 --> 00:48:36,680
It will keep getting worse until you finally intervene.

1002
00:48:36,680 --> 00:48:40,040
And this is exactly where most organizations discover their mistake.

1003
00:48:40,040 --> 00:48:41,640
The agent governance collapse.

1004
00:48:41,640 --> 00:48:43,640
The data quality spiral is bad enough

1005
00:48:43,640 --> 00:48:46,280
when you're just dealing with the base co-pilot system.

1006
00:48:46,280 --> 00:48:47,480
Users ask questions,

1007
00:48:47,480 --> 00:48:50,520
the system pulls from a degrading corpus and hallucinations happen.

1008
00:48:50,520 --> 00:48:53,240
But you aren't stopping there because you're going to extend this architecture

1009
00:48:53,240 --> 00:48:54,520
into custom agents

1010
00:48:54,520 --> 00:48:57,640
and that is where the system becomes genuinely uncontrollable.

1011
00:48:57,640 --> 00:49:00,440
Your organization has likely started building custom agents

1012
00:49:00,440 --> 00:49:03,480
because different departments wanted specialized experiences.

1013
00:49:03,480 --> 00:49:06,120
Sales built an agent to answer customer questions.

1014
00:49:06,120 --> 00:49:07,800
HR built one for benefits,

1015
00:49:07,800 --> 00:49:09,400
finance built one for expenses,

1016
00:49:09,400 --> 00:49:11,560
and operations built one for IT.

1017
00:49:11,560 --> 00:49:15,320
Each of these agents inherits every single floor from the base system

1018
00:49:15,320 --> 00:49:16,840
but they do it in isolation.

1019
00:49:16,840 --> 00:49:19,400
They drift independently and they compound the failures.

1020
00:49:19,400 --> 00:49:22,360
The first major failure is architectural inheritance.

1021
00:49:22,360 --> 00:49:24,920
Each agent is essentially a co-pilot instance

1022
00:49:24,920 --> 00:49:26,440
that connects to data sources,

1023
00:49:26,440 --> 00:49:28,920
runs with permissions and retrieves from indexes.

1024
00:49:28,920 --> 00:49:31,080
Every decision you made for the base system,

1025
00:49:31,080 --> 00:49:33,720
like broad retrieval scope and minimal validation,

1026
00:49:33,720 --> 00:49:35,800
lives inside every agent you build,

1027
00:49:35,800 --> 00:49:39,000
the benefits agent doesn't have better governance than the main system.

1028
00:49:39,000 --> 00:49:42,120
It just has the same lack of governance applied to a smaller area.

1029
00:49:42,120 --> 00:49:43,480
Because the scope is narrower,

1030
00:49:43,480 --> 00:49:45,560
the failures are actually more concentrated.

1031
00:49:45,560 --> 00:49:49,000
The base co-pilot at least spreads its hallucinations across many different topics

1032
00:49:49,000 --> 00:49:51,560
but the benefits agent focuses entirely on benefits.

1033
00:49:51,560 --> 00:49:54,440
When it hallucinating in that specific domain,

1034
00:49:54,440 --> 00:49:56,360
it creates concentrated misinformation

1035
00:49:56,360 --> 00:49:59,720
where people are actually making real-life decisions based on what it says.

1036
00:49:59,720 --> 00:50:02,120
An employee's entire understanding of their eligibility

1037
00:50:02,120 --> 00:50:03,560
might come from this one agent

1038
00:50:03,560 --> 00:50:05,400
because it's positioned as the expert.

1039
00:50:05,400 --> 00:50:08,760
There are currently no approval workflows for new agents or connectors.

1040
00:50:08,760 --> 00:50:10,920
Someone in the organization decides they want an agent,

1041
00:50:10,920 --> 00:50:13,480
they request it, your team builds it and it gets deployed.

1042
00:50:13,480 --> 00:50:15,080
That is the entire process.

1043
00:50:15,080 --> 00:50:18,360
There is no gate to check if the data sources are appropriate,

1044
00:50:18,360 --> 00:50:20,360
no review to verify minimal permissions

1045
00:50:20,360 --> 00:50:22,440
and no checkpoint to validate the prompts.

1046
00:50:22,440 --> 00:50:25,400
The agent exists simply because someone wanted it to exist.

1047
00:50:25,400 --> 00:50:28,600
Agents can easily access data sources they were never meant to see.

1048
00:50:28,600 --> 00:50:30,920
The sales agent needs to access customer records

1049
00:50:30,920 --> 00:50:32,600
so it gets connected to dynamics

1050
00:50:32,600 --> 00:50:36,840
but that system also contains sensitive deal information and negotiation notes.

1051
00:50:36,840 --> 00:50:38,680
The agent isn't supposed to surface those details

1052
00:50:38,680 --> 00:50:40,600
but there is no enforcement to stop it.

1053
00:50:40,600 --> 00:50:43,160
Since the connector exists and the data is accessible,

1054
00:50:43,160 --> 00:50:45,080
nothing prevents the agent from retrieving it

1055
00:50:45,080 --> 00:50:47,400
when a customer asks a tricky question.

1056
00:50:47,400 --> 00:50:50,600
The operations agent needs to access the IT procedures database

1057
00:50:50,600 --> 00:50:52,600
but those procedures were written by humans

1058
00:50:52,600 --> 00:50:55,000
who often leave sensitive info behind.

1059
00:50:55,000 --> 00:50:56,760
Passwords get mentioned in documentation

1060
00:50:56,760 --> 00:50:58,600
and API keys appear in examples.

1061
00:50:58,600 --> 00:51:00,600
The agent shouldn't be surfacing those secrets

1062
00:51:00,600 --> 00:51:02,840
but there is no content filtering or classification

1063
00:51:02,840 --> 00:51:05,560
to prevent those sensitive strings from being retrieved.

1064
00:51:05,560 --> 00:51:07,400
The agent just accesses the database

1065
00:51:07,400 --> 00:51:09,320
and pulls whatever matches the query.

1066
00:51:09,320 --> 00:51:12,600
Version control doesn't exist for your prompts or system instructions.

1067
00:51:12,600 --> 00:51:15,400
The benefits agent was built with a specific system prompt

1068
00:51:15,400 --> 00:51:18,200
that was never committed to a repository or reviewed by a team.

1069
00:51:18,200 --> 00:51:19,800
When someone suggests a change,

1070
00:51:19,800 --> 00:51:22,280
there is no history to reference and no way to revert

1071
00:51:22,280 --> 00:51:24,600
if the change makes the agent perform worse.

1072
00:51:24,600 --> 00:51:25,880
Someone modifies the prompt,

1073
00:51:25,880 --> 00:51:27,400
the agent starts behaving differently

1074
00:51:27,400 --> 00:51:28,920
and nobody knows exactly what happened

1075
00:51:28,920 --> 00:51:30,200
because nothing was tracked.

1076
00:51:30,200 --> 00:51:33,080
Agents will drift over time because you aren't monitoring them.

1077
00:51:33,080 --> 00:51:35,320
The benefits agent might have answered questions well

1078
00:51:35,320 --> 00:51:36,840
when it was built three months ago

1079
00:51:36,840 --> 00:51:38,920
but your policies have changed twice since then

1080
00:51:38,920 --> 00:51:40,280
because the index wasn't refreshed

1081
00:51:40,280 --> 00:51:41,640
and the prompts weren't updated.

1082
00:51:41,640 --> 00:51:43,800
The behavior of the agent degraded gradually.

1083
00:51:43,800 --> 00:51:46,440
You don't have a dashboard showing that answer quality has declined

1084
00:51:46,440 --> 00:51:49,000
so the agent just quietly gets worse in the background.

1085
00:51:49,000 --> 00:51:50,600
The consequence is that hallucination

1086
00:51:50,600 --> 00:51:53,640
becomes a feature of every custom agent rather than a bug.

1087
00:51:53,640 --> 00:51:56,120
These agents are unstoppable hallucination machines

1088
00:51:56,120 --> 00:51:58,440
that are each tuned to a specific domain.

1089
00:51:58,440 --> 00:51:59,960
They drift without any oversight

1090
00:51:59,960 --> 00:52:02,760
and produce confident wrong answers to your most important questions.

1091
00:52:02,760 --> 00:52:05,800
Now we need to talk about how you actually measure this failure.

1092
00:52:05,800 --> 00:52:08,040
The hallucination metrics you're not tracking.

1093
00:52:08,040 --> 00:52:10,840
You've built a machine that produces confident wrong answers

1094
00:52:10,840 --> 00:52:13,720
and now you've deployed it across your entire organization.

1095
00:52:13,720 --> 00:52:15,400
People are using it every day to make decisions

1096
00:52:15,400 --> 00:52:16,600
but there is a massive problem

1097
00:52:16,600 --> 00:52:18,680
that nobody in your building has actually solved.

1098
00:52:18,680 --> 00:52:21,640
In reality, you have no idea how broken the system is.

1099
00:52:21,640 --> 00:52:24,280
Measurement requires making a choice about what actually matters

1100
00:52:24,280 --> 00:52:27,160
but most organizations just measure what is easy to count.

1101
00:52:27,160 --> 00:52:29,400
They track request volume, response latency

1102
00:52:29,400 --> 00:52:31,240
or the API cost per query.

1103
00:52:31,240 --> 00:52:34,120
These are infrastructure metrics that tell you the system is running

1104
00:52:34,120 --> 00:52:37,240
but they tell you absolutely nothing about whether the system is correct.

1105
00:52:37,240 --> 00:52:38,760
The metrics that actually matter

1106
00:52:38,760 --> 00:52:40,520
are the ones you aren't tracking at all.

1107
00:52:40,520 --> 00:52:42,200
It starts with the hallucination rate

1108
00:52:42,200 --> 00:52:43,560
which is the percentage of claims

1109
00:52:43,560 --> 00:52:45,480
in a response that have zero source support.

1110
00:52:45,480 --> 00:52:47,160
When the model generates an answer,

1111
00:52:47,160 --> 00:52:50,440
you have to break that answer down into individual factual claims.

1112
00:52:50,440 --> 00:52:53,480
You might have one sentence saying the policy allows extensions,

1113
00:52:53,480 --> 00:52:55,800
another saying the department head must approve them

1114
00:52:55,800 --> 00:52:59,080
and a third claiming the process takes five business days.

1115
00:52:59,080 --> 00:53:01,000
For every single one of those claims,

1116
00:53:01,000 --> 00:53:04,040
you need to check if it is supported by the retrieved context.

1117
00:53:04,040 --> 00:53:06,600
A significant percentage of those claims will be made up

1118
00:53:06,600 --> 00:53:10,760
but you aren't measuring this because you aren't breaking down answers or checking sources.

1119
00:53:10,760 --> 00:53:13,640
Instead, you're just counting how many people logged in today.

1120
00:53:13,640 --> 00:53:16,040
Citation accuracy is a different problem entirely.

1121
00:53:16,040 --> 00:53:17,400
The model doesn't just make claims,

1122
00:53:17,400 --> 00:53:19,480
it cites sources by saying something is true

1123
00:53:19,480 --> 00:53:21,160
according to a specific document.

1124
00:53:21,160 --> 00:53:23,480
You should be measuring whether those citations

1125
00:53:23,480 --> 00:53:25,560
actually support the claims being made.

1126
00:53:25,560 --> 00:53:27,720
This means you have to pull the cited document

1127
00:53:27,720 --> 00:53:29,960
and read the passage to see if it actually says

1128
00:53:29,960 --> 00:53:31,560
what the model claims it says.

1129
00:53:31,560 --> 00:53:34,120
In most organizations, nobody does this systematically

1130
00:53:34,120 --> 00:53:35,640
because they see a citation

1131
00:53:35,640 --> 00:53:36,840
and assume it was verified.

1132
00:53:36,840 --> 00:53:37,560
It wasn't.

1133
00:53:37,560 --> 00:53:39,480
The citation might be totally fabricated

1134
00:53:39,480 --> 00:53:40,920
or the document might exist

1135
00:53:40,920 --> 00:53:42,360
but say something slightly different

1136
00:53:42,360 --> 00:53:43,800
than what the AI suggested.

1137
00:53:43,800 --> 00:53:46,280
Retrieval quality is the foundation for everything else

1138
00:53:46,280 --> 00:53:48,040
yet it is almost always ignored.

1139
00:53:48,040 --> 00:53:50,600
You can measure precision to see how many retrieve documents

1140
00:53:50,600 --> 00:53:51,640
were actually relevant

1141
00:53:51,640 --> 00:53:53,000
and you can measure recall to see

1142
00:53:53,000 --> 00:53:55,160
how many relevant documents the system missed.

1143
00:53:55,160 --> 00:53:56,920
Most organizations don't measure either

1144
00:53:56,920 --> 00:53:59,480
because they assume that if the search returned results

1145
00:53:59,480 --> 00:54:00,680
the retrieval worked.

1146
00:54:00,680 --> 00:54:02,120
But here is the problem.

1147
00:54:02,120 --> 00:54:03,880
Search can be completely broken

1148
00:54:03,880 --> 00:54:05,720
while appearing to function perfectly.

1149
00:54:05,720 --> 00:54:09,560
Grounding failure happens when answers directly contradict

1150
00:54:09,560 --> 00:54:10,840
the retrieved context.

1151
00:54:10,840 --> 00:54:11,880
This usually occurs

1152
00:54:11,880 --> 00:54:13,720
when the model receives conflicting documents

1153
00:54:13,720 --> 00:54:15,480
and resolves that conflict the wrong way

1154
00:54:15,480 --> 00:54:16,840
or when it synthesizes information

1155
00:54:16,840 --> 00:54:18,760
beyond what any document actually says.

1156
00:54:18,760 --> 00:54:20,840
To catch this, you have to run a painstaking test

1157
00:54:20,840 --> 00:54:23,400
to see if the answer faithfully represents the sources.

1158
00:54:23,400 --> 00:54:25,640
You have to read the answer, read the sources

1159
00:54:25,640 --> 00:54:27,480
and verify the claims manually.

1160
00:54:27,480 --> 00:54:29,320
Since this is slow and expensive work

1161
00:54:29,320 --> 00:54:30,920
nobody is doing it at scale.

1162
00:54:30,920 --> 00:54:33,240
Confidence misalignment is the most insidious metric

1163
00:54:33,240 --> 00:54:35,480
because it is completely invisible to the user.

1164
00:54:35,480 --> 00:54:38,440
The model expresses confidence through an authoritative tone,

1165
00:54:38,440 --> 00:54:40,280
clear structures and specific details

1166
00:54:40,280 --> 00:54:42,360
but this confidence has nothing to do with being right.

1167
00:54:42,360 --> 00:54:45,160
A hallucinated answer can sound completely certain

1168
00:54:45,160 --> 00:54:47,880
while an honest, I don't know, sounds weak and uncertain.

1169
00:54:47,880 --> 00:54:50,760
Users end up trusting the model's tone

1170
00:54:50,760 --> 00:54:52,120
rather than its accuracy.

1171
00:54:52,120 --> 00:54:53,720
You would measure this by having humans rate

1172
00:54:53,720 --> 00:54:55,400
how confident an answer sounds

1173
00:54:55,400 --> 00:54:57,240
and comparing that to the actual facts.

1174
00:54:57,240 --> 00:54:59,320
Answers that sound certain but are factually wrong

1175
00:54:59,320 --> 00:55:02,040
represent the maximum level of danger for your company.

1176
00:55:02,040 --> 00:55:04,120
Why do these metrics always go missing?

1177
00:55:04,120 --> 00:55:06,920
It's because measuring them requires human judgment.

1178
00:55:06,920 --> 00:55:09,400
You cannot count hallucinations with a simple algorithm

1179
00:55:09,400 --> 00:55:11,000
so you need a person to read the answer

1180
00:55:11,000 --> 00:55:12,840
and verify the claims against the sources.

1181
00:55:12,840 --> 00:55:14,600
That process is expensive and slow

1182
00:55:14,600 --> 00:55:16,600
which is why organizations stick to metrics

1183
00:55:16,600 --> 00:55:18,600
that machines can compute automatically.

1184
00:55:18,600 --> 00:55:20,600
But this is where the situation gets much worse.

1185
00:55:20,600 --> 00:55:23,000
The silent failure when nobody notices.

1186
00:55:23,000 --> 00:55:25,400
The most dangerous way to deploy a hallucination machine

1187
00:55:25,400 --> 00:55:27,080
is to let it work quietly.

1188
00:55:27,080 --> 00:55:28,920
You aren't looking for a system that crashes

1189
00:55:28,920 --> 00:55:30,920
or produces obviously crazy answers

1190
00:55:30,920 --> 00:55:32,440
that users reject immediately.

1191
00:55:32,440 --> 00:55:34,920
The real danger is a system that produces answers

1192
00:55:34,920 --> 00:55:36,440
that are only slightly wrong

1193
00:55:36,440 --> 00:55:39,320
but sound plausible enough that nobody bothers to verify them.

1194
00:55:39,320 --> 00:55:43,080
Users trust fluent answers without checking the facts

1195
00:55:43,080 --> 00:55:44,600
and that is the operational reality

1196
00:55:44,600 --> 00:55:46,280
of deploying co-pilot at scale.

1197
00:55:46,280 --> 00:55:48,680
When a user gets a well-written answer with citations

1198
00:55:48,680 --> 00:55:50,440
they read it once and move on.

1199
00:55:50,440 --> 00:55:52,040
They don't pull up the original documents

1200
00:55:52,040 --> 00:55:53,000
to verify the claims

1201
00:55:53,000 --> 00:55:54,520
because the whole point of having an AI

1202
00:55:54,520 --> 00:55:56,440
is to stop doing that manual work.

1203
00:55:56,440 --> 00:55:58,600
But the system never actually checked the work.

1204
00:55:58,600 --> 00:56:01,160
The citation is there but the verification is missing

1205
00:56:01,160 --> 00:56:03,640
and the user is now relying on an implicit promise

1206
00:56:03,640 --> 00:56:05,400
that no one ever actually made.

1207
00:56:05,400 --> 00:56:07,160
When a benefits agent tells an employee

1208
00:56:07,160 --> 00:56:08,840
they aren't eligible for a program

1209
00:56:08,840 --> 00:56:10,840
that employee might just accept the answer.

1210
00:56:10,840 --> 00:56:12,200
They don't know they should contest it

1211
00:56:12,200 --> 00:56:13,560
because the AI said no

1212
00:56:13,560 --> 00:56:15,960
and they believe the AI was looking at the right data.

1213
00:56:15,960 --> 00:56:17,720
Hallucinations accumulate quietly

1214
00:56:17,720 --> 00:56:20,200
until a major incident finally occurs.

1215
00:56:20,200 --> 00:56:22,760
In month one the system is new and people are skeptical

1216
00:56:22,760 --> 00:56:24,280
so they verify a few answers

1217
00:56:24,280 --> 00:56:25,640
and find that they mostly check out.

1218
00:56:25,640 --> 00:56:27,560
By month two usage expands

1219
00:56:27,560 --> 00:56:28,600
and people stop verifying

1220
00:56:28,600 --> 00:56:31,400
because the system has earned a little bit of unearned credibility.

1221
00:56:31,400 --> 00:56:34,120
By month three small hallucinations are happening every day

1222
00:56:34,120 --> 00:56:35,400
like a misinterpreted policy

1223
00:56:35,400 --> 00:56:36,600
or a slightly off-sitation

1224
00:56:36,600 --> 00:56:39,000
but nothing is big enough to break the user's trust.

1225
00:56:39,000 --> 00:56:40,440
Then you hit month five or six

1226
00:56:40,440 --> 00:56:43,080
and someone makes a massive decision based on a co-pilot answer

1227
00:56:43,080 --> 00:56:44,360
that is fundamentally wrong.

1228
00:56:44,360 --> 00:56:46,760
They only discover the error when the consequences arrive

1229
00:56:46,760 --> 00:56:49,000
and the mess requires a massive cleanup.

1230
00:56:49,000 --> 00:56:51,800
Suddenly the credibility of the entire system collapses

1231
00:56:51,800 --> 00:56:53,320
people go back to check old answers

1232
00:56:53,320 --> 00:56:55,080
and realize the hallucinations were always there

1233
00:56:55,080 --> 00:56:57,160
but nobody caught them because nobody was looking.

1234
00:56:57,160 --> 00:57:00,680
The nobody complained trap is how leadership rationalizes doing nothing.

1235
00:57:00,680 --> 00:57:02,520
The theory is that if the system were broken

1236
00:57:02,520 --> 00:57:03,880
people would be screaming about it.

1237
00:57:03,880 --> 00:57:06,440
In practice users don't know they should complain

1238
00:57:06,440 --> 00:57:08,680
because the answers sound perfectly right.

1239
00:57:08,680 --> 00:57:10,760
They lack the deep knowledge to spot a subtle lie

1240
00:57:10,760 --> 00:57:12,200
and the consequences of the error

1241
00:57:12,200 --> 00:57:13,480
might not show up for months.

1242
00:57:13,480 --> 00:57:15,640
If someone gets bad information about their benefits

1243
00:57:15,640 --> 00:57:16,600
they won't know it's wrong

1244
00:57:16,600 --> 00:57:18,600
until they actually try to use that benefit.

1245
00:57:18,600 --> 00:57:21,880
This organizational silence is almost always interpreted as approval.

1246
00:57:21,880 --> 00:57:23,240
You aren't tracking correctness

1247
00:57:23,240 --> 00:57:25,880
so you assume that silence equals satisfaction.

1248
00:57:25,880 --> 00:57:27,720
The system could be lying on every fifth answer

1249
00:57:27,720 --> 00:57:28,680
and you wouldn't have a clue

1250
00:57:28,680 --> 00:57:30,280
because users aren't filing IT tickets

1251
00:57:30,280 --> 00:57:32,040
about a sentence that sounds reasonable.

1252
00:57:32,040 --> 00:57:33,320
Audit logs show you usage

1253
00:57:33,320 --> 00:57:34,920
but they never show you correctness.

1254
00:57:34,920 --> 00:57:38,120
You have perfect records of when someone accessed co-pilot,

1255
00:57:38,120 --> 00:57:40,280
what they asked and what the timestamp was.

1256
00:57:40,280 --> 00:57:42,520
You have the user IDs and the generated text

1257
00:57:42,520 --> 00:57:44,120
but you don't have a single data point

1258
00:57:44,120 --> 00:57:45,560
on whether the answer was right.

1259
00:57:45,560 --> 00:57:47,560
The logs shows the interaction occurred

1260
00:57:47,560 --> 00:57:50,200
but it doesn't show if the employee acted on faulty information

1261
00:57:50,200 --> 00:57:51,480
or if they verified the result.

1262
00:57:51,480 --> 00:57:54,360
This becomes a massive problem in a compliance situation.

1263
00:57:54,360 --> 00:57:56,680
An auditor can see that you deployed the system

1264
00:57:56,680 --> 00:57:57,800
and that people used it

1265
00:57:57,800 --> 00:57:59,880
so everything looks controlled on the surface.

1266
00:57:59,880 --> 00:58:01,560
What the auditor cannot see from those logs

1267
00:58:01,560 --> 00:58:03,640
is that the answers were factually incorrect.

1268
00:58:03,640 --> 00:58:05,080
The logs are honest about the activity

1269
00:58:05,080 --> 00:58:07,480
but the conclusions you draw from them are completely false.

1270
00:58:07,480 --> 00:58:09,240
System drift happens so gradually

1271
00:58:09,240 --> 00:58:11,400
that it eventually becomes catastrophic.

1272
00:58:11,400 --> 00:58:12,840
Your system degrades over time

1273
00:58:12,840 --> 00:58:15,560
as the data gets noisier and the retrieval gets weaker

1274
00:58:15,560 --> 00:58:19,080
but it happens slowly enough that nobody is shocked by the decline.

1275
00:58:19,080 --> 00:58:21,560
Co-pilot starts answering fewer questions perfectly

1276
00:58:21,560 --> 00:58:22,920
and more questions partially

1277
00:58:22,920 --> 00:58:25,560
and the organization just adapts to the lower quality.

1278
00:58:25,560 --> 00:58:27,720
Users might start verifying things more often

1279
00:58:27,720 --> 00:58:29,720
but they don't realize they are compensating

1280
00:58:29,720 --> 00:58:31,320
for a system that is failing.

1281
00:58:31,320 --> 00:58:33,160
The end result is that you have shipped a system

1282
00:58:33,160 --> 00:58:35,640
into production that is quietly failing every single day.

1283
00:58:35,640 --> 00:58:37,560
It is integrated into your workflows

1284
00:58:37,560 --> 00:58:39,720
and people are depending on it to do their jobs.

1285
00:58:39,720 --> 00:58:41,640
Decisions are being made based on its output

1286
00:58:41,640 --> 00:58:42,840
and yet nobody knows it's broken

1287
00:58:42,840 --> 00:58:45,320
because the failure is happening just below the surface.

1288
00:58:45,320 --> 00:58:46,600
It becomes the new normal.

1289
00:58:46,600 --> 00:58:49,000
Now let's talk about how you actually fix this.

1290
00:58:49,000 --> 00:58:50,600
Retrieval first governance.

1291
00:58:50,600 --> 00:58:51,720
You are going to fix this

1292
00:58:51,720 --> 00:58:54,440
but you won't do it by redesigning everything at once.

1293
00:58:54,440 --> 00:58:56,680
The path forward starts with a single diagnosis

1294
00:58:56,680 --> 00:58:59,640
and that means understanding what your retrieval system can actually see.

1295
00:58:59,640 --> 00:59:01,800
This is the point where you stop talking about the problem

1296
00:59:01,800 --> 00:59:02,920
and start measuring it.

1297
00:59:02,920 --> 00:59:04,280
You need to pull your audit logs

1298
00:59:04,280 --> 00:59:06,200
and run a query to see what happens.

1299
00:59:06,200 --> 00:59:07,960
Ask your co-pilot what it would show a user

1300
00:59:07,960 --> 00:59:09,400
who is asking about company benefits

1301
00:59:09,400 --> 00:59:11,400
and then capture the top 20 results.

1302
00:59:11,400 --> 00:59:12,360
Read through them carefully

1303
00:59:12,360 --> 00:59:14,520
to see what kind of mixture they represent.

1304
00:59:14,520 --> 00:59:16,600
You might find current policy documents

1305
00:59:16,600 --> 00:59:18,680
but you might also find historical guidance,

1306
00:59:18,680 --> 00:59:21,880
draft explorations, or personal notes filed in the wrong library.

1307
00:59:21,880 --> 00:59:23,400
You may even see external content

1308
00:59:23,400 --> 00:59:25,960
that has no business ranking alongside internal policy.

1309
00:59:25,960 --> 00:59:27,960
This audit is not about finding someone to blame

1310
00:59:27,960 --> 00:59:29,960
but it is about establishing a baseline.

1311
00:59:29,960 --> 00:59:33,320
You have to see your data exactly how co-pilot sees it

1312
00:59:33,320 --> 00:59:35,320
before you can control what it retrieves.

1313
00:59:35,320 --> 00:59:37,240
Most organizations have never actually done this

1314
00:59:37,240 --> 00:59:38,680
because they assume the system works

1315
00:59:38,680 --> 00:59:39,880
just because it gives an answer.

1316
00:59:39,880 --> 00:59:41,080
They have never stopped to examine

1317
00:59:41,080 --> 00:59:43,560
whether those responses are drawing from appropriate sources

1318
00:59:43,560 --> 00:59:45,160
or just grabbing whatever is nearby.

1319
00:59:45,160 --> 00:59:46,680
Once you see what is happening,

1320
00:59:46,680 --> 00:59:48,120
the first control you need to set up

1321
00:59:48,120 --> 00:59:50,280
is permission trimming at the moment of retrieval.

1322
00:59:50,280 --> 00:59:52,040
This should not happen afterward as a filter

1323
00:59:52,040 --> 00:59:54,280
that strips results before they reach the screen.

1324
00:59:54,280 --> 00:59:56,520
It has to happen at the very moment of the search.

1325
00:59:56,520 --> 00:59:58,760
This requires you to integrate your permission model

1326
00:59:58,760 --> 01:00:00,360
directly into the retrieval pipeline

1327
01:00:00,360 --> 01:00:01,880
so the system knows who is asking.

1328
01:00:01,880 --> 01:00:03,400
When a user asks a question,

1329
01:00:03,400 --> 01:00:05,080
the system should not search everything

1330
01:00:05,080 --> 01:00:06,600
and then filter the results.

1331
01:00:06,600 --> 01:00:07,800
It should build the search query

1332
01:00:07,800 --> 01:00:09,480
with permission constraints already inside it

1333
01:00:09,480 --> 01:00:11,640
ensuring only documents that specific user

1334
01:00:11,640 --> 01:00:13,960
can access are even eligible to be found.

1335
01:00:13,960 --> 01:00:15,560
This seems like an obvious step in theory

1336
01:00:15,560 --> 01:00:18,120
but in practice it is foreign to most deployments.

1337
01:00:18,120 --> 01:00:20,600
Usually the architecture keeps these concerns separate

1338
01:00:20,600 --> 01:00:22,040
where the search returns everything

1339
01:00:22,040 --> 01:00:24,200
and permissions are handled by a different layer.

1340
01:00:24,200 --> 01:00:26,360
But Ragh requires these two to work together.

1341
01:00:26,360 --> 01:00:28,680
Your retrieval layer must understand identity

1342
01:00:28,680 --> 01:00:31,000
and apply authorization as part of the ranking logic

1343
01:00:31,000 --> 01:00:32,520
instead of waiting until the end.

1344
01:00:32,520 --> 01:00:35,000
The next control you need is metadata filtering.

1345
01:00:35,000 --> 01:00:36,440
Your documents should have clear labels

1346
01:00:36,440 --> 01:00:37,880
like the date they were created

1347
01:00:37,880 --> 01:00:39,320
when they were last modified

1348
01:00:39,320 --> 01:00:40,920
and what type of document they are.

1349
01:00:40,920 --> 01:00:43,080
You should also include sensitivity levels,

1350
01:00:43,080 --> 01:00:45,000
the owning team and the version number.

1351
01:00:45,000 --> 01:00:47,080
When co-pilot searches these fields should change

1352
01:00:47,080 --> 01:00:49,080
how it ranks and filters the results.

1353
01:00:49,080 --> 01:00:50,440
A document from three years ago

1354
01:00:50,440 --> 01:00:53,480
that hasn't been updated should rank lower than one from this month

1355
01:00:53,480 --> 01:00:56,520
and a draft should always rank below approved content.

1356
01:00:56,520 --> 01:00:59,240
A version that has been replaced should be left out entirely.

1357
01:00:59,240 --> 01:01:00,680
You aren't removing these from the index

1358
01:01:00,680 --> 01:01:02,680
but you are changing how they are ranked

1359
01:01:02,680 --> 01:01:04,920
and whether they get to be part of the final set.

1360
01:01:04,920 --> 01:01:07,320
You also need to separate your authoritative sources

1361
01:01:07,320 --> 01:01:08,920
from your supporting material.

1362
01:01:08,920 --> 01:01:11,240
Not every document in your collection carries the same weight

1363
01:01:11,240 --> 01:01:12,680
and you need to recognize that.

1364
01:01:12,680 --> 01:01:14,920
Some documents are the absolute source of truth

1365
01:01:14,920 --> 01:01:17,080
like official policies or board level decisions

1366
01:01:17,080 --> 01:01:18,760
while others are just reference material

1367
01:01:18,760 --> 01:01:21,160
like case studies or industry analysis.

1368
01:01:21,160 --> 01:01:23,080
The retrieval system needs to know the difference

1369
01:01:23,080 --> 01:01:24,520
between these categories.

1370
01:01:24,520 --> 01:01:26,040
When a person asks about a policy

1371
01:01:26,040 --> 01:01:28,120
the system should focus on those authoritative sources

1372
01:01:28,120 --> 01:01:29,960
and only bring in reference material

1373
01:01:29,960 --> 01:01:32,520
if the query specifically needs more context.

1374
01:01:32,520 --> 01:01:34,360
This requires you to classify your data.

1375
01:01:34,360 --> 01:01:35,800
You have to mark what is authoritative

1376
01:01:35,800 --> 01:01:37,400
and what is just supporting information.

1377
01:01:37,400 --> 01:01:40,440
This is the heavy lifting of data governance that we skipped earlier

1378
01:01:40,440 --> 01:01:41,560
but now you are going to do it.

1379
01:01:41,560 --> 01:01:42,760
You don't have to do it for everything

1380
01:01:42,760 --> 01:01:44,600
because that would be too much work.

1381
01:01:44,600 --> 01:01:46,520
Start with the critical areas like benefits,

1382
01:01:46,520 --> 01:01:48,680
finance, legal and IT procedures.

1383
01:01:48,680 --> 01:01:51,080
Focus on the material that actually drives decisions

1384
01:01:51,080 --> 01:01:53,240
classify it as a source or a reference

1385
01:01:53,240 --> 01:01:55,640
and make sure that label can be read by the search engine.

1386
01:01:55,640 --> 01:01:57,880
Finally, you should build retrieval quality metrics

1387
01:01:57,880 --> 01:01:59,240
into your deployment gates.

1388
01:01:59,240 --> 01:02:01,480
Before you roll out any change to copilot

1389
01:02:01,480 --> 01:02:03,080
whether it is a new data source

1390
01:02:03,080 --> 01:02:04,280
or a different ranking method

1391
01:02:04,280 --> 01:02:06,840
you must test how it affects what is being found.

1392
01:02:06,840 --> 01:02:08,600
Run your audit queries again to verify

1393
01:02:08,600 --> 01:02:10,600
that the top results are still the right ones.

1394
01:02:10,600 --> 01:02:12,040
You need to measure precision

1395
01:02:12,040 --> 01:02:14,440
and recall on a set of gold standard data.

1396
01:02:14,440 --> 01:02:15,880
If these metrics don't improve

1397
01:02:15,880 --> 01:02:17,800
or at the very least stay the same,

1398
01:02:17,800 --> 01:02:19,400
you do not deploy the update.

1399
01:02:19,400 --> 01:02:22,040
The shift here is moving from letting the system see everything

1400
01:02:22,040 --> 01:02:23,720
to controlling what it is allowed to find.

1401
01:02:23,720 --> 01:02:25,240
You aren't deleting your documents

1402
01:02:25,240 --> 01:02:27,240
or removing information from the company.

1403
01:02:27,240 --> 01:02:30,840
You are simply constraining how aggressively copilot searches

1404
01:02:30,840 --> 01:02:34,040
and prioritizing quality sources over the surrounding noise.

1405
01:02:34,040 --> 01:02:35,320
You are building business logic

1406
01:02:35,320 --> 01:02:37,160
directly into the retrieval layer.

1407
01:02:37,160 --> 01:02:39,800
Instead of pretending you can fix the mistakes after they happen.

1408
01:02:39,800 --> 01:02:41,400
This is the foundation of the whole system.

1409
01:02:41,400 --> 01:02:43,000
You have to fix retrieval first

1410
01:02:43,000 --> 01:02:45,320
because everything else depends on whether the model is given

1411
01:02:45,320 --> 01:02:46,760
the right evidence to work with.

1412
01:02:46,760 --> 01:02:48,360
But even with perfect retrieval,

1413
01:02:48,360 --> 01:02:50,520
control alone is not enough.

1414
01:02:50,520 --> 01:02:52,600
Grounding as a constraint, not a feature.

1415
01:02:52,600 --> 01:02:55,000
Retrieval control ensures you have the right evidence

1416
01:02:55,000 --> 01:02:56,920
but grounding is what you do with that evidence

1417
01:02:56,920 --> 01:02:58,280
once the model has it.

1418
01:02:58,280 --> 01:03:01,000
This is where most organizations make a massive mistake.

1419
01:03:01,000 --> 01:03:02,440
They treat grounding like a feature

1420
01:03:02,440 --> 01:03:04,440
that improves performance when it's available

1421
01:03:04,440 --> 01:03:06,120
but you need to treat it as a constraint

1422
01:03:06,120 --> 01:03:08,120
that defines which answers are allowed.

1423
01:03:08,120 --> 01:03:10,200
This is an architectural shift you have to make.

1424
01:03:10,200 --> 01:03:11,880
Grounding is not an optional setting

1425
01:03:11,880 --> 01:03:14,520
that you turn on for hard questions and off for easy ones.

1426
01:03:14,520 --> 01:03:17,560
It is the boundary that separates a real answer from a hallucination.

1427
01:03:17,560 --> 01:03:19,000
If an answer doesn't have grounding,

1428
01:03:19,000 --> 01:03:21,320
it isn't a helpful suggestion or a rough draft.

1429
01:03:21,320 --> 01:03:23,960
It is a total violation of how the system is supposed to work.

1430
01:03:23,960 --> 01:03:26,280
You should start with explicit grounding requirements.

1431
01:03:26,280 --> 01:03:28,040
Your system prompt needs to have a rule

1432
01:03:28,040 --> 01:03:29,560
that cannot be negotiated,

1433
01:03:29,560 --> 01:03:32,520
telling the model to answer only using the context it was given.

1434
01:03:32,520 --> 01:03:35,880
Don't tell it to prefer the context or try to use the documents.

1435
01:03:35,880 --> 01:03:37,320
Make it unconditional.

1436
01:03:37,320 --> 01:03:39,640
When the model gets a query and a set of documents,

1437
01:03:39,640 --> 01:03:41,400
the answer must come from those pages.

1438
01:03:41,400 --> 01:03:44,120
If the answer cannot be built from what is right there in front of it,

1439
01:03:44,120 --> 01:03:45,880
the model should not build an answer at all.

1440
01:03:45,880 --> 01:03:47,080
This sounds like a simple rule

1441
01:03:47,080 --> 01:03:49,080
but it takes a lot of discipline to pull off.

1442
01:03:49,080 --> 01:03:50,920
These models are trained to be helpful

1443
01:03:50,920 --> 01:03:53,160
and being helpful usually means giving a full answer

1444
01:03:53,160 --> 01:03:55,400
even if the evidence is missing a few pieces.

1445
01:03:55,400 --> 01:03:57,400
The model has learned how to fill in the gaps

1446
01:03:57,400 --> 01:03:59,880
and use its own training to fix weak grounding.

1447
01:03:59,880 --> 01:04:01,480
You are asking it to kill that instinct.

1448
01:04:01,480 --> 01:04:02,920
You wanted to refuse to guess

1449
01:04:02,920 --> 01:04:05,720
and admit when the evidence isn't good enough to give a full response.

1450
01:04:05,720 --> 01:04:08,360
The way you make this happen is through citation requirements.

1451
01:04:08,360 --> 01:04:09,960
Every single fact the model claims

1452
01:04:09,960 --> 01:04:13,480
must be traceable back to a specific piece of the retrieved context.

1453
01:04:13,480 --> 01:04:15,320
This isn't just a style choice.

1454
01:04:15,320 --> 01:04:17,960
It is a structural requirement for the system.

1455
01:04:17,960 --> 01:04:20,280
The model should identify the claim it wants to make,

1456
01:04:20,280 --> 01:04:21,800
find the source that supports it

1457
01:04:21,800 --> 01:04:24,760
and verify that the source actually says what the model thinks it says.

1458
01:04:24,760 --> 01:04:27,480
If it can't find a direct link between a claim and a source,

1459
01:04:27,480 --> 01:04:28,440
it shouldn't say it.

1460
01:04:28,440 --> 01:04:30,200
This creates a necessary feedback loop.

1461
01:04:30,200 --> 01:04:32,120
The model learns that every sentence it writes

1462
01:04:32,120 --> 01:04:33,560
must have proof to back it up.

1463
01:04:33,560 --> 01:04:36,200
It can't just make assertions or combine sources

1464
01:04:36,200 --> 01:04:38,440
to create brand new claims that aren't there.

1465
01:04:38,440 --> 01:04:40,360
It can only restate and put together

1466
01:04:40,360 --> 01:04:42,520
what is explicitly written in the documents you gave it.

1467
01:04:42,520 --> 01:04:44,680
This makes the models vocabulary much smaller

1468
01:04:44,680 --> 01:04:46,040
but that is exactly what you want.

1469
01:04:46,040 --> 01:04:47,800
You want the words it uses to match the evidence

1470
01:04:47,800 --> 01:04:49,160
that is actually available.

1471
01:04:49,160 --> 01:04:50,840
You should also add confidence thresholds

1472
01:04:50,840 --> 01:04:52,920
as a gate before any response is shown.

1473
01:04:52,920 --> 01:04:54,680
Don't just ask for citations.

1474
01:04:54,680 --> 01:04:58,120
Make the model decide if it is actually sure enough to give an answer.

1475
01:04:58,120 --> 01:05:00,680
You can build this right into the generation process.

1476
01:05:00,680 --> 01:05:02,360
Before the system shows a response,

1477
01:05:02,360 --> 01:05:04,200
it should ask itself how confident it is

1478
01:05:04,200 --> 01:05:06,280
that the answer is correct based on the context.

1479
01:05:06,280 --> 01:05:08,760
If that confidence score falls below a certain level,

1480
01:05:08,760 --> 01:05:10,840
like 70%, the system should stop.

1481
01:05:10,840 --> 01:05:12,840
Instead of a guess, it returns a message saying

1482
01:05:12,840 --> 01:05:14,760
it doesn't have enough information to answer.

1483
01:05:14,760 --> 01:05:17,240
This is the specific tool that stops the model from talking

1484
01:05:17,240 --> 01:05:18,840
when the retrieval step has failed.

1485
01:05:18,840 --> 01:05:21,800
If the search only brought back weak or unrelated documents,

1486
01:05:21,800 --> 01:05:24,600
the model needs to recognize that the evidence is bad.

1487
01:05:24,600 --> 01:05:27,480
Confidence thresholds turn that recognition into a rule.

1488
01:05:27,480 --> 01:05:29,880
They force the system to be honest about poor evidence.

1489
01:05:29,880 --> 01:05:32,600
Instead of pretending that a bad document is good enough to use,

1490
01:05:32,600 --> 01:05:34,920
refusal patterns are another way to teach the model

1491
01:05:34,920 --> 01:05:37,400
that staying silent is often the right choice.

1492
01:05:37,400 --> 01:05:39,480
Instead of trying to generate something every time,

1493
01:05:39,480 --> 01:05:41,960
you train it to see which requests it should stay away from.

1494
01:05:41,960 --> 01:05:43,800
If the evidence is thin, it shouldn't answer.

1495
01:05:43,800 --> 01:05:45,560
If the question asks for an interpretation

1496
01:05:45,560 --> 01:05:47,880
that goes beyond the text, it shouldn't answer.

1497
01:05:47,880 --> 01:05:49,400
If the sources contradict each other

1498
01:05:49,400 --> 01:05:52,280
and there is no clear way to resolve them, it shouldn't answer.

1499
01:05:52,280 --> 01:05:54,920
The model needs to learn a vocabulary for saying no,

1500
01:05:54,920 --> 01:05:58,200
using phrases like "the provided documents don't address this"

1501
01:05:58,200 --> 01:06:01,640
or "I would need more information to answer this accurately".

1502
01:06:01,640 --> 01:06:04,200
You also need multi-layer validation to check every answer

1503
01:06:04,200 --> 01:06:05,560
before it leaves the system.

1504
01:06:05,560 --> 01:06:08,520
Once the model writes a response, it should pass through several checks.

1505
01:06:08,520 --> 01:06:10,680
You need to ask if every claim has a citation

1506
01:06:10,680 --> 01:06:13,400
and if that citation actually supports what was said.

1507
01:06:13,400 --> 01:06:15,800
You have to check if the response contradicts the documents

1508
01:06:15,800 --> 01:06:18,360
or if the model is making inferences that aren't there.

1509
01:06:18,360 --> 01:06:21,640
If any of these checks fail, the response is rejected or flagged

1510
01:06:21,640 --> 01:06:23,080
for a human to look at.

1511
01:06:23,080 --> 01:06:27,240
Moving from being helpful to being accurate is a fundamental change in how you think.

1512
01:06:27,240 --> 01:06:30,760
Being helpful without being accurate is a huge liability for a company.

1513
01:06:30,760 --> 01:06:33,560
Being accurate without being helpful might feel incomplete,

1514
01:06:33,560 --> 01:06:35,400
but it is much easier to defend.

1515
01:06:35,400 --> 01:06:36,920
You are choosing the safer path.

1516
01:06:36,920 --> 01:06:39,480
The system will sometimes tell a user it can't help

1517
01:06:39,480 --> 01:06:42,200
and that might be disappointing, but when it does give an answer,

1518
01:06:42,200 --> 01:06:44,840
that answer will be backed by evidence you can verify.

1519
01:06:44,840 --> 01:06:48,280
This approach based on constraints is what actually kills hallucinations.

1520
01:06:48,280 --> 01:06:51,880
A model cannot lie with confidence if it is forced to show its work with citations.

1521
01:06:51,880 --> 01:06:55,240
It cannot make up stories if it isn't allowed to make claims without proof.

1522
01:06:55,240 --> 01:06:58,040
It won't sound like an expert on topics it knows nothing about

1523
01:06:58,040 --> 01:07:00,680
if the confidence thresholds force it to admit its lost.

1524
01:07:00,680 --> 01:07:01,880
Now that we have the rules in place,

1525
01:07:01,880 --> 01:07:05,240
we need to look at the orchestration layer that actually enforces them.

1526
01:07:05,240 --> 01:07:07,080
Orchestration as the control plane.

1527
01:07:07,080 --> 01:07:09,880
The orchestration layer is where intention meets execution.

1528
01:07:09,880 --> 01:07:12,760
It sits between the user's request and the model's response

1529
01:07:12,760 --> 01:07:15,240
acting as the buffer between what the model suggests

1530
01:07:15,240 --> 01:07:16,600
and what the system actually does.

1531
01:07:16,600 --> 01:07:18,120
This is your control plane.

1532
01:07:18,120 --> 01:07:19,320
Everything else you've built,

1533
01:07:19,320 --> 01:07:22,680
the retrieval governance, the grounding requirements, the confidence thresholds.

1534
01:07:22,680 --> 01:07:25,800
It all exists only on paper until this layer makes it real.

1535
01:07:25,800 --> 01:07:27,880
Think of orchestration as a series of gates.

1536
01:07:27,880 --> 01:07:31,800
Each gate inspects the process and decides whether to let it move forward.

1537
01:07:31,800 --> 01:07:33,880
You have gates before the model receives a request

1538
01:07:33,880 --> 01:07:35,960
and gates after it generates an output.

1539
01:07:35,960 --> 01:07:39,240
There are gates before tools execute and gates at every single junction

1540
01:07:39,240 --> 01:07:40,520
where something could go wrong.

1541
01:07:40,520 --> 01:07:42,120
None of these gates trust the model.

1542
01:07:42,120 --> 01:07:44,120
They don't assume the process will work correctly,

1543
01:07:44,120 --> 01:07:46,200
so they verify every single step instead.

1544
01:07:46,200 --> 01:07:48,280
Input validation is your first gate.

1545
01:07:48,280 --> 01:07:49,880
When a user types a question,

1546
01:07:49,880 --> 01:07:51,800
that text doesn't go directly to the model.

1547
01:07:51,800 --> 01:07:55,800
It goes through sanitization first to determine the intent and check for manipulation.

1548
01:07:55,800 --> 01:07:59,320
The system looks for prompt injection patterns or sensitive information

1549
01:07:59,320 --> 01:08:00,600
that shouldn't be processed.

1550
01:08:00,600 --> 01:08:03,000
The orchestrator extracts the legitimate question,

1551
01:08:03,000 --> 01:08:04,600
redacts the sensitive parts,

1552
01:08:04,600 --> 01:08:08,360
and strips away adversarial instructions before the model ever sees a word.

1553
01:08:08,360 --> 01:08:11,800
This sounds obvious, but in reality most systems skip it entirely.

1554
01:08:11,800 --> 01:08:13,880
The user's text flows directly to the model

1555
01:08:13,880 --> 01:08:15,560
and that's your biggest vulnerability.

1556
01:08:15,560 --> 01:08:19,880
The orchestration layer stops that by acting as a hard boundary between untrusted input

1557
01:08:19,880 --> 01:08:20,760
and the model.

1558
01:08:20,760 --> 01:08:24,840
Output validation then checks what the model generated against your specific requirements.

1559
01:08:24,840 --> 01:08:27,880
The model produces an answer, the orchestrator receives it

1560
01:08:27,880 --> 01:08:30,280
and validation runs before that answer goes anywhere.

1561
01:08:30,280 --> 01:08:33,000
It checks if the answer has citations for every claim

1562
01:08:33,000 --> 01:08:36,840
and ensures those citations actually lead back to real source material.

1563
01:08:36,840 --> 01:08:39,080
The system looks for contradictions in the context or claims

1564
01:08:39,080 --> 01:08:40,840
that seem made up rather than grounded.

1565
01:08:40,840 --> 01:08:44,200
It checks compliance rules and verifies if the confidence level is high enough

1566
01:08:44,200 --> 01:08:46,200
to even show the response to a human.

1567
01:08:46,200 --> 01:08:49,560
If validation fails on any of these points, the answer doesn't get delivered.

1568
01:08:49,560 --> 01:08:52,200
It gets flagged, maybe it goes back for regeneration

1569
01:08:52,200 --> 01:08:54,600
or maybe it gets escalated to a person for review.

1570
01:08:54,600 --> 01:08:57,400
In some cases it's rejected with a simple message saying

1571
01:08:57,400 --> 01:08:59,560
the system can't confidently answer,

1572
01:08:59,560 --> 01:09:01,800
but it never circulates as if it's verified truth.

1573
01:09:01,800 --> 01:09:05,240
Policy enforcement ensures these answers follow your business rules.

1574
01:09:05,240 --> 01:09:07,000
Your organization has specific standards

1575
01:09:07,000 --> 01:09:09,800
like ensuring benefit questions only reference current policies

1576
01:09:09,800 --> 01:09:12,360
or financial guidance comes from approved sources.

1577
01:09:12,360 --> 01:09:14,280
The orchestration layer encodes these rules

1578
01:09:14,280 --> 01:09:16,200
and checks every answer against them.

1579
01:09:16,200 --> 01:09:18,360
A response that violates policy stays hidden,

1580
01:09:18,360 --> 01:09:20,760
regardless of how confident the model feels about it.

1581
01:09:20,760 --> 01:09:24,520
Tool authorization is what prevents the model from triggering actions it shouldn't.

1582
01:09:24,520 --> 01:09:27,160
If your co-pilot can send emails or update records,

1583
01:09:27,160 --> 01:09:30,040
the orchestration layer verifies the action before it happens.

1584
01:09:30,040 --> 01:09:31,800
It checks if the user has permission

1585
01:09:31,800 --> 01:09:34,280
and if the action matches what was actually asked for.

1586
01:09:34,280 --> 01:09:35,800
It verifies the target resource

1587
01:09:35,800 --> 01:09:38,520
and ensures the model didn't just hallucinate an instruction.

1588
01:09:38,520 --> 01:09:41,720
Only after this verification does the tool actually execute.

1589
01:09:41,720 --> 01:09:44,520
This is where orchestration differs from retrieval governance.

1590
01:09:44,520 --> 01:09:47,320
Retrieval governance controls what the model can see

1591
01:09:47,320 --> 01:09:49,800
but orchestration controls what the model can do.

1592
01:09:49,800 --> 01:09:51,720
It's the difference between limiting what goes in

1593
01:09:51,720 --> 01:09:54,040
and limiting what comes out both are mandatory.

1594
01:09:54,040 --> 01:09:58,360
The shift from trust the model to verify everything

1595
01:09:58,360 --> 01:10:00,360
is a change in your operational mindset.

1596
01:10:00,360 --> 01:10:02,440
Previously you likely assumed the model would work

1597
01:10:02,440 --> 01:10:04,200
and treated errors as exceptions.

1598
01:10:04,200 --> 01:10:06,200
Now you assume errors are the default state.

1599
01:10:06,200 --> 01:10:07,560
Verification is the baseline

1600
01:10:07,560 --> 01:10:10,360
and trust is something that must be earned after inspection.

1601
01:10:10,360 --> 01:10:12,520
This orchestration layer will add latency

1602
01:10:12,520 --> 01:10:14,280
because verification takes time.

1603
01:10:14,280 --> 01:10:17,800
Sanitization, validation and policy checks aren't free

1604
01:10:17,800 --> 01:10:19,480
so your response times will increase.

1605
01:10:19,480 --> 01:10:20,600
That's acceptable.

1606
01:10:20,600 --> 01:10:23,640
Latency is the price you pay for a system you can actually rely on

1607
01:10:23,640 --> 01:10:26,120
but the real control happens one level deeper.

1608
01:10:26,120 --> 01:10:28,920
At the data layer, data governance is the foundation.

1609
01:10:28,920 --> 01:10:30,680
The orchestration layer is the mechanism

1610
01:10:30,680 --> 01:10:32,920
but the data layer is the foundation it sits on

1611
01:10:32,920 --> 01:10:34,360
and here's the uncomfortable truth.

1612
01:10:34,360 --> 01:10:36,680
You can build perfect orchestration around terrible data

1613
01:10:36,680 --> 01:10:38,440
and you'll still produce hallucinations

1614
01:10:38,440 --> 01:10:41,480
but you can't build perfect data and survive without orchestration.

1615
01:10:41,480 --> 01:10:43,960
The hierarchy is clear data governance comes first.

1616
01:10:43,960 --> 01:10:46,360
This means you need classification before indexing

1617
01:10:46,360 --> 01:10:48,680
not after the fact and not as a someday project

1618
01:10:48,680 --> 01:10:49,720
when you have more budget.

1619
01:10:49,720 --> 01:10:52,120
Before a single document enters your co-pilot index

1620
01:10:52,120 --> 01:10:53,640
it needs a classification tag.

1621
01:10:53,640 --> 01:10:54,920
Is it current or archived?

1622
01:10:54,920 --> 01:10:56,440
Is it a draft or an approved version?

1623
01:10:56,440 --> 01:10:58,360
This classification lives as metadata

1624
01:10:58,360 --> 01:11:00,840
that travels with the document through the entire pipeline.

1625
01:11:00,840 --> 01:11:04,040
During retrieval that metadata decides what gets surfaced

1626
01:11:04,040 --> 01:11:05,800
during ranking it sets the priority

1627
01:11:05,800 --> 01:11:08,600
and during validation it defines which answers are allowed.

1628
01:11:08,600 --> 01:11:11,080
This requires real discipline at the point of creation.

1629
01:11:11,080 --> 01:11:13,000
When someone starts a document in SharePoint

1630
01:11:13,000 --> 01:11:14,760
they should have to classify it immediately.

1631
01:11:14,760 --> 01:11:17,640
Your templates, libraries and workflows must enforce this.

1632
01:11:17,640 --> 01:11:20,600
The document simply can't be published until it has a tag.

1633
01:11:20,600 --> 01:11:23,000
This feels heavy-handed to the employees in the moment

1634
01:11:23,000 --> 01:11:26,360
but it saves the entire organization from hallucinations later.

1635
01:11:26,360 --> 01:11:27,960
For the documents you already have

1636
01:11:27,960 --> 01:11:30,040
you'll need retroactive classification

1637
01:11:30,040 --> 01:11:31,320
this is painful manual work

1638
01:11:31,320 --> 01:11:33,000
you're reviewing years of old content

1639
01:11:33,000 --> 01:11:36,200
and deciding if it's current guidance or just a historical reference

1640
01:11:36,200 --> 01:11:38,440
you're tagging thousands of files with the metadata

1641
01:11:38,440 --> 01:11:40,840
that will eventually govern how co-pilot treats them

1642
01:11:40,840 --> 01:11:42,760
you can use machine learning to predict these tags

1643
01:11:42,760 --> 01:11:45,320
but you'll still need humans to validate the sensitive stuff

1644
01:11:45,320 --> 01:11:48,680
Lifecycle management then moves from a goal to an automated reality

1645
01:11:48,680 --> 01:11:50,360
once documents are classified

1646
01:11:50,360 --> 01:11:53,720
you need automation to enforce what those tags actually mean

1647
01:11:53,720 --> 01:11:56,440
a document marked as a draft should be auto-deleted

1648
01:11:56,440 --> 01:11:58,200
after six months of silence

1649
01:11:58,200 --> 01:12:00,520
an archived file should be pulled from the active index

1650
01:12:00,520 --> 01:12:02,040
the moment it superseded

1651
01:12:02,040 --> 01:12:06,040
if a policy is marked historical it should be excluded from retrieval

1652
01:12:06,040 --> 01:12:07,480
if a newer version exists

1653
01:12:07,480 --> 01:12:09,160
these rules have to execute automatically

1654
01:12:09,160 --> 01:12:11,720
so you don't rely on people to remember to clean up

1655
01:12:11,720 --> 01:12:13,880
this requires actual retention schedules

1656
01:12:13,880 --> 01:12:17,320
for every document type you have to define how long it stays in the index

1657
01:12:17,320 --> 01:12:19,800
benefits policies might stay current indefinitely

1658
01:12:19,800 --> 01:12:22,920
but project plans might be deleted the moment the project ends

1659
01:12:22,920 --> 01:12:25,080
meeting notes might only be relevant for a year

1660
01:12:25,080 --> 01:12:27,080
these schedules aren't just random dates

1661
01:12:27,080 --> 01:12:30,600
they're based on how long the information is actually useful for making decisions

1662
01:12:30,600 --> 01:12:33,160
permission hygiene is the absolute prerequisite here

1663
01:12:33,160 --> 01:12:36,360
you can classify perfectly but if your permissions are still a mess

1664
01:12:36,360 --> 01:12:38,280
classification won't save you

1665
01:12:38,280 --> 01:12:41,240
before you try to optimize governance you have to fix access

1666
01:12:41,240 --> 01:12:43,000
you need to reduce oversharing

1667
01:12:43,000 --> 01:12:46,280
strip-guest access and revoke permissions from people who don't need them

1668
01:12:46,280 --> 01:12:49,400
it's hard work that involves difficult conversations with site owners

1669
01:12:49,400 --> 01:12:52,920
but it's non-negotiable because co-pilot only operates within the boundaries you set

1670
01:12:52,920 --> 01:12:54,680
if those boundaries are too wide

1671
01:12:54,680 --> 01:12:58,520
metadata won't help you metadata enrichment goes much further than simple labels

1672
01:12:58,520 --> 01:13:01,480
every source document should carry details about its origin

1673
01:13:01,480 --> 01:13:04,120
like who created it and what source of truth it came from

1674
01:13:04,120 --> 01:13:06,920
if it's a summary you need to know which documents it was built from

1675
01:13:06,920 --> 01:13:08,920
if it's a policy you need to see who approved it

1676
01:13:08,920 --> 01:13:11,960
this metadata become searchable and changes how things are ranked

1677
01:13:11,960 --> 01:13:15,640
an official finance document will always outrank someone's personal notes

1678
01:13:15,640 --> 01:13:17,640
even if both seem relevant to the search

1679
01:13:17,640 --> 01:13:21,000
continuous monitoring is how you detect when data quality starts to slip

1680
01:13:21,000 --> 01:13:23,400
you need a baseline that you measure against regularly

1681
01:13:23,400 --> 01:13:26,520
is the ratio of current content still higher than the stale stuff

1682
01:13:26,520 --> 01:13:30,440
are new files being tagged correctly or the old ones being archived on time

1683
01:13:30,440 --> 01:13:33,480
monitoring catches the drift before it impacts the users

1684
01:13:33,480 --> 01:13:38,360
when the metrics drop you get in alert and you fix the problem before it turns into a hallucination

1685
01:13:38,360 --> 01:13:43,880
the shift from index everything to index only what's trustworthy is a philosophical one

1686
01:13:43,880 --> 01:13:46,600
you have to abandon the idea that more data is better

1687
01:13:46,600 --> 01:13:49,720
an index with less content but higher quality is always superior

1688
01:13:49,720 --> 01:13:53,640
you're building co-pilot around authoritative sources rather than just making everything available

1689
01:13:53,640 --> 01:13:57,880
it feels like you're losing something until you realize how much reliability you've gained

1690
01:13:57,880 --> 01:14:02,520
finally let's look at the continuous discipline that keeps this whole system running

1691
01:14:02,520 --> 01:14:05,960
continuous evaluation and drift detection you have fixed the foundation

1692
01:14:05,960 --> 01:14:10,200
rebuild the architecture and implemented governance that simply did not exist before

1693
01:14:10,200 --> 01:14:14,280
the system works better now but you have to realize that better is not a permanent state

1694
01:14:14,280 --> 01:14:18,440
better requires constant maintenance because the real work does not end at deployment

1695
01:14:18,440 --> 01:14:20,520
in reality that is exactly where it starts

1696
01:14:20,520 --> 01:14:25,720
the first thing you need to build is a golden data set this is a specific collection of test queries

1697
01:14:25,720 --> 01:14:30,200
where you already know the correct answers you do not need thousands of queries to make this work

1698
01:14:30,200 --> 01:14:35,960
start with 50 to 100 representative questions that capture the full scope of what your co-pilot actually does

1699
01:14:35,960 --> 01:14:39,960
every single question must have a clear right answer that you have verified yourself

1700
01:14:39,960 --> 01:14:43,400
you need to identify the exact source documents that support those answers

1701
01:14:43,400 --> 01:14:47,560
these test cases become your baseline and the only real measure of your system health

1702
01:14:47,560 --> 01:14:52,360
this data set is not something you use once and then put away you run it repeatedly to see how

1703
01:14:52,360 --> 01:14:56,040
the system is holding up you run it before you deploy any change to establish the current

1704
01:14:56,040 --> 01:15:00,120
baseline and then you run it again afterward to see if things improved or degraded

1705
01:15:00,120 --> 01:15:04,440
even when you are not changing anything you should run it monthly as a regression check

1706
01:15:04,440 --> 01:15:09,560
drift happens silently and a golden data set is the only way to catch problems before they reach

1707
01:15:09,560 --> 01:15:13,800
your users regression testing is the discipline you build around these data sets

1708
01:15:13,800 --> 01:15:18,680
when someone on the team proposes a change like a new data source or a modified prompt

1709
01:15:18,680 --> 01:15:23,320
you do not just push it live you run the golden data set first to measure precision recall

1710
01:15:23,320 --> 01:15:28,280
and citation accuracy you compare those results to your previous baseline to see if the change is

1711
01:15:28,280 --> 01:15:32,600
moving you in the right direction if the numbers drop you reject the change immediately if they

1712
01:15:32,600 --> 01:15:37,480
look good you verify that no new failure modes appeared in the edge cases only then does that change

1713
01:15:37,480 --> 01:15:41,640
go to production this process creates friction and it means change will move slower because every

1714
01:15:41,640 --> 01:15:46,520
update requires a full evaluation that friction is actually the point of the entire exercise you

1715
01:15:46,520 --> 01:15:50,840
are making a conscious choice to trade velocity for reliability you are deciding that deploying quickly

1716
01:15:50,840 --> 01:15:55,080
matters much less than deploying correctly this is the operational discipline that transforms a

1717
01:15:55,080 --> 01:16:00,600
co-pilot from a massive liability into a functional tool production monitoring is how you catch the

1718
01:16:00,600 --> 01:16:05,320
things your golden data set misses your test set is comprehensive but it is still limited by what you

1719
01:16:05,320 --> 01:16:10,200
can imagine real users will always ask questions in ways you did not anticipate they ask about

1720
01:16:10,200 --> 01:16:15,400
strange edge cases use different languages or bring up topics that intersect multiple domains at once

1721
01:16:15,400 --> 01:16:21,400
these real world queries generate patterns that a curated test set will never capture to fix this

1722
01:16:21,400 --> 01:16:25,400
you need to sample your production traffic take about two to five percent of your live queries

1723
01:16:25,400 --> 01:16:29,960
and run them through a human evaluation you need to check if the answer matches the sources

1724
01:16:29,960 --> 01:16:34,440
and if the citations are actually accurate this human sampling is what keeps your team connected

1725
01:16:34,440 --> 01:16:39,000
to real world quality drift alerts will notify you the moment something in the system changes

1726
01:16:39,000 --> 01:16:44,040
you should be monitoring four key metrics at all times hallucination rate citation accuracy

1727
01:16:44,040 --> 01:16:49,640
retrieval quality and response latency when any of these metrics shifts past a certain threshold

1728
01:16:49,640 --> 01:16:54,040
like a five percent increase in hallucinations the system needs to trigger an alert the system usually

1729
01:16:54,040 --> 01:16:58,840
does not break all at once instead it drifts maybe a new data source brought in poor quality content

1730
01:16:58,840 --> 01:17:03,000
or perhaps an embedding model updated without warning you might not know the cause immediately

1731
01:17:03,000 --> 01:17:07,080
but you will know something shifted that is enough to start an investigation before the drifts

1732
01:17:07,080 --> 01:17:12,040
spreads to the rest of the user base feedback loops are what finally close the circle your users are

1733
01:17:12,040 --> 01:17:17,000
going to find errors and when they do that information must feedback into the development cycle

1734
01:17:17,000 --> 01:17:21,240
when a user marks a response as incorrect that query should go straight into a feedback queue for

1735
01:17:21,240 --> 01:17:25,640
review your team needs to look at it and diagnose exactly what went wrong you have to ask if the

1736
01:17:25,640 --> 01:17:30,440
retrieval failed or if the model simply ignored the evidence that diagnosis is what informs your

1737
01:17:30,440 --> 01:17:35,480
next iteration the system learns from these corrections but it does not happen automatically

1738
01:17:35,480 --> 01:17:39,640
it happens explicitly through deliberate updates to your data and your logic the shift from

1739
01:17:39,640 --> 01:17:44,600
deploy and forget to monitor and improve is a permanent transition in how you handle responsibility

1740
01:17:44,600 --> 01:17:49,480
a copilot is not a piece of software that you ship and then hand off to another team

1741
01:17:49,480 --> 01:17:54,920
it is infrastructure that you have to maintain every single day it needs constant care and professional attention

1742
01:17:54,920 --> 01:17:59,240
you need someone whose entire job is to watch the metrics catch the drift and correct the

1743
01:17:59,240 --> 01:18:03,560
course when the quality starts to degrade this is how you stop a hallucination machine from

1744
01:18:03,560 --> 01:18:09,400
becoming a liability you build the system to expose failures and now you must operate it to prevent them

1745
01:18:09,400 --> 01:18:13,640
the core insights it's right at the center of everything we have discussed hallucinations are not

1746
01:18:13,640 --> 01:18:18,120
actually an llm problem they are an orchestration problem by now you should understand the exact

1747
01:18:18,120 --> 01:18:22,520
path people take to build a copilot that confidently lies to its users more importantly you now have

1748
01:18:22,520 --> 01:18:28,440
the framework to dismantle one most organizations are accidentally building these broken machines right now

1749
01:18:28,440 --> 01:18:32,680
they are skipping retrieval governance and treating grounding like it is an optional feature they

1750
01:18:32,680 --> 01:18:36,600
are deploying systems without any orchestration and indexing every document they own without any

1751
01:18:36,600 --> 01:18:40,760
classification your next step is very clear you need to audit your current architecture against

1752
01:18:40,760 --> 01:18:45,080
these specific failure modes look at where you are cutting corners and where governance theatre

1753
01:18:45,080 --> 01:18:49,720
is substituting for actual control the difference between a useful tool and a corporate liability is

1754
01:18:49,720 --> 01:18:54,600
the presence of these controls trumpify was a diagnostic tool but it was never the destination

1755
01:18:54,600 --> 01:18:59,480
if this changed how you think about AI systems follow me mere copeters on LinkedIn and if you want more

1756
01:18:59,480 --> 01:19:04,520
of this deep dive analysis leave a review so more people can find it share this with your team

1757
01:19:04,520 --> 01:19:07,640
especially if you are dealing with these exact problems right now

How to Trumpify Your Copilot: A Masterclass in Hallucination

Listen On

Support On

Featured Episodes

Recent Episodes

Microsoft Data Podcast – Analytics, Fabric & Data Governance Episodes

Microsoft Power Platform Podcast – Governance, Security & Architecture Episodes

Microsoft Security Podcast – Identity, Cloud & Enterprise Protection Episodes

Microsoft Azure Podcast – Cloud Architecture, Security & Operations Episodes

Microsoft Copilot Podcast – AI Architecture, Security & Governance Episodes

Microsoft Dynamics 365 Podcast – Architecture & Integration Episodes

Microsoft Development Podcast – APIs, Identity & Architecture Episodes

Microsoft 365 Podcast – Teams, SharePoint, Office Apps & Productivity Episodes

Browse episodes by category