June 7, 2026

I Engineered Copilot for 3.5 Million Pages: The Epstein Files Challenge

I Engineered Copilot for 3.5 Million Pages: The Epstein Files Challenge
I Engineered Copilot for 3.5 Million Pages: The Epstein Files Challenge
M365 FM Podcast
I Engineered Copilot for 3.5 Million Pages: The Epstein Files Challenge
Apple Podcasts podcast player iconSpotify podcast player iconYoutube Music podcast player iconSpreaker podcast player iconPodchaser podcast player iconAmazon Music podcast player icon

Three and a half million pages. Two thousand videos. One hundred and eighty thousand images. Most people assume that once you connect Microsoft Copilot to a massive dataset, the answers simply appear. The reality is very different.In this episode of the M365 FM Podcast, we go deep into the engineering challenges behind building a retrieval architecture capable of handling one of the largest and most complex information collections imaginable. Using the Epstein Files challenge as a case study, we explore what happens when traditional search and standard Retrieval-Augmented Generation (RAG) approaches collide with millions of documents, transcripts, images, and videos.This is not a discussion about AI marketing. It is a technical deep dive into the infrastructure, orchestration, governance, chunking strategies, retrieval systems, and performance engineering required to make Copilot work at extreme scale.

THE DATA BLINDNESS PROBLEM

Organizations often think Copilot is simply a smarter search engine. In reality, Copilot is an orchestration layer that relies entirely on the quality of the retrieval architecture beneath it.At massive scale, information overload becomes the primary challenge. Questions that should have straightforward answers become buried beneath millions of irrelevant documents. Standard keyword search floods large language models with noise, making it increasingly difficult to identify meaningful signals. The result is what we call data blindness: the information exists, but it becomes practically invisible because of the overwhelming volume of competing content.We explore how retrieval systems fail when legal documents, emails, transcripts, photographs, scanned PDFs, and multimedia assets all compete within the same search environment.

WHY STANDARD RAG COLLAPSES AT SCALE

Retrieval-Augmented Generation works well in controlled environments with relatively small knowledge bases. The assumptions behind standard RAG begin to break down once the dataset reaches millions of pages.In this segment, we analyze why semantic chunking often underperforms at enterprise scale despite sounding attractive in theory. We discuss the hidden costs of sentence-level embeddings, similarity calculations, and preprocessing pipelines that dramatically increase infrastructure costs while sometimes reducing retrieval accuracy.You will learn why more data does not automatically lead to better answers and how poorly designed retrieval architectures can actually increase hallucinations rather than reduce them.

THE SELECTIVE ACTIVATION MODEL

Not every document deserves the same investment.One of the most important concepts discussed in this episode is Selective Activation, a three-tier architecture designed to prioritize the content that delivers the highest business value.Rather than embedding every document equally, the system intelligently separates content into active, supporting, and archival tiers. This dramatically reduces infrastructure costs while improving retrieval performance and maintaining governance requirements.The discussion covers:

  • Tier 1 high-value evidence and core documents
  • Tier 2 supporting records and operational content
  • Tier 3 cold storage and archival retrieval
This model allows organizations to focus resources where they generate the greatest return.

RECURSIVE STRUCTURE-AWARE CHUNKING

Chunking is one of the most overlooked components of enterprise AI architecture.Legal documents, contracts, investigations, and regulatory records contain natural structures that traditional token-based chunking frequently destroys. In this section, we explore recursive structure-aware chunking and how respecting document hierarchy significantly improves retrieval quality.Instead of splitting content at arbitrary token limits, this approach preserves articles, sections, clauses, and narrative context. The result is better grounding, higher retrieval precision, and more accurate answers.We also discuss overlap strategies, metadata preservation, and benchmark results showing why recursive chunking consistently outperforms many expensive alternatives.

BUILDING A MULTIMODAL INGESTION PIPELINE

Modern knowledge repositories are no longer text-only environments.Organizations must process images, scanned documents, video recordings, transcripts, handwritten notes, and multimedia evidence. Making this information searchable requires a sophisticated ingestion pipeline that performs OCR, transcription, image analysis, metadata extraction, and enrichment before users ever submit a query.This episode explores how multimodal ingestion transforms unsearchable content into structured knowledge that Copilot can retrieve and reason over.

ENTITY EXTRACTION AND KNOWLEDGE GRAPHS

Raw text is information. Relationships create understanding.We examine how entity extraction transforms millions of disconnected references into a structured knowledge graph capable of identifying people, organizations, locations, events, and relationships.Rather than forcing the AI model to discover relationships during generation, the system extracts and organizes these connections during ingestion. This reduces hallucinations, improves retrieval accuracy, and enables advanced relationship-based questioning across large datasets.

THE AGENTIC ROUTER

Not all questions require the same retrieval strategy.The Agentic Router serves as the intelligence layer that determines what a user is actually asking and routes requests to the most appropriate retrieval systems.Whether a query requires structured databases, knowledge graphs, keyword indexes, vector search, or document retrieval, the router decomposes complex requests into specialized tasks and orchestrates the response process.This section provides a practical look at query decomposition, intent classification, fallback mechanisms, and confidence scoring.

HYBRID RETRIEVAL AND RERANKING

Modern enterprise retrieval requires more than vector search alone.We explore why combining BM25 keyword retrieval, vector search, Reciprocal Rank Fusion, metadata filtering, and transformer-based reranking delivers superior results compared to any individual approach.Hybrid retrieval balances precision and recall while reducing retrieval noise before information ever reaches the large language model.The conversation includes practical implementation considerations, latency tradeoffs, and the impact of reranking on answer quality.

PERMISSION-AWARE RETRIEVAL

Security cannot be an afterthought.When dealing with millions of pages, access control becomes a foundational architectural requirement rather than a feature.We discuss chunk-level permissions, Azure Active Directory integration, sensitivity labels, compliance boundaries, audit trails, and governance models that ensure users only receive information they are authorized to access.This section highlights why permission-aware retrieval is one of the most critical components of enterprise AI deployment.

LATENCY, PERFORMANCE, AND TIME-TO-FIRST-TOKEN

Users judge AI systems by speed.Even the most accurate answer loses value if it arrives too slowly.This episode examines Time-to-First-Token (TTFT), retrieval latency, reranking overhead, permission filtering costs, caching strategies, and parallel processing techniques that enable sub-second experiences at enterprise scale.You will learn where latency accumulates inside the retrieval pipeline and how architectural decisions directly influence user adoption.

GOVERNANCE, COMPLIANCE, AND ENTERPRISE READINESS

Enterprise AI is not simply about retrieval performance.Governance frameworks, retention policies, legal holds, audit logging, data residency requirements, and compliance controls determine whether a system can safely operate in production environments.We explore how governance becomes increasingly important as datasets grow and why organizations must design compliance directly into their architecture rather than adding it later.

THE ORCHESTRATION LAYER

Every component discussed in this episode ultimately converges inside the orchestration layer.The orchestration layer coordinates ingestion, chunking, enrichment, indexing, retrieval, reranking, permission filtering, answer generation, feedback loops, monitoring, and scaling.Without orchestration, organizations are left with disconnected technologies. With orchestration, those technologies become a coherent AI system capable of turning millions of pages into actionable knowledge.

KEY TAKEAWAYS
  • Copilot is an orchestration engine, not a search engine.
  • Retrieval architecture determines answer quality.
  • Recursive chunking often outperforms expensive semantic approaches.
  • Metadata enrichment dramatically improves retrieval accuracy.
  • Hybrid retrieval provides the best balance of precision and recall.
  • Governance and security must be built into the architecture from day one.
CONNECT WITH M365 FM

If you enjoyed this episode, subscribe to M365 FM for deep technical conversations covering Microsoft 365, Microsoft Copilot, Azure AI, enterprise search, knowledge management, governance, security, and the future of intelligent workplaces.New episodes explore real-world architectures, implementation strategies, lessons learned from large-scale deployments, and the technologies shaping the next generation of work.Subscribe, leave a review, and share the episode with anyone building AI-powered solutions at enterprise scale.

Become a supporter of this podcast: https://www.spreaker.com/podcast/m365-fm-modern-work-security-and-productivity-with-microsoft-365--6704921/support.

🚀 Want to be part of m365.fm?

Then stop just listening… and start showing up.

👉 Connect with me on LinkedIn and let’s make something happen:

  • 🎙️ Be a podcast guest and share your story
  • 🎧 Host your own episode (yes, seriously)
  • 💡 Pitch topics the community actually wants to hear
  • 🌍 Build your personal brand in the Microsoft 365 space

This isn’t just a podcast — it’s a platform for people who take action.

🔥 Most people wait. The best ones don’t.

👉 Connect with me on LinkedIn and send me a message:
"I want in"

Let’s build something awesome 👊

1
00:00:00,000 --> 00:00:04,120
3.5 million pages, 2,000 videos, 180,000 images.

2
00:00:04,120 --> 00:00:05,960
That is the massive pile of data you are looking at

3
00:00:05,960 --> 00:00:07,320
in the Epstein files case,

4
00:00:07,320 --> 00:00:09,280
and the assumption most people make is simple.

5
00:00:09,280 --> 00:00:11,040
You just plug co-pilot in and it works,

6
00:00:11,040 --> 00:00:12,520
but in reality, it doesn't,

7
00:00:12,520 --> 00:00:14,560
not without the right architecture underneath it.

8
00:00:14,560 --> 00:00:16,440
This episode is about the technical infrastructure

9
00:00:16,440 --> 00:00:17,960
that actually makes this happen,

10
00:00:17,960 --> 00:00:20,560
focusing on the engineering that turns millions of pages

11
00:00:20,560 --> 00:00:22,040
into something you can search,

12
00:00:22,040 --> 00:00:25,440
reason over, and cite with total accuracy.

13
00:00:25,440 --> 00:00:28,280
We are not talking about theory or marketing hype today.

14
00:00:28,280 --> 00:00:29,640
We are talking about a working system.

15
00:00:29,640 --> 00:00:31,160
By the end of this, you will understand

16
00:00:31,160 --> 00:00:33,760
why standard retrieval breaks when you hit massive scale,

17
00:00:33,760 --> 00:00:36,080
and you will see how to handle chunking and routing

18
00:00:36,080 --> 00:00:38,120
when the noise starts to drown out the signal.

19
00:00:38,120 --> 00:00:41,800
We are going to look at what orchestration looks like in practice,

20
00:00:41,800 --> 00:00:44,720
and how you can actually measure if any of it is working.

21
00:00:44,720 --> 00:00:47,520
If you want deep technical insights on Microsoft 365,

22
00:00:47,520 --> 00:00:49,840
co-pilot, Azure, and the modern workplace,

23
00:00:49,840 --> 00:00:53,880
make sure to subscribe to the M365 FM podcast.

24
00:00:53,880 --> 00:00:55,400
The data blindness problem.

25
00:00:55,400 --> 00:00:58,000
Most organizations think co-pilot is just a search tool,

26
00:00:58,000 --> 00:00:58,960
but that is not the case.

27
00:00:58,960 --> 00:01:00,560
It is an orchestration engine,

28
00:01:00,560 --> 00:01:03,240
and when you are dealing with three and a half million pages,

29
00:01:03,240 --> 00:01:04,800
that distinction becomes everything.

30
00:01:04,800 --> 00:01:06,960
Here is what happens when you connect co-pilot

31
00:01:06,960 --> 00:01:09,400
to massive data without a retrieval architecture.

32
00:01:09,400 --> 00:01:11,560
The noise to signal ratio becomes impossible to manage

33
00:01:11,560 --> 00:01:13,280
because co-pilot fires a keyword search

34
00:01:13,280 --> 00:01:16,040
across millions of documents the moment you ask a question.

35
00:01:16,040 --> 00:01:19,000
It returns results where some are relevant and many are not.

36
00:01:19,000 --> 00:01:21,440
Fourcing the LLM to try and find a signal inside

37
00:01:21,440 --> 00:01:22,720
a fire hose of candidates.

38
00:01:22,720 --> 00:01:24,800
The real answer just gets buried under a pile

39
00:01:24,800 --> 00:01:26,320
of half relevant noise.

40
00:01:26,320 --> 00:01:27,800
This is what I call data blindness.

41
00:01:27,800 --> 00:01:30,240
The data is not missing, but it is everywhere at once,

42
00:01:30,240 --> 00:01:32,080
which ends up being the exact same thing.

43
00:01:32,080 --> 00:01:34,960
If you use fixed-sized chunking at 512 tokens,

44
00:01:34,960 --> 00:01:36,800
you are going to break your legal narratives.

45
00:01:36,800 --> 00:01:38,920
Imagine a deposition that spans several pages

46
00:01:38,920 --> 00:01:40,400
as one continuous argument,

47
00:01:40,400 --> 00:01:42,600
but your chunking algorithm has no way of knowing that.

48
00:01:42,600 --> 00:01:44,760
It splits the text right at the token boundary

49
00:01:44,760 --> 00:01:46,200
in the middle of a testimony,

50
00:01:46,200 --> 00:01:48,760
and the next chunk ends up being someone else's statement

51
00:01:48,760 --> 00:01:50,600
from a completely different page.

52
00:01:50,600 --> 00:01:52,920
When vector search pulls just one of those chunks,

53
00:01:52,920 --> 00:01:54,160
the context is broken,

54
00:01:54,160 --> 00:01:56,840
and the answer co-pilot gives you is incomplete.

55
00:01:56,840 --> 00:01:59,120
Vector search by itself will return a relevant candidate

56
00:01:59,120 --> 00:02:00,680
because your data is too diverse.

57
00:02:00,680 --> 00:02:03,160
You have legal documents, emails, transcripts

58
00:02:03,160 --> 00:02:06,560
and photographs with metadata all sitting in the same vector space.

59
00:02:06,560 --> 00:02:08,440
When a user asks about a timeline,

60
00:02:08,440 --> 00:02:10,240
the vector search might pull images

61
00:02:10,240 --> 00:02:12,040
that were tagged with dates close to the prompt

62
00:02:12,040 --> 00:02:13,120
in the embedding space.

63
00:02:13,120 --> 00:02:14,880
The image had nothing to do with the timeline,

64
00:02:14,880 --> 00:02:16,480
but it was semantically close,

65
00:02:16,480 --> 00:02:19,280
so co-pilot grounds its answer on the wrong data.

66
00:02:19,280 --> 00:02:20,400
Then you hit the latency wall.

67
00:02:20,400 --> 00:02:22,600
You simply cannot query three and a half million pages

68
00:02:22,600 --> 00:02:25,600
in real time the same way you would query 10,000.

69
00:02:25,600 --> 00:02:28,200
Every single stage adds up from embedding the query

70
00:02:28,200 --> 00:02:30,280
and searching the index to re-ranking candidates

71
00:02:30,280 --> 00:02:32,000
and passing context to the LLM.

72
00:02:32,000 --> 00:02:33,760
Response time starts stretching into seconds

73
00:02:33,760 --> 00:02:34,800
instead of milliseconds,

74
00:02:34,800 --> 00:02:36,240
and for an enterprise assistant,

75
00:02:36,240 --> 00:02:38,800
those extra seconds feel like a total failure.

76
00:02:38,800 --> 00:02:40,840
Permission aware retrieval also becomes a nightmare

77
00:02:40,840 --> 00:02:42,240
without proper governance.

78
00:02:42,240 --> 00:02:45,200
At this scale, different users have different access rights,

79
00:02:45,200 --> 00:02:46,720
where one analyst sees one branch

80
00:02:46,720 --> 00:02:47,840
and another sees something else.

81
00:02:47,840 --> 00:02:49,880
If your retrieval layer is not trimming permissions

82
00:02:49,880 --> 00:02:51,360
at the exact moment of the query,

83
00:02:51,360 --> 00:02:53,640
someone is eventually going to see something they should not.

84
00:02:53,640 --> 00:02:55,120
You have to bake that permission logic

85
00:02:55,120 --> 00:02:57,960
into every single retrieval path and ranking function.

86
00:02:57,960 --> 00:03:00,160
Finally, you have the problem of dark data.

87
00:03:00,160 --> 00:03:02,040
Most of those three and a half million pages

88
00:03:02,040 --> 00:03:04,920
are completely unsurczable unless you process them first.

89
00:03:04,920 --> 00:03:07,640
Scan PDFs need OCR, videos need transcripts,

90
00:03:07,640 --> 00:03:09,160
and images need descriptions,

91
00:03:09,160 --> 00:03:11,840
but you cannot do this while the user is waiting for an answer.

92
00:03:11,840 --> 00:03:14,280
It has to happen asynchronously during ingestion

93
00:03:14,280 --> 00:03:16,560
before the data ever reaches your indexes.

94
00:03:16,560 --> 00:03:19,320
If you skip that step, those documents stay dark,

95
00:03:19,320 --> 00:03:21,040
meaning they exist in your system,

96
00:03:21,040 --> 00:03:23,520
but co-pilot is effectively blind to them.

97
00:03:23,520 --> 00:03:25,560
Why standard rag collapses at scale?

98
00:03:25,560 --> 00:03:26,880
The research is clear.

99
00:03:26,880 --> 00:03:29,200
Semantic chunking sounds great in a white paper,

100
00:03:29,200 --> 00:03:31,360
but when you hit 3.5 million pages,

101
00:03:31,360 --> 00:03:32,880
the cost becomes a wall.

102
00:03:32,880 --> 00:03:34,280
Here is the technical reality.

103
00:03:34,280 --> 00:03:36,120
Semantic chunking requires you to embed

104
00:03:36,120 --> 00:03:38,080
every single sentence in your corpus,

105
00:03:38,080 --> 00:03:41,040
then compute similarity matrices across every adjacent pair

106
00:03:41,040 --> 00:03:42,760
just to find where a topic ends.

107
00:03:42,760 --> 00:03:45,680
At 3.5 million pages, you are looking at tens of millions

108
00:03:45,680 --> 00:03:46,680
of sentences.

109
00:03:46,680 --> 00:03:48,280
Every one of them gets embedded.

110
00:03:48,280 --> 00:03:51,560
Then you have to run co-sync similarity across that entire set

111
00:03:51,560 --> 00:03:53,240
to detect where the topic's shift.

112
00:03:53,240 --> 00:03:54,600
Only after that work is done,

113
00:03:54,600 --> 00:03:56,360
do you group those sentences into chunks

114
00:03:56,360 --> 00:03:59,320
and embed the chunks themselves for the final index?

115
00:03:59,320 --> 00:04:01,320
It is effectively a double embedding pipeline

116
00:04:01,320 --> 00:04:02,520
that burns through your budget

117
00:04:02,520 --> 00:04:04,960
before the data even reaches a vector database.

118
00:04:04,960 --> 00:04:06,600
The benchmark data is stark.

119
00:04:06,600 --> 00:04:09,160
Fixed size recursive chunking at 512 tokens

120
00:04:09,160 --> 00:04:12,520
hits 69% accuracy on document level retrieval tasks.

121
00:04:12,520 --> 00:04:15,840
Semantic chunking on that same benchmark scores 54%.

122
00:04:15,840 --> 00:04:18,320
That is a 15-point gap in the wrong direction.

123
00:04:18,320 --> 00:04:20,880
Because the semantic chunks average only 43 tokens,

124
00:04:20,880 --> 00:04:22,200
they are so over-fragmented

125
00:04:22,200 --> 00:04:23,800
that your retrieval becomes scattered.

126
00:04:23,800 --> 00:04:25,920
You lose the context, you lose the coherence,

127
00:04:25,920 --> 00:04:28,200
but the real cost isn't just the loss in accuracy.

128
00:04:28,200 --> 00:04:29,160
It is the compute.

129
00:04:29,160 --> 00:04:33,000
Semantic chunking adds 1.5 to 3 times the ingestion overhead

130
00:04:33,000 --> 00:04:35,280
compared to a standard fixed size split.

131
00:04:35,280 --> 00:04:36,920
You are paying for sentence segmentation,

132
00:04:36,920 --> 00:04:39,440
embedding infrastructure and similarity calculations

133
00:04:39,440 --> 00:04:41,760
that don't actually improve your end result.

134
00:04:41,760 --> 00:04:43,920
On a 3.5 million page corpus,

135
00:04:43,920 --> 00:04:46,480
that multiplier turns into weeks of pre-processing time

136
00:04:46,480 --> 00:04:48,280
and massive infrastructure bills.

137
00:04:48,280 --> 00:04:50,480
You are embedding the same text multiple times,

138
00:04:50,480 --> 00:04:52,920
once for boundary detection and again for the final chunks,

139
00:04:52,920 --> 00:04:54,960
which is just redundant work at scale,

140
00:04:54,960 --> 00:04:56,680
fixed size chunking is linear.

141
00:04:56,680 --> 00:04:59,560
You tokenize once, cut at intervals and embed once.

142
00:04:59,560 --> 00:05:00,920
It scales predictably.

143
00:05:00,920 --> 00:05:02,600
Semantic chunking scales much worse

144
00:05:02,600 --> 00:05:05,760
because every single document triggers expensive NLP operations.

145
00:05:05,760 --> 00:05:08,640
If you are ingesting 3.5 million pages one time,

146
00:05:08,640 --> 00:05:09,800
the delay is annoying.

147
00:05:09,800 --> 00:05:11,880
But if you are ingesting data continuously

148
00:05:11,880 --> 00:05:13,360
with new evidence arriving weekly

149
00:05:13,360 --> 00:05:15,000
and video transcripts monthly,

150
00:05:15,000 --> 00:05:17,720
that semantic pre-processing becomes a permanent bottleneck

151
00:05:17,720 --> 00:05:18,560
in your system.

152
00:05:18,560 --> 00:05:21,480
Retrieval latency grows right along with your corpus size

153
00:05:21,480 --> 00:05:23,560
unless you architect your way around it.

154
00:05:23,560 --> 00:05:24,840
With semantic chunking,

155
00:05:24,840 --> 00:05:27,400
you have increased your chunk count through over fragmentation

156
00:05:27,400 --> 00:05:30,120
and that increases the total number of vectors in your index.

157
00:05:30,120 --> 00:05:33,080
More vectors mean the search has to evaluate more candidates.

158
00:05:33,080 --> 00:05:35,200
Your latency on queries starts to creep up,

159
00:05:35,200 --> 00:05:37,880
which means you respond slower and your users wait longer.

160
00:05:37,880 --> 00:05:40,480
Then you hit the constraint of the LLM context window.

161
00:05:40,480 --> 00:05:43,800
You cannot just send all 3.5 million pages to the model.

162
00:05:43,800 --> 00:05:44,840
Your window is finite,

163
00:05:44,840 --> 00:05:48,440
whether it is 4,000 tokens or 128,000,

164
00:05:48,440 --> 00:05:49,440
at the high end.

165
00:05:49,440 --> 00:05:52,640
Retrieval has to return a small, precise set of candidates.

166
00:05:52,640 --> 00:05:55,000
If your chunking strategy produces tiny fragments,

167
00:05:55,000 --> 00:05:56,320
you have to retrieve more of them

168
00:05:56,320 --> 00:05:58,360
just to reconstruct the basic context.

169
00:05:58,360 --> 00:06:01,040
More retrieved chunks mean more tokens are consumed

170
00:06:01,040 --> 00:06:03,760
by the retrieval process instead of the actual reasoning.

171
00:06:03,760 --> 00:06:06,200
Your model spends its tokens trying to rebuild a narrative

172
00:06:06,200 --> 00:06:08,040
instead of synthesizing an insight.

173
00:06:08,040 --> 00:06:09,680
The paradox surfaces right here,

174
00:06:09,680 --> 00:06:12,200
more data without orchestration does not lead to better answers.

175
00:06:12,200 --> 00:06:13,360
It leads to worse ones.

176
00:06:13,360 --> 00:06:16,160
You retrieve more candidates, but they are more fragmented.

177
00:06:16,160 --> 00:06:19,560
The LLM receives a scatter of context instead of a coherent story.

178
00:06:19,560 --> 00:06:22,880
It starts to hallucinate more because it is forced to piece together statements

179
00:06:22,880 --> 00:06:24,160
from disconnected chunks.

180
00:06:24,160 --> 00:06:26,800
Users see answers that look like they are grounded in documents,

181
00:06:26,800 --> 00:06:28,320
but that grounding is fragile.

182
00:06:28,320 --> 00:06:29,600
One chunk says X,

183
00:06:29,600 --> 00:06:32,080
and another chunk that is nearby in the vector space,

184
00:06:32,080 --> 00:06:34,400
but logically unrelated says Y.

185
00:06:34,400 --> 00:06:37,480
The LLM then synthesizes a false inference to connect them.

186
00:06:37,480 --> 00:06:40,320
Standard Rags assumes you have a small, simple corpus.

187
00:06:40,320 --> 00:06:42,960
Maybe some legal documents or product FAQs.

188
00:06:42,960 --> 00:06:46,000
At 3.5 million pages with 2,000 videos

189
00:06:46,000 --> 00:06:49,280
and 180,000 images, those assumptions break.

190
00:06:49,280 --> 00:06:51,680
The default architecture of one chunking strategy

191
00:06:51,680 --> 00:06:54,120
and one vector index collapses under that weight.

192
00:06:54,120 --> 00:06:56,200
You have to architect differently.

193
00:06:56,200 --> 00:06:59,120
You have to be selective about where you invest your resources.

194
00:06:59,120 --> 00:07:01,080
The selective activation model.

195
00:07:01,080 --> 00:07:03,520
The insight that changes everything is this.

196
00:07:03,520 --> 00:07:08,440
Not every document in your 3.5 million page corpus has the same value.

197
00:07:08,440 --> 00:07:12,240
Some documents answer 80% of your user questions while others answer none.

198
00:07:12,240 --> 00:07:14,880
Your budget is finite, your embedding costs are finite,

199
00:07:14,880 --> 00:07:18,160
your latency budget is finite, you have to choose where to invest.

200
00:07:18,160 --> 00:07:20,080
This is selective activation.

201
00:07:20,080 --> 00:07:23,200
The Epstein files implementation uses a 3-tier architecture.

202
00:07:23,200 --> 00:07:25,200
I think of it as a graduated investment,

203
00:07:25,200 --> 00:07:28,720
where each tier gets exactly the level of sophistication its use justifies.

204
00:07:28,720 --> 00:07:31,360
Tier 1 is your high-value high-frequency content.

205
00:07:31,360 --> 00:07:35,200
This includes depositions, legal filings, and testimony from key figures.

206
00:07:35,200 --> 00:07:38,400
These documents generate the vast majority of your answerable queries.

207
00:07:38,400 --> 00:07:40,720
They are the nucleus of the entire knowledge base.

208
00:07:40,720 --> 00:07:42,120
Tier 1 gets the full treatment.

209
00:07:42,120 --> 00:07:43,920
These documents are chunked recursively

210
00:07:43,920 --> 00:07:47,120
to respect section boundaries and maintain narrative flow.

211
00:07:47,120 --> 00:07:50,480
Metadata is extracted at a very granular level to track who testifies

212
00:07:50,480 --> 00:07:51,600
and what topics appear.

213
00:07:51,600 --> 00:07:53,680
Chunks are embedded with a high-quality model

214
00:07:53,680 --> 00:07:56,000
and results are re-ranked using a cross-encoder.

215
00:07:56,000 --> 00:07:58,640
Query time latency matters less here than answer quality

216
00:07:58,640 --> 00:08:00,960
because users expect deep, accurate results.

217
00:08:00,960 --> 00:08:04,880
If you need a complex answer about a person's movements across multiple documents,

218
00:08:04,880 --> 00:08:06,960
tier 1 retrieval will surface that evidence.

219
00:08:06,960 --> 00:08:08,320
Tier 2 is your bulk evidence.

220
00:08:08,320 --> 00:08:11,440
This is the supporting material like emails, financial records,

221
00:08:11,440 --> 00:08:12,480
and video transcripts.

222
00:08:12,480 --> 00:08:14,560
These are secondary sources that provide context

223
00:08:14,560 --> 00:08:16,640
but rarely answer a question on their own.

224
00:08:16,640 --> 00:08:20,560
Tier 2 uses keyword only indexing like BM25 and exact match search.

225
00:08:20,560 --> 00:08:22,400
There are no embeddings and no re-ranking.

226
00:08:22,400 --> 00:08:24,080
This trade-off is a deliberate choice.

227
00:08:24,080 --> 00:08:26,400
If a user searches for a specific transaction date,

228
00:08:26,400 --> 00:08:28,160
keyword search finds it instantly.

229
00:08:28,160 --> 00:08:31,040
There is no embedding overhead and no model latency.

230
00:08:31,040 --> 00:08:34,400
You sacrifice the ability to find documents through conceptual similarity,

231
00:08:34,400 --> 00:08:37,360
but you gain massive speed and operational simplicity.

232
00:08:37,360 --> 00:08:39,760
Tier 2 is perfect for narrow questions

233
00:08:39,760 --> 00:08:42,960
like finding all emails from a specific person in March.

234
00:08:42,960 --> 00:08:44,320
Tier 3 is the deep archive.

235
00:08:44,320 --> 00:08:46,640
This is for historical documents and duplicate evidence

236
00:08:46,640 --> 00:08:49,360
that you keep for compliance but rarely ever query.

237
00:08:49,360 --> 00:08:50,960
Tier 3 stays in cold storage.

238
00:08:50,960 --> 00:08:53,600
These documents never even enter your hot indexes.

239
00:08:53,600 --> 00:08:57,040
If a user query indicates that tier 3 might actually have something relevant,

240
00:08:57,040 --> 00:08:59,120
the system triggers an on-demand extraction.

241
00:08:59,120 --> 00:09:00,480
This is expensive per query,

242
00:09:00,480 --> 00:09:03,120
but it is acceptable because those queries are so rare.

243
00:09:03,120 --> 00:09:05,680
Maybe once a month, a user asks something specific enough

244
00:09:05,680 --> 00:09:07,680
to justify a cold storage search.

245
00:09:07,680 --> 00:09:09,360
You pay the latency cost then,

246
00:09:09,360 --> 00:09:10,560
but the rest of the time,

247
00:09:10,560 --> 00:09:12,880
you aren't paying to keep those documents active.

248
00:09:12,880 --> 00:09:15,360
The cost picture changes completely under this model.

249
00:09:15,360 --> 00:09:18,240
You aren't embedding 3.5 million pages anymore.

250
00:09:18,240 --> 00:09:19,600
You are only embedding tier 1,

251
00:09:19,600 --> 00:09:22,080
which might be 500,000 high-value documents.

252
00:09:22,080 --> 00:09:24,320
You are running keyword indexing on tier 2,

253
00:09:24,320 --> 00:09:26,480
which costs a fraction of what embedding does.

254
00:09:26,480 --> 00:09:28,000
Tier 3 is left untouched.

255
00:09:28,000 --> 00:09:30,640
Your embedding budget is suddenly 5 to 10 times smaller.

256
00:09:30,640 --> 00:09:33,280
Your vector database is 5 to 10 times smaller.

257
00:09:33,280 --> 00:09:35,360
Your query costs drop because the system

258
00:09:35,360 --> 00:09:38,320
roots the request to the right tier based on what the user is asking.

259
00:09:38,320 --> 00:09:40,160
This approach requires real governance.

260
00:09:40,160 --> 00:09:43,200
You need a framework to decide which documents land in which tier.

261
00:09:43,200 --> 00:09:45,920
When you ingest data, you have to classify it immediately.

262
00:09:45,920 --> 00:09:47,680
A key deposition goes to tier 1,

263
00:09:47,680 --> 00:09:49,520
while a supporting email goes to tier 2.

264
00:09:49,520 --> 00:09:51,600
This can be done with simple rules or metadata.

265
00:09:51,600 --> 00:09:55,120
For example, documents older than 5 years that are marked as archived

266
00:09:55,120 --> 00:09:56,400
can go straight to tier 3.

267
00:09:56,400 --> 00:10:00,080
Your measurement of success becomes the cost per answerable query

268
00:10:00,080 --> 00:10:01,840
rather than just the raw document count.

269
00:10:01,840 --> 00:10:04,080
You track how many queries tier 1 answers

270
00:10:04,080 --> 00:10:06,400
and how often you need to pull from tier 2.

271
00:10:06,400 --> 00:10:09,760
If you find that financial records are answering 40% of your queries

272
00:10:09,760 --> 00:10:12,240
but you have them in tier 2, you move them up to tier 1,

273
00:10:12,240 --> 00:10:15,760
you are constantly optimizing where you spend your infrastructure budget.

274
00:10:15,760 --> 00:10:17,600
The business case here is very straightforward,

275
00:10:17,600 --> 00:10:19,040
not all data is equal,

276
00:10:19,040 --> 00:10:20,560
so you shouldn't treat it like it is.

277
00:10:20,560 --> 00:10:23,520
Recursive structure aware, chunking.

278
00:10:23,520 --> 00:10:25,440
Once you've decided what goes into each tier,

279
00:10:25,440 --> 00:10:27,040
you hit the second big decision,

280
00:10:27,040 --> 00:10:29,360
how to actually split your documents into chunks.

281
00:10:29,360 --> 00:10:31,040
This is where most projects fall apart.

282
00:10:31,040 --> 00:10:33,520
People choose a strategy because it sounds good on paper,

283
00:10:33,520 --> 00:10:36,240
but then they realize it doesn't work for the content they actually have.

284
00:10:36,560 --> 00:10:38,720
Legal documents aren't just walls of text.

285
00:10:38,720 --> 00:10:41,120
They have a specific structure with articles, sections,

286
00:10:41,120 --> 00:10:42,960
and numbered provisions that isn't there by accident.

287
00:10:42,960 --> 00:10:45,600
That structure encodes the actual meaning of the document.

288
00:10:45,600 --> 00:10:48,160
An article usually contains a complete legal concept,

289
00:10:48,160 --> 00:10:49,840
while a section inside that article

290
00:10:49,840 --> 00:10:53,040
refines the idea and a subsection clarifies a tiny detail.

291
00:10:53,040 --> 00:10:54,400
When you split these documents up,

292
00:10:54,400 --> 00:10:57,120
you have to respect that hierarchy instead of just cutting through it.

293
00:10:57,120 --> 00:11:00,720
The recursive structure aware approach starts with one simple rule.

294
00:11:00,720 --> 00:11:03,280
Use the document's own structure as your primary signal.

295
00:11:03,280 --> 00:11:04,720
Don't just count tokens and cut,

296
00:11:04,720 --> 00:11:06,080
you need to build a hierarchy.

297
00:11:06,080 --> 00:11:07,760
The logic here is pretty straightforward.

298
00:11:07,760 --> 00:11:09,920
You take a document and start at the highest level,

299
00:11:09,920 --> 00:11:10,880
like an article.

300
00:11:10,880 --> 00:11:13,280
If that entire article fits into your token budget,

301
00:11:13,280 --> 00:11:15,280
you keep it as one single chunk.

302
00:11:15,280 --> 00:11:16,960
But if that article is too big,

303
00:11:16,960 --> 00:11:19,280
let's say it's over 512 tokens,

304
00:11:19,280 --> 00:11:20,880
you move down to the next level.

305
00:11:20,880 --> 00:11:22,720
You split the article into sections

306
00:11:22,720 --> 00:11:24,320
and try to keep those as chunks.

307
00:11:24,320 --> 00:11:26,000
If a section is still too large,

308
00:11:26,000 --> 00:11:28,720
you go deeper into paragraphs or even sentences.

309
00:11:28,720 --> 00:11:30,960
You only force a split based on token counts

310
00:11:30,960 --> 00:11:33,680
after you've tried every possible structural boundary.

311
00:11:33,680 --> 00:11:36,800
You keep refining until every piece fits within your limit.

312
00:11:36,800 --> 00:11:38,000
Why go through all this trouble?

313
00:11:38,000 --> 00:11:40,800
It's because legal meaning lives at these specific boundaries.

314
00:11:40,800 --> 00:11:42,240
An article makes a full argument

315
00:11:42,240 --> 00:11:43,920
and a section advances that point,

316
00:11:43,920 --> 00:11:46,400
so respecting these divisions keeps the narrative together.

317
00:11:46,400 --> 00:11:47,840
The context stays in one piece.

318
00:11:47,840 --> 00:11:50,320
When Copilot pulls a chunk about termination conditions,

319
00:11:50,320 --> 00:11:52,560
it gets the entire list of rules instead of a fragment

320
00:11:52,560 --> 00:11:54,480
that was cut off in the middle of a sentence.

321
00:11:54,480 --> 00:11:56,240
The extra work here is actually minimal

322
00:11:56,240 --> 00:11:58,480
compared to more complex semantic methods.

323
00:11:58,480 --> 00:11:59,920
You aren't calculating embeddings

324
00:11:59,920 --> 00:12:02,880
for every single sentence just to find where a topic changes,

325
00:12:02,880 --> 00:12:05,520
and you aren't running math across millions of pairs.

326
00:12:05,520 --> 00:12:07,360
You simply pass the document structure once

327
00:12:07,360 --> 00:12:08,880
when you bring it into the system.

328
00:12:08,880 --> 00:12:12,400
Most PDFs and office files already have this structure built in through

329
00:12:12,400 --> 00:12:13,440
headings and tags,

330
00:12:13,440 --> 00:12:15,520
so you just use a standard library to pull it out.

331
00:12:15,520 --> 00:12:17,120
The parser runs once and you're done.

332
00:12:17,120 --> 00:12:19,200
Overlap is how you bridge the gaps on purpose.

333
00:12:19,200 --> 00:12:21,760
Even though recursive chunking aligns with meaning,

334
00:12:21,760 --> 00:12:24,160
every boundary carries a risk that important context

335
00:12:24,160 --> 00:12:26,080
might get split between two pieces.

336
00:12:26,080 --> 00:12:28,640
You fix this by adding a 10 to 15% overlap,

337
00:12:28,640 --> 00:12:30,800
which is usually about 50 to 100 tokens.

338
00:12:30,800 --> 00:12:32,240
When you chunk at the section level,

339
00:12:32,240 --> 00:12:34,240
you take the last few sentences of one chunk

340
00:12:34,240 --> 00:12:36,320
and repeat them at the start of the next one.

341
00:12:36,320 --> 00:12:38,320
If a user's question hits that boundary,

342
00:12:38,320 --> 00:12:42,160
the system pulls both chunks and the information flows without a break.

343
00:12:42,160 --> 00:12:45,120
Metadata needs to travel with every chunk you extract.

344
00:12:45,120 --> 00:12:47,200
When you split a document at the section level,

345
00:12:47,200 --> 00:12:49,360
you should pull the article number, the title,

346
00:12:49,360 --> 00:12:51,200
and the effective data at the same time.

347
00:12:51,200 --> 00:12:53,520
This data stays attached to the chunk forever.

348
00:12:53,520 --> 00:12:56,160
During a search, you can filter by specific articles

349
00:12:56,160 --> 00:12:57,680
or look within certain date ranges

350
00:12:57,680 --> 00:12:59,360
to make sure the answer is still valid.

351
00:12:59,360 --> 00:13:01,440
Metadata becomes a way to rank results,

352
00:13:01,440 --> 00:13:04,400
so an older section might carry less weight than a newer one.

353
00:13:04,400 --> 00:13:05,600
The numbers back this up.

354
00:13:05,600 --> 00:13:08,880
Testing shows that recursive chunking at 512 tokens

355
00:13:08,880 --> 00:13:12,480
hits 69% accuracy on legal retrieval benchmarks.

356
00:13:12,480 --> 00:13:14,640
That is just as good as expensive semantic methods

357
00:13:14,640 --> 00:13:16,480
but without all the processing lag.

358
00:13:16,480 --> 00:13:18,320
We use 512 tokens as a target

359
00:13:18,320 --> 00:13:21,280
because it balances keeping enough context with staying precise.

360
00:13:21,280 --> 00:13:22,960
If chunks are too small, the story breaks,

361
00:13:22,960 --> 00:13:26,160
but if they're too big, the specific answer gets lost in the noise.

362
00:13:26,160 --> 00:13:27,840
The actual setup is very practical.

363
00:13:27,840 --> 00:13:29,840
You pass the structure, chunk it recursively

364
00:13:29,840 --> 00:13:30,880
and pull the metadata.

365
00:13:30,880 --> 00:13:32,880
Then you embed the chunks once and start testing

366
00:13:32,880 --> 00:13:33,840
with real questions.

367
00:13:33,840 --> 00:13:35,680
You look at the results, tweak your thresholds

368
00:13:35,680 --> 00:13:37,040
and move it into production.

369
00:13:37,040 --> 00:13:39,840
The overhead is low because the structure is already there.

370
00:13:39,840 --> 00:13:41,360
You aren't inventing anything new.

371
00:13:41,360 --> 00:13:43,680
You're just finally respecting the way the document was written.

372
00:13:43,680 --> 00:13:46,800
The multimodal ingestion pipeline.

373
00:13:46,800 --> 00:13:50,320
You might have 2,000 videos and 180,000 images buried

374
00:13:50,320 --> 00:13:52,480
in a 3.5 million page collection.

375
00:13:52,480 --> 00:13:54,960
Copilot cannot watch those videos in real time

376
00:13:54,960 --> 00:13:56,960
and it definitely can't interpret raw pixels

377
00:13:56,960 --> 00:13:58,720
while a user is waiting for an answer.

378
00:13:58,720 --> 00:14:01,760
If you try to send raw image data to an AI during a search,

379
00:14:01,760 --> 00:14:03,600
you would hit your limit and destroy your speed

380
00:14:03,600 --> 00:14:05,120
within the first few seconds.

381
00:14:05,120 --> 00:14:06,640
You have to make this media searchable

382
00:14:06,640 --> 00:14:08,640
before anyone ever asks a question.

383
00:14:08,640 --> 00:14:11,280
This happens through a process called asynchronous enrichment.

384
00:14:11,280 --> 00:14:13,440
The system runs in the background during off-peak hours

385
00:14:13,440 --> 00:14:15,600
to turn images and video into text

386
00:14:15,600 --> 00:14:17,360
that the computer can actually understand.

387
00:14:17,360 --> 00:14:18,960
It starts with your scanned documents.

388
00:14:18,960 --> 00:14:20,720
A huge portion of those millions of pages

389
00:14:20,720 --> 00:14:22,320
are probably physical evidence,

390
00:14:22,320 --> 00:14:24,160
photos of files or handwritten notes.

391
00:14:24,160 --> 00:14:25,360
These aren't digital texts.

392
00:14:25,360 --> 00:14:26,880
They are just pictures of words.

393
00:14:26,880 --> 00:14:29,840
You use OCR optical character recognition

394
00:14:29,840 --> 00:14:32,880
to turn those images into text during the ingestion phase.

395
00:14:32,880 --> 00:14:35,440
You feed the PDF to the engine, extract the words,

396
00:14:35,440 --> 00:14:38,160
and then chunk that text using the same structural strategy

397
00:14:38,160 --> 00:14:39,200
we just talked about.

398
00:14:39,200 --> 00:14:41,280
We also attach metadata about who scanned it

399
00:14:41,280 --> 00:14:42,480
and where it came from.

400
00:14:42,480 --> 00:14:44,160
If the OCR isn't sure about a word,

401
00:14:44,160 --> 00:14:45,920
it flags it with a confidence score.

402
00:14:45,920 --> 00:14:48,000
If a user gets an answer based on a messy scan

403
00:14:48,000 --> 00:14:49,440
with 70% confidence,

404
00:14:49,440 --> 00:14:50,640
they'll see that in the citation

405
00:14:50,640 --> 00:14:52,640
so they know to double check the original.

406
00:14:52,640 --> 00:14:54,080
Videos need to be transcribed next.

407
00:14:54,080 --> 00:14:56,160
You might have thousands of videos or hours of testimony

408
00:14:56,160 --> 00:14:56,960
to get through.

409
00:14:56,960 --> 00:14:58,320
If you try to transcribe a video,

410
00:14:58,320 --> 00:15:00,160
the moment a user asks a question,

411
00:15:00,160 --> 00:15:02,160
the system would be way too slow to use.

412
00:15:02,160 --> 00:15:03,520
Instead, you do it ahead of time.

413
00:15:03,520 --> 00:15:05,440
You send the audio to a speech detect service

414
00:15:05,440 --> 00:15:07,600
to get a full transcript and a quick summary.

415
00:15:07,600 --> 00:15:09,360
If you can identify who is speaking,

416
00:15:09,360 --> 00:15:10,800
you label those segments too.

417
00:15:10,800 --> 00:15:13,760
That transcript becomes the part the AI actually searches.

418
00:15:13,760 --> 00:15:15,680
We keep the timestamps as metadata,

419
00:15:15,680 --> 00:15:16,720
so when a match is found,

420
00:15:16,720 --> 00:15:18,320
the system doesn't just show text.

421
00:15:18,320 --> 00:15:21,440
It says, "See the testimony starting at 1.24.30

422
00:15:21,440 --> 00:15:24,400
and gives the user a link to jump right to that spot."

423
00:15:24,400 --> 00:15:26,560
Images require a different kind of processing.

424
00:15:26,560 --> 00:15:29,600
Running a massive vision model on nearly 200,000 images

425
00:15:29,600 --> 00:15:30,960
is expensive and slow.

426
00:15:30,960 --> 00:15:33,120
To solve this, you use a smaller, faster model

427
00:15:33,120 --> 00:15:35,920
that is built for speed rather than perfect artistic detail.

428
00:15:35,920 --> 00:15:37,600
You give the model a simple job,

429
00:15:37,600 --> 00:15:40,000
describe what you see, find any text,

430
00:15:40,000 --> 00:15:42,000
and list the people or objects in the frame.

431
00:15:42,000 --> 00:15:43,760
The model gives you a text description

432
00:15:43,760 --> 00:15:46,320
and any visible words are pulled out via OCR.

433
00:15:46,320 --> 00:15:48,320
This text becomes the metadata for the image.

434
00:15:48,320 --> 00:15:49,840
When co-pilot finds a match,

435
00:15:49,840 --> 00:15:51,280
it isn't looking at the pixels.

436
00:15:51,280 --> 00:15:53,280
It's reading the description of those pixels.

437
00:15:53,280 --> 00:15:55,840
Metadata enrichment has to happen across every single format.

438
00:15:55,840 --> 00:15:58,320
We use named entity recognition to find people,

439
00:15:58,320 --> 00:16:00,880
places and dates in every document and transcript.

440
00:16:00,880 --> 00:16:02,240
We also look for relationships

441
00:16:02,240 --> 00:16:04,080
like seeing that person A and person B

442
00:16:04,080 --> 00:16:05,840
were in the same room on a specific date.

443
00:16:05,840 --> 00:16:08,400
If a user wants to know what happened in 2005,

444
00:16:08,400 --> 00:16:10,240
the metadata filters the search down

445
00:16:10,240 --> 00:16:12,000
to only the files tagged with that year.

446
00:16:12,000 --> 00:16:14,080
This all happens once during the initial setup,

447
00:16:14,080 --> 00:16:15,840
not every time someone types a query.

448
00:16:15,840 --> 00:16:19,120
Your storage strategy should separate the hot and cold data.

449
00:16:19,120 --> 00:16:21,600
The original heavy files like PDFs and videos

450
00:16:21,600 --> 00:16:24,160
stay in cheap storage where they are rarely touched.

451
00:16:24,160 --> 00:16:25,840
The extracted text and metadata

452
00:16:25,840 --> 00:16:27,680
live in fast indexed databases

453
00:16:27,680 --> 00:16:29,600
that the AI can search instantly.

454
00:16:29,600 --> 00:16:30,560
When a search happens,

455
00:16:30,560 --> 00:16:32,640
the system pulls from the fast index.

456
00:16:32,640 --> 00:16:35,520
If the user actually needs to see the original photo,

457
00:16:35,520 --> 00:16:37,440
they just click a link to the cheap storage.

458
00:16:37,440 --> 00:16:40,240
This keeps the AI from getting bogged down by massive files

459
00:16:40,240 --> 00:16:41,520
it doesn't need to see.

460
00:16:41,520 --> 00:16:44,080
All of this work, the OCR, the transcription,

461
00:16:44,080 --> 00:16:45,520
and the vision processing

462
00:16:45,520 --> 00:16:46,720
happens on a schedule.

463
00:16:46,720 --> 00:16:48,400
It might run every night or every week

464
00:16:48,400 --> 00:16:49,680
as new documents arrive.

465
00:16:49,680 --> 00:16:52,800
The enrichment finishes before the data ever hits the search index.

466
00:16:52,800 --> 00:16:55,040
This is why the user gets an answer in seconds.

467
00:16:55,040 --> 00:16:57,200
All the hard, expensive work was already finished

468
00:16:57,200 --> 00:16:58,800
before they even opened the app.

469
00:16:58,800 --> 00:17:01,760
This is the only way to handle massive amounts of different media.

470
00:17:01,760 --> 00:17:04,080
You make it searchable without breaking your budget

471
00:17:04,080 --> 00:17:06,800
or making your users wait forever for response.

472
00:17:06,800 --> 00:17:09,920
Entity extraction and knowledge graphs, raw text is noise,

473
00:17:09,920 --> 00:17:11,680
but structured entities are signal.

474
00:17:11,680 --> 00:17:13,280
By this stage in your pipeline,

475
00:17:13,280 --> 00:17:14,640
you have already chunked your documents

476
00:17:14,640 --> 00:17:15,920
and enriched your media,

477
00:17:15,920 --> 00:17:18,400
but you still face a massive retrieval problem.

478
00:17:18,400 --> 00:17:21,440
Imagine a user asks about the relationship between person X

479
00:17:21,440 --> 00:17:22,400
and person Y.

480
00:17:22,400 --> 00:17:25,440
Your chunking strategy might pull up relevant passages

481
00:17:25,440 --> 00:17:28,480
and your embedding model will find semantically similar text,

482
00:17:28,480 --> 00:17:30,720
but the final answer still forces the LLM

483
00:17:30,720 --> 00:17:31,920
to do the heavy lifting.

484
00:17:31,920 --> 00:17:34,080
It has to read through disconnected statements

485
00:17:34,080 --> 00:17:35,280
across multiple chunks

486
00:17:35,280 --> 00:17:38,160
and try to synthesize a relationship on the fly.

487
00:17:38,160 --> 00:17:40,000
If those statements come from different indexes

488
00:17:40,000 --> 00:17:42,400
or different time periods, the connection weakens

489
00:17:42,400 --> 00:17:44,240
and the LLM starts to infer.

490
00:17:44,240 --> 00:17:46,400
At scale, inference turns into hallucination.

491
00:17:46,400 --> 00:17:48,480
You need a layer that extracts these relationships

492
00:17:48,480 --> 00:17:50,080
before the query ever arrives

493
00:17:50,080 --> 00:17:52,880
and that layer is entity extraction and knowledge graphs.

494
00:17:52,880 --> 00:17:56,000
Named Entity Recognition runs during the enrichment phase

495
00:17:56,000 --> 00:17:58,400
where every document transcript and image description

496
00:17:58,400 --> 00:18:00,000
passes through an NER model.

497
00:18:00,000 --> 00:18:02,800
This model identifies every person, organization, location

498
00:18:02,800 --> 00:18:04,320
and date mentioned in your text.

499
00:18:04,320 --> 00:18:07,920
It sees that person X appears in document A on page 42

500
00:18:07,920 --> 00:18:11,200
while organization Y shows up in the contract section of document B.

501
00:18:11,200 --> 00:18:13,760
The system extracts these with precise position information

502
00:18:13,760 --> 00:18:16,640
so it doesn't just note that a person exists in your data.

503
00:18:16,640 --> 00:18:19,360
It records that person X was mentioned at exactly 12 minutes

504
00:18:19,360 --> 00:18:21,440
and 56 seconds into video three.

505
00:18:21,440 --> 00:18:24,000
Relationship mapping then takes these isolated extractions

506
00:18:24,000 --> 00:18:26,800
and turns them into a graph that captures real world connections.

507
00:18:26,800 --> 00:18:29,680
If person X appears with person Y in one document

508
00:18:29,680 --> 00:18:32,080
and Y is money to organization Z in another,

509
00:18:32,080 --> 00:18:33,920
the graph maps that path.

510
00:18:33,920 --> 00:18:36,240
While none of these facts are remarkable on their own,

511
00:18:36,240 --> 00:18:38,400
they form a clear pattern when viewed together.

512
00:18:38,400 --> 00:18:41,120
The graph itself does not interpret the pattern,

513
00:18:41,120 --> 00:18:43,680
as that is the LLM's job during generation,

514
00:18:43,680 --> 00:18:47,200
but its surface is the structure so the LLM doesn't have to scan and guess.

515
00:18:47,200 --> 00:18:49,840
Cross-reference resolution is how you solve the identity problem

516
00:18:49,840 --> 00:18:51,360
across millions of pages.

517
00:18:51,360 --> 00:18:53,360
Person X might be mentioned hundreds of times

518
00:18:53,360 --> 00:18:55,680
but the text won't always use their full name.

519
00:18:55,680 --> 00:18:57,520
Sometimes the data says Mr X.

520
00:18:57,520 --> 00:18:59,360
Sometimes it just uses a last name

521
00:18:59,360 --> 00:19:01,680
and other times it uses a pronoun like he or she,

522
00:19:01,680 --> 00:19:02,960
depending on the context.

523
00:19:02,960 --> 00:19:06,000
You have to recognize that all of these mentions refer to the same individual

524
00:19:06,000 --> 00:19:07,920
which requires fuzzy matching at scale.

525
00:19:07,920 --> 00:19:10,640
By using string similarity and co-occurrence patterns,

526
00:19:10,640 --> 00:19:13,680
the system links these mentions to a single canonical identifier

527
00:19:13,680 --> 00:19:15,040
so your data stays clean.

528
00:19:15,040 --> 00:19:17,600
Temporal indexing is what organises your graph by time

529
00:19:17,600 --> 00:19:19,440
because entities are never static.

530
00:19:19,440 --> 00:19:20,800
A person's role changes,

531
00:19:20,800 --> 00:19:23,360
their association with the company ends on a specific date

532
00:19:23,360 --> 00:19:26,320
and their involvement in an event spans a specific range.

533
00:19:26,320 --> 00:19:28,080
Your graph stores this metadata,

534
00:19:28,080 --> 00:19:31,840
allowing a user to ask what changed about a person's role after a certain date.

535
00:19:31,840 --> 00:19:33,440
Because this information is indexed,

536
00:19:33,440 --> 00:19:35,840
these queries are answered at the retrieval stage

537
00:19:35,840 --> 00:19:39,120
rather than forcing the LLM to hunt for dates in free text.

538
00:19:39,120 --> 00:19:41,360
Confidence scoring is used to mark uncertainty

539
00:19:41,360 --> 00:19:43,360
because not every extraction is a guarantee.

540
00:19:43,360 --> 00:19:46,880
The NER model might find a name with 98% confidence in one sentence,

541
00:19:46,880 --> 00:19:50,560
but only 62% confidence when it's guessing based on a pronoun.

542
00:19:50,560 --> 00:19:52,000
The system records these scores

543
00:19:52,000 --> 00:19:54,560
so the user knows if an answer is grounded in solid facts

544
00:19:54,560 --> 00:19:56,080
or uncertain inferences.

545
00:19:56,080 --> 00:19:58,560
This level of transparency is what prevents the system

546
00:19:58,560 --> 00:20:00,800
from showing false confidence when the data is thin.

547
00:20:00,800 --> 00:20:02,320
Query leverage fundamentally changes

548
00:20:02,320 --> 00:20:04,560
how people interact with your information.

549
00:20:04,560 --> 00:20:06,960
Instead of basic keyword searches for a name,

550
00:20:06,960 --> 00:20:10,800
users can search by the entity itself to see every document where that person appears.

551
00:20:10,800 --> 00:20:14,160
They can ask for a list of every location a person visited in 2005

552
00:20:14,160 --> 00:20:16,880
or define the relationship between two specific parties.

553
00:20:16,880 --> 00:20:19,440
These entity-based queries are more precise and aligned

554
00:20:19,440 --> 00:20:21,600
with how people actually think about their data.

555
00:20:21,600 --> 00:20:24,240
The implementation handles all of this during ingestion

556
00:20:24,240 --> 00:20:26,240
so that queries run fast when it matters.

557
00:20:26,240 --> 00:20:28,560
You build the graphs, index the relationships,

558
00:20:28,560 --> 00:20:30,480
and attach the temporal metadata once.

559
00:20:30,480 --> 00:20:34,400
The result is a system where users get answers grounded in structured relationships

560
00:20:34,400 --> 00:20:36,480
rather than lose textual associations

561
00:20:36,480 --> 00:20:38,480
that might lead to errors.

562
00:20:38,480 --> 00:20:40,960
The agente grouter, query decomposition.

563
00:20:40,960 --> 00:20:43,680
You have invested heavily in your data architecture at this point.

564
00:20:43,680 --> 00:20:46,080
Your documents are chunked across three tiers.

565
00:20:46,080 --> 00:20:47,520
Your images have descriptions

566
00:20:47,520 --> 00:20:50,560
and your knowledge graph tracks every relationship and date.

567
00:20:50,560 --> 00:20:53,520
Everything is ready for use, but this is where most systems fail.

568
00:20:53,520 --> 00:20:56,000
A user asks a question and the system has no idea

569
00:20:56,000 --> 00:20:59,920
how to interpret the intent or which layer of the architecture it should actually trigger.

570
00:20:59,920 --> 00:21:03,360
A standard retrieval pipeline treats every query exactly the same way

571
00:21:03,360 --> 00:21:05,760
by searching vectors and returning results.

572
00:21:05,760 --> 00:21:07,360
This works fine for simple questions,

573
00:21:07,360 --> 00:21:10,080
but enterprise data is rarely that straightforward.

574
00:21:10,080 --> 00:21:13,920
If a user asks for flight logs from 2005 regarding a specific person,

575
00:21:13,920 --> 00:21:16,080
a simple semantic search will likely fail.

576
00:21:16,080 --> 00:21:18,800
That request requires the system to understand a specific time,

577
00:21:18,800 --> 00:21:21,520
a specific entity, and a specific type of record.

578
00:21:21,520 --> 00:21:23,520
You cannot rely on a single retrieval path,

579
00:21:23,520 --> 00:21:25,520
so you need an agent that breaks the question down

580
00:21:25,520 --> 00:21:27,360
and routes the pieces to the right systems.

581
00:21:27,360 --> 00:21:29,440
This is the role of the agente grouter.

582
00:21:29,440 --> 00:21:32,640
Intent classification is the very first decision the router makes.

583
00:21:32,640 --> 00:21:33,840
When a query comes in,

584
00:21:33,840 --> 00:21:36,720
the system classifies what the user is actually looking for,

585
00:21:36,720 --> 00:21:40,160
whether it is a temporal, relational, or transactional request.

586
00:21:40,160 --> 00:21:44,640
It might be a structural query about how documents are organized or a simple entity search.

587
00:21:44,640 --> 00:21:47,680
To keep things fast, you should use a lightweight supervised classifier

588
00:21:47,680 --> 00:21:49,920
rather than an expensive LLM call.

589
00:21:49,920 --> 00:21:53,280
This classifier runs in just a few milliseconds and provides an intent type

590
00:21:53,280 --> 00:21:55,840
along with a confident score to guide the next step.

591
00:21:55,840 --> 00:21:59,600
Subquery generation then breaks the main question into smaller discrete tasks

592
00:21:59,600 --> 00:22:01,040
that match the identified intent.

593
00:22:01,040 --> 00:22:04,400
If a user asks for flight logs from 2005 for Person X,

594
00:22:04,400 --> 00:22:08,400
the router sees this as a mix of temporal, entity, and structured data needs.

595
00:22:08,400 --> 00:22:11,760
It generates three separate subquaries, one for the knowledge graph,

596
00:22:11,760 --> 00:22:14,640
one for the SQL database, and one for the document index.

597
00:22:14,640 --> 00:22:18,560
Each of these is written in the specific syntax that the target system requires.

598
00:22:18,560 --> 00:22:20,960
The router then fires all three requests in parallel

599
00:22:20,960 --> 00:22:22,880
while maintaining the relationship between them.

600
00:22:22,880 --> 00:22:27,040
Data-cylor routing determines which specific index each of those subquaries should hit.

601
00:22:27,040 --> 00:22:29,440
Your architecture likely has multiple systems,

602
00:22:29,440 --> 00:22:32,800
including SQL databases for records and vector indexes for concepts.

603
00:22:32,800 --> 00:22:35,280
The router keeps track of what each system contains,

604
00:22:35,280 --> 00:22:40,080
so it can send a transaction query to the database and a conceptual query to the vector index.

605
00:22:40,080 --> 00:22:43,040
While this routing can be learned through feedback over time,

606
00:22:43,040 --> 00:22:47,920
it usually starts with a set of rules that connect intent classes to the most logical data source.

607
00:22:47,920 --> 00:22:51,520
Result synthesis is the process of merging all those different answers

608
00:22:51,520 --> 00:22:53,280
before the LLM ever sees them.

609
00:22:53,280 --> 00:22:55,840
Because different systems respond at different speeds,

610
00:22:55,840 --> 00:22:59,440
the router collects the results asynchronously and checks for redundancy.

611
00:22:59,440 --> 00:23:03,120
If the knowledge graph and the document search both confirm the same connection,

612
00:23:03,120 --> 00:23:05,280
the router waits that finding more heavily.

613
00:23:05,280 --> 00:23:08,720
It also looks for contradictions between sources and flags them immediately.

614
00:23:08,720 --> 00:23:10,720
By the time the LLM receives the data,

615
00:23:10,720 --> 00:23:12,960
it is looking at a cross-check set of candidates,

616
00:23:12,960 --> 00:23:15,120
which significantly drops the risk of hallucination.

617
00:23:15,120 --> 00:23:20,400
Fullback logic ensures the system handles failures gracefully without giving up on the user.

618
00:23:20,400 --> 00:23:23,360
If a database search for a specific person comes up empty,

619
00:23:23,360 --> 00:23:26,880
the router doesn't just stop. It automatically tries an alternative path,

620
00:23:26,880 --> 00:23:29,920
like searching the document index for that same person and date.

621
00:23:29,920 --> 00:23:32,640
These pre-planned fullback routes happen behind the scenes,

622
00:23:32,640 --> 00:23:35,920
so the user sees a complete answer instead of an error message.

623
00:23:35,920 --> 00:23:40,560
It turns a potential failure into a successful retrieval by checking every available resource.

624
00:23:40,560 --> 00:23:44,240
Confidence thresholds are used to decide when a query needs to be escalated.

625
00:23:44,240 --> 00:23:46,800
If every subquery returns a low confidence result,

626
00:23:46,800 --> 00:23:50,960
or the fullback paths fail, the router flags the entire request as uncertain.

627
00:23:50,960 --> 00:23:55,360
Instead of letting the LLM gas and potentially lie, the system can trigger a search of cold storage

628
00:23:55,360 --> 00:24:00,320
or ask the user for clarification. It might even escalate the task to a human analyst for review.

629
00:24:00,320 --> 00:24:05,200
This prevents the system from projecting false certainty when the data just isn't there.

630
00:24:05,200 --> 00:24:08,960
The agent itself is designed to be lean, fast, and strictly rule-based,

631
00:24:08,960 --> 00:24:10,400
to keep latency at a minimum.

632
00:24:10,400 --> 00:24:15,040
Its only job is to figure out what the user wants and orchestrate the retrieval process efficiently.

633
00:24:15,040 --> 00:24:18,000
By letting each specialized system do what it does best,

634
00:24:18,000 --> 00:24:21,520
the router ensures that the final answer is as accurate as possible.

635
00:24:21,520 --> 00:24:23,200
Hybrid retrieval strategy.

636
00:24:23,200 --> 00:24:27,360
The router has made its decision and subqueries are now firing across different systems.

637
00:24:27,360 --> 00:24:31,520
Now those systems need to return candidates that are both precise and semantically relevant.

638
00:24:31,520 --> 00:24:33,680
This is where the hybrid approach becomes essential.

639
00:24:33,680 --> 00:24:38,560
Think of your retrieval layer as having two parallel search mechanisms running at the same time,

640
00:24:38,560 --> 00:24:40,800
and each one has its own distinct strengths.

641
00:24:40,800 --> 00:24:45,920
BM25 is the probabilistic keyword search algorithm that has been the industry standard since the 90s,

642
00:24:45,920 --> 00:24:49,680
and it excels at precision. If you ask for documents mentioning a person by name,

643
00:24:49,680 --> 00:24:52,640
BM25 finds those exact matches instantly.

644
00:24:52,640 --> 00:24:56,800
You want records from March of 2005, or specific transaction amounts,

645
00:24:56,800 --> 00:24:59,680
or precise legal terminology BM25 owns these.

646
00:24:59,680 --> 00:25:02,720
It is deterministic and fast, but it only works at the surface level.

647
00:25:02,720 --> 00:25:05,760
If a document discusses the same concept using different vocabulary,

648
00:25:05,760 --> 00:25:07,600
BM25 is going to miss it.

649
00:25:07,600 --> 00:25:11,120
VectorSearch does the opposite work by embedding your query into a semantic space.

650
00:25:11,120 --> 00:25:13,920
It finds documents whose embeddings are closest in that space,

651
00:25:13,920 --> 00:25:17,280
which allows it to find conceptual similarity across different words.

652
00:25:17,280 --> 00:25:19,760
You might ask about financial transfers between parties,

653
00:25:19,760 --> 00:25:23,360
and VectorSearch will retrieve documents discussing money movement,

654
00:25:23,360 --> 00:25:26,800
even if they use words like payment or disbursement instead of transfer.

655
00:25:26,800 --> 00:25:29,200
But here's the problem. VectorSearch can be noisy.

656
00:25:29,200 --> 00:25:31,840
A document that is semantically similar to your query

657
00:25:31,840 --> 00:25:34,400
might be totally irrelevant to what you actually need,

658
00:25:34,400 --> 00:25:38,240
because embedding space distance does not always align with what a user wants.

659
00:25:38,240 --> 00:25:40,800
Combining them yields something neither provides a loan.

660
00:25:40,800 --> 00:25:44,480
You run BM25 against your corpus to get a ranked list of keyword matches,

661
00:25:44,480 --> 00:25:48,640
and simultaneously you run a VectorSearch to get a ranked list of semantic matches.

662
00:25:48,640 --> 00:25:51,120
These two lists almost never align perfectly.

663
00:25:51,120 --> 00:25:53,920
BM25 puts exact match documents at the top,

664
00:25:53,920 --> 00:25:57,200
while VectorSearch scatters results based on semantic proximity.

665
00:25:57,200 --> 00:26:00,560
The union of these two lists contains both precision and recall,

666
00:26:00,560 --> 00:26:03,600
the intersection contains documents that match both criteria,

667
00:26:03,600 --> 00:26:06,560
meaning they are exact matches and they are semantically coherent.

668
00:26:06,560 --> 00:26:08,320
But you cannot just stack the results.

669
00:26:08,320 --> 00:26:09,840
You have to fuse the rankings.

670
00:26:09,840 --> 00:26:11,760
Reciprocal rank fusion handles this.

671
00:26:11,760 --> 00:26:14,800
RRF takes the BM25 ranking and the Vector ranking,

672
00:26:14,800 --> 00:26:18,240
and merges them using a formula that avoids favoring one signal over the other.

673
00:26:18,240 --> 00:26:22,320
If a document ranks first in BM25 and tenth in VectorSearch,

674
00:26:22,320 --> 00:26:25,440
RRF gives it a high weight because it is important to both methods.

675
00:26:25,440 --> 00:26:29,920
If a document ranks second in VectorSearch but appears nowhere in BM25,

676
00:26:29,920 --> 00:26:35,520
RRF places it lower because semantic similarity without any keyword connection is a weaker signal.

677
00:26:35,520 --> 00:26:38,240
The fused ranking reflects where the two methods agree

678
00:26:38,240 --> 00:26:39,920
and where they complement each other.

679
00:26:39,920 --> 00:26:42,880
In practice, RRF produces better results than either method alone

680
00:26:42,880 --> 00:26:44,960
because it balances precision and recall.

681
00:26:44,960 --> 00:26:49,040
Metadata filtering happens before or after retrieval depending on your architecture.

682
00:26:49,040 --> 00:26:52,800
Before retrieval, you narrow the search space to make things more efficient.

683
00:26:52,800 --> 00:26:55,760
If a user query asks about the year 2005,

684
00:26:55,760 --> 00:26:59,120
you filter the indexes to only show documents with 2005 dates.

685
00:26:59,120 --> 00:27:02,560
This reduces the candidate set before any expensive operations run.

686
00:27:02,560 --> 00:27:06,080
After retrieval, you filter results by permission to ensure security.

687
00:27:06,080 --> 00:27:10,960
If a user does not have access to a specific document, the system removes it from the results.

688
00:27:10,960 --> 00:27:14,320
Metadata filtering at query time is essential for compliance and security

689
00:27:14,320 --> 00:27:16,080
but it does add latency.

690
00:27:16,080 --> 00:27:19,040
Prefiltering is faster because it reduces the total workload

691
00:27:19,040 --> 00:27:21,280
not because the logic is more sophisticated.

692
00:27:21,280 --> 00:27:24,000
Re-ranking applies after the hybrid fusion is complete.

693
00:27:24,000 --> 00:27:26,800
You have emerged candidate set ranked by RRF

694
00:27:26,800 --> 00:27:28,880
and now you pass this to a crossing-coder model.

695
00:27:28,880 --> 00:27:31,440
This is a lightweight transformer that reads the query

696
00:27:31,440 --> 00:27:34,480
and each candidate together to assign a new relevant score.

697
00:27:34,480 --> 00:27:37,440
The crossing-coder understands query document interaction

698
00:27:37,440 --> 00:27:39,440
better than embedding similarity alone,

699
00:27:39,440 --> 00:27:41,360
which allows it to catch small nuances.

700
00:27:41,360 --> 00:27:45,680
A document that is semantically closed but contextually wrong will drop in score

701
00:27:45,680 --> 00:27:49,440
while an exact match document that is slightly off topic gets re-weighted.

702
00:27:49,440 --> 00:27:51,280
The re-ranker produces the final ranking

703
00:27:51,280 --> 00:27:53,280
and your top candidates come from this list.

704
00:27:53,280 --> 00:27:54,960
The cost of all this is latency.

705
00:27:54,960 --> 00:27:58,240
RRF itself is fast because it is just arithmetic on two ranked lists

706
00:27:58,240 --> 00:28:02,000
and metadata filtering is fast because it uses simple comparison logic.

707
00:28:02,000 --> 00:28:07,120
But re-ranking hits a neural model which typically adds 100 to 150 milliseconds

708
00:28:07,120 --> 00:28:08,800
for a batch of 50 candidates.

709
00:28:08,800 --> 00:28:10,080
That is a meaningful delay.

710
00:28:10,080 --> 00:28:14,000
In aggregate your search call might take 150 to 300 milliseconds

711
00:28:14,000 --> 00:28:17,600
instead of the 30 to 50 milliseconds you would see with pure vector search.

712
00:28:17,600 --> 00:28:18,800
The payoff is precision.

713
00:28:18,800 --> 00:28:21,520
Precision gains compound as you move downstream.

714
00:28:21,520 --> 00:28:24,800
Fewer bad candidates reach the LLM which means fewer tokens are wasted

715
00:28:24,800 --> 00:28:27,120
on irrelevant context and there are fewer hallucinations.

716
00:28:27,120 --> 00:28:30,080
The marginal latency cost prevents much larger problems later.

717
00:28:31,040 --> 00:28:34,080
Permission aware retrieval at three and a half million pages

718
00:28:34,080 --> 00:28:37,280
you cannot afford to surface the wrong document to the wrong person.

719
00:28:37,280 --> 00:28:38,640
Governance is not optional.

720
00:28:38,640 --> 00:28:42,320
It is the difference between a working system and a massive legal liability.

721
00:28:42,320 --> 00:28:44,400
Your retrieval infrastructure is now solid,

722
00:28:44,400 --> 00:28:46,800
hybrid search finds candidates, re-ranking orders them

723
00:28:46,800 --> 00:28:48,720
and the LLM grounds the answers.

724
00:28:48,720 --> 00:28:52,400
But none of this matters if retrieval returns a document the user is not allowed to see.

725
00:28:52,400 --> 00:28:54,080
That is not just a technical failure.

726
00:28:54,080 --> 00:28:55,280
That is a security breach.

727
00:28:55,280 --> 00:28:57,600
The problem scales as the corpus gets bigger.

728
00:28:57,600 --> 00:29:01,280
In a small knowledge base, access control happens at the document level

729
00:29:01,280 --> 00:29:03,440
where a user either has permission or they do not.

730
00:29:03,440 --> 00:29:04,880
It is binary and simple.

731
00:29:04,880 --> 00:29:08,160
At three and a half million pages across multiple tiers and silos,

732
00:29:08,160 --> 00:29:09,840
permission becomes granular.

733
00:29:09,840 --> 00:29:12,720
A user might have access to some sections of a document but not others

734
00:29:12,720 --> 00:29:15,440
or they might see emails from one time period but not another.

735
00:29:15,440 --> 00:29:18,320
They can see financial summaries but not the raw transaction details.

736
00:29:18,320 --> 00:29:22,240
These fine grained access patterns cannot be managed at the document level.

737
00:29:22,240 --> 00:29:23,680
They need to live at the chunk level.

738
00:29:23,680 --> 00:29:27,040
This means access control lists must travel with the data during ingestion.

739
00:29:27,600 --> 00:29:30,640
When you chunk a document, you do not just extract the text and embed it.

740
00:29:30,640 --> 00:29:32,560
You extract the permission metadata as well.

741
00:29:32,560 --> 00:29:35,600
You need to know who created the chunk which users have read access

742
00:29:35,600 --> 00:29:37,200
and what the classification level is.

743
00:29:37,200 --> 00:29:39,120
This metadata attaches to every single chunk.

744
00:29:39,120 --> 00:29:40,880
It is not stored in a separate database.

745
00:29:40,880 --> 00:29:43,360
It is indexed right alongside the chunk content.

746
00:29:43,360 --> 00:29:46,640
During retrieval, the permissions are always available for the system to check.

747
00:29:46,640 --> 00:29:49,680
Query time filtering trims the results based on user identity.

748
00:29:49,680 --> 00:29:53,360
A user submits a question and the system identifies them through Azure AD

749
00:29:53,360 --> 00:29:54,880
or another identity system.

750
00:29:54,880 --> 00:29:57,680
The router passes that identity through the retrieval pipeline.

751
00:29:57,680 --> 00:30:02,960
When BM25 returns candidates, the filtering logic immediately removes chunks the user cannot access.

752
00:30:02,960 --> 00:30:04,800
The same thing happens with vector search.

753
00:30:04,800 --> 00:30:07,600
Rewanking only receives chunks the user is authorized to see.

754
00:30:07,600 --> 00:30:10,640
By the time the results reach the LLM,

755
00:30:10,640 --> 00:30:12,640
every candidate has been permission checked.

756
00:30:12,640 --> 00:30:14,720
The LLM never sees restricted content

757
00:30:14,720 --> 00:30:18,560
and it only generates answers from documents the user is authorized to access.

758
00:30:18,560 --> 00:30:20,800
This filtering adds more latency to the process.

759
00:30:20,800 --> 00:30:23,040
You are not just retrieving candidates.

760
00:30:23,040 --> 00:30:25,840
You are checking permissions for every single one of them.

761
00:30:25,840 --> 00:30:30,000
A Rewanking operation over 50 documents means 50 individual permission checks.

762
00:30:30,000 --> 00:30:33,680
Modern systems batch these checks but permission evaluation is still overhead.

763
00:30:33,680 --> 00:30:37,120
Typical query latency increases by 30 to 50 milliseconds.

764
00:30:37,120 --> 00:30:40,560
For systems that are sensitive to compliance, that is an acceptable cost.

765
00:30:40,560 --> 00:30:45,200
For real time systems, it is a trade off you have to accept because leaking access is a much worse outcome.

766
00:30:45,200 --> 00:30:47,760
Sensitivity labels are how you express policy.

767
00:30:47,760 --> 00:30:51,200
Copilot does not inherently understand that a document marked as legally privileged

768
00:30:51,200 --> 00:30:53,840
should have tighter controls than something marked as internal.

769
00:30:53,840 --> 00:30:57,200
You encode that understanding into sensitivity label policies.

770
00:30:57,200 --> 00:30:58,720
Each label has its own rules.

771
00:30:58,720 --> 00:31:01,920
Confidential documents can only be accessed by the legal department

772
00:31:01,920 --> 00:31:06,160
while internal documents are accessible to authenticated users but not guests.

773
00:31:06,160 --> 00:31:09,920
During ingestion documents get labeled and those labels become metadata.

774
00:31:09,920 --> 00:31:13,040
At query time, the filtering logic respects these policies.

775
00:31:13,040 --> 00:31:16,960
A user requests a search, the system checks their identity against the label policies

776
00:31:16,960 --> 00:31:18,240
and the results are trimmed.

777
00:31:18,240 --> 00:31:20,960
Compliance boundaries add another layer of complexity.

778
00:31:21,440 --> 00:31:24,400
In regulated industries data residency matters quite a bit.

779
00:31:24,400 --> 00:31:28,880
Personal information about residents in the EU cannot be processed in US regions

780
00:31:28,880 --> 00:31:31,920
and health information has its own specific access rules.

781
00:31:31,920 --> 00:31:34,480
These boundaries are not enforced by copilot alone.

782
00:31:34,480 --> 00:31:36,880
Your retrieval system must enforce them.

783
00:31:36,880 --> 00:31:39,040
If a user in the US queries the system,

784
00:31:39,040 --> 00:31:41,680
the retrieval must respect US only data boundaries.

785
00:31:41,680 --> 00:31:43,840
If a compliance officer runs a discovery search,

786
00:31:43,840 --> 00:31:46,960
the retrieval must respect audit and retention rules.

787
00:31:46,960 --> 00:31:48,880
Audit trails log every single action.

788
00:31:48,880 --> 00:31:51,280
They track every retrieval, every result shown,

789
00:31:51,280 --> 00:31:53,600
and every document accessed through copilot.

790
00:31:53,600 --> 00:31:55,520
This is not for performance monitoring.

791
00:31:55,520 --> 00:31:56,640
It is for accountability.

792
00:31:56,640 --> 00:32:00,080
If a breach occurs and someone accesses data they should not have seen,

793
00:32:00,080 --> 00:32:01,920
the audits show exactly how it happened.

794
00:32:01,920 --> 00:32:04,720
You can see which user it was, what they asked,

795
00:32:04,720 --> 00:32:06,720
and what documents were returned to them.

796
00:32:06,720 --> 00:32:09,680
This logging happens at retrieval time, not generation time.

797
00:32:09,680 --> 00:32:12,800
The moment a chunk becomes a candidate, the system logs it.

798
00:32:12,800 --> 00:32:15,040
The implementation requires real infrastructure.

799
00:32:15,040 --> 00:32:16,880
You need identity and access management,

800
00:32:16,880 --> 00:32:20,240
permission metadata stores, and filtering logic at the retrieval stage.

801
00:32:20,240 --> 00:32:22,320
You also need audit logging pipelines.

802
00:32:22,320 --> 00:32:24,640
This is not a feature you can just bolt on at the end.

803
00:32:24,640 --> 00:32:25,760
It is foundational.

804
00:32:25,760 --> 00:32:29,040
Without it, your entire retrieval system is a compliance risk.

805
00:32:29,040 --> 00:32:31,920
With it, permission aware retrieval becomes invisible to the users.

806
00:32:31,920 --> 00:32:33,520
They see the answers they need,

807
00:32:33,520 --> 00:32:36,800
and the system quietly ensures those answers come from data.

808
00:32:36,800 --> 00:32:38,720
They are authorized to access.

809
00:32:38,720 --> 00:32:40,480
Chunking performance benchmarks.

810
00:32:40,480 --> 00:32:44,480
Theory is one thing, but real world performance is where the model actually breaks or holds up.

811
00:32:44,480 --> 00:32:46,720
Here is what the research actually shows.

812
00:32:46,720 --> 00:32:50,000
The moment you move from architectural design to actual implementation,

813
00:32:50,000 --> 00:32:52,080
benchmarks become your only source of truth.

814
00:32:52,080 --> 00:32:54,240
They tell you whether your choices were right

815
00:32:54,240 --> 00:32:56,640
or if you've built a system that fails under pressure.

816
00:32:56,640 --> 00:32:58,320
The question you have to answer is simple.

817
00:32:58,320 --> 00:33:00,640
How does your chunking strategy actually perform

818
00:33:00,640 --> 00:33:02,320
on the content you're trying to retrieve?

819
00:33:02,320 --> 00:33:04,560
We've discussed recursive structure aware chunking

820
00:33:04,560 --> 00:33:07,040
as the go-to approach for tier one and tier two content.

821
00:33:07,040 --> 00:33:09,360
This isn't just a theoretical preference or a hunch.

822
00:33:09,360 --> 00:33:11,280
The benchmark validation is important

823
00:33:11,280 --> 00:33:14,000
because it proves that this method is grounded in measured,

824
00:33:14,000 --> 00:33:15,360
repeatable performance.

825
00:33:15,360 --> 00:33:19,040
When you look at the data, fixed 512 token recursive chunking

826
00:33:19,040 --> 00:33:22,320
achieve 69% accuracy on document-level retrieval tasks.

827
00:33:22,320 --> 00:33:23,360
That is your baseline.

828
00:33:23,360 --> 00:33:26,000
If you ask 10 questions, recursive chunking

829
00:33:26,000 --> 00:33:28,960
finds the document containing the answer 9 times out of 10.

830
00:33:28,960 --> 00:33:31,680
The beauty of this method is that the chunking is clean.

831
00:33:31,680 --> 00:33:34,320
It respects section boundaries and preserves context,

832
00:33:34,320 --> 00:33:36,560
which means it maintains narrative continuity

833
00:33:36,560 --> 00:33:38,080
instead of shredding the document.

834
00:33:38,080 --> 00:33:40,240
When you ask about a specific legal provision,

835
00:33:40,240 --> 00:33:42,640
the retrieval process finds that section intact

836
00:33:42,640 --> 00:33:44,720
rather than giving you a fragmented mess.

837
00:33:44,720 --> 00:33:47,040
Compare that to semantic chunking on the same corpus.

838
00:33:47,040 --> 00:33:49,760
Semantic chunking scores 54% accuracy,

839
00:33:49,760 --> 00:33:52,880
which is 15% points lower than the recursive method.

840
00:33:52,880 --> 00:33:54,960
But the hidden cost is even more revealing.

841
00:33:54,960 --> 00:33:58,800
Semantic chunking produces chunks averaging only 43 tokens,

842
00:33:58,800 --> 00:34:01,440
and that is far too small for most enterprise needs.

843
00:34:01,440 --> 00:34:04,080
It's over-segmented because the boundary detection algorithm

844
00:34:04,080 --> 00:34:07,360
splits the text whenever semantic similarity drops even slightly.

845
00:34:07,360 --> 00:34:09,840
A new topic starts, so it creates a new chunk.

846
00:34:09,840 --> 00:34:12,800
A new speaker joins the dialogue, so it creates another new chunk.

847
00:34:12,800 --> 00:34:16,160
The result is a pile of small pieces instead of coherent units,

848
00:34:16,160 --> 00:34:18,400
forcing the system to pull multiple fragments

849
00:34:18,400 --> 00:34:20,480
just to reconstruct basic context.

850
00:34:20,480 --> 00:34:24,160
Your LLM receives scattered evidence instead of a continuous narrative.

851
00:34:24,160 --> 00:34:26,480
The latency consequence is direct and painful.

852
00:34:26,480 --> 00:34:28,800
More chunks mean more vector database operations,

853
00:34:28,800 --> 00:34:32,480
and more retrieval candidates mean more re-ranking candidates to process.

854
00:34:32,480 --> 00:34:35,840
A query that takes 30 milliseconds with recursive chunking

855
00:34:35,840 --> 00:34:38,720
might take 120 milliseconds with semantic chunking

856
00:34:38,720 --> 00:34:41,600
because the index is larger and the candidate set is bigger.

857
00:34:41,600 --> 00:34:43,360
When you spread this across thousands of queries,

858
00:34:43,360 --> 00:34:45,200
the cumulative difference is substantial enough

859
00:34:45,200 --> 00:34:47,040
to slow down the entire organization.

860
00:34:47,040 --> 00:34:49,520
Overlap impact gives us another lever to pull.

861
00:34:49,520 --> 00:34:51,360
The benchmark tested overlap at 0,

862
00:34:51,360 --> 00:34:53,760
and then at different percentages to see what changed.

863
00:34:53,760 --> 00:34:55,600
A chunking strategy with 0 overlap,

864
00:34:55,600 --> 00:34:57,840
where consecutive chunks are completely disjoint,

865
00:34:57,840 --> 00:34:59,680
yielded only baseline precision.

866
00:34:59,680 --> 00:35:02,080
However, adding a 64 token overlap,

867
00:35:02,080 --> 00:35:04,320
which is about 10% at 500 token targets,

868
00:35:04,320 --> 00:35:06,560
increased precision by 14.5%,

869
00:35:06,560 --> 00:35:08,240
that is a massive jump for a small change.

870
00:35:08,240 --> 00:35:09,760
Overlap prevents boundary loss

871
00:35:09,760 --> 00:35:11,920
because important context near a chunk edge

872
00:35:11,920 --> 00:35:14,320
might be missed if it falls just outside the window.

873
00:35:14,320 --> 00:35:16,480
With overlap, that context appears in two chunks,

874
00:35:16,480 --> 00:35:18,000
so if the query matches the boundary,

875
00:35:18,000 --> 00:35:20,400
you retrieve both and keep the context continuous.

876
00:35:20,400 --> 00:35:23,600
The cost is minimal since it only increases the chunk count slightly,

877
00:35:23,600 --> 00:35:26,640
and the precision gain more than justifies the extra tokens.

878
00:35:26,640 --> 00:35:29,200
One of the most crucial findings is that embedding quality

879
00:35:29,200 --> 00:35:31,040
matters more than chunking strategy

880
00:35:31,040 --> 00:35:32,880
when you're using models like GPT-4.

881
00:35:32,880 --> 00:35:34,960
You might assume that finding the perfect chunking strategy

882
00:35:34,960 --> 00:35:37,360
is the bottleneck, but in reality, it isn't.

883
00:35:37,360 --> 00:35:41,280
If your embedding model is weak, even perfect chunks will produce mediocre vectors

884
00:35:41,280 --> 00:35:42,800
that the system can't use effectively.

885
00:35:42,800 --> 00:35:44,240
If your embedding model is strong,

886
00:35:44,240 --> 00:35:47,120
the differences between chunking methods start to compress.

887
00:35:47,120 --> 00:35:50,480
A strong model can extract relevant signals from imperfect chunks,

888
00:35:50,480 --> 00:35:53,280
whereas a weaker chunk boundary is only a major problem

889
00:35:53,280 --> 00:35:55,280
when your retriever is also weak.

890
00:35:55,280 --> 00:35:57,680
This inverts how we usually think about precision.

891
00:35:57,680 --> 00:35:59,920
Don't waste time over-optimizing chunk boundaries

892
00:35:59,920 --> 00:36:01,680
if you're using powerful embedding models,

893
00:36:01,680 --> 00:36:04,720
and instead, invest that energy into the embedder itself.

894
00:36:04,720 --> 00:36:08,720
Cost comparison across these strategies shows the pragmatic trade-off you have to make.

895
00:36:08,720 --> 00:36:12,960
Semantic chunking adds 1.5 to 3 times more pre-processing overhead

896
00:36:12,960 --> 00:36:14,720
compared to fixed recursive chunking.

897
00:36:14,720 --> 00:36:17,760
That multiplier compounds quickly across 3.5 million pages,

898
00:36:17,760 --> 00:36:20,000
and you aren't just paying that fee once.

899
00:36:20,000 --> 00:36:22,880
You're paying that multiple across the entire ingestion pipeline

900
00:36:22,880 --> 00:36:24,080
every time data changes.

901
00:36:24,080 --> 00:36:26,720
For environments where you ingest data continuously,

902
00:36:26,720 --> 00:36:29,520
semantic pre-processing becomes a permanent bottleneck

903
00:36:29,520 --> 00:36:30,560
that slows everything down.

904
00:36:30,560 --> 00:36:32,720
For one time ingestion, the cost is concentrated,

905
00:36:32,720 --> 00:36:34,720
but it's still a significant hit to the budget.

906
00:36:34,720 --> 00:36:37,840
The practical recommendation from these benchmarks is straightforward.

907
00:36:37,840 --> 00:36:40,720
Start with fixed recursive chunking at 512 tokens

908
00:36:40,720 --> 00:36:42,480
with 10 to 15% overlap.

909
00:36:42,480 --> 00:36:43,840
Test that on your actual content

910
00:36:43,840 --> 00:36:45,280
and measure the retrieval precision

911
00:36:45,280 --> 00:36:47,280
and answer quality before changing anything.

912
00:36:47,280 --> 00:36:49,120
Only if you see clear failure modes,

913
00:36:49,120 --> 00:36:52,480
like consistent retrieval failures on a specific question type,

914
00:36:52,480 --> 00:36:54,480
should you layer on semantic refinement?

915
00:36:54,480 --> 00:36:56,160
Don't assume semantic is better

916
00:36:56,160 --> 00:36:57,920
just because it sounds more advanced.

917
00:36:57,920 --> 00:36:59,600
Measure the results and then decide.

918
00:36:59,600 --> 00:37:02,400
Latency and time to first token, TTFT.

919
00:37:02,400 --> 00:37:05,440
Speed matters and at a scale of 3.5 million pages,

920
00:37:05,440 --> 00:37:08,000
latency compounds until it breaks the user experience.

921
00:37:08,000 --> 00:37:09,840
You have to architect for speed from day one.

922
00:37:09,840 --> 00:37:12,480
Let's be concrete about what latency actually means

923
00:37:12,480 --> 00:37:13,840
in a system like this.

924
00:37:13,840 --> 00:37:17,760
Time to first token or TTFT is the gap between a user-hitting enter

925
00:37:17,760 --> 00:37:20,000
and the first word of the answer appearing on the screen.

926
00:37:20,000 --> 00:37:21,600
It is the only thing the user cares about

927
00:37:21,600 --> 00:37:23,280
when it comes to responsiveness.

928
00:37:23,280 --> 00:37:26,720
For a search interface, a sub-second TTFT feels instant,

929
00:37:26,720 --> 00:37:28,240
but two seconds feels slow,

930
00:37:28,240 --> 00:37:30,320
and five seconds feels completely broken.

931
00:37:30,320 --> 00:37:34,640
At enterprise scale, TTFT is the difference between a tool people actually use

932
00:37:34,640 --> 00:37:36,320
and a tool they learn to avoid.

933
00:37:36,320 --> 00:37:38,400
Your retrieval pipeline has many stages,

934
00:37:38,400 --> 00:37:41,040
and every single one of them consumes precious time.

935
00:37:41,040 --> 00:37:42,720
It starts with keyword search.

936
00:37:42,720 --> 00:37:45,680
BM25 against a well-indexed corpus usually runs

937
00:37:45,680 --> 00:37:46,960
in 10 to 50 milliseconds,

938
00:37:46,960 --> 00:37:49,040
depending on how complex the query is.

939
00:37:49,040 --> 00:37:51,120
A simple exact match query might hit the low end

940
00:37:51,120 --> 00:37:53,280
while a complex Boolean query might hit the high end,

941
00:37:53,280 --> 00:37:55,440
but this stage is rarely the bottleneck.

942
00:37:55,440 --> 00:37:57,920
Vector search also completes in tens of milliseconds

943
00:37:57,920 --> 00:38:00,720
for optimized indices when running in parallel.

944
00:38:00,720 --> 00:38:03,120
When you run both simultaneously through hybrid retrieval,

945
00:38:03,120 --> 00:38:04,400
neither blocks the other,

946
00:38:04,400 --> 00:38:07,520
and the first one to return simply feeds the results forward.

947
00:38:07,520 --> 00:38:10,240
Rewanking is where the latency really starts to accumulate.

948
00:38:10,240 --> 00:38:13,280
The crossing-coder model is a transformer that has to read the query

949
00:38:13,280 --> 00:38:15,280
and the candidate together to evaluate them.

950
00:38:15,280 --> 00:38:17,680
Rewanking 50 documents at typical lengths

951
00:38:17,680 --> 00:38:20,320
takes roughly 100 to 150 milliseconds.

952
00:38:20,320 --> 00:38:21,840
That is a meaningful delay,

953
00:38:21,840 --> 00:38:24,960
because it's three to five times slower than the initial retrieval.

954
00:38:24,960 --> 00:38:26,880
Your trading speed for precision here,

955
00:38:26,880 --> 00:38:29,280
and the decision to re-rank is purely architectural.

956
00:38:29,280 --> 00:38:32,000
Some systems skip it for applications where speed is everything,

957
00:38:32,000 --> 00:38:34,400
but others always include it because the precision gain

958
00:38:34,400 --> 00:38:36,080
prevents the LLM from hallucinating.

959
00:38:36,080 --> 00:38:38,240
For enterprise knowledge assistance,

960
00:38:38,240 --> 00:38:41,120
the 150 millisecond cost is usually acceptable

961
00:38:41,120 --> 00:38:43,760
because sending noisy candidates to the LLM

962
00:38:43,760 --> 00:38:45,920
creates much worse problems later on.

963
00:38:45,920 --> 00:38:47,760
Permission filtering adds even more latency

964
00:38:47,760 --> 00:38:49,200
to the re-ranking stage.

965
00:38:49,200 --> 00:38:51,040
Each candidate requires an access check

966
00:38:51,040 --> 00:38:53,120
to make sure the user is allowed to see it.

967
00:38:53,120 --> 00:38:55,280
You can mitigate this with batch operations

968
00:38:55,280 --> 00:38:57,760
by collecting 50 candidates and checking them all

969
00:38:57,760 --> 00:38:59,440
in a single authorization call.

970
00:38:59,440 --> 00:39:00,560
It's still overhead though,

971
00:39:00,560 --> 00:39:03,520
and 30 to 50 milliseconds for permission evaluation

972
00:39:03,520 --> 00:39:05,200
is normal for a full candidate set,

973
00:39:05,200 --> 00:39:06,560
then you have context shaping.

974
00:39:06,560 --> 00:39:08,800
Once the top candidates come back from retrieval,

975
00:39:08,800 --> 00:39:10,560
those raw chunks need to be assembled

976
00:39:10,560 --> 00:39:12,800
into a coherent context for the LLM.

977
00:39:12,800 --> 00:39:14,240
You might deduplicate the results

978
00:39:14,240 --> 00:39:15,760
if two chunks say the same thing,

979
00:39:15,760 --> 00:39:17,360
or you might expand the context

980
00:39:17,360 --> 00:39:20,320
to include surrounding sentences for better continuity.

981
00:39:20,320 --> 00:39:22,160
You might even order the results by date

982
00:39:22,160 --> 00:39:24,000
or relevance to make them more readable.

983
00:39:24,000 --> 00:39:25,680
This assembly process is fast

984
00:39:25,680 --> 00:39:27,680
and usually stays under 50 milliseconds,

985
00:39:27,680 --> 00:39:29,680
but it's still a cost you have to account for.

986
00:39:29,680 --> 00:39:33,040
The LLM call is the dominant stage of the entire process.

987
00:39:33,040 --> 00:39:34,640
Generating an answer from the context

988
00:39:34,640 --> 00:39:36,960
takes anywhere from 200 to 800 milliseconds

989
00:39:36,960 --> 00:39:39,040
depending on the model size and how long the answer is.

990
00:39:39,040 --> 00:39:43,040
A small model generating 100 tokens might take 200 milliseconds,

991
00:39:43,040 --> 00:39:45,040
but a large model generating 500 tokens

992
00:39:45,040 --> 00:39:46,400
will likely take 800.

993
00:39:46,400 --> 00:39:49,040
This is where most of your TTFT budget is spent.

994
00:39:49,040 --> 00:39:51,200
If you reduce the context you send to the LLM,

995
00:39:51,200 --> 00:39:52,960
you reduce the generation time,

996
00:39:52,960 --> 00:39:55,600
which is another reason why re-ranking is so important.

997
00:39:55,600 --> 00:39:57,600
Fueur, higher quality candidates,

998
00:39:57,600 --> 00:39:59,840
mean shorter context and shorter context

999
00:39:59,840 --> 00:40:01,200
means faster generation.

1000
00:40:01,200 --> 00:40:04,480
Your total TTFT budget is typically between 500

1001
00:40:04,480 --> 00:40:07,440
and 1000 milliseconds for enterprise assistance.

1002
00:40:07,440 --> 00:40:09,840
Anything under a second feels responsive to a human,

1003
00:40:09,840 --> 00:40:11,600
but once you go beyond two seconds,

1004
00:40:11,600 --> 00:40:14,160
users start getting impatient and lose focus.

1005
00:40:14,160 --> 00:40:16,720
Caching is the best way to eliminate most of this latency

1006
00:40:16,720 --> 00:40:17,840
for repeat queries.

1007
00:40:17,840 --> 00:40:19,600
If a user asks the same question twice,

1008
00:40:19,600 --> 00:40:21,200
you shouldn't be recomputing embeddings

1009
00:40:21,200 --> 00:40:22,400
or re-running retrieval.

1010
00:40:22,400 --> 00:40:24,160
You should cache the answer from the first query

1011
00:40:24,160 --> 00:40:25,600
and return it instantly.

1012
00:40:25,600 --> 00:40:27,520
For knowledge bases with common questions,

1013
00:40:27,520 --> 00:40:30,240
caching absorbs a huge fraction of your traffic.

1014
00:40:30,240 --> 00:40:32,480
A financial assistant might answer a question

1015
00:40:32,480 --> 00:40:34,960
about a specific policy 50 times a month,

1016
00:40:34,960 --> 00:40:37,440
so you should cache that answer after the first request.

1017
00:40:37,440 --> 00:40:39,360
Every subsequent request will get that response

1018
00:40:39,360 --> 00:40:41,040
in milliseconds instead of seconds.

1019
00:40:41,040 --> 00:40:43,040
Parallel processing is how you hide latency

1020
00:40:43,040 --> 00:40:44,640
under concurrent execution.

1021
00:40:44,640 --> 00:40:46,160
While the initial retrieval is running,

1022
00:40:46,160 --> 00:40:48,800
you can begin the embedding processing for the next step.

1023
00:40:48,800 --> 00:40:50,080
While re-ranking is happening,

1024
00:40:50,080 --> 00:40:52,080
you can start constructing the LLM prompt.

1025
00:40:52,080 --> 00:40:53,840
Modern systems don't wait for one stage

1026
00:40:53,840 --> 00:40:55,360
to finish before starting the next,

1027
00:40:55,360 --> 00:40:57,920
and instead, they pipeline the operations.

1028
00:40:57,920 --> 00:40:59,680
The elapsed time is the critical path

1029
00:40:59,680 --> 00:41:02,560
through the pipeline rather than the sum of every single stage.

1030
00:41:02,560 --> 00:41:04,560
A system where retrieval, permission checking

1031
00:41:04,560 --> 00:41:07,520
and re-ranking run in a sequence might take 300 milliseconds,

1032
00:41:07,520 --> 00:41:09,760
but that same system with parallel execution

1033
00:41:09,760 --> 00:41:11,840
might only take 150 milliseconds

1034
00:41:11,840 --> 00:41:13,520
because the operations overlap.

1035
00:41:13,520 --> 00:41:15,520
The architectural lesson here is very straightforward.

1036
00:41:15,520 --> 00:41:17,360
TTFT is a design constraint

1037
00:41:17,360 --> 00:41:19,360
that drives every other decision you make.

1038
00:41:19,360 --> 00:41:20,800
It drives chunking decisions

1039
00:41:20,800 --> 00:41:22,720
because smaller chunks retrieve faster,

1040
00:41:22,720 --> 00:41:24,480
and it drives re-ranking choices

1041
00:41:24,480 --> 00:41:26,880
because lighter models trade precision for speed.

1042
00:41:26,880 --> 00:41:29,040
It even drives your caching strategy

1043
00:41:29,040 --> 00:41:31,920
by forcing you to decide which queries are worth pre-computing.

1044
00:41:31,920 --> 00:41:34,080
When you're dealing with 3.5 million pages,

1045
00:41:34,080 --> 00:41:36,000
you're making these choices constantly.

1046
00:41:36,000 --> 00:41:37,920
If you get them right, users get a responsive system,

1047
00:41:37,920 --> 00:41:39,120
but if you get them wrong,

1048
00:41:39,120 --> 00:41:42,480
even the most perfect answers will feel too slow to be useful.

1049
00:41:42,480 --> 00:41:44,640
Metadata enrichment and its impact.

1050
00:41:44,640 --> 00:41:46,080
Research into retrieval systems

1051
00:41:46,080 --> 00:41:48,320
shows us something that feels completely wrong at first,

1052
00:41:48,320 --> 00:41:50,960
and that is the fact that metadata often matters more

1053
00:41:50,960 --> 00:41:52,960
than your actual chunking strategy.

1054
00:41:52,960 --> 00:41:55,360
You might have spent a fortune on your retrieval architecture

1055
00:41:55,360 --> 00:41:57,600
by setting up recursive chunking hybrid search

1056
00:41:57,600 --> 00:41:59,120
and complex entity graphs.

1057
00:41:59,120 --> 00:42:00,480
But the data tells a different story

1058
00:42:00,480 --> 00:42:02,560
because when you add rich metadata to your chunks,

1059
00:42:02,560 --> 00:42:04,640
your accuracy goes up much faster than it does

1060
00:42:04,640 --> 00:42:07,120
when you just try to refine your chunking boundaries.

1061
00:42:07,120 --> 00:42:10,080
A chunk without metadata is just a fragment floating in space

1062
00:42:10,080 --> 00:42:11,520
whereas a chunk with proper metadata

1063
00:42:11,520 --> 00:42:14,080
becomes a searchable object with context,

1064
00:42:14,080 --> 00:42:17,600
a location, and specific hooks for the system to grab onto.

1065
00:42:17,600 --> 00:42:19,360
We can actually quantify this accuracy

1066
00:42:19,360 --> 00:42:20,480
lift in a real way.

1067
00:42:20,480 --> 00:42:22,960
If you don't use metadata enrichment,

1068
00:42:22,960 --> 00:42:25,040
your question answer system is probably going to hit

1069
00:42:25,040 --> 00:42:28,560
about 50 or 60% accuracy on difficult tasks.

1070
00:42:28,560 --> 00:42:31,120
Users ask a question, the system finds some candidates,

1071
00:42:31,120 --> 00:42:33,040
and the LLM tries to build an answer.

1072
00:42:33,040 --> 00:42:36,640
But the problem is that the LLM has no anchors to hold onto.

1073
00:42:36,640 --> 00:42:38,880
It has no way of knowing if a document is brand new

1074
00:42:38,880 --> 00:42:40,000
or 10 years old,

1075
00:42:40,000 --> 00:42:42,240
and it can't tell the difference between a casual mention

1076
00:42:42,240 --> 00:42:43,440
and a primary source.

1077
00:42:43,440 --> 00:42:45,040
Once you add metadata,

1078
00:42:45,040 --> 00:42:47,920
that same system jumps to 75% accuracy,

1079
00:42:47,920 --> 00:42:51,120
and that 15% gap is the difference between a tool people hate

1080
00:42:51,120 --> 00:42:52,560
and one they actually trust.

1081
00:42:52,560 --> 00:42:54,240
The way this works is actually very simple

1082
00:42:54,240 --> 00:42:57,760
because the metadata just travels right alongside the content.

1083
00:42:57,760 --> 00:42:59,520
When you pull a document into the system,

1084
00:42:59,520 --> 00:43:02,320
you extract structured details like the document type,

1085
00:43:02,320 --> 00:43:03,600
the creation date,

1086
00:43:03,600 --> 00:43:06,000
and the specific jurisdiction it belongs to.

1087
00:43:06,000 --> 00:43:08,080
If you were looking at something like the Epstein files,

1088
00:43:08,080 --> 00:43:10,080
your metadata would include the name of the person

1089
00:43:10,080 --> 00:43:11,760
testifying, the case references,

1090
00:43:11,760 --> 00:43:13,360
and specific topic tags.

1091
00:43:13,360 --> 00:43:15,360
Every single chunk carries these details with it,

1092
00:43:15,360 --> 00:43:16,880
which means they become part of the index

1093
00:43:16,880 --> 00:43:19,360
and stay available every time a user runs a search.

1094
00:43:19,360 --> 00:43:22,080
When a query happens, metadata serves as both a filter

1095
00:43:22,080 --> 00:43:23,680
and a way to rank the results.

1096
00:43:23,680 --> 00:43:26,560
If a user asks about something that happened in 2005,

1097
00:43:26,560 --> 00:43:29,280
the metadata filter immediately shrinks the search space,

1098
00:43:29,280 --> 00:43:32,560
so the system only looks at chunks tagged with that specific year.

1099
00:43:32,560 --> 00:43:34,160
This makes the search much faster

1100
00:43:34,160 --> 00:43:36,160
because you aren't running expensive operations

1101
00:43:36,160 --> 00:43:37,360
on files that don't matter,

1102
00:43:37,360 --> 00:43:40,480
but metadata does more than just filter it also helps with ranking,

1103
00:43:40,480 --> 00:43:41,680
because it can tell the difference

1104
00:43:41,680 --> 00:43:43,680
between a primary source from 2005

1105
00:43:43,680 --> 00:43:45,840
and a summary written in 2015

1106
00:43:45,840 --> 00:43:48,000
that just happens to mention that year.

1107
00:43:48,000 --> 00:43:51,360
The quality of your metadata depends entirely on how you extract it.

1108
00:43:51,360 --> 00:43:52,960
Some of these fields are easy to get

1109
00:43:52,960 --> 00:43:55,040
because things like creation dates and author names

1110
00:43:55,040 --> 00:43:57,360
are usually embedded right in the file properties.

1111
00:43:57,360 --> 00:44:00,160
You can use simple rules to pull that data with high confidence,

1112
00:44:00,160 --> 00:44:02,880
but other fields require a bit more intelligence to get right.

1113
00:44:02,880 --> 00:44:04,960
You might need a machine learning classifier

1114
00:44:04,960 --> 00:44:07,680
to decide if a document is actually about a specific topic

1115
00:44:07,680 --> 00:44:09,440
or if it just mentions it once.

1116
00:44:09,440 --> 00:44:12,240
The system then flags which pieces of data are rule-based

1117
00:44:12,240 --> 00:44:13,280
and which ones are predicted,

1118
00:44:13,280 --> 00:44:16,320
so the user knows exactly how much they should trust the result.

1119
00:44:16,320 --> 00:44:18,240
Keeping the system running is a real challenge

1120
00:44:18,240 --> 00:44:19,840
because metadata is never static.

1121
00:44:19,840 --> 00:44:22,640
Documents get updated, sensitivity levels change,

1122
00:44:22,640 --> 00:44:24,080
and old policies expire,

1123
00:44:24,080 --> 00:44:26,080
which means a file marked as internal today

1124
00:44:26,080 --> 00:44:27,920
might be public by tomorrow morning.

1125
00:44:27,920 --> 00:44:29,840
Your enrichment pipeline cannot just stop

1126
00:44:29,840 --> 00:44:31,200
after the initial ingestion,

1127
00:44:31,200 --> 00:44:32,560
so you need a scheduled process

1128
00:44:32,560 --> 00:44:34,720
to review your documents every month or quarter

1129
00:44:34,720 --> 00:44:36,160
to refresh those tags.

1130
00:44:36,160 --> 00:44:37,920
If a new version of a file comes out,

1131
00:44:37,920 --> 00:44:40,960
the old metadata needs to be marked as deprecated immediately.

1132
00:44:40,960 --> 00:44:43,520
This ongoing maintenance might seem expensive,

1133
00:44:43,520 --> 00:44:45,600
but it is much better than the alternative,

1134
00:44:45,600 --> 00:44:48,400
which is having your system return outdated information

1135
00:44:48,400 --> 00:44:50,880
because the metadata drifted over time.

1136
00:44:50,880 --> 00:44:53,680
Setting this up does add some overhead to your infrastructure.

1137
00:44:53,680 --> 00:44:55,440
You have to build the extraction logic,

1138
00:44:55,440 --> 00:44:57,760
find a place to store the fields in your database,

1139
00:44:57,760 --> 00:45:00,080
and write the ranking logic for the retrieval step,

1140
00:45:00,080 --> 00:45:03,520
but this investment pays off on every single query the system handles.

1141
00:45:03,520 --> 00:45:06,720
Unlike chunking tweaks that only help specific types of questions,

1142
00:45:06,720 --> 00:45:09,440
metadata enrichment makes the entire system smarter

1143
00:45:09,440 --> 00:45:11,280
and more reliable for every user.

1144
00:45:11,280 --> 00:45:14,080
Handling multimodal queries.

1145
00:45:14,080 --> 00:45:16,560
Users are going to ask questions that point to both documents

1146
00:45:16,560 --> 00:45:18,000
and videos at the same time.

1147
00:45:18,000 --> 00:45:19,840
Your system has to be able to handle that

1148
00:45:19,840 --> 00:45:21,760
without splitting the search into separate buckets

1149
00:45:21,760 --> 00:45:23,120
that don't talk to each other.

1150
00:45:23,120 --> 00:45:24,880
The real difficulty here is detecting

1151
00:45:24,880 --> 00:45:26,560
what the user actually wants.

1152
00:45:26,560 --> 00:45:27,760
When someone asks a question,

1153
00:45:27,760 --> 00:45:29,680
the system has to figure out if they need a video,

1154
00:45:29,680 --> 00:45:31,200
a document, or both.

1155
00:45:31,200 --> 00:45:33,120
A phrase like "Show me what happened in the meeting"

1156
00:45:33,120 --> 00:45:35,680
is tricky because it could mean the user wants the transcript,

1157
00:45:35,680 --> 00:45:39,200
the handwritten notes, or the actual video clip of the event.

1158
00:45:39,200 --> 00:45:40,480
The system cannot guess,

1159
00:45:40,480 --> 00:45:42,160
so it has to use query understanding

1160
00:45:42,160 --> 00:45:44,800
to figure out the multimodal intent behind the words.

1161
00:45:44,800 --> 00:45:47,760
This intent detection happens very early in the process.

1162
00:45:47,760 --> 00:45:50,160
A lightweight classifier looks at the question

1163
00:45:50,160 --> 00:45:52,000
and tags it based on what it finds,

1164
00:45:52,000 --> 00:45:54,000
such as looking for temporal words like

1165
00:45:54,000 --> 00:45:56,080
"During the meeting" to trigger a transcript search.

1166
00:45:56,080 --> 00:45:58,640
If the user says "How does it look?"

1167
00:45:58,640 --> 00:46:01,520
The system flags the query for image or video retrieval

1168
00:46:01,520 --> 00:46:03,280
instead. These tags aren't exclusive,

1169
00:46:03,280 --> 00:46:05,840
so a single query can have multiple signals

1170
00:46:05,840 --> 00:46:07,280
that tell the system to route the search

1171
00:46:07,280 --> 00:46:09,120
to several different places at once.

1172
00:46:09,120 --> 00:46:11,280
The embedding challenge goes even deeper than that.

1173
00:46:11,280 --> 00:46:13,040
Most models only work with text,

1174
00:46:13,040 --> 00:46:14,960
but your library now includes video transcripts

1175
00:46:14,960 --> 00:46:16,000
and image descriptions

1176
00:46:16,000 --> 00:46:18,080
that all have different semantic properties.

1177
00:46:18,080 --> 00:46:20,720
An image description might say "Two people in a room"

1178
00:46:20,720 --> 00:46:23,680
while a transcript says "Person A and Person B met"

1179
00:46:23,680 --> 00:46:26,080
and a document might just call it a formal meeting.

1180
00:46:26,080 --> 00:46:28,640
A standard text model would see these as three different things

1181
00:46:28,640 --> 00:46:31,840
even though they are all describing the exact same moment in time.

1182
00:46:31,840 --> 00:46:34,720
Multimodal embeddings fix this by putting different types of media

1183
00:46:34,720 --> 00:46:36,080
into one shared space.

1184
00:46:36,080 --> 00:46:38,160
These models take text, images, and video data

1185
00:46:38,160 --> 00:46:41,200
and turn them into vectors that you can actually compare against each other.

1186
00:46:41,200 --> 00:46:43,760
A multimodal model understands that a picture of a meeting

1187
00:46:43,760 --> 00:46:45,440
and a document about that meeting belong

1188
00:46:45,440 --> 00:46:47,120
in the same semantic neighborhood.

1189
00:46:47,120 --> 00:46:49,760
When you embed a user's query, it lands in that same neighborhood,

1190
00:46:49,760 --> 00:46:52,480
allowing the system to pull candidates from every modality

1191
00:46:52,480 --> 00:46:54,640
and rank them together in one list.

1192
00:46:54,640 --> 00:46:57,440
Cross-modal retrieval is what allows the system to find a video

1193
00:46:57,440 --> 00:46:59,760
that visualizes a concept found in a document.

1194
00:46:59,760 --> 00:47:03,040
If a user asks about a specific person's involvement in a business deal,

1195
00:47:03,040 --> 00:47:06,080
the system pulls the contracts, the video of their testimony,

1196
00:47:06,080 --> 00:47:08,000
and even photos of the checks involved.

1197
00:47:08,000 --> 00:47:09,920
The user gets a complete picture of the event

1198
00:47:09,920 --> 00:47:12,480
because they have the textual evidence, the first-hand video,

1199
00:47:12,480 --> 00:47:14,320
and the visual proof all in one place.

1200
00:47:14,320 --> 00:47:16,320
To keep things precise, you have to make sure

1201
00:47:16,320 --> 00:47:18,480
the transcript alignment is perfect.

1202
00:47:18,480 --> 00:47:20,320
When the system finds a video transcript,

1203
00:47:20,320 --> 00:47:22,800
it shouldn't just hand over a giant block of text

1204
00:47:22,800 --> 00:47:25,520
but should instead preserve the timestamps and speaker names.

1205
00:47:25,520 --> 00:47:28,240
When a user sees a result, they should see exactly

1206
00:47:28,240 --> 00:47:31,200
when the person started talking, along with a direct link

1207
00:47:31,200 --> 00:47:32,640
to that moment in the video.

1208
00:47:32,640 --> 00:47:34,720
This bridges the gap between searching through text

1209
00:47:34,720 --> 00:47:37,680
and watching the media, giving the user the speed of a search engine

1210
00:47:37,680 --> 00:47:38,960
with the context of a film.

1211
00:47:38,960 --> 00:47:42,960
Visual grounding is the process of returning different types of evidence together.

1212
00:47:42,960 --> 00:47:45,280
If a user is looking for details on a transaction,

1213
00:47:45,280 --> 00:47:47,360
the system should show them the financial spreadsheet

1214
00:47:47,360 --> 00:47:49,760
and a photo of the sign check at the same time.

1215
00:47:49,760 --> 00:47:51,680
The document gives them the data they need

1216
00:47:51,680 --> 00:47:54,160
while the image provides the verification they want.

1217
00:47:54,160 --> 00:47:57,360
Neither of these is as strong on its own as they are when you present them together.

1218
00:47:57,360 --> 00:48:01,200
The final step is a fusion strategy that merges all these parallel parts

1219
00:48:01,200 --> 00:48:03,200
before the final ranking happens.

1220
00:48:03,200 --> 00:48:06,240
Text, video and image searches all run at the same time

1221
00:48:06,240 --> 00:48:09,040
and then the fusion logic combines the results into one set.

1222
00:48:09,040 --> 00:48:11,120
You can use reciprocal rank fusion to make sure

1223
00:48:11,120 --> 00:48:13,840
the modalities are competing fairly against each other.

1224
00:48:13,840 --> 00:48:17,120
If a result shows up as a strong signal in both the text and the video,

1225
00:48:17,120 --> 00:48:18,960
it gets boosted to the top of the list

1226
00:48:18,960 --> 00:48:21,760
while weaker signals naturally fall toward the bottom.

1227
00:48:21,760 --> 00:48:25,680
In the end, the system gives the user one unified list of results.

1228
00:48:25,680 --> 00:48:28,720
They don't have to click through different tabs for videos and documents

1229
00:48:28,720 --> 00:48:31,440
because everything is interspersed based on how relevant it is.

1230
00:48:31,440 --> 00:48:34,080
This allows the LLM to take all that mixed evidence

1231
00:48:34,080 --> 00:48:36,640
and turn it into a single coherent answer

1232
00:48:36,640 --> 00:48:39,520
that is grounded in every piece of data you have.

1233
00:48:39,520 --> 00:48:41,920
The orchestration layer, putting it together.

1234
00:48:41,920 --> 00:48:45,040
All these pieces, chunking, routing, retrieval and re-ranking,

1235
00:48:45,040 --> 00:48:48,320
have to work together and that is where the orchestration layer comes in.

1236
00:48:48,320 --> 00:48:51,200
Think of your architecture not as a bunch of isolated parts

1237
00:48:51,200 --> 00:48:54,960
but as an integrated system where every single piece triggers the next one.

1238
00:48:54,960 --> 00:48:58,000
At the very center of everything sits the orchestration layer

1239
00:48:58,000 --> 00:49:01,040
and its only job is to coordinate the flow of information.

1240
00:49:01,040 --> 00:49:03,360
Data flows in one direction during ingestion

1241
00:49:03,360 --> 00:49:06,880
while queries flow in another direction from retrieval to generation.

1242
00:49:06,880 --> 00:49:10,400
The orchestration layer makes sure that data lands in the right place at the right time

1243
00:49:10,400 --> 00:49:13,120
and it ensures that user queries get routed efficiently

1244
00:49:13,120 --> 00:49:15,040
through the systems you've prepared.

1245
00:49:15,040 --> 00:49:17,120
The ingestion pipeline works as a sequence,

1246
00:49:17,120 --> 00:49:19,520
starting when raw documents arrive at the front door.

1247
00:49:19,520 --> 00:49:23,040
You might have 3.5 million pages, plus videos and images

1248
00:49:23,040 --> 00:49:25,200
and they all enter the system as raw bytes.

1249
00:49:25,200 --> 00:49:27,840
The first stage is extraction where PDFs are passed,

1250
00:49:27,840 --> 00:49:31,200
images go through OCR and videos are transcribed into text.

1251
00:49:31,200 --> 00:49:34,720
This extraction happens in parallel so you don't have to wait for document one

1252
00:49:34,720 --> 00:49:37,280
to finish before you start processing document two.

1253
00:49:37,280 --> 00:49:41,200
Extraction runs on a cluster that can process thousands of documents at the same time

1254
00:49:41,200 --> 00:49:44,960
and as that work completes, the documents flow into the chunking stage.

1255
00:49:44,960 --> 00:49:49,120
The chunking process follows the recursive structure-aware hierarchy we talked about

1256
00:49:49,120 --> 00:49:52,480
which produces chunks that actually align with the document architecture.

1257
00:49:52,480 --> 00:49:56,480
From there, chunks flow into enrichment where named entity recognition runs

1258
00:49:56,480 --> 00:49:58,240
and temporal metadata is pulled out.

1259
00:49:58,240 --> 00:50:00,800
Sensitivity labels are applied to keep things secure

1260
00:50:00,800 --> 00:50:03,360
and then the chunks flow into the embedding stage.

1261
00:50:03,360 --> 00:50:06,720
Your embedding model processes them in batches to produce vectors

1262
00:50:06,720 --> 00:50:09,840
and then those vectors and metadata flow into storage.

1263
00:50:09,840 --> 00:50:11,840
The vector database receives the vectors,

1264
00:50:11,840 --> 00:50:13,680
the keyword index takes the text

1265
00:50:13,680 --> 00:50:16,400
and the knowledge graph receives the entity relationships.

1266
00:50:16,400 --> 00:50:19,760
Chunks land in specific tiers based on how you've classified their value

1267
00:50:19,760 --> 00:50:21,200
and while all of this is happening,

1268
00:50:21,200 --> 00:50:23,280
audit logs record every single step.

1269
00:50:23,280 --> 00:50:26,560
Document fingerprints are computed to stop duplicates from getting in,

1270
00:50:26,560 --> 00:50:30,240
progress is tracked and any failures are logged so they can be retried automatically.

1271
00:50:30,240 --> 00:50:32,080
This entire pipeline is asynchronous,

1272
00:50:32,080 --> 00:50:35,440
which means the different stages don't block each other from moving forward.

1273
00:50:35,440 --> 00:50:37,680
While extraction is busy processing batch 10,

1274
00:50:37,680 --> 00:50:39,920
the chunking stage is already working on batch 8

1275
00:50:39,920 --> 00:50:41,920
and enrichment is handling batch 6.

1276
00:50:41,920 --> 00:50:45,440
The pipeline works like a conveyor belt where every stage feeds the next one

1277
00:50:45,440 --> 00:50:48,160
and this makes bottlenecks immediately visible to the team.

1278
00:50:48,160 --> 00:50:51,600
If embedding becomes a constraint, you scale the embedding service

1279
00:50:51,600 --> 00:50:54,800
and if storage is slow, you focus on optimizing your rights.

1280
00:50:54,800 --> 00:50:57,920
The orchestration layer monitors the throughput at every single stage

1281
00:50:57,920 --> 00:51:00,320
and shows you exactly where the system needs to be improved.

1282
00:51:00,320 --> 00:51:02,160
The query pipeline is just as sequential

1283
00:51:02,160 --> 00:51:04,320
but it also involves a lot of branching logic.

1284
00:51:04,320 --> 00:51:08,640
A user submits a question and the system immediately roots it through intent classification

1285
00:51:08,640 --> 00:51:10,800
to see what they're actually asking.

1286
00:51:10,800 --> 00:51:13,520
Intent tags trigger different retrieval strategies

1287
00:51:13,520 --> 00:51:15,680
and the orchestration layer receives these tags

1288
00:51:15,680 --> 00:51:18,400
to decide which retrieval systems it needs to call.

1289
00:51:18,400 --> 00:51:21,600
A query about dates and specific people might trigger the knowledge graph,

1290
00:51:21,600 --> 00:51:24,160
document search and metadata filtering all at once.

1291
00:51:24,160 --> 00:51:28,160
On the other hand, a simple structural query might only need to invoke BM25.

1292
00:51:28,160 --> 00:51:30,640
These decisions are deterministic based on the intent

1293
00:51:30,640 --> 00:51:33,600
so they don't require any human intervention to move forward.

1294
00:51:33,600 --> 00:51:36,400
The parallel retrieval paths fire off at the same time

1295
00:51:36,400 --> 00:51:38,160
and as results come back from each path,

1296
00:51:38,160 --> 00:51:40,160
the orchestration layer collects them.

1297
00:51:40,160 --> 00:51:43,600
Results are merged using RRF or another fusion strategy

1298
00:51:43,600 --> 00:51:46,880
and then the RRF looks at the entire merged candidate set.

1299
00:51:46,880 --> 00:51:49,760
As soon as RRF is done, permission filtering kicks in

1300
00:51:49,760 --> 00:51:51,840
and the orchestration layer trims the results

1301
00:51:51,840 --> 00:51:54,320
based on who the user is and what they're allowed to see.

1302
00:51:54,320 --> 00:51:56,800
The trimmed results feed into context shaping

1303
00:51:56,800 --> 00:51:59,280
where the system assembles the top K candidates

1304
00:51:59,280 --> 00:52:01,920
into a coherent block of text for the LLM.

1305
00:52:01,920 --> 00:52:04,800
Redundancy is stripped out, temporal ordering is applied

1306
00:52:04,800 --> 00:52:06,480
and narrative continuity is preserved

1307
00:52:06,480 --> 00:52:08,240
so the model doesn't get confused.

1308
00:52:08,240 --> 00:52:10,160
The shaped context flows to the LLM,

1309
00:52:10,160 --> 00:52:11,600
the model generates an answer

1310
00:52:11,600 --> 00:52:14,000
and the orchestration layer logs that answer alongside

1311
00:52:14,000 --> 00:52:16,480
the specific chunks used to ground it for auditing.

1312
00:52:16,480 --> 00:52:18,640
Feedback loops are what finally close the system

1313
00:52:18,640 --> 00:52:20,400
and make it smarter over time.

1314
00:52:20,400 --> 00:52:22,080
Users interact with the answers they get

1315
00:52:22,080 --> 00:52:24,080
providing thumbs up or thumbs down ratings

1316
00:52:24,080 --> 00:52:25,760
to let you know if the system is working.

1317
00:52:25,760 --> 00:52:28,480
They might correct a hallucinational ask a follow-up question

1318
00:52:28,480 --> 00:52:31,360
and all of those signals flow back to the orchestration layer.

1319
00:52:31,360 --> 00:52:33,280
High confidence correct answers are cached

1320
00:52:33,280 --> 00:52:36,160
to save money later while failed retrievals are flagged

1321
00:52:36,160 --> 00:52:38,480
so an engineer can investigate what went wrong.

1322
00:52:38,480 --> 00:52:40,880
Systematic biases like certain question types

1323
00:52:40,880 --> 00:52:42,640
always returning bad results

1324
00:52:42,640 --> 00:52:45,040
are surfaced to the engineering teams for a fix.

1325
00:52:45,040 --> 00:52:47,280
The orchestration layer aggregates all of this feedback

1326
00:52:47,280 --> 00:52:49,600
and uses it to trigger retraining for the models.

1327
00:52:49,600 --> 00:52:52,240
Thresholds are adjusted, rooting heuristics get better

1328
00:52:52,240 --> 00:52:53,840
and query classifiers are retrained

1329
00:52:53,840 --> 00:52:56,320
on the actual logs of what users asked and what happened.

1330
00:52:56,320 --> 00:52:58,080
User behavior is essentially teaching the system

1331
00:52:58,080 --> 00:53:00,720
how to evolve without you having to hard code every change.

1332
00:53:00,720 --> 00:53:03,440
Monitoring is the eyes and ears of the entire operation.

1333
00:53:03,440 --> 00:53:05,440
The orchestration layer instruments every stage

1334
00:53:05,440 --> 00:53:08,320
with metrics like retrieval precision, answer accuracy,

1335
00:53:08,320 --> 00:53:11,280
latency and the cost of every single query.

1336
00:53:11,280 --> 00:53:13,200
These metrics flow into a dedicated store

1337
00:53:13,200 --> 00:53:16,000
and dashboards visualize the health of the system in real time.

1338
00:53:16,000 --> 00:53:18,720
Alerts fire, the moment performance starts to degrade

1339
00:53:18,720 --> 00:53:21,200
which allows you to react before users notice a problem.

1340
00:53:21,200 --> 00:53:22,960
If latency suddenly spikes,

1341
00:53:22,960 --> 00:53:25,920
you can drill down to see exactly which stage is causing the delay.

1342
00:53:25,920 --> 00:53:28,400
If precision drops, you can identify which categories

1343
00:53:28,400 --> 00:53:30,240
of questions are failing and why.

1344
00:53:30,240 --> 00:53:32,000
If the cost per query starts rising,

1345
00:53:32,000 --> 00:53:34,000
you can detect which operations became expensive

1346
00:53:34,000 --> 00:53:34,880
and optimize them.

1347
00:53:34,880 --> 00:53:37,600
This level of observability allows for rapid diagnosis

1348
00:53:37,600 --> 00:53:39,360
and prevents the kind of silent failures

1349
00:53:39,360 --> 00:53:40,960
that kill trust in a system.

1350
00:53:40,960 --> 00:53:42,880
Scaling happens in two different dimensions

1351
00:53:42,880 --> 00:53:44,160
to keep up with demand.

1352
00:53:44,160 --> 00:53:46,160
Horizontal scaling adds raw capacity

1353
00:53:46,160 --> 00:53:48,560
by spinning up more instances of the retrieval service,

1354
00:53:48,560 --> 00:53:51,440
more embedding workers or more re-ranca replicas.

1355
00:53:51,440 --> 00:53:53,680
Load is distributed across these resources

1356
00:53:53,680 --> 00:53:56,320
and the total throughput of the system increases.

1357
00:53:56,320 --> 00:53:59,040
Vertical scaling is more about optimizing performance

1358
00:53:59,040 --> 00:54:01,280
like switching to a smaller, faster embedding model

1359
00:54:01,280 --> 00:54:03,280
or a more efficient vector database.

1360
00:54:03,280 --> 00:54:06,720
This makes the system leaner and latency decreases as a result.

1361
00:54:06,720 --> 00:54:08,960
The orchestration layer manages all of this scaling

1362
00:54:08,960 --> 00:54:11,200
by rooting traffic only to healthy instances.

1363
00:54:11,200 --> 00:54:13,920
It can gradually drain capacity from all deployments

1364
00:54:13,920 --> 00:54:16,640
and move it to new ones without dropping a single request.

1365
00:54:16,640 --> 00:54:18,640
It monitors how resources are being used

1366
00:54:18,640 --> 00:54:21,920
and triggers auto-scaling the moment-specific thresholds are reached.

1367
00:54:21,920 --> 00:54:24,400
The system adapts to the demand on its own

1368
00:54:24,400 --> 00:54:26,320
and that's the power of orchestration.

1369
00:54:26,320 --> 00:54:27,760
It isn't just a collection of parts,

1370
00:54:27,760 --> 00:54:30,480
it's a cohesive system where every part serves the whole.

1371
00:54:30,480 --> 00:54:32,240
Data flows, queries get answered

1372
00:54:32,240 --> 00:54:34,880
and feedback makes the whole thing better over time.

1373
00:54:34,880 --> 00:54:39,200
The orchestration layer is what makes 3.5 million pages actually useful

1374
00:54:39,200 --> 00:54:42,640
because without it you just have a pile of disconnected pieces.

1375
00:54:42,640 --> 00:54:46,160
Cost analysis, semantic versus fixed chunking at scale,

1376
00:54:46,160 --> 00:54:48,320
budget matters more than people realize

1377
00:54:48,320 --> 00:54:51,040
and when you're dealing with 3.5 million pages,

1378
00:54:51,040 --> 00:54:54,080
the cost difference between chunking strategies is massive.

1379
00:54:54,080 --> 00:54:56,320
You aren't just paying a one-time fee to set this up

1380
00:54:56,320 --> 00:54:59,520
because you're actually paying every single day the system is running,

1381
00:54:59,520 --> 00:55:01,520
every page you ingest costs money

1382
00:55:01,520 --> 00:55:04,480
and every query a user sends triggers more costs.

1383
00:55:04,480 --> 00:55:07,600
These expenses compound across millions of operations

1384
00:55:07,600 --> 00:55:10,560
and a tiny overhead multiplier that looks small on paper

1385
00:55:10,560 --> 00:55:12,560
becomes a huge line item in your budget.

1386
00:55:12,560 --> 00:55:14,640
It all starts with the preprocessing stage.

1387
00:55:14,640 --> 00:55:19,040
When you chunk 3.5 million pages using fixed recursive chunking,

1388
00:55:19,040 --> 00:55:20,960
the work is linear and predictable.

1389
00:55:20,960 --> 00:55:24,240
You tokenize the text, you split it at the boundaries you've set

1390
00:55:24,240 --> 00:55:26,160
and you move on to the next page.

1391
00:55:26,160 --> 00:55:28,880
The computational cost for each page is basically a constant

1392
00:55:28,880 --> 00:55:32,240
which means you can parallelize the work across a cluster very easily.

1393
00:55:32,240 --> 00:55:35,520
A thousand machines can process a thousand pages at the same time

1394
00:55:35,520 --> 00:55:38,480
and the whole preprocessing job can be finished in a few hours.

1395
00:55:38,480 --> 00:55:40,240
When you switch to semantic chunking,

1396
00:55:40,240 --> 00:55:42,960
the work becomes fundamentally different and much more expensive.

1397
00:55:42,960 --> 00:55:44,560
You can't just split at boundaries anymore

1398
00:55:44,560 --> 00:55:47,520
because now you have to compute embeddings for every single sentence.

1399
00:55:47,520 --> 00:55:50,480
You have to calculate similarity scores between those sentences

1400
00:55:50,480 --> 00:55:53,840
and detect exactly where those scores drop below a certain threshold.

1401
00:55:53,840 --> 00:55:57,200
You merge small segments, recompute the embeddings at the segment level

1402
00:55:57,200 --> 00:56:00,800
and the whole process requires multiple passes through neural models.

1403
00:56:00,800 --> 00:56:04,160
A single page might require embedding every sentence individually

1404
00:56:04,160 --> 00:56:06,640
just to make a decision about where a chunk should end.

1405
00:56:06,640 --> 00:56:11,840
A cost multiplier of 1.5 to 3 times is actually a conservative estimate for this.

1406
00:56:11,840 --> 00:56:15,520
Some setups pay even more than that depending on how sophisticated the logic is.

1407
00:56:15,520 --> 00:56:19,120
Embedding costs scaled directly with the number of chunks you create.

1408
00:56:19,120 --> 00:56:23,520
Your vector database is going to bill you based on how many vectors you're storing in the index.

1409
00:56:23,520 --> 00:56:26,320
If semantic chunking produces a lot of small chunks,

1410
00:56:26,320 --> 00:56:28,960
which it usually does because it breaks things down by topic,

1411
00:56:28,960 --> 00:56:31,760
you end up with more total chunks from the same content.

1412
00:56:31,760 --> 00:56:33,760
More chunks means more embeddings to generate

1413
00:56:33,760 --> 00:56:37,520
and that leads to higher costs during ingestion and higher storage fees every month.

1414
00:56:37,520 --> 00:56:39,920
At a scale of 3.5 million pages,

1415
00:56:39,920 --> 00:56:43,520
even a small increase in the average chunk count has real financial consequences.

1416
00:56:43,520 --> 00:56:46,640
If semantic chunking increases your chunk count by 20%,

1417
00:56:46,640 --> 00:56:49,920
you're paying 20% more for your embeddings and your vector storage.

1418
00:56:49,920 --> 00:56:52,160
That is a material difference that adds up fast.

1419
00:56:52,160 --> 00:56:55,840
When you multiply that by your model costs and your database subscription tier,

1420
00:56:55,840 --> 00:56:59,200
the difference becomes very visible to whoever is paying the bills.

1421
00:56:59,200 --> 00:57:01,680
Storage costs only make this effect worse over time.

1422
00:57:01,680 --> 00:57:05,600
Vector databases usually charge based on the number of vectors you have stored

1423
00:57:05,600 --> 00:57:08,320
and some of them charge per million vectors every month.

1424
00:57:08,320 --> 00:57:09,920
Regardless of how they price it,

1425
00:57:09,920 --> 00:57:13,040
more vectors always mean a higher bill at the end of the month.

1426
00:57:13,040 --> 00:57:18,560
A recursive fixed chunking strategy at 512 tokens might produce 1 million chunks from your data.

1427
00:57:18,560 --> 00:57:23,520
Semantic chunking might produce 1.2 or even 1.5 million chunks from that same data.

1428
00:57:23,520 --> 00:57:27,280
Every single one of those extra vectors has to be stored, indexed and searched.

1429
00:57:27,280 --> 00:57:31,840
The storage multiplier is direct, so more vectors lead to a higher bill every single time.

1430
00:57:31,840 --> 00:57:34,960
Query time costs are another factor that people often overlook.

1431
00:57:34,960 --> 00:57:37,360
When you search a larger index with more candidates,

1432
00:57:37,360 --> 00:57:40,640
the retrieval process takes longer and uses more compute.

1433
00:57:40,640 --> 00:57:43,200
Your vector database has to evaluate more distances

1434
00:57:43,200 --> 00:57:45,760
and your keyword index has more postings to look through.

1435
00:57:45,760 --> 00:57:50,640
Your re-rancker also has to evaluate more candidates before it can produce the final rankings for the user.

1436
00:57:50,640 --> 00:57:55,120
Rewanking 50 candidates is much cheaper than re-ranking 150 candidates.

1437
00:57:55,120 --> 00:57:56,960
If semantic chunking creates smaller chunks,

1438
00:57:56,960 --> 00:58:00,720
you'll have to retrieve more of them just to reconstruct the context for the answer.

1439
00:58:00,720 --> 00:58:03,920
You end up re-ranking more and passing more data through the system,

1440
00:58:03,920 --> 00:58:06,320
which makes every query incrementally more expensive.

1441
00:58:06,320 --> 00:58:08,800
The cost of the LLM context is less obvious,

1442
00:58:08,800 --> 00:58:11,040
but it's very consequential for your bottom line.

1443
00:58:11,040 --> 00:58:13,440
Better chunking is supposed to reduce token waste,

1444
00:58:13,440 --> 00:58:15,760
but it can actually do the opposite if you aren't careful.

1445
00:58:15,760 --> 00:58:17,760
If your chunks are larger and more coherent,

1446
00:58:17,760 --> 00:58:21,120
you might only need to pass three of them to the LLM to get a good answer.

1447
00:58:21,120 --> 00:58:23,280
If your chunks are fragmented into tiny pieces,

1448
00:58:23,280 --> 00:58:26,640
you might have to pass eight chunks to capture the same amount of information.

1449
00:58:26,640 --> 00:58:30,240
LLM costs scale with how many tokens you consume in the prompt.

1450
00:58:30,240 --> 00:58:33,600
Eight chunks of 200 tokens each adds up to 1600 tokens,

1451
00:58:33,600 --> 00:58:36,640
while three chunks of 500 tokens is only 1500.

1452
00:58:36,640 --> 00:58:40,000
In this case, the semantic chunking scenario uses more tokens

1453
00:58:40,000 --> 00:58:42,080
to provide the exact same information.

1454
00:58:42,080 --> 00:58:45,040
These API costs compound across millions of queries,

1455
00:58:45,040 --> 00:58:47,440
and while the difference isn't a disaster for one query,

1456
00:58:47,440 --> 00:58:49,280
it's significant for the whole operation.

1457
00:58:49,280 --> 00:58:53,680
A break-even analysis will show you when semantic chunking actually justifies the extra money.

1458
00:58:53,680 --> 00:58:55,200
For general document retrieval,

1459
00:58:55,200 --> 00:58:57,360
the break-even point almost never arrives

1460
00:58:57,360 --> 00:59:00,480
because the overhead costs are higher than the quality improvements.

1461
00:59:00,480 --> 00:59:04,560
However, for high-value domains like complex legal documents or regulatory filings,

1462
00:59:04,560 --> 00:59:06,320
the extra cost can be worth it.

1463
00:59:06,320 --> 00:59:09,760
When the precision of the answer is worth more than the pre-processing expense,

1464
00:59:09,760 --> 00:59:12,400
then semantic approaches become a lot easier to defend.

1465
00:59:12,400 --> 00:59:14,880
For 3.5 million pages of mixed content,

1466
00:59:14,880 --> 00:59:18,880
you might apply semantic chunking to the top 10% of your most important files.

1467
00:59:18,880 --> 00:59:21,600
This hybrid approach gives you the precision where it matters most

1468
00:59:21,600 --> 00:59:24,240
while keeping costs down across the rest of the corpus.

1469
00:59:24,240 --> 00:59:26,640
The lesson here is to be pragmatic with your budget.

1470
00:59:26,640 --> 00:59:28,480
Start with fixed recursive chunking

1471
00:59:28,480 --> 00:59:30,480
and measure exactly what you're spending.

1472
00:59:30,480 --> 00:59:32,560
You should only invest in semantic refinement

1473
00:59:32,560 --> 00:59:36,480
when you can prove that the gain in precision is worth the extra expense.

1474
00:59:36,480 --> 00:59:38,240
Governance and compliance at scale.

1475
00:59:38,240 --> 00:59:40,400
When you are managing 3.5 million pages,

1476
00:59:40,400 --> 00:59:42,160
governance isn't just a nice feature to have.

1477
00:59:42,160 --> 00:59:44,240
It is the absolute foundation of the system.

1478
00:59:44,240 --> 00:59:47,600
The retrieval tools might work and the AI might generate great answers,

1479
00:59:47,600 --> 00:59:51,600
but without governance, you are leaving yourself wide open to massive exposure.

1480
00:59:51,600 --> 00:59:56,240
At this kind of volume, a single policy mistake can leak thousands of sensitive documents

1481
00:59:56,240 --> 00:59:59,920
and a mist legal hold can literally destroy evidence needed for court.

1482
00:59:59,920 --> 01:00:04,000
Governance is the only thing that stops these small errors from turning into total system failures.

1483
01:00:04,000 --> 01:00:07,760
Data classification has to start long before the retrieval process even begins.

1484
01:00:07,760 --> 01:00:11,680
Every single document that enters your system gets sorted into a specific tier,

1485
01:00:11,680 --> 01:00:15,040
ranging from public and internal to confidential or restricted.

1486
01:00:15,040 --> 01:00:17,680
Public files can be shared with anyone and kept forever,

1487
01:00:17,680 --> 01:00:20,240
while internal documents are strictly for the organization

1488
01:00:20,240 --> 01:00:22,080
and can never be shared outside the company.

1489
01:00:22,080 --> 01:00:26,240
Confidential files require a specific need to know clearance.

1490
01:00:26,240 --> 01:00:29,040
And restricted documents are tied to legal holds or regulations

1491
01:00:29,040 --> 01:00:31,280
that require very specific handling rules.

1492
01:00:31,280 --> 01:00:34,560
This classification isn't just a label for show because it actually dictates

1493
01:00:34,560 --> 01:00:36,000
how the entire system behaves.

1494
01:00:36,000 --> 01:00:38,000
If a document is marked restricted,

1495
01:00:38,000 --> 01:00:40,800
the system won't let you delete it without legal sign-off

1496
01:00:40,800 --> 01:00:45,840
and confidential files won't even show up in a search unless the user has been granted explicit access.

1497
01:00:45,840 --> 01:00:49,040
Retention policies are what define the life cycle of your data.

1498
01:00:49,040 --> 01:00:52,160
You have to decide how long a document stays in your active index,

1499
01:00:52,160 --> 01:00:56,480
which might be years for a legal case or just three years for a standard business memo.

1500
01:00:56,480 --> 01:01:00,080
Certain industries have strict mandates like keeping financial records for seven years

1501
01:01:00,080 --> 01:01:01,680
or employment files for five,

1502
01:01:01,680 --> 01:01:04,240
but retention isn't just about hitting a delete button.

1503
01:01:04,240 --> 01:01:08,320
It is about transitions where active documents stay in fast access tiers,

1504
01:01:08,320 --> 01:01:12,240
while older files move to cheaper, slower storage with reduced search ability.

1505
01:01:12,240 --> 01:01:15,120
Your policies encode this entire journey.

1506
01:01:15,120 --> 01:01:19,120
So when a document hits its expiration date, the system moves it automatically.

1507
01:01:19,120 --> 01:01:22,320
Users can't find these archived files through a normal copilot search

1508
01:01:22,320 --> 01:01:26,080
and getting to them requires special permission and a full-ordered trail.

1509
01:01:26,080 --> 01:01:29,200
Legal holds will always override your standard retention rules.

1510
01:01:29,200 --> 01:01:32,800
During a lawsuit or an investigation, certain documents become immutable,

1511
01:01:32,800 --> 01:01:35,200
meaning they cannot be moved, changed or deleted,

1512
01:01:35,200 --> 01:01:37,520
no matter what the original policy said.

1513
01:01:37,520 --> 01:01:40,640
Once the legal department issues a hold, the system tags those files,

1514
01:01:40,640 --> 01:01:44,480
and they stay frozen even if they were scheduled for deletion that very day.

1515
01:01:44,480 --> 01:01:48,000
Normal retention only starts back up once the hold is officially lifted.

1516
01:01:48,000 --> 01:01:51,360
If you don't have this logic built in, you run the risk of destroying evidence

1517
01:01:51,360 --> 01:01:52,720
during a routine cleanup,

1518
01:01:52,720 --> 01:01:57,360
but a good system lets litigation needs and daily compliance live together without any conflict.

1519
01:01:57,360 --> 01:02:01,280
Audit logging is how you record every single thing that happens.

1520
01:02:01,280 --> 01:02:04,080
Every time a user asks a question, the system logs,

1521
01:02:04,080 --> 01:02:06,320
who made the request which documents were pulled,

1522
01:02:06,320 --> 01:02:09,840
and whether the user actually opened them or just saw them in a summary.

1523
01:02:09,840 --> 01:02:13,280
Every AI response is tracked to show which documents were used as context

1524
01:02:13,280 --> 01:02:15,200
and if the answer cited them correctly.

1525
01:02:15,200 --> 01:02:19,280
We also log every change in classification and every time access is granted or denied,

1526
01:02:19,280 --> 01:02:22,240
but these logs aren't there to help with speed or performance.

1527
01:02:22,240 --> 01:02:23,760
They exist for accountability,

1528
01:02:23,760 --> 01:02:26,000
giving you a clear timeline if a breach happens,

1529
01:02:26,000 --> 01:02:28,960
or if a regulator asks how you handle specific data.

1530
01:02:28,960 --> 01:02:31,600
These logs are permanent and cannot be deleted by users,

1531
01:02:31,600 --> 01:02:35,440
and we actually keep them longer than the source documents to prove we stayed compliant.

1532
01:02:35,440 --> 01:02:38,720
Data residency is how you manage the geography of your information.

1533
01:02:38,720 --> 01:02:42,560
Regulations in the EU require personal data to stay within their borders,

1534
01:02:42,560 --> 01:02:45,600
just like US health data and various financial records

1535
01:02:45,600 --> 01:02:47,840
have their own strict jurisdictional rules.

1536
01:02:47,840 --> 01:02:50,000
Your retrieval system has to respect these lines,

1537
01:02:50,000 --> 01:02:52,800
so if a US user tries to process EU data,

1538
01:02:52,800 --> 01:02:54,960
the system won't pull from a US index.

1539
01:02:54,960 --> 01:02:59,200
Instead, the orchestration layer transparently routes that request to a compliant region in Europe.

1540
01:02:59,200 --> 01:03:02,720
Because every document has residency tags built into its classification,

1541
01:03:02,720 --> 01:03:06,800
the routing happens automatically without the user ever needing to think about it.

1542
01:03:06,800 --> 01:03:09,760
Incident response is your way of catching a breach before it spreads.

1543
01:03:09,760 --> 01:03:13,680
If a user suddenly tries to pull thousands of documents they've never looked at before,

1544
01:03:13,680 --> 01:03:15,440
the system triggers an immediate alert.

1545
01:03:15,440 --> 01:03:18,240
We also have detection tools that watch for sensitive data,

1546
01:03:18,240 --> 01:03:19,920
appearing in answers where it shouldn't be,

1547
01:03:19,920 --> 01:03:23,760
or for users logging in from unexpected regions to access restricted files.

1548
01:03:23,760 --> 01:03:26,800
These detection systems feed directly into response workflows,

1549
01:03:26,800 --> 01:03:29,680
so your security team can investigate and fix the problem.

1550
01:03:29,680 --> 01:03:33,120
Once the issue is resolved, we go back and adjust the preventive policies

1551
01:03:33,120 --> 01:03:35,120
to make sure it doesn't happen again.

1552
01:03:35,120 --> 01:03:37,120
When governance is working the way it should,

1553
01:03:37,120 --> 01:03:39,600
it stays completely invisible to the end user.

1554
01:03:39,600 --> 01:03:42,320
The people using the system just see the answers they need,

1555
01:03:42,320 --> 01:03:45,760
while the back end quietly makes sure those answers follow the law,

1556
01:03:45,760 --> 01:03:49,440
respect retention dates, and keep a perfect audit trail.

1557
01:03:49,440 --> 01:03:53,040
The goal is a system that doesn't just fail silently when a policy is broken,

1558
01:03:53,040 --> 01:03:54,960
but instead escalates the issue,

1559
01:03:54,960 --> 01:03:56,800
so it can be handled properly.

1560
01:03:56,800 --> 01:03:59,120
Real-world implementation, the workflow,

1561
01:03:59,120 --> 01:04:00,400
theory is great for planning,

1562
01:04:00,400 --> 01:04:02,240
but you need to know how this actually looks

1563
01:04:02,240 --> 01:04:03,760
when you put it into practice.

1564
01:04:03,760 --> 01:04:06,960
When you are standing up a system to manage 3.5 million pages,

1565
01:04:06,960 --> 01:04:09,680
you have to follow a very specific execution plan.

1566
01:04:09,680 --> 01:04:13,840
The timeline is vital because every phase is designed to prove your architecture works

1567
01:04:13,840 --> 01:04:16,160
before you try to scale it up to the full corpus.

1568
01:04:16,160 --> 01:04:18,720
Day one is all about ingestion and extraction.

1569
01:04:18,720 --> 01:04:22,240
You start with a massive data dump of 3.5 million pages,

1570
01:04:22,240 --> 01:04:26,000
including everything from PDFs and emails to old scanned images and video files.

1571
01:04:26,000 --> 01:04:28,080
The ingestion pipeline starts moving immediately,

1572
01:04:28,080 --> 01:04:31,360
and we run the extraction process at the same time to break those files down.

1573
01:04:31,360 --> 01:04:33,760
PDF passes pull out the text.

1574
01:04:33,760 --> 01:04:36,960
OCR technology turns scanned images into readable data,

1575
01:04:36,960 --> 01:04:40,160
and we catalog all the metadata for every video and image file.

1576
01:04:40,160 --> 01:04:42,000
We aren't looking for perfection on the first day,

1577
01:04:42,000 --> 01:04:44,640
so it is fine if the OCR misses a few characters

1578
01:04:44,640 --> 01:04:46,480
or a parser mislabel the section.

1579
01:04:46,480 --> 01:04:49,360
The goal is to get coverage across the entire dataset,

1580
01:04:49,360 --> 01:04:52,880
while a background process identifies names, dates, and organisations.

1581
01:04:52,880 --> 01:04:54,640
By the time the sun goes down on day one,

1582
01:04:54,640 --> 01:04:57,600
you have a raw version of your content registered in the system

1583
01:04:57,600 --> 01:04:59,280
and ready for the next step.

1584
01:04:59,280 --> 01:05:01,920
Days 2 through 7 are dedicated to indexing.

1585
01:05:01,920 --> 01:05:04,480
All that extracted content goes through a chunking process

1586
01:05:04,480 --> 01:05:06,880
that respects the natural hierarchy of the documents,

1587
01:05:06,880 --> 01:05:10,640
and we tag those chunks with the entities and sensitivity levels we found earlier.

1588
01:05:10,640 --> 01:05:13,840
The system starts filling up the tier one and tier two indexes,

1589
01:05:13,840 --> 01:05:16,480
giving high value files, like legal filings,

1590
01:05:16,480 --> 01:05:19,120
full semantic indexing, with vector embeddings.

1591
01:05:19,120 --> 01:05:22,880
At the same time, the knowledge graph starts building connections between people and dates,

1592
01:05:22,880 --> 01:05:25,760
while less important content goes into keyword-based storage.

1593
01:05:25,760 --> 01:05:28,400
You don't have to wait for the whole thing to finish to start testing,

1594
01:05:28,400 --> 01:05:32,400
so you can run sample queries to see if you can find documents by name or topic.

1595
01:05:32,400 --> 01:05:35,200
These early tests tell you if your chunking is working,

1596
01:05:35,200 --> 01:05:38,160
and if the system is respecting the structure of your data.

1597
01:05:38,160 --> 01:05:40,240
Week 2 is when you finally deploy your agents.

1598
01:05:40,240 --> 01:05:42,320
The agentech router is the brain of the system

1599
01:05:42,320 --> 01:05:46,160
that understands what a user wants and breaks their query down into smaller parts.

1600
01:05:46,160 --> 01:05:48,960
This is the moment where your design meets the real world,

1601
01:05:48,960 --> 01:05:51,360
and you start feeding the system complex questions,

1602
01:05:51,360 --> 01:05:56,000
like asking for all emails between two specific people from the year 2005.

1603
01:05:56,000 --> 01:05:58,080
The router has to classify that request,

1604
01:05:58,080 --> 01:06:02,400
generate subqueries and pull data from both the knowledge graph and the document index.

1605
01:06:02,400 --> 01:06:05,680
Once the results come back, the router merges them into a single answer,

1606
01:06:05,680 --> 01:06:08,640
and you have to manually check if that answer is actually right.

1607
01:06:08,640 --> 01:06:11,600
This phase usually reveals where your assumptions were wrong,

1608
01:06:11,600 --> 01:06:13,600
like a query going to the wrong index,

1609
01:06:13,600 --> 01:06:17,280
which allows you to iterate and fix the biggest problems before moving on.

1610
01:06:17,280 --> 01:06:19,440
Week 3 is when you turn on re-ranking.

1611
01:06:19,440 --> 01:06:22,960
We bring the crossing-coder model online to sharpen the precision of the results,

1612
01:06:22,960 --> 01:06:26,240
but you have to keep a close eye on how much this slows things down.

1613
01:06:26,240 --> 01:06:30,640
If your basic search took 30 milliseconds and re-ranking bumps it up to 150,

1614
01:06:30,640 --> 01:06:33,280
you have to decide if that extra time is worth it.

1615
01:06:33,280 --> 01:06:36,560
In most cases, a 40% jump-in precision is a fair trade

1616
01:06:36,560 --> 01:06:39,120
for an extra 120 milliseconds of wait time.

1617
01:06:39,120 --> 01:06:40,880
If the delay is too long for your users,

1618
01:06:40,880 --> 01:06:43,360
you can always shrink the number of documents you are re-ranking

1619
01:06:43,360 --> 01:06:45,040
or switch to a smaller, faster model.

1620
01:06:45,040 --> 01:06:46,400
Week 4 is the pilot rollout.

1621
01:06:46,400 --> 01:06:48,240
You don't give the system to everyone at once,

1622
01:06:48,240 --> 01:06:52,400
but instead you hand it to a small group like the legal team who really knows the data.

1623
01:06:52,400 --> 01:06:55,360
They start running real-world queries and giving you feedback,

1624
01:06:55,360 --> 01:06:59,440
hitting a thumbs-up for good answers and correcting the system when it misses the mark.

1625
01:06:59,440 --> 01:07:01,760
This feedback goes straight into your monitoring tools

1626
01:07:01,760 --> 01:07:03,920
so you can see exactly where the system is struggling.

1627
01:07:03,920 --> 01:07:06,400
You might find that time-based questions work perfectly

1628
01:07:06,400 --> 01:07:08,640
while relationship queries are missing some connections,

1629
01:07:08,640 --> 01:07:10,720
but you don't rush to change things yet.

1630
01:07:10,720 --> 01:07:13,600
You take the time to observe and understand these failure patterns

1631
01:07:13,600 --> 01:07:15,600
before you start tweaking the settings.

1632
01:07:15,600 --> 01:07:18,960
Ongoing operations are all about constant monitoring and small adjustments.

1633
01:07:18,960 --> 01:07:21,040
You have to track your latency metrics every day

1634
01:07:21,040 --> 01:07:22,880
because if the system starts slowing down,

1635
01:07:22,880 --> 01:07:26,160
you need to know if the index is getting too big or if the cache isn't working.

1636
01:07:26,160 --> 01:07:28,400
We also keep a close eye on the cost per query

1637
01:07:28,400 --> 01:07:31,440
to make sure a new feature hasn't made the system inefficient.

1638
01:07:31,440 --> 01:07:34,000
Quality is measured against a set of gold standard questions,

1639
01:07:34,000 --> 01:07:35,600
so if the precision starts to slip,

1640
01:07:35,600 --> 01:07:39,360
you know exactly which part of the retrieval or ranking process needs work.

1641
01:07:39,360 --> 01:07:40,880
The system evolves slowly,

1642
01:07:40,880 --> 01:07:43,360
getting better with every interaction and every correction

1643
01:07:43,360 --> 01:07:46,000
until it becomes a truly reliable tool.

1644
01:07:46,000 --> 01:07:48,160
Common failure modes and how to avoid them,

1645
01:07:48,160 --> 01:07:50,400
even with a solid architecture, things can break

1646
01:07:50,400 --> 01:07:52,960
and usually they break in predictable ways.

1647
01:07:52,960 --> 01:07:55,920
Overfragmentation is a silent killer for usability.

1648
01:07:55,920 --> 01:07:58,080
When you create chunks that are too small,

1649
01:07:58,080 --> 01:07:59,680
say under 200 tokens,

1650
01:07:59,680 --> 01:08:02,400
you force the retrieval system to pull dozens of fragments

1651
01:08:02,400 --> 01:08:04,560
just to reconstruct one coherent thought.

1652
01:08:04,560 --> 01:08:07,120
Imagine a legal deposition that spans 300 tokens

1653
01:08:07,120 --> 01:08:09,040
but gets split into four separate chunks.

1654
01:08:09,040 --> 01:08:12,000
If a user asks a question that matches only the second chunk,

1655
01:08:12,000 --> 01:08:14,720
the system retrieves it, but the context is gone.

1656
01:08:14,720 --> 01:08:16,880
The witness statement leading up to that testimony

1657
01:08:16,880 --> 01:08:18,480
is sitting in a different chunk

1658
01:08:18,480 --> 01:08:21,120
and the follow-up question is in another one entirely.

1659
01:08:21,120 --> 01:08:23,520
What the LLM receives is a fragmented mosaic

1660
01:08:23,520 --> 01:08:25,200
instead of a continuous narrative

1661
01:08:25,200 --> 01:08:27,200
which leads it to hallucinate connections

1662
01:08:27,200 --> 01:08:29,600
or miss the nuance of the conversation.

1663
01:08:29,600 --> 01:08:32,720
Users will notice that the answers feel scattered and incomplete.

1664
01:08:32,720 --> 01:08:34,720
On the other hand, chunks that are too large

1665
01:08:34,720 --> 01:08:36,960
create massive amounts of retrieval noise.

1666
01:08:36,960 --> 01:08:40,160
A 5,000 token chunk covering an entire legal section

1667
01:08:40,160 --> 01:08:43,920
might contain a few core facts mixed with pages of tangential discussion.

1668
01:08:43,920 --> 01:08:46,320
When this gets retrieved, it swamps the context window

1669
01:08:46,320 --> 01:08:47,600
with irrelevant details

1670
01:08:47,600 --> 01:08:50,640
and the LLM struggles to find the needle in the haystack.

1671
01:08:50,640 --> 01:08:53,360
Both of these extremes will break your downstream performance.

1672
01:08:53,360 --> 01:08:56,400
The only way to prevent this is through constant measurement.

1673
01:08:56,400 --> 01:08:58,800
You need to monitor your chunk size distribution

1674
01:08:58,800 --> 01:09:01,040
and calculate exactly what percentage of your data

1675
01:09:01,040 --> 01:09:03,680
falls below 200 tokens or exceeds 2000.

1676
01:09:03,680 --> 01:09:06,640
When you see the distribution skewing towards these extremes,

1677
01:09:06,640 --> 01:09:09,280
you have to adjust the minimum and maximum bounds.

1678
01:09:09,280 --> 01:09:11,440
The goal isn't to make every chunk identical,

1679
01:09:11,440 --> 01:09:13,120
but you need enough consistency

1680
01:09:13,120 --> 01:09:16,640
to prevent these pathological cases from ruining the user experience.

1681
01:09:16,640 --> 01:09:18,800
Permission misalignment is a much more dangerous problem

1682
01:09:18,800 --> 01:09:20,640
because it's a silent security breach.

1683
01:09:20,640 --> 01:09:22,560
You might build a robust retrieval system,

1684
01:09:22,560 --> 01:09:24,880
but if the permission metadata attached to your chunks

1685
01:09:24,880 --> 01:09:27,120
becomes stale, you have a major liability.

1686
01:09:27,120 --> 01:09:29,600
Consider a situation where a user changes teams

1687
01:09:29,600 --> 01:09:32,880
and their access profile is updated in your identity system.

1688
01:09:32,880 --> 01:09:35,840
If the retrieval system hasn't revalidated those permissions

1689
01:09:35,840 --> 01:09:37,520
against the updated identity,

1690
01:09:37,520 --> 01:09:38,960
the system is flying blind.

1691
01:09:38,960 --> 01:09:40,480
The user runs a query

1692
01:09:40,480 --> 01:09:42,160
and the retrieval engine returns a chunk

1693
01:09:42,160 --> 01:09:43,840
they no longer have the right to see.

1694
01:09:43,840 --> 01:09:46,400
Because the permission check ran against cashed,

1695
01:09:46,400 --> 01:09:47,680
outdated credentials,

1696
01:09:47,680 --> 01:09:50,960
the restricted information passes through without being filtered.

1697
01:09:50,960 --> 01:09:53,200
This is actually worse than a total system failure

1698
01:09:53,200 --> 01:09:55,440
because it isn't visible to the admins.

1699
01:09:55,440 --> 01:09:58,320
To prevent this, you need a strict synchronization discipline

1700
01:09:58,320 --> 01:10:01,280
where permission metadata is refreshed on a regular schedule.

1701
01:10:01,280 --> 01:10:03,760
Depending on how often people move roles in your company,

1702
01:10:03,760 --> 01:10:05,760
this should happen daily or even weekly.

1703
01:10:05,760 --> 01:10:08,640
When access policies change in your primary identity system,

1704
01:10:08,640 --> 01:10:11,440
those changes must flow to the retrieval cash immediately.

1705
01:10:11,440 --> 01:10:13,360
You should never rely on cashed permissions

1706
01:10:13,360 --> 01:10:15,440
for more than a few hours at a time.

1707
01:10:15,440 --> 01:10:18,080
It's also vital to test these boundaries continuously

1708
01:10:18,080 --> 01:10:20,400
by running synthetic queries as a restricted user.

1709
01:10:20,400 --> 01:10:22,080
You need to verify that those restricted chunks

1710
01:10:22,080 --> 01:10:24,080
are actually being filtered out in real time.

1711
01:10:24,080 --> 01:10:26,080
Don't just assume the permission system is working.

1712
01:10:26,080 --> 01:10:27,600
You have to validate it.

1713
01:10:27,600 --> 01:10:29,440
Stale metadata causes your system

1714
01:10:29,440 --> 01:10:31,680
to drift away from reality over time.

1715
01:10:31,680 --> 01:10:33,120
During the initial ingestion,

1716
01:10:33,120 --> 01:10:35,520
your enrichment pipelines extract dates,

1717
01:10:35,520 --> 01:10:39,200
assign classifications, and map out relationships between entities.

1718
01:10:39,200 --> 01:10:41,280
But documents are living things that evolve.

1719
01:10:41,280 --> 01:10:44,000
A file might change from confidential to internal

1720
01:10:44,000 --> 01:10:46,320
or a document might be superseded by a newer version

1721
01:10:46,320 --> 01:10:47,600
without the old one being deleted.

1722
01:10:47,600 --> 01:10:49,040
Sometimes a legal hold expires,

1723
01:10:49,040 --> 01:10:51,680
but nobody remembers to update the metadata tags.

1724
01:10:51,680 --> 01:10:53,280
These small errors accumulate

1725
01:10:53,280 --> 01:10:55,360
until your retrieval system is operating on a mountain

1726
01:10:55,360 --> 01:10:56,960
of outdated information.

1727
01:10:56,960 --> 01:10:59,520
Users start seeing results ordered by the wrong dates

1728
01:10:59,520 --> 01:11:02,240
and classifications no longer match current company policy.

1729
01:11:02,240 --> 01:11:04,160
Because this degradation happens gradually,

1730
01:11:04,160 --> 01:11:07,360
it's easy to miss until the system is significantly compromised.

1731
01:11:07,360 --> 01:11:09,840
The fix is a scheduled refreshment of your data.

1732
01:11:09,840 --> 01:11:12,640
Major fields like classification, retention status,

1733
01:11:12,640 --> 01:11:15,760
and legal hold states need to be revalidated periodically.

1734
01:11:15,760 --> 01:11:18,400
At least once a quarter, you should scan the entire corpus

1735
01:11:18,400 --> 01:11:21,200
and compare the metadata against your current policies.

1736
01:11:21,200 --> 01:11:22,800
When you find a mismatch, you update it.

1737
01:11:22,800 --> 01:11:24,800
For high-velocity fields that change frequently,

1738
01:11:24,800 --> 01:11:26,240
you'll need to increase that frequency

1739
01:11:26,240 --> 01:11:27,760
while quarterly checks are usually enough

1740
01:11:27,760 --> 01:11:29,680
for stable fields like document types.

1741
01:11:29,680 --> 01:11:32,400
You should also set up automated alerts for any document

1742
01:11:32,400 --> 01:11:35,520
whose metadata hasn't been touched in longer than expected.

1743
01:11:35,520 --> 01:11:37,280
Rooting errors are another common pitfall

1744
01:11:37,280 --> 01:11:39,920
where queries get sent to the wrong specialized systems.

1745
01:11:39,920 --> 01:11:42,320
This happens when the intent classifier misses the signals

1746
01:11:42,320 --> 01:11:43,440
in a user's question.

1747
01:11:43,440 --> 01:11:46,560
For example, a query about complex financial relationships

1748
01:11:46,560 --> 01:11:48,960
might get routed to a standard document search

1749
01:11:48,960 --> 01:11:50,400
instead of a graph database.

1750
01:11:50,400 --> 01:11:52,000
As a result, the answer misses

1751
01:11:52,000 --> 01:11:54,560
all the structured relationship data it needs,

1752
01:11:54,560 --> 01:11:57,040
or perhaps a temporal query gets sent to a keyword search

1753
01:11:57,040 --> 01:11:58,800
instead of a time indexed system

1754
01:11:58,800 --> 01:12:00,560
leading to results from the wrong years.

1755
01:12:00,560 --> 01:12:02,160
These failures often go unnoticed

1756
01:12:02,160 --> 01:12:03,920
because the system still returns results.

1757
01:12:03,920 --> 01:12:05,360
They just aren't the right ones.

1758
01:12:05,360 --> 01:12:08,000
To stop this, you need a continuous validation loop.

1759
01:12:08,000 --> 01:12:10,160
You should maintain a dedicated test set of queries

1760
01:12:10,160 --> 01:12:12,080
where the correct routing is already known

1761
01:12:12,080 --> 01:12:14,160
and you should run these tests every single day.

1762
01:12:14,160 --> 01:12:15,920
If the routing classification starts to shift,

1763
01:12:15,920 --> 01:12:17,520
you need to investigate why.

1764
01:12:17,520 --> 01:12:20,240
If certain types of questions are consistently failing,

1765
01:12:20,240 --> 01:12:22,320
it's time to retrain your classifier.

1766
01:12:22,320 --> 01:12:24,080
You should also monitor routing decisions

1767
01:12:24,080 --> 01:12:26,080
alongside the actual retrieval outcomes.

1768
01:12:26,080 --> 01:12:28,480
When you see that queries matching a specific pattern

1769
01:12:28,480 --> 01:12:30,800
are consistently producing poor results,

1770
01:12:30,800 --> 01:12:33,520
it's a clear sign that your routing logic is broken.

1771
01:12:33,520 --> 01:12:36,800
Rewanking collapse occurs when your model becomes miscalibrated.

1772
01:12:36,800 --> 01:12:38,160
If you have an aggressive reranca,

1773
01:12:38,160 --> 01:12:39,760
it might assign incredibly low scores

1774
01:12:39,760 --> 01:12:41,360
to perfectly legitimate candidates,

1775
01:12:41,360 --> 01:12:43,280
leaving your top results empty.

1776
01:12:43,280 --> 01:12:44,960
Conversely, a conservative reranca

1777
01:12:44,960 --> 01:12:47,440
might barely change the order of the initial results at all.

1778
01:12:47,440 --> 01:12:50,240
In that case, the reranking process is just adding latency

1779
01:12:50,240 --> 01:12:53,680
to the system without actually improving the precision of the answers.

1780
01:12:53,680 --> 01:12:55,680
You prevent this through empirical tuning.

1781
01:12:55,680 --> 01:12:58,720
You have to test the reranca against a sample of real queries

1782
01:12:58,720 --> 01:13:01,360
and measure exactly how much it's changing the order of the results.

1783
01:13:01,360 --> 01:13:02,880
You need to validate that these changes

1784
01:13:02,880 --> 01:13:05,360
are actually making the answers more accurate.

1785
01:13:05,360 --> 01:13:07,600
If the reranca is moving things around for no reason

1786
01:13:07,600 --> 01:13:08,960
or having almost no impact,

1787
01:13:08,960 --> 01:13:10,560
you need to adjust your thresholds

1788
01:13:10,560 --> 01:13:12,480
or change the size of your candidate set.

1789
01:13:12,480 --> 01:13:15,360
Finally, there is the issue of latency creep,

1790
01:13:15,360 --> 01:13:18,080
which comes from a dozen small optimizations piling up.

1791
01:13:18,080 --> 01:13:20,400
Each new feature might only add 10 milliseconds

1792
01:13:20,400 --> 01:13:22,240
but then the reranca adds 150

1793
01:13:22,240 --> 01:13:24,240
and the permission checks add another 30.

1794
01:13:24,240 --> 01:13:25,280
Before you know it,

1795
01:13:25,280 --> 01:13:28,320
the total latency has climbed to 300 milliseconds.

1796
01:13:28,320 --> 01:13:31,440
Then someone adds query expansion or extra caching overhead

1797
01:13:31,440 --> 01:13:33,360
and suddenly the system feels sluggish.

1798
01:13:33,360 --> 01:13:35,920
You have to be ruthless about monitoring these numbers.

1799
01:13:35,920 --> 01:13:37,920
Every single component in your pipeline

1800
01:13:37,920 --> 01:13:39,680
should report its own latency

1801
01:13:39,680 --> 01:13:42,480
and you need dashboards that show the cumulative total.

1802
01:13:42,480 --> 01:13:44,160
When that total hits your budget ceiling,

1803
01:13:44,160 --> 01:13:45,280
you have to make a choice.

1804
01:13:45,280 --> 01:13:46,720
If you want to add a new feature,

1805
01:13:46,720 --> 01:13:48,960
you have to find a way to remove or optimize

1806
01:13:48,960 --> 01:13:50,880
an old one to keep the system responsive.

1807
01:13:50,880 --> 01:13:53,760
Monitoring, evaluation, and continuous improvement.

1808
01:13:53,760 --> 01:13:56,880
The reality is that you cannot optimize what you aren't measuring

1809
01:13:56,880 --> 01:14:00,080
and this is exactly where most rag implementations fall apart.

1810
01:14:00,080 --> 01:14:02,560
The system might be running and users might be getting answers

1811
01:14:02,560 --> 01:14:05,280
but without proper instrumentation, you are flying blind.

1812
01:14:05,280 --> 01:14:07,840
You won't know if the precision is slowly drifting downward

1813
01:14:07,840 --> 01:14:10,880
or if a recent update made the latency unbearable for your users.

1814
01:14:10,880 --> 01:14:13,760
You won't even know if your cost per query is starting to spike.

1815
01:14:13,760 --> 01:14:15,920
Measurement is the only thing that stands between a system

1816
01:14:15,920 --> 01:14:18,960
that degrades in the dark and one that actually gets better over time.

1817
01:14:18,960 --> 01:14:20,960
Retrieval metrics are your first line of defense

1818
01:14:20,960 --> 01:14:24,240
because they tell you if the pipeline is finding the right candidates.

1819
01:14:24,240 --> 01:14:25,840
You should be looking at precision at K

1820
01:14:25,840 --> 01:14:29,200
which measures how many of your top retrieved results are actually relevant.

1821
01:14:29,200 --> 01:14:32,480
If you pull the top 50 candidates but only 35 of them matter,

1822
01:14:32,480 --> 01:14:34,400
your precision is sitting at 70%.

1823
01:14:34,400 --> 01:14:36,960
You also need to track recoloured K to see if you're capturing

1824
01:14:36,960 --> 01:14:39,040
all the relevant documents available in the corpus.

1825
01:14:39,040 --> 01:14:41,920
If there are 100 matching documents but you only find 40 of them,

1826
01:14:41,920 --> 01:14:43,040
your recall is failing.

1827
01:14:43,040 --> 01:14:46,800
These metrics help you diagnose the specific flavor of your quality issues.

1828
01:14:46,800 --> 01:14:49,040
You might have high recall but low precision,

1829
01:14:49,040 --> 01:14:50,640
meaning you're finding the right info

1830
01:14:50,640 --> 01:14:52,240
but drowning it in noise.

1831
01:14:52,240 --> 01:14:55,360
Or you might have the opposite problem where what you find is accurate

1832
01:14:55,360 --> 01:14:56,800
but you're missing the bigger picture.

1833
01:14:56,800 --> 01:14:59,920
You should also use NDCG to measure the quality of your ranking.

1834
01:14:59,920 --> 01:15:02,960
A relevant document that shows up in the first slot is much more valuable

1835
01:15:02,960 --> 01:15:04,400
than one buried at number 50,

1836
01:15:04,400 --> 01:15:06,560
and NDCG captures that difference.

1837
01:15:06,560 --> 01:15:08,960
Mean reciprocal rank is another useful tool that tells you

1838
01:15:08,960 --> 01:15:12,240
how far down the list a user has to go to find their first real answer.

1839
01:15:12,240 --> 01:15:14,320
These aren't just academic numbers.

1840
01:15:14,320 --> 01:15:16,960
They correlate directly with how happy your users are.

1841
01:15:16,960 --> 01:15:20,240
When precision is high, users don't have to deal with false leads.

1842
01:15:20,240 --> 01:15:22,720
When recall is high, they don't miss critical evidence.

1843
01:15:22,720 --> 01:15:26,960
The quality of your ranking is what determines if the system feels like a shortcut or a chore.

1844
01:15:26,960 --> 01:15:30,800
Answer quality metrics are what your users actually care about at the end of the day.

1845
01:15:30,800 --> 01:15:32,480
You need to measure faithfulness,

1846
01:15:32,480 --> 01:15:35,600
which asks if the answer stays true to the source documents.

1847
01:15:35,600 --> 01:15:39,440
An unfaithful answer is one that contradicts the evidence or starts guessing.

1848
01:15:39,440 --> 01:15:43,920
Then there is relevance, which checks if the system actually answered the specific question the user asked.

1849
01:15:43,920 --> 01:15:45,440
You also need to look at grounding.

1850
01:15:45,440 --> 01:15:50,160
Can every single factual claim in the response be traced back to a specific source chunk

1851
01:15:50,160 --> 01:15:55,200
Citation accuracy is just as important because you need to know if the links actually support the claims they're attached to.

1852
01:15:55,200 --> 01:15:59,680
Since these are qualitative metrics, you can't just calculate them with a simple algorithm.

1853
01:15:59,680 --> 01:16:02,080
They require human judgment or expert review.

1854
01:16:02,080 --> 01:16:07,200
The best approach is to build a gold standard evaluation set of about 500 representative questions

1855
01:16:07,200 --> 01:16:09,120
with verified answers and citations.

1856
01:16:09,120 --> 01:16:12,320
You run your system against this set and have experts score the results.

1857
01:16:12,320 --> 01:16:18,080
By tracking these scores over time, you can see exactly when a change in the pipeline helps or hurts the final output.

1858
01:16:18,080 --> 01:16:22,000
Latency metrics need to be broken down by every single component in the chain.

1859
01:16:22,000 --> 01:16:26,560
You should track the time to the first byte so you know when the user first sees text on their screen.

1860
01:16:26,560 --> 01:16:30,400
You need to measure the time from the initial query to the moment retrieval is finished

1861
01:16:30,400 --> 01:16:33,520
and then the time from retrieval to the first LLM token.

1862
01:16:33,520 --> 01:16:37,280
It's important to remember that percentile latency matters much more than the average.

1863
01:16:37,280 --> 01:16:40,240
Your average response time might look greater at 200 milliseconds,

1864
01:16:40,240 --> 01:16:45,040
but if your P99 is five seconds, one percent of your users are having a terrible experience.

1865
01:16:45,040 --> 01:16:48,880
You should track PFT, P95 and P99 latencies religiously.

1866
01:16:48,880 --> 01:16:51,840
When the P99 starts to lag, you need to find out why.

1867
01:16:51,840 --> 01:16:53,360
Is a specific type of query slow?

1868
01:16:53,360 --> 01:16:55,840
Is one of your subsystems hitting a resource limit?

1869
01:16:55,840 --> 01:16:58,800
Percentiles will show you the failures that average is high.

1870
01:16:58,800 --> 01:17:02,080
Cost metrics are how you track the actual business impact of the system.

1871
01:17:02,080 --> 01:17:05,680
You should calculate the cost per query by dividing your total infrastructure spend

1872
01:17:05,680 --> 01:17:07,520
by the number of queries you serve.

1873
01:17:07,520 --> 01:17:10,880
If you're spending $50,000 a month to serve a million queries,

1874
01:17:10,880 --> 01:17:12,480
you're looking at five cents per query.

1875
01:17:12,480 --> 01:17:15,360
But an even better metric is the cost per successful answer.

1876
01:17:15,360 --> 01:17:17,440
Not every query produces a good result,

1877
01:17:17,440 --> 01:17:20,000
and only the successful ones actually provide value.

1878
01:17:20,000 --> 01:17:23,680
This metric tells you if your system is becoming more efficient over time.

1879
01:17:23,680 --> 01:17:26,480
You should also break down your infrastructure costs by component,

1880
01:17:26,480 --> 01:17:29,920
looking at retrieval, re-ranking, LLM tokens and storage.

1881
01:17:29,920 --> 01:17:34,640
If the total cost starts to climb, you can drill down to see if the LLM is consuming too many tokens,

1882
01:17:34,640 --> 01:17:37,040
or if your storage costs are scaling poorly.

1883
01:17:37,040 --> 01:17:40,240
User feedback is the final piece of the puzzle that closes the loop.

1884
01:17:40,240 --> 01:17:44,480
Simple thumbs up or down ratings can tell you if the answers were useful in the real world.

1885
01:17:44,480 --> 01:17:48,560
You can also look at escalation rates to see how often users have to give up and ask a human for help.

1886
01:17:48,560 --> 01:17:52,160
When those rates go up, it's a clear signal that the system is hitting a wall.

1887
01:17:52,160 --> 01:17:56,560
You should make it as easy as possible for users to provide this feedback directly in the interface.

1888
01:17:56,560 --> 01:18:01,600
When users rate thousands of answers, that data becomes a gold mine for future improvements.

1889
01:18:01,600 --> 01:18:04,720
Finally, you need to move toward automated offline evaluation.

1890
01:18:04,720 --> 01:18:08,320
Instead of checking things manually, your system should run a benchmark set of questions

1891
01:18:08,320 --> 01:18:12,400
against your live indexes every single night. It should compute precision, recall,

1892
01:18:12,400 --> 01:18:15,840
and NDCG automatically and compare them to your baseline.

1893
01:18:15,840 --> 01:18:19,440
This allows you to detect regressions before a single user ever sees them.

1894
01:18:19,440 --> 01:18:23,040
This kind of discipline is what separates a project that slowly falls apart

1895
01:18:23,040 --> 01:18:27,440
from a professional system that catches problems and fixes them before they become disasters.

1896
01:18:27,440 --> 01:18:31,440
Future directions. Beyond 3.5 million pages.

1897
01:18:31,440 --> 01:18:36,000
The Epstein files case shows us what happens when you build with intention at a massive scale.

1898
01:18:36,000 --> 01:18:40,800
But here's the problem. The architecture you build today is just the foundation for tomorrow's headaches.

1899
01:18:40,800 --> 01:18:46,080
Those constraints that drove our design, the 3.5 million pages, the messy multimodal files,

1900
01:18:46,080 --> 01:18:49,520
the strict permissions, those aren't going to stay the same. Data grows,

1901
01:18:49,520 --> 01:18:54,080
requirements shift, new tech shows up. It is not a question of if your system needs to change,

1902
01:18:54,080 --> 01:18:58,640
but how you plan for that change right now. Agente workflows are the next step in this journey.

1903
01:18:58,640 --> 01:19:03,360
Up until now, retrieval has been a simple operation. You ask a question, the system finds some text.

1904
01:19:03,360 --> 01:19:07,520
The model reads it and gives you an answer, but sophisticated users don't actually want answers.

1905
01:19:07,520 --> 01:19:11,920
They want reasoning. They want to compare a witness's testimony across three different depositions

1906
01:19:11,920 --> 01:19:16,080
to find where they lied. That isn't just finding text and repeating it. That is a process of

1907
01:19:16,080 --> 01:19:21,200
finding, synthesizing, comparing, and judging. An agentic system treats every one of those steps

1908
01:19:21,200 --> 01:19:25,680
as its own job. The agent pulls the first deposition. Then it pulls the second and third.

1909
01:19:25,680 --> 01:19:29,920
It runs them through a specific tool designed to find contradictions. It thinks about what that

1910
01:19:29,920 --> 01:19:34,960
means for the witness's credibility. This requires agents that can use multiple tools in a row,

1911
01:19:34,960 --> 01:19:39,040
check their own work, and decide what to do next. Current systems do this a little bit,

1912
01:19:39,040 --> 01:19:43,040
but in the future, this kind of reasoning will be the default. You need to build your architecture

1913
01:19:43,040 --> 01:19:48,400
today so that retrieval is a tool an agent can call, rather than a closed box that nobody can touch.

1914
01:19:48,400 --> 01:19:52,160
We are also seeing a shift toward graph-based drag. This moves us away from simple,

1915
01:19:52,160 --> 01:19:56,720
math-based similarity and toward actual relationships. Your knowledge graph today maps people

1916
01:19:56,720 --> 01:20:00,560
and how they are connected. Tomorrow, that graph becomes the main way you find information.

1917
01:20:00,560 --> 01:20:04,800
Instead of looking for similar words, the system treats your question like a map. If you ask what

1918
01:20:04,800 --> 01:20:10,960
transactions connected person X to organization Y between 2005 and 2007, that isn't a text search,

1919
01:20:10,960 --> 01:20:15,520
it is a graph traversal. The system starts at the person, follows the money trails, filters by the

1920
01:20:15,520 --> 01:20:20,160
dates, and shows you the path. This means you have to start investing in graph infrastructure

1921
01:20:20,160 --> 01:20:23,920
like property graphs and relationship engines right now. This doesn't mean we stop using vector

1922
01:20:23,920 --> 01:20:28,800
search. It means we use both. If your data is all about relationships, you use the graph.

1923
01:20:28,800 --> 01:20:32,960
If it is all about heavy text, you use vectors. The system just learns how to route the question to

1924
01:20:32,960 --> 01:20:38,480
the right place. Streaming retrieval is going to change how users actually feel the system working.

1925
01:20:38,480 --> 01:20:43,120
Right now, you ask a question and you wait. The system finds the data, ranks it, builds the context,

1926
01:20:43,120 --> 01:20:46,640
and finally shows you an answer. You see the whole thing at once after a long pause.

1927
01:20:46,640 --> 01:20:51,200
Streaming flips that logic. Results start appearing as they are found. The system shows you the best

1928
01:20:51,200 --> 01:20:55,760
guess immediately, while it keeps looking for better evidence in the background. The model starts

1929
01:20:55,760 --> 01:20:59,840
talking to you based on the first piece of evidence, while the retrieval engine is still working,

1930
01:20:59,840 --> 01:21:04,000
to find something even better. This means we have to stop thinking of this as a straight line,

1931
01:21:04,000 --> 01:21:08,640
where one thing happens after another. We have to decouple finding data from talking about it.

1932
01:21:08,640 --> 01:21:13,040
Even if the total time is the same, the user feels like the system is faster because they aren't

1933
01:21:13,040 --> 01:21:18,240
staring at a loading spinner. Then there is personalization. Today, everyone gets the same answer

1934
01:21:18,240 --> 01:21:22,000
to the same question. Tomorrow, the system will look at who you are and what you do.

1935
01:21:22,000 --> 01:21:26,480
If an executive asks about transactions from 2005, they get a high-level summary. If a forensic

1936
01:21:26,480 --> 01:21:31,040
analyst asks that same question, they get the raw evidence and the full citations. This isn't

1937
01:21:31,040 --> 01:21:35,520
about changing the facts. It is about changing the priority. If you care about dates, the system shows

1938
01:21:35,520 --> 01:21:40,800
you a timeline. If you care about networks, it shows you a map. The results stay grounded in reality,

1939
01:21:40,800 --> 01:21:45,840
but the way they are served to you fits your job. We are also moving towards cross-silo orchestration.

1940
01:21:45,840 --> 01:21:50,960
As these systems grow, you'll need to look outside your own walls. You might need to check partner data,

1941
01:21:50,960 --> 01:21:55,840
government records, or public files. One layer will coordinate all of that even though every source has

1942
01:21:55,840 --> 01:22:00,080
its own security and its own rules. The system will negotiate those permissions for you.

1943
01:22:00,080 --> 01:22:04,160
It will respect the rules of each silo and then merge the results into one coherent thought.

1944
01:22:04,160 --> 01:22:08,800
Finally, we have sovereign AI. Regulations are getting tighter every day. You have to be able to prove

1945
01:22:08,800 --> 01:22:13,280
that your data stays where it belongs, whether that is the EU or a specific industry.

1946
01:22:13,280 --> 01:22:17,760
Sovereign systems will root your questions only to approved hardware and models. This prevents data

1947
01:22:17,760 --> 01:22:22,800
from leaking across borders. This isn't a "maybe" for the future. If you are building a global system today,

1948
01:22:22,800 --> 01:22:28,400
you need to build sovereignty into the core. Key takeaways for enterprise leaders. If you are the one

1949
01:22:28,400 --> 01:22:32,720
in charge of rolling out copilot, the choices you make right now will decide if it works or fails.

1950
01:22:32,720 --> 01:22:37,120
These aren't just IT tickets. These are business decisions that carry real risks and real costs.

1951
01:22:37,120 --> 01:22:41,360
The first thing to realize is that scale changes everything. If you have 10,000 pages,

1952
01:22:41,360 --> 01:22:46,320
the standard approach works fine. You just index it all and call it a day, but 3.5 million pages

1953
01:22:46,320 --> 01:22:50,400
will break that model every single time. The money doesn't make sense. The speed drops, the noise gets

1954
01:22:50,400 --> 01:22:54,960
too loud. At this scale, you stop asking if you can index something. You start asking if that data

1955
01:22:54,960 --> 01:22:59,280
is actually worth the cost of making it smart. That is a massive shift in how you think. It means

1956
01:22:59,280 --> 01:23:04,080
you have to prioritize. A leader who gets this will spend their budget wisely. A leader who doesn't

1957
01:23:04,080 --> 01:23:08,960
will go broke trying to index junk. Governance is not a nice to have feature. It is the foundation

1958
01:23:08,960 --> 01:23:13,040
of the whole building. Sometimes people try to bolt security on at the end of a project.

1959
01:23:13,040 --> 01:23:17,360
At this scale, that is impossible. Your permission model is what makes the system work. It is what

1960
01:23:17,360 --> 01:23:21,840
keeps your data safe. Your audit trails are the only thing standing between you and a massive fine.

1961
01:23:21,840 --> 01:23:26,160
You have to build governance in from the very first day. Yes, it adds some complexity and a little

1962
01:23:26,160 --> 01:23:30,800
bit of lag, but paying that price now is much cheaper than trying to fix a data breach later.

1963
01:23:30,800 --> 01:23:35,200
Selective activation is a strategy, not a failure. Not all of your data is equally important.

1964
01:23:35,200 --> 01:23:39,680
Your legal filings and key depositions deserve the best indexing and the most expensive processing.

1965
01:23:39,680 --> 01:23:43,840
Your old background files and supporting documents can live in a cheaper, lighter index.

1966
01:23:43,840 --> 01:23:48,240
Your archives can stay in cold storage until they are needed. This kind of tiering is how you

1967
01:23:48,240 --> 01:23:52,800
survive at scale without an unlimited budget. It is the difference between hoping your system can

1968
01:23:52,800 --> 01:23:57,600
grow and knowing it will. You also have to measure everything. You cannot fix what you aren't tracking.

1969
01:23:57,600 --> 01:24:01,760
This means you need metrics from the moment you turn the system on. You need to know the cost

1970
01:24:01,760 --> 01:24:06,320
per query and the accuracy of the answers. You need to hear what the users actually think.

1971
01:24:06,320 --> 01:24:10,080
Set your baselines before you go live so you can see when things start to drift.

1972
01:24:10,080 --> 01:24:14,400
The teams that measure their progress are the ones that succeed. The ones that don't just wander

1973
01:24:14,400 --> 01:24:19,200
into bad performance without even knowing why. There is always a trade-off between speed and cost.

1974
01:24:19,200 --> 01:24:24,400
If you want better accuracy, you might add a re-ranking step, but that adds 150 milliseconds

1975
01:24:24,400 --> 01:24:29,120
to the weight. If you want better security, you add permission filters, but that adds overhead.

1976
01:24:29,120 --> 01:24:33,360
Every single improvement has a price tag attached to it. The question is whether that improvement is

1977
01:24:33,360 --> 01:24:39,040
worth it for your specific case. A basic chatbot needs to be fast so you might skip the extra steps.

1978
01:24:39,040 --> 01:24:43,280
A legal tool needs to be perfect so you pay the price for precision. You have to make these choices

1979
01:24:43,280 --> 01:24:48,400
on purpose. Lastly, a face draw-out is just good risk management. It isn't a delay. Start with the

1980
01:24:48,400 --> 01:24:52,720
low-risk areas and the users who understand that the tech isn't perfect yet. Listen to their feedback.

1981
01:24:52,720 --> 01:24:56,400
That feedback will tell you what is actually happening in the real world before you try to

1982
01:24:56,400 --> 01:25:01,200
scale to the whole company. It is much cheaper to fix a mistake for 100 people than it is for a million.

1983
01:25:01,200 --> 01:25:05,440
If you rush the deployment, you will hit a wall. If you move deliberately, you can fix the problems

1984
01:25:05,440 --> 01:25:10,400
while they are still small and then scale with total confidence. The Epstein files case proves

1985
01:25:10,400 --> 01:25:14,800
that running co-pilot at a massive scale isn't just about finding data. It's about orchestration.

1986
01:25:14,800 --> 01:25:18,800
The system only works because every single part has a specific coordinated job to do.

1987
01:25:18,800 --> 01:25:23,280
Selective activation focuses on the data that actually matters, while structure-aware chunking

1988
01:25:23,280 --> 01:25:27,440
keeps the original meaning of the documents intact. Agentech routing figures out what the user

1989
01:25:27,440 --> 01:25:32,160
actually wants and hybrid retrieval pulls that signal out from millions of pages. This entire

1990
01:25:32,160 --> 01:25:36,320
choreography, including permission-aware security and constant monitoring for drift, is the only

1991
01:25:36,320 --> 01:25:40,880
reason three and a half million pages stay searchable. The old way of thinking doesn't work anymore.

1992
01:25:40,880 --> 01:25:44,720
If you treat this like a simple data problem, you're going to fail. Success comes when you treat

1993
01:25:44,720 --> 01:25:49,280
it as an architecture problem instead. The real difference here isn't how complex the tools are,

1994
01:25:49,280 --> 01:25:53,520
but how intentional you are with the design. If these principles changed how you think about building

1995
01:25:53,520 --> 01:25:59,440
at scale, leave a review for the M365FM podcast. It helps more people find these insights on

1996
01:25:59,440 --> 01:26:04,400
Microsoft 365 and the modern workplace. Follow me, my co-peters, on LinkedIn and share the

1997
01:26:04,400 --> 01:26:09,120
challenges you're facing right now. We'll figure out the next generation of Enterprise AI together.