June 18, 2026

Stop Leaking Data: How to Run Local Llama on Your SharePoint Files

Stop Leaking Data: How to Run Local Llama on Your SharePoint Files
Stop Leaking Data: How to Run Local Llama on Your SharePoint Files
M365 FM Podcast
Stop Leaking Data: How to Run Local Llama on Your SharePoint Files
Apple Podcasts podcast player iconSpotify podcast player iconYoutube Music podcast player iconSpreaker podcast player iconPodchaser podcast player iconAmazon Music podcast player icon

AI is transforming the way organizations work with knowledge, documents, and collaboration platforms. But as more businesses adopt AI-powered assistants and large language models, one critical question continues to surface: how can you unlock the power of AI without exposing sensitive corporate information to external services?In this episode, we explore how organizations can run Local Llama models directly against SharePoint content while maintaining full control over their data. Instead of sending confidential documents, intellectual property, customer records, and internal knowledge to cloud-hosted AI services, local AI architectures provide a powerful alternative that prioritizes privacy, governance, and security.Our discussion breaks down the practical steps required to connect locally hosted large language models with SharePoint data sources. We examine the technologies involved, the infrastructure considerations, and the trade-offs between convenience and data sovereignty. Whether you are an IT professional, Microsoft 365 administrator, security architect, or AI enthusiast, this episode provides valuable insights into building private AI solutions on top of your existing Microsoft 365 environment.

UNDERSTANDING THE DATA PRIVACY CHALLENGE

As organizations rush to embrace generative AI, many overlook the risks associated with sending sensitive business data to third-party platforms. Data leakage, compliance concerns, and regulatory requirements are becoming major factors in AI adoption strategies.We discuss:

  • Why data sovereignty matters in the age of AI
  • Common risks associated with public AI services
  • Regulatory and compliance considerations
  • How local AI models can reduce exposure risks
WHAT IS LOCAL LLAMA?

Local Llama models have emerged as one of the most exciting developments in the open-source AI ecosystem. Running AI models locally gives organizations complete ownership of both the infrastructure and the data processing pipeline.During the conversation, we explain how Local Llama works, the hardware requirements involved, and how organizations can begin experimenting with private AI deployments without massive cloud costs.

CONNECTING SHAREPOINT TO PRIVATE AI

SharePoint remains one of the largest repositories of enterprise knowledge. From project documentation and operational procedures to contracts and meeting notes, organizations store enormous amounts of valuable information inside Microsoft 365.

Key topics include:
  • Indexing SharePoint content securely
  • Retrieval-Augmented Generation (RAG) architectures
  • Document embeddings and semantic search
  • Building intelligent chat experiences on internal data
REAL-WORLD DEPLOYMENT STRATEGIES

Moving from a proof of concept to production requires careful planning. We explore deployment patterns that balance performance, scalability, security, and user experience.Listeners will learn about infrastructure design, GPU considerations, storage requirements, monitoring, and operational best practices. We also discuss common implementation mistakes and how organizations can avoid them while delivering meaningful business value.

THE FUTURE OF PRIVATE ENTERPRISE AI

The future of enterprise AI may not belong exclusively to cloud-hosted models. As local AI technology continues to evolve, organizations are gaining more options to build intelligent systems that keep sensitive information under their control.This episode examines how private AI solutions could reshape knowledge management, enterprise search, productivity workflows, and digital workplace experiences across Microsoft 365 environments.

WHY YOU SHOULD LISTEN

If you're evaluating AI adoption within your organization, concerned about data privacy, or looking for practical ways to leverage SharePoint content with large language models, this episode delivers actionable insights and real-world guidance. Learn how to combine the power of modern AI with the security and governance requirements that today's businesses demand.Tune in to discover how Local Llama, SharePoint, and private AI architectures can work together to unlock organizational knowledge without compromising data security.

Become a supporter of this podcast: https://www.spreaker.com/podcast/m365-fm-modern-work-security-and-productivity-with-microsoft-365--6704921/support.

🚀 Want to be part of m365.fm?

Then stop just listening… and start showing up.

👉 Connect with me on LinkedIn and let’s make something happen:

  • 🎙️ Be a podcast guest and share your story
  • 🎧 Host your own episode (yes, seriously)
  • 💡 Pitch topics the community actually wants to hear
  • 🌍 Build your personal brand in the Microsoft 365 space

This isn’t just a podcast — it’s a platform for people who take action.

🔥 Most people wait. The best ones don’t.

👉 Connect with me on LinkedIn and send me a message:
"I want in"

Let’s build something awesome 👊

1
00:00:00,000 --> 00:00:02,440
Your AI strategy was supposed to protect your data,

2
00:00:02,440 --> 00:00:04,200
but in reality it's doing the opposite,

3
00:00:04,200 --> 00:00:05,800
not because of bad intentions,

4
00:00:05,800 --> 00:00:07,140
because of the model behind it.

5
00:00:07,140 --> 00:00:09,120
Cloud AI requires cloud connectivity.

6
00:00:09,120 --> 00:00:11,560
Cloud connectivity means logging, retention,

7
00:00:11,560 --> 00:00:13,760
and legal compulsion you can't opt out of.

8
00:00:13,760 --> 00:00:15,720
Your SharePoint documents deserve better.

9
00:00:15,720 --> 00:00:18,080
Today, I will show you how to run Lama locally,

10
00:00:18,080 --> 00:00:19,720
connect it to your SharePoint libraries,

11
00:00:19,720 --> 00:00:21,200
and build a retrieval system

12
00:00:21,200 --> 00:00:23,400
that never touches the public internet.

13
00:00:23,400 --> 00:00:26,240
Sovereign Intelligence, not Sovereign Cloud.

14
00:00:26,240 --> 00:00:28,200
Here is what most organizations miss.

15
00:00:28,200 --> 00:00:30,280
The moment you connect a cloud AI assistant

16
00:00:30,280 --> 00:00:32,640
to your SharePoint, your contracts, your HR files,

17
00:00:32,640 --> 00:00:34,720
your strategy documents, and your board memos,

18
00:00:34,720 --> 00:00:37,040
enter a pipeline you don't control.

19
00:00:37,040 --> 00:00:39,560
They travel across network boundaries, you didn't architect.

20
00:00:39,560 --> 00:00:41,600
They sit in log files, you can't delete,

21
00:00:41,600 --> 00:00:43,840
and in many jurisdictions, they become subject

22
00:00:43,840 --> 00:00:45,720
to legal requests you can't refuse.

23
00:00:45,720 --> 00:00:47,840
Microsoft's own WorkTrend Index tells us

24
00:00:47,840 --> 00:00:49,760
that roughly three out of four knowledge workers

25
00:00:49,760 --> 00:00:51,920
now use generative AI in some form.

26
00:00:51,920 --> 00:00:54,640
Nearly half of them started in just the last several months,

27
00:00:54,640 --> 00:00:56,440
that acceleration isn't a statistic.

28
00:00:56,440 --> 00:00:57,160
It's a signal.

29
00:00:57,160 --> 00:00:59,680
The signal says your organization's most sensitive content

30
00:00:59,680 --> 00:01:01,080
is now being queried through models

31
00:01:01,080 --> 00:01:03,520
that exist outside your perimeter, trained on data

32
00:01:03,520 --> 00:01:06,960
you didn't approve, and retained in ways you can't audit.

33
00:01:06,960 --> 00:01:09,120
The risk isn't that Microsoft will misuse your data.

34
00:01:09,120 --> 00:01:10,200
The risk is structural.

35
00:01:10,200 --> 00:01:12,120
When you send a prompt to a cloud LLM,

36
00:01:12,120 --> 00:01:13,760
that prompt carries context.

37
00:01:13,760 --> 00:01:15,360
That context carries document chunks.

38
00:01:15,360 --> 00:01:17,560
Those chunks carry proprietary information.

39
00:01:17,560 --> 00:01:19,600
And once that information leaves your network,

40
00:01:19,600 --> 00:01:22,440
it enters a legal and technical framework that's not yours.

41
00:01:22,440 --> 00:01:23,600
This isn't theoretical.

42
00:01:23,600 --> 00:01:25,240
Organizations in regulated industries

43
00:01:25,240 --> 00:01:26,880
already face hard blockers.

44
00:01:26,880 --> 00:01:28,800
Healthcare providers can't send patient records

45
00:01:28,800 --> 00:01:30,680
to external APIs under HIPAA.

46
00:01:30,680 --> 00:01:32,480
Financial institutions can't expose

47
00:01:32,480 --> 00:01:34,800
trading strategies to cross-border processing.

48
00:01:34,800 --> 00:01:36,160
Government bodies can't risk

49
00:01:36,160 --> 00:01:37,960
extraterritorial legal compulsion.

50
00:01:37,960 --> 00:01:40,640
The cloud act in the United States allows authorities

51
00:01:40,640 --> 00:01:42,200
to compel disclosure of data.

52
00:01:42,200 --> 00:01:44,520
Even when it's stored in European data centers,

53
00:01:44,520 --> 00:01:46,080
contracts don't eliminate that risk.

54
00:01:46,080 --> 00:01:47,960
They merely define who pays when it happens.

55
00:01:47,960 --> 00:01:49,760
So the question isn't whether AI is useful.

56
00:01:49,760 --> 00:01:51,920
The question is whether your current architecture respects

57
00:01:51,920 --> 00:01:54,400
the boundary between your data and the rest of the world.

58
00:01:54,400 --> 00:01:55,960
Most organizations assume it does.

59
00:01:55,960 --> 00:01:57,960
They assume that enterprise agreements,

60
00:01:57,960 --> 00:02:01,080
data processing addendums and region selection checkboxes

61
00:02:01,080 --> 00:02:02,320
create a sufficient barrier.

62
00:02:02,320 --> 00:02:03,080
They don't.

63
00:02:03,080 --> 00:02:04,560
Those tools improve transparency.

64
00:02:04,560 --> 00:02:06,080
They don't create sovereignty.

65
00:02:06,080 --> 00:02:08,080
Sovereignty means you control the hardware,

66
00:02:08,080 --> 00:02:11,400
the network, the model, the logs, and the legal jurisdiction.

67
00:02:11,400 --> 00:02:13,880
Anything less is delegation dressed up as protection.

68
00:02:13,880 --> 00:02:15,200
And delegation fails.

69
00:02:15,200 --> 00:02:17,600
The moment the delegated party faces a legal obligation,

70
00:02:17,600 --> 00:02:18,840
you can't override.

71
00:02:18,840 --> 00:02:19,800
That's the hidden leak.

72
00:02:19,800 --> 00:02:20,720
It's not a bug.

73
00:02:20,720 --> 00:02:22,080
It's the architecture.

74
00:02:22,080 --> 00:02:24,040
Why sovereign cloud is not enough?

75
00:02:24,040 --> 00:02:25,720
You have probably heard that sovereign cloud

76
00:02:25,720 --> 00:02:26,400
is the answer.

77
00:02:26,400 --> 00:02:27,560
Microsoft offers it.

78
00:02:27,560 --> 00:02:28,800
Other providers offer it too.

79
00:02:28,800 --> 00:02:29,840
The promise is appealing.

80
00:02:29,840 --> 00:02:31,360
Your data stays in your region.

81
00:02:31,360 --> 00:02:32,840
Processing happens locally.

82
00:02:32,840 --> 00:02:34,320
Access controls are stricter.

83
00:02:34,320 --> 00:02:35,480
Transparency improves.

84
00:02:35,480 --> 00:02:36,480
But here's the problem.

85
00:02:36,480 --> 00:02:38,200
Sovereign cloud is still cloud.

86
00:02:38,200 --> 00:02:39,960
And cloud means a provider headquartered

87
00:02:39,960 --> 00:02:42,840
in a jurisdiction that can compel access to your data

88
00:02:42,840 --> 00:02:44,640
regardless of where the server sits.

89
00:02:44,640 --> 00:02:47,320
Legal analysis from active mind legal makes this explicit.

90
00:02:47,320 --> 00:02:50,080
As long as US laws like the Cloud Act remain in force,

91
00:02:50,080 --> 00:02:52,640
US based companies can be compelled to transfer data

92
00:02:52,640 --> 00:02:55,520
to US authorities even when that data is physically stored

93
00:02:55,520 --> 00:02:56,560
in Europe.

94
00:02:56,560 --> 00:02:58,480
Microsoft has acknowledged this tension.

95
00:02:58,480 --> 00:03:01,120
Their sovereign cloud for Europe promises stricter controls

96
00:03:01,120 --> 00:03:02,480
in regional processing.

97
00:03:02,480 --> 00:03:05,680
It can't promise immunity from extraterritorial legal obligations

98
00:03:05,680 --> 00:03:08,960
because Microsoft is a US corporation subject to US law.

99
00:03:08,960 --> 00:03:10,600
This isn't a criticism of Microsoft.

100
00:03:10,600 --> 00:03:11,920
It's a statement about structure.

101
00:03:11,920 --> 00:03:14,920
No contractual assurance can override a statutory compulsion.

102
00:03:14,920 --> 00:03:16,600
No checkbox can change the jurisdiction

103
00:03:16,600 --> 00:03:17,840
of the parent company.

104
00:03:17,840 --> 00:03:19,160
And no amount of marketing language

105
00:03:19,160 --> 00:03:21,760
can turn a hosted service into an owned system.

106
00:03:21,760 --> 00:03:24,600
For many organizations, this distinction is academic.

107
00:03:24,600 --> 00:03:27,600
For others, it's a hard blocker, public sector bodies,

108
00:03:27,600 --> 00:03:31,000
critical infrastructure operators, regulated industries,

109
00:03:31,000 --> 00:03:33,840
organizations subject to GDPR article 44 restrictions

110
00:03:33,840 --> 00:03:35,400
on international transfers.

111
00:03:35,400 --> 00:03:37,000
These entities need data processing

112
00:03:37,000 --> 00:03:38,760
that's not merely resident in the right region,

113
00:03:38,760 --> 00:03:40,720
but legally and technically insulated

114
00:03:40,720 --> 00:03:42,000
from external access.

115
00:03:42,000 --> 00:03:44,760
GDPR requires that personal data be processed lawfully,

116
00:03:44,760 --> 00:03:46,280
fairly and transparently.

117
00:03:46,280 --> 00:03:49,560
It requires appropriate technical and organizational measures.

118
00:03:49,560 --> 00:03:51,320
And when using third party processes,

119
00:03:51,320 --> 00:03:53,840
controllers must ensure that processing agreements match

120
00:03:53,840 --> 00:03:56,280
with GDPR's protections, including restrictions

121
00:03:56,280 --> 00:03:58,880
on transfers to countries without adequate protection.

122
00:03:58,880 --> 00:04:01,080
The standard contractual clauses and adequacy decisions

123
00:04:01,080 --> 00:04:02,560
that underpin many cloud arrangements

124
00:04:02,560 --> 00:04:06,320
are being challenged, renegotiated, and in some cases invalidated.

125
00:04:06,320 --> 00:04:07,840
The legal environment is shifting,

126
00:04:07,840 --> 00:04:12,240
building on assumptions that held in 2022 is a risk in 2026.

127
00:04:12,240 --> 00:04:13,560
So sovereign cloud is a step.

128
00:04:13,560 --> 00:04:14,680
It's not the destination.

129
00:04:14,680 --> 00:04:16,040
The destination is an architecture

130
00:04:16,040 --> 00:04:18,520
where your data never leaves your control in the first place,

131
00:04:18,520 --> 00:04:20,440
where the model runs on your hardware,

132
00:04:20,440 --> 00:04:22,880
where the retrieval index lives on your network,

133
00:04:22,880 --> 00:04:25,560
where the query logs stay inside your perimeter,

134
00:04:25,560 --> 00:04:27,520
where the legal framework is the one you chose,

135
00:04:27,520 --> 00:04:29,440
not the one your provider is subject to.

136
00:04:29,440 --> 00:04:30,640
That architecture exists.

137
00:04:30,640 --> 00:04:32,440
It's called air-gapped intelligence,

138
00:04:32,440 --> 00:04:34,320
and it's what we're building today.

139
00:04:34,320 --> 00:04:35,920
The air-gapped alternative.

140
00:04:35,920 --> 00:04:38,240
Air-gapped doesn't mean disconnected from everything.

141
00:04:38,240 --> 00:04:40,160
It means disconnected from the public internet

142
00:04:40,160 --> 00:04:41,480
for the parts that matter.

143
00:04:41,480 --> 00:04:44,960
Your sharepoint still connects to Microsoft 365 for collaboration.

144
00:04:44,960 --> 00:04:48,040
Your users still authenticate through Microsoft Enter ID.

145
00:04:48,040 --> 00:04:50,520
Your document lifecycle still follows the governance rules

146
00:04:50,520 --> 00:04:51,520
you already built,

147
00:04:51,520 --> 00:04:53,520
but the AI layer runs inside your perimeter.

148
00:04:53,520 --> 00:04:55,640
The LLM sits on your GPU server.

149
00:04:55,640 --> 00:04:58,200
The vector database sits on your local network.

150
00:04:58,200 --> 00:05:00,840
The query interface resolves to an internal IP.

151
00:05:00,840 --> 00:05:02,360
And when a user asks a question,

152
00:05:02,360 --> 00:05:04,600
the answer is generated without a single packet

153
00:05:04,600 --> 00:05:06,080
leaving your controlled environment.

154
00:05:06,080 --> 00:05:07,920
This is zero trust applied to AI.

155
00:05:07,920 --> 00:05:10,720
PaloAlton Networks defines zero trust architecture

156
00:05:10,720 --> 00:05:14,200
as assuming no user or system is inherently trustworthy

157
00:05:14,200 --> 00:05:16,400
and requiring continuous verification.

158
00:05:16,400 --> 00:05:18,480
For AI, this means the ingestion service

159
00:05:18,480 --> 00:05:20,960
authenticates against sharepoint using OOOs.

160
00:05:20,960 --> 00:05:23,440
The vector database enforces role-based access control

161
00:05:23,440 --> 00:05:24,520
at the collection level.

162
00:05:24,520 --> 00:05:26,120
The query interface checks permissions

163
00:05:26,120 --> 00:05:27,560
before returning results.

164
00:05:27,560 --> 00:05:29,200
And the LLM runtime is isolated

165
00:05:29,200 --> 00:05:31,400
from outbound connectivity entirely.

166
00:05:31,400 --> 00:05:33,560
NIST describes role-based access control

167
00:05:33,560 --> 00:05:35,520
as enforcing three basic rules.

168
00:05:35,520 --> 00:05:38,840
Role assignment, role authorization, permission authorization,

169
00:05:38,840 --> 00:05:40,800
uses only exercise permissions consistent

170
00:05:40,800 --> 00:05:42,520
with their authorized roles.

171
00:05:42,520 --> 00:05:45,240
In our architecture, this applies at every layer.

172
00:05:45,240 --> 00:05:46,920
The ingestion service has a role that allows

173
00:05:46,920 --> 00:05:49,400
read access to specific sharepoint libraries.

174
00:05:49,400 --> 00:05:50,560
The vector database collections

175
00:05:50,560 --> 00:05:53,240
are tagged with the permission levels required to query them.

176
00:05:53,240 --> 00:05:56,280
The chat interface verifies the user's EntraID group membership

177
00:05:56,280 --> 00:05:57,640
before constructing the prompt.

178
00:05:57,640 --> 00:05:58,960
The result isn't paranoia.

179
00:05:58,960 --> 00:05:59,880
It's precision.

180
00:05:59,880 --> 00:06:01,720
Every document chunk carries its source.

181
00:06:01,720 --> 00:06:03,320
Every answer carries its citation.

182
00:06:03,320 --> 00:06:05,240
Every query carries its audit trail.

183
00:06:05,240 --> 00:06:06,680
And none of it leaves your building.

184
00:06:06,680 --> 00:06:09,120
Let me walk you through what this looks like in practice.

185
00:06:09,120 --> 00:06:10,680
SharePoint holds your documents.

186
00:06:10,680 --> 00:06:12,640
A local ingestion service authenticates

187
00:06:12,640 --> 00:06:15,920
via Microsoft EntraID, enumerates your libraries

188
00:06:15,920 --> 00:06:17,440
and extracts the content.

189
00:06:17,440 --> 00:06:19,000
A chunking engine breaks documents

190
00:06:19,000 --> 00:06:21,000
into semantically meaningful pieces.

191
00:06:21,000 --> 00:06:22,960
A local embedding model converts those pieces

192
00:06:22,960 --> 00:06:24,440
into numerical vectors.

193
00:06:24,440 --> 00:06:27,160
A vector database stores and indexes those vectors.

194
00:06:27,160 --> 00:06:29,480
A local Lama instance waits for queries.

195
00:06:29,480 --> 00:06:32,120
And a simple web interface lets your team ask questions

196
00:06:32,120 --> 00:06:34,120
and get grounded sighted answers.

197
00:06:34,120 --> 00:06:35,200
That's the architecture.

198
00:06:35,200 --> 00:06:37,840
Seven layers, all local, all under your control.

199
00:06:37,840 --> 00:06:40,120
The ingestion service runs on a modest server.

200
00:06:40,120 --> 00:06:42,640
It needs CPU, memory, and network access to SharePoint.

201
00:06:42,640 --> 00:06:44,040
It doesn't need a GPU.

202
00:06:44,040 --> 00:06:45,640
The chunking engine runs alongside it.

203
00:06:45,640 --> 00:06:47,720
The embedding model needs a GPU for speed

204
00:06:47,720 --> 00:06:50,240
but can fall back to CPU for smaller batches.

205
00:06:50,240 --> 00:06:52,480
The vector database needs fast SSD storage

206
00:06:52,480 --> 00:06:54,880
and enough RAM to hold the HNSW index.

207
00:06:54,880 --> 00:06:58,360
The LLM runtime needs the biggest GPU you can afford.

208
00:06:58,360 --> 00:07:00,400
And the query interface needs minimal resources

209
00:07:00,400 --> 00:07:01,760
because it's just a web application

210
00:07:01,760 --> 00:07:03,640
orchestrating calls to the other layers.

211
00:07:03,640 --> 00:07:06,400
This modular resource allocation means you can start small.

212
00:07:06,400 --> 00:07:08,200
A single server with a mid-range GPU

213
00:07:08,200 --> 00:07:09,840
can run the entire stack for a pilot

214
00:07:09,840 --> 00:07:12,320
with 5,000 documents and 50 daily users.

215
00:07:12,320 --> 00:07:15,520
As you grow, you move components to dedicated hardware.

216
00:07:15,520 --> 00:07:18,240
The vector database gets its own server with fast disks.

217
00:07:18,240 --> 00:07:21,000
The LLM runtime gets a dedicated GPU workstation.

218
00:07:21,000 --> 00:07:22,720
The ingestion service scales horizontally

219
00:07:22,720 --> 00:07:23,920
by adding more workers.

220
00:07:23,920 --> 00:07:26,440
You don't need to buy enterprise hardware on day one.

221
00:07:26,440 --> 00:07:28,320
But before we build it, you need to understand

222
00:07:28,320 --> 00:07:29,960
why retrieval isn't optional.

223
00:07:29,960 --> 00:07:32,000
It's the difference between a useful system

224
00:07:32,000 --> 00:07:34,360
and an expensive hallucination machine.

225
00:07:34,360 --> 00:07:36,440
The Ragnparative, large language models

226
00:07:36,440 --> 00:07:37,920
are patent completion engines.

227
00:07:37,920 --> 00:07:40,360
They predict the next token based on statistical patterns

228
00:07:40,360 --> 00:07:41,640
learned from training data.

229
00:07:41,640 --> 00:07:43,160
They don't know your organization.

230
00:07:43,160 --> 00:07:44,240
They don't know your contracts.

231
00:07:44,240 --> 00:07:45,800
They don't know your procedures.

232
00:07:45,800 --> 00:07:47,800
And when you ask them a question about content,

233
00:07:47,800 --> 00:07:48,840
they have never seen.

234
00:07:48,840 --> 00:07:50,640
They invent plausible sounding answers.

235
00:07:50,640 --> 00:07:52,160
That invention is called hallucination.

236
00:07:52,160 --> 00:07:53,240
It's not a rare bug.

237
00:07:53,240 --> 00:07:55,880
It's a fundamental property of how these models work.

238
00:07:55,880 --> 00:07:59,240
A full survey from RxIV categorizes hallucination mitigation

239
00:07:59,240 --> 00:08:01,640
into prompt engineering decoding constraints,

240
00:08:01,640 --> 00:08:04,120
training interventions and retrieval-based methods.

241
00:08:04,120 --> 00:08:04,960
The conclusion is clear.

242
00:08:04,960 --> 00:08:06,440
For factual grounding and changing

243
00:08:06,440 --> 00:08:10,440
or proprietary content, retrieval is the most reliable approach.

244
00:08:10,440 --> 00:08:12,640
Retrieval augmented generation or Ragn

245
00:08:12,640 --> 00:08:15,280
inserts a retrieval step between the user and the model.

246
00:08:15,280 --> 00:08:17,000
AWS describes it simply.

247
00:08:17,000 --> 00:08:18,880
User input retrieves relevant information

248
00:08:18,880 --> 00:08:20,160
from a new data source.

249
00:08:20,160 --> 00:08:22,000
The combined query and retrieved context

250
00:08:22,000 --> 00:08:23,200
passed to the LLM.

251
00:08:23,200 --> 00:08:25,320
The result is generated from authoritative data,

252
00:08:25,320 --> 00:08:26,880
not from statistical guessing.

253
00:08:26,880 --> 00:08:29,600
Immutar adds the enterprise security perspective.

254
00:08:29,600 --> 00:08:31,800
Ragn converts external data into embeddings,

255
00:08:31,800 --> 00:08:33,480
stores them in a vector database,

256
00:08:33,480 --> 00:08:35,480
retrieves the most relevant chunks for a query

257
00:08:35,480 --> 00:08:37,240
and integrates them into the prompt.

258
00:08:37,240 --> 00:08:39,080
And every step security must be enforced.

259
00:08:39,080 --> 00:08:42,560
Storage tier, data tier, prompt tier.

260
00:08:42,560 --> 00:08:45,120
Microsoft's Azure Architecture Center agrees.

261
00:08:45,120 --> 00:08:46,760
Ragn is the industry standard approach

262
00:08:46,760 --> 00:08:49,040
to using language models with proprietary data.

263
00:08:49,040 --> 00:08:50,840
Each step from chunking and embedding

264
00:08:50,840 --> 00:08:53,600
to retrieval and evaluation must be carefully designed

265
00:08:53,600 --> 00:08:54,440
and measured.

266
00:08:54,440 --> 00:08:56,000
Here is the pipeline in practical terms.

267
00:08:56,000 --> 00:08:57,760
Your SharePoint documents contain text.

268
00:08:57,760 --> 00:08:59,200
That text gets broken into chunks.

269
00:08:59,200 --> 00:09:01,680
Each chunk gets converted into a numerical vector

270
00:09:01,680 --> 00:09:03,480
that captures its semantic meaning.

271
00:09:03,480 --> 00:09:05,720
Those vectors get stored in a specialized database

272
00:09:05,720 --> 00:09:06,840
called a vector database.

273
00:09:06,840 --> 00:09:08,160
When a user asks a question,

274
00:09:08,160 --> 00:09:10,680
that question also gets converted into a vector.

275
00:09:10,680 --> 00:09:12,640
The database compares the question vector

276
00:09:12,640 --> 00:09:14,200
against all the document vectors

277
00:09:14,200 --> 00:09:15,800
and returns the closest matches.

278
00:09:15,800 --> 00:09:18,400
Those matches get inserted into the prompt sent to the LLM.

279
00:09:18,400 --> 00:09:20,640
The LLM now has both its general training

280
00:09:20,640 --> 00:09:22,840
and your specific documents as context.

281
00:09:22,840 --> 00:09:25,600
It generates an answer grounded in your actual content.

282
00:09:25,600 --> 00:09:28,080
This matters because fine tuning isn't a substitute.

283
00:09:28,080 --> 00:09:30,360
When you fine tune a model on internal documents,

284
00:09:30,360 --> 00:09:33,000
you bake specific information into the model weights.

285
00:09:33,000 --> 00:09:34,440
Updates become expensive.

286
00:09:34,440 --> 00:09:35,720
Privacy questions multiply

287
00:09:35,720 --> 00:09:37,160
because private data used in training

288
00:09:37,160 --> 00:09:39,640
can be memorized and inadvertently reproduced.

289
00:09:39,640 --> 00:09:42,520
A recent case study comparing rag against fine tuning

290
00:09:42,520 --> 00:09:44,640
finds that rag offers better factual accuracy

291
00:09:44,640 --> 00:09:45,680
and maintainability,

292
00:09:45,680 --> 00:09:47,920
especially when the knowledge-based changes frequently.

293
00:09:47,920 --> 00:09:49,920
Your SharePoint content changes daily.

294
00:09:49,920 --> 00:09:51,760
Rag reflects those changes immediately.

295
00:09:51,760 --> 00:09:53,120
Fine tuning doesn't.

296
00:09:53,120 --> 00:09:54,640
There's a mistake almost everyone makes

297
00:09:54,640 --> 00:09:56,320
when chunking SharePoint documents.

298
00:09:56,320 --> 00:09:58,080
They use the same chunk size for everything.

299
00:09:58,080 --> 00:10:02,000
PDFs, word docs, Excel sheets, PowerPoint decks.

300
00:10:02,000 --> 00:10:04,520
Each document type has a different structure.

301
00:10:04,520 --> 00:10:07,720
A uniform chunking strategy destroys retrieval accuracy

302
00:10:07,720 --> 00:10:10,120
because it breaks semantic boundaries in some documents

303
00:10:10,120 --> 00:10:11,440
and creates noise in others.

304
00:10:11,440 --> 00:10:13,760
I will show you exactly how to fix this later.

305
00:10:13,760 --> 00:10:16,200
For now, remember that retrieval isn't a bolt on.

306
00:10:16,200 --> 00:10:18,840
It's the foundation of trust where the enterprise AI.

307
00:10:18,840 --> 00:10:20,320
The architecture we're building doesn't send

308
00:10:20,320 --> 00:10:22,200
your documents to a model for training.

309
00:10:22,200 --> 00:10:24,560
It sends relevant chunks to a model for inference.

310
00:10:24,560 --> 00:10:25,720
The documents stay local.

311
00:10:25,720 --> 00:10:26,840
The embedding stay local.

312
00:10:26,840 --> 00:10:27,920
The model stays local.

313
00:10:27,920 --> 00:10:29,440
And the answers cite their sources.

314
00:10:29,440 --> 00:10:31,000
That's rag, that's the pattern.

315
00:10:31,000 --> 00:10:33,040
Now let us look at what we're retrieving from.

316
00:10:33,040 --> 00:10:34,960
SharePoint as the content backbone.

317
00:10:34,960 --> 00:10:36,440
SharePoint isn't just a file store.

318
00:10:36,440 --> 00:10:39,080
It's the governance backbone of your document ecosystem.

319
00:10:39,080 --> 00:10:41,520
Microsoft defines effective document management

320
00:10:41,520 --> 00:10:45,480
as specifying document types, templates, metadata, storage

321
00:10:45,480 --> 00:10:48,560
locations, access controls, workflows, and policies

322
00:10:48,560 --> 00:10:50,000
for auditing and retention.

323
00:10:50,000 --> 00:10:51,320
That's not a feature list.

324
00:10:51,320 --> 00:10:53,320
It's a description of how your organization already

325
00:10:53,320 --> 00:10:54,600
manages knowledge.

326
00:10:54,600 --> 00:10:57,120
When you build an AI layer on top of SharePoint,

327
00:10:57,120 --> 00:10:58,600
you're not starting from scratch.

328
00:10:58,600 --> 00:11:00,160
You're extending an existing system.

329
00:11:00,160 --> 00:11:02,120
SharePoint already knows who can see what.

330
00:11:02,120 --> 00:11:03,480
It already tracks versions.

331
00:11:03,480 --> 00:11:06,480
It already enforces retention policies through PerView.

332
00:11:06,480 --> 00:11:08,720
It already logs access through audit trails.

333
00:11:08,720 --> 00:11:11,160
Any AI system that bypasses these controls

334
00:11:11,160 --> 00:11:12,560
creates shadow governance.

335
00:11:12,560 --> 00:11:14,600
And shadow governance is where data breaches happen.

336
00:11:14,600 --> 00:11:16,640
The planning process for SharePoint document management

337
00:11:16,640 --> 00:11:17,920
is itself structured.

338
00:11:17,920 --> 00:11:20,720
Organizations identify document management roles.

339
00:11:20,720 --> 00:11:22,080
They analyze usage patterns.

340
00:11:22,080 --> 00:11:23,880
They plan site collections and libraries.

341
00:11:23,880 --> 00:11:27,000
They design content types that capture metadata and workflows.

342
00:11:27,000 --> 00:11:29,080
They configure approval processes.

343
00:11:29,080 --> 00:11:31,880
And they set policies for auditing, retention, and records

344
00:11:31,880 --> 00:11:32,480
management.

345
00:11:32,480 --> 00:11:33,840
This isn't overhead.

346
00:11:33,840 --> 00:11:37,320
It's the reason SharePoint is trusted in regulated environments.

347
00:11:37,320 --> 00:11:38,760
For our architecture, these structures

348
00:11:38,760 --> 00:11:40,240
are both an asset and a constraint.

349
00:11:40,240 --> 00:11:41,920
They provide rich metadata that can

350
00:11:41,920 --> 00:11:43,520
inform chunking decisions.

351
00:11:43,520 --> 00:11:45,680
A contract in the legal library carries different weight

352
00:11:45,680 --> 00:11:47,320
than a draft in the marketing folder.

353
00:11:47,320 --> 00:11:49,120
They provide clear access boundaries.

354
00:11:49,120 --> 00:11:51,360
The AI should never surface content from a library

355
00:11:51,360 --> 00:11:53,120
the user can't access directly.

356
00:11:53,120 --> 00:11:54,560
And they impose obligations.

357
00:11:54,560 --> 00:11:56,200
If a document is under legal hold,

358
00:11:56,200 --> 00:11:57,920
the AI must respect that hold.

359
00:11:57,920 --> 00:12:00,920
If a retention policy deletes a document after seven years,

360
00:12:00,920 --> 00:12:03,800
the AI must not preserve it indefinitely in a vector index.

361
00:12:03,800 --> 00:12:06,560
SharePoint exposes this content through multiple APIs.

362
00:12:06,560 --> 00:12:08,200
The traditional SharePoint REST API

363
00:12:08,200 --> 00:12:11,360
allows programmatic access to lists, libraries, and files.

364
00:12:11,360 --> 00:12:13,440
A PowerShell script can issue a get request

365
00:12:13,440 --> 00:12:16,120
against a library using a URL like your SharePoint site

366
00:12:16,120 --> 00:12:18,400
plus the API endpoint for list items.

367
00:12:18,400 --> 00:12:19,960
Appropriate headers and credentials

368
00:12:19,960 --> 00:12:21,800
retrieve the items for processing.

369
00:12:21,800 --> 00:12:24,720
The newer Microsoft Graph API provides a unified endpoint

370
00:12:24,720 --> 00:12:27,400
for SharePoint, OneDrive, Teams, and Exchange.

371
00:12:27,400 --> 00:12:30,160
And the Microsoft 365 Copilot Search API,

372
00:12:30,160 --> 00:12:33,720
currently in preview, allows hybrid semantic and lexical search

373
00:12:33,720 --> 00:12:36,160
over work content using natural language queries.

374
00:12:36,160 --> 00:12:38,200
For our local REC solution, these APIs

375
00:12:38,200 --> 00:12:40,000
provide the pipelines ingres point.

376
00:12:40,000 --> 00:12:41,720
A service running inside your perimeter

377
00:12:41,720 --> 00:12:44,360
authenticates against SharePoint using OAuth 2.0

378
00:12:44,360 --> 00:12:45,840
through Microsoft Enter ID.

379
00:12:45,840 --> 00:12:48,160
It enumerates libraries, it downloads documents,

380
00:12:48,160 --> 00:12:50,080
and it feeds them into the ingestion process

381
00:12:50,080 --> 00:12:52,240
without exposing content to external providers.

382
00:12:52,240 --> 00:12:53,240
This is critical.

383
00:12:53,240 --> 00:12:55,760
The ingestion service is the bridge between SharePoint

384
00:12:55,760 --> 00:12:56,960
and your local AI.

385
00:12:56,960 --> 00:12:58,480
It must authenticate securely.

386
00:12:58,480 --> 00:12:59,640
It must respect rate limits.

387
00:12:59,640 --> 00:13:03,040
It must handle versioning, and it must run inside your network.

388
00:13:03,040 --> 00:13:04,560
If you deploy the ingestion service

389
00:13:04,560 --> 00:13:07,440
in a cloud virtual machine, you have reintroduced

390
00:13:07,440 --> 00:13:09,120
the problem you're trying to solve.

391
00:13:09,120 --> 00:13:12,040
The authentication flow is standard Microsoft 365,

392
00:13:12,040 --> 00:13:14,120
register an application in Enter ID,

393
00:13:14,120 --> 00:13:16,120
granted application permissions for sites,

394
00:13:16,120 --> 00:13:17,640
read all or delegated permissions

395
00:13:17,640 --> 00:13:20,360
scoped to specific libraries, store the client secret

396
00:13:20,360 --> 00:13:23,080
or certificate in a local secret manager, not in code.

397
00:13:23,080 --> 00:13:25,720
Use the client credentials flow for background ingestion,

398
00:13:25,720 --> 00:13:28,120
and the on behalf of flow if you want user scoped queries

399
00:13:28,120 --> 00:13:29,920
that respect individual permissions.

400
00:13:29,920 --> 00:13:30,760
This isn't exotic.

401
00:13:30,760 --> 00:13:34,040
It's the same pattern you use for any Microsoft 365 integration.

402
00:13:34,040 --> 00:13:35,520
What changes is the destination.

403
00:13:35,520 --> 00:13:38,320
Instead of sending documents to a cloud AI service,

404
00:13:38,320 --> 00:13:40,400
you send them to a local chunking engine.

405
00:13:40,400 --> 00:13:42,480
Instead of calling a cloud embedding API,

406
00:13:42,480 --> 00:13:44,560
you call a local sentence transformer.

407
00:13:44,560 --> 00:13:46,560
Instead of storing vectors in a managed service,

408
00:13:46,560 --> 00:13:49,560
you store them in a local queue-drand or waviate instance.

409
00:13:49,560 --> 00:13:51,960
The APIs are the same, but the network path is different.

410
00:13:51,960 --> 00:13:52,760
That's the foundation.

411
00:13:52,760 --> 00:13:54,880
SharePoint isn't just where your documents live.

412
00:13:54,880 --> 00:13:56,440
It's where your governance lives,

413
00:13:56,440 --> 00:13:58,320
and our architecture preserves that governance

414
00:13:58,320 --> 00:13:59,800
while adding intelligence.

415
00:13:59,800 --> 00:14:01,280
But here is where most people get stuck.

416
00:14:01,280 --> 00:14:03,960
They assume they need a cloud LLM to make this useful.

417
00:14:03,960 --> 00:14:05,400
They look at the local deployment path

418
00:14:05,400 --> 00:14:08,240
and worry that the model will be too small, too slow,

419
00:14:08,240 --> 00:14:08,960
or too dumb.

420
00:14:08,960 --> 00:14:10,480
That assumption is outdated.

421
00:14:10,480 --> 00:14:13,880
Local versus cloud LLMs, the real trade-offs.

422
00:14:13,880 --> 00:14:16,960
Cloud LLM APIs offer undeniable advantages.

423
00:14:16,960 --> 00:14:19,600
Lower operational overhead, automatic scaling,

424
00:14:19,600 --> 00:14:21,800
access to frontier models with hundreds of billions

425
00:14:21,800 --> 00:14:22,560
of parameters.

426
00:14:22,560 --> 00:14:23,880
You don't manage drivers.

427
00:14:23,880 --> 00:14:25,360
You don't manage quantization.

428
00:14:25,360 --> 00:14:26,680
You don't manage fail-over.

429
00:14:26,680 --> 00:14:27,560
You send a prompt.

430
00:14:27,560 --> 00:14:28,600
You get an answer.

431
00:14:28,600 --> 00:14:30,240
But those advantages come with trade-offs

432
00:14:30,240 --> 00:14:32,280
that many organizations can't accept.

433
00:14:32,280 --> 00:14:33,960
Every prompt leaves your network.

434
00:14:33,960 --> 00:14:35,920
Every response passes through infrastructure

435
00:14:35,920 --> 00:14:37,040
you don't control.

436
00:14:37,040 --> 00:14:39,680
Every token incurs a cost that scales with usage.

437
00:14:39,680 --> 00:14:42,480
And the best models aren't available for local deployment

438
00:14:42,480 --> 00:14:44,520
at all because their weights are proprietary.

439
00:14:44,520 --> 00:14:46,520
Local LLM deployments flip that equation.

440
00:14:46,520 --> 00:14:48,600
The operational burden shifts to your team.

441
00:14:48,600 --> 00:14:50,440
The scaling responsibility becomes yours,

442
00:14:50,440 --> 00:14:52,280
but the data control becomes absolute.

443
00:14:52,280 --> 00:14:53,880
The cost becomes predictable,

444
00:14:53,880 --> 00:14:57,240
and the model becomes yours to configure, update, and audit.

445
00:14:57,240 --> 00:15:00,480
AI multiples comparison of cloud versus local LLMs

446
00:15:00,480 --> 00:15:03,160
notes that cloud models are attractive for organizations

447
00:15:03,160 --> 00:15:06,360
that prefer managed services and rapid experimentation.

448
00:15:06,360 --> 00:15:08,800
Local models are more suitable when data security

449
00:15:08,800 --> 00:15:10,320
and sovereignty are critical,

450
00:15:10,320 --> 00:15:12,600
and where organizations have or can invest

451
00:15:12,600 --> 00:15:13,720
in appropriate hardware.

452
00:15:13,720 --> 00:15:14,600
That's our scenario.

453
00:15:14,600 --> 00:15:15,480
We're not experimenting.

454
00:15:15,480 --> 00:15:17,320
We're building production infrastructure.

455
00:15:17,320 --> 00:15:19,640
The hardware requirements for local YAMA deployment

456
00:15:19,640 --> 00:15:21,880
are major but increasingly accessible.

457
00:15:21,880 --> 00:15:25,240
Guidance for LAMA 3 suggests targeting Nvidia GPUs

458
00:15:25,240 --> 00:15:28,520
with at least 16 gigabytes of VRAM, 32 gigabytes

459
00:15:28,520 --> 00:15:31,760
of system RAM, and roughly 50 gigabytes of free disk space

460
00:15:31,760 --> 00:15:33,560
for models and dependencies.

461
00:15:33,560 --> 00:15:35,320
Larger models of fine tuning workloads

462
00:15:35,320 --> 00:15:39,280
benefit from 64 gigabytes of RAM and more GPU memory.

463
00:15:39,280 --> 00:15:41,640
Community reports describe successful deployments

464
00:15:41,640 --> 00:15:43,600
on Linux distributions like Ubuntu

465
00:15:43,600 --> 00:15:45,760
by compiling inference engines like LAMA,

466
00:15:45,760 --> 00:15:49,080
CPP with CUDA support combined with appropriate Nvidia drivers.

467
00:15:49,080 --> 00:15:50,600
Olamma simplifies this further.

468
00:15:50,600 --> 00:15:53,720
It's a cross-platform application for macOS, Windows, and Linux

469
00:15:53,720 --> 00:15:56,520
that downloads and runs models via a local API endpoint.

470
00:15:56,520 --> 00:15:58,000
You pull a model with a single command,

471
00:15:58,000 --> 00:16:01,160
you query it with a simple HTTP request to local host.

472
00:16:01,160 --> 00:16:03,480
No container orchestration, no model conversion,

473
00:16:03,480 --> 00:16:05,320
no manual dependency management.

474
00:16:05,320 --> 00:16:07,720
For production use, you will want to run Olamma

475
00:16:07,720 --> 00:16:10,400
on a dedicated GPU server rather than a laptop

476
00:16:10,400 --> 00:16:11,640
but the abstraction is the same.

477
00:16:11,640 --> 00:16:14,040
A 2026 total cost of ownership analysis

478
00:16:14,040 --> 00:16:16,160
suggests that beyond certain usage thresholds

479
00:16:16,160 --> 00:16:19,120
running open source models on dedicated GPU servers

480
00:16:19,120 --> 00:16:22,240
can become more cost effective than paying per token API fees

481
00:16:22,240 --> 00:16:24,600
despite high upfront hardware costs.

482
00:16:24,600 --> 00:16:26,880
The exact threshold depends on utilization,

483
00:16:26,880 --> 00:16:30,280
model size, energy costs, and operational expertise.

484
00:16:30,280 --> 00:16:31,920
But the directional insight is clear.

485
00:16:31,920 --> 00:16:35,080
If your organization will process thousands of queries daily

486
00:16:35,080 --> 00:16:37,160
across tens of thousands of documents,

487
00:16:37,160 --> 00:16:38,840
local deployment isn't a luxury.

488
00:16:38,840 --> 00:16:40,480
It's a financial optimization.

489
00:16:40,480 --> 00:16:43,720
Subjective experience comparisons between cloud and local models

490
00:16:43,720 --> 00:16:45,840
tend to emphasize that frontier cloud models

491
00:16:45,840 --> 00:16:49,560
still outperform smaller local ones on reasoning and nuance.

492
00:16:49,560 --> 00:16:52,040
But local models are increasingly acceptable for enterprise

493
00:16:52,040 --> 00:16:54,240
tasks when carefully selected and configured.

494
00:16:54,240 --> 00:16:56,760
The main phrase is carefully selected and configured.

495
00:16:56,760 --> 00:16:58,840
A badly chosen local model with poor prompting

496
00:16:58,840 --> 00:16:59,800
will disappoint.

497
00:16:59,800 --> 00:17:02,280
A well-chosen model with good rag will surprise you.

498
00:17:02,280 --> 00:17:04,920
For our architecture, the model isn't doing everything.

499
00:17:04,920 --> 00:17:07,200
It's answering questions based on retrieved context.

500
00:17:07,200 --> 00:17:08,640
It doesn't need to know quantum physics.

501
00:17:08,640 --> 00:17:10,920
It needs to synthesize policy documents, contracts,

502
00:17:10,920 --> 00:17:13,440
and procedures into coherent responses.

503
00:17:13,440 --> 00:17:15,760
That's a narrower task than general reasoning.

504
00:17:15,760 --> 00:17:17,640
And local models handle it well.

505
00:17:17,640 --> 00:17:19,960
Meta now offers Lama 4 as its flagship family

506
00:17:19,960 --> 00:17:22,520
alongside the Open Weight Lama 3 series.

507
00:17:22,520 --> 00:17:25,840
Deployment paths exist for both cloud and local scenarios.

508
00:17:25,840 --> 00:17:28,640
For our air-gaped architecture, we pull the open weights,

509
00:17:28,640 --> 00:17:30,720
quantize them for our hardware, and serve them

510
00:17:30,720 --> 00:17:32,240
through Olamma or Lama.

511
00:17:32,240 --> 00:17:33,000
CPP.

512
00:17:33,000 --> 00:17:35,040
The license is permissive for commercial use.

513
00:17:35,040 --> 00:17:35,880
The weights are yours.

514
00:17:35,880 --> 00:17:38,000
The model is yours and the answers stay yours.

515
00:17:38,000 --> 00:17:39,920
Once you have committed to local inference,

516
00:17:39,920 --> 00:17:41,760
the next decision is the vector database.

517
00:17:41,760 --> 00:17:43,400
And this is where many architects make

518
00:17:43,400 --> 00:17:44,720
their first real mistake.

519
00:17:44,720 --> 00:17:46,560
Vector databases and embeddings.

520
00:17:46,560 --> 00:17:48,640
Embeddings are the bridge between human language

521
00:17:48,640 --> 00:17:49,640
and machine search.

522
00:17:49,640 --> 00:17:52,120
A sentence transformer model takes a piece of text

523
00:17:52,120 --> 00:17:55,080
and converts it into a dense vector of floating point numbers.

524
00:17:55,080 --> 00:17:56,960
That vector captures semantic meaning.

525
00:17:56,960 --> 00:17:58,920
Sentence is about similar topics produce vectors

526
00:17:58,920 --> 00:18:00,880
that are close together in high-dimensional space.

527
00:18:00,880 --> 00:18:02,840
Sentence is about unrelated topics produce vectors

528
00:18:02,840 --> 00:18:03,880
that are far apart.

529
00:18:03,880 --> 00:18:05,360
This isn't keyword search.

530
00:18:05,360 --> 00:18:07,200
A keyword search for termination policy

531
00:18:07,200 --> 00:18:09,840
might miss a document titled off-boarding procedures.

532
00:18:09,840 --> 00:18:12,120
An embedding search finds it because the semantic meaning

533
00:18:12,120 --> 00:18:12,880
is similar.

534
00:18:12,880 --> 00:18:14,920
The model understands that off-boarding and termination

535
00:18:14,920 --> 00:18:16,160
are related concepts.

536
00:18:16,160 --> 00:18:18,240
It encodes that relationship into the geometry

537
00:18:18,240 --> 00:18:19,400
of the vector space.

538
00:18:19,400 --> 00:18:21,400
Sentence transformers come in many flavors.

539
00:18:21,400 --> 00:18:24,040
The all-mini LML6 V2 model is small, fast,

540
00:18:24,040 --> 00:18:25,200
and runs well on CPU.

541
00:18:25,200 --> 00:18:28,160
It produces 384 dimensional vectors.

542
00:18:28,160 --> 00:18:31,360
The BGE large and model is larger, slower, and more accurate.

543
00:18:31,360 --> 00:18:34,200
It produces 1,024 dimensional vectors.

544
00:18:34,200 --> 00:18:35,960
For a local air-gapped deployment,

545
00:18:35,960 --> 00:18:38,240
you run the embedding model on the same GPU server

546
00:18:38,240 --> 00:18:41,120
as your LLM or on a separate CPU worker.

547
00:18:41,120 --> 00:18:44,160
The critical rule is that the embedding model must run locally.

548
00:18:44,160 --> 00:18:46,680
Don't call a cloud embedding API doing so

549
00:18:46,680 --> 00:18:49,200
would send your document chunks to an external service,

550
00:18:49,200 --> 00:18:51,480
defeating the entire purpose of the architecture.

551
00:18:51,480 --> 00:18:53,320
The vector database stores these embeddings

552
00:18:53,320 --> 00:18:55,080
and performs similarity search.

553
00:18:55,080 --> 00:18:57,600
When a user asks a question, the question gets embedded

554
00:18:57,600 --> 00:18:58,920
using the same model.

555
00:18:58,920 --> 00:19:01,080
The database compares this query vector

556
00:19:01,080 --> 00:19:03,120
against all stored document vectors

557
00:19:03,120 --> 00:19:04,520
and returns the nearest neighbors.

558
00:19:04,520 --> 00:19:07,640
This is called approximate nearest neighbor search, OANN.

559
00:19:07,640 --> 00:19:09,320
It's fast even with millions of vectors

560
00:19:09,320 --> 00:19:12,320
because the database users specialize in their structures.

561
00:19:12,320 --> 00:19:14,280
Several vector databases are available.

562
00:19:14,280 --> 00:19:15,440
Your grant is written in Rust,

563
00:19:15,440 --> 00:19:17,200
its fast, memory efficient, and supports

564
00:19:17,200 --> 00:19:18,880
rich metadata filtering.

565
00:19:18,880 --> 00:19:21,680
You can attach tags to each vector, such as document source,

566
00:19:21,680 --> 00:19:23,760
library name, author, and permission level.

567
00:19:23,760 --> 00:19:25,640
Then you can filter searches to only vectors

568
00:19:25,640 --> 00:19:27,800
from libraries the user is allowed to access.

569
00:19:27,800 --> 00:19:29,600
Wevey8 offers a GraphQL interface

570
00:19:29,600 --> 00:19:31,240
and native multimodal support.

571
00:19:31,240 --> 00:19:33,200
Milvus is designed for cloud-native scaling

572
00:19:33,200 --> 00:19:34,080
with Kubernetes.

573
00:19:34,080 --> 00:19:36,480
Chroma is lightweight and ideal for prototyping.

574
00:19:36,480 --> 00:19:38,040
For an air-gapped SharePoint deployment,

575
00:19:38,040 --> 00:19:40,400
Q-drand and Wevey8 are the pragmatic choices.

576
00:19:40,400 --> 00:19:42,240
Both run on-premises via Docker.

577
00:19:42,240 --> 00:19:43,760
Both support the metadata filtering

578
00:19:43,760 --> 00:19:45,560
you need for permission-aware retrieval.

579
00:19:45,560 --> 00:19:48,200
Both have stable APIs and active communities.

580
00:19:48,200 --> 00:19:50,920
The choice between them often comes down to team preference.

581
00:19:50,920 --> 00:19:52,920
If your team likes Rust APIs and JSON,

582
00:19:52,920 --> 00:19:54,520
Q-drand feels natural.

583
00:19:54,520 --> 00:19:57,040
If your team likes GraphQL and semantic search features,

584
00:19:57,040 --> 00:19:58,440
Wevey8 fits better.

585
00:19:58,440 --> 00:20:00,760
Chunking strategy determines whether your embeddings

586
00:20:00,760 --> 00:20:02,320
are meaningful or noisy.

587
00:20:02,320 --> 00:20:03,800
A document chunk is a piece of text

588
00:20:03,800 --> 00:20:05,640
that gets embedded as a single unit.

589
00:20:05,640 --> 00:20:08,000
If chunks are too large, they dilute meaning.

590
00:20:08,000 --> 00:20:11,040
A 5,000-word chunk about the entire employee handbook

591
00:20:11,040 --> 00:20:13,560
embeds into a single vector that represents everything

592
00:20:13,560 --> 00:20:14,360
and nothing.

593
00:20:14,360 --> 00:20:16,600
If chunks are too small, they fragment meaning.

594
00:20:16,600 --> 00:20:18,960
A single sentence like section 4.2 applies

595
00:20:18,960 --> 00:20:21,280
to all full-time employees carries no context

596
00:20:21,280 --> 00:20:23,240
about what section 4.2 actually says.

597
00:20:23,240 --> 00:20:26,040
The mistake I mentioned earlier is using uniform chunking

598
00:20:26,040 --> 00:20:28,040
for all SharePoint document types.

599
00:20:28,040 --> 00:20:29,680
PDFs need page-aware boundaries

600
00:20:29,680 --> 00:20:32,800
because page breaks often separate unrelated topics.

601
00:20:32,800 --> 00:20:34,480
Word documents need heading-aware chunking

602
00:20:34,480 --> 00:20:36,560
because heading's defined semantic sections.

603
00:20:36,560 --> 00:20:38,600
Excel spreadsheets need row-group chunking

604
00:20:38,600 --> 00:20:41,200
with header preservation because a row without column headers

605
00:20:41,200 --> 00:20:42,400
is meaningless.

606
00:20:42,400 --> 00:20:44,720
PowerPoint decks need slide-level chunking

607
00:20:44,720 --> 00:20:47,320
because each slide is a self-contained unit.

608
00:20:47,320 --> 00:20:49,920
Pinecones research on chunking strategies confirms this.

609
00:20:49,920 --> 00:20:52,960
Fixed-sized chunking with overlap works for homogeneous text.

610
00:20:52,960 --> 00:20:55,080
Semantic chunking based on sentence boundaries

611
00:20:55,080 --> 00:20:56,640
works for narrative documents.

612
00:20:56,640 --> 00:20:58,680
Recursive chunking that tries paragraphs,

613
00:20:58,680 --> 00:21:01,440
then sentences, then words works for mixed content.

614
00:21:01,440 --> 00:21:03,520
For SharePoint, you need a hybrid approach.

615
00:21:03,520 --> 00:21:04,880
Detect the document type.

616
00:21:04,880 --> 00:21:06,560
Apply the appropriate strategy.

617
00:21:06,560 --> 00:21:08,520
Preserve metadata at every step.

618
00:21:08,520 --> 00:21:10,320
The practical setup looks like this.

619
00:21:10,320 --> 00:21:12,680
Your ingestion service downloads a Word document

620
00:21:12,680 --> 00:21:13,800
from SharePoint.

621
00:21:13,800 --> 00:21:17,000
It extracts text while preserving heading structure.

622
00:21:17,000 --> 00:21:20,360
It breaks the text into chunks of roughly 500 tokens

623
00:21:20,360 --> 00:21:22,120
with a 50 token overlap.

624
00:21:22,120 --> 00:21:23,840
It attaches metadata to each chunk,

625
00:21:23,840 --> 00:21:26,480
including the source URL document title author,

626
00:21:26,480 --> 00:21:29,160
last modified date, and SharePoint library.

627
00:21:29,160 --> 00:21:31,800
It sends the chunk to your local sentence transformer.

628
00:21:31,800 --> 00:21:33,240
The transformer returns a vector.

629
00:21:33,240 --> 00:21:36,160
The vector gets stored in queue-drand with its metadata.

630
00:21:36,160 --> 00:21:38,760
The process repeats for every document in the library.

631
00:21:38,760 --> 00:21:40,480
For a library of 1,000 documents,

632
00:21:40,480 --> 00:21:42,080
this takes minutes, not hours.

633
00:21:42,080 --> 00:21:45,120
For 10,000 documents, it takes longer, but runs unattended.

634
00:21:45,120 --> 00:21:46,600
And once the initial index is built,

635
00:21:46,600 --> 00:21:48,320
Delta updates handle changes.

636
00:21:48,320 --> 00:21:50,400
When a document is modified in SharePoint,

637
00:21:50,400 --> 00:21:52,200
the ingestion service detects the change,

638
00:21:52,200 --> 00:21:54,760
rechunks the updated document, re-embeds the chunks,

639
00:21:54,760 --> 00:21:56,760
and updates the vectors in the database.

640
00:21:56,760 --> 00:21:58,840
Delete a document's trigger vector deletion.

641
00:21:58,840 --> 00:22:00,000
That's the memory layer.

642
00:22:00,000 --> 00:22:01,360
Now for the brain.

643
00:22:01,360 --> 00:22:02,680
The midpoint revelation.

644
00:22:02,680 --> 00:22:05,800
Everything we have covered so far is what vendors sell you.

645
00:22:05,800 --> 00:22:07,480
Cloud AI with enterprise controls,

646
00:22:07,480 --> 00:22:09,320
sovereign cloud with regional processing,

647
00:22:09,320 --> 00:22:11,080
managed drag with vector databases.

648
00:22:11,080 --> 00:22:12,480
The environment is full of platforms

649
00:22:12,480 --> 00:22:14,000
that promise to solve this problem

650
00:22:14,000 --> 00:22:16,040
while keeping you inside their ecosystem.

651
00:22:16,040 --> 00:22:18,320
But the real architecture is simpler than you think,

652
00:22:18,320 --> 00:22:21,360
a single GPU server, an open source vector database,

653
00:22:21,360 --> 00:22:23,760
a local YAMA instance, a lightweight web interface.

654
00:22:23,760 --> 00:22:26,000
And the SharePoint APIs you already know how to use.

655
00:22:26,000 --> 00:22:28,400
That's the entire stack, no cloud model subscriptions,

656
00:22:28,400 --> 00:22:31,640
no per token pricing, no vendor lock-in, no legal exposure.

657
00:22:31,640 --> 00:22:33,320
This isn't science fiction.

658
00:22:33,320 --> 00:22:36,680
In April 2025, a developer published a complete implementation

659
00:22:36,680 --> 00:22:39,560
of exactly this architecture for SharePoint on premises,

660
00:22:39,560 --> 00:22:41,920
CPAT minimal API for the backend,

661
00:22:41,920 --> 00:22:43,560
QDRIND for the vector database,

662
00:22:43,560 --> 00:22:47,000
OLAMA for local LLM inference, SharePoint as the document source.

663
00:22:47,000 --> 00:22:49,280
It handled authentication, ingestion, chunking,

664
00:22:49,280 --> 00:22:51,040
embedding, retrieval, and generation,

665
00:22:51,040 --> 00:22:53,360
all on local hardware, all under local control.

666
00:22:53,360 --> 00:22:55,920
A Reddit discussion from August 2025 details

667
00:22:55,920 --> 00:22:57,440
a similar implementation scaling

668
00:22:57,440 --> 00:22:59,560
to over 6,000 SharePoint documents.

669
00:22:59,560 --> 00:23:01,440
The developer faced real problems.

670
00:23:01,440 --> 00:23:03,280
Chunking PDFs with mixed layouts,

671
00:23:03,280 --> 00:23:04,880
handling SharePoint rate limits,

672
00:23:04,880 --> 00:23:07,920
tuning retrieval to avoid surfacing outdated versions.

673
00:23:07,920 --> 00:23:11,200
And they solved them with open source tools and community support.

674
00:23:11,200 --> 00:23:13,400
The point isn't that these specific implementations

675
00:23:13,400 --> 00:23:15,160
are production ready for your environment.

676
00:23:15,160 --> 00:23:17,400
The point is that the architecture is proven.

677
00:23:17,400 --> 00:23:18,760
People are building this today.

678
00:23:18,760 --> 00:23:19,920
They're solving the problems.

679
00:23:19,920 --> 00:23:22,520
And they're doing it without sending proprietary data

680
00:23:22,520 --> 00:23:23,800
to external APIs.

681
00:23:23,800 --> 00:23:25,240
What changes isn't the technology?

682
00:23:25,240 --> 00:23:26,200
It's the stance.

683
00:23:26,200 --> 00:23:29,320
Most organizations approach AI as a service to consume.

684
00:23:29,320 --> 00:23:31,720
They evaluate vendors, they negotiate contracts,

685
00:23:31,720 --> 00:23:33,600
they audit compliance, and they hope

686
00:23:33,600 --> 00:23:36,120
the vendor's architecture matches with their risk model.

687
00:23:36,120 --> 00:23:37,520
The stance we're taking is different.

688
00:23:37,520 --> 00:23:39,480
We treat AI as infrastructure to own.

689
00:23:39,480 --> 00:23:41,880
We select open models with permissive licenses.

690
00:23:41,880 --> 00:23:43,640
We deploy them on hardware, we control,

691
00:23:43,640 --> 00:23:45,680
we connect them to data sources we govern.

692
00:23:45,680 --> 00:23:48,240
And we accept the operational burden in exchange for sovereignty.

693
00:23:48,240 --> 00:23:51,080
That's the shift from cloud first to sovereignty first,

694
00:23:51,080 --> 00:23:54,120
from consumption to ownership, from delegation to control.

695
00:23:54,120 --> 00:23:56,560
Let me show you exactly how the data flows.

696
00:23:56,560 --> 00:23:57,800
The ingestion layer.

697
00:23:57,800 --> 00:24:00,000
The ingestion service is the bridge between SharePoint

698
00:24:00,000 --> 00:24:01,200
and your local AI.

699
00:24:01,200 --> 00:24:03,200
It's also the most security sensitive component

700
00:24:03,200 --> 00:24:05,720
because it has read access to your document libraries,

701
00:24:05,720 --> 00:24:07,760
designed it carefully.

702
00:24:07,760 --> 00:24:11,200
SharePoint content is unstructured, versioned, and permissioned.

703
00:24:11,200 --> 00:24:13,480
You can't simply dump files into a vector database

704
00:24:13,480 --> 00:24:14,600
and hope for the best.

705
00:24:14,600 --> 00:24:17,520
The ingestion service must understand document types, respect

706
00:24:17,520 --> 00:24:20,000
access controls, handle versioning, and manage

707
00:24:20,000 --> 00:24:20,880
delta updates.

708
00:24:20,880 --> 00:24:22,760
If it fails at any of these, your AI layer

709
00:24:22,760 --> 00:24:24,720
becomes either inaccurate or insecure.

710
00:24:24,720 --> 00:24:26,760
The first decision is authentication.

711
00:24:26,760 --> 00:24:29,120
The ingestion service must authenticate against SharePoint

712
00:24:29,120 --> 00:24:32,320
online or SharePoint on premises using Microsoft EntraID.

713
00:24:32,320 --> 00:24:35,240
For SharePoint online, this means OOOTH 2.0

714
00:24:35,240 --> 00:24:38,200
with either application permissions or delegated permissions.

715
00:24:38,200 --> 00:24:40,840
Application permissions grant the service broad access,

716
00:24:40,840 --> 00:24:42,600
which is simpler but less secure.

717
00:24:42,600 --> 00:24:45,640
Delegated permissions scope access to what the specific user

718
00:24:45,640 --> 00:24:47,720
or service principle is allowed to see,

719
00:24:47,720 --> 00:24:50,280
which is more secure but more complex to manage.

720
00:24:50,280 --> 00:24:53,360
For an air-gapped architecture, I recommend a hybrid approach.

721
00:24:53,360 --> 00:24:56,480
Use application permissions, scope to specific libraries,

722
00:24:56,480 --> 00:24:57,640
rather than sites.

723
00:24:57,640 --> 00:24:59,520
Read all across the entire tenant.

724
00:24:59,520 --> 00:25:01,920
Create a dedicated service principle in EntraID

725
00:25:01,920 --> 00:25:04,600
with a descriptive name like SP Local Ragngestion.

726
00:25:04,600 --> 00:25:07,600
Granted access only to the libraries you intend to index.

727
00:25:07,600 --> 00:25:09,400
Store the client's secret or certificate

728
00:25:09,400 --> 00:25:11,680
in a local secret manager like Hashikop Vault

729
00:25:11,680 --> 00:25:14,160
or as your main vault if you have a hybrid environment.

730
00:25:14,160 --> 00:25:15,480
Never hard code credentials.

731
00:25:15,480 --> 00:25:17,480
Never commit secrets to repositories.

732
00:25:17,480 --> 00:25:20,360
The SharePoint REST API provides the ingestion endpoint.

733
00:25:20,360 --> 00:25:23,400
You construct URLs like your SharePoint site collection

734
00:25:23,400 --> 00:25:25,680
plus the API path for list items.

735
00:25:25,680 --> 00:25:28,400
You specify headers for JSON, accept, and content types.

736
00:25:28,400 --> 00:25:30,960
You handle pagination because libraries can contain thousands

737
00:25:30,960 --> 00:25:31,960
of documents.

738
00:25:31,960 --> 00:25:35,080
And you filter by content type, modified date or library path

739
00:25:35,080 --> 00:25:36,320
to limit the scope.

740
00:25:36,320 --> 00:25:38,840
Microsoft Graph offers a more modern alternative.

741
00:25:38,840 --> 00:25:42,240
The Graph API provides a unified endpoint for SharePoint, OneDrive,

742
00:25:42,240 --> 00:25:43,840
Teams, and Exchange.

743
00:25:43,840 --> 00:25:47,080
For document ingestion, you query the Drive items endpoint

744
00:25:47,080 --> 00:25:49,320
for a specific site or library.

745
00:25:49,320 --> 00:25:52,160
You get metadata including file name, size, last modified date,

746
00:25:52,160 --> 00:25:53,280
and download URL.

747
00:25:53,280 --> 00:25:55,800
You download the file content using the download URL

748
00:25:55,800 --> 00:25:57,240
and you process it locally.

749
00:25:57,240 --> 00:26:00,760
The Microsoft 365 co-pilot search API currently in preview

750
00:26:00,760 --> 00:26:02,000
offers a third option.

751
00:26:02,000 --> 00:26:04,120
It allows hybrid semantic and lexical search

752
00:26:04,120 --> 00:26:06,640
over work content using natural language queries.

753
00:26:06,640 --> 00:26:09,200
For our architecture, this is less relevant for ingestion

754
00:26:09,200 --> 00:26:10,720
but useful for validation.

755
00:26:10,720 --> 00:26:13,480
You can compare your local rag results against co-pilot search

756
00:26:13,480 --> 00:26:16,280
to verify coverage and accuracy during testing.

757
00:26:16,280 --> 00:26:19,440
Document extraction is where the ingestion service earns its keep.

758
00:26:19,440 --> 00:26:22,000
Different document types require different extractors.

759
00:26:22,000 --> 00:26:24,640
For word documents, the Dose X format is a zip archive

760
00:26:24,640 --> 00:26:26,240
containing XML files.

761
00:26:26,240 --> 00:26:28,240
You can extract the text from document XML

762
00:26:28,240 --> 00:26:29,960
without installing Microsoft Office.

763
00:26:29,960 --> 00:26:32,120
For PDFs, you need a text extraction library.

764
00:26:32,120 --> 00:26:34,560
Be careful with mixed layout PDFs that contain both text

765
00:26:34,560 --> 00:26:35,080
and images.

766
00:26:35,080 --> 00:26:38,160
Tables and PDFs are notoriously difficult to extract correctly.

767
00:26:38,160 --> 00:26:40,160
For Excel spreadsheets, you need to flatten rows

768
00:26:40,160 --> 00:26:42,320
into text while preserving column headers.

769
00:26:42,320 --> 00:26:44,880
For PowerPoint text, you extract text from slides

770
00:26:44,880 --> 00:26:46,520
and optionally speaker notes.

771
00:26:46,520 --> 00:26:48,880
The extraction step must preserve structure.

772
00:26:48,880 --> 00:26:50,320
A word document with clear headings

773
00:26:50,320 --> 00:26:52,880
should produce text segments that know which heading they belong

774
00:26:52,880 --> 00:26:53,520
under.

775
00:26:53,520 --> 00:26:56,560
An Excel sheet should produce rows that include column context.

776
00:26:56,560 --> 00:26:58,960
A PowerPoint deck should separate slides.

777
00:26:58,960 --> 00:27:02,320
The structural metadata feeds into the chunking engine later.

778
00:27:02,320 --> 00:27:04,600
If you throw away structure during extraction,

779
00:27:04,600 --> 00:27:06,560
the chunking engine has nothing to work with.

780
00:27:06,560 --> 00:27:08,800
Let me give you a concrete example of what extraction looks

781
00:27:08,800 --> 00:27:10,720
like for a typical contract document.

782
00:27:10,720 --> 00:27:12,960
A word file named Employment Contract Template.

783
00:27:12,960 --> 00:27:14,480
Docs lives in the HR library.

784
00:27:14,480 --> 00:27:18,160
The ingestion service downloads it using the SharePoint REST API.

785
00:27:18,160 --> 00:27:21,000
It opens the Docx package, which is a zip archive containing

786
00:27:21,000 --> 00:27:21,960
XML files.

787
00:27:21,960 --> 00:27:24,280
It reads document.xml and extracts paragraphs

788
00:27:24,280 --> 00:27:26,200
while preserving the paragraph styles.

789
00:27:26,200 --> 00:27:28,600
Paragraph style is heading one become section markers.

790
00:27:28,600 --> 00:27:31,840
Paragraph style as normal become body text.

791
00:27:31,840 --> 00:27:34,880
Paragraph style as list bullet become list items.

792
00:27:34,880 --> 00:27:36,800
The extraction output is a structured text file

793
00:27:36,800 --> 00:27:38,200
that looks like this.

794
00:27:38,200 --> 00:27:39,920
Heading one, employment terms.

795
00:27:39,920 --> 00:27:41,880
Body, this contract is governed by the laws

796
00:27:41,880 --> 00:27:43,080
of the state of Delaware.

797
00:27:43,080 --> 00:27:45,840
Heading one, compensation, body, the employee

798
00:27:45,840 --> 00:27:48,760
shall receive a base salary as specified in Appendix A.

799
00:27:48,760 --> 00:27:51,040
This structure is exactly what the chunking engine needs

800
00:27:51,040 --> 00:27:53,040
to create semantically coherent chunks.

801
00:27:53,040 --> 00:27:55,960
PDF extraction is more complex because PDF is a presentation

802
00:27:55,960 --> 00:27:57,680
format, not a content format.

803
00:27:57,680 --> 00:28:00,880
A PDF file contains drawing commands, not paragraphs.

804
00:28:00,880 --> 00:28:03,920
The text you see on the page is positioned absolutely.

805
00:28:03,920 --> 00:28:05,760
Two words that appear next to each other

806
00:28:05,760 --> 00:28:08,320
might be stored in the PDF as separate objects

807
00:28:08,320 --> 00:28:09,880
with no explicit relationship.

808
00:28:09,880 --> 00:28:13,480
Good PDF extractors like Pi PDF2, PDF Plumber, or PDF Miner

809
00:28:13,480 --> 00:28:15,520
use heuristics to reconstruct reading order.

810
00:28:15,520 --> 00:28:18,240
They detect columns, they identify headers and footers.

811
00:28:18,240 --> 00:28:21,400
They separate tables from body text, but they're not perfect.

812
00:28:21,400 --> 00:28:23,920
A scanned PDF that contains images instead of text

813
00:28:23,920 --> 00:28:27,240
requires OCR, which adds another layer of complexity and error.

814
00:28:27,240 --> 00:28:29,720
Testract is a common open source OCR engine.

815
00:28:29,720 --> 00:28:32,080
It converts images to text with reasonable accuracy

816
00:28:32,080 --> 00:28:35,280
for clean documents, but handwritten annotations, stamps,

817
00:28:35,280 --> 00:28:38,000
and poor scan quality will produce garbage text

818
00:28:38,000 --> 00:28:39,360
that pollutes your index.

819
00:28:39,360 --> 00:28:41,920
For Excel, the challenge is that a single spreadsheet

820
00:28:41,920 --> 00:28:44,720
might contain multiple sheets, each with a different purpose.

821
00:28:44,720 --> 00:28:46,680
Sheet one might be employee data.

822
00:28:46,680 --> 00:28:48,600
Sheet two might be salary bands, sheet three

823
00:28:48,600 --> 00:28:49,600
might be a lookup table.

824
00:28:49,600 --> 00:28:52,680
If you flatten the entire workbook into a single text stream,

825
00:28:52,680 --> 00:28:55,560
the retrieval engine can't distinguish between an employee name

826
00:28:55,560 --> 00:28:57,040
and a salary band value.

827
00:28:57,040 --> 00:28:59,240
The extraction must preserve sheet boundaries.

828
00:28:59,240 --> 00:29:01,880
It must include column headers in every data row context,

829
00:29:01,880 --> 00:29:04,400
and it must skip empty rows and hidden sheets

830
00:29:04,400 --> 00:29:06,400
that contain no meaningful content.

831
00:29:06,400 --> 00:29:09,520
Powerpoint extraction faces the opposite problem of Excel.

832
00:29:09,520 --> 00:29:11,720
Each slide is already a self-contained unit,

833
00:29:11,720 --> 00:29:14,400
but slides contain title text, body bullets, speaker

834
00:29:14,400 --> 00:29:15,880
notes, and embedded charts.

835
00:29:15,880 --> 00:29:18,920
The title and bullets are usually the content you want to index.

836
00:29:18,920 --> 00:29:21,040
Speaker notes might contain presenter guidance

837
00:29:21,040 --> 00:29:22,920
that's irrelevant to document retrieval.

838
00:29:22,920 --> 00:29:25,880
Charts contain data that might be useful if extracted as text,

839
00:29:25,880 --> 00:29:27,400
but is usually stored as images.

840
00:29:27,400 --> 00:29:30,320
A good PowerPoint extractor pulls title and bullet text

841
00:29:30,320 --> 00:29:32,000
while optionally including speaker notes

842
00:29:32,000 --> 00:29:33,720
if your use case requires them.

843
00:29:33,720 --> 00:29:36,200
Error handling at the extraction layer is non-trivial.

844
00:29:36,200 --> 00:29:38,640
Documents in SharePoint aren't always well-formed.

845
00:29:38,640 --> 00:29:40,080
A word file might be corrupted.

846
00:29:40,080 --> 00:29:41,960
A PDF might be password protected.

847
00:29:41,960 --> 00:29:44,560
An Excel sheet might contain circular references

848
00:29:44,560 --> 00:29:46,200
that crash the parser.

849
00:29:46,200 --> 00:29:48,560
Your ingestion service must handle these gracefully.

850
00:29:48,560 --> 00:29:51,440
Log the failure, skip the document, alert the administrator,

851
00:29:51,440 --> 00:29:52,880
and continue with the rest.

852
00:29:52,880 --> 00:29:54,720
A single bad document shouldn't stop the indexing

853
00:29:54,720 --> 00:29:56,000
of 10,000 good ones.

854
00:29:56,000 --> 00:29:58,040
Retri logic with exponential back-off protects

855
00:29:58,040 --> 00:29:59,280
against transient failures.

856
00:29:59,280 --> 00:30:02,560
If SharePoint returns HTTP 500 or 503,

857
00:30:02,560 --> 00:30:05,480
wait 10 seconds and retry, then 20 seconds, then 40.

858
00:30:05,480 --> 00:30:08,080
If it still fails after three retreats, log and move on.

859
00:30:08,080 --> 00:30:10,920
If the vector database is temporarily unreachable,

860
00:30:10,920 --> 00:30:13,200
queue the vectors in local storage and retry.

861
00:30:13,200 --> 00:30:15,000
If the embedding model is overloaded,

862
00:30:15,000 --> 00:30:17,440
pause the batch and wait for GPU memory to free up.

863
00:30:17,440 --> 00:30:19,640
Resilience isn't a feature you add later.

864
00:30:19,640 --> 00:30:21,800
It's a property you design in from the start.

865
00:30:21,800 --> 00:30:23,720
Delta handling is critical for production.

866
00:30:23,720 --> 00:30:26,640
You don't want to re-index 10,000 documents every night.

867
00:30:26,640 --> 00:30:29,600
You want to detect what changed and process only that.

868
00:30:29,600 --> 00:30:31,280
SharePoint provides a changes endpoint

869
00:30:31,280 --> 00:30:34,800
that returns items modified since a specific timestamp.

870
00:30:34,800 --> 00:30:36,680
Your ingestion service stores a watermark,

871
00:30:36,680 --> 00:30:38,360
the last processed timestamp,

872
00:30:38,360 --> 00:30:40,760
and queries for changes since that watermark.

873
00:30:40,760 --> 00:30:44,040
New documents get extracted, chunked, embedded and stored.

874
00:30:44,040 --> 00:30:46,240
Modified documents get their old vectors deleted

875
00:30:46,240 --> 00:30:47,600
and new vectors inserted.

876
00:30:47,600 --> 00:30:49,960
Deleted documents trigger vector deletion.

877
00:30:49,960 --> 00:30:51,920
Security considerations at the ingestion layer

878
00:30:51,920 --> 00:30:54,120
are straightforward but non-negotiable.

879
00:30:54,120 --> 00:30:56,800
The ingestion service must run inside your perimeter.

880
00:30:56,800 --> 00:30:59,440
It should have no outbound internet connectivity

881
00:30:59,440 --> 00:31:02,640
except to Microsoft 365 if you're using SharePoint online.

882
00:31:02,640 --> 00:31:04,720
It should log every documented processes,

883
00:31:04,720 --> 00:31:08,080
every error it encounters, and every API call it makes.

884
00:31:08,080 --> 00:31:10,520
Logs stay local, the service should fail closed.

885
00:31:10,520 --> 00:31:12,400
If authentication fails, it stops.

886
00:31:12,400 --> 00:31:14,600
If the vector database is unreachable, it stops.

887
00:31:14,600 --> 00:31:17,040
If a document can't be processed, it logs the error

888
00:31:17,040 --> 00:31:18,520
and continues with the rest.

889
00:31:18,520 --> 00:31:20,360
Rate limiting is a practical concern.

890
00:31:20,360 --> 00:31:22,200
SharePoint online enforces throttling.

891
00:31:22,200 --> 00:31:25,040
If your ingestion service makes too many requests too quickly,

892
00:31:25,040 --> 00:31:28,160
SharePoint returns HTTP 429 and backs off.

893
00:31:28,160 --> 00:31:30,280
Your service must implement exponential back off.

894
00:31:30,280 --> 00:31:32,840
Start with a modest request rate, increase it gradually,

895
00:31:32,840 --> 00:31:34,320
monitor for throttling responses,

896
00:31:34,320 --> 00:31:37,160
and schedule full re-indexing during off-peak hours.

897
00:31:37,160 --> 00:31:40,480
The output of the ingestion layer is clean text with metadata.

898
00:31:40,480 --> 00:31:42,160
Document source, library name, author,

899
00:31:42,160 --> 00:31:44,520
last modified date, version number, permission level.

900
00:31:44,520 --> 00:31:47,520
This text and metadata feed into the chunking engine,

901
00:31:47,520 --> 00:31:50,480
and the chunking engine is where most implementations fail.

902
00:31:50,480 --> 00:31:54,280
Chunking and embedding strategy, bad chunking destroys rag.

903
00:31:54,280 --> 00:31:57,600
I want to be explicit about this because I have seen it happen repeatedly.

904
00:31:57,600 --> 00:32:00,120
An organization builds a beautiful ingestion pipeline,

905
00:32:00,120 --> 00:32:03,920
deploys an expensive GPU server and configures a sleek chat interface.

906
00:32:03,920 --> 00:32:06,400
Then they ask a question and get an answer that's half write,

907
00:32:06,400 --> 00:32:08,600
half invented and completely unsighted.

908
00:32:08,600 --> 00:32:10,160
The problem is almost never the model.

909
00:32:10,160 --> 00:32:11,520
It's the chunks.

910
00:32:11,520 --> 00:32:14,040
Chunking is the process of breaking extracted text

911
00:32:14,040 --> 00:32:16,760
into semantically meaningful pieces that can be embedded

912
00:32:16,760 --> 00:32:18,320
and retrieved individually.

913
00:32:18,320 --> 00:32:20,920
The goal is to create chunks that are self-contained enough

914
00:32:20,920 --> 00:32:24,040
to answer questions, but specific enough to avoid dilution.

915
00:32:24,040 --> 00:32:24,880
This is attention.

916
00:32:24,880 --> 00:32:26,600
You can't satisfy both perfectly.

917
00:32:26,600 --> 00:32:29,320
You optimize for your document types and your query patterns.

918
00:32:29,320 --> 00:32:32,320
For word documents, the best approach is heading aware chunking.

919
00:32:32,320 --> 00:32:35,280
Pass the document structure, identify headings and subheadings,

920
00:32:35,280 --> 00:32:37,240
group paragraphs under their nearest heading.

921
00:32:37,240 --> 00:32:38,640
Each group becomes a chunk.

922
00:32:38,640 --> 00:32:40,840
If a group is too large, split it at natural boundaries

923
00:32:40,840 --> 00:32:41,880
like paragraph breaks.

924
00:32:41,880 --> 00:32:45,360
If a group is too small, merge it with the next group under the same heading.

925
00:32:45,360 --> 00:32:47,560
The result is chunks that carry semantic context.

926
00:32:47,560 --> 00:32:51,480
A chunk from the termination policy section includes the heading termination policy

927
00:32:51,480 --> 00:32:52,960
and the paragraphs beneath it.

928
00:32:52,960 --> 00:32:54,720
When a user asks about termination,

929
00:32:54,720 --> 00:32:56,680
the retrieval engine finds this chunk

930
00:32:56,680 --> 00:32:59,000
because the heading is embedded along with the content.

931
00:32:59,000 --> 00:33:00,880
For PDFs, the challenge is layout.

932
00:33:00,880 --> 00:33:03,000
A research paper has clear sections.

933
00:33:03,000 --> 00:33:04,320
A scanned contract doesn't.

934
00:33:04,320 --> 00:33:06,240
A brochure mixes text and images.

935
00:33:06,240 --> 00:33:09,280
The chunking strategy must detect the document structure.

936
00:33:09,280 --> 00:33:12,600
For structured PDFs, use section headers as chunk boundaries.

937
00:33:12,600 --> 00:33:16,000
For unstructured PDFs, use fixed size chunking with overlap

938
00:33:16,000 --> 00:33:17,880
but include page numbers in the metadata

939
00:33:17,880 --> 00:33:20,640
so retrieval can sight sources accurately.

940
00:33:20,640 --> 00:33:23,000
For image-heavy PDFs, consider OCR

941
00:33:23,000 --> 00:33:25,720
if the documents contain critical text in images.

942
00:33:25,720 --> 00:33:27,520
But OCR adds complexity and error.

943
00:33:27,520 --> 00:33:28,960
Use it only when necessary.

944
00:33:28,960 --> 00:33:31,440
For Excel spreadsheets, chunk by row groups

945
00:33:31,440 --> 00:33:33,160
include column headers in every chunk.

946
00:33:33,160 --> 00:33:39,240
A chunk that says row 45, 5,000 approved 2025, 06, 01 is meaningless

947
00:33:39,240 --> 00:33:41,960
without knowing that the columns are budget, status and date.

948
00:33:41,960 --> 00:33:49,040
The chunk should read budget 5,000, status approved, date 2025, 06, 01.

949
00:33:49,040 --> 00:33:51,480
And it should include the sheet name and file name.

950
00:33:51,480 --> 00:33:53,240
If a spreadsheet has multiple sheets,

951
00:33:53,240 --> 00:33:55,560
each sheet becomes a separate chunking context.

952
00:33:55,560 --> 00:33:57,920
For PowerPoint text, chunk at the slide level.

953
00:33:57,920 --> 00:34:00,600
Each slide is designed as a self-contained unit.

954
00:34:00,600 --> 00:34:02,320
Extract the slide title, the bullet points

955
00:34:02,320 --> 00:34:04,280
and the speaker notes if available.

956
00:34:04,280 --> 00:34:06,720
Combine them into a single chunk per slide.

957
00:34:06,720 --> 00:34:09,000
If a slide is dense, split it into two chunks

958
00:34:09,000 --> 00:34:11,080
but preserve the slide number in metadata,

959
00:34:11,080 --> 00:34:13,520
so citations point back to the correct source.

960
00:34:13,520 --> 00:34:15,400
The chunk size depends on your embedding model

961
00:34:15,400 --> 00:34:17,320
and your LLM context window.

962
00:34:17,320 --> 00:34:19,840
A common starting point is 500 tokens per chunk

963
00:34:19,840 --> 00:34:21,320
with a 50 token overlap.

964
00:34:21,320 --> 00:34:23,120
The overlap ensures that sentences split

965
00:34:23,120 --> 00:34:25,240
across chunk boundaries appear in both chunks,

966
00:34:25,240 --> 00:34:27,800
reducing the chance that a critical connection gets lost.

967
00:34:27,800 --> 00:34:29,760
But this is a starting point, not a rule.

968
00:34:29,760 --> 00:34:32,720
If your documents are dense legal contracts with cross references,

969
00:34:32,720 --> 00:34:34,160
you might need larger chunks.

970
00:34:34,160 --> 00:34:36,160
If they're FAQ documents with short answers,

971
00:34:36,160 --> 00:34:37,800
you might need smaller chunks.

972
00:34:37,800 --> 00:34:39,960
Let me walk you through what good chunking looks like

973
00:34:39,960 --> 00:34:41,400
for a specific document.

974
00:34:41,400 --> 00:34:44,400
Imagine a word document titled Corporate Travel Policy.

975
00:34:44,400 --> 00:34:45,240
DocuX.

976
00:34:45,240 --> 00:34:47,360
It has sections for booking, expense limits,

977
00:34:47,360 --> 00:34:49,640
approval workflow and reimbursement.

978
00:34:49,640 --> 00:34:52,280
A heading aware chunker passes the document structure.

979
00:34:52,280 --> 00:34:53,480
It creates a chunk for booking

980
00:34:53,480 --> 00:34:55,760
that contains the heading and all paragraphs under it.

981
00:34:55,760 --> 00:34:57,200
It creates a chunk for expense limits

982
00:34:57,200 --> 00:34:58,400
that contains the heading,

983
00:34:58,400 --> 00:35:00,040
the paragraph about daily limits

984
00:35:00,040 --> 00:35:02,680
and the table showing per-dium rates by city.

985
00:35:02,680 --> 00:35:04,760
It creates a chunk for approval workflow

986
00:35:04,760 --> 00:35:06,160
that contains the heading,

987
00:35:06,160 --> 00:35:08,280
the paragraph about manager approval

988
00:35:08,280 --> 00:35:11,800
and the paragraph about executive approval for international travel.

989
00:35:11,800 --> 00:35:13,440
Each chunk is self-contained.

990
00:35:13,440 --> 00:35:15,520
A user asking about per-dium rates in New York

991
00:35:15,520 --> 00:35:17,200
gets the expense limits chunk.

992
00:35:17,200 --> 00:35:19,320
A user asking who approves international travel

993
00:35:19,320 --> 00:35:21,240
gets the approval workflow chunk.

994
00:35:21,240 --> 00:35:23,400
The retrieval is precise because the chunk boundaries

995
00:35:23,400 --> 00:35:25,080
match with semantic boundaries.

996
00:35:25,080 --> 00:35:27,520
Now imagine the same document with bad chunking.

997
00:35:27,520 --> 00:35:31,000
A fixed size chunker breaks the document every 500 tokens.

998
00:35:31,000 --> 00:35:33,800
The first chunk ends in the middle of the expense limits section.

999
00:35:33,800 --> 00:35:35,880
It contains half the daily limits paragraph

1000
00:35:35,880 --> 00:35:37,480
and the beginning of the per-dium table,

1001
00:35:37,480 --> 00:35:38,800
but not the table headers.

1002
00:35:38,800 --> 00:35:41,360
The second chunk starts with the rest of the per-dium table,

1003
00:35:41,360 --> 00:35:42,760
but not the section heading.

1004
00:35:42,760 --> 00:35:44,520
When a user asks about per-dium rates,

1005
00:35:44,520 --> 00:35:46,440
neither chunk fully answers the question.

1006
00:35:46,440 --> 00:35:48,080
The first chunk lacks the table headers.

1007
00:35:48,080 --> 00:35:49,560
The second chunk lacks the context

1008
00:35:49,560 --> 00:35:51,040
that this is about travel policy.

1009
00:35:51,040 --> 00:35:53,040
The retrieval engine returns both chunks

1010
00:35:53,040 --> 00:35:54,800
because they are topically related,

1011
00:35:54,800 --> 00:35:57,200
but the LLM can't synthesize a coherent answer

1012
00:35:57,200 --> 00:35:59,120
because the information is fragmented.

1013
00:35:59,120 --> 00:36:02,920
This is how bad chunking destroys a rag accuracy silently.

1014
00:36:02,920 --> 00:36:06,200
For Excel spreadsheets, chunking must preserve row relationships.

1015
00:36:06,200 --> 00:36:08,040
Consider an employee directory with columns

1016
00:36:08,040 --> 00:36:11,480
for name, department, manager, office location, and start date.

1017
00:36:11,480 --> 00:36:14,680
A naive chunker might create chunks of five rows each.

1018
00:36:14,680 --> 00:36:16,240
Row one through five get one chunk,

1019
00:36:16,240 --> 00:36:17,880
row six through ten get another.

1020
00:36:17,880 --> 00:36:20,240
But if a user asks who manages the engineering team,

1021
00:36:20,240 --> 00:36:23,080
the relevant rows might be scattered across multiple chunks.

1022
00:36:23,080 --> 00:36:25,720
A better approach is to chunk by department group.

1023
00:36:25,720 --> 00:36:27,600
All engineering rows become one chunk,

1024
00:36:27,600 --> 00:36:29,200
all sales rows become another.

1025
00:36:29,200 --> 00:36:31,160
This way, a query about engineering managers

1026
00:36:31,160 --> 00:36:32,920
retrieves a single coherent chunk

1027
00:36:32,920 --> 00:36:35,360
containing all engineering employees and their managers.

1028
00:36:35,360 --> 00:36:38,360
For PowerPoint decks, slide-level chunking is usually correct,

1029
00:36:38,360 --> 00:36:39,720
but some slides are dense.

1030
00:36:39,720 --> 00:36:42,280
A quarterly review slide might contain six bullet points

1031
00:36:42,280 --> 00:36:43,600
with detailed metrics.

1032
00:36:43,600 --> 00:36:45,760
If you put the entire slide into one chunk,

1033
00:36:45,760 --> 00:36:47,040
the embedding might dilute

1034
00:36:47,040 --> 00:36:49,600
because the vector must represent six different ideas.

1035
00:36:49,600 --> 00:36:51,840
In this case, split the slide into two chunks.

1036
00:36:51,840 --> 00:36:54,040
The first chunk contains the first three bullets.

1037
00:36:54,040 --> 00:36:55,640
The second contains the last three.

1038
00:36:55,640 --> 00:36:58,560
Both chunks carry the same slide title in their metadata,

1039
00:36:58,560 --> 00:37:01,280
so the retrieval engine knows they came from the same source.

1040
00:37:01,280 --> 00:37:03,960
Metadata preservation during chunking is critical.

1041
00:37:03,960 --> 00:37:06,280
Every chunk must carry its source document URL,

1042
00:37:06,280 --> 00:37:08,480
its position in the document, its heading hierarchy,

1043
00:37:08,480 --> 00:37:10,560
its document type, its library name, its author,

1044
00:37:10,560 --> 00:37:13,200
its last modified date, and its permission level.

1045
00:37:13,200 --> 00:37:14,960
This metadata doesn't get embedded.

1046
00:37:14,960 --> 00:37:17,320
It gets stored as payload data alongside the vector

1047
00:37:17,320 --> 00:37:18,240
in the database.

1048
00:37:18,240 --> 00:37:21,240
During retrieval, the metadata is returned with the vector.

1049
00:37:21,240 --> 00:37:24,240
During answer generation, the metadata becomes the citation.

1050
00:37:24,240 --> 00:37:26,920
Without metadata, an answer is unverifiable.

1051
00:37:26,920 --> 00:37:28,960
And an unverifiable answer is worthless

1052
00:37:28,960 --> 00:37:30,560
in an enterprise context.

1053
00:37:30,560 --> 00:37:33,320
The embedding model converts each chunk into a vector.

1054
00:37:33,320 --> 00:37:35,640
As I mentioned earlier, run this model locally.

1055
00:37:35,640 --> 00:37:38,920
Popular choices include all Mini-LML6, V2 for speed

1056
00:37:38,920 --> 00:37:40,840
and BGE-large and for accuracy.

1057
00:37:40,840 --> 00:37:42,320
Both are available through hugging phase

1058
00:37:42,320 --> 00:37:43,760
and run on local hardware.

1059
00:37:43,760 --> 00:37:47,040
The all-mini-LM model is 384 dimensions.

1060
00:37:47,040 --> 00:37:50,160
The BGE-large model is 1,024 dimensions.

1061
00:37:50,160 --> 00:37:52,560
Higher dimensions capture more nuance,

1062
00:37:52,560 --> 00:37:55,560
but require more storage and more compute during search.

1063
00:37:55,560 --> 00:37:58,880
For most SharePoint deployments, all Mini-LML is sufficient.

1064
00:37:58,880 --> 00:38:00,720
If retrieval accuracy is critical

1065
00:38:00,720 --> 00:38:03,120
and your document base is under 10,000 chunks,

1066
00:38:03,120 --> 00:38:05,120
BGE-large is worth the overhead.

1067
00:38:05,120 --> 00:38:07,200
The embedding step is batched for efficiency.

1068
00:38:07,200 --> 00:38:09,280
Send multiple chunks to the model at once,

1069
00:38:09,280 --> 00:38:10,680
rather than one by one.

1070
00:38:10,680 --> 00:38:12,200
Modern embedding models handle batches

1071
00:38:12,200 --> 00:38:14,720
of 32 or 64 chunks in parallel.

1072
00:38:14,720 --> 00:38:17,760
This reduces GPU idle time and speeds up indexing.

1073
00:38:17,760 --> 00:38:20,080
A batch of 1,000 chunks might take a few seconds

1074
00:38:20,080 --> 00:38:21,560
on a modern GPU.

1075
00:38:21,560 --> 00:38:23,240
A batch of 10,000 might take a minute.

1076
00:38:23,240 --> 00:38:24,920
Schedule this during maintenance windows

1077
00:38:24,920 --> 00:38:26,600
if your document base is large.

1078
00:38:26,600 --> 00:38:29,760
Metadata preservation is as important as the chunk itself.

1079
00:38:29,760 --> 00:38:32,320
Every vector in your database must carry metadata

1080
00:38:32,320 --> 00:38:34,840
that answers three critical questions about its origin,

1081
00:38:34,840 --> 00:38:37,120
its currency and its access restrictions.

1082
00:38:37,120 --> 00:38:39,040
The source URL lets the query interface

1083
00:38:39,040 --> 00:38:40,240
cite the document.

1084
00:38:40,240 --> 00:38:42,920
The last modified date helps the ingestion service detect

1085
00:38:42,920 --> 00:38:43,840
staleness.

1086
00:38:43,840 --> 00:38:45,880
The permission level enables filtered retrieval,

1087
00:38:45,880 --> 00:38:48,600
so users only see content they're authorized to access.

1088
00:38:48,600 --> 00:38:50,960
Langchain and Lama Index provide document loaders

1089
00:38:50,960 --> 00:38:53,480
and chunking utilities that handle many of these concerns.

1090
00:38:53,480 --> 00:38:56,040
Langchain's recursive character text splitter tries splitting

1091
00:38:56,040 --> 00:38:58,160
on paragraphs, then sentences, then words.

1092
00:38:58,160 --> 00:38:59,680
It's a good default for mixed content.

1093
00:38:59,680 --> 00:39:01,280
Yama Index provides node passes

1094
00:39:01,280 --> 00:39:03,080
that preserve hierarchical structure,

1095
00:39:03,080 --> 00:39:05,240
both integrate with SharePoint through custom loaders

1096
00:39:05,240 --> 00:39:06,600
or the SharePoint REST API.

1097
00:39:06,600 --> 00:39:08,800
You don't need to write chunking logic from scratch,

1098
00:39:08,800 --> 00:39:10,240
but you do need to configure it correctly

1099
00:39:10,240 --> 00:39:11,480
for your document types.

1100
00:39:11,480 --> 00:39:13,040
Once chunked and embedded, everything

1101
00:39:13,040 --> 00:39:14,680
lands in the vector database.

1102
00:39:14,680 --> 00:39:16,800
And the vector database configuration determines

1103
00:39:16,800 --> 00:39:19,640
whether your retrieval is fast, accurate and secure.

1104
00:39:19,640 --> 00:39:21,400
Vector database configuration.

1105
00:39:21,400 --> 00:39:23,880
The vector database is your AI's long term memory.

1106
00:39:23,880 --> 00:39:26,920
It must handle real-time updates, permission-aware filtering,

1107
00:39:26,920 --> 00:39:28,560
and high query throughput.

1108
00:39:28,560 --> 00:39:30,960
A bad configuration here means slow searches

1109
00:39:30,960 --> 00:39:33,880
in accurate results or unauthorized data exposure.

1110
00:39:33,880 --> 00:39:35,360
These aren't hypothetical risks,

1111
00:39:35,360 --> 00:39:37,560
they're configuration mistakes that happen in production.

1112
00:39:37,560 --> 00:39:39,560
QueueDrand is my recommended starting point

1113
00:39:39,560 --> 00:39:41,440
for air-gapped SharePoint deployments.

1114
00:39:41,440 --> 00:39:43,280
It's written in Rust, it's fast.

1115
00:39:43,280 --> 00:39:45,360
It supports rich metadata filtering,

1116
00:39:45,360 --> 00:39:48,320
and it runs on-premises via a single Docker container.

1117
00:39:48,320 --> 00:39:49,720
You can start a QueueDrand instance

1118
00:39:49,720 --> 00:39:53,040
with a Docker run command pointing to a local data directory.

1119
00:39:53,040 --> 00:39:56,600
It exposes a REST API on port 6333.

1120
00:39:56,600 --> 00:39:58,000
And it stores collections of vectors

1121
00:39:58,000 --> 00:39:59,800
with attached payload metadata.

1122
00:39:59,800 --> 00:40:01,560
A collection in QueueDrand is like a table

1123
00:40:01,560 --> 00:40:03,000
in a relational database.

1124
00:40:03,000 --> 00:40:05,640
You create one collection for your SharePoint index.

1125
00:40:05,640 --> 00:40:09,640
You define the vector size 384 for all mini-lem or 1024

1126
00:40:09,640 --> 00:40:10,720
for BG large.

1127
00:40:10,720 --> 00:40:12,160
You configure the distance metric,

1128
00:40:12,160 --> 00:40:14,720
cosine similarity is standard for text embeddings,

1129
00:40:14,720 --> 00:40:16,920
and you set up the index type HNSW,

1130
00:40:16,920 --> 00:40:19,480
which stands for hierarchical navigable small world,

1131
00:40:19,480 --> 00:40:22,320
is the default index type for approximate nearest neighbor

1132
00:40:22,320 --> 00:40:24,280
search, it builds a graph structure

1133
00:40:24,280 --> 00:40:26,920
where each vector connects to its nearest neighbors.

1134
00:40:26,920 --> 00:40:29,160
Search traverses this graph to find closed matches

1135
00:40:29,160 --> 00:40:31,200
without comparing the query against every vector

1136
00:40:31,200 --> 00:40:32,120
in the database.

1137
00:40:32,120 --> 00:40:35,000
This makes search fast even with millions of vectors.

1138
00:40:35,000 --> 00:40:37,480
Two parameters control the speed accuracy trade off.

1139
00:40:37,480 --> 00:40:39,360
EF construction determines how thoroughly

1140
00:40:39,360 --> 00:40:40,960
the graph is built during indexing.

1141
00:40:40,960 --> 00:40:44,000
Higher values produce better graphs, but slower indexing.

1142
00:40:44,000 --> 00:40:46,200
EF determines how thoroughly the graph is searched

1143
00:40:46,200 --> 00:40:47,480
during query time.

1144
00:40:47,480 --> 00:40:49,160
Higher values produce more accurate results,

1145
00:40:49,160 --> 00:40:50,360
but slower queries.

1146
00:40:50,360 --> 00:40:55,480
For initial setup, use EF construction of 128 and EF of 64.

1147
00:40:55,480 --> 00:40:57,760
Tune these based on your observed query latency

1148
00:40:57,760 --> 00:40:58,880
and retrieval accuracy.

1149
00:40:58,880 --> 00:41:01,840
Metadata filtering is where QueueDrand shines for our use case.

1150
00:41:01,840 --> 00:41:04,240
When you insert a vector, you attach a JSON payload.

1151
00:41:04,240 --> 00:41:05,520
That payload might look like this.

1152
00:41:05,520 --> 00:41:07,880
Source URL pointing to the SharePoint document,

1153
00:41:07,880 --> 00:41:12,000
library name like legal or HR, author email, last modified timestamp,

1154
00:41:12,000 --> 00:41:14,040
and permission level like executive or standard.

1155
00:41:14,040 --> 00:41:16,960
At query time, you can filter the search to only vectors

1156
00:41:16,960 --> 00:41:19,000
where permission level equals standard or lower.

1157
00:41:19,000 --> 00:41:21,280
This prevents the retrieval engine from finding chunks

1158
00:41:21,280 --> 00:41:22,680
the user can't access.

1159
00:41:22,680 --> 00:41:24,160
WeV8 is a strong alternative.

1160
00:41:24,160 --> 00:41:26,920
It offers a GraphQL interface, native multimodal support,

1161
00:41:26,920 --> 00:41:29,000
and built in vectorization if you want to delegate

1162
00:41:29,000 --> 00:41:30,400
embedding to the database.

1163
00:41:30,400 --> 00:41:32,760
For our architecture, we keep embedding separate

1164
00:41:32,760 --> 00:41:34,440
because we want control over the embedding model

1165
00:41:34,440 --> 00:41:36,680
and batching, but WeV8's GraphQL interface

1166
00:41:36,680 --> 00:41:38,880
is elegant for complex filtered queries.

1167
00:41:38,880 --> 00:41:40,800
If your team prefers GraphQL overrest,

1168
00:41:40,800 --> 00:41:42,600
WeV8 is worth evaluating.

1169
00:41:42,600 --> 00:41:44,640
Milvus is designed for cloud native scaling.

1170
00:41:44,640 --> 00:41:45,760
It runs on Kubernetes.

1171
00:41:45,760 --> 00:41:48,120
It supports billion scale vector search,

1172
00:41:48,120 --> 00:41:49,800
and it has a sophisticated architecture

1173
00:41:49,800 --> 00:41:51,880
with separated storage and compute.

1174
00:41:51,880 --> 00:41:54,040
For a single tenant air-gaped deployment,

1175
00:41:54,040 --> 00:41:57,680
Milvus is overkill unless you expect tens of millions of vectors.

1176
00:41:57,680 --> 00:41:59,360
If you do, it's the right choice.

1177
00:41:59,360 --> 00:42:01,800
But most SharePoint deployments don't reach that scale.

1178
00:42:01,800 --> 00:42:03,240
Chroma is the lightweight option.

1179
00:42:03,240 --> 00:42:05,400
It stores vectors in SQLite by default.

1180
00:42:05,400 --> 00:42:06,880
It requires no server setup,

1181
00:42:06,880 --> 00:42:08,720
and it's ideal for prototyping.

1182
00:42:08,720 --> 00:42:11,680
But for production with multiple users, concurrent queries

1183
00:42:11,680 --> 00:42:13,360
and permission filtering, Chroma

1184
00:42:13,360 --> 00:42:15,800
lacks the strongness of QDrand or WeV8.

1185
00:42:15,800 --> 00:42:17,400
Use it to validate your pipeline.

1186
00:42:17,400 --> 00:42:19,360
Then migrate to QDrand for production.

1187
00:42:19,360 --> 00:42:21,440
Real-time synchronization between SharePoint

1188
00:42:21,440 --> 00:42:23,760
and the vector database is a workflow problem,

1189
00:42:23,760 --> 00:42:25,200
not a database problem.

1190
00:42:25,200 --> 00:42:27,760
Your ingestion service detects changes in SharePoint.

1191
00:42:27,760 --> 00:42:29,280
It extracts modified documents.

1192
00:42:29,280 --> 00:42:30,840
It rechunks and reambeds them.

1193
00:42:30,840 --> 00:42:32,560
It updates the vectors in QDrand,

1194
00:42:32,560 --> 00:42:35,280
and it deletes vectors for removed documents.

1195
00:42:35,280 --> 00:42:37,800
QDrand supports point updates and deletes by ID.

1196
00:42:37,800 --> 00:42:40,800
You store the vector ID as a hash of the document URL

1197
00:42:40,800 --> 00:42:42,040
and chunk index.

1198
00:42:42,040 --> 00:42:44,400
When a document changes, you know exactly which vectors

1199
00:42:44,400 --> 00:42:45,360
to replace.

1200
00:42:45,360 --> 00:42:47,640
Query latency is your user experience metric.

1201
00:42:47,640 --> 00:42:50,600
The user asks the question, the query interface embeds it.

1202
00:42:50,600 --> 00:42:52,160
The vector database searches.

1203
00:42:52,160 --> 00:42:54,040
The top matches get sent to the LLM.

1204
00:42:54,040 --> 00:42:55,360
The LLM generates an answer.

1205
00:42:55,360 --> 00:42:56,760
The total time from question to answer

1206
00:42:56,760 --> 00:42:59,200
should be under five seconds for a good experience.

1207
00:42:59,200 --> 00:43:02,160
Vector search itself should take under 100 milliseconds.

1208
00:43:02,160 --> 00:43:05,160
If it takes longer, increase EF or add query replicas.

1209
00:43:05,160 --> 00:43:08,160
If accuracy is poor, increase EF or rebuild the index

1210
00:43:08,160 --> 00:43:10,120
with higher EF construction.

1211
00:43:10,120 --> 00:43:12,120
Metadata filtering examples show why

1212
00:43:12,120 --> 00:43:14,680
QDrand is powerful for our use case.

1213
00:43:14,680 --> 00:43:17,040
A user asks about remote work policies.

1214
00:43:17,040 --> 00:43:19,680
The query interface constructs a search with two conditions.

1215
00:43:19,680 --> 00:43:22,040
The vector must be semantically close to the question.

1216
00:43:22,040 --> 00:43:24,840
And the payload must have library equal to HR or legal

1217
00:43:24,840 --> 00:43:28,240
and permission tier less than or equal to the user's tier.

1218
00:43:28,240 --> 00:43:30,800
QDrand evaluates both conditions simultaneously.

1219
00:43:30,800 --> 00:43:33,600
It searches only vectors that match the filter,

1220
00:43:33,600 --> 00:43:35,920
then ranks them by vector similarity.

1221
00:43:35,920 --> 00:43:38,280
This is far more efficient than retrieving all vectors

1222
00:43:38,280 --> 00:43:39,880
and filtering afterward.

1223
00:43:39,880 --> 00:43:42,200
Another filtering pattern is date-based exclusion.

1224
00:43:42,200 --> 00:43:44,480
A user asks about the current vacation policy.

1225
00:43:44,480 --> 00:43:46,880
The query interface adds a filter for last modified date

1226
00:43:46,880 --> 00:43:49,160
greater than January 1, 2025.

1227
00:43:49,160 --> 00:43:52,800
This excludes outdated policy documents from 2023 or 2024.

1228
00:43:52,800 --> 00:43:55,560
The answer reflects the current rules, not superseded ones.

1229
00:43:55,560 --> 00:43:57,400
This pattern requires your ingestion service

1230
00:43:57,400 --> 00:44:00,120
to keep the last modified date accurate in the metadata.

1231
00:44:00,120 --> 00:44:03,040
If the date is wrong, the filter excludes the wrong documents.

1232
00:44:03,040 --> 00:44:06,680
Monitor the database, track query latency, index size, memory

1233
00:44:06,680 --> 00:44:08,120
usage, and error rates.

1234
00:44:08,120 --> 00:44:10,240
QDrand exposes Prometheus metrics,

1235
00:44:10,240 --> 00:44:12,400
scrape them with your local monitoring stack, alert

1236
00:44:12,400 --> 00:44:13,520
on anomalies.

1237
00:44:13,520 --> 00:44:15,840
A vector database that silently degrades

1238
00:44:15,840 --> 00:44:18,600
will produce poor answers without anyone noticing.

1239
00:44:18,600 --> 00:44:21,600
Backup and recovery for the vector database is often overlooked.

1240
00:44:21,600 --> 00:44:23,640
Could you run stores data in a local directory

1241
00:44:23,640 --> 00:44:25,360
that you mount as a Docker volume?

1242
00:44:25,360 --> 00:44:28,360
Backup this directory using your standard backup infrastructure,

1243
00:44:28,360 --> 00:44:31,280
snapshot before major changes like re-indexing or collection

1244
00:44:31,280 --> 00:44:32,200
rebuilds.

1245
00:44:32,200 --> 00:44:34,000
Test your restore procedure quarterly.

1246
00:44:34,000 --> 00:44:36,120
A corrupted vector index with no backup

1247
00:44:36,120 --> 00:44:39,040
means re-indexing your entire document base from scratch.

1248
00:44:39,040 --> 00:44:41,640
For a 10,000 document library that might take hours,

1249
00:44:41,640 --> 00:44:44,400
for a 100,000 document library it might take days.

1250
00:44:44,400 --> 00:44:46,360
Capacity planning starts with sizing.

1251
00:44:46,360 --> 00:44:50,400
A single vector of 384 dimensions at single precision

1252
00:44:50,400 --> 00:44:53,120
takes roughly 1.5 kilobytes of storage.

1253
00:44:53,120 --> 00:44:55,880
A million vectors take roughly 1.5 gigabytes.

1254
00:44:55,880 --> 00:44:58,280
Add payload metadata and index overhead,

1255
00:44:58,280 --> 00:45:01,440
and the total might be three to five gigabytes per million vectors.

1256
00:45:01,440 --> 00:45:04,200
For a typical enterprise with 50,000 sharepoint documents

1257
00:45:04,200 --> 00:45:06,480
chunked into 200,000 vectors, your database

1258
00:45:06,480 --> 00:45:09,640
needs roughly one terabyte of fast SSD storage, not massive,

1259
00:45:09,640 --> 00:45:10,800
but not trivial either.

1260
00:45:10,800 --> 00:45:12,600
Plan for three to five years of growth.

1261
00:45:12,600 --> 00:45:14,000
Memory sizing is equally important.

1262
00:45:14,000 --> 00:45:17,320
QDrand keeps the H and SW index in memory for fast search.

1263
00:45:17,320 --> 00:45:19,560
The index size depends on vector count, dimensions,

1264
00:45:19,560 --> 00:45:21,000
and graph connectivity.

1265
00:45:21,000 --> 00:45:24,520
A million vectors of 384 dimensions with H and SW

1266
00:45:24,520 --> 00:45:26,760
might consume two to four gigabytes of RAM.

1267
00:45:26,760 --> 00:45:29,000
Add the OS, the container overhead, and headroom

1268
00:45:29,000 --> 00:45:30,360
for concurrent queries.

1269
00:45:30,360 --> 00:45:32,600
A server with 16 gigabytes of RAM is comfortable

1270
00:45:32,600 --> 00:45:33,760
for most deployments.

1271
00:45:33,760 --> 00:45:35,680
32 gigabytes provides room to grow.

1272
00:45:35,680 --> 00:45:37,000
That's the memory layer.

1273
00:45:37,000 --> 00:45:38,200
Now for the brain.

1274
00:45:38,200 --> 00:45:39,680
The local Yamaha runtime.

1275
00:45:39,680 --> 00:45:41,600
The LLM is the reasoning engine.

1276
00:45:41,600 --> 00:45:44,000
It takes the user's question and the retrieved document chunks

1277
00:45:44,000 --> 00:45:46,080
and synthesizes a coherent answer.

1278
00:45:46,080 --> 00:45:48,120
But it must run entirely inside your perimeter

1279
00:45:48,120 --> 00:45:49,840
with no cloud dependency.

1280
00:45:49,840 --> 00:45:52,440
Every byte of the model weight sits on your local disk.

1281
00:45:52,440 --> 00:45:54,680
Every inference runs on your local GPU,

1282
00:45:54,680 --> 00:45:57,560
and every response leaves through your local network.

1283
00:45:57,560 --> 00:45:59,800
Olamma is the simplest way to achieve this.

1284
00:45:59,800 --> 00:46:03,320
It's a cross-platform runtime for LLM and other open models.

1285
00:46:03,320 --> 00:46:05,000
You install it on your GPU server.

1286
00:46:05,000 --> 00:46:08,040
You pull a model with a command like Olamma pull LLM3.

1287
00:46:08,040 --> 00:46:11,800
And you get a local REST API at localhost port 1143-4,

1288
00:46:11,800 --> 00:46:14,040
query it with a simple HTTP post.

1289
00:46:14,040 --> 00:46:15,800
Send the model name, the system prompt,

1290
00:46:15,800 --> 00:46:18,000
the user message, and the retrieved context.

1291
00:46:18,000 --> 00:46:19,280
Get back a streaming response.

1292
00:46:19,280 --> 00:46:21,400
The system prompt is critical for SharePoint Rack.

1293
00:46:21,400 --> 00:46:23,000
It tells the model what its role is,

1294
00:46:23,000 --> 00:46:24,360
what the context format is,

1295
00:46:24,360 --> 00:46:25,920
and what constraints to follow.

1296
00:46:25,920 --> 00:46:28,320
A good system prompt for this architecture looks like this.

1297
00:46:28,320 --> 00:46:29,480
You're a knowledgeable assistant

1298
00:46:29,480 --> 00:46:32,440
that answers questions based on the provided document context.

1299
00:46:32,440 --> 00:46:35,120
Use only the information in the context to answer.

1300
00:46:35,120 --> 00:46:38,120
If the context doesn't contain the answer, say you don't know.

1301
00:46:38,120 --> 00:46:40,280
Site the source document for every claim you make.

1302
00:46:40,280 --> 00:46:42,480
Don't speculate, don't use outside knowledge.

1303
00:46:42,480 --> 00:46:44,600
This prompt enforces three behaviors.

1304
00:46:44,600 --> 00:46:47,000
Grounding, the model must use the retrieved context.

1305
00:46:47,000 --> 00:46:48,240
Honestly, the model must admit

1306
00:46:48,240 --> 00:46:50,440
when the answer isn't in the context, citations,

1307
00:46:50,440 --> 00:46:52,160
the model must reference sources.

1308
00:46:52,160 --> 00:46:54,720
These constraints reduce hallucination and increased trust.

1309
00:46:54,720 --> 00:46:56,720
They don't eliminate hallucination entirely.

1310
00:46:56,720 --> 00:46:58,640
No prompt does, but they push the model

1311
00:46:58,640 --> 00:47:00,200
toward the behavior you want.

1312
00:47:00,200 --> 00:47:01,680
Temperature controls randomness.

1313
00:47:01,680 --> 00:47:03,960
A low temperature like 0.1 or 0.2

1314
00:47:03,960 --> 00:47:06,000
makes the model deterministic and conservative.

1315
00:47:06,000 --> 00:47:08,040
It sticks closely to the retrieved text.

1316
00:47:08,040 --> 00:47:09,800
A high temperature like 0.7

1317
00:47:09,800 --> 00:47:11,880
makes the model creative and exploratory.

1318
00:47:11,880 --> 00:47:14,040
For factual retrieval from SharePoint documents,

1319
00:47:14,040 --> 00:47:15,200
keep temperature low.

1320
00:47:15,200 --> 00:47:17,800
You want accurate synthesis, not creative writing.

1321
00:47:17,800 --> 00:47:19,640
For brainstorming or summarization tasks,

1322
00:47:19,640 --> 00:47:21,360
you might raise temperature slightly.

1323
00:47:21,360 --> 00:47:24,360
But the default for most enterprise queries should be low.

1324
00:47:24,360 --> 00:47:26,800
Quantization reduces model size and memory usage

1325
00:47:26,800 --> 00:47:28,520
at a small-costing quality.

1326
00:47:28,520 --> 00:47:32,000
Models are originally stored as 16-bit floating point numbers.

1327
00:47:32,000 --> 00:47:35,720
Quantization converts them to 8-bit, 4-bit, or even lower precision.

1328
00:47:35,720 --> 00:47:37,840
Q4KM is a common quantization format

1329
00:47:37,840 --> 00:47:39,440
that balances quality and size.

1330
00:47:39,440 --> 00:47:42,520
A 70-billion parameter model quantized to Q4KM

1331
00:47:42,520 --> 00:47:44,560
fits in roughly 40 gigabytes of disk space

1332
00:47:44,560 --> 00:47:47,520
and loads into roughly 40 gigabytes of VRM.

1333
00:47:47,520 --> 00:47:50,040
That's manageable on a single NVIDIA A100

1334
00:47:50,040 --> 00:47:53,840
with 80 gigabytes of VRM or on dual RTX 4090 cards

1335
00:47:53,840 --> 00:47:55,360
with 24 gigabytes each.

1336
00:47:55,360 --> 00:48:00,040
For smaller deployments, consider the LAMMA 3.38 billion parameter model.

1337
00:48:00,040 --> 00:48:01,840
It quantizes to under five gigabytes.

1338
00:48:01,840 --> 00:48:05,200
It runs on a single RTX 4090 with room to spare.

1339
00:48:05,200 --> 00:48:08,320
And with good rag, it answers most enterprise questions adequately.

1340
00:48:08,320 --> 00:48:11,280
It won't write poetry as well as the 70-billion model.

1341
00:48:11,280 --> 00:48:13,520
But it will tell you what your vacation policy says.

1342
00:48:13,520 --> 00:48:14,560
And that's the job.

1343
00:48:14,560 --> 00:48:17,560
LAMMA 4 is now available from Meta as their flagship family.

1344
00:48:17,560 --> 00:48:20,560
Deployment paths exist for both cloud and local scenarios.

1345
00:48:20,560 --> 00:48:22,200
For our air-gapped architecture, you

1346
00:48:22,200 --> 00:48:24,640
pull the open weights, quantize them for your hardware,

1347
00:48:24,640 --> 00:48:26,080
and serve them through LAMMA.

1348
00:48:26,080 --> 00:48:28,520
Expect higher hardware requirements than LAMMA 3,

1349
00:48:28,520 --> 00:48:30,520
plan for an A100 or newer if you want

1350
00:48:30,520 --> 00:48:32,920
the largest LAMMA 4 variant, unquantized.

1351
00:48:32,920 --> 00:48:35,520
For quantized deployment, a dual GPU setup

1352
00:48:35,520 --> 00:48:38,200
or a single high memory card should suffice.

1353
00:48:38,200 --> 00:48:40,840
The exact requirements will depend on the specific variant

1354
00:48:40,840 --> 00:48:43,400
and quantization level you choose.

1355
00:48:43,400 --> 00:48:45,000
Performance tuning for local inference

1356
00:48:45,000 --> 00:48:46,520
involves several knobs.

1357
00:48:46,520 --> 00:48:48,520
Context window size determines how much text

1358
00:48:48,520 --> 00:48:50,320
the model can process in one call.

1359
00:48:50,320 --> 00:48:52,880
With rag, your context window must fit the system prompt,

1360
00:48:52,880 --> 00:48:55,000
the retrieved chunks, and the user question.

1361
00:48:55,000 --> 00:48:58,920
Five retrieved chunks of 500 tokens each is 2,500 tokens.

1362
00:48:58,920 --> 00:49:02,240
Plus the system prompt plus the user question plus the response.

1363
00:49:02,240 --> 00:49:04,600
A 4,000 token context window is tight.

1364
00:49:04,600 --> 00:49:06,320
An 8,000 token window is comfortable.

1365
00:49:06,320 --> 00:49:08,560
A 16,000 token window is generous.

1366
00:49:08,560 --> 00:49:10,600
Larger context windows require more VRM.

1367
00:49:10,600 --> 00:49:12,160
Balance this against your hardware.

1368
00:49:12,160 --> 00:49:14,360
Batching at the LLM level is different from batching

1369
00:49:14,360 --> 00:49:15,720
at the embedding level.

1370
00:49:15,720 --> 00:49:17,520
LLM inference is harder to batch,

1371
00:49:17,520 --> 00:49:20,960
because each user query is independent and latency sensitive.

1372
00:49:20,960 --> 00:49:23,600
For a chat interface serving 10 concurrent users,

1373
00:49:23,600 --> 00:49:25,920
you might run a small batch of 2 to 4 requests

1374
00:49:25,920 --> 00:49:27,440
if your GPU supports it.

1375
00:49:27,440 --> 00:49:29,200
But most local deployments process queries

1376
00:49:29,200 --> 00:49:30,880
sequentially or with minimal batching.

1377
00:49:30,880 --> 00:49:33,000
The throughput is lower than cloud APIs.

1378
00:49:33,000 --> 00:49:35,160
The latency is acceptable for interactive use.

1379
00:49:35,160 --> 00:49:37,560
GPU utilization is your infrastructure metric.

1380
00:49:37,560 --> 00:49:41,440
A GPU sitting at 10% utilization is wasted money.

1381
00:49:41,440 --> 00:49:44,440
A GPU at 90% utilization is near capacity.

1382
00:49:44,440 --> 00:49:46,600
Monitor utilization during peak hours.

1383
00:49:46,600 --> 00:49:49,160
If you consistently hit 80% or higher,

1384
00:49:49,160 --> 00:49:51,160
add a second GPU or upgrade.

1385
00:49:51,160 --> 00:49:53,880
If you sit at 20%, you have headroom for a larger model

1386
00:49:53,880 --> 00:49:54,920
or more users.

1387
00:49:54,920 --> 00:49:57,200
Let me talk about context window sizing in more detail

1388
00:49:57,200 --> 00:50:00,520
because this is where many local deployments fail silently.

1389
00:50:00,520 --> 00:50:02,920
You retrieve five chunks of 500 tokens each.

1390
00:50:02,920 --> 00:50:04,920
That's 2,500 tokens of context.

1391
00:50:04,920 --> 00:50:06,720
Your system prompt is 200 tokens.

1392
00:50:06,720 --> 00:50:08,440
Your user question is 50 tokens.

1393
00:50:08,440 --> 00:50:11,560
The model needs a few hundred tokens to generate the answer.

1394
00:50:11,560 --> 00:50:14,040
Total context is roughly 3,000 tokens.

1395
00:50:14,040 --> 00:50:17,240
If your model has a 4,000 token context window, this fits.

1396
00:50:17,240 --> 00:50:20,320
But it leaves no room for longer documents or more retrieved chunks.

1397
00:50:20,320 --> 00:50:23,320
If you want to retrieve 10 chunks or process longer contracts,

1398
00:50:23,320 --> 00:50:25,840
you need an 8,000 token window or larger.

1399
00:50:25,840 --> 00:50:28,040
Larger context windows require more VRM,

1400
00:50:28,040 --> 00:50:30,680
the KV cache, which stores the main and value matrices

1401
00:50:30,680 --> 00:50:32,920
for each token during attention computation,

1402
00:50:32,920 --> 00:50:35,560
grows linearly with context length.

1403
00:50:35,560 --> 00:50:37,560
A model with 70 billion parameters

1404
00:50:37,560 --> 00:50:41,200
and a 4,000 token context might need 30 gigabytes of VRM.

1405
00:50:41,200 --> 00:50:45,480
The same model with a 16,000 token context might need 45 gigabytes.

1406
00:50:45,480 --> 00:50:48,720
This is why hardware planning must account for your expected context size,

1407
00:50:48,720 --> 00:50:50,040
not just the model weights.

1408
00:50:50,040 --> 00:50:52,360
Quantization affects context window capacity.

1409
00:50:52,360 --> 00:50:56,080
The Q4 quantized model uses 4 bits per weight instead of 16.

1410
00:50:56,080 --> 00:50:58,760
This reduces VRM usage by roughly half for the weights.

1411
00:50:58,760 --> 00:51:01,040
But the KV cache isn't quantized by default.

1412
00:51:01,040 --> 00:51:02,280
It remains in full precision.

1413
00:51:02,280 --> 00:51:04,080
So even with aggressive quantization,

1414
00:51:04,080 --> 00:51:06,920
the context window is still the limiting factor for memory.

1415
00:51:06,920 --> 00:51:10,240
Some advanced inference engines now support KV cache quantization,

1416
00:51:10,240 --> 00:51:12,000
which can reduce memory further.

1417
00:51:12,000 --> 00:51:14,600
But this is bleeding edge and may affect accuracy,

1418
00:51:14,600 --> 00:51:16,120
test thoroughly before deploying.

1419
00:51:16,120 --> 00:51:17,680
For production, I recommend starting

1420
00:51:17,680 --> 00:51:19,920
with an 8,000 token context window.

1421
00:51:19,920 --> 00:51:23,360
It provides enough room for five to seven chunks of 500 tokens each,

1422
00:51:23,360 --> 00:51:25,360
plus system prompt and user question.

1423
00:51:25,360 --> 00:51:27,080
If your documents are unusually long,

1424
00:51:27,080 --> 00:51:29,640
or your queries require cross-document synthesis,

1425
00:51:29,640 --> 00:51:31,160
increase to 16,000.

1426
00:51:31,160 --> 00:51:33,160
But don't increase beyond what your hardware can serve

1427
00:51:33,160 --> 00:51:34,640
with acceptable latency.

1428
00:51:34,640 --> 00:51:37,200
The LLM runtime is the crown jewel of your architecture.

1429
00:51:37,200 --> 00:51:38,960
It's also the most resource-hungry,

1430
00:51:38,960 --> 00:51:40,880
plan your hardware around it, everything else,

1431
00:51:40,880 --> 00:51:42,880
the ingestion service, the chunking engine,

1432
00:51:42,880 --> 00:51:44,600
the vector database, the embedding model,

1433
00:51:44,600 --> 00:51:47,280
can run on CPU or share GPU resources.

1434
00:51:47,280 --> 00:51:50,280
The LLM needs dedicated VRM and fast memory bandwidth

1435
00:51:50,280 --> 00:51:52,240
don't under-provision it.

1436
00:51:52,240 --> 00:51:55,120
Now for the interface that brings it all together.

1437
00:51:55,120 --> 00:51:56,400
The query interface.

1438
00:51:56,400 --> 00:51:59,360
A local brain without a face is just an API endpoint.

1439
00:51:59,360 --> 00:52:01,760
Your team members need to ask questions in natural language

1440
00:52:01,760 --> 00:52:05,120
and get grounded, cited answers, without technical friction.

1441
00:52:05,120 --> 00:52:08,240
The query interface is where sovereignty meets usability.

1442
00:52:08,240 --> 00:52:09,920
Build a minimalist web interface.

1443
00:52:09,920 --> 00:52:11,280
It doesn't need to be elaborate.

1444
00:52:11,280 --> 00:52:14,400
A text input, a submit button, a response area,

1445
00:52:14,400 --> 00:52:16,520
and citation links authenticate users

1446
00:52:16,520 --> 00:52:19,400
through Microsoft Entra ID using the same credentials

1447
00:52:19,400 --> 00:52:20,760
they use for SharePoint.

1448
00:52:20,760 --> 00:52:23,240
This provides single sign-on and ensures

1449
00:52:23,240 --> 00:52:25,880
that user identity is known for permission filtering.

1450
00:52:25,880 --> 00:52:27,240
The query flow is straightforward.

1451
00:52:27,240 --> 00:52:28,400
The user types a question.

1452
00:52:28,400 --> 00:52:30,880
The interface sends the question to your local embedding model.

1453
00:52:30,880 --> 00:52:32,880
The embedding model returns a vector.

1454
00:52:32,880 --> 00:52:35,200
The interface queries queuedrand with that vector

1455
00:52:35,200 --> 00:52:37,160
filtered by the user's permission level.

1456
00:52:37,160 --> 00:52:39,880
Queuedrand returns a top five most relevant chunks.

1457
00:52:39,880 --> 00:52:41,280
The interface constructs a prompt

1458
00:52:41,280 --> 00:52:43,760
containing the system instructions, the retrieved chunks,

1459
00:52:43,760 --> 00:52:44,760
and the user question.

1460
00:52:44,760 --> 00:52:46,400
It sends this prompt to Olamma.

1461
00:52:46,400 --> 00:52:47,840
Olamma generates a response.

1462
00:52:47,840 --> 00:52:49,480
The interface displays the response

1463
00:52:49,480 --> 00:52:53,240
alongside citations linking back to the specific SharePoint documents.

1464
00:52:53,240 --> 00:52:54,400
Citations aren't optional.

1465
00:52:54,400 --> 00:52:56,680
They're the mechanism by which users verify answers

1466
00:52:56,680 --> 00:52:58,480
and auditors trace decisions.

1467
00:52:58,480 --> 00:53:00,600
Every answer must show the source document name,

1468
00:53:00,600 --> 00:53:03,360
the library name, and the last modified date.

1469
00:53:03,360 --> 00:53:06,480
Ideally, the citation is a clickable link to the SharePoint document.

1470
00:53:06,480 --> 00:53:08,160
If the document is in SharePoint online,

1471
00:53:08,160 --> 00:53:09,800
the link opens in the browser.

1472
00:53:09,800 --> 00:53:12,840
If it's on-premises, the link opens in the local SharePoint interface.

1473
00:53:12,840 --> 00:53:15,560
The user can verify that the answer matches the source.

1474
00:53:15,560 --> 00:53:17,520
Permission enforcement happens at two points.

1475
00:53:17,520 --> 00:53:19,320
First, the vector database filters

1476
00:53:19,320 --> 00:53:21,200
by permission level during retrieval.

1477
00:53:21,200 --> 00:53:23,360
If a chunk requires executive access,

1478
00:53:23,360 --> 00:53:26,320
and the user is standard, queuedrand doesn't return it.

1479
00:53:26,320 --> 00:53:28,440
Second, the query interface should verify

1480
00:53:28,440 --> 00:53:31,280
the user's group membership before constructing the prompt.

1481
00:53:31,280 --> 00:53:32,600
This is defense in depth.

1482
00:53:32,600 --> 00:53:34,480
Even if the vector database is misconfigured

1483
00:53:34,480 --> 00:53:36,360
and returned an unauthorized chunk,

1484
00:53:36,360 --> 00:53:38,000
the interface would discard it.

1485
00:53:38,000 --> 00:53:39,920
Immutile's guidance on securing RAC systems

1486
00:53:39,920 --> 00:53:42,240
emphasizes this three-layer security model.

1487
00:53:42,240 --> 00:53:45,240
The storage tier is SharePoint with its native access controls.

1488
00:53:45,240 --> 00:53:48,240
The data tier is the vector database with metadata filtering.

1489
00:53:48,240 --> 00:53:49,960
The prompt tier is the query interface

1490
00:53:49,960 --> 00:53:52,600
with user authentication and output validation.

1491
00:53:52,600 --> 00:53:54,680
At each layer, organizations must enforce

1492
00:53:54,680 --> 00:53:57,520
changing access controls, monitor and audit queries

1493
00:53:57,520 --> 00:54:01,160
and validate outputs to avoid both data leakage and hallucinations.

1494
00:54:01,160 --> 00:54:02,640
Output validation is worth mentioning.

1495
00:54:02,640 --> 00:54:05,760
The LLM can still hallucinate even with perfect retrieval.

1496
00:54:05,760 --> 00:54:07,360
It might misinterpret a chunk,

1497
00:54:07,360 --> 00:54:10,240
synthesize two unrelated chunks into a false connection,

1498
00:54:10,240 --> 00:54:12,360
or ignore the system prompt and speculate.

1499
00:54:12,360 --> 00:54:15,200
The query interface should implement basic sanity checks.

1500
00:54:15,200 --> 00:54:17,560
If the answer contains unsupported phrases like,

1501
00:54:17,560 --> 00:54:20,040
"I think" or "it seems" "flagged."

1502
00:54:20,040 --> 00:54:22,640
If the answer contradicts a retrieved chunk, "flagged."

1503
00:54:22,640 --> 00:54:25,600
These checks aren't foolproof, but they catch obvious errors.

1504
00:54:25,600 --> 00:54:27,480
The interface should also log every query,

1505
00:54:27,480 --> 00:54:29,920
every retrieval result, every generated answer

1506
00:54:29,920 --> 00:54:31,160
and every user action.

1507
00:54:31,160 --> 00:54:33,320
Logs stay local, they're your audit trail.

1508
00:54:33,320 --> 00:54:35,600
If a user claims the AI gave them bad advice,

1509
00:54:35,600 --> 00:54:37,960
you can reconstruct exactly what chunks were retrieved

1510
00:54:37,960 --> 00:54:39,200
and what prompt was sent.

1511
00:54:39,200 --> 00:54:41,680
This is governance, and governance is what makes a demo

1512
00:54:41,680 --> 00:54:43,560
into a production system.

1513
00:54:43,560 --> 00:54:46,240
Microsoft 365 co-pilot search API,

1514
00:54:46,240 --> 00:54:48,600
currently in preview, offers a useful benchmark.

1515
00:54:48,600 --> 00:54:50,600
It performs hybrid semantic and lexical search

1516
00:54:50,600 --> 00:54:53,520
over work content and returns relevant documents.

1517
00:54:53,520 --> 00:54:55,160
You can compare your local rag results

1518
00:54:55,160 --> 00:54:57,960
against co-pilot search to evaluate coverage.

1519
00:54:57,960 --> 00:55:00,280
If your rag finds documents that co-pilot misses,

1520
00:55:00,280 --> 00:55:01,240
you have an advantage.

1521
00:55:01,240 --> 00:55:03,560
If co-pilot finds documents your rag misses,

1522
00:55:03,560 --> 00:55:04,880
you have a tuning problem.

1523
00:55:04,880 --> 00:55:06,640
Use this comparison during development.

1524
00:55:06,640 --> 00:55:09,640
Disable it in production because it calls a cloud API.

1525
00:55:09,640 --> 00:55:12,920
The query interface is where users experience the architecture.

1526
00:55:12,920 --> 00:55:14,400
If it's slow, they won't use it.

1527
00:55:14,400 --> 00:55:16,240
If it's inaccurate, they won't trust it.

1528
00:55:16,240 --> 00:55:19,160
If it's ugly, they will tolerate it if the answers are good.

1529
00:55:19,160 --> 00:55:21,360
Focus on latency and accuracy first.

1530
00:55:21,360 --> 00:55:22,680
Polish the interface later.

1531
00:55:22,680 --> 00:55:25,520
Let me describe what a good query interface looks like in practice.

1532
00:55:25,520 --> 00:55:27,240
The user opens an internal web page.

1533
00:55:27,240 --> 00:55:29,400
They see a simple text box with placeholder text

1534
00:55:29,400 --> 00:55:32,040
like ask about our policies, procedures, or documentation.

1535
00:55:32,040 --> 00:55:32,920
They type a question.

1536
00:55:32,920 --> 00:55:35,200
What is the procedure for requesting remote work?

1537
00:55:35,200 --> 00:55:36,040
They hit enter.

1538
00:55:36,040 --> 00:55:38,400
Within two seconds, they see a loading indicator.

1539
00:55:38,400 --> 00:55:40,040
Within five seconds, they see an answer.

1540
00:55:40,040 --> 00:55:41,440
The answer isn't a wall of text.

1541
00:55:41,440 --> 00:55:42,560
It's a short paragraph.

1542
00:55:42,560 --> 00:55:45,440
Remote work requests must be submitted through the HR portal

1543
00:55:45,440 --> 00:55:47,480
at least 10 business days in advance.

1544
00:55:47,480 --> 00:55:49,160
Your manager must approve the request.

1545
00:55:49,160 --> 00:55:51,320
If approved for more than three consecutive days,

1546
00:55:51,320 --> 00:55:53,720
the request requires director-level sign-off.

1547
00:55:53,720 --> 00:55:55,560
Below the answer are citations.

1548
00:55:55,560 --> 00:55:57,200
Source remote work policy.

1549
00:55:57,200 --> 00:56:01,400
Doc X, HR library, last modified March 15, 2026.

1550
00:56:01,400 --> 00:56:04,760
Source, manager handbook.x, leadership library,

1551
00:56:04,760 --> 00:56:07,600
last modified January 8, 2026.

1552
00:56:07,600 --> 00:56:09,280
The user can click any citation

1553
00:56:09,280 --> 00:56:11,280
to open the source document in SharePoint.

1554
00:56:11,280 --> 00:56:14,040
This is the user experience that makes adoption happen.

1555
00:56:14,040 --> 00:56:16,280
It's fast, it's grounded, it's verifiable,

1556
00:56:16,280 --> 00:56:18,600
and it respects the user's existing SharePoint permissions.

1557
00:56:18,600 --> 00:56:21,080
If the user doesn't have access to the leadership library,

1558
00:56:21,080 --> 00:56:23,000
the manager handbook citation doesn't appear.

1559
00:56:23,000 --> 00:56:24,120
The answer is still useful

1560
00:56:24,120 --> 00:56:27,200
because the remote work policy chunk contains enough information.

1561
00:56:27,200 --> 00:56:28,960
But the user can't access material

1562
00:56:28,960 --> 00:56:30,600
they're not authorized to see.

1563
00:56:30,600 --> 00:56:33,440
Error handling in the query interface must be graceful.

1564
00:56:33,440 --> 00:56:36,400
If the vector database is down, show a message like,

1565
00:56:36,400 --> 00:56:38,800
the knowledge base is temporarily unavailable.

1566
00:56:38,800 --> 00:56:40,360
Please try again in a few minutes.

1567
00:56:40,360 --> 00:56:41,760
Don't expose stack traces.

1568
00:56:41,760 --> 00:56:43,680
Don't expose internal service names.

1569
00:56:43,680 --> 00:56:45,680
Don't expose the fact that you're running Q-drand

1570
00:56:45,680 --> 00:56:48,560
on a server named GPU server 01.

1571
00:56:48,560 --> 00:56:50,920
These details help attackers and confuse users.

1572
00:56:50,920 --> 00:56:53,680
If the LLM runtime is overloaded, implement a Q.

1573
00:56:53,680 --> 00:56:55,120
The user submits a question.

1574
00:56:55,120 --> 00:56:56,960
The interface shows position in Q.

1575
00:56:56,960 --> 00:56:59,160
When the GPU is free, the query processes.

1576
00:56:59,160 --> 00:57:01,680
For most organizations, this Q is rarely needed

1577
00:57:01,680 --> 00:57:03,880
because local GPU inference is fast enough

1578
00:57:03,880 --> 00:57:04,800
for interactive use.

1579
00:57:04,800 --> 00:57:06,440
But if you have 100 concurrent users,

1580
00:57:06,440 --> 00:57:08,600
Qing prevents the system from crashing.

1581
00:57:08,600 --> 00:57:11,400
If the LLM generates an answer that fails validation,

1582
00:57:11,400 --> 00:57:13,640
for example, it contains speculative language,

1583
00:57:13,640 --> 00:57:16,360
unsupported by the context, flag it for review,

1584
00:57:16,360 --> 00:57:18,320
show the user the answer with a disclaimer.

1585
00:57:18,320 --> 00:57:19,840
This answer may contain information

1586
00:57:19,840 --> 00:57:21,520
not found in the source documents.

1587
00:57:21,520 --> 00:57:22,960
Please verify before acting.

1588
00:57:22,960 --> 00:57:24,560
This isn't ideal, but it's better

1589
00:57:24,560 --> 00:57:26,760
than presenting hallucinations as facts.

1590
00:57:26,760 --> 00:57:29,120
The query interface should also support feedback.

1591
00:57:29,120 --> 00:57:30,800
Thumbs up, thumbs down.

1592
00:57:30,800 --> 00:57:33,520
A text box for explaining why the answer was wrong.

1593
00:57:33,520 --> 00:57:36,080
This feedback feeds into your evaluation pipeline.

1594
00:57:36,080 --> 00:57:38,240
You review thumbs down responses weekly.

1595
00:57:38,240 --> 00:57:40,880
You identify common failure modes, bad chunking,

1596
00:57:40,880 --> 00:57:43,360
missing documents, hallucinated citations,

1597
00:57:43,360 --> 00:57:44,360
and you fix them.

1598
00:57:44,360 --> 00:57:46,960
This feedback loop is how the system improves over time

1599
00:57:46,960 --> 00:57:48,400
without retraining models.

1600
00:57:48,400 --> 00:57:50,800
But here is what most proof of concepts ignore.

1601
00:57:50,800 --> 00:57:52,000
They build a working pipeline,

1602
00:57:52,000 --> 00:57:54,560
they demonstrate a good answer, and they declare victory.

1603
00:57:54,560 --> 00:57:56,000
The real work starts after that.

1604
00:57:56,000 --> 00:57:57,720
Permission tiers and access control.

1605
00:57:57,720 --> 00:58:00,280
SharePoint already has role-based access control.

1606
00:58:00,280 --> 00:58:03,120
Your AI must mirror it exactly, not approximately,

1607
00:58:03,120 --> 00:58:04,800
not eventually, exactly.

1608
00:58:04,800 --> 00:58:06,160
Every library has permissions.

1609
00:58:06,160 --> 00:58:08,480
Every document inherits or overrides them.

1610
00:58:08,480 --> 00:58:10,600
Every user has an effective permission level,

1611
00:58:10,600 --> 00:58:13,480
determined by their group membership, direct grants,

1612
00:58:13,480 --> 00:58:14,800
and denied permissions.

1613
00:58:14,800 --> 00:58:17,520
Your vector database must respect the same matrix.

1614
00:58:17,520 --> 00:58:19,720
The naive approach is to build a single vector index

1615
00:58:19,720 --> 00:58:22,640
for the entire organization and filter at the application layer.

1616
00:58:22,640 --> 00:58:23,480
This fails.

1617
00:58:23,480 --> 00:58:26,880
Because application layer filters can be bypassed by bugs.

1618
00:58:26,880 --> 00:58:29,240
It fails because a single compromised query interface

1619
00:58:29,240 --> 00:58:31,080
exposes the entire index.

1620
00:58:31,080 --> 00:58:33,200
And it fails because it doesn't scale to find

1621
00:58:33,200 --> 00:58:35,520
grained permissions like document level access control.

1622
00:58:35,520 --> 00:58:37,400
The correct approach is to tag every vector

1623
00:58:37,400 --> 00:58:40,080
with its required permission level at ingestion time.

1624
00:58:40,080 --> 00:58:42,920
When the ingestion service processes a document from SharePoint,

1625
00:58:42,920 --> 00:58:45,160
it queries the SharePoint API for the documents

1626
00:58:45,160 --> 00:58:46,400
effective permissions.

1627
00:58:46,400 --> 00:58:48,160
It maps those permissions to a permission tier,

1628
00:58:48,160 --> 00:58:50,600
executive, manager, standard, public,

1629
00:58:50,600 --> 00:58:52,640
or whatever taxonomy your organization uses.

1630
00:58:52,640 --> 00:58:56,000
It stores that tier in the vectors metadata payload.

1631
00:58:56,000 --> 00:58:58,080
At query time, the user's permission tier

1632
00:58:58,080 --> 00:59:01,280
is determined by their Microsoft EntraID group membership.

1633
00:59:01,280 --> 00:59:03,400
The query interface passes this tier to QDRIND

1634
00:59:03,400 --> 00:59:04,560
as a filter condition.

1635
00:59:04,560 --> 00:59:07,520
QDRIND only searches vectors where the permission tier is less

1636
00:59:07,520 --> 00:59:10,480
than or equal to the user's tier.

1637
00:59:10,480 --> 00:59:12,440
A standard user searching for budget information

1638
00:59:12,440 --> 00:59:14,520
doesn't see executive budget documents.

1639
00:59:14,520 --> 00:59:18,200
A manager searching for HR policies sees manager level policies,

1640
00:59:18,200 --> 00:59:20,320
but not executive compensation details.

1641
00:59:20,320 --> 00:59:22,520
NIST defines a role-based access control

1642
00:59:22,520 --> 00:59:24,400
as enforcing three rules.

1643
00:59:24,400 --> 00:59:26,320
Role assignment every user must be assigned a role.

1644
00:59:26,320 --> 00:59:29,240
Role authorization every role must be authorized for the user.

1645
00:59:29,240 --> 00:59:31,080
Permission authorization every permission

1646
00:59:31,080 --> 00:59:32,680
must be authorized for the role.

1647
00:59:32,680 --> 00:59:35,640
In our architecture, role assignment happens in EntraID.

1648
00:59:35,640 --> 00:59:38,160
Role authorization happens when the query interface

1649
00:59:38,160 --> 00:59:40,520
resolves the user's groups to permission tiers.

1650
00:59:40,520 --> 00:59:42,680
Permission authorization happens when QDRIND filters

1651
00:59:42,680 --> 00:59:44,480
by tier during vector search.

1652
00:59:44,480 --> 00:59:45,600
This isn't a new concept.

1653
00:59:45,600 --> 00:59:47,560
It's our back applied to vector databases.

1654
00:59:47,560 --> 00:59:50,000
The novelty is that most drag implementations ignore it.

1655
00:59:50,000 --> 00:59:51,360
They build a single collection.

1656
00:59:51,360 --> 00:59:53,040
They search everything.

1657
00:59:53,040 --> 00:59:55,680
And they hope the LLM doesn't say something sensitive.

1658
00:59:55,680 --> 00:59:56,480
That's not security.

1659
00:59:56,480 --> 00:59:58,160
That's wishful thinking.

1660
00:59:58,160 --> 01:00:00,040
Microsoft purview data loss prevention

1661
01:00:00,040 --> 01:00:01,400
provides an additional layer.

1662
01:00:01,400 --> 01:00:03,560
DLP policies in SharePoint can block documents

1663
01:00:03,560 --> 01:00:05,840
from leaving the organization, but in our architecture,

1664
01:00:05,840 --> 01:00:06,920
documents never leave.

1665
01:00:06,920 --> 01:00:09,280
They're read by the ingestion service inside the perimeter

1666
01:00:09,280 --> 01:00:11,960
and converted into vectors that also stay inside.

1667
01:00:11,960 --> 01:00:14,480
DLP policies should still monitor the ingestion services

1668
01:00:14,480 --> 01:00:16,560
API calls to ensure it doesn't accidentally

1669
01:00:16,560 --> 01:00:18,360
forward content to external endpoints.

1670
01:00:18,360 --> 01:00:19,600
This is belt and suspenders.

1671
01:00:19,600 --> 01:00:21,800
The architecture prevents leakage by design.

1672
01:00:21,800 --> 01:00:23,960
DLP detects leakage if the design fails.

1673
01:00:23,960 --> 01:00:25,560
Audit logging is needed for compliance.

1674
01:00:25,560 --> 01:00:28,000
Every query must be logged with the user identity,

1675
01:00:28,000 --> 01:00:30,680
timestamp, query text, retrieve chunks,

1676
01:00:30,680 --> 01:00:32,760
generated answer and citation list.

1677
01:00:32,760 --> 01:00:35,480
These logs prove that the system is behaving correctly.

1678
01:00:35,480 --> 01:00:37,400
They support investigations if a user claims

1679
01:00:37,400 --> 01:00:39,360
they received an unauthorized answer.

1680
01:00:39,360 --> 01:00:41,200
And they demonstrate compliance to auditors.

1681
01:00:41,200 --> 01:00:44,320
Logs should be stored locally, not in a cloud logging service,

1682
01:00:44,320 --> 01:00:45,720
not in a shared SaaS platform.

1683
01:00:45,720 --> 01:00:47,400
In a local log aggregation system,

1684
01:00:47,400 --> 01:00:49,320
like the elastic stack or grapharnaloki,

1685
01:00:49,320 --> 01:00:52,080
retain them according to your organization's retention policy,

1686
01:00:52,080 --> 01:00:53,880
secure them with the same R-back that governs

1687
01:00:53,880 --> 01:00:54,920
the rest of the system.

1688
01:00:54,920 --> 01:00:56,960
And review them periodically for anomalies.

1689
01:00:56,960 --> 01:00:59,560
Permission synchronization is an operational challenge.

1690
01:00:59,560 --> 01:01:01,240
SharePoint permissions change.

1691
01:01:01,240 --> 01:01:02,920
Users move between departments.

1692
01:01:02,920 --> 01:01:05,840
Groups are reorganized, documents are reclassified.

1693
01:01:05,840 --> 01:01:08,280
Your vector database must reflect these changes.

1694
01:01:08,280 --> 01:01:10,360
The ingestion service should periodically re-scan

1695
01:01:10,360 --> 01:01:12,800
document permissions and update vector metadata.

1696
01:01:12,800 --> 01:01:15,360
The query interface should refresh user group membership

1697
01:01:15,360 --> 01:01:17,400
on every log in or at least every session.

1698
01:01:17,400 --> 01:01:19,800
And you should run a full permission audit quarterly

1699
01:01:19,800 --> 01:01:20,880
to catch drift.

1700
01:01:20,880 --> 01:01:22,880
If a document's permission level increases,

1701
01:01:22,880 --> 01:01:24,560
meaning fewer users should see it,

1702
01:01:24,560 --> 01:01:27,040
the ingestion service must update the vector metadata

1703
01:01:27,040 --> 01:01:28,240
immediately.

1704
01:01:28,240 --> 01:01:30,160
If a document's permission level decreases,

1705
01:01:30,160 --> 01:01:33,160
meaning more users should see it, the update can be batched.

1706
01:01:33,160 --> 01:01:35,680
The risk of temporarily withholding accessible information

1707
01:01:35,680 --> 01:01:38,240
is lower than the risk of temporarily exposing

1708
01:01:38,240 --> 01:01:39,280
restricted information.

1709
01:01:39,280 --> 01:01:40,040
This is governance.

1710
01:01:40,040 --> 01:01:41,120
It's not glamorous.

1711
01:01:41,120 --> 01:01:43,040
But it's what separates a proof of concept

1712
01:01:43,040 --> 01:01:45,160
from a system your legal team will approve.

1713
01:01:45,160 --> 01:01:47,400
Let me give you a specific example of permission mapping

1714
01:01:47,400 --> 01:01:48,240
in practice.

1715
01:01:48,240 --> 01:01:50,200
Your SharePoint tenant has three libraries.

1716
01:01:50,200 --> 01:01:52,000
The public library contains employee handbooks

1717
01:01:52,000 --> 01:01:53,320
and IT guidelines.

1718
01:01:53,320 --> 01:01:55,240
The manager library contains team budgets

1719
01:01:55,240 --> 01:01:56,480
and hiring procedures.

1720
01:01:56,480 --> 01:01:59,520
The executive library contains board minutes and M&A strategy.

1721
01:01:59,520 --> 01:02:01,200
In EntraID, you have three groups.

1722
01:02:01,200 --> 01:02:02,880
All employees contains every user.

1723
01:02:02,880 --> 01:02:05,240
Managers contains users with direct reports.

1724
01:02:05,240 --> 01:02:07,520
Executives contains the C-suite and VPs.

1725
01:02:07,520 --> 01:02:09,480
The ingestion service tags every vector

1726
01:02:09,480 --> 01:02:12,000
from the public library with permission tier standard.

1727
01:02:12,000 --> 01:02:14,560
Every vector from the manager library with permission tier

1728
01:02:14,560 --> 01:02:15,400
manager.

1729
01:02:15,400 --> 01:02:17,160
Every vector from the executive library

1730
01:02:17,160 --> 01:02:19,080
with permission tier executive.

1731
01:02:19,080 --> 01:02:21,920
When a user in the all employees group queries the system,

1732
01:02:21,920 --> 01:02:25,000
the query interface resolves their tier as standard.

1733
01:02:25,000 --> 01:02:27,560
QDrand filters the search to vectors with tier standard.

1734
01:02:27,560 --> 01:02:29,360
The user sees employee handbook answers,

1735
01:02:29,360 --> 01:02:30,960
but not budget details.

1736
01:02:30,960 --> 01:02:32,960
When a user in the managers group queries,

1737
01:02:32,960 --> 01:02:34,960
their tier resolves as manager.

1738
01:02:34,960 --> 01:02:37,920
QDrand searches vectors with tier standard or manager.

1739
01:02:37,920 --> 01:02:39,880
They see handbooks and hiring procedures.

1740
01:02:39,880 --> 01:02:41,880
When an executive queries, QDrand searches

1741
01:02:41,880 --> 01:02:42,720
all tiers.

1742
01:02:42,720 --> 01:02:43,640
They see everything.

1743
01:02:43,640 --> 01:02:45,880
This is simple, auditable, and matched with SharePoints

1744
01:02:45,880 --> 01:02:47,040
native permissions.

1745
01:02:47,040 --> 01:02:48,280
Edge cases exist.

1746
01:02:48,280 --> 01:02:51,000
Consider a user who moves from engineering to sales.

1747
01:02:51,000 --> 01:02:52,880
Their EntraID group membership changes.

1748
01:02:52,880 --> 01:02:55,480
The query interface picks up the new groups on next login.

1749
01:02:55,480 --> 01:02:57,960
But if they had access to sensitive engineering documents

1750
01:02:57,960 --> 01:02:59,800
yesterday and shouldn't see them today,

1751
01:02:59,800 --> 01:03:01,800
the query interface must refresh group membership

1752
01:03:01,800 --> 01:03:03,840
every session, not just at login.

1753
01:03:03,840 --> 01:03:06,280
And it should cache group membership for no more than one hour

1754
01:03:06,280 --> 01:03:08,240
to balance security against performance.

1755
01:03:08,240 --> 01:03:10,200
Consider a document with custom permissions,

1756
01:03:10,200 --> 01:03:11,640
a single file in the public library

1757
01:03:11,640 --> 01:03:13,320
that's restricted to the legal team.

1758
01:03:13,320 --> 01:03:16,040
The ingestion service must detect this exception.

1759
01:03:16,040 --> 01:03:19,440
It queries the SharePoint API for the document's effective permissions.

1760
01:03:19,440 --> 01:03:21,640
It sees that only the legal group has access.

1761
01:03:21,640 --> 01:03:23,080
It tags the vector with permission

1762
01:03:23,080 --> 01:03:24,800
to your legal instead of standard.

1763
01:03:24,800 --> 01:03:26,360
This requires the ingestion service

1764
01:03:26,360 --> 01:03:29,320
to check permissions per document, not just per library.

1765
01:03:29,320 --> 01:03:31,240
It's slower, but it's accurate.

1766
01:03:31,240 --> 01:03:32,760
And accuracy is the point.

1767
01:03:32,760 --> 01:03:34,960
Consider inherited permissions that break.

1768
01:03:34,960 --> 01:03:36,280
A library inherits from the site.

1769
01:03:36,280 --> 01:03:37,760
A document inherits from the library.

1770
01:03:37,760 --> 01:03:39,520
Then someone breaks inheritance on the document

1771
01:03:39,520 --> 01:03:41,600
and grants access to an individual user.

1772
01:03:41,600 --> 01:03:44,080
Your ingestion service must detect the broken inheritance

1773
01:03:44,080 --> 01:03:45,080
and map it correctly.

1774
01:03:45,080 --> 01:03:46,000
This is complex.

1775
01:03:46,000 --> 01:03:48,600
But SharePoint's API returns effective permissions

1776
01:03:48,600 --> 01:03:50,440
that account for inheritance breaks.

1777
01:03:50,440 --> 01:03:55,120
Trust the API, map the result, and audit periodically.

1778
01:03:55,120 --> 01:03:58,120
Permission drift is the silent killer of secure RRAG.

1779
01:03:58,120 --> 01:04:01,120
A document moves from the manager library to the public library.

1780
01:04:01,120 --> 01:04:02,880
The ingestion service detects the move.

1781
01:04:02,880 --> 01:04:04,840
It updates the vector metadata.

1782
01:04:04,840 --> 01:04:06,880
But if the service is down during the move,

1783
01:04:06,880 --> 01:04:09,480
the vector retains its old permission tag,

1784
01:04:09,480 --> 01:04:11,720
a standard user might see manager-level content.

1785
01:04:11,720 --> 01:04:14,440
To prevent this, run a full permission reconciliation weekly.

1786
01:04:14,440 --> 01:04:15,640
Rescan all vectors.

1787
01:04:15,640 --> 01:04:17,840
Compare their permission tags against current SharePoint

1788
01:04:17,840 --> 01:04:21,600
permissions, flag mismatches, and fix them.

1789
01:04:21,600 --> 01:04:23,920
But the document base never stays small.

1790
01:04:23,920 --> 01:04:25,800
Scaling and performance tuning.

1791
01:04:25,800 --> 01:04:27,480
A few hundred documents is trivial.

1792
01:04:27,480 --> 01:04:29,360
10,000 documents with daily updates

1793
01:04:29,360 --> 01:04:30,840
is a different problem.

1794
01:04:30,840 --> 01:04:33,360
The architecture scales, but only if you tune it.

1795
01:04:33,360 --> 01:04:35,480
Untuned systems degrade gracefully at first,

1796
01:04:35,480 --> 01:04:38,160
and then suddenly query latency creeps up.

1797
01:04:38,160 --> 01:04:41,280
Embedding throughput drops, GPU memory fills.

1798
01:04:41,280 --> 01:04:43,960
And users start complaining that the AI is slow.

1799
01:04:43,960 --> 01:04:45,880
GPU batching for embedding generation

1800
01:04:45,880 --> 01:04:47,520
is your first optimization.

1801
01:04:47,520 --> 01:04:49,640
Embedding models process chunks in parallel.

1802
01:04:49,640 --> 01:04:53,280
A batch of 64 chunks isn't 64 times slower than a batch of one.

1803
01:04:53,280 --> 01:04:54,760
It's perhaps 10 times slower.

1804
01:04:54,760 --> 01:04:57,360
Batching amortizes the overhead of loading the model

1805
01:04:57,360 --> 01:04:59,560
and transferring data to GPU memory.

1806
01:04:59,560 --> 01:05:02,520
Use the largest batch size that fits in your GPU memory.

1807
01:05:02,520 --> 01:05:04,960
For all Mini-LM on a 24 gigabyte GPU,

1808
01:05:04,960 --> 01:05:06,800
you can batch thousands of chunks.

1809
01:05:06,800 --> 01:05:11,680
For BGE large, the batch size is smaller, experiment and measure.

1810
01:05:11,680 --> 01:05:14,680
Incremental Delta indexing replaces full re-indexing.

1811
01:05:14,680 --> 01:05:16,000
After the initial index is built,

1812
01:05:16,000 --> 01:05:17,880
you only process change documents.

1813
01:05:17,880 --> 01:05:20,680
SharePoints changes API returns modified items

1814
01:05:20,680 --> 01:05:21,800
since a timestamp.

1815
01:05:21,800 --> 01:05:23,520
Your ingestion service stores a watermark

1816
01:05:23,520 --> 01:05:25,320
and processes only the Delta.

1817
01:05:25,320 --> 01:05:27,640
This keeps indexing time proportional to change volume,

1818
01:05:27,640 --> 01:05:28,880
not total volume.

1819
01:05:28,880 --> 01:05:31,760
A 10,000 document library with 50 daily changes

1820
01:05:31,760 --> 01:05:34,080
takes minutes to update, not hours.

1821
01:05:34,080 --> 01:05:36,600
Multi-collection sharding by department or document type

1822
01:05:36,600 --> 01:05:38,400
reduces index size per collection.

1823
01:05:38,400 --> 01:05:41,880
QDrand searches a single collection in parallel across segments.

1824
01:05:41,880 --> 01:05:45,680
But if a collection grows too large, search slows.

1825
01:05:45,680 --> 01:05:48,280
Splitting into smaller collections, one per department

1826
01:05:48,280 --> 01:05:51,200
or one per major library keeps each collection fast.

1827
01:05:51,200 --> 01:05:54,240
The query interface roots the search to the appropriate collection

1828
01:05:54,240 --> 01:05:56,360
based on the user's query context.

1829
01:05:56,360 --> 01:05:58,560
Or it searches multiple collections in parallel

1830
01:05:58,560 --> 01:05:59,800
and merges the results.

1831
01:05:59,800 --> 01:06:02,200
Caching frequent queries reduces redundant work.

1832
01:06:02,200 --> 01:06:04,040
Some questions get asked repeatedly.

1833
01:06:04,040 --> 01:06:05,400
What is our vacation policy?

1834
01:06:05,400 --> 01:06:07,080
How do I submit an expense report?

1835
01:06:07,080 --> 01:06:08,960
Cache the query vector, the retrieved chunks

1836
01:06:08,960 --> 01:06:10,280
and the generated answer.

1837
01:06:10,280 --> 01:06:14,080
Serve the cached answer for identical or near identical queries.

1838
01:06:14,080 --> 01:06:16,880
Invalidate the cache when the source documents change.

1839
01:06:16,880 --> 01:06:19,280
This can eliminate 50% or more of LLM calls

1840
01:06:19,280 --> 01:06:20,680
for FAQ style queries.

1841
01:06:20,680 --> 01:06:22,440
Monitor vector DB query latency.

1842
01:06:22,440 --> 01:06:25,640
QDrand exposes Prometheus metrics, track P50, P95,

1843
01:06:25,640 --> 01:06:26,920
and P99 latency.

1844
01:06:26,920 --> 01:06:30,040
If P95 exceeds 200 milliseconds, investigate.

1845
01:06:30,040 --> 01:06:32,960
Increase EF and rebuild the index with higher EF construction

1846
01:06:32,960 --> 01:06:35,520
add query replicas or chart the collection.

1847
01:06:35,520 --> 01:06:37,400
If P99 spikes, you may have a hot chart

1848
01:06:37,400 --> 01:06:39,040
where one segment is overloaded.

1849
01:06:39,040 --> 01:06:41,040
QDrand can rebalance segments automatically

1850
01:06:41,040 --> 01:06:43,040
but you should verify it's doing so.

1851
01:06:43,040 --> 01:06:45,320
For embedding throughput, schedule re-indexing

1852
01:06:45,320 --> 01:06:46,520
during off-peak hours.

1853
01:06:46,520 --> 01:06:48,360
Users query during business hours.

1854
01:06:48,360 --> 01:06:51,080
The ingestion service indexes during nights and weekends.

1855
01:06:51,080 --> 01:06:53,440
This separation prevents resource contention.

1856
01:06:53,440 --> 01:06:55,360
If your organization operates globally,

1857
01:06:55,360 --> 01:06:57,640
define off-peak by region or run ingestion

1858
01:06:57,640 --> 01:06:59,080
on dedicated hardware.

1859
01:06:59,080 --> 01:07:02,480
CPU fallback for embedding is viable if GPU is saturated.

1860
01:07:02,480 --> 01:07:04,480
Modern sentence transformers run well on CPU.

1861
01:07:04,480 --> 01:07:06,520
An AMD EPIC or Intel Zeon processor

1862
01:07:06,520 --> 01:07:09,440
with many cores can embed hundreds of chunks per second.

1863
01:07:09,440 --> 01:07:12,240
It's slower than GPU but cheaper and more available.

1864
01:07:12,240 --> 01:07:15,120
If your GPU is fully utilized by LLM inference,

1865
01:07:15,120 --> 01:07:16,800
move embedding to CPU.

1866
01:07:16,800 --> 01:07:19,600
The latency increase is acceptable for background indexing.

1867
01:07:19,600 --> 01:07:21,680
Hardware scaling follows a simple pattern.

1868
01:07:21,680 --> 01:07:24,240
Start with one GPU server handling everything.

1869
01:07:24,240 --> 01:07:26,200
When LLM inference saturates the GPU,

1870
01:07:26,200 --> 01:07:28,280
add a second GPU dedicated to serving.

1871
01:07:28,280 --> 01:07:31,960
When embedding saturates add CPU workers or a second GPU,

1872
01:07:31,960 --> 01:07:33,960
when the vector database becomes a bottleneck,

1873
01:07:33,960 --> 01:07:36,640
run Q-drand on its own server with fast SSDs

1874
01:07:36,640 --> 01:07:37,760
and plenty of RAM.

1875
01:07:37,760 --> 01:07:39,800
The architecture is horizontally modular.

1876
01:07:39,800 --> 01:07:41,880
Each component can scale independently.

1877
01:07:41,880 --> 01:07:43,480
Cost reality is worth stating.

1878
01:07:43,480 --> 01:07:45,480
Local hardware has high upfront cost.

1879
01:07:45,480 --> 01:07:48,800
A server with an Nvidia A-164 gigabytes of RAM

1880
01:07:48,800 --> 01:07:52,240
and fast storage might cost $15,000 or more.

1881
01:07:52,240 --> 01:07:54,080
But the operating cost is predictable.

1882
01:07:54,080 --> 01:07:57,000
Electricity, maintenance, occasional upgrades,

1883
01:07:57,000 --> 01:08:00,080
there's no per token pricing, there's no usage surprise.

1884
01:08:00,080 --> 01:08:02,800
For an organization processing thousands of queries daily,

1885
01:08:02,800 --> 01:08:04,920
the break-even point against cloud API costs

1886
01:08:04,920 --> 01:08:07,320
often arrives within 12 to 18 months.

1887
01:08:07,320 --> 01:08:10,080
The 2026 total cost of ownership analysis

1888
01:08:10,080 --> 01:08:11,560
supports this directionally.

1889
01:08:11,560 --> 01:08:12,960
Beyond certain usage thresholds,

1890
01:08:12,960 --> 01:08:15,560
running open source models on dedicated GPU servers

1891
01:08:15,560 --> 01:08:18,360
becomes more cost-effective than per token API fees.

1892
01:08:18,360 --> 01:08:20,800
The exact threshold depends on your query volume,

1893
01:08:20,800 --> 01:08:23,280
model size, and hardware choices.

1894
01:08:23,280 --> 01:08:25,960
But the economics favor local deployment at scale.

1895
01:08:25,960 --> 01:08:28,200
Let me walk you through a capacity planning example.

1896
01:08:28,200 --> 01:08:30,760
Your organization has 50,000 SharePoint documents.

1897
01:08:30,760 --> 01:08:33,880
They generate roughly 200,000 chunks after chunking.

1898
01:08:33,880 --> 01:08:37,400
Each chunk is embedded into a 384-dimensional vector.

1899
01:08:37,400 --> 01:08:39,800
The total vector storage is roughly one gigabyte.

1900
01:08:39,800 --> 01:08:43,120
With Q-drand overhead and metadata, call it three gigabytes.

1901
01:08:43,120 --> 01:08:45,080
This fits comfortably on a single server.

1902
01:08:45,080 --> 01:08:48,040
Your users submit roughly 1,000 queries per day.

1903
01:08:48,040 --> 01:08:50,360
Each query embeds the question, searches Q-drand

1904
01:08:50,360 --> 01:08:51,560
and calls Olamma.

1905
01:08:51,560 --> 01:08:53,000
Embedding takes 10 milliseconds.

1906
01:08:53,000 --> 01:08:55,160
Q-drand search takes 50 milliseconds.

1907
01:08:55,160 --> 01:08:57,280
LLM generation takes three seconds.

1908
01:08:57,280 --> 01:09:00,120
Total latency is roughly 3.1 seconds per query.

1909
01:09:00,120 --> 01:09:03,000
1,000 queries per day is 42 queries per hour.

1910
01:09:03,000 --> 01:09:04,720
Your GPU is idle most of the time.

1911
01:09:04,720 --> 01:09:06,600
Now scale to 10,000 queries per day.

1912
01:09:06,600 --> 01:09:10,280
That's 417 per hour, still manageable on a single GPU.

1913
01:09:10,280 --> 01:09:12,040
But at 50,000 queries per day,

1914
01:09:12,040 --> 01:09:15,280
you're processing 2,000 per hour during business hours.

1915
01:09:15,280 --> 01:09:17,720
The GPU hits 80% utilization.

1916
01:09:17,720 --> 01:09:20,400
Query latency degrades from three seconds to six seconds.

1917
01:09:20,400 --> 01:09:21,520
Users notice, at this point,

1918
01:09:21,520 --> 01:09:24,520
you add a second GPU server dedicated to LLM inference.

1919
01:09:24,520 --> 01:09:25,920
You root queries round robin.

1920
01:09:25,920 --> 01:09:27,720
Latency returns to three seconds.

1921
01:09:27,720 --> 01:09:28,920
You have scaled horizontally.

1922
01:09:28,920 --> 01:09:31,520
Vector database scaling follows a different curve.

1923
01:09:31,520 --> 01:09:34,360
Search latency depends on vector count and index quality.

1924
01:09:34,360 --> 01:09:37,520
200,000 vectors search in under 100 milliseconds.

1925
01:09:37,520 --> 01:09:40,000
Two million vectors might take 200 milliseconds.

1926
01:09:40,000 --> 01:09:41,480
10 million might take half a second.

1927
01:09:41,480 --> 01:09:43,640
If your document base grows to a million documents,

1928
01:09:43,640 --> 01:09:45,320
consider shouting by department.

1929
01:09:45,320 --> 01:09:47,400
The legal collection contains only legal documents.

1930
01:09:47,400 --> 01:09:50,200
The HR collection contains only HR documents.

1931
01:09:50,200 --> 01:09:52,640
The query interface roots based on the user's department

1932
01:09:52,640 --> 01:09:54,640
or searches all collections in parallel.

1933
01:09:54,640 --> 01:09:56,600
Each collection stays small and fast.

1934
01:09:56,600 --> 01:09:58,400
Network bandwidth is rarely the bottleneck

1935
01:09:58,400 --> 01:09:59,400
in local deployments.

1936
01:09:59,400 --> 01:10:01,320
Your ingestion service talks to SharePoint

1937
01:10:01,320 --> 01:10:02,720
over your corporate network.

1938
01:10:02,720 --> 01:10:05,200
It talks to the vector database over the local LAN.

1939
01:10:05,200 --> 01:10:06,480
The latency is milliseconds.

1940
01:10:06,480 --> 01:10:07,800
The bandwidth is gigabits.

1941
01:10:07,800 --> 01:10:09,920
The bottleneck is compute, not network.

1942
01:10:09,920 --> 01:10:12,400
Focus your scaling budget on GPU and CPU,

1943
01:10:12,400 --> 01:10:13,840
not on network switches.

1944
01:10:13,840 --> 01:10:15,600
Storage scaling is predictable.

1945
01:10:15,600 --> 01:10:18,160
Model weights don't grow unless you upgrade models.

1946
01:10:18,160 --> 01:10:20,040
Vector storage grows with document count.

1947
01:10:20,040 --> 01:10:23,080
Lock storage grows with query volume, plan for log rotation.

1948
01:10:23,080 --> 01:10:25,720
Keep 30 days of detailed logs for troubleshooting.

1949
01:10:25,720 --> 01:10:27,800
Archive older logs to cold storage.

1950
01:10:27,800 --> 01:10:30,160
For a busy system, logs might consume hundreds

1951
01:10:30,160 --> 01:10:31,320
of gigabytes per month.

1952
01:10:31,320 --> 01:10:32,680
Don't let them fill your disk.

1953
01:10:32,680 --> 01:10:34,800
Scaling isn't about buying bigger hardware.

1954
01:10:34,800 --> 01:10:36,760
It's about understanding where the bottlenecks are

1955
01:10:36,760 --> 01:10:38,840
and eliminating them systematically.

1956
01:10:38,840 --> 01:10:41,280
Monitor everything, profile before optimizing,

1957
01:10:41,280 --> 01:10:43,280
and scale one component at a time.

1958
01:10:43,280 --> 01:10:45,720
This methodical approach prevents overspending on hardware

1959
01:10:45,720 --> 01:10:47,720
you don't need while ensuring you never hit a wall

1960
01:10:47,720 --> 01:10:48,800
you can't climb.

1961
01:10:48,800 --> 01:10:50,960
Load testing validates your capacity assumptions

1962
01:10:50,960 --> 01:10:52,440
before production deployment.

1963
01:10:52,440 --> 01:10:56,080
User tool like Locust or K6 to simulate concurrent users.

1964
01:10:56,080 --> 01:10:59,240
Start with 10 virtual users submitting queries every 30 seconds.

1965
01:10:59,240 --> 01:11:01,440
Monitor latency and GPU utilization

1966
01:11:01,440 --> 01:11:04,000
gradually increased to 50, then 100 users.

1967
01:11:04,000 --> 01:11:05,360
Identify the breaking point.

1968
01:11:05,360 --> 01:11:07,040
If the system degrades at 40 users,

1969
01:11:07,040 --> 01:11:09,760
you know your single GPU setup handles 30 comfortably.

1970
01:11:09,760 --> 01:11:12,080
Plan your hardware so your monitoring dashboard

1971
01:11:12,080 --> 01:11:14,440
should show four main metrics at a glance.

1972
01:11:14,440 --> 01:11:18,360
Query latency over the last hour with P50, P95, and P99 lines.

1973
01:11:18,360 --> 01:11:21,760
GPU utilization percentage with a red threshold at 80%.

1974
01:11:21,760 --> 01:11:25,120
Vector database query throughput in queries per second.

1975
01:11:25,120 --> 01:11:27,400
An ingestion Q-depth showing how many documents are waiting

1976
01:11:27,400 --> 01:11:28,440
to be processed.

1977
01:11:28,440 --> 01:11:31,040
These four metrics tell you whether your system is healthy,

1978
01:11:31,040 --> 01:11:32,480
stressed, or failing.

1979
01:11:32,480 --> 01:11:34,320
Everything else is detailed you drill into

1980
01:11:34,320 --> 01:11:36,160
when one of these four looks wrong.

1981
01:11:36,160 --> 01:11:38,880
Alerting should be proactive, not reactive.

1982
01:11:38,880 --> 01:11:42,160
Alert when P95 latency exceeds five seconds for 10 minutes.

1983
01:11:42,160 --> 01:11:45,600
Alert when GPU utilization exceeds 80% for 15 minutes.

1984
01:11:45,600 --> 01:11:48,520
Alert when ingestion Q-depth exceeds 1000 documents

1985
01:11:48,520 --> 01:11:49,560
for 30 minutes.

1986
01:11:49,560 --> 01:11:52,160
And alert when any component logs an error

1987
01:11:52,160 --> 01:11:54,920
more than 10 times in five minutes.

1988
01:11:54,920 --> 01:11:57,680
These alerts catch problems before users complain.

1989
01:11:57,680 --> 01:12:00,120
Troubleshooting follows a simple decision tree.

1990
01:12:00,120 --> 01:12:03,080
If query latency is high, check GPU utilization first.

1991
01:12:03,080 --> 01:12:04,920
If the GPU is saturated, scale it.

1992
01:12:04,920 --> 01:12:08,160
If the GPU is idle, check Vector database latency.

1993
01:12:08,160 --> 01:12:11,280
If Q-drand is slow, check index size and EF parameters.

1994
01:12:11,280 --> 01:12:13,840
If Q-drand is fast, check the embedding model.

1995
01:12:13,840 --> 01:12:16,800
If embedding is slow, check batch size and GPU memory.

1996
01:12:16,800 --> 01:12:19,240
If everything is fast, but the answer quality is poor,

1997
01:12:19,240 --> 01:12:21,800
check chunking strategy and retrieval accuracy.

1998
01:12:21,800 --> 01:12:24,080
Quality problems usually trace back to chunking.

1999
01:12:24,080 --> 01:12:26,200
And here is the part that makes this future proof.

2000
01:12:26,200 --> 01:12:27,920
Modularity and model evolution.

2001
01:12:27,920 --> 01:12:29,720
Yamaha 4 won't be the last open model.

2002
01:12:29,720 --> 01:12:31,320
Next year there will be Lama 5,

2003
01:12:31,320 --> 01:12:33,160
or a competitor with better reasoning

2004
01:12:33,160 --> 01:12:35,920
or a smaller model with equivalent performance.

2005
01:12:35,920 --> 01:12:37,920
Your architecture must allow swapping models

2006
01:12:37,920 --> 01:12:39,560
without rebuilding the pipeline.

2007
01:12:39,560 --> 01:12:41,200
The rack pattern is model agnostic.

2008
01:12:41,200 --> 01:12:44,320
The Vector database doesn't care which LLM generates the answer.

2009
01:12:44,320 --> 01:12:45,720
The chunking logic doesn't care.

2010
01:12:45,720 --> 01:12:47,040
The query interface doesn't care.

2011
01:12:47,040 --> 01:12:49,680
They care about the prompt format and the API endpoint,

2012
01:12:49,680 --> 01:12:53,240
standardized on the Olama API or an open AI compatible local API

2013
01:12:53,240 --> 01:12:56,640
like the one provided by VLLM or text generation inference.

2014
01:12:56,640 --> 01:13:00,160
These APIs accept a model name, messages array and parameters.

2015
01:13:00,160 --> 01:13:02,280
When a new model arrives, you pull it,

2016
01:13:02,280 --> 01:13:04,400
update the model name in your configuration,

2017
01:13:04,400 --> 01:13:05,760
and restart the service.

2018
01:13:05,760 --> 01:13:07,800
A/B testing is responsible model migration.

2019
01:13:07,800 --> 01:13:10,080
Maintain a test suite of representative queries

2020
01:13:10,080 --> 01:13:11,920
with expected answer characteristics.

2021
01:13:11,920 --> 01:13:14,720
Run the old model and the new model against the same queries.

2022
01:13:14,720 --> 01:13:18,120
Compare latency accuracy, citation quality and hallucination rate.

2023
01:13:18,120 --> 01:13:19,880
Only promote the new model to production

2024
01:13:19,880 --> 01:13:22,760
when it matches or exceeds the old model on your metrics.

2025
01:13:22,760 --> 01:13:24,160
This prevents regressions.

2026
01:13:24,160 --> 01:13:25,880
Embedding models also evolve.

2027
01:13:25,880 --> 01:13:29,360
A new sentence transformer might offer better accuracy for your domain.

2028
01:13:29,360 --> 01:13:33,120
But switching embedding models requires re-embedding your entire corpus.

2029
01:13:33,120 --> 01:13:36,200
The old vectors and new vectors exist in different semantic spaces.

2030
01:13:36,200 --> 01:13:37,200
You can't mix them.

2031
01:13:37,200 --> 01:13:39,200
Plan this as a scheduled maintenance task,

2032
01:13:39,200 --> 01:13:42,960
build the new index in parallel, validated against a test query set,

2033
01:13:42,960 --> 01:13:44,840
then swap the collection names automatically.

2034
01:13:44,840 --> 01:13:46,120
The downtime is seconds.

2035
01:13:46,120 --> 01:13:47,800
Model quantization improves over time.

2036
01:13:47,800 --> 01:13:51,640
A new quantization algorithm might reduce VRM usage with less quality loss.

2037
01:13:51,640 --> 01:13:53,520
Requantize your existing model weights.

2038
01:13:53,520 --> 01:13:55,600
Test the quantized model against your benchmark.

2039
01:13:55,600 --> 01:13:58,640
If it passes deploy, if it fails, keep the old quantization.

2040
01:13:58,640 --> 01:14:01,280
This is continuous improvement, not big bang replacement.

2041
01:14:01,280 --> 01:14:03,640
The Microsoft 365 ecosystem also evolves.

2042
01:14:03,640 --> 01:14:05,160
SharePoint APIs change.

2043
01:14:05,160 --> 01:14:07,080
Microsoft Graph adds new endpoints.

2044
01:14:07,080 --> 01:14:09,320
Enter ID updates, its authentication flows.

2045
01:14:09,320 --> 01:14:11,760
Your ingestion service must handle API versioning.

2046
01:14:11,760 --> 01:14:14,760
Use the API version parameter in SharePoint rest calls.

2047
01:14:14,760 --> 01:14:17,880
Subscribe to Microsoft Graph change notifications were supported

2048
01:14:17,880 --> 01:14:20,400
and monitor Microsoft's deprecation announcements.

2049
01:14:20,400 --> 01:14:24,680
An ingestion service that breaks because an API changed isn't a technology failure.

2050
01:14:24,680 --> 01:14:26,240
It's an operational failure.

2051
01:14:26,240 --> 01:14:29,880
Modularity means each component has a well-defined interface.

2052
01:14:29,880 --> 01:14:32,680
The ingestion service outputs text chunks with metadata.

2053
01:14:32,680 --> 01:14:35,320
The chunking engine consumes text and outputs chunks.

2054
01:14:35,320 --> 01:14:38,040
The embedding service consumes chunks and outputs vectors.

2055
01:14:38,040 --> 01:14:41,280
The vector database consumes vectors and outputs search results.

2056
01:14:41,280 --> 01:14:44,160
The LLM runtime consumes prompts and outputs text.

2057
01:14:44,160 --> 01:14:47,480
The query interface consumes user input and orchestrates the rest.

2058
01:14:47,480 --> 01:14:49,720
Change one component, the others stay the same.

2059
01:14:49,720 --> 01:14:52,440
This is how you future-proof, not by predicting the future.

2060
01:14:52,440 --> 01:14:55,120
But by building interfaces that don't care about the future,

2061
01:14:55,120 --> 01:14:57,680
let me talk about testing and continuous integration

2062
01:14:57,680 --> 01:14:59,960
because a production rag system without tests

2063
01:14:59,960 --> 01:15:01,800
is a liability waiting to happen.

2064
01:15:01,800 --> 01:15:03,960
Your test suite should cover three layers.

2065
01:15:03,960 --> 01:15:07,120
Unit tests for the chunking engine feed it a known document.

2066
01:15:07,120 --> 01:15:09,040
Verify the chunks match with headings.

2067
01:15:09,040 --> 01:15:10,600
Verify metadata is preserved.

2068
01:15:10,600 --> 01:15:12,600
Verify chunks size stays within bounds.

2069
01:15:12,600 --> 01:15:14,680
Integration tests for the ingestion pipeline

2070
01:15:14,680 --> 01:15:16,400
pointed at a test sharepoint library.

2071
01:15:16,400 --> 01:15:19,040
Verify it authenticates, downloads extracts, chunks,

2072
01:15:19,040 --> 01:15:20,720
embeds and stores correctly.

2073
01:15:20,720 --> 01:15:22,600
End-to-end tests for the query flow.

2074
01:15:22,600 --> 01:15:25,160
Submit a known question against a known document base.

2075
01:15:25,160 --> 01:15:27,480
Verify the answer contains expected content.

2076
01:15:27,480 --> 01:15:28,920
Verify citations.com.s.

2077
01:15:28,920 --> 01:15:32,920
Verify permission filtering excludes unauthorized content.

2078
01:15:32,920 --> 01:15:34,960
Automate these tests in your CI pipeline.

2079
01:15:34,960 --> 01:15:36,400
Run unit tests on every commit.

2080
01:15:36,400 --> 01:15:38,680
Run integration tests on every pull request.

2081
01:15:38,680 --> 01:15:41,800
Run end-to-end tests nightly against a staging environment

2082
01:15:41,800 --> 01:15:43,240
that mirrors production.

2083
01:15:43,240 --> 01:15:45,200
This catches regressions before they reach users.

2084
01:15:45,200 --> 01:15:47,600
It gives you confidence to upgrade components.

2085
01:15:47,600 --> 01:15:50,360
And it documents the expected behavior for new team members.

2086
01:15:50,360 --> 01:15:52,880
Deployment patterns for this architecture are straightforward.

2087
01:15:52,880 --> 01:15:55,480
The ingestion service, chunking engine, embedding model,

2088
01:15:55,480 --> 01:15:57,760
and vector database are all containerized.

2089
01:15:57,760 --> 01:15:59,520
You deploy them with Docker compose

2090
01:15:59,520 --> 01:16:02,640
for simple setups or Kubernetes for complex ones.

2091
01:16:02,640 --> 01:16:06,880
The LLM runtime runs on bare metal or in a GPU enabled container.

2092
01:16:06,880 --> 01:16:09,360
The query interface is a standard web application

2093
01:16:09,360 --> 01:16:11,840
deployed behind your corporate reverse proxy.

2094
01:16:11,840 --> 01:16:13,840
Blue-green deployment minimizes downtime.

2095
01:16:13,840 --> 01:16:15,800
You stand up a new version of the query interface

2096
01:16:15,800 --> 01:16:16,920
alongside the old.

2097
01:16:16,920 --> 01:16:19,200
You root 10% of traffic to the new version.

2098
01:16:19,200 --> 01:16:21,040
You monitor error rates and latency.

2099
01:16:21,040 --> 01:16:23,680
If metrics look good, you root 100%.

2100
01:16:23,680 --> 01:16:26,640
If metrics degrade, you root back to the old version.

2101
01:16:26,640 --> 01:16:28,440
This pattern works for the query interface,

2102
01:16:28,440 --> 01:16:30,920
the ingestion service, and the vector database.

2103
01:16:30,920 --> 01:16:34,040
It doesn't work for the LLM runtime if you only have one GPU.

2104
01:16:34,040 --> 01:16:36,640
In that case, deploy during maintenance windows.

2105
01:16:36,640 --> 01:16:38,680
Database migrations for QDrand are minimal.

2106
01:16:38,680 --> 01:16:40,280
Collections are created once.

2107
01:16:40,280 --> 01:16:43,040
Vectors are inserted and updated by the ingestion service.

2108
01:16:43,040 --> 01:16:45,960
You don't run schema migrations in the traditional sense.

2109
01:16:45,960 --> 01:16:47,680
But if you change the embedding model,

2110
01:16:47,680 --> 01:16:49,040
you must rebuild the collection,

2111
01:16:49,040 --> 01:16:51,600
create a new collection with the new vector dimensions,

2112
01:16:51,600 --> 01:16:54,600
reindex all documents into it, validate query accuracy

2113
01:16:54,600 --> 01:16:56,000
against your test suite.

2114
01:16:56,000 --> 01:16:57,920
Then, atomically swap the collection names.

2115
01:16:57,920 --> 01:16:59,200
The downtime is seconds.

2116
01:16:59,200 --> 01:17:01,200
Model version control is worth implementing.

2117
01:17:01,200 --> 01:17:03,400
Store model weights in a local artifact repository

2118
01:17:03,400 --> 01:17:05,280
or on network attached storage.

2119
01:17:05,280 --> 01:17:08,720
Tag each model with version, quantization level, and deployment date.

2120
01:17:08,720 --> 01:17:11,960
When you upgrade, keep the previous version available for rollback.

2121
01:17:11,960 --> 01:17:14,200
A new model that performs worse on your test suite

2122
01:17:14,200 --> 01:17:15,960
can be rolled back in minutes by pointing

2123
01:17:15,960 --> 01:17:17,600
or lamer at the previous weights.

2124
01:17:17,600 --> 01:17:19,360
Let us put this all together.

2125
01:17:19,360 --> 01:17:21,240
The complete architecture blueprint.

2126
01:17:21,240 --> 01:17:23,080
The full pipeline has seven layers.

2127
01:17:23,080 --> 01:17:24,720
Each layer runs inside your perimeter.

2128
01:17:24,720 --> 01:17:26,320
Each layer has a specific job.

2129
01:17:26,320 --> 01:17:27,880
And each layer connects to the next

2130
01:17:27,880 --> 01:17:29,280
through a well-defined interface.

2131
01:17:29,280 --> 01:17:32,080
Layer one is SharePoint Online or SharePoint on premises.

2132
01:17:32,080 --> 01:17:33,320
This is the storage tier.

2133
01:17:33,320 --> 01:17:35,960
It holds your documents, enforces access controls,

2134
01:17:35,960 --> 01:17:38,760
manages versions, and applies retention policies.

2135
01:17:38,760 --> 01:17:39,920
It's the source of truth.

2136
01:17:39,920 --> 01:17:42,960
Nothing in the AI layer overrides SharePoint governance.

2137
01:17:42,960 --> 01:17:45,160
Layer two is the ingestion service.

2138
01:17:45,160 --> 01:17:48,600
It authenticates via Microsoft EntraID using OAuth 2.0.

2139
01:17:48,600 --> 01:17:50,920
It enumerates document libraries using the SharePoint

2140
01:17:50,920 --> 01:17:53,160
REST API or Microsoft Graph.

2141
01:17:53,160 --> 01:17:56,320
It extracts text from Word documents, PDFs, Excel sheets,

2142
01:17:56,320 --> 01:17:57,440
and PowerPoint text.

2143
01:17:57,440 --> 01:17:59,360
It detects changes using Delta Sync.

2144
01:17:59,360 --> 01:18:02,280
And it outputs clean text chunks with structural metadata.

2145
01:18:02,280 --> 01:18:04,080
Layer three is the chunking engine.

2146
01:18:04,080 --> 01:18:05,280
It detects document type.

2147
01:18:05,280 --> 01:18:07,280
It applies heading aware chunking for Word.

2148
01:18:07,280 --> 01:18:09,000
Page aware chunking for PDFs.

2149
01:18:09,000 --> 01:18:10,560
Row group chunking for Excel.

2150
01:18:10,560 --> 01:18:12,280
Slide level chunking for PowerPoint.

2151
01:18:12,280 --> 01:18:14,120
It preserves metadata at every step.

2152
01:18:14,120 --> 01:18:15,200
Source URL.

2153
01:18:15,200 --> 01:18:16,320
Document title.

2154
01:18:16,320 --> 01:18:17,080
Author.

2155
01:18:17,080 --> 01:18:18,280
Last modified date.

2156
01:18:18,280 --> 01:18:19,280
Permission level.

2157
01:18:19,280 --> 01:18:21,480
It outputs chunks ready for embedding.

2158
01:18:21,480 --> 01:18:23,440
Layer four is the local embedding model.

2159
01:18:23,440 --> 01:18:26,880
It runs a sentence transformer like all mini-LML6V2

2160
01:18:26,880 --> 01:18:29,960
or BGE large N on your local GPU or CPU.

2161
01:18:29,960 --> 01:18:31,320
It processes chunks and batches.

2162
01:18:31,320 --> 01:18:33,880
It converts each chunk into a dense vector.

2163
01:18:33,880 --> 01:18:35,960
And it outputs vectors with metadata payloads.

2164
01:18:35,960 --> 01:18:38,680
Layer five is the vector database, QDrand or VV8

2165
01:18:38,680 --> 01:18:40,760
running on your local network via Docker.

2166
01:18:40,760 --> 01:18:44,120
It stores vectors in a collection with HNSW indexing.

2167
01:18:44,120 --> 01:18:46,000
It supports metadata filtering for permission

2168
01:18:46,000 --> 01:18:46,920
or wear retrieval.

2169
01:18:46,920 --> 01:18:48,640
It handles point updates for Delta Sync.

2170
01:18:48,640 --> 01:18:52,200
And it returns the top-k nearest neighbors in under 100 milliseconds.

2171
01:18:52,200 --> 01:18:54,640
Layer six is the local YAMA runtime.

2172
01:18:54,640 --> 01:18:57,480
Olamar serving a quantized YAMA three or LAMA four model

2173
01:18:57,480 --> 01:18:58,600
on your GPU server.

2174
01:18:58,600 --> 01:19:00,320
It exposes a local REST API.

2175
01:19:00,320 --> 01:19:02,520
It receives prompts containing system instructions,

2176
01:19:02,520 --> 01:19:04,240
retrieve chunks and user questions.

2177
01:19:04,240 --> 01:19:07,640
It generates answers with low temperature for factual grounding.

2178
01:19:07,640 --> 01:19:10,200
And it streams responses back to the query interface.

2179
01:19:10,200 --> 01:19:11,840
Layer seven is the query interface.

2180
01:19:11,840 --> 01:19:13,360
A minimalist web application

2181
01:19:13,360 --> 01:19:15,800
authenticated through Microsoft Enter ID.

2182
01:19:15,800 --> 01:19:17,240
It embeds user questions.

2183
01:19:17,240 --> 01:19:19,880
It queries the vector database with permission filtering.

2184
01:19:19,880 --> 01:19:20,880
It constructs prompts.

2185
01:19:20,880 --> 01:19:21,960
It calls Olamar.

2186
01:19:21,960 --> 01:19:24,520
It displays generated answers with clickable citations

2187
01:19:24,520 --> 01:19:25,840
back to SharePoint documents.

2188
01:19:25,840 --> 01:19:27,200
It logs every interaction.

2189
01:19:27,200 --> 01:19:29,520
And it never exposes data to external APIs.

2190
01:19:29,520 --> 01:19:30,600
And that's the architecture.

2191
01:19:30,600 --> 01:19:31,520
Seven layers.

2192
01:19:31,520 --> 01:19:33,400
All local, all under your control.

2193
01:19:33,400 --> 01:19:35,080
Now let me address common concerns.

2194
01:19:35,080 --> 01:19:39,360
If the SharePoint API changes your ingestion service uses versioned API calls.

2195
01:19:39,360 --> 01:19:42,360
You test against preview versions in a staging environment.

2196
01:19:42,360 --> 01:19:45,280
You migrate to new versions on your schedule, not Microsofts.

2197
01:19:45,280 --> 01:19:48,680
If the model hallucinates despite rag, you lower the temperature.

2198
01:19:48,680 --> 01:19:49,920
You tighten the system prompt.

2199
01:19:49,920 --> 01:19:52,320
You add output validation in the query interface.

2200
01:19:52,320 --> 01:19:55,600
You implement human feedback loops where users flag bad answers.

2201
01:19:55,600 --> 01:19:57,080
And you monitor logs for patterns.

2202
01:19:57,080 --> 01:19:58,760
Illucination isn't eliminated.

2203
01:19:58,760 --> 01:19:59,720
It's managed.

2204
01:19:59,720 --> 01:20:01,480
If a user tries prompt injection, they

2205
01:20:01,480 --> 01:20:04,920
craft a query designed to make the model ignore its instructions.

2206
01:20:04,920 --> 01:20:06,160
You sanitize inputs.

2207
01:20:06,160 --> 01:20:09,920
You enforce the system prompt at the API level, not just in the application.

2208
01:20:09,920 --> 01:20:13,920
You validate that retrieved chunks match the query semantically before including them.

2209
01:20:13,920 --> 01:20:15,360
And you log suspicious patterns.

2210
01:20:15,360 --> 01:20:18,320
Prompt injection is an attack vector for any LLM system.

2211
01:20:18,320 --> 01:20:20,160
Local deployment doesn't eliminate it.

2212
01:20:20,160 --> 01:20:22,360
But it contains the blast radius to your perimeter.

2213
01:20:22,360 --> 01:20:24,880
If hardware fails, you run Q-drand with replication.

2214
01:20:24,880 --> 01:20:26,760
You snapshot the vector index nightly.

2215
01:20:26,760 --> 01:20:28,720
You keep model weights on network storage.

2216
01:20:28,720 --> 01:20:30,680
And you document the recovery procedure.

2217
01:20:30,680 --> 01:20:33,000
Local infrastructure requires operational discipline.

2218
01:20:33,000 --> 01:20:34,360
The reward is control.

2219
01:20:34,360 --> 01:20:36,400
If a user asks a question in German or French,

2220
01:20:36,400 --> 01:20:40,520
while your documents are in English, multilingual embedding models like multilingual E5

2221
01:20:40,520 --> 01:20:41,960
large handle this scenario.

2222
01:20:41,960 --> 01:20:46,040
They map semantically equivalent sentences in different languages to nearby vectors.

2223
01:20:46,040 --> 01:20:50,200
A user asking for Urlobs Richtlinien in German retrieves the English vacation policy chunk

2224
01:20:50,200 --> 01:20:52,520
because the embeddings are close in vector space.

2225
01:20:52,520 --> 01:20:55,040
The LLM then generates the answer in the user's language.

2226
01:20:55,040 --> 01:20:57,600
This isn't machine translation in the traditional sense.

2227
01:20:57,600 --> 01:21:00,840
It's cross-lingual retrieval followed by monolingual generation.

2228
01:21:00,840 --> 01:21:03,680
And it works surprisingly well with modern multilingual models.

2229
01:21:03,680 --> 01:21:07,360
If a document contains a table that the extraction process fails to pass,

2230
01:21:07,360 --> 01:21:09,440
the chunk contains garbled text.

2231
01:21:09,440 --> 01:21:11,120
The embedding represents noise.

2232
01:21:11,120 --> 01:21:15,680
When a user asks about the table content, the retrieval engine returns the noisy chunk.

2233
01:21:15,680 --> 01:21:19,120
The LLM generates an answer based on partial or incorrect information.

2234
01:21:19,120 --> 01:21:22,160
This is a data quality problem, not a model problem.

2235
01:21:22,160 --> 01:21:24,760
The fix is better extraction, not better prompting.

2236
01:21:24,760 --> 01:21:28,200
Invest in PDF table extractors like Camelot or Tabular Pi.

2237
01:21:28,200 --> 01:21:30,280
Test them on your actual document corpus.

2238
01:21:30,280 --> 01:21:34,480
And fall back to manual review for documents that automated extraction can't handle.

2239
01:21:34,480 --> 01:21:38,480
If the LLM refuses to answer a question because the system prompt is too restrictive,

2240
01:21:38,480 --> 01:21:42,200
this happens when users ask about topics that are adjacent to sensitive areas.

2241
01:21:42,200 --> 01:21:44,400
A user asks about employee benefits.

2242
01:21:44,400 --> 01:21:46,760
The system prompt says only use provided context.

2243
01:21:46,760 --> 01:21:49,000
The retrieved chunks contain benefits information.

2244
01:21:49,000 --> 01:21:53,800
But the LLM interprets the question as potentially asking about other employees and refuses.

2245
01:21:53,800 --> 01:21:55,240
This is over refusal.

2246
01:21:55,240 --> 01:21:57,480
The fix is to tune the system prompt carefully.

2247
01:21:57,480 --> 01:22:00,320
Allow answers that are clearly supported by the context.

2248
01:22:00,320 --> 01:22:03,480
Only refuse when the context genuinely doesn't contain the answer.

2249
01:22:03,480 --> 01:22:08,240
And monitor refusal rates, a system that refuses 50% of queries isn't useful.

2250
01:22:08,240 --> 01:22:14,400
If a document contains outdated information, a policy from 2024 might be superseded in 2025.

2251
01:22:14,400 --> 01:22:18,680
Both versions exist in SharePoint because the old version is retained for legal reasons.

2252
01:22:18,680 --> 01:22:20,360
The ingestion service indexes both.

2253
01:22:20,360 --> 01:22:21,880
The retrieval engine returns both.

2254
01:22:21,880 --> 01:22:25,760
The LLM synthesizes an answer that mixes old and new rules.

2255
01:22:25,760 --> 01:22:27,280
This is a version control problem.

2256
01:22:27,280 --> 01:22:28,480
The fix is metadata.

2257
01:22:28,480 --> 01:22:31,720
Tag every chunk with effective date and superseded status.

2258
01:22:31,720 --> 01:22:36,440
Filter out superseded documents at query time unless the user explicitly asks for historical

2259
01:22:36,440 --> 01:22:40,040
versions and train users to check the last modified date incitations.

2260
01:22:40,040 --> 01:22:44,560
Let me walk through a complete query from start to finish so you see how the layers interact.

2261
01:22:44,560 --> 01:22:47,280
Sarah, a project manager, opens the query interface.

2262
01:22:47,280 --> 01:22:51,920
She types what is the approval process for vendor contracts over $50,000?

2263
01:22:51,920 --> 01:22:55,920
The interface authenticates her via enter ID and determines she belongs to the manager's

2264
01:22:55,920 --> 01:22:58,200
group, giving her a manager permission tier.

2265
01:22:58,200 --> 01:23:00,720
The interface sends her question to the local embedding model.

2266
01:23:00,720 --> 01:23:03,960
The model converts it into a 384-dimensional vector.

2267
01:23:03,960 --> 01:23:07,440
The interface sends this vector to Q-drand with a filter for permission tier less than

2268
01:23:07,440 --> 01:23:08,680
or equal to manager.

2269
01:23:08,680 --> 01:23:13,320
Q-drand searches 200,000 vectors and returns the top five matches in 80 milliseconds.

2270
01:23:13,320 --> 01:23:17,640
The matches include chunks from the procurement policy, the finance handbook and the delegation

2271
01:23:17,640 --> 01:23:18,840
of authority document.

2272
01:23:18,840 --> 01:23:20,640
The interface constructs a prompt.

2273
01:23:20,640 --> 01:23:24,480
System, your unknowledgeable assistant that answers based on provided context, use only

2274
01:23:24,480 --> 01:23:26,120
the information in the context.

2275
01:23:26,120 --> 01:23:29,120
Side sources, context chunk one from procurement policy.

2276
01:23:29,120 --> 01:23:34,600
Doc X contracts exceeding $50,000 require procurement team review and CFO approval.

2277
01:23:34,600 --> 01:23:36,800
Context chunk two from finance handbook.

2278
01:23:36,800 --> 01:23:41,320
Doc X, vendor selection must follow the three-bit process documented in section four,

2279
01:23:41,320 --> 01:23:43,680
chunk three from delegation of authority.

2280
01:23:43,680 --> 01:23:48,600
Doc X, the CFO retains approval authority for all contracts above the departmental threshold.

2281
01:23:48,600 --> 01:23:52,920
User question, what is the approval process for vendor contracts over $50,000?

2282
01:23:52,920 --> 01:23:56,800
The interface sends this prompt to Olama running Lama 370BQ4.

2283
01:23:56,800 --> 01:23:58,840
Olama generates the answer in four seconds.

2284
01:23:58,840 --> 01:24:03,880
For vendor contracts over $50,000, the procurement team must first conduct a three-bit process.

2285
01:24:03,880 --> 01:24:05,800
The results are reviewed by procurement.

2286
01:24:05,800 --> 01:24:09,600
Finally, approval requires the CFO as per the delegation of authority policy.

2287
01:24:09,600 --> 01:24:14,560
Resources, procurement policy.doc X, finance handbook.doc X, delegation of authority.doc X,

2288
01:24:14,560 --> 01:24:17,280
the interface displays the answer with clickable citations.

2289
01:24:17,280 --> 01:24:21,680
Sarah clicks the procurement policy citation and opens the document in SharePoint.

2290
01:24:21,680 --> 01:24:23,800
Total time from question to answer five seconds.

2291
01:24:23,800 --> 01:24:25,560
Total outbound data packets zero.

2292
01:24:25,560 --> 01:24:27,360
This isn't a theoretical architecture.

2293
01:24:27,360 --> 01:24:28,720
It's a production pattern.

2294
01:24:28,720 --> 01:24:29,720
Seven layers.

2295
01:24:29,720 --> 01:24:30,720
Each layer has one job.

2296
01:24:30,720 --> 01:24:33,560
Each layer passes structured data to the next.

2297
01:24:33,560 --> 01:24:36,360
And the entire pipeline stays inside your perimeter.

2298
01:24:36,360 --> 01:24:38,080
This architecture isn't a product you buy.

2299
01:24:38,080 --> 01:24:39,080
It's a stance you take.

2300
01:24:39,080 --> 01:24:41,720
A stance says your data is too useful to delegate.

2301
01:24:41,720 --> 01:24:44,080
Your governance is too specific to outsource.

2302
01:24:44,080 --> 01:24:47,560
And your AI capabilities should serve your perimeter not someone else's.

2303
01:24:47,560 --> 01:24:49,880
That's sovereign intelligence, not sovereign cloud.

2304
01:24:49,880 --> 01:24:52,200
Sovereignty isn't about rejecting AI.

2305
01:24:52,200 --> 01:24:56,280
It's about rejecting the assumption that intelligence must live in someone else's cloud.

2306
01:24:56,280 --> 01:24:59,720
You now have the blueprint to turn your SharePoint into a private brain.

2307
01:24:59,720 --> 01:25:00,720
Your data stays local.

2308
01:25:00,720 --> 01:25:01,880
Your model stays local.

2309
01:25:01,880 --> 01:25:04,720
And your answers stay grounded in documents you already own.

2310
01:25:04,720 --> 01:25:06,280
Share this with your security team.

2311
01:25:06,280 --> 01:25:08,380
for more architecture that respects your data.