Stop Leaking Data: How to Run Local Llama on Your SharePoint Files


AI is transforming the way organizations work with knowledge, documents, and collaboration platforms. But as more businesses adopt AI-powered assistants and large language models, one critical question continues to surface: how can you unlock the power of AI without exposing sensitive corporate information to external services?In this episode, we explore how organizations can run Local Llama models directly against SharePoint content while maintaining full control over their data. Instead of sending confidential documents, intellectual property, customer records, and internal knowledge to cloud-hosted AI services, local AI architectures provide a powerful alternative that prioritizes privacy, governance, and security.Our discussion breaks down the practical steps required to connect locally hosted large language models with SharePoint data sources. We examine the technologies involved, the infrastructure considerations, and the trade-offs between convenience and data sovereignty. Whether you are an IT professional, Microsoft 365 administrator, security architect, or AI enthusiast, this episode provides valuable insights into building private AI solutions on top of your existing Microsoft 365 environment.
UNDERSTANDING THE DATA PRIVACY CHALLENGE
As organizations rush to embrace generative AI, many overlook the risks associated with sending sensitive business data to third-party platforms. Data leakage, compliance concerns, and regulatory requirements are becoming major factors in AI adoption strategies.We discuss:
- Why data sovereignty matters in the age of AI
- Common risks associated with public AI services
- Regulatory and compliance considerations
- How local AI models can reduce exposure risks
Local Llama models have emerged as one of the most exciting developments in the open-source AI ecosystem. Running AI models locally gives organizations complete ownership of both the infrastructure and the data processing pipeline.During the conversation, we explain how Local Llama works, the hardware requirements involved, and how organizations can begin experimenting with private AI deployments without massive cloud costs.
CONNECTING SHAREPOINT TO PRIVATE AI
SharePoint remains one of the largest repositories of enterprise knowledge. From project documentation and operational procedures to contracts and meeting notes, organizations store enormous amounts of valuable information inside Microsoft 365.
Key topics include:
- Indexing SharePoint content securely
- Retrieval-Augmented Generation (RAG) architectures
- Document embeddings and semantic search
- Building intelligent chat experiences on internal data
Moving from a proof of concept to production requires careful planning. We explore deployment patterns that balance performance, scalability, security, and user experience.Listeners will learn about infrastructure design, GPU considerations, storage requirements, monitoring, and operational best practices. We also discuss common implementation mistakes and how organizations can avoid them while delivering meaningful business value.
THE FUTURE OF PRIVATE ENTERPRISE AI
The future of enterprise AI may not belong exclusively to cloud-hosted models. As local AI technology continues to evolve, organizations are gaining more options to build intelligent systems that keep sensitive information under their control.This episode examines how private AI solutions could reshape knowledge management, enterprise search, productivity workflows, and digital workplace experiences across Microsoft 365 environments.
WHY YOU SHOULD LISTEN
If you're evaluating AI adoption within your organization, concerned about data privacy, or looking for practical ways to leverage SharePoint content with large language models, this episode delivers actionable insights and real-world guidance. Learn how to combine the power of modern AI with the security and governance requirements that today's businesses demand.Tune in to discover how Local Llama, SharePoint, and private AI architectures can work together to unlock organizational knowledge without compromising data security.
Become a supporter of this podcast: https://www.spreaker.com/podcast/m365-fm-modern-work-security-and-productivity-with-microsoft-365--6704921/support.
🚀 Want to be part of m365.fm?
Then stop just listening… and start showing up.
👉 Connect with me on LinkedIn and let’s make something happen:
- 🎙️ Be a podcast guest and share your story
- 🎧 Host your own episode (yes, seriously)
- 💡 Pitch topics the community actually wants to hear
- 🌍 Build your personal brand in the Microsoft 365 space
This isn’t just a podcast — it’s a platform for people who take action.
🔥 Most people wait. The best ones don’t.
👉 Connect with me on LinkedIn and send me a message:
"I want in"
Let’s build something awesome 👊
00:00:00,000 --> 00:00:02,440
Your AI strategy was supposed to protect your data,
2
00:00:02,440 --> 00:00:04,200
but in reality it's doing the opposite,
3
00:00:04,200 --> 00:00:05,800
not because of bad intentions,
4
00:00:05,800 --> 00:00:07,140
because of the model behind it.
5
00:00:07,140 --> 00:00:09,120
Cloud AI requires cloud connectivity.
6
00:00:09,120 --> 00:00:11,560
Cloud connectivity means logging, retention,
7
00:00:11,560 --> 00:00:13,760
and legal compulsion you can't opt out of.
8
00:00:13,760 --> 00:00:15,720
Your SharePoint documents deserve better.
9
00:00:15,720 --> 00:00:18,080
Today, I will show you how to run Lama locally,
10
00:00:18,080 --> 00:00:19,720
connect it to your SharePoint libraries,
11
00:00:19,720 --> 00:00:21,200
and build a retrieval system
12
00:00:21,200 --> 00:00:23,400
that never touches the public internet.
13
00:00:23,400 --> 00:00:26,240
Sovereign Intelligence, not Sovereign Cloud.
14
00:00:26,240 --> 00:00:28,200
Here is what most organizations miss.
15
00:00:28,200 --> 00:00:30,280
The moment you connect a cloud AI assistant
16
00:00:30,280 --> 00:00:32,640
to your SharePoint, your contracts, your HR files,
17
00:00:32,640 --> 00:00:34,720
your strategy documents, and your board memos,
18
00:00:34,720 --> 00:00:37,040
enter a pipeline you don't control.
19
00:00:37,040 --> 00:00:39,560
They travel across network boundaries, you didn't architect.
20
00:00:39,560 --> 00:00:41,600
They sit in log files, you can't delete,
21
00:00:41,600 --> 00:00:43,840
and in many jurisdictions, they become subject
22
00:00:43,840 --> 00:00:45,720
to legal requests you can't refuse.
23
00:00:45,720 --> 00:00:47,840
Microsoft's own WorkTrend Index tells us
24
00:00:47,840 --> 00:00:49,760
that roughly three out of four knowledge workers
25
00:00:49,760 --> 00:00:51,920
now use generative AI in some form.
26
00:00:51,920 --> 00:00:54,640
Nearly half of them started in just the last several months,
27
00:00:54,640 --> 00:00:56,440
that acceleration isn't a statistic.
28
00:00:56,440 --> 00:00:57,160
It's a signal.
29
00:00:57,160 --> 00:00:59,680
The signal says your organization's most sensitive content
30
00:00:59,680 --> 00:01:01,080
is now being queried through models
31
00:01:01,080 --> 00:01:03,520
that exist outside your perimeter, trained on data
32
00:01:03,520 --> 00:01:06,960
you didn't approve, and retained in ways you can't audit.
33
00:01:06,960 --> 00:01:09,120
The risk isn't that Microsoft will misuse your data.
34
00:01:09,120 --> 00:01:10,200
The risk is structural.
35
00:01:10,200 --> 00:01:12,120
When you send a prompt to a cloud LLM,
36
00:01:12,120 --> 00:01:13,760
that prompt carries context.
37
00:01:13,760 --> 00:01:15,360
That context carries document chunks.
38
00:01:15,360 --> 00:01:17,560
Those chunks carry proprietary information.
39
00:01:17,560 --> 00:01:19,600
And once that information leaves your network,
40
00:01:19,600 --> 00:01:22,440
it enters a legal and technical framework that's not yours.
41
00:01:22,440 --> 00:01:23,600
This isn't theoretical.
42
00:01:23,600 --> 00:01:25,240
Organizations in regulated industries
43
00:01:25,240 --> 00:01:26,880
already face hard blockers.
44
00:01:26,880 --> 00:01:28,800
Healthcare providers can't send patient records
45
00:01:28,800 --> 00:01:30,680
to external APIs under HIPAA.
46
00:01:30,680 --> 00:01:32,480
Financial institutions can't expose
47
00:01:32,480 --> 00:01:34,800
trading strategies to cross-border processing.
48
00:01:34,800 --> 00:01:36,160
Government bodies can't risk
49
00:01:36,160 --> 00:01:37,960
extraterritorial legal compulsion.
50
00:01:37,960 --> 00:01:40,640
The cloud act in the United States allows authorities
51
00:01:40,640 --> 00:01:42,200
to compel disclosure of data.
52
00:01:42,200 --> 00:01:44,520
Even when it's stored in European data centers,
53
00:01:44,520 --> 00:01:46,080
contracts don't eliminate that risk.
54
00:01:46,080 --> 00:01:47,960
They merely define who pays when it happens.
55
00:01:47,960 --> 00:01:49,760
So the question isn't whether AI is useful.
56
00:01:49,760 --> 00:01:51,920
The question is whether your current architecture respects
57
00:01:51,920 --> 00:01:54,400
the boundary between your data and the rest of the world.
58
00:01:54,400 --> 00:01:55,960
Most organizations assume it does.
59
00:01:55,960 --> 00:01:57,960
They assume that enterprise agreements,
60
00:01:57,960 --> 00:02:01,080
data processing addendums and region selection checkboxes
61
00:02:01,080 --> 00:02:02,320
create a sufficient barrier.
62
00:02:02,320 --> 00:02:03,080
They don't.
63
00:02:03,080 --> 00:02:04,560
Those tools improve transparency.
64
00:02:04,560 --> 00:02:06,080
They don't create sovereignty.
65
00:02:06,080 --> 00:02:08,080
Sovereignty means you control the hardware,
66
00:02:08,080 --> 00:02:11,400
the network, the model, the logs, and the legal jurisdiction.
67
00:02:11,400 --> 00:02:13,880
Anything less is delegation dressed up as protection.
68
00:02:13,880 --> 00:02:15,200
And delegation fails.
69
00:02:15,200 --> 00:02:17,600
The moment the delegated party faces a legal obligation,
70
00:02:17,600 --> 00:02:18,840
you can't override.
71
00:02:18,840 --> 00:02:19,800
That's the hidden leak.
72
00:02:19,800 --> 00:02:20,720
It's not a bug.
73
00:02:20,720 --> 00:02:22,080
It's the architecture.
74
00:02:22,080 --> 00:02:24,040
Why sovereign cloud is not enough?
75
00:02:24,040 --> 00:02:25,720
You have probably heard that sovereign cloud
76
00:02:25,720 --> 00:02:26,400
is the answer.
77
00:02:26,400 --> 00:02:27,560
Microsoft offers it.
78
00:02:27,560 --> 00:02:28,800
Other providers offer it too.
79
00:02:28,800 --> 00:02:29,840
The promise is appealing.
80
00:02:29,840 --> 00:02:31,360
Your data stays in your region.
81
00:02:31,360 --> 00:02:32,840
Processing happens locally.
82
00:02:32,840 --> 00:02:34,320
Access controls are stricter.
83
00:02:34,320 --> 00:02:35,480
Transparency improves.
84
00:02:35,480 --> 00:02:36,480
But here's the problem.
85
00:02:36,480 --> 00:02:38,200
Sovereign cloud is still cloud.
86
00:02:38,200 --> 00:02:39,960
And cloud means a provider headquartered
87
00:02:39,960 --> 00:02:42,840
in a jurisdiction that can compel access to your data
88
00:02:42,840 --> 00:02:44,640
regardless of where the server sits.
89
00:02:44,640 --> 00:02:47,320
Legal analysis from active mind legal makes this explicit.
90
00:02:47,320 --> 00:02:50,080
As long as US laws like the Cloud Act remain in force,
91
00:02:50,080 --> 00:02:52,640
US based companies can be compelled to transfer data
92
00:02:52,640 --> 00:02:55,520
to US authorities even when that data is physically stored
93
00:02:55,520 --> 00:02:56,560
in Europe.
94
00:02:56,560 --> 00:02:58,480
Microsoft has acknowledged this tension.
95
00:02:58,480 --> 00:03:01,120
Their sovereign cloud for Europe promises stricter controls
96
00:03:01,120 --> 00:03:02,480
in regional processing.
97
00:03:02,480 --> 00:03:05,680
It can't promise immunity from extraterritorial legal obligations
98
00:03:05,680 --> 00:03:08,960
because Microsoft is a US corporation subject to US law.
99
00:03:08,960 --> 00:03:10,600
This isn't a criticism of Microsoft.
100
00:03:10,600 --> 00:03:11,920
It's a statement about structure.
101
00:03:11,920 --> 00:03:14,920
No contractual assurance can override a statutory compulsion.
102
00:03:14,920 --> 00:03:16,600
No checkbox can change the jurisdiction
103
00:03:16,600 --> 00:03:17,840
of the parent company.
104
00:03:17,840 --> 00:03:19,160
And no amount of marketing language
105
00:03:19,160 --> 00:03:21,760
can turn a hosted service into an owned system.
106
00:03:21,760 --> 00:03:24,600
For many organizations, this distinction is academic.
107
00:03:24,600 --> 00:03:27,600
For others, it's a hard blocker, public sector bodies,
108
00:03:27,600 --> 00:03:31,000
critical infrastructure operators, regulated industries,
109
00:03:31,000 --> 00:03:33,840
organizations subject to GDPR article 44 restrictions
110
00:03:33,840 --> 00:03:35,400
on international transfers.
111
00:03:35,400 --> 00:03:37,000
These entities need data processing
112
00:03:37,000 --> 00:03:38,760
that's not merely resident in the right region,
113
00:03:38,760 --> 00:03:40,720
but legally and technically insulated
114
00:03:40,720 --> 00:03:42,000
from external access.
115
00:03:42,000 --> 00:03:44,760
GDPR requires that personal data be processed lawfully,
116
00:03:44,760 --> 00:03:46,280
fairly and transparently.
117
00:03:46,280 --> 00:03:49,560
It requires appropriate technical and organizational measures.
118
00:03:49,560 --> 00:03:51,320
And when using third party processes,
119
00:03:51,320 --> 00:03:53,840
controllers must ensure that processing agreements match
120
00:03:53,840 --> 00:03:56,280
with GDPR's protections, including restrictions
121
00:03:56,280 --> 00:03:58,880
on transfers to countries without adequate protection.
122
00:03:58,880 --> 00:04:01,080
The standard contractual clauses and adequacy decisions
123
00:04:01,080 --> 00:04:02,560
that underpin many cloud arrangements
124
00:04:02,560 --> 00:04:06,320
are being challenged, renegotiated, and in some cases invalidated.
125
00:04:06,320 --> 00:04:07,840
The legal environment is shifting,
126
00:04:07,840 --> 00:04:12,240
building on assumptions that held in 2022 is a risk in 2026.
127
00:04:12,240 --> 00:04:13,560
So sovereign cloud is a step.
128
00:04:13,560 --> 00:04:14,680
It's not the destination.
129
00:04:14,680 --> 00:04:16,040
The destination is an architecture
130
00:04:16,040 --> 00:04:18,520
where your data never leaves your control in the first place,
131
00:04:18,520 --> 00:04:20,440
where the model runs on your hardware,
132
00:04:20,440 --> 00:04:22,880
where the retrieval index lives on your network,
133
00:04:22,880 --> 00:04:25,560
where the query logs stay inside your perimeter,
134
00:04:25,560 --> 00:04:27,520
where the legal framework is the one you chose,
135
00:04:27,520 --> 00:04:29,440
not the one your provider is subject to.
136
00:04:29,440 --> 00:04:30,640
That architecture exists.
137
00:04:30,640 --> 00:04:32,440
It's called air-gapped intelligence,
138
00:04:32,440 --> 00:04:34,320
and it's what we're building today.
139
00:04:34,320 --> 00:04:35,920
The air-gapped alternative.
140
00:04:35,920 --> 00:04:38,240
Air-gapped doesn't mean disconnected from everything.
141
00:04:38,240 --> 00:04:40,160
It means disconnected from the public internet
142
00:04:40,160 --> 00:04:41,480
for the parts that matter.
143
00:04:41,480 --> 00:04:44,960
Your sharepoint still connects to Microsoft 365 for collaboration.
144
00:04:44,960 --> 00:04:48,040
Your users still authenticate through Microsoft Enter ID.
145
00:04:48,040 --> 00:04:50,520
Your document lifecycle still follows the governance rules
146
00:04:50,520 --> 00:04:51,520
you already built,
147
00:04:51,520 --> 00:04:53,520
but the AI layer runs inside your perimeter.
148
00:04:53,520 --> 00:04:55,640
The LLM sits on your GPU server.
149
00:04:55,640 --> 00:04:58,200
The vector database sits on your local network.
150
00:04:58,200 --> 00:05:00,840
The query interface resolves to an internal IP.
151
00:05:00,840 --> 00:05:02,360
And when a user asks a question,
152
00:05:02,360 --> 00:05:04,600
the answer is generated without a single packet
153
00:05:04,600 --> 00:05:06,080
leaving your controlled environment.
154
00:05:06,080 --> 00:05:07,920
This is zero trust applied to AI.
155
00:05:07,920 --> 00:05:10,720
PaloAlton Networks defines zero trust architecture
156
00:05:10,720 --> 00:05:14,200
as assuming no user or system is inherently trustworthy
157
00:05:14,200 --> 00:05:16,400
and requiring continuous verification.
158
00:05:16,400 --> 00:05:18,480
For AI, this means the ingestion service
159
00:05:18,480 --> 00:05:20,960
authenticates against sharepoint using OOOs.
160
00:05:20,960 --> 00:05:23,440
The vector database enforces role-based access control
161
00:05:23,440 --> 00:05:24,520
at the collection level.
162
00:05:24,520 --> 00:05:26,120
The query interface checks permissions
163
00:05:26,120 --> 00:05:27,560
before returning results.
164
00:05:27,560 --> 00:05:29,200
And the LLM runtime is isolated
165
00:05:29,200 --> 00:05:31,400
from outbound connectivity entirely.
166
00:05:31,400 --> 00:05:33,560
NIST describes role-based access control
167
00:05:33,560 --> 00:05:35,520
as enforcing three basic rules.
168
00:05:35,520 --> 00:05:38,840
Role assignment, role authorization, permission authorization,
169
00:05:38,840 --> 00:05:40,800
uses only exercise permissions consistent
170
00:05:40,800 --> 00:05:42,520
with their authorized roles.
171
00:05:42,520 --> 00:05:45,240
In our architecture, this applies at every layer.
172
00:05:45,240 --> 00:05:46,920
The ingestion service has a role that allows
173
00:05:46,920 --> 00:05:49,400
read access to specific sharepoint libraries.
174
00:05:49,400 --> 00:05:50,560
The vector database collections
175
00:05:50,560 --> 00:05:53,240
are tagged with the permission levels required to query them.
176
00:05:53,240 --> 00:05:56,280
The chat interface verifies the user's EntraID group membership
177
00:05:56,280 --> 00:05:57,640
before constructing the prompt.
178
00:05:57,640 --> 00:05:58,960
The result isn't paranoia.
179
00:05:58,960 --> 00:05:59,880
It's precision.
180
00:05:59,880 --> 00:06:01,720
Every document chunk carries its source.
181
00:06:01,720 --> 00:06:03,320
Every answer carries its citation.
182
00:06:03,320 --> 00:06:05,240
Every query carries its audit trail.
183
00:06:05,240 --> 00:06:06,680
And none of it leaves your building.
184
00:06:06,680 --> 00:06:09,120
Let me walk you through what this looks like in practice.
185
00:06:09,120 --> 00:06:10,680
SharePoint holds your documents.
186
00:06:10,680 --> 00:06:12,640
A local ingestion service authenticates
187
00:06:12,640 --> 00:06:15,920
via Microsoft EntraID, enumerates your libraries
188
00:06:15,920 --> 00:06:17,440
and extracts the content.
189
00:06:17,440 --> 00:06:19,000
A chunking engine breaks documents
190
00:06:19,000 --> 00:06:21,000
into semantically meaningful pieces.
191
00:06:21,000 --> 00:06:22,960
A local embedding model converts those pieces
192
00:06:22,960 --> 00:06:24,440
into numerical vectors.
193
00:06:24,440 --> 00:06:27,160
A vector database stores and indexes those vectors.
194
00:06:27,160 --> 00:06:29,480
A local Lama instance waits for queries.
195
00:06:29,480 --> 00:06:32,120
And a simple web interface lets your team ask questions
196
00:06:32,120 --> 00:06:34,120
and get grounded sighted answers.
197
00:06:34,120 --> 00:06:35,200
That's the architecture.
198
00:06:35,200 --> 00:06:37,840
Seven layers, all local, all under your control.
199
00:06:37,840 --> 00:06:40,120
The ingestion service runs on a modest server.
200
00:06:40,120 --> 00:06:42,640
It needs CPU, memory, and network access to SharePoint.
201
00:06:42,640 --> 00:06:44,040
It doesn't need a GPU.
202
00:06:44,040 --> 00:06:45,640
The chunking engine runs alongside it.
203
00:06:45,640 --> 00:06:47,720
The embedding model needs a GPU for speed
204
00:06:47,720 --> 00:06:50,240
but can fall back to CPU for smaller batches.
205
00:06:50,240 --> 00:06:52,480
The vector database needs fast SSD storage
206
00:06:52,480 --> 00:06:54,880
and enough RAM to hold the HNSW index.
207
00:06:54,880 --> 00:06:58,360
The LLM runtime needs the biggest GPU you can afford.
208
00:06:58,360 --> 00:07:00,400
And the query interface needs minimal resources
209
00:07:00,400 --> 00:07:01,760
because it's just a web application
210
00:07:01,760 --> 00:07:03,640
orchestrating calls to the other layers.
211
00:07:03,640 --> 00:07:06,400
This modular resource allocation means you can start small.
212
00:07:06,400 --> 00:07:08,200
A single server with a mid-range GPU
213
00:07:08,200 --> 00:07:09,840
can run the entire stack for a pilot
214
00:07:09,840 --> 00:07:12,320
with 5,000 documents and 50 daily users.
215
00:07:12,320 --> 00:07:15,520
As you grow, you move components to dedicated hardware.
216
00:07:15,520 --> 00:07:18,240
The vector database gets its own server with fast disks.
217
00:07:18,240 --> 00:07:21,000
The LLM runtime gets a dedicated GPU workstation.
218
00:07:21,000 --> 00:07:22,720
The ingestion service scales horizontally
219
00:07:22,720 --> 00:07:23,920
by adding more workers.
220
00:07:23,920 --> 00:07:26,440
You don't need to buy enterprise hardware on day one.
221
00:07:26,440 --> 00:07:28,320
But before we build it, you need to understand
222
00:07:28,320 --> 00:07:29,960
why retrieval isn't optional.
223
00:07:29,960 --> 00:07:32,000
It's the difference between a useful system
224
00:07:32,000 --> 00:07:34,360
and an expensive hallucination machine.
225
00:07:34,360 --> 00:07:36,440
The Ragnparative, large language models
226
00:07:36,440 --> 00:07:37,920
are patent completion engines.
227
00:07:37,920 --> 00:07:40,360
They predict the next token based on statistical patterns
228
00:07:40,360 --> 00:07:41,640
learned from training data.
229
00:07:41,640 --> 00:07:43,160
They don't know your organization.
230
00:07:43,160 --> 00:07:44,240
They don't know your contracts.
231
00:07:44,240 --> 00:07:45,800
They don't know your procedures.
232
00:07:45,800 --> 00:07:47,800
And when you ask them a question about content,
233
00:07:47,800 --> 00:07:48,840
they have never seen.
234
00:07:48,840 --> 00:07:50,640
They invent plausible sounding answers.
235
00:07:50,640 --> 00:07:52,160
That invention is called hallucination.
236
00:07:52,160 --> 00:07:53,240
It's not a rare bug.
237
00:07:53,240 --> 00:07:55,880
It's a fundamental property of how these models work.
238
00:07:55,880 --> 00:07:59,240
A full survey from RxIV categorizes hallucination mitigation
239
00:07:59,240 --> 00:08:01,640
into prompt engineering decoding constraints,
240
00:08:01,640 --> 00:08:04,120
training interventions and retrieval-based methods.
241
00:08:04,120 --> 00:08:04,960
The conclusion is clear.
242
00:08:04,960 --> 00:08:06,440
For factual grounding and changing
243
00:08:06,440 --> 00:08:10,440
or proprietary content, retrieval is the most reliable approach.
244
00:08:10,440 --> 00:08:12,640
Retrieval augmented generation or Ragn
245
00:08:12,640 --> 00:08:15,280
inserts a retrieval step between the user and the model.
246
00:08:15,280 --> 00:08:17,000
AWS describes it simply.
247
00:08:17,000 --> 00:08:18,880
User input retrieves relevant information
248
00:08:18,880 --> 00:08:20,160
from a new data source.
249
00:08:20,160 --> 00:08:22,000
The combined query and retrieved context
250
00:08:22,000 --> 00:08:23,200
passed to the LLM.
251
00:08:23,200 --> 00:08:25,320
The result is generated from authoritative data,
252
00:08:25,320 --> 00:08:26,880
not from statistical guessing.
253
00:08:26,880 --> 00:08:29,600
Immutar adds the enterprise security perspective.
254
00:08:29,600 --> 00:08:31,800
Ragn converts external data into embeddings,
255
00:08:31,800 --> 00:08:33,480
stores them in a vector database,
256
00:08:33,480 --> 00:08:35,480
retrieves the most relevant chunks for a query
257
00:08:35,480 --> 00:08:37,240
and integrates them into the prompt.
258
00:08:37,240 --> 00:08:39,080
And every step security must be enforced.
259
00:08:39,080 --> 00:08:42,560
Storage tier, data tier, prompt tier.
260
00:08:42,560 --> 00:08:45,120
Microsoft's Azure Architecture Center agrees.
261
00:08:45,120 --> 00:08:46,760
Ragn is the industry standard approach
262
00:08:46,760 --> 00:08:49,040
to using language models with proprietary data.
263
00:08:49,040 --> 00:08:50,840
Each step from chunking and embedding
264
00:08:50,840 --> 00:08:53,600
to retrieval and evaluation must be carefully designed
265
00:08:53,600 --> 00:08:54,440
and measured.
266
00:08:54,440 --> 00:08:56,000
Here is the pipeline in practical terms.
267
00:08:56,000 --> 00:08:57,760
Your SharePoint documents contain text.
268
00:08:57,760 --> 00:08:59,200
That text gets broken into chunks.
269
00:08:59,200 --> 00:09:01,680
Each chunk gets converted into a numerical vector
270
00:09:01,680 --> 00:09:03,480
that captures its semantic meaning.
271
00:09:03,480 --> 00:09:05,720
Those vectors get stored in a specialized database
272
00:09:05,720 --> 00:09:06,840
called a vector database.
273
00:09:06,840 --> 00:09:08,160
When a user asks a question,
274
00:09:08,160 --> 00:09:10,680
that question also gets converted into a vector.
275
00:09:10,680 --> 00:09:12,640
The database compares the question vector
276
00:09:12,640 --> 00:09:14,200
against all the document vectors
277
00:09:14,200 --> 00:09:15,800
and returns the closest matches.
278
00:09:15,800 --> 00:09:18,400
Those matches get inserted into the prompt sent to the LLM.
279
00:09:18,400 --> 00:09:20,640
The LLM now has both its general training
280
00:09:20,640 --> 00:09:22,840
and your specific documents as context.
281
00:09:22,840 --> 00:09:25,600
It generates an answer grounded in your actual content.
282
00:09:25,600 --> 00:09:28,080
This matters because fine tuning isn't a substitute.
283
00:09:28,080 --> 00:09:30,360
When you fine tune a model on internal documents,
284
00:09:30,360 --> 00:09:33,000
you bake specific information into the model weights.
285
00:09:33,000 --> 00:09:34,440
Updates become expensive.
286
00:09:34,440 --> 00:09:35,720
Privacy questions multiply
287
00:09:35,720 --> 00:09:37,160
because private data used in training
288
00:09:37,160 --> 00:09:39,640
can be memorized and inadvertently reproduced.
289
00:09:39,640 --> 00:09:42,520
A recent case study comparing rag against fine tuning
290
00:09:42,520 --> 00:09:44,640
finds that rag offers better factual accuracy
291
00:09:44,640 --> 00:09:45,680
and maintainability,
292
00:09:45,680 --> 00:09:47,920
especially when the knowledge-based changes frequently.
293
00:09:47,920 --> 00:09:49,920
Your SharePoint content changes daily.
294
00:09:49,920 --> 00:09:51,760
Rag reflects those changes immediately.
295
00:09:51,760 --> 00:09:53,120
Fine tuning doesn't.
296
00:09:53,120 --> 00:09:54,640
There's a mistake almost everyone makes
297
00:09:54,640 --> 00:09:56,320
when chunking SharePoint documents.
298
00:09:56,320 --> 00:09:58,080
They use the same chunk size for everything.
299
00:09:58,080 --> 00:10:02,000
PDFs, word docs, Excel sheets, PowerPoint decks.
300
00:10:02,000 --> 00:10:04,520
Each document type has a different structure.
301
00:10:04,520 --> 00:10:07,720
A uniform chunking strategy destroys retrieval accuracy
302
00:10:07,720 --> 00:10:10,120
because it breaks semantic boundaries in some documents
303
00:10:10,120 --> 00:10:11,440
and creates noise in others.
304
00:10:11,440 --> 00:10:13,760
I will show you exactly how to fix this later.
305
00:10:13,760 --> 00:10:16,200
For now, remember that retrieval isn't a bolt on.
306
00:10:16,200 --> 00:10:18,840
It's the foundation of trust where the enterprise AI.
307
00:10:18,840 --> 00:10:20,320
The architecture we're building doesn't send
308
00:10:20,320 --> 00:10:22,200
your documents to a model for training.
309
00:10:22,200 --> 00:10:24,560
It sends relevant chunks to a model for inference.
310
00:10:24,560 --> 00:10:25,720
The documents stay local.
311
00:10:25,720 --> 00:10:26,840
The embedding stay local.
312
00:10:26,840 --> 00:10:27,920
The model stays local.
313
00:10:27,920 --> 00:10:29,440
And the answers cite their sources.
314
00:10:29,440 --> 00:10:31,000
That's rag, that's the pattern.
315
00:10:31,000 --> 00:10:33,040
Now let us look at what we're retrieving from.
316
00:10:33,040 --> 00:10:34,960
SharePoint as the content backbone.
317
00:10:34,960 --> 00:10:36,440
SharePoint isn't just a file store.
318
00:10:36,440 --> 00:10:39,080
It's the governance backbone of your document ecosystem.
319
00:10:39,080 --> 00:10:41,520
Microsoft defines effective document management
320
00:10:41,520 --> 00:10:45,480
as specifying document types, templates, metadata, storage
321
00:10:45,480 --> 00:10:48,560
locations, access controls, workflows, and policies
322
00:10:48,560 --> 00:10:50,000
for auditing and retention.
323
00:10:50,000 --> 00:10:51,320
That's not a feature list.
324
00:10:51,320 --> 00:10:53,320
It's a description of how your organization already
325
00:10:53,320 --> 00:10:54,600
manages knowledge.
326
00:10:54,600 --> 00:10:57,120
When you build an AI layer on top of SharePoint,
327
00:10:57,120 --> 00:10:58,600
you're not starting from scratch.
328
00:10:58,600 --> 00:11:00,160
You're extending an existing system.
329
00:11:00,160 --> 00:11:02,120
SharePoint already knows who can see what.
330
00:11:02,120 --> 00:11:03,480
It already tracks versions.
331
00:11:03,480 --> 00:11:06,480
It already enforces retention policies through PerView.
332
00:11:06,480 --> 00:11:08,720
It already logs access through audit trails.
333
00:11:08,720 --> 00:11:11,160
Any AI system that bypasses these controls
334
00:11:11,160 --> 00:11:12,560
creates shadow governance.
335
00:11:12,560 --> 00:11:14,600
And shadow governance is where data breaches happen.
336
00:11:14,600 --> 00:11:16,640
The planning process for SharePoint document management
337
00:11:16,640 --> 00:11:17,920
is itself structured.
338
00:11:17,920 --> 00:11:20,720
Organizations identify document management roles.
339
00:11:20,720 --> 00:11:22,080
They analyze usage patterns.
340
00:11:22,080 --> 00:11:23,880
They plan site collections and libraries.
341
00:11:23,880 --> 00:11:27,000
They design content types that capture metadata and workflows.
342
00:11:27,000 --> 00:11:29,080
They configure approval processes.
343
00:11:29,080 --> 00:11:31,880
And they set policies for auditing, retention, and records
344
00:11:31,880 --> 00:11:32,480
management.
345
00:11:32,480 --> 00:11:33,840
This isn't overhead.
346
00:11:33,840 --> 00:11:37,320
It's the reason SharePoint is trusted in regulated environments.
347
00:11:37,320 --> 00:11:38,760
For our architecture, these structures
348
00:11:38,760 --> 00:11:40,240
are both an asset and a constraint.
349
00:11:40,240 --> 00:11:41,920
They provide rich metadata that can
350
00:11:41,920 --> 00:11:43,520
inform chunking decisions.
351
00:11:43,520 --> 00:11:45,680
A contract in the legal library carries different weight
352
00:11:45,680 --> 00:11:47,320
than a draft in the marketing folder.
353
00:11:47,320 --> 00:11:49,120
They provide clear access boundaries.
354
00:11:49,120 --> 00:11:51,360
The AI should never surface content from a library
355
00:11:51,360 --> 00:11:53,120
the user can't access directly.
356
00:11:53,120 --> 00:11:54,560
And they impose obligations.
357
00:11:54,560 --> 00:11:56,200
If a document is under legal hold,
358
00:11:56,200 --> 00:11:57,920
the AI must respect that hold.
359
00:11:57,920 --> 00:12:00,920
If a retention policy deletes a document after seven years,
360
00:12:00,920 --> 00:12:03,800
the AI must not preserve it indefinitely in a vector index.
361
00:12:03,800 --> 00:12:06,560
SharePoint exposes this content through multiple APIs.
362
00:12:06,560 --> 00:12:08,200
The traditional SharePoint REST API
363
00:12:08,200 --> 00:12:11,360
allows programmatic access to lists, libraries, and files.
364
00:12:11,360 --> 00:12:13,440
A PowerShell script can issue a get request
365
00:12:13,440 --> 00:12:16,120
against a library using a URL like your SharePoint site
366
00:12:16,120 --> 00:12:18,400
plus the API endpoint for list items.
367
00:12:18,400 --> 00:12:19,960
Appropriate headers and credentials
368
00:12:19,960 --> 00:12:21,800
retrieve the items for processing.
369
00:12:21,800 --> 00:12:24,720
The newer Microsoft Graph API provides a unified endpoint
370
00:12:24,720 --> 00:12:27,400
for SharePoint, OneDrive, Teams, and Exchange.
371
00:12:27,400 --> 00:12:30,160
And the Microsoft 365 Copilot Search API,
372
00:12:30,160 --> 00:12:33,720
currently in preview, allows hybrid semantic and lexical search
373
00:12:33,720 --> 00:12:36,160
over work content using natural language queries.
374
00:12:36,160 --> 00:12:38,200
For our local REC solution, these APIs
375
00:12:38,200 --> 00:12:40,000
provide the pipelines ingres point.
376
00:12:40,000 --> 00:12:41,720
A service running inside your perimeter
377
00:12:41,720 --> 00:12:44,360
authenticates against SharePoint using OAuth 2.0
378
00:12:44,360 --> 00:12:45,840
through Microsoft Enter ID.
379
00:12:45,840 --> 00:12:48,160
It enumerates libraries, it downloads documents,
380
00:12:48,160 --> 00:12:50,080
and it feeds them into the ingestion process
381
00:12:50,080 --> 00:12:52,240
without exposing content to external providers.
382
00:12:52,240 --> 00:12:53,240
This is critical.
383
00:12:53,240 --> 00:12:55,760
The ingestion service is the bridge between SharePoint
384
00:12:55,760 --> 00:12:56,960
and your local AI.
385
00:12:56,960 --> 00:12:58,480
It must authenticate securely.
386
00:12:58,480 --> 00:12:59,640
It must respect rate limits.
387
00:12:59,640 --> 00:13:03,040
It must handle versioning, and it must run inside your network.
388
00:13:03,040 --> 00:13:04,560
If you deploy the ingestion service
389
00:13:04,560 --> 00:13:07,440
in a cloud virtual machine, you have reintroduced
390
00:13:07,440 --> 00:13:09,120
the problem you're trying to solve.
391
00:13:09,120 --> 00:13:12,040
The authentication flow is standard Microsoft 365,
392
00:13:12,040 --> 00:13:14,120
register an application in Enter ID,
393
00:13:14,120 --> 00:13:16,120
granted application permissions for sites,
394
00:13:16,120 --> 00:13:17,640
read all or delegated permissions
395
00:13:17,640 --> 00:13:20,360
scoped to specific libraries, store the client secret
396
00:13:20,360 --> 00:13:23,080
or certificate in a local secret manager, not in code.
397
00:13:23,080 --> 00:13:25,720
Use the client credentials flow for background ingestion,
398
00:13:25,720 --> 00:13:28,120
and the on behalf of flow if you want user scoped queries
399
00:13:28,120 --> 00:13:29,920
that respect individual permissions.
400
00:13:29,920 --> 00:13:30,760
This isn't exotic.
401
00:13:30,760 --> 00:13:34,040
It's the same pattern you use for any Microsoft 365 integration.
402
00:13:34,040 --> 00:13:35,520
What changes is the destination.
403
00:13:35,520 --> 00:13:38,320
Instead of sending documents to a cloud AI service,
404
00:13:38,320 --> 00:13:40,400
you send them to a local chunking engine.
405
00:13:40,400 --> 00:13:42,480
Instead of calling a cloud embedding API,
406
00:13:42,480 --> 00:13:44,560
you call a local sentence transformer.
407
00:13:44,560 --> 00:13:46,560
Instead of storing vectors in a managed service,
408
00:13:46,560 --> 00:13:49,560
you store them in a local queue-drand or waviate instance.
409
00:13:49,560 --> 00:13:51,960
The APIs are the same, but the network path is different.
410
00:13:51,960 --> 00:13:52,760
That's the foundation.
411
00:13:52,760 --> 00:13:54,880
SharePoint isn't just where your documents live.
412
00:13:54,880 --> 00:13:56,440
It's where your governance lives,
413
00:13:56,440 --> 00:13:58,320
and our architecture preserves that governance
414
00:13:58,320 --> 00:13:59,800
while adding intelligence.
415
00:13:59,800 --> 00:14:01,280
But here is where most people get stuck.
416
00:14:01,280 --> 00:14:03,960
They assume they need a cloud LLM to make this useful.
417
00:14:03,960 --> 00:14:05,400
They look at the local deployment path
418
00:14:05,400 --> 00:14:08,240
and worry that the model will be too small, too slow,
419
00:14:08,240 --> 00:14:08,960
or too dumb.
420
00:14:08,960 --> 00:14:10,480
That assumption is outdated.
421
00:14:10,480 --> 00:14:13,880
Local versus cloud LLMs, the real trade-offs.
422
00:14:13,880 --> 00:14:16,960
Cloud LLM APIs offer undeniable advantages.
423
00:14:16,960 --> 00:14:19,600
Lower operational overhead, automatic scaling,
424
00:14:19,600 --> 00:14:21,800
access to frontier models with hundreds of billions
425
00:14:21,800 --> 00:14:22,560
of parameters.
426
00:14:22,560 --> 00:14:23,880
You don't manage drivers.
427
00:14:23,880 --> 00:14:25,360
You don't manage quantization.
428
00:14:25,360 --> 00:14:26,680
You don't manage fail-over.
429
00:14:26,680 --> 00:14:27,560
You send a prompt.
430
00:14:27,560 --> 00:14:28,600
You get an answer.
431
00:14:28,600 --> 00:14:30,240
But those advantages come with trade-offs
432
00:14:30,240 --> 00:14:32,280
that many organizations can't accept.
433
00:14:32,280 --> 00:14:33,960
Every prompt leaves your network.
434
00:14:33,960 --> 00:14:35,920
Every response passes through infrastructure
435
00:14:35,920 --> 00:14:37,040
you don't control.
436
00:14:37,040 --> 00:14:39,680
Every token incurs a cost that scales with usage.
437
00:14:39,680 --> 00:14:42,480
And the best models aren't available for local deployment
438
00:14:42,480 --> 00:14:44,520
at all because their weights are proprietary.
439
00:14:44,520 --> 00:14:46,520
Local LLM deployments flip that equation.
440
00:14:46,520 --> 00:14:48,600
The operational burden shifts to your team.
441
00:14:48,600 --> 00:14:50,440
The scaling responsibility becomes yours,
442
00:14:50,440 --> 00:14:52,280
but the data control becomes absolute.
443
00:14:52,280 --> 00:14:53,880
The cost becomes predictable,
444
00:14:53,880 --> 00:14:57,240
and the model becomes yours to configure, update, and audit.
445
00:14:57,240 --> 00:15:00,480
AI multiples comparison of cloud versus local LLMs
446
00:15:00,480 --> 00:15:03,160
notes that cloud models are attractive for organizations
447
00:15:03,160 --> 00:15:06,360
that prefer managed services and rapid experimentation.
448
00:15:06,360 --> 00:15:08,800
Local models are more suitable when data security
449
00:15:08,800 --> 00:15:10,320
and sovereignty are critical,
450
00:15:10,320 --> 00:15:12,600
and where organizations have or can invest
451
00:15:12,600 --> 00:15:13,720
in appropriate hardware.
452
00:15:13,720 --> 00:15:14,600
That's our scenario.
453
00:15:14,600 --> 00:15:15,480
We're not experimenting.
454
00:15:15,480 --> 00:15:17,320
We're building production infrastructure.
455
00:15:17,320 --> 00:15:19,640
The hardware requirements for local YAMA deployment
456
00:15:19,640 --> 00:15:21,880
are major but increasingly accessible.
457
00:15:21,880 --> 00:15:25,240
Guidance for LAMA 3 suggests targeting Nvidia GPUs
458
00:15:25,240 --> 00:15:28,520
with at least 16 gigabytes of VRAM, 32 gigabytes
459
00:15:28,520 --> 00:15:31,760
of system RAM, and roughly 50 gigabytes of free disk space
460
00:15:31,760 --> 00:15:33,560
for models and dependencies.
461
00:15:33,560 --> 00:15:35,320
Larger models of fine tuning workloads
462
00:15:35,320 --> 00:15:39,280
benefit from 64 gigabytes of RAM and more GPU memory.
463
00:15:39,280 --> 00:15:41,640
Community reports describe successful deployments
464
00:15:41,640 --> 00:15:43,600
on Linux distributions like Ubuntu
465
00:15:43,600 --> 00:15:45,760
by compiling inference engines like LAMA,
466
00:15:45,760 --> 00:15:49,080
CPP with CUDA support combined with appropriate Nvidia drivers.
467
00:15:49,080 --> 00:15:50,600
Olamma simplifies this further.
468
00:15:50,600 --> 00:15:53,720
It's a cross-platform application for macOS, Windows, and Linux
469
00:15:53,720 --> 00:15:56,520
that downloads and runs models via a local API endpoint.
470
00:15:56,520 --> 00:15:58,000
You pull a model with a single command,
471
00:15:58,000 --> 00:16:01,160
you query it with a simple HTTP request to local host.
472
00:16:01,160 --> 00:16:03,480
No container orchestration, no model conversion,
473
00:16:03,480 --> 00:16:05,320
no manual dependency management.
474
00:16:05,320 --> 00:16:07,720
For production use, you will want to run Olamma
475
00:16:07,720 --> 00:16:10,400
on a dedicated GPU server rather than a laptop
476
00:16:10,400 --> 00:16:11,640
but the abstraction is the same.
477
00:16:11,640 --> 00:16:14,040
A 2026 total cost of ownership analysis
478
00:16:14,040 --> 00:16:16,160
suggests that beyond certain usage thresholds
479
00:16:16,160 --> 00:16:19,120
running open source models on dedicated GPU servers
480
00:16:19,120 --> 00:16:22,240
can become more cost effective than paying per token API fees
481
00:16:22,240 --> 00:16:24,600
despite high upfront hardware costs.
482
00:16:24,600 --> 00:16:26,880
The exact threshold depends on utilization,
483
00:16:26,880 --> 00:16:30,280
model size, energy costs, and operational expertise.
484
00:16:30,280 --> 00:16:31,920
But the directional insight is clear.
485
00:16:31,920 --> 00:16:35,080
If your organization will process thousands of queries daily
486
00:16:35,080 --> 00:16:37,160
across tens of thousands of documents,
487
00:16:37,160 --> 00:16:38,840
local deployment isn't a luxury.
488
00:16:38,840 --> 00:16:40,480
It's a financial optimization.
489
00:16:40,480 --> 00:16:43,720
Subjective experience comparisons between cloud and local models
490
00:16:43,720 --> 00:16:45,840
tend to emphasize that frontier cloud models
491
00:16:45,840 --> 00:16:49,560
still outperform smaller local ones on reasoning and nuance.
492
00:16:49,560 --> 00:16:52,040
But local models are increasingly acceptable for enterprise
493
00:16:52,040 --> 00:16:54,240
tasks when carefully selected and configured.
494
00:16:54,240 --> 00:16:56,760
The main phrase is carefully selected and configured.
495
00:16:56,760 --> 00:16:58,840
A badly chosen local model with poor prompting
496
00:16:58,840 --> 00:16:59,800
will disappoint.
497
00:16:59,800 --> 00:17:02,280
A well-chosen model with good rag will surprise you.
498
00:17:02,280 --> 00:17:04,920
For our architecture, the model isn't doing everything.
499
00:17:04,920 --> 00:17:07,200
It's answering questions based on retrieved context.
500
00:17:07,200 --> 00:17:08,640
It doesn't need to know quantum physics.
501
00:17:08,640 --> 00:17:10,920
It needs to synthesize policy documents, contracts,
502
00:17:10,920 --> 00:17:13,440
and procedures into coherent responses.
503
00:17:13,440 --> 00:17:15,760
That's a narrower task than general reasoning.
504
00:17:15,760 --> 00:17:17,640
And local models handle it well.
505
00:17:17,640 --> 00:17:19,960
Meta now offers Lama 4 as its flagship family
506
00:17:19,960 --> 00:17:22,520
alongside the Open Weight Lama 3 series.
507
00:17:22,520 --> 00:17:25,840
Deployment paths exist for both cloud and local scenarios.
508
00:17:25,840 --> 00:17:28,640
For our air-gaped architecture, we pull the open weights,
509
00:17:28,640 --> 00:17:30,720
quantize them for our hardware, and serve them
510
00:17:30,720 --> 00:17:32,240
through Olamma or Lama.
511
00:17:32,240 --> 00:17:33,000
CPP.
512
00:17:33,000 --> 00:17:35,040
The license is permissive for commercial use.
513
00:17:35,040 --> 00:17:35,880
The weights are yours.
514
00:17:35,880 --> 00:17:38,000
The model is yours and the answers stay yours.
515
00:17:38,000 --> 00:17:39,920
Once you have committed to local inference,
516
00:17:39,920 --> 00:17:41,760
the next decision is the vector database.
517
00:17:41,760 --> 00:17:43,400
And this is where many architects make
518
00:17:43,400 --> 00:17:44,720
their first real mistake.
519
00:17:44,720 --> 00:17:46,560
Vector databases and embeddings.
520
00:17:46,560 --> 00:17:48,640
Embeddings are the bridge between human language
521
00:17:48,640 --> 00:17:49,640
and machine search.
522
00:17:49,640 --> 00:17:52,120
A sentence transformer model takes a piece of text
523
00:17:52,120 --> 00:17:55,080
and converts it into a dense vector of floating point numbers.
524
00:17:55,080 --> 00:17:56,960
That vector captures semantic meaning.
525
00:17:56,960 --> 00:17:58,920
Sentence is about similar topics produce vectors
526
00:17:58,920 --> 00:18:00,880
that are close together in high-dimensional space.
527
00:18:00,880 --> 00:18:02,840
Sentence is about unrelated topics produce vectors
528
00:18:02,840 --> 00:18:03,880
that are far apart.
529
00:18:03,880 --> 00:18:05,360
This isn't keyword search.
530
00:18:05,360 --> 00:18:07,200
A keyword search for termination policy
531
00:18:07,200 --> 00:18:09,840
might miss a document titled off-boarding procedures.
532
00:18:09,840 --> 00:18:12,120
An embedding search finds it because the semantic meaning
533
00:18:12,120 --> 00:18:12,880
is similar.
534
00:18:12,880 --> 00:18:14,920
The model understands that off-boarding and termination
535
00:18:14,920 --> 00:18:16,160
are related concepts.
536
00:18:16,160 --> 00:18:18,240
It encodes that relationship into the geometry
537
00:18:18,240 --> 00:18:19,400
of the vector space.
538
00:18:19,400 --> 00:18:21,400
Sentence transformers come in many flavors.
539
00:18:21,400 --> 00:18:24,040
The all-mini LML6 V2 model is small, fast,
540
00:18:24,040 --> 00:18:25,200
and runs well on CPU.
541
00:18:25,200 --> 00:18:28,160
It produces 384 dimensional vectors.
542
00:18:28,160 --> 00:18:31,360
The BGE large and model is larger, slower, and more accurate.
543
00:18:31,360 --> 00:18:34,200
It produces 1,024 dimensional vectors.
544
00:18:34,200 --> 00:18:35,960
For a local air-gapped deployment,
545
00:18:35,960 --> 00:18:38,240
you run the embedding model on the same GPU server
546
00:18:38,240 --> 00:18:41,120
as your LLM or on a separate CPU worker.
547
00:18:41,120 --> 00:18:44,160
The critical rule is that the embedding model must run locally.
548
00:18:44,160 --> 00:18:46,680
Don't call a cloud embedding API doing so
549
00:18:46,680 --> 00:18:49,200
would send your document chunks to an external service,
550
00:18:49,200 --> 00:18:51,480
defeating the entire purpose of the architecture.
551
00:18:51,480 --> 00:18:53,320
The vector database stores these embeddings
552
00:18:53,320 --> 00:18:55,080
and performs similarity search.
553
00:18:55,080 --> 00:18:57,600
When a user asks a question, the question gets embedded
554
00:18:57,600 --> 00:18:58,920
using the same model.
555
00:18:58,920 --> 00:19:01,080
The database compares this query vector
556
00:19:01,080 --> 00:19:03,120
against all stored document vectors
557
00:19:03,120 --> 00:19:04,520
and returns the nearest neighbors.
558
00:19:04,520 --> 00:19:07,640
This is called approximate nearest neighbor search, OANN.
559
00:19:07,640 --> 00:19:09,320
It's fast even with millions of vectors
560
00:19:09,320 --> 00:19:12,320
because the database users specialize in their structures.
561
00:19:12,320 --> 00:19:14,280
Several vector databases are available.
562
00:19:14,280 --> 00:19:15,440
Your grant is written in Rust,
563
00:19:15,440 --> 00:19:17,200
its fast, memory efficient, and supports
564
00:19:17,200 --> 00:19:18,880
rich metadata filtering.
565
00:19:18,880 --> 00:19:21,680
You can attach tags to each vector, such as document source,
566
00:19:21,680 --> 00:19:23,760
library name, author, and permission level.
567
00:19:23,760 --> 00:19:25,640
Then you can filter searches to only vectors
568
00:19:25,640 --> 00:19:27,800
from libraries the user is allowed to access.
569
00:19:27,800 --> 00:19:29,600
Wevey8 offers a GraphQL interface
570
00:19:29,600 --> 00:19:31,240
and native multimodal support.
571
00:19:31,240 --> 00:19:33,200
Milvus is designed for cloud-native scaling
572
00:19:33,200 --> 00:19:34,080
with Kubernetes.
573
00:19:34,080 --> 00:19:36,480
Chroma is lightweight and ideal for prototyping.
574
00:19:36,480 --> 00:19:38,040
For an air-gapped SharePoint deployment,
575
00:19:38,040 --> 00:19:40,400
Q-drand and Wevey8 are the pragmatic choices.
576
00:19:40,400 --> 00:19:42,240
Both run on-premises via Docker.
577
00:19:42,240 --> 00:19:43,760
Both support the metadata filtering
578
00:19:43,760 --> 00:19:45,560
you need for permission-aware retrieval.
579
00:19:45,560 --> 00:19:48,200
Both have stable APIs and active communities.
580
00:19:48,200 --> 00:19:50,920
The choice between them often comes down to team preference.
581
00:19:50,920 --> 00:19:52,920
If your team likes Rust APIs and JSON,
582
00:19:52,920 --> 00:19:54,520
Q-drand feels natural.
583
00:19:54,520 --> 00:19:57,040
If your team likes GraphQL and semantic search features,
584
00:19:57,040 --> 00:19:58,440
Wevey8 fits better.
585
00:19:58,440 --> 00:20:00,760
Chunking strategy determines whether your embeddings
586
00:20:00,760 --> 00:20:02,320
are meaningful or noisy.
587
00:20:02,320 --> 00:20:03,800
A document chunk is a piece of text
588
00:20:03,800 --> 00:20:05,640
that gets embedded as a single unit.
589
00:20:05,640 --> 00:20:08,000
If chunks are too large, they dilute meaning.
590
00:20:08,000 --> 00:20:11,040
A 5,000-word chunk about the entire employee handbook
591
00:20:11,040 --> 00:20:13,560
embeds into a single vector that represents everything
592
00:20:13,560 --> 00:20:14,360
and nothing.
593
00:20:14,360 --> 00:20:16,600
If chunks are too small, they fragment meaning.
594
00:20:16,600 --> 00:20:18,960
A single sentence like section 4.2 applies
595
00:20:18,960 --> 00:20:21,280
to all full-time employees carries no context
596
00:20:21,280 --> 00:20:23,240
about what section 4.2 actually says.
597
00:20:23,240 --> 00:20:26,040
The mistake I mentioned earlier is using uniform chunking
598
00:20:26,040 --> 00:20:28,040
for all SharePoint document types.
599
00:20:28,040 --> 00:20:29,680
PDFs need page-aware boundaries
600
00:20:29,680 --> 00:20:32,800
because page breaks often separate unrelated topics.
601
00:20:32,800 --> 00:20:34,480
Word documents need heading-aware chunking
602
00:20:34,480 --> 00:20:36,560
because heading's defined semantic sections.
603
00:20:36,560 --> 00:20:38,600
Excel spreadsheets need row-group chunking
604
00:20:38,600 --> 00:20:41,200
with header preservation because a row without column headers
605
00:20:41,200 --> 00:20:42,400
is meaningless.
606
00:20:42,400 --> 00:20:44,720
PowerPoint decks need slide-level chunking
607
00:20:44,720 --> 00:20:47,320
because each slide is a self-contained unit.
608
00:20:47,320 --> 00:20:49,920
Pinecones research on chunking strategies confirms this.
609
00:20:49,920 --> 00:20:52,960
Fixed-sized chunking with overlap works for homogeneous text.
610
00:20:52,960 --> 00:20:55,080
Semantic chunking based on sentence boundaries
611
00:20:55,080 --> 00:20:56,640
works for narrative documents.
612
00:20:56,640 --> 00:20:58,680
Recursive chunking that tries paragraphs,
613
00:20:58,680 --> 00:21:01,440
then sentences, then words works for mixed content.
614
00:21:01,440 --> 00:21:03,520
For SharePoint, you need a hybrid approach.
615
00:21:03,520 --> 00:21:04,880
Detect the document type.
616
00:21:04,880 --> 00:21:06,560
Apply the appropriate strategy.
617
00:21:06,560 --> 00:21:08,520
Preserve metadata at every step.
618
00:21:08,520 --> 00:21:10,320
The practical setup looks like this.
619
00:21:10,320 --> 00:21:12,680
Your ingestion service downloads a Word document
620
00:21:12,680 --> 00:21:13,800
from SharePoint.
621
00:21:13,800 --> 00:21:17,000
It extracts text while preserving heading structure.
622
00:21:17,000 --> 00:21:20,360
It breaks the text into chunks of roughly 500 tokens
623
00:21:20,360 --> 00:21:22,120
with a 50 token overlap.
624
00:21:22,120 --> 00:21:23,840
It attaches metadata to each chunk,
625
00:21:23,840 --> 00:21:26,480
including the source URL document title author,
626
00:21:26,480 --> 00:21:29,160
last modified date, and SharePoint library.
627
00:21:29,160 --> 00:21:31,800
It sends the chunk to your local sentence transformer.
628
00:21:31,800 --> 00:21:33,240
The transformer returns a vector.
629
00:21:33,240 --> 00:21:36,160
The vector gets stored in queue-drand with its metadata.
630
00:21:36,160 --> 00:21:38,760
The process repeats for every document in the library.
631
00:21:38,760 --> 00:21:40,480
For a library of 1,000 documents,
632
00:21:40,480 --> 00:21:42,080
this takes minutes, not hours.
633
00:21:42,080 --> 00:21:45,120
For 10,000 documents, it takes longer, but runs unattended.
634
00:21:45,120 --> 00:21:46,600
And once the initial index is built,
635
00:21:46,600 --> 00:21:48,320
Delta updates handle changes.
636
00:21:48,320 --> 00:21:50,400
When a document is modified in SharePoint,
637
00:21:50,400 --> 00:21:52,200
the ingestion service detects the change,
638
00:21:52,200 --> 00:21:54,760
rechunks the updated document, re-embeds the chunks,
639
00:21:54,760 --> 00:21:56,760
and updates the vectors in the database.
640
00:21:56,760 --> 00:21:58,840
Delete a document's trigger vector deletion.
641
00:21:58,840 --> 00:22:00,000
That's the memory layer.
642
00:22:00,000 --> 00:22:01,360
Now for the brain.
643
00:22:01,360 --> 00:22:02,680
The midpoint revelation.
644
00:22:02,680 --> 00:22:05,800
Everything we have covered so far is what vendors sell you.
645
00:22:05,800 --> 00:22:07,480
Cloud AI with enterprise controls,
646
00:22:07,480 --> 00:22:09,320
sovereign cloud with regional processing,
647
00:22:09,320 --> 00:22:11,080
managed drag with vector databases.
648
00:22:11,080 --> 00:22:12,480
The environment is full of platforms
649
00:22:12,480 --> 00:22:14,000
that promise to solve this problem
650
00:22:14,000 --> 00:22:16,040
while keeping you inside their ecosystem.
651
00:22:16,040 --> 00:22:18,320
But the real architecture is simpler than you think,
652
00:22:18,320 --> 00:22:21,360
a single GPU server, an open source vector database,
653
00:22:21,360 --> 00:22:23,760
a local YAMA instance, a lightweight web interface.
654
00:22:23,760 --> 00:22:26,000
And the SharePoint APIs you already know how to use.
655
00:22:26,000 --> 00:22:28,400
That's the entire stack, no cloud model subscriptions,
656
00:22:28,400 --> 00:22:31,640
no per token pricing, no vendor lock-in, no legal exposure.
657
00:22:31,640 --> 00:22:33,320
This isn't science fiction.
658
00:22:33,320 --> 00:22:36,680
In April 2025, a developer published a complete implementation
659
00:22:36,680 --> 00:22:39,560
of exactly this architecture for SharePoint on premises,
660
00:22:39,560 --> 00:22:41,920
CPAT minimal API for the backend,
661
00:22:41,920 --> 00:22:43,560
QDRIND for the vector database,
662
00:22:43,560 --> 00:22:47,000
OLAMA for local LLM inference, SharePoint as the document source.
663
00:22:47,000 --> 00:22:49,280
It handled authentication, ingestion, chunking,
664
00:22:49,280 --> 00:22:51,040
embedding, retrieval, and generation,
665
00:22:51,040 --> 00:22:53,360
all on local hardware, all under local control.
666
00:22:53,360 --> 00:22:55,920
A Reddit discussion from August 2025 details
667
00:22:55,920 --> 00:22:57,440
a similar implementation scaling
668
00:22:57,440 --> 00:22:59,560
to over 6,000 SharePoint documents.
669
00:22:59,560 --> 00:23:01,440
The developer faced real problems.
670
00:23:01,440 --> 00:23:03,280
Chunking PDFs with mixed layouts,
671
00:23:03,280 --> 00:23:04,880
handling SharePoint rate limits,
672
00:23:04,880 --> 00:23:07,920
tuning retrieval to avoid surfacing outdated versions.
673
00:23:07,920 --> 00:23:11,200
And they solved them with open source tools and community support.
674
00:23:11,200 --> 00:23:13,400
The point isn't that these specific implementations
675
00:23:13,400 --> 00:23:15,160
are production ready for your environment.
676
00:23:15,160 --> 00:23:17,400
The point is that the architecture is proven.
677
00:23:17,400 --> 00:23:18,760
People are building this today.
678
00:23:18,760 --> 00:23:19,920
They're solving the problems.
679
00:23:19,920 --> 00:23:22,520
And they're doing it without sending proprietary data
680
00:23:22,520 --> 00:23:23,800
to external APIs.
681
00:23:23,800 --> 00:23:25,240
What changes isn't the technology?
682
00:23:25,240 --> 00:23:26,200
It's the stance.
683
00:23:26,200 --> 00:23:29,320
Most organizations approach AI as a service to consume.
684
00:23:29,320 --> 00:23:31,720
They evaluate vendors, they negotiate contracts,
685
00:23:31,720 --> 00:23:33,600
they audit compliance, and they hope
686
00:23:33,600 --> 00:23:36,120
the vendor's architecture matches with their risk model.
687
00:23:36,120 --> 00:23:37,520
The stance we're taking is different.
688
00:23:37,520 --> 00:23:39,480
We treat AI as infrastructure to own.
689
00:23:39,480 --> 00:23:41,880
We select open models with permissive licenses.
690
00:23:41,880 --> 00:23:43,640
We deploy them on hardware, we control,
691
00:23:43,640 --> 00:23:45,680
we connect them to data sources we govern.
692
00:23:45,680 --> 00:23:48,240
And we accept the operational burden in exchange for sovereignty.
693
00:23:48,240 --> 00:23:51,080
That's the shift from cloud first to sovereignty first,
694
00:23:51,080 --> 00:23:54,120
from consumption to ownership, from delegation to control.
695
00:23:54,120 --> 00:23:56,560
Let me show you exactly how the data flows.
696
00:23:56,560 --> 00:23:57,800
The ingestion layer.
697
00:23:57,800 --> 00:24:00,000
The ingestion service is the bridge between SharePoint
698
00:24:00,000 --> 00:24:01,200
and your local AI.
699
00:24:01,200 --> 00:24:03,200
It's also the most security sensitive component
700
00:24:03,200 --> 00:24:05,720
because it has read access to your document libraries,
701
00:24:05,720 --> 00:24:07,760
designed it carefully.
702
00:24:07,760 --> 00:24:11,200
SharePoint content is unstructured, versioned, and permissioned.
703
00:24:11,200 --> 00:24:13,480
You can't simply dump files into a vector database
704
00:24:13,480 --> 00:24:14,600
and hope for the best.
705
00:24:14,600 --> 00:24:17,520
The ingestion service must understand document types, respect
706
00:24:17,520 --> 00:24:20,000
access controls, handle versioning, and manage
707
00:24:20,000 --> 00:24:20,880
delta updates.
708
00:24:20,880 --> 00:24:22,760
If it fails at any of these, your AI layer
709
00:24:22,760 --> 00:24:24,720
becomes either inaccurate or insecure.
710
00:24:24,720 --> 00:24:26,760
The first decision is authentication.
711
00:24:26,760 --> 00:24:29,120
The ingestion service must authenticate against SharePoint
712
00:24:29,120 --> 00:24:32,320
online or SharePoint on premises using Microsoft EntraID.
713
00:24:32,320 --> 00:24:35,240
For SharePoint online, this means OOOTH 2.0
714
00:24:35,240 --> 00:24:38,200
with either application permissions or delegated permissions.
715
00:24:38,200 --> 00:24:40,840
Application permissions grant the service broad access,
716
00:24:40,840 --> 00:24:42,600
which is simpler but less secure.
717
00:24:42,600 --> 00:24:45,640
Delegated permissions scope access to what the specific user
718
00:24:45,640 --> 00:24:47,720
or service principle is allowed to see,
719
00:24:47,720 --> 00:24:50,280
which is more secure but more complex to manage.
720
00:24:50,280 --> 00:24:53,360
For an air-gapped architecture, I recommend a hybrid approach.
721
00:24:53,360 --> 00:24:56,480
Use application permissions, scope to specific libraries,
722
00:24:56,480 --> 00:24:57,640
rather than sites.
723
00:24:57,640 --> 00:24:59,520
Read all across the entire tenant.
724
00:24:59,520 --> 00:25:01,920
Create a dedicated service principle in EntraID
725
00:25:01,920 --> 00:25:04,600
with a descriptive name like SP Local Ragngestion.
726
00:25:04,600 --> 00:25:07,600
Granted access only to the libraries you intend to index.
727
00:25:07,600 --> 00:25:09,400
Store the client's secret or certificate
728
00:25:09,400 --> 00:25:11,680
in a local secret manager like Hashikop Vault
729
00:25:11,680 --> 00:25:14,160
or as your main vault if you have a hybrid environment.
730
00:25:14,160 --> 00:25:15,480
Never hard code credentials.
731
00:25:15,480 --> 00:25:17,480
Never commit secrets to repositories.
732
00:25:17,480 --> 00:25:20,360
The SharePoint REST API provides the ingestion endpoint.
733
00:25:20,360 --> 00:25:23,400
You construct URLs like your SharePoint site collection
734
00:25:23,400 --> 00:25:25,680
plus the API path for list items.
735
00:25:25,680 --> 00:25:28,400
You specify headers for JSON, accept, and content types.
736
00:25:28,400 --> 00:25:30,960
You handle pagination because libraries can contain thousands
737
00:25:30,960 --> 00:25:31,960
of documents.
738
00:25:31,960 --> 00:25:35,080
And you filter by content type, modified date or library path
739
00:25:35,080 --> 00:25:36,320
to limit the scope.
740
00:25:36,320 --> 00:25:38,840
Microsoft Graph offers a more modern alternative.
741
00:25:38,840 --> 00:25:42,240
The Graph API provides a unified endpoint for SharePoint, OneDrive,
742
00:25:42,240 --> 00:25:43,840
Teams, and Exchange.
743
00:25:43,840 --> 00:25:47,080
For document ingestion, you query the Drive items endpoint
744
00:25:47,080 --> 00:25:49,320
for a specific site or library.
745
00:25:49,320 --> 00:25:52,160
You get metadata including file name, size, last modified date,
746
00:25:52,160 --> 00:25:53,280
and download URL.
747
00:25:53,280 --> 00:25:55,800
You download the file content using the download URL
748
00:25:55,800 --> 00:25:57,240
and you process it locally.
749
00:25:57,240 --> 00:26:00,760
The Microsoft 365 co-pilot search API currently in preview
750
00:26:00,760 --> 00:26:02,000
offers a third option.
751
00:26:02,000 --> 00:26:04,120
It allows hybrid semantic and lexical search
752
00:26:04,120 --> 00:26:06,640
over work content using natural language queries.
753
00:26:06,640 --> 00:26:09,200
For our architecture, this is less relevant for ingestion
754
00:26:09,200 --> 00:26:10,720
but useful for validation.
755
00:26:10,720 --> 00:26:13,480
You can compare your local rag results against co-pilot search
756
00:26:13,480 --> 00:26:16,280
to verify coverage and accuracy during testing.
757
00:26:16,280 --> 00:26:19,440
Document extraction is where the ingestion service earns its keep.
758
00:26:19,440 --> 00:26:22,000
Different document types require different extractors.
759
00:26:22,000 --> 00:26:24,640
For word documents, the Dose X format is a zip archive
760
00:26:24,640 --> 00:26:26,240
containing XML files.
761
00:26:26,240 --> 00:26:28,240
You can extract the text from document XML
762
00:26:28,240 --> 00:26:29,960
without installing Microsoft Office.
763
00:26:29,960 --> 00:26:32,120
For PDFs, you need a text extraction library.
764
00:26:32,120 --> 00:26:34,560
Be careful with mixed layout PDFs that contain both text
765
00:26:34,560 --> 00:26:35,080
and images.
766
00:26:35,080 --> 00:26:38,160
Tables and PDFs are notoriously difficult to extract correctly.
767
00:26:38,160 --> 00:26:40,160
For Excel spreadsheets, you need to flatten rows
768
00:26:40,160 --> 00:26:42,320
into text while preserving column headers.
769
00:26:42,320 --> 00:26:44,880
For PowerPoint text, you extract text from slides
770
00:26:44,880 --> 00:26:46,520
and optionally speaker notes.
771
00:26:46,520 --> 00:26:48,880
The extraction step must preserve structure.
772
00:26:48,880 --> 00:26:50,320
A word document with clear headings
773
00:26:50,320 --> 00:26:52,880
should produce text segments that know which heading they belong
774
00:26:52,880 --> 00:26:53,520
under.
775
00:26:53,520 --> 00:26:56,560
An Excel sheet should produce rows that include column context.
776
00:26:56,560 --> 00:26:58,960
A PowerPoint deck should separate slides.
777
00:26:58,960 --> 00:27:02,320
The structural metadata feeds into the chunking engine later.
778
00:27:02,320 --> 00:27:04,600
If you throw away structure during extraction,
779
00:27:04,600 --> 00:27:06,560
the chunking engine has nothing to work with.
780
00:27:06,560 --> 00:27:08,800
Let me give you a concrete example of what extraction looks
781
00:27:08,800 --> 00:27:10,720
like for a typical contract document.
782
00:27:10,720 --> 00:27:12,960
A word file named Employment Contract Template.
783
00:27:12,960 --> 00:27:14,480
Docs lives in the HR library.
784
00:27:14,480 --> 00:27:18,160
The ingestion service downloads it using the SharePoint REST API.
785
00:27:18,160 --> 00:27:21,000
It opens the Docx package, which is a zip archive containing
786
00:27:21,000 --> 00:27:21,960
XML files.
787
00:27:21,960 --> 00:27:24,280
It reads document.xml and extracts paragraphs
788
00:27:24,280 --> 00:27:26,200
while preserving the paragraph styles.
789
00:27:26,200 --> 00:27:28,600
Paragraph style is heading one become section markers.
790
00:27:28,600 --> 00:27:31,840
Paragraph style as normal become body text.
791
00:27:31,840 --> 00:27:34,880
Paragraph style as list bullet become list items.
792
00:27:34,880 --> 00:27:36,800
The extraction output is a structured text file
793
00:27:36,800 --> 00:27:38,200
that looks like this.
794
00:27:38,200 --> 00:27:39,920
Heading one, employment terms.
795
00:27:39,920 --> 00:27:41,880
Body, this contract is governed by the laws
796
00:27:41,880 --> 00:27:43,080
of the state of Delaware.
797
00:27:43,080 --> 00:27:45,840
Heading one, compensation, body, the employee
798
00:27:45,840 --> 00:27:48,760
shall receive a base salary as specified in Appendix A.
799
00:27:48,760 --> 00:27:51,040
This structure is exactly what the chunking engine needs
800
00:27:51,040 --> 00:27:53,040
to create semantically coherent chunks.
801
00:27:53,040 --> 00:27:55,960
PDF extraction is more complex because PDF is a presentation
802
00:27:55,960 --> 00:27:57,680
format, not a content format.
803
00:27:57,680 --> 00:28:00,880
A PDF file contains drawing commands, not paragraphs.
804
00:28:00,880 --> 00:28:03,920
The text you see on the page is positioned absolutely.
805
00:28:03,920 --> 00:28:05,760
Two words that appear next to each other
806
00:28:05,760 --> 00:28:08,320
might be stored in the PDF as separate objects
807
00:28:08,320 --> 00:28:09,880
with no explicit relationship.
808
00:28:09,880 --> 00:28:13,480
Good PDF extractors like Pi PDF2, PDF Plumber, or PDF Miner
809
00:28:13,480 --> 00:28:15,520
use heuristics to reconstruct reading order.
810
00:28:15,520 --> 00:28:18,240
They detect columns, they identify headers and footers.
811
00:28:18,240 --> 00:28:21,400
They separate tables from body text, but they're not perfect.
812
00:28:21,400 --> 00:28:23,920
A scanned PDF that contains images instead of text
813
00:28:23,920 --> 00:28:27,240
requires OCR, which adds another layer of complexity and error.
814
00:28:27,240 --> 00:28:29,720
Testract is a common open source OCR engine.
815
00:28:29,720 --> 00:28:32,080
It converts images to text with reasonable accuracy
816
00:28:32,080 --> 00:28:35,280
for clean documents, but handwritten annotations, stamps,
817
00:28:35,280 --> 00:28:38,000
and poor scan quality will produce garbage text
818
00:28:38,000 --> 00:28:39,360
that pollutes your index.
819
00:28:39,360 --> 00:28:41,920
For Excel, the challenge is that a single spreadsheet
820
00:28:41,920 --> 00:28:44,720
might contain multiple sheets, each with a different purpose.
821
00:28:44,720 --> 00:28:46,680
Sheet one might be employee data.
822
00:28:46,680 --> 00:28:48,600
Sheet two might be salary bands, sheet three
823
00:28:48,600 --> 00:28:49,600
might be a lookup table.
824
00:28:49,600 --> 00:28:52,680
If you flatten the entire workbook into a single text stream,
825
00:28:52,680 --> 00:28:55,560
the retrieval engine can't distinguish between an employee name
826
00:28:55,560 --> 00:28:57,040
and a salary band value.
827
00:28:57,040 --> 00:28:59,240
The extraction must preserve sheet boundaries.
828
00:28:59,240 --> 00:29:01,880
It must include column headers in every data row context,
829
00:29:01,880 --> 00:29:04,400
and it must skip empty rows and hidden sheets
830
00:29:04,400 --> 00:29:06,400
that contain no meaningful content.
831
00:29:06,400 --> 00:29:09,520
Powerpoint extraction faces the opposite problem of Excel.
832
00:29:09,520 --> 00:29:11,720
Each slide is already a self-contained unit,
833
00:29:11,720 --> 00:29:14,400
but slides contain title text, body bullets, speaker
834
00:29:14,400 --> 00:29:15,880
notes, and embedded charts.
835
00:29:15,880 --> 00:29:18,920
The title and bullets are usually the content you want to index.
836
00:29:18,920 --> 00:29:21,040
Speaker notes might contain presenter guidance
837
00:29:21,040 --> 00:29:22,920
that's irrelevant to document retrieval.
838
00:29:22,920 --> 00:29:25,880
Charts contain data that might be useful if extracted as text,
839
00:29:25,880 --> 00:29:27,400
but is usually stored as images.
840
00:29:27,400 --> 00:29:30,320
A good PowerPoint extractor pulls title and bullet text
841
00:29:30,320 --> 00:29:32,000
while optionally including speaker notes
842
00:29:32,000 --> 00:29:33,720
if your use case requires them.
843
00:29:33,720 --> 00:29:36,200
Error handling at the extraction layer is non-trivial.
844
00:29:36,200 --> 00:29:38,640
Documents in SharePoint aren't always well-formed.
845
00:29:38,640 --> 00:29:40,080
A word file might be corrupted.
846
00:29:40,080 --> 00:29:41,960
A PDF might be password protected.
847
00:29:41,960 --> 00:29:44,560
An Excel sheet might contain circular references
848
00:29:44,560 --> 00:29:46,200
that crash the parser.
849
00:29:46,200 --> 00:29:48,560
Your ingestion service must handle these gracefully.
850
00:29:48,560 --> 00:29:51,440
Log the failure, skip the document, alert the administrator,
851
00:29:51,440 --> 00:29:52,880
and continue with the rest.
852
00:29:52,880 --> 00:29:54,720
A single bad document shouldn't stop the indexing
853
00:29:54,720 --> 00:29:56,000
of 10,000 good ones.
854
00:29:56,000 --> 00:29:58,040
Retri logic with exponential back-off protects
855
00:29:58,040 --> 00:29:59,280
against transient failures.
856
00:29:59,280 --> 00:30:02,560
If SharePoint returns HTTP 500 or 503,
857
00:30:02,560 --> 00:30:05,480
wait 10 seconds and retry, then 20 seconds, then 40.
858
00:30:05,480 --> 00:30:08,080
If it still fails after three retreats, log and move on.
859
00:30:08,080 --> 00:30:10,920
If the vector database is temporarily unreachable,
860
00:30:10,920 --> 00:30:13,200
queue the vectors in local storage and retry.
861
00:30:13,200 --> 00:30:15,000
If the embedding model is overloaded,
862
00:30:15,000 --> 00:30:17,440
pause the batch and wait for GPU memory to free up.
863
00:30:17,440 --> 00:30:19,640
Resilience isn't a feature you add later.
864
00:30:19,640 --> 00:30:21,800
It's a property you design in from the start.
865
00:30:21,800 --> 00:30:23,720
Delta handling is critical for production.
866
00:30:23,720 --> 00:30:26,640
You don't want to re-index 10,000 documents every night.
867
00:30:26,640 --> 00:30:29,600
You want to detect what changed and process only that.
868
00:30:29,600 --> 00:30:31,280
SharePoint provides a changes endpoint
869
00:30:31,280 --> 00:30:34,800
that returns items modified since a specific timestamp.
870
00:30:34,800 --> 00:30:36,680
Your ingestion service stores a watermark,
871
00:30:36,680 --> 00:30:38,360
the last processed timestamp,
872
00:30:38,360 --> 00:30:40,760
and queries for changes since that watermark.
873
00:30:40,760 --> 00:30:44,040
New documents get extracted, chunked, embedded and stored.
874
00:30:44,040 --> 00:30:46,240
Modified documents get their old vectors deleted
875
00:30:46,240 --> 00:30:47,600
and new vectors inserted.
876
00:30:47,600 --> 00:30:49,960
Deleted documents trigger vector deletion.
877
00:30:49,960 --> 00:30:51,920
Security considerations at the ingestion layer
878
00:30:51,920 --> 00:30:54,120
are straightforward but non-negotiable.
879
00:30:54,120 --> 00:30:56,800
The ingestion service must run inside your perimeter.
880
00:30:56,800 --> 00:30:59,440
It should have no outbound internet connectivity
881
00:30:59,440 --> 00:31:02,640
except to Microsoft 365 if you're using SharePoint online.
882
00:31:02,640 --> 00:31:04,720
It should log every documented processes,
883
00:31:04,720 --> 00:31:08,080
every error it encounters, and every API call it makes.
884
00:31:08,080 --> 00:31:10,520
Logs stay local, the service should fail closed.
885
00:31:10,520 --> 00:31:12,400
If authentication fails, it stops.
886
00:31:12,400 --> 00:31:14,600
If the vector database is unreachable, it stops.
887
00:31:14,600 --> 00:31:17,040
If a document can't be processed, it logs the error
888
00:31:17,040 --> 00:31:18,520
and continues with the rest.
889
00:31:18,520 --> 00:31:20,360
Rate limiting is a practical concern.
890
00:31:20,360 --> 00:31:22,200
SharePoint online enforces throttling.
891
00:31:22,200 --> 00:31:25,040
If your ingestion service makes too many requests too quickly,
892
00:31:25,040 --> 00:31:28,160
SharePoint returns HTTP 429 and backs off.
893
00:31:28,160 --> 00:31:30,280
Your service must implement exponential back off.
894
00:31:30,280 --> 00:31:32,840
Start with a modest request rate, increase it gradually,
895
00:31:32,840 --> 00:31:34,320
monitor for throttling responses,
896
00:31:34,320 --> 00:31:37,160
and schedule full re-indexing during off-peak hours.
897
00:31:37,160 --> 00:31:40,480
The output of the ingestion layer is clean text with metadata.
898
00:31:40,480 --> 00:31:42,160
Document source, library name, author,
899
00:31:42,160 --> 00:31:44,520
last modified date, version number, permission level.
900
00:31:44,520 --> 00:31:47,520
This text and metadata feed into the chunking engine,
901
00:31:47,520 --> 00:31:50,480
and the chunking engine is where most implementations fail.
902
00:31:50,480 --> 00:31:54,280
Chunking and embedding strategy, bad chunking destroys rag.
903
00:31:54,280 --> 00:31:57,600
I want to be explicit about this because I have seen it happen repeatedly.
904
00:31:57,600 --> 00:32:00,120
An organization builds a beautiful ingestion pipeline,
905
00:32:00,120 --> 00:32:03,920
deploys an expensive GPU server and configures a sleek chat interface.
906
00:32:03,920 --> 00:32:06,400
Then they ask a question and get an answer that's half write,
907
00:32:06,400 --> 00:32:08,600
half invented and completely unsighted.
908
00:32:08,600 --> 00:32:10,160
The problem is almost never the model.
909
00:32:10,160 --> 00:32:11,520
It's the chunks.
910
00:32:11,520 --> 00:32:14,040
Chunking is the process of breaking extracted text
911
00:32:14,040 --> 00:32:16,760
into semantically meaningful pieces that can be embedded
912
00:32:16,760 --> 00:32:18,320
and retrieved individually.
913
00:32:18,320 --> 00:32:20,920
The goal is to create chunks that are self-contained enough
914
00:32:20,920 --> 00:32:24,040
to answer questions, but specific enough to avoid dilution.
915
00:32:24,040 --> 00:32:24,880
This is attention.
916
00:32:24,880 --> 00:32:26,600
You can't satisfy both perfectly.
917
00:32:26,600 --> 00:32:29,320
You optimize for your document types and your query patterns.
918
00:32:29,320 --> 00:32:32,320
For word documents, the best approach is heading aware chunking.
919
00:32:32,320 --> 00:32:35,280
Pass the document structure, identify headings and subheadings,
920
00:32:35,280 --> 00:32:37,240
group paragraphs under their nearest heading.
921
00:32:37,240 --> 00:32:38,640
Each group becomes a chunk.
922
00:32:38,640 --> 00:32:40,840
If a group is too large, split it at natural boundaries
923
00:32:40,840 --> 00:32:41,880
like paragraph breaks.
924
00:32:41,880 --> 00:32:45,360
If a group is too small, merge it with the next group under the same heading.
925
00:32:45,360 --> 00:32:47,560
The result is chunks that carry semantic context.
926
00:32:47,560 --> 00:32:51,480
A chunk from the termination policy section includes the heading termination policy
927
00:32:51,480 --> 00:32:52,960
and the paragraphs beneath it.
928
00:32:52,960 --> 00:32:54,720
When a user asks about termination,
929
00:32:54,720 --> 00:32:56,680
the retrieval engine finds this chunk
930
00:32:56,680 --> 00:32:59,000
because the heading is embedded along with the content.
931
00:32:59,000 --> 00:33:00,880
For PDFs, the challenge is layout.
932
00:33:00,880 --> 00:33:03,000
A research paper has clear sections.
933
00:33:03,000 --> 00:33:04,320
A scanned contract doesn't.
934
00:33:04,320 --> 00:33:06,240
A brochure mixes text and images.
935
00:33:06,240 --> 00:33:09,280
The chunking strategy must detect the document structure.
936
00:33:09,280 --> 00:33:12,600
For structured PDFs, use section headers as chunk boundaries.
937
00:33:12,600 --> 00:33:16,000
For unstructured PDFs, use fixed size chunking with overlap
938
00:33:16,000 --> 00:33:17,880
but include page numbers in the metadata
939
00:33:17,880 --> 00:33:20,640
so retrieval can sight sources accurately.
940
00:33:20,640 --> 00:33:23,000
For image-heavy PDFs, consider OCR
941
00:33:23,000 --> 00:33:25,720
if the documents contain critical text in images.
942
00:33:25,720 --> 00:33:27,520
But OCR adds complexity and error.
943
00:33:27,520 --> 00:33:28,960
Use it only when necessary.
944
00:33:28,960 --> 00:33:31,440
For Excel spreadsheets, chunk by row groups
945
00:33:31,440 --> 00:33:33,160
include column headers in every chunk.
946
00:33:33,160 --> 00:33:39,240
A chunk that says row 45, 5,000 approved 2025, 06, 01 is meaningless
947
00:33:39,240 --> 00:33:41,960
without knowing that the columns are budget, status and date.
948
00:33:41,960 --> 00:33:49,040
The chunk should read budget 5,000, status approved, date 2025, 06, 01.
949
00:33:49,040 --> 00:33:51,480
And it should include the sheet name and file name.
950
00:33:51,480 --> 00:33:53,240
If a spreadsheet has multiple sheets,
951
00:33:53,240 --> 00:33:55,560
each sheet becomes a separate chunking context.
952
00:33:55,560 --> 00:33:57,920
For PowerPoint text, chunk at the slide level.
953
00:33:57,920 --> 00:34:00,600
Each slide is designed as a self-contained unit.
954
00:34:00,600 --> 00:34:02,320
Extract the slide title, the bullet points
955
00:34:02,320 --> 00:34:04,280
and the speaker notes if available.
956
00:34:04,280 --> 00:34:06,720
Combine them into a single chunk per slide.
957
00:34:06,720 --> 00:34:09,000
If a slide is dense, split it into two chunks
958
00:34:09,000 --> 00:34:11,080
but preserve the slide number in metadata,
959
00:34:11,080 --> 00:34:13,520
so citations point back to the correct source.
960
00:34:13,520 --> 00:34:15,400
The chunk size depends on your embedding model
961
00:34:15,400 --> 00:34:17,320
and your LLM context window.
962
00:34:17,320 --> 00:34:19,840
A common starting point is 500 tokens per chunk
963
00:34:19,840 --> 00:34:21,320
with a 50 token overlap.
964
00:34:21,320 --> 00:34:23,120
The overlap ensures that sentences split
965
00:34:23,120 --> 00:34:25,240
across chunk boundaries appear in both chunks,
966
00:34:25,240 --> 00:34:27,800
reducing the chance that a critical connection gets lost.
967
00:34:27,800 --> 00:34:29,760
But this is a starting point, not a rule.
968
00:34:29,760 --> 00:34:32,720
If your documents are dense legal contracts with cross references,
969
00:34:32,720 --> 00:34:34,160
you might need larger chunks.
970
00:34:34,160 --> 00:34:36,160
If they're FAQ documents with short answers,
971
00:34:36,160 --> 00:34:37,800
you might need smaller chunks.
972
00:34:37,800 --> 00:34:39,960
Let me walk you through what good chunking looks like
973
00:34:39,960 --> 00:34:41,400
for a specific document.
974
00:34:41,400 --> 00:34:44,400
Imagine a word document titled Corporate Travel Policy.
975
00:34:44,400 --> 00:34:45,240
DocuX.
976
00:34:45,240 --> 00:34:47,360
It has sections for booking, expense limits,
977
00:34:47,360 --> 00:34:49,640
approval workflow and reimbursement.
978
00:34:49,640 --> 00:34:52,280
A heading aware chunker passes the document structure.
979
00:34:52,280 --> 00:34:53,480
It creates a chunk for booking
980
00:34:53,480 --> 00:34:55,760
that contains the heading and all paragraphs under it.
981
00:34:55,760 --> 00:34:57,200
It creates a chunk for expense limits
982
00:34:57,200 --> 00:34:58,400
that contains the heading,
983
00:34:58,400 --> 00:35:00,040
the paragraph about daily limits
984
00:35:00,040 --> 00:35:02,680
and the table showing per-dium rates by city.
985
00:35:02,680 --> 00:35:04,760
It creates a chunk for approval workflow
986
00:35:04,760 --> 00:35:06,160
that contains the heading,
987
00:35:06,160 --> 00:35:08,280
the paragraph about manager approval
988
00:35:08,280 --> 00:35:11,800
and the paragraph about executive approval for international travel.
989
00:35:11,800 --> 00:35:13,440
Each chunk is self-contained.
990
00:35:13,440 --> 00:35:15,520
A user asking about per-dium rates in New York
991
00:35:15,520 --> 00:35:17,200
gets the expense limits chunk.
992
00:35:17,200 --> 00:35:19,320
A user asking who approves international travel
993
00:35:19,320 --> 00:35:21,240
gets the approval workflow chunk.
994
00:35:21,240 --> 00:35:23,400
The retrieval is precise because the chunk boundaries
995
00:35:23,400 --> 00:35:25,080
match with semantic boundaries.
996
00:35:25,080 --> 00:35:27,520
Now imagine the same document with bad chunking.
997
00:35:27,520 --> 00:35:31,000
A fixed size chunker breaks the document every 500 tokens.
998
00:35:31,000 --> 00:35:33,800
The first chunk ends in the middle of the expense limits section.
999
00:35:33,800 --> 00:35:35,880
It contains half the daily limits paragraph
1000
00:35:35,880 --> 00:35:37,480
and the beginning of the per-dium table,
1001
00:35:37,480 --> 00:35:38,800
but not the table headers.
1002
00:35:38,800 --> 00:35:41,360
The second chunk starts with the rest of the per-dium table,
1003
00:35:41,360 --> 00:35:42,760
but not the section heading.
1004
00:35:42,760 --> 00:35:44,520
When a user asks about per-dium rates,
1005
00:35:44,520 --> 00:35:46,440
neither chunk fully answers the question.
1006
00:35:46,440 --> 00:35:48,080
The first chunk lacks the table headers.
1007
00:35:48,080 --> 00:35:49,560
The second chunk lacks the context
1008
00:35:49,560 --> 00:35:51,040
that this is about travel policy.
1009
00:35:51,040 --> 00:35:53,040
The retrieval engine returns both chunks
1010
00:35:53,040 --> 00:35:54,800
because they are topically related,
1011
00:35:54,800 --> 00:35:57,200
but the LLM can't synthesize a coherent answer
1012
00:35:57,200 --> 00:35:59,120
because the information is fragmented.
1013
00:35:59,120 --> 00:36:02,920
This is how bad chunking destroys a rag accuracy silently.
1014
00:36:02,920 --> 00:36:06,200
For Excel spreadsheets, chunking must preserve row relationships.
1015
00:36:06,200 --> 00:36:08,040
Consider an employee directory with columns
1016
00:36:08,040 --> 00:36:11,480
for name, department, manager, office location, and start date.
1017
00:36:11,480 --> 00:36:14,680
A naive chunker might create chunks of five rows each.
1018
00:36:14,680 --> 00:36:16,240
Row one through five get one chunk,
1019
00:36:16,240 --> 00:36:17,880
row six through ten get another.
1020
00:36:17,880 --> 00:36:20,240
But if a user asks who manages the engineering team,
1021
00:36:20,240 --> 00:36:23,080
the relevant rows might be scattered across multiple chunks.
1022
00:36:23,080 --> 00:36:25,720
A better approach is to chunk by department group.
1023
00:36:25,720 --> 00:36:27,600
All engineering rows become one chunk,
1024
00:36:27,600 --> 00:36:29,200
all sales rows become another.
1025
00:36:29,200 --> 00:36:31,160
This way, a query about engineering managers
1026
00:36:31,160 --> 00:36:32,920
retrieves a single coherent chunk
1027
00:36:32,920 --> 00:36:35,360
containing all engineering employees and their managers.
1028
00:36:35,360 --> 00:36:38,360
For PowerPoint decks, slide-level chunking is usually correct,
1029
00:36:38,360 --> 00:36:39,720
but some slides are dense.
1030
00:36:39,720 --> 00:36:42,280
A quarterly review slide might contain six bullet points
1031
00:36:42,280 --> 00:36:43,600
with detailed metrics.
1032
00:36:43,600 --> 00:36:45,760
If you put the entire slide into one chunk,
1033
00:36:45,760 --> 00:36:47,040
the embedding might dilute
1034
00:36:47,040 --> 00:36:49,600
because the vector must represent six different ideas.
1035
00:36:49,600 --> 00:36:51,840
In this case, split the slide into two chunks.
1036
00:36:51,840 --> 00:36:54,040
The first chunk contains the first three bullets.
1037
00:36:54,040 --> 00:36:55,640
The second contains the last three.
1038
00:36:55,640 --> 00:36:58,560
Both chunks carry the same slide title in their metadata,
1039
00:36:58,560 --> 00:37:01,280
so the retrieval engine knows they came from the same source.
1040
00:37:01,280 --> 00:37:03,960
Metadata preservation during chunking is critical.
1041
00:37:03,960 --> 00:37:06,280
Every chunk must carry its source document URL,
1042
00:37:06,280 --> 00:37:08,480
its position in the document, its heading hierarchy,
1043
00:37:08,480 --> 00:37:10,560
its document type, its library name, its author,
1044
00:37:10,560 --> 00:37:13,200
its last modified date, and its permission level.
1045
00:37:13,200 --> 00:37:14,960
This metadata doesn't get embedded.
1046
00:37:14,960 --> 00:37:17,320
It gets stored as payload data alongside the vector
1047
00:37:17,320 --> 00:37:18,240
in the database.
1048
00:37:18,240 --> 00:37:21,240
During retrieval, the metadata is returned with the vector.
1049
00:37:21,240 --> 00:37:24,240
During answer generation, the metadata becomes the citation.
1050
00:37:24,240 --> 00:37:26,920
Without metadata, an answer is unverifiable.
1051
00:37:26,920 --> 00:37:28,960
And an unverifiable answer is worthless
1052
00:37:28,960 --> 00:37:30,560
in an enterprise context.
1053
00:37:30,560 --> 00:37:33,320
The embedding model converts each chunk into a vector.
1054
00:37:33,320 --> 00:37:35,640
As I mentioned earlier, run this model locally.
1055
00:37:35,640 --> 00:37:38,920
Popular choices include all Mini-LML6, V2 for speed
1056
00:37:38,920 --> 00:37:40,840
and BGE-large and for accuracy.
1057
00:37:40,840 --> 00:37:42,320
Both are available through hugging phase
1058
00:37:42,320 --> 00:37:43,760
and run on local hardware.
1059
00:37:43,760 --> 00:37:47,040
The all-mini-LM model is 384 dimensions.
1060
00:37:47,040 --> 00:37:50,160
The BGE-large model is 1,024 dimensions.
1061
00:37:50,160 --> 00:37:52,560
Higher dimensions capture more nuance,
1062
00:37:52,560 --> 00:37:55,560
but require more storage and more compute during search.
1063
00:37:55,560 --> 00:37:58,880
For most SharePoint deployments, all Mini-LML is sufficient.
1064
00:37:58,880 --> 00:38:00,720
If retrieval accuracy is critical
1065
00:38:00,720 --> 00:38:03,120
and your document base is under 10,000 chunks,
1066
00:38:03,120 --> 00:38:05,120
BGE-large is worth the overhead.
1067
00:38:05,120 --> 00:38:07,200
The embedding step is batched for efficiency.
1068
00:38:07,200 --> 00:38:09,280
Send multiple chunks to the model at once,
1069
00:38:09,280 --> 00:38:10,680
rather than one by one.
1070
00:38:10,680 --> 00:38:12,200
Modern embedding models handle batches
1071
00:38:12,200 --> 00:38:14,720
of 32 or 64 chunks in parallel.
1072
00:38:14,720 --> 00:38:17,760
This reduces GPU idle time and speeds up indexing.
1073
00:38:17,760 --> 00:38:20,080
A batch of 1,000 chunks might take a few seconds
1074
00:38:20,080 --> 00:38:21,560
on a modern GPU.
1075
00:38:21,560 --> 00:38:23,240
A batch of 10,000 might take a minute.
1076
00:38:23,240 --> 00:38:24,920
Schedule this during maintenance windows
1077
00:38:24,920 --> 00:38:26,600
if your document base is large.
1078
00:38:26,600 --> 00:38:29,760
Metadata preservation is as important as the chunk itself.
1079
00:38:29,760 --> 00:38:32,320
Every vector in your database must carry metadata
1080
00:38:32,320 --> 00:38:34,840
that answers three critical questions about its origin,
1081
00:38:34,840 --> 00:38:37,120
its currency and its access restrictions.
1082
00:38:37,120 --> 00:38:39,040
The source URL lets the query interface
1083
00:38:39,040 --> 00:38:40,240
cite the document.
1084
00:38:40,240 --> 00:38:42,920
The last modified date helps the ingestion service detect
1085
00:38:42,920 --> 00:38:43,840
staleness.
1086
00:38:43,840 --> 00:38:45,880
The permission level enables filtered retrieval,
1087
00:38:45,880 --> 00:38:48,600
so users only see content they're authorized to access.
1088
00:38:48,600 --> 00:38:50,960
Langchain and Lama Index provide document loaders
1089
00:38:50,960 --> 00:38:53,480
and chunking utilities that handle many of these concerns.
1090
00:38:53,480 --> 00:38:56,040
Langchain's recursive character text splitter tries splitting
1091
00:38:56,040 --> 00:38:58,160
on paragraphs, then sentences, then words.
1092
00:38:58,160 --> 00:38:59,680
It's a good default for mixed content.
1093
00:38:59,680 --> 00:39:01,280
Yama Index provides node passes
1094
00:39:01,280 --> 00:39:03,080
that preserve hierarchical structure,
1095
00:39:03,080 --> 00:39:05,240
both integrate with SharePoint through custom loaders
1096
00:39:05,240 --> 00:39:06,600
or the SharePoint REST API.
1097
00:39:06,600 --> 00:39:08,800
You don't need to write chunking logic from scratch,
1098
00:39:08,800 --> 00:39:10,240
but you do need to configure it correctly
1099
00:39:10,240 --> 00:39:11,480
for your document types.
1100
00:39:11,480 --> 00:39:13,040
Once chunked and embedded, everything
1101
00:39:13,040 --> 00:39:14,680
lands in the vector database.
1102
00:39:14,680 --> 00:39:16,800
And the vector database configuration determines
1103
00:39:16,800 --> 00:39:19,640
whether your retrieval is fast, accurate and secure.
1104
00:39:19,640 --> 00:39:21,400
Vector database configuration.
1105
00:39:21,400 --> 00:39:23,880
The vector database is your AI's long term memory.
1106
00:39:23,880 --> 00:39:26,920
It must handle real-time updates, permission-aware filtering,
1107
00:39:26,920 --> 00:39:28,560
and high query throughput.
1108
00:39:28,560 --> 00:39:30,960
A bad configuration here means slow searches
1109
00:39:30,960 --> 00:39:33,880
in accurate results or unauthorized data exposure.
1110
00:39:33,880 --> 00:39:35,360
These aren't hypothetical risks,
1111
00:39:35,360 --> 00:39:37,560
they're configuration mistakes that happen in production.
1112
00:39:37,560 --> 00:39:39,560
QueueDrand is my recommended starting point
1113
00:39:39,560 --> 00:39:41,440
for air-gapped SharePoint deployments.
1114
00:39:41,440 --> 00:39:43,280
It's written in Rust, it's fast.
1115
00:39:43,280 --> 00:39:45,360
It supports rich metadata filtering,
1116
00:39:45,360 --> 00:39:48,320
and it runs on-premises via a single Docker container.
1117
00:39:48,320 --> 00:39:49,720
You can start a QueueDrand instance
1118
00:39:49,720 --> 00:39:53,040
with a Docker run command pointing to a local data directory.
1119
00:39:53,040 --> 00:39:56,600
It exposes a REST API on port 6333.
1120
00:39:56,600 --> 00:39:58,000
And it stores collections of vectors
1121
00:39:58,000 --> 00:39:59,800
with attached payload metadata.
1122
00:39:59,800 --> 00:40:01,560
A collection in QueueDrand is like a table
1123
00:40:01,560 --> 00:40:03,000
in a relational database.
1124
00:40:03,000 --> 00:40:05,640
You create one collection for your SharePoint index.
1125
00:40:05,640 --> 00:40:09,640
You define the vector size 384 for all mini-lem or 1024
1126
00:40:09,640 --> 00:40:10,720
for BG large.
1127
00:40:10,720 --> 00:40:12,160
You configure the distance metric,
1128
00:40:12,160 --> 00:40:14,720
cosine similarity is standard for text embeddings,
1129
00:40:14,720 --> 00:40:16,920
and you set up the index type HNSW,
1130
00:40:16,920 --> 00:40:19,480
which stands for hierarchical navigable small world,
1131
00:40:19,480 --> 00:40:22,320
is the default index type for approximate nearest neighbor
1132
00:40:22,320 --> 00:40:24,280
search, it builds a graph structure
1133
00:40:24,280 --> 00:40:26,920
where each vector connects to its nearest neighbors.
1134
00:40:26,920 --> 00:40:29,160
Search traverses this graph to find closed matches
1135
00:40:29,160 --> 00:40:31,200
without comparing the query against every vector
1136
00:40:31,200 --> 00:40:32,120
in the database.
1137
00:40:32,120 --> 00:40:35,000
This makes search fast even with millions of vectors.
1138
00:40:35,000 --> 00:40:37,480
Two parameters control the speed accuracy trade off.
1139
00:40:37,480 --> 00:40:39,360
EF construction determines how thoroughly
1140
00:40:39,360 --> 00:40:40,960
the graph is built during indexing.
1141
00:40:40,960 --> 00:40:44,000
Higher values produce better graphs, but slower indexing.
1142
00:40:44,000 --> 00:40:46,200
EF determines how thoroughly the graph is searched
1143
00:40:46,200 --> 00:40:47,480
during query time.
1144
00:40:47,480 --> 00:40:49,160
Higher values produce more accurate results,
1145
00:40:49,160 --> 00:40:50,360
but slower queries.
1146
00:40:50,360 --> 00:40:55,480
For initial setup, use EF construction of 128 and EF of 64.
1147
00:40:55,480 --> 00:40:57,760
Tune these based on your observed query latency
1148
00:40:57,760 --> 00:40:58,880
and retrieval accuracy.
1149
00:40:58,880 --> 00:41:01,840
Metadata filtering is where QueueDrand shines for our use case.
1150
00:41:01,840 --> 00:41:04,240
When you insert a vector, you attach a JSON payload.
1151
00:41:04,240 --> 00:41:05,520
That payload might look like this.
1152
00:41:05,520 --> 00:41:07,880
Source URL pointing to the SharePoint document,
1153
00:41:07,880 --> 00:41:12,000
library name like legal or HR, author email, last modified timestamp,
1154
00:41:12,000 --> 00:41:14,040
and permission level like executive or standard.
1155
00:41:14,040 --> 00:41:16,960
At query time, you can filter the search to only vectors
1156
00:41:16,960 --> 00:41:19,000
where permission level equals standard or lower.
1157
00:41:19,000 --> 00:41:21,280
This prevents the retrieval engine from finding chunks
1158
00:41:21,280 --> 00:41:22,680
the user can't access.
1159
00:41:22,680 --> 00:41:24,160
WeV8 is a strong alternative.
1160
00:41:24,160 --> 00:41:26,920
It offers a GraphQL interface, native multimodal support,
1161
00:41:26,920 --> 00:41:29,000
and built in vectorization if you want to delegate
1162
00:41:29,000 --> 00:41:30,400
embedding to the database.
1163
00:41:30,400 --> 00:41:32,760
For our architecture, we keep embedding separate
1164
00:41:32,760 --> 00:41:34,440
because we want control over the embedding model
1165
00:41:34,440 --> 00:41:36,680
and batching, but WeV8's GraphQL interface
1166
00:41:36,680 --> 00:41:38,880
is elegant for complex filtered queries.
1167
00:41:38,880 --> 00:41:40,800
If your team prefers GraphQL overrest,
1168
00:41:40,800 --> 00:41:42,600
WeV8 is worth evaluating.
1169
00:41:42,600 --> 00:41:44,640
Milvus is designed for cloud native scaling.
1170
00:41:44,640 --> 00:41:45,760
It runs on Kubernetes.
1171
00:41:45,760 --> 00:41:48,120
It supports billion scale vector search,
1172
00:41:48,120 --> 00:41:49,800
and it has a sophisticated architecture
1173
00:41:49,800 --> 00:41:51,880
with separated storage and compute.
1174
00:41:51,880 --> 00:41:54,040
For a single tenant air-gaped deployment,
1175
00:41:54,040 --> 00:41:57,680
Milvus is overkill unless you expect tens of millions of vectors.
1176
00:41:57,680 --> 00:41:59,360
If you do, it's the right choice.
1177
00:41:59,360 --> 00:42:01,800
But most SharePoint deployments don't reach that scale.
1178
00:42:01,800 --> 00:42:03,240
Chroma is the lightweight option.
1179
00:42:03,240 --> 00:42:05,400
It stores vectors in SQLite by default.
1180
00:42:05,400 --> 00:42:06,880
It requires no server setup,
1181
00:42:06,880 --> 00:42:08,720
and it's ideal for prototyping.
1182
00:42:08,720 --> 00:42:11,680
But for production with multiple users, concurrent queries
1183
00:42:11,680 --> 00:42:13,360
and permission filtering, Chroma
1184
00:42:13,360 --> 00:42:15,800
lacks the strongness of QDrand or WeV8.
1185
00:42:15,800 --> 00:42:17,400
Use it to validate your pipeline.
1186
00:42:17,400 --> 00:42:19,360
Then migrate to QDrand for production.
1187
00:42:19,360 --> 00:42:21,440
Real-time synchronization between SharePoint
1188
00:42:21,440 --> 00:42:23,760
and the vector database is a workflow problem,
1189
00:42:23,760 --> 00:42:25,200
not a database problem.
1190
00:42:25,200 --> 00:42:27,760
Your ingestion service detects changes in SharePoint.
1191
00:42:27,760 --> 00:42:29,280
It extracts modified documents.
1192
00:42:29,280 --> 00:42:30,840
It rechunks and reambeds them.
1193
00:42:30,840 --> 00:42:32,560
It updates the vectors in QDrand,
1194
00:42:32,560 --> 00:42:35,280
and it deletes vectors for removed documents.
1195
00:42:35,280 --> 00:42:37,800
QDrand supports point updates and deletes by ID.
1196
00:42:37,800 --> 00:42:40,800
You store the vector ID as a hash of the document URL
1197
00:42:40,800 --> 00:42:42,040
and chunk index.
1198
00:42:42,040 --> 00:42:44,400
When a document changes, you know exactly which vectors
1199
00:42:44,400 --> 00:42:45,360
to replace.
1200
00:42:45,360 --> 00:42:47,640
Query latency is your user experience metric.
1201
00:42:47,640 --> 00:42:50,600
The user asks the question, the query interface embeds it.
1202
00:42:50,600 --> 00:42:52,160
The vector database searches.
1203
00:42:52,160 --> 00:42:54,040
The top matches get sent to the LLM.
1204
00:42:54,040 --> 00:42:55,360
The LLM generates an answer.
1205
00:42:55,360 --> 00:42:56,760
The total time from question to answer
1206
00:42:56,760 --> 00:42:59,200
should be under five seconds for a good experience.
1207
00:42:59,200 --> 00:43:02,160
Vector search itself should take under 100 milliseconds.
1208
00:43:02,160 --> 00:43:05,160
If it takes longer, increase EF or add query replicas.
1209
00:43:05,160 --> 00:43:08,160
If accuracy is poor, increase EF or rebuild the index
1210
00:43:08,160 --> 00:43:10,120
with higher EF construction.
1211
00:43:10,120 --> 00:43:12,120
Metadata filtering examples show why
1212
00:43:12,120 --> 00:43:14,680
QDrand is powerful for our use case.
1213
00:43:14,680 --> 00:43:17,040
A user asks about remote work policies.
1214
00:43:17,040 --> 00:43:19,680
The query interface constructs a search with two conditions.
1215
00:43:19,680 --> 00:43:22,040
The vector must be semantically close to the question.
1216
00:43:22,040 --> 00:43:24,840
And the payload must have library equal to HR or legal
1217
00:43:24,840 --> 00:43:28,240
and permission tier less than or equal to the user's tier.
1218
00:43:28,240 --> 00:43:30,800
QDrand evaluates both conditions simultaneously.
1219
00:43:30,800 --> 00:43:33,600
It searches only vectors that match the filter,
1220
00:43:33,600 --> 00:43:35,920
then ranks them by vector similarity.
1221
00:43:35,920 --> 00:43:38,280
This is far more efficient than retrieving all vectors
1222
00:43:38,280 --> 00:43:39,880
and filtering afterward.
1223
00:43:39,880 --> 00:43:42,200
Another filtering pattern is date-based exclusion.
1224
00:43:42,200 --> 00:43:44,480
A user asks about the current vacation policy.
1225
00:43:44,480 --> 00:43:46,880
The query interface adds a filter for last modified date
1226
00:43:46,880 --> 00:43:49,160
greater than January 1, 2025.
1227
00:43:49,160 --> 00:43:52,800
This excludes outdated policy documents from 2023 or 2024.
1228
00:43:52,800 --> 00:43:55,560
The answer reflects the current rules, not superseded ones.
1229
00:43:55,560 --> 00:43:57,400
This pattern requires your ingestion service
1230
00:43:57,400 --> 00:44:00,120
to keep the last modified date accurate in the metadata.
1231
00:44:00,120 --> 00:44:03,040
If the date is wrong, the filter excludes the wrong documents.
1232
00:44:03,040 --> 00:44:06,680
Monitor the database, track query latency, index size, memory
1233
00:44:06,680 --> 00:44:08,120
usage, and error rates.
1234
00:44:08,120 --> 00:44:10,240
QDrand exposes Prometheus metrics,
1235
00:44:10,240 --> 00:44:12,400
scrape them with your local monitoring stack, alert
1236
00:44:12,400 --> 00:44:13,520
on anomalies.
1237
00:44:13,520 --> 00:44:15,840
A vector database that silently degrades
1238
00:44:15,840 --> 00:44:18,600
will produce poor answers without anyone noticing.
1239
00:44:18,600 --> 00:44:21,600
Backup and recovery for the vector database is often overlooked.
1240
00:44:21,600 --> 00:44:23,640
Could you run stores data in a local directory
1241
00:44:23,640 --> 00:44:25,360
that you mount as a Docker volume?
1242
00:44:25,360 --> 00:44:28,360
Backup this directory using your standard backup infrastructure,
1243
00:44:28,360 --> 00:44:31,280
snapshot before major changes like re-indexing or collection
1244
00:44:31,280 --> 00:44:32,200
rebuilds.
1245
00:44:32,200 --> 00:44:34,000
Test your restore procedure quarterly.
1246
00:44:34,000 --> 00:44:36,120
A corrupted vector index with no backup
1247
00:44:36,120 --> 00:44:39,040
means re-indexing your entire document base from scratch.
1248
00:44:39,040 --> 00:44:41,640
For a 10,000 document library that might take hours,
1249
00:44:41,640 --> 00:44:44,400
for a 100,000 document library it might take days.
1250
00:44:44,400 --> 00:44:46,360
Capacity planning starts with sizing.
1251
00:44:46,360 --> 00:44:50,400
A single vector of 384 dimensions at single precision
1252
00:44:50,400 --> 00:44:53,120
takes roughly 1.5 kilobytes of storage.
1253
00:44:53,120 --> 00:44:55,880
A million vectors take roughly 1.5 gigabytes.
1254
00:44:55,880 --> 00:44:58,280
Add payload metadata and index overhead,
1255
00:44:58,280 --> 00:45:01,440
and the total might be three to five gigabytes per million vectors.
1256
00:45:01,440 --> 00:45:04,200
For a typical enterprise with 50,000 sharepoint documents
1257
00:45:04,200 --> 00:45:06,480
chunked into 200,000 vectors, your database
1258
00:45:06,480 --> 00:45:09,640
needs roughly one terabyte of fast SSD storage, not massive,
1259
00:45:09,640 --> 00:45:10,800
but not trivial either.
1260
00:45:10,800 --> 00:45:12,600
Plan for three to five years of growth.
1261
00:45:12,600 --> 00:45:14,000
Memory sizing is equally important.
1262
00:45:14,000 --> 00:45:17,320
QDrand keeps the H and SW index in memory for fast search.
1263
00:45:17,320 --> 00:45:19,560
The index size depends on vector count, dimensions,
1264
00:45:19,560 --> 00:45:21,000
and graph connectivity.
1265
00:45:21,000 --> 00:45:24,520
A million vectors of 384 dimensions with H and SW
1266
00:45:24,520 --> 00:45:26,760
might consume two to four gigabytes of RAM.
1267
00:45:26,760 --> 00:45:29,000
Add the OS, the container overhead, and headroom
1268
00:45:29,000 --> 00:45:30,360
for concurrent queries.
1269
00:45:30,360 --> 00:45:32,600
A server with 16 gigabytes of RAM is comfortable
1270
00:45:32,600 --> 00:45:33,760
for most deployments.
1271
00:45:33,760 --> 00:45:35,680
32 gigabytes provides room to grow.
1272
00:45:35,680 --> 00:45:37,000
That's the memory layer.
1273
00:45:37,000 --> 00:45:38,200
Now for the brain.
1274
00:45:38,200 --> 00:45:39,680
The local Yamaha runtime.
1275
00:45:39,680 --> 00:45:41,600
The LLM is the reasoning engine.
1276
00:45:41,600 --> 00:45:44,000
It takes the user's question and the retrieved document chunks
1277
00:45:44,000 --> 00:45:46,080
and synthesizes a coherent answer.
1278
00:45:46,080 --> 00:45:48,120
But it must run entirely inside your perimeter
1279
00:45:48,120 --> 00:45:49,840
with no cloud dependency.
1280
00:45:49,840 --> 00:45:52,440
Every byte of the model weight sits on your local disk.
1281
00:45:52,440 --> 00:45:54,680
Every inference runs on your local GPU,
1282
00:45:54,680 --> 00:45:57,560
and every response leaves through your local network.
1283
00:45:57,560 --> 00:45:59,800
Olamma is the simplest way to achieve this.
1284
00:45:59,800 --> 00:46:03,320
It's a cross-platform runtime for LLM and other open models.
1285
00:46:03,320 --> 00:46:05,000
You install it on your GPU server.
1286
00:46:05,000 --> 00:46:08,040
You pull a model with a command like Olamma pull LLM3.
1287
00:46:08,040 --> 00:46:11,800
And you get a local REST API at localhost port 1143-4,
1288
00:46:11,800 --> 00:46:14,040
query it with a simple HTTP post.
1289
00:46:14,040 --> 00:46:15,800
Send the model name, the system prompt,
1290
00:46:15,800 --> 00:46:18,000
the user message, and the retrieved context.
1291
00:46:18,000 --> 00:46:19,280
Get back a streaming response.
1292
00:46:19,280 --> 00:46:21,400
The system prompt is critical for SharePoint Rack.
1293
00:46:21,400 --> 00:46:23,000
It tells the model what its role is,
1294
00:46:23,000 --> 00:46:24,360
what the context format is,
1295
00:46:24,360 --> 00:46:25,920
and what constraints to follow.
1296
00:46:25,920 --> 00:46:28,320
A good system prompt for this architecture looks like this.
1297
00:46:28,320 --> 00:46:29,480
You're a knowledgeable assistant
1298
00:46:29,480 --> 00:46:32,440
that answers questions based on the provided document context.
1299
00:46:32,440 --> 00:46:35,120
Use only the information in the context to answer.
1300
00:46:35,120 --> 00:46:38,120
If the context doesn't contain the answer, say you don't know.
1301
00:46:38,120 --> 00:46:40,280
Site the source document for every claim you make.
1302
00:46:40,280 --> 00:46:42,480
Don't speculate, don't use outside knowledge.
1303
00:46:42,480 --> 00:46:44,600
This prompt enforces three behaviors.
1304
00:46:44,600 --> 00:46:47,000
Grounding, the model must use the retrieved context.
1305
00:46:47,000 --> 00:46:48,240
Honestly, the model must admit
1306
00:46:48,240 --> 00:46:50,440
when the answer isn't in the context, citations,
1307
00:46:50,440 --> 00:46:52,160
the model must reference sources.
1308
00:46:52,160 --> 00:46:54,720
These constraints reduce hallucination and increased trust.
1309
00:46:54,720 --> 00:46:56,720
They don't eliminate hallucination entirely.
1310
00:46:56,720 --> 00:46:58,640
No prompt does, but they push the model
1311
00:46:58,640 --> 00:47:00,200
toward the behavior you want.
1312
00:47:00,200 --> 00:47:01,680
Temperature controls randomness.
1313
00:47:01,680 --> 00:47:03,960
A low temperature like 0.1 or 0.2
1314
00:47:03,960 --> 00:47:06,000
makes the model deterministic and conservative.
1315
00:47:06,000 --> 00:47:08,040
It sticks closely to the retrieved text.
1316
00:47:08,040 --> 00:47:09,800
A high temperature like 0.7
1317
00:47:09,800 --> 00:47:11,880
makes the model creative and exploratory.
1318
00:47:11,880 --> 00:47:14,040
For factual retrieval from SharePoint documents,
1319
00:47:14,040 --> 00:47:15,200
keep temperature low.
1320
00:47:15,200 --> 00:47:17,800
You want accurate synthesis, not creative writing.
1321
00:47:17,800 --> 00:47:19,640
For brainstorming or summarization tasks,
1322
00:47:19,640 --> 00:47:21,360
you might raise temperature slightly.
1323
00:47:21,360 --> 00:47:24,360
But the default for most enterprise queries should be low.
1324
00:47:24,360 --> 00:47:26,800
Quantization reduces model size and memory usage
1325
00:47:26,800 --> 00:47:28,520
at a small-costing quality.
1326
00:47:28,520 --> 00:47:32,000
Models are originally stored as 16-bit floating point numbers.
1327
00:47:32,000 --> 00:47:35,720
Quantization converts them to 8-bit, 4-bit, or even lower precision.
1328
00:47:35,720 --> 00:47:37,840
Q4KM is a common quantization format
1329
00:47:37,840 --> 00:47:39,440
that balances quality and size.
1330
00:47:39,440 --> 00:47:42,520
A 70-billion parameter model quantized to Q4KM
1331
00:47:42,520 --> 00:47:44,560
fits in roughly 40 gigabytes of disk space
1332
00:47:44,560 --> 00:47:47,520
and loads into roughly 40 gigabytes of VRM.
1333
00:47:47,520 --> 00:47:50,040
That's manageable on a single NVIDIA A100
1334
00:47:50,040 --> 00:47:53,840
with 80 gigabytes of VRM or on dual RTX 4090 cards
1335
00:47:53,840 --> 00:47:55,360
with 24 gigabytes each.
1336
00:47:55,360 --> 00:48:00,040
For smaller deployments, consider the LAMMA 3.38 billion parameter model.
1337
00:48:00,040 --> 00:48:01,840
It quantizes to under five gigabytes.
1338
00:48:01,840 --> 00:48:05,200
It runs on a single RTX 4090 with room to spare.
1339
00:48:05,200 --> 00:48:08,320
And with good rag, it answers most enterprise questions adequately.
1340
00:48:08,320 --> 00:48:11,280
It won't write poetry as well as the 70-billion model.
1341
00:48:11,280 --> 00:48:13,520
But it will tell you what your vacation policy says.
1342
00:48:13,520 --> 00:48:14,560
And that's the job.
1343
00:48:14,560 --> 00:48:17,560
LAMMA 4 is now available from Meta as their flagship family.
1344
00:48:17,560 --> 00:48:20,560
Deployment paths exist for both cloud and local scenarios.
1345
00:48:20,560 --> 00:48:22,200
For our air-gapped architecture, you
1346
00:48:22,200 --> 00:48:24,640
pull the open weights, quantize them for your hardware,
1347
00:48:24,640 --> 00:48:26,080
and serve them through LAMMA.
1348
00:48:26,080 --> 00:48:28,520
Expect higher hardware requirements than LAMMA 3,
1349
00:48:28,520 --> 00:48:30,520
plan for an A100 or newer if you want
1350
00:48:30,520 --> 00:48:32,920
the largest LAMMA 4 variant, unquantized.
1351
00:48:32,920 --> 00:48:35,520
For quantized deployment, a dual GPU setup
1352
00:48:35,520 --> 00:48:38,200
or a single high memory card should suffice.
1353
00:48:38,200 --> 00:48:40,840
The exact requirements will depend on the specific variant
1354
00:48:40,840 --> 00:48:43,400
and quantization level you choose.
1355
00:48:43,400 --> 00:48:45,000
Performance tuning for local inference
1356
00:48:45,000 --> 00:48:46,520
involves several knobs.
1357
00:48:46,520 --> 00:48:48,520
Context window size determines how much text
1358
00:48:48,520 --> 00:48:50,320
the model can process in one call.
1359
00:48:50,320 --> 00:48:52,880
With rag, your context window must fit the system prompt,
1360
00:48:52,880 --> 00:48:55,000
the retrieved chunks, and the user question.
1361
00:48:55,000 --> 00:48:58,920
Five retrieved chunks of 500 tokens each is 2,500 tokens.
1362
00:48:58,920 --> 00:49:02,240
Plus the system prompt plus the user question plus the response.
1363
00:49:02,240 --> 00:49:04,600
A 4,000 token context window is tight.
1364
00:49:04,600 --> 00:49:06,320
An 8,000 token window is comfortable.
1365
00:49:06,320 --> 00:49:08,560
A 16,000 token window is generous.
1366
00:49:08,560 --> 00:49:10,600
Larger context windows require more VRM.
1367
00:49:10,600 --> 00:49:12,160
Balance this against your hardware.
1368
00:49:12,160 --> 00:49:14,360
Batching at the LLM level is different from batching
1369
00:49:14,360 --> 00:49:15,720
at the embedding level.
1370
00:49:15,720 --> 00:49:17,520
LLM inference is harder to batch,
1371
00:49:17,520 --> 00:49:20,960
because each user query is independent and latency sensitive.
1372
00:49:20,960 --> 00:49:23,600
For a chat interface serving 10 concurrent users,
1373
00:49:23,600 --> 00:49:25,920
you might run a small batch of 2 to 4 requests
1374
00:49:25,920 --> 00:49:27,440
if your GPU supports it.
1375
00:49:27,440 --> 00:49:29,200
But most local deployments process queries
1376
00:49:29,200 --> 00:49:30,880
sequentially or with minimal batching.
1377
00:49:30,880 --> 00:49:33,000
The throughput is lower than cloud APIs.
1378
00:49:33,000 --> 00:49:35,160
The latency is acceptable for interactive use.
1379
00:49:35,160 --> 00:49:37,560
GPU utilization is your infrastructure metric.
1380
00:49:37,560 --> 00:49:41,440
A GPU sitting at 10% utilization is wasted money.
1381
00:49:41,440 --> 00:49:44,440
A GPU at 90% utilization is near capacity.
1382
00:49:44,440 --> 00:49:46,600
Monitor utilization during peak hours.
1383
00:49:46,600 --> 00:49:49,160
If you consistently hit 80% or higher,
1384
00:49:49,160 --> 00:49:51,160
add a second GPU or upgrade.
1385
00:49:51,160 --> 00:49:53,880
If you sit at 20%, you have headroom for a larger model
1386
00:49:53,880 --> 00:49:54,920
or more users.
1387
00:49:54,920 --> 00:49:57,200
Let me talk about context window sizing in more detail
1388
00:49:57,200 --> 00:50:00,520
because this is where many local deployments fail silently.
1389
00:50:00,520 --> 00:50:02,920
You retrieve five chunks of 500 tokens each.
1390
00:50:02,920 --> 00:50:04,920
That's 2,500 tokens of context.
1391
00:50:04,920 --> 00:50:06,720
Your system prompt is 200 tokens.
1392
00:50:06,720 --> 00:50:08,440
Your user question is 50 tokens.
1393
00:50:08,440 --> 00:50:11,560
The model needs a few hundred tokens to generate the answer.
1394
00:50:11,560 --> 00:50:14,040
Total context is roughly 3,000 tokens.
1395
00:50:14,040 --> 00:50:17,240
If your model has a 4,000 token context window, this fits.
1396
00:50:17,240 --> 00:50:20,320
But it leaves no room for longer documents or more retrieved chunks.
1397
00:50:20,320 --> 00:50:23,320
If you want to retrieve 10 chunks or process longer contracts,
1398
00:50:23,320 --> 00:50:25,840
you need an 8,000 token window or larger.
1399
00:50:25,840 --> 00:50:28,040
Larger context windows require more VRM,
1400
00:50:28,040 --> 00:50:30,680
the KV cache, which stores the main and value matrices
1401
00:50:30,680 --> 00:50:32,920
for each token during attention computation,
1402
00:50:32,920 --> 00:50:35,560
grows linearly with context length.
1403
00:50:35,560 --> 00:50:37,560
A model with 70 billion parameters
1404
00:50:37,560 --> 00:50:41,200
and a 4,000 token context might need 30 gigabytes of VRM.
1405
00:50:41,200 --> 00:50:45,480
The same model with a 16,000 token context might need 45 gigabytes.
1406
00:50:45,480 --> 00:50:48,720
This is why hardware planning must account for your expected context size,
1407
00:50:48,720 --> 00:50:50,040
not just the model weights.
1408
00:50:50,040 --> 00:50:52,360
Quantization affects context window capacity.
1409
00:50:52,360 --> 00:50:56,080
The Q4 quantized model uses 4 bits per weight instead of 16.
1410
00:50:56,080 --> 00:50:58,760
This reduces VRM usage by roughly half for the weights.
1411
00:50:58,760 --> 00:51:01,040
But the KV cache isn't quantized by default.
1412
00:51:01,040 --> 00:51:02,280
It remains in full precision.
1413
00:51:02,280 --> 00:51:04,080
So even with aggressive quantization,
1414
00:51:04,080 --> 00:51:06,920
the context window is still the limiting factor for memory.
1415
00:51:06,920 --> 00:51:10,240
Some advanced inference engines now support KV cache quantization,
1416
00:51:10,240 --> 00:51:12,000
which can reduce memory further.
1417
00:51:12,000 --> 00:51:14,600
But this is bleeding edge and may affect accuracy,
1418
00:51:14,600 --> 00:51:16,120
test thoroughly before deploying.
1419
00:51:16,120 --> 00:51:17,680
For production, I recommend starting
1420
00:51:17,680 --> 00:51:19,920
with an 8,000 token context window.
1421
00:51:19,920 --> 00:51:23,360
It provides enough room for five to seven chunks of 500 tokens each,
1422
00:51:23,360 --> 00:51:25,360
plus system prompt and user question.
1423
00:51:25,360 --> 00:51:27,080
If your documents are unusually long,
1424
00:51:27,080 --> 00:51:29,640
or your queries require cross-document synthesis,
1425
00:51:29,640 --> 00:51:31,160
increase to 16,000.
1426
00:51:31,160 --> 00:51:33,160
But don't increase beyond what your hardware can serve
1427
00:51:33,160 --> 00:51:34,640
with acceptable latency.
1428
00:51:34,640 --> 00:51:37,200
The LLM runtime is the crown jewel of your architecture.
1429
00:51:37,200 --> 00:51:38,960
It's also the most resource-hungry,
1430
00:51:38,960 --> 00:51:40,880
plan your hardware around it, everything else,
1431
00:51:40,880 --> 00:51:42,880
the ingestion service, the chunking engine,
1432
00:51:42,880 --> 00:51:44,600
the vector database, the embedding model,
1433
00:51:44,600 --> 00:51:47,280
can run on CPU or share GPU resources.
1434
00:51:47,280 --> 00:51:50,280
The LLM needs dedicated VRM and fast memory bandwidth
1435
00:51:50,280 --> 00:51:52,240
don't under-provision it.
1436
00:51:52,240 --> 00:51:55,120
Now for the interface that brings it all together.
1437
00:51:55,120 --> 00:51:56,400
The query interface.
1438
00:51:56,400 --> 00:51:59,360
A local brain without a face is just an API endpoint.
1439
00:51:59,360 --> 00:52:01,760
Your team members need to ask questions in natural language
1440
00:52:01,760 --> 00:52:05,120
and get grounded, cited answers, without technical friction.
1441
00:52:05,120 --> 00:52:08,240
The query interface is where sovereignty meets usability.
1442
00:52:08,240 --> 00:52:09,920
Build a minimalist web interface.
1443
00:52:09,920 --> 00:52:11,280
It doesn't need to be elaborate.
1444
00:52:11,280 --> 00:52:14,400
A text input, a submit button, a response area,
1445
00:52:14,400 --> 00:52:16,520
and citation links authenticate users
1446
00:52:16,520 --> 00:52:19,400
through Microsoft Entra ID using the same credentials
1447
00:52:19,400 --> 00:52:20,760
they use for SharePoint.
1448
00:52:20,760 --> 00:52:23,240
This provides single sign-on and ensures
1449
00:52:23,240 --> 00:52:25,880
that user identity is known for permission filtering.
1450
00:52:25,880 --> 00:52:27,240
The query flow is straightforward.
1451
00:52:27,240 --> 00:52:28,400
The user types a question.
1452
00:52:28,400 --> 00:52:30,880
The interface sends the question to your local embedding model.
1453
00:52:30,880 --> 00:52:32,880
The embedding model returns a vector.
1454
00:52:32,880 --> 00:52:35,200
The interface queries queuedrand with that vector
1455
00:52:35,200 --> 00:52:37,160
filtered by the user's permission level.
1456
00:52:37,160 --> 00:52:39,880
Queuedrand returns a top five most relevant chunks.
1457
00:52:39,880 --> 00:52:41,280
The interface constructs a prompt
1458
00:52:41,280 --> 00:52:43,760
containing the system instructions, the retrieved chunks,
1459
00:52:43,760 --> 00:52:44,760
and the user question.
1460
00:52:44,760 --> 00:52:46,400
It sends this prompt to Olamma.
1461
00:52:46,400 --> 00:52:47,840
Olamma generates a response.
1462
00:52:47,840 --> 00:52:49,480
The interface displays the response
1463
00:52:49,480 --> 00:52:53,240
alongside citations linking back to the specific SharePoint documents.
1464
00:52:53,240 --> 00:52:54,400
Citations aren't optional.
1465
00:52:54,400 --> 00:52:56,680
They're the mechanism by which users verify answers
1466
00:52:56,680 --> 00:52:58,480
and auditors trace decisions.
1467
00:52:58,480 --> 00:53:00,600
Every answer must show the source document name,
1468
00:53:00,600 --> 00:53:03,360
the library name, and the last modified date.
1469
00:53:03,360 --> 00:53:06,480
Ideally, the citation is a clickable link to the SharePoint document.
1470
00:53:06,480 --> 00:53:08,160
If the document is in SharePoint online,
1471
00:53:08,160 --> 00:53:09,800
the link opens in the browser.
1472
00:53:09,800 --> 00:53:12,840
If it's on-premises, the link opens in the local SharePoint interface.
1473
00:53:12,840 --> 00:53:15,560
The user can verify that the answer matches the source.
1474
00:53:15,560 --> 00:53:17,520
Permission enforcement happens at two points.
1475
00:53:17,520 --> 00:53:19,320
First, the vector database filters
1476
00:53:19,320 --> 00:53:21,200
by permission level during retrieval.
1477
00:53:21,200 --> 00:53:23,360
If a chunk requires executive access,
1478
00:53:23,360 --> 00:53:26,320
and the user is standard, queuedrand doesn't return it.
1479
00:53:26,320 --> 00:53:28,440
Second, the query interface should verify
1480
00:53:28,440 --> 00:53:31,280
the user's group membership before constructing the prompt.
1481
00:53:31,280 --> 00:53:32,600
This is defense in depth.
1482
00:53:32,600 --> 00:53:34,480
Even if the vector database is misconfigured
1483
00:53:34,480 --> 00:53:36,360
and returned an unauthorized chunk,
1484
00:53:36,360 --> 00:53:38,000
the interface would discard it.
1485
00:53:38,000 --> 00:53:39,920
Immutile's guidance on securing RAC systems
1486
00:53:39,920 --> 00:53:42,240
emphasizes this three-layer security model.
1487
00:53:42,240 --> 00:53:45,240
The storage tier is SharePoint with its native access controls.
1488
00:53:45,240 --> 00:53:48,240
The data tier is the vector database with metadata filtering.
1489
00:53:48,240 --> 00:53:49,960
The prompt tier is the query interface
1490
00:53:49,960 --> 00:53:52,600
with user authentication and output validation.
1491
00:53:52,600 --> 00:53:54,680
At each layer, organizations must enforce
1492
00:53:54,680 --> 00:53:57,520
changing access controls, monitor and audit queries
1493
00:53:57,520 --> 00:54:01,160
and validate outputs to avoid both data leakage and hallucinations.
1494
00:54:01,160 --> 00:54:02,640
Output validation is worth mentioning.
1495
00:54:02,640 --> 00:54:05,760
The LLM can still hallucinate even with perfect retrieval.
1496
00:54:05,760 --> 00:54:07,360
It might misinterpret a chunk,
1497
00:54:07,360 --> 00:54:10,240
synthesize two unrelated chunks into a false connection,
1498
00:54:10,240 --> 00:54:12,360
or ignore the system prompt and speculate.
1499
00:54:12,360 --> 00:54:15,200
The query interface should implement basic sanity checks.
1500
00:54:15,200 --> 00:54:17,560
If the answer contains unsupported phrases like,
1501
00:54:17,560 --> 00:54:20,040
"I think" or "it seems" "flagged."
1502
00:54:20,040 --> 00:54:22,640
If the answer contradicts a retrieved chunk, "flagged."
1503
00:54:22,640 --> 00:54:25,600
These checks aren't foolproof, but they catch obvious errors.
1504
00:54:25,600 --> 00:54:27,480
The interface should also log every query,
1505
00:54:27,480 --> 00:54:29,920
every retrieval result, every generated answer
1506
00:54:29,920 --> 00:54:31,160
and every user action.
1507
00:54:31,160 --> 00:54:33,320
Logs stay local, they're your audit trail.
1508
00:54:33,320 --> 00:54:35,600
If a user claims the AI gave them bad advice,
1509
00:54:35,600 --> 00:54:37,960
you can reconstruct exactly what chunks were retrieved
1510
00:54:37,960 --> 00:54:39,200
and what prompt was sent.
1511
00:54:39,200 --> 00:54:41,680
This is governance, and governance is what makes a demo
1512
00:54:41,680 --> 00:54:43,560
into a production system.
1513
00:54:43,560 --> 00:54:46,240
Microsoft 365 co-pilot search API,
1514
00:54:46,240 --> 00:54:48,600
currently in preview, offers a useful benchmark.
1515
00:54:48,600 --> 00:54:50,600
It performs hybrid semantic and lexical search
1516
00:54:50,600 --> 00:54:53,520
over work content and returns relevant documents.
1517
00:54:53,520 --> 00:54:55,160
You can compare your local rag results
1518
00:54:55,160 --> 00:54:57,960
against co-pilot search to evaluate coverage.
1519
00:54:57,960 --> 00:55:00,280
If your rag finds documents that co-pilot misses,
1520
00:55:00,280 --> 00:55:01,240
you have an advantage.
1521
00:55:01,240 --> 00:55:03,560
If co-pilot finds documents your rag misses,
1522
00:55:03,560 --> 00:55:04,880
you have a tuning problem.
1523
00:55:04,880 --> 00:55:06,640
Use this comparison during development.
1524
00:55:06,640 --> 00:55:09,640
Disable it in production because it calls a cloud API.
1525
00:55:09,640 --> 00:55:12,920
The query interface is where users experience the architecture.
1526
00:55:12,920 --> 00:55:14,400
If it's slow, they won't use it.
1527
00:55:14,400 --> 00:55:16,240
If it's inaccurate, they won't trust it.
1528
00:55:16,240 --> 00:55:19,160
If it's ugly, they will tolerate it if the answers are good.
1529
00:55:19,160 --> 00:55:21,360
Focus on latency and accuracy first.
1530
00:55:21,360 --> 00:55:22,680
Polish the interface later.
1531
00:55:22,680 --> 00:55:25,520
Let me describe what a good query interface looks like in practice.
1532
00:55:25,520 --> 00:55:27,240
The user opens an internal web page.
1533
00:55:27,240 --> 00:55:29,400
They see a simple text box with placeholder text
1534
00:55:29,400 --> 00:55:32,040
like ask about our policies, procedures, or documentation.
1535
00:55:32,040 --> 00:55:32,920
They type a question.
1536
00:55:32,920 --> 00:55:35,200
What is the procedure for requesting remote work?
1537
00:55:35,200 --> 00:55:36,040
They hit enter.
1538
00:55:36,040 --> 00:55:38,400
Within two seconds, they see a loading indicator.
1539
00:55:38,400 --> 00:55:40,040
Within five seconds, they see an answer.
1540
00:55:40,040 --> 00:55:41,440
The answer isn't a wall of text.
1541
00:55:41,440 --> 00:55:42,560
It's a short paragraph.
1542
00:55:42,560 --> 00:55:45,440
Remote work requests must be submitted through the HR portal
1543
00:55:45,440 --> 00:55:47,480
at least 10 business days in advance.
1544
00:55:47,480 --> 00:55:49,160
Your manager must approve the request.
1545
00:55:49,160 --> 00:55:51,320
If approved for more than three consecutive days,
1546
00:55:51,320 --> 00:55:53,720
the request requires director-level sign-off.
1547
00:55:53,720 --> 00:55:55,560
Below the answer are citations.
1548
00:55:55,560 --> 00:55:57,200
Source remote work policy.
1549
00:55:57,200 --> 00:56:01,400
Doc X, HR library, last modified March 15, 2026.
1550
00:56:01,400 --> 00:56:04,760
Source, manager handbook.x, leadership library,
1551
00:56:04,760 --> 00:56:07,600
last modified January 8, 2026.
1552
00:56:07,600 --> 00:56:09,280
The user can click any citation
1553
00:56:09,280 --> 00:56:11,280
to open the source document in SharePoint.
1554
00:56:11,280 --> 00:56:14,040
This is the user experience that makes adoption happen.
1555
00:56:14,040 --> 00:56:16,280
It's fast, it's grounded, it's verifiable,
1556
00:56:16,280 --> 00:56:18,600
and it respects the user's existing SharePoint permissions.
1557
00:56:18,600 --> 00:56:21,080
If the user doesn't have access to the leadership library,
1558
00:56:21,080 --> 00:56:23,000
the manager handbook citation doesn't appear.
1559
00:56:23,000 --> 00:56:24,120
The answer is still useful
1560
00:56:24,120 --> 00:56:27,200
because the remote work policy chunk contains enough information.
1561
00:56:27,200 --> 00:56:28,960
But the user can't access material
1562
00:56:28,960 --> 00:56:30,600
they're not authorized to see.
1563
00:56:30,600 --> 00:56:33,440
Error handling in the query interface must be graceful.
1564
00:56:33,440 --> 00:56:36,400
If the vector database is down, show a message like,
1565
00:56:36,400 --> 00:56:38,800
the knowledge base is temporarily unavailable.
1566
00:56:38,800 --> 00:56:40,360
Please try again in a few minutes.
1567
00:56:40,360 --> 00:56:41,760
Don't expose stack traces.
1568
00:56:41,760 --> 00:56:43,680
Don't expose internal service names.
1569
00:56:43,680 --> 00:56:45,680
Don't expose the fact that you're running Q-drand
1570
00:56:45,680 --> 00:56:48,560
on a server named GPU server 01.
1571
00:56:48,560 --> 00:56:50,920
These details help attackers and confuse users.
1572
00:56:50,920 --> 00:56:53,680
If the LLM runtime is overloaded, implement a Q.
1573
00:56:53,680 --> 00:56:55,120
The user submits a question.
1574
00:56:55,120 --> 00:56:56,960
The interface shows position in Q.
1575
00:56:56,960 --> 00:56:59,160
When the GPU is free, the query processes.
1576
00:56:59,160 --> 00:57:01,680
For most organizations, this Q is rarely needed
1577
00:57:01,680 --> 00:57:03,880
because local GPU inference is fast enough
1578
00:57:03,880 --> 00:57:04,800
for interactive use.
1579
00:57:04,800 --> 00:57:06,440
But if you have 100 concurrent users,
1580
00:57:06,440 --> 00:57:08,600
Qing prevents the system from crashing.
1581
00:57:08,600 --> 00:57:11,400
If the LLM generates an answer that fails validation,
1582
00:57:11,400 --> 00:57:13,640
for example, it contains speculative language,
1583
00:57:13,640 --> 00:57:16,360
unsupported by the context, flag it for review,
1584
00:57:16,360 --> 00:57:18,320
show the user the answer with a disclaimer.
1585
00:57:18,320 --> 00:57:19,840
This answer may contain information
1586
00:57:19,840 --> 00:57:21,520
not found in the source documents.
1587
00:57:21,520 --> 00:57:22,960
Please verify before acting.
1588
00:57:22,960 --> 00:57:24,560
This isn't ideal, but it's better
1589
00:57:24,560 --> 00:57:26,760
than presenting hallucinations as facts.
1590
00:57:26,760 --> 00:57:29,120
The query interface should also support feedback.
1591
00:57:29,120 --> 00:57:30,800
Thumbs up, thumbs down.
1592
00:57:30,800 --> 00:57:33,520
A text box for explaining why the answer was wrong.
1593
00:57:33,520 --> 00:57:36,080
This feedback feeds into your evaluation pipeline.
1594
00:57:36,080 --> 00:57:38,240
You review thumbs down responses weekly.
1595
00:57:38,240 --> 00:57:40,880
You identify common failure modes, bad chunking,
1596
00:57:40,880 --> 00:57:43,360
missing documents, hallucinated citations,
1597
00:57:43,360 --> 00:57:44,360
and you fix them.
1598
00:57:44,360 --> 00:57:46,960
This feedback loop is how the system improves over time
1599
00:57:46,960 --> 00:57:48,400
without retraining models.
1600
00:57:48,400 --> 00:57:50,800
But here is what most proof of concepts ignore.
1601
00:57:50,800 --> 00:57:52,000
They build a working pipeline,
1602
00:57:52,000 --> 00:57:54,560
they demonstrate a good answer, and they declare victory.
1603
00:57:54,560 --> 00:57:56,000
The real work starts after that.
1604
00:57:56,000 --> 00:57:57,720
Permission tiers and access control.
1605
00:57:57,720 --> 00:58:00,280
SharePoint already has role-based access control.
1606
00:58:00,280 --> 00:58:03,120
Your AI must mirror it exactly, not approximately,
1607
00:58:03,120 --> 00:58:04,800
not eventually, exactly.
1608
00:58:04,800 --> 00:58:06,160
Every library has permissions.
1609
00:58:06,160 --> 00:58:08,480
Every document inherits or overrides them.
1610
00:58:08,480 --> 00:58:10,600
Every user has an effective permission level,
1611
00:58:10,600 --> 00:58:13,480
determined by their group membership, direct grants,
1612
00:58:13,480 --> 00:58:14,800
and denied permissions.
1613
00:58:14,800 --> 00:58:17,520
Your vector database must respect the same matrix.
1614
00:58:17,520 --> 00:58:19,720
The naive approach is to build a single vector index
1615
00:58:19,720 --> 00:58:22,640
for the entire organization and filter at the application layer.
1616
00:58:22,640 --> 00:58:23,480
This fails.
1617
00:58:23,480 --> 00:58:26,880
Because application layer filters can be bypassed by bugs.
1618
00:58:26,880 --> 00:58:29,240
It fails because a single compromised query interface
1619
00:58:29,240 --> 00:58:31,080
exposes the entire index.
1620
00:58:31,080 --> 00:58:33,200
And it fails because it doesn't scale to find
1621
00:58:33,200 --> 00:58:35,520
grained permissions like document level access control.
1622
00:58:35,520 --> 00:58:37,400
The correct approach is to tag every vector
1623
00:58:37,400 --> 00:58:40,080
with its required permission level at ingestion time.
1624
00:58:40,080 --> 00:58:42,920
When the ingestion service processes a document from SharePoint,
1625
00:58:42,920 --> 00:58:45,160
it queries the SharePoint API for the documents
1626
00:58:45,160 --> 00:58:46,400
effective permissions.
1627
00:58:46,400 --> 00:58:48,160
It maps those permissions to a permission tier,
1628
00:58:48,160 --> 00:58:50,600
executive, manager, standard, public,
1629
00:58:50,600 --> 00:58:52,640
or whatever taxonomy your organization uses.
1630
00:58:52,640 --> 00:58:56,000
It stores that tier in the vectors metadata payload.
1631
00:58:56,000 --> 00:58:58,080
At query time, the user's permission tier
1632
00:58:58,080 --> 00:59:01,280
is determined by their Microsoft EntraID group membership.
1633
00:59:01,280 --> 00:59:03,400
The query interface passes this tier to QDRIND
1634
00:59:03,400 --> 00:59:04,560
as a filter condition.
1635
00:59:04,560 --> 00:59:07,520
QDRIND only searches vectors where the permission tier is less
1636
00:59:07,520 --> 00:59:10,480
than or equal to the user's tier.
1637
00:59:10,480 --> 00:59:12,440
A standard user searching for budget information
1638
00:59:12,440 --> 00:59:14,520
doesn't see executive budget documents.
1639
00:59:14,520 --> 00:59:18,200
A manager searching for HR policies sees manager level policies,
1640
00:59:18,200 --> 00:59:20,320
but not executive compensation details.
1641
00:59:20,320 --> 00:59:22,520
NIST defines a role-based access control
1642
00:59:22,520 --> 00:59:24,400
as enforcing three rules.
1643
00:59:24,400 --> 00:59:26,320
Role assignment every user must be assigned a role.
1644
00:59:26,320 --> 00:59:29,240
Role authorization every role must be authorized for the user.
1645
00:59:29,240 --> 00:59:31,080
Permission authorization every permission
1646
00:59:31,080 --> 00:59:32,680
must be authorized for the role.
1647
00:59:32,680 --> 00:59:35,640
In our architecture, role assignment happens in EntraID.
1648
00:59:35,640 --> 00:59:38,160
Role authorization happens when the query interface
1649
00:59:38,160 --> 00:59:40,520
resolves the user's groups to permission tiers.
1650
00:59:40,520 --> 00:59:42,680
Permission authorization happens when QDRIND filters
1651
00:59:42,680 --> 00:59:44,480
by tier during vector search.
1652
00:59:44,480 --> 00:59:45,600
This isn't a new concept.
1653
00:59:45,600 --> 00:59:47,560
It's our back applied to vector databases.
1654
00:59:47,560 --> 00:59:50,000
The novelty is that most drag implementations ignore it.
1655
00:59:50,000 --> 00:59:51,360
They build a single collection.
1656
00:59:51,360 --> 00:59:53,040
They search everything.
1657
00:59:53,040 --> 00:59:55,680
And they hope the LLM doesn't say something sensitive.
1658
00:59:55,680 --> 00:59:56,480
That's not security.
1659
00:59:56,480 --> 00:59:58,160
That's wishful thinking.
1660
00:59:58,160 --> 01:00:00,040
Microsoft purview data loss prevention
1661
01:00:00,040 --> 01:00:01,400
provides an additional layer.
1662
01:00:01,400 --> 01:00:03,560
DLP policies in SharePoint can block documents
1663
01:00:03,560 --> 01:00:05,840
from leaving the organization, but in our architecture,
1664
01:00:05,840 --> 01:00:06,920
documents never leave.
1665
01:00:06,920 --> 01:00:09,280
They're read by the ingestion service inside the perimeter
1666
01:00:09,280 --> 01:00:11,960
and converted into vectors that also stay inside.
1667
01:00:11,960 --> 01:00:14,480
DLP policies should still monitor the ingestion services
1668
01:00:14,480 --> 01:00:16,560
API calls to ensure it doesn't accidentally
1669
01:00:16,560 --> 01:00:18,360
forward content to external endpoints.
1670
01:00:18,360 --> 01:00:19,600
This is belt and suspenders.
1671
01:00:19,600 --> 01:00:21,800
The architecture prevents leakage by design.
1672
01:00:21,800 --> 01:00:23,960
DLP detects leakage if the design fails.
1673
01:00:23,960 --> 01:00:25,560
Audit logging is needed for compliance.
1674
01:00:25,560 --> 01:00:28,000
Every query must be logged with the user identity,
1675
01:00:28,000 --> 01:00:30,680
timestamp, query text, retrieve chunks,
1676
01:00:30,680 --> 01:00:32,760
generated answer and citation list.
1677
01:00:32,760 --> 01:00:35,480
These logs prove that the system is behaving correctly.
1678
01:00:35,480 --> 01:00:37,400
They support investigations if a user claims
1679
01:00:37,400 --> 01:00:39,360
they received an unauthorized answer.
1680
01:00:39,360 --> 01:00:41,200
And they demonstrate compliance to auditors.
1681
01:00:41,200 --> 01:00:44,320
Logs should be stored locally, not in a cloud logging service,
1682
01:00:44,320 --> 01:00:45,720
not in a shared SaaS platform.
1683
01:00:45,720 --> 01:00:47,400
In a local log aggregation system,
1684
01:00:47,400 --> 01:00:49,320
like the elastic stack or grapharnaloki,
1685
01:00:49,320 --> 01:00:52,080
retain them according to your organization's retention policy,
1686
01:00:52,080 --> 01:00:53,880
secure them with the same R-back that governs
1687
01:00:53,880 --> 01:00:54,920
the rest of the system.
1688
01:00:54,920 --> 01:00:56,960
And review them periodically for anomalies.
1689
01:00:56,960 --> 01:00:59,560
Permission synchronization is an operational challenge.
1690
01:00:59,560 --> 01:01:01,240
SharePoint permissions change.
1691
01:01:01,240 --> 01:01:02,920
Users move between departments.
1692
01:01:02,920 --> 01:01:05,840
Groups are reorganized, documents are reclassified.
1693
01:01:05,840 --> 01:01:08,280
Your vector database must reflect these changes.
1694
01:01:08,280 --> 01:01:10,360
The ingestion service should periodically re-scan
1695
01:01:10,360 --> 01:01:12,800
document permissions and update vector metadata.
1696
01:01:12,800 --> 01:01:15,360
The query interface should refresh user group membership
1697
01:01:15,360 --> 01:01:17,400
on every log in or at least every session.
1698
01:01:17,400 --> 01:01:19,800
And you should run a full permission audit quarterly
1699
01:01:19,800 --> 01:01:20,880
to catch drift.
1700
01:01:20,880 --> 01:01:22,880
If a document's permission level increases,
1701
01:01:22,880 --> 01:01:24,560
meaning fewer users should see it,
1702
01:01:24,560 --> 01:01:27,040
the ingestion service must update the vector metadata
1703
01:01:27,040 --> 01:01:28,240
immediately.
1704
01:01:28,240 --> 01:01:30,160
If a document's permission level decreases,
1705
01:01:30,160 --> 01:01:33,160
meaning more users should see it, the update can be batched.
1706
01:01:33,160 --> 01:01:35,680
The risk of temporarily withholding accessible information
1707
01:01:35,680 --> 01:01:38,240
is lower than the risk of temporarily exposing
1708
01:01:38,240 --> 01:01:39,280
restricted information.
1709
01:01:39,280 --> 01:01:40,040
This is governance.
1710
01:01:40,040 --> 01:01:41,120
It's not glamorous.
1711
01:01:41,120 --> 01:01:43,040
But it's what separates a proof of concept
1712
01:01:43,040 --> 01:01:45,160
from a system your legal team will approve.
1713
01:01:45,160 --> 01:01:47,400
Let me give you a specific example of permission mapping
1714
01:01:47,400 --> 01:01:48,240
in practice.
1715
01:01:48,240 --> 01:01:50,200
Your SharePoint tenant has three libraries.
1716
01:01:50,200 --> 01:01:52,000
The public library contains employee handbooks
1717
01:01:52,000 --> 01:01:53,320
and IT guidelines.
1718
01:01:53,320 --> 01:01:55,240
The manager library contains team budgets
1719
01:01:55,240 --> 01:01:56,480
and hiring procedures.
1720
01:01:56,480 --> 01:01:59,520
The executive library contains board minutes and M&A strategy.
1721
01:01:59,520 --> 01:02:01,200
In EntraID, you have three groups.
1722
01:02:01,200 --> 01:02:02,880
All employees contains every user.
1723
01:02:02,880 --> 01:02:05,240
Managers contains users with direct reports.
1724
01:02:05,240 --> 01:02:07,520
Executives contains the C-suite and VPs.
1725
01:02:07,520 --> 01:02:09,480
The ingestion service tags every vector
1726
01:02:09,480 --> 01:02:12,000
from the public library with permission tier standard.
1727
01:02:12,000 --> 01:02:14,560
Every vector from the manager library with permission tier
1728
01:02:14,560 --> 01:02:15,400
manager.
1729
01:02:15,400 --> 01:02:17,160
Every vector from the executive library
1730
01:02:17,160 --> 01:02:19,080
with permission tier executive.
1731
01:02:19,080 --> 01:02:21,920
When a user in the all employees group queries the system,
1732
01:02:21,920 --> 01:02:25,000
the query interface resolves their tier as standard.
1733
01:02:25,000 --> 01:02:27,560
QDrand filters the search to vectors with tier standard.
1734
01:02:27,560 --> 01:02:29,360
The user sees employee handbook answers,
1735
01:02:29,360 --> 01:02:30,960
but not budget details.
1736
01:02:30,960 --> 01:02:32,960
When a user in the managers group queries,
1737
01:02:32,960 --> 01:02:34,960
their tier resolves as manager.
1738
01:02:34,960 --> 01:02:37,920
QDrand searches vectors with tier standard or manager.
1739
01:02:37,920 --> 01:02:39,880
They see handbooks and hiring procedures.
1740
01:02:39,880 --> 01:02:41,880
When an executive queries, QDrand searches
1741
01:02:41,880 --> 01:02:42,720
all tiers.
1742
01:02:42,720 --> 01:02:43,640
They see everything.
1743
01:02:43,640 --> 01:02:45,880
This is simple, auditable, and matched with SharePoints
1744
01:02:45,880 --> 01:02:47,040
native permissions.
1745
01:02:47,040 --> 01:02:48,280
Edge cases exist.
1746
01:02:48,280 --> 01:02:51,000
Consider a user who moves from engineering to sales.
1747
01:02:51,000 --> 01:02:52,880
Their EntraID group membership changes.
1748
01:02:52,880 --> 01:02:55,480
The query interface picks up the new groups on next login.
1749
01:02:55,480 --> 01:02:57,960
But if they had access to sensitive engineering documents
1750
01:02:57,960 --> 01:02:59,800
yesterday and shouldn't see them today,
1751
01:02:59,800 --> 01:03:01,800
the query interface must refresh group membership
1752
01:03:01,800 --> 01:03:03,840
every session, not just at login.
1753
01:03:03,840 --> 01:03:06,280
And it should cache group membership for no more than one hour
1754
01:03:06,280 --> 01:03:08,240
to balance security against performance.
1755
01:03:08,240 --> 01:03:10,200
Consider a document with custom permissions,
1756
01:03:10,200 --> 01:03:11,640
a single file in the public library
1757
01:03:11,640 --> 01:03:13,320
that's restricted to the legal team.
1758
01:03:13,320 --> 01:03:16,040
The ingestion service must detect this exception.
1759
01:03:16,040 --> 01:03:19,440
It queries the SharePoint API for the document's effective permissions.
1760
01:03:19,440 --> 01:03:21,640
It sees that only the legal group has access.
1761
01:03:21,640 --> 01:03:23,080
It tags the vector with permission
1762
01:03:23,080 --> 01:03:24,800
to your legal instead of standard.
1763
01:03:24,800 --> 01:03:26,360
This requires the ingestion service
1764
01:03:26,360 --> 01:03:29,320
to check permissions per document, not just per library.
1765
01:03:29,320 --> 01:03:31,240
It's slower, but it's accurate.
1766
01:03:31,240 --> 01:03:32,760
And accuracy is the point.
1767
01:03:32,760 --> 01:03:34,960
Consider inherited permissions that break.
1768
01:03:34,960 --> 01:03:36,280
A library inherits from the site.
1769
01:03:36,280 --> 01:03:37,760
A document inherits from the library.
1770
01:03:37,760 --> 01:03:39,520
Then someone breaks inheritance on the document
1771
01:03:39,520 --> 01:03:41,600
and grants access to an individual user.
1772
01:03:41,600 --> 01:03:44,080
Your ingestion service must detect the broken inheritance
1773
01:03:44,080 --> 01:03:45,080
and map it correctly.
1774
01:03:45,080 --> 01:03:46,000
This is complex.
1775
01:03:46,000 --> 01:03:48,600
But SharePoint's API returns effective permissions
1776
01:03:48,600 --> 01:03:50,440
that account for inheritance breaks.
1777
01:03:50,440 --> 01:03:55,120
Trust the API, map the result, and audit periodically.
1778
01:03:55,120 --> 01:03:58,120
Permission drift is the silent killer of secure RRAG.
1779
01:03:58,120 --> 01:04:01,120
A document moves from the manager library to the public library.
1780
01:04:01,120 --> 01:04:02,880
The ingestion service detects the move.
1781
01:04:02,880 --> 01:04:04,840
It updates the vector metadata.
1782
01:04:04,840 --> 01:04:06,880
But if the service is down during the move,
1783
01:04:06,880 --> 01:04:09,480
the vector retains its old permission tag,
1784
01:04:09,480 --> 01:04:11,720
a standard user might see manager-level content.
1785
01:04:11,720 --> 01:04:14,440
To prevent this, run a full permission reconciliation weekly.
1786
01:04:14,440 --> 01:04:15,640
Rescan all vectors.
1787
01:04:15,640 --> 01:04:17,840
Compare their permission tags against current SharePoint
1788
01:04:17,840 --> 01:04:21,600
permissions, flag mismatches, and fix them.
1789
01:04:21,600 --> 01:04:23,920
But the document base never stays small.
1790
01:04:23,920 --> 01:04:25,800
Scaling and performance tuning.
1791
01:04:25,800 --> 01:04:27,480
A few hundred documents is trivial.
1792
01:04:27,480 --> 01:04:29,360
10,000 documents with daily updates
1793
01:04:29,360 --> 01:04:30,840
is a different problem.
1794
01:04:30,840 --> 01:04:33,360
The architecture scales, but only if you tune it.
1795
01:04:33,360 --> 01:04:35,480
Untuned systems degrade gracefully at first,
1796
01:04:35,480 --> 01:04:38,160
and then suddenly query latency creeps up.
1797
01:04:38,160 --> 01:04:41,280
Embedding throughput drops, GPU memory fills.
1798
01:04:41,280 --> 01:04:43,960
And users start complaining that the AI is slow.
1799
01:04:43,960 --> 01:04:45,880
GPU batching for embedding generation
1800
01:04:45,880 --> 01:04:47,520
is your first optimization.
1801
01:04:47,520 --> 01:04:49,640
Embedding models process chunks in parallel.
1802
01:04:49,640 --> 01:04:53,280
A batch of 64 chunks isn't 64 times slower than a batch of one.
1803
01:04:53,280 --> 01:04:54,760
It's perhaps 10 times slower.
1804
01:04:54,760 --> 01:04:57,360
Batching amortizes the overhead of loading the model
1805
01:04:57,360 --> 01:04:59,560
and transferring data to GPU memory.
1806
01:04:59,560 --> 01:05:02,520
Use the largest batch size that fits in your GPU memory.
1807
01:05:02,520 --> 01:05:04,960
For all Mini-LM on a 24 gigabyte GPU,
1808
01:05:04,960 --> 01:05:06,800
you can batch thousands of chunks.
1809
01:05:06,800 --> 01:05:11,680
For BGE large, the batch size is smaller, experiment and measure.
1810
01:05:11,680 --> 01:05:14,680
Incremental Delta indexing replaces full re-indexing.
1811
01:05:14,680 --> 01:05:16,000
After the initial index is built,
1812
01:05:16,000 --> 01:05:17,880
you only process change documents.
1813
01:05:17,880 --> 01:05:20,680
SharePoints changes API returns modified items
1814
01:05:20,680 --> 01:05:21,800
since a timestamp.
1815
01:05:21,800 --> 01:05:23,520
Your ingestion service stores a watermark
1816
01:05:23,520 --> 01:05:25,320
and processes only the Delta.
1817
01:05:25,320 --> 01:05:27,640
This keeps indexing time proportional to change volume,
1818
01:05:27,640 --> 01:05:28,880
not total volume.
1819
01:05:28,880 --> 01:05:31,760
A 10,000 document library with 50 daily changes
1820
01:05:31,760 --> 01:05:34,080
takes minutes to update, not hours.
1821
01:05:34,080 --> 01:05:36,600
Multi-collection sharding by department or document type
1822
01:05:36,600 --> 01:05:38,400
reduces index size per collection.
1823
01:05:38,400 --> 01:05:41,880
QDrand searches a single collection in parallel across segments.
1824
01:05:41,880 --> 01:05:45,680
But if a collection grows too large, search slows.
1825
01:05:45,680 --> 01:05:48,280
Splitting into smaller collections, one per department
1826
01:05:48,280 --> 01:05:51,200
or one per major library keeps each collection fast.
1827
01:05:51,200 --> 01:05:54,240
The query interface roots the search to the appropriate collection
1828
01:05:54,240 --> 01:05:56,360
based on the user's query context.
1829
01:05:56,360 --> 01:05:58,560
Or it searches multiple collections in parallel
1830
01:05:58,560 --> 01:05:59,800
and merges the results.
1831
01:05:59,800 --> 01:06:02,200
Caching frequent queries reduces redundant work.
1832
01:06:02,200 --> 01:06:04,040
Some questions get asked repeatedly.
1833
01:06:04,040 --> 01:06:05,400
What is our vacation policy?
1834
01:06:05,400 --> 01:06:07,080
How do I submit an expense report?
1835
01:06:07,080 --> 01:06:08,960
Cache the query vector, the retrieved chunks
1836
01:06:08,960 --> 01:06:10,280
and the generated answer.
1837
01:06:10,280 --> 01:06:14,080
Serve the cached answer for identical or near identical queries.
1838
01:06:14,080 --> 01:06:16,880
Invalidate the cache when the source documents change.
1839
01:06:16,880 --> 01:06:19,280
This can eliminate 50% or more of LLM calls
1840
01:06:19,280 --> 01:06:20,680
for FAQ style queries.
1841
01:06:20,680 --> 01:06:22,440
Monitor vector DB query latency.
1842
01:06:22,440 --> 01:06:25,640
QDrand exposes Prometheus metrics, track P50, P95,
1843
01:06:25,640 --> 01:06:26,920
and P99 latency.
1844
01:06:26,920 --> 01:06:30,040
If P95 exceeds 200 milliseconds, investigate.
1845
01:06:30,040 --> 01:06:32,960
Increase EF and rebuild the index with higher EF construction
1846
01:06:32,960 --> 01:06:35,520
add query replicas or chart the collection.
1847
01:06:35,520 --> 01:06:37,400
If P99 spikes, you may have a hot chart
1848
01:06:37,400 --> 01:06:39,040
where one segment is overloaded.
1849
01:06:39,040 --> 01:06:41,040
QDrand can rebalance segments automatically
1850
01:06:41,040 --> 01:06:43,040
but you should verify it's doing so.
1851
01:06:43,040 --> 01:06:45,320
For embedding throughput, schedule re-indexing
1852
01:06:45,320 --> 01:06:46,520
during off-peak hours.
1853
01:06:46,520 --> 01:06:48,360
Users query during business hours.
1854
01:06:48,360 --> 01:06:51,080
The ingestion service indexes during nights and weekends.
1855
01:06:51,080 --> 01:06:53,440
This separation prevents resource contention.
1856
01:06:53,440 --> 01:06:55,360
If your organization operates globally,
1857
01:06:55,360 --> 01:06:57,640
define off-peak by region or run ingestion
1858
01:06:57,640 --> 01:06:59,080
on dedicated hardware.
1859
01:06:59,080 --> 01:07:02,480
CPU fallback for embedding is viable if GPU is saturated.
1860
01:07:02,480 --> 01:07:04,480
Modern sentence transformers run well on CPU.
1861
01:07:04,480 --> 01:07:06,520
An AMD EPIC or Intel Zeon processor
1862
01:07:06,520 --> 01:07:09,440
with many cores can embed hundreds of chunks per second.
1863
01:07:09,440 --> 01:07:12,240
It's slower than GPU but cheaper and more available.
1864
01:07:12,240 --> 01:07:15,120
If your GPU is fully utilized by LLM inference,
1865
01:07:15,120 --> 01:07:16,800
move embedding to CPU.
1866
01:07:16,800 --> 01:07:19,600
The latency increase is acceptable for background indexing.
1867
01:07:19,600 --> 01:07:21,680
Hardware scaling follows a simple pattern.
1868
01:07:21,680 --> 01:07:24,240
Start with one GPU server handling everything.
1869
01:07:24,240 --> 01:07:26,200
When LLM inference saturates the GPU,
1870
01:07:26,200 --> 01:07:28,280
add a second GPU dedicated to serving.
1871
01:07:28,280 --> 01:07:31,960
When embedding saturates add CPU workers or a second GPU,
1872
01:07:31,960 --> 01:07:33,960
when the vector database becomes a bottleneck,
1873
01:07:33,960 --> 01:07:36,640
run Q-drand on its own server with fast SSDs
1874
01:07:36,640 --> 01:07:37,760
and plenty of RAM.
1875
01:07:37,760 --> 01:07:39,800
The architecture is horizontally modular.
1876
01:07:39,800 --> 01:07:41,880
Each component can scale independently.
1877
01:07:41,880 --> 01:07:43,480
Cost reality is worth stating.
1878
01:07:43,480 --> 01:07:45,480
Local hardware has high upfront cost.
1879
01:07:45,480 --> 01:07:48,800
A server with an Nvidia A-164 gigabytes of RAM
1880
01:07:48,800 --> 01:07:52,240
and fast storage might cost $15,000 or more.
1881
01:07:52,240 --> 01:07:54,080
But the operating cost is predictable.
1882
01:07:54,080 --> 01:07:57,000
Electricity, maintenance, occasional upgrades,
1883
01:07:57,000 --> 01:08:00,080
there's no per token pricing, there's no usage surprise.
1884
01:08:00,080 --> 01:08:02,800
For an organization processing thousands of queries daily,
1885
01:08:02,800 --> 01:08:04,920
the break-even point against cloud API costs
1886
01:08:04,920 --> 01:08:07,320
often arrives within 12 to 18 months.
1887
01:08:07,320 --> 01:08:10,080
The 2026 total cost of ownership analysis
1888
01:08:10,080 --> 01:08:11,560
supports this directionally.
1889
01:08:11,560 --> 01:08:12,960
Beyond certain usage thresholds,
1890
01:08:12,960 --> 01:08:15,560
running open source models on dedicated GPU servers
1891
01:08:15,560 --> 01:08:18,360
becomes more cost-effective than per token API fees.
1892
01:08:18,360 --> 01:08:20,800
The exact threshold depends on your query volume,
1893
01:08:20,800 --> 01:08:23,280
model size, and hardware choices.
1894
01:08:23,280 --> 01:08:25,960
But the economics favor local deployment at scale.
1895
01:08:25,960 --> 01:08:28,200
Let me walk you through a capacity planning example.
1896
01:08:28,200 --> 01:08:30,760
Your organization has 50,000 SharePoint documents.
1897
01:08:30,760 --> 01:08:33,880
They generate roughly 200,000 chunks after chunking.
1898
01:08:33,880 --> 01:08:37,400
Each chunk is embedded into a 384-dimensional vector.
1899
01:08:37,400 --> 01:08:39,800
The total vector storage is roughly one gigabyte.
1900
01:08:39,800 --> 01:08:43,120
With Q-drand overhead and metadata, call it three gigabytes.
1901
01:08:43,120 --> 01:08:45,080
This fits comfortably on a single server.
1902
01:08:45,080 --> 01:08:48,040
Your users submit roughly 1,000 queries per day.
1903
01:08:48,040 --> 01:08:50,360
Each query embeds the question, searches Q-drand
1904
01:08:50,360 --> 01:08:51,560
and calls Olamma.
1905
01:08:51,560 --> 01:08:53,000
Embedding takes 10 milliseconds.
1906
01:08:53,000 --> 01:08:55,160
Q-drand search takes 50 milliseconds.
1907
01:08:55,160 --> 01:08:57,280
LLM generation takes three seconds.
1908
01:08:57,280 --> 01:09:00,120
Total latency is roughly 3.1 seconds per query.
1909
01:09:00,120 --> 01:09:03,000
1,000 queries per day is 42 queries per hour.
1910
01:09:03,000 --> 01:09:04,720
Your GPU is idle most of the time.
1911
01:09:04,720 --> 01:09:06,600
Now scale to 10,000 queries per day.
1912
01:09:06,600 --> 01:09:10,280
That's 417 per hour, still manageable on a single GPU.
1913
01:09:10,280 --> 01:09:12,040
But at 50,000 queries per day,
1914
01:09:12,040 --> 01:09:15,280
you're processing 2,000 per hour during business hours.
1915
01:09:15,280 --> 01:09:17,720
The GPU hits 80% utilization.
1916
01:09:17,720 --> 01:09:20,400
Query latency degrades from three seconds to six seconds.
1917
01:09:20,400 --> 01:09:21,520
Users notice, at this point,
1918
01:09:21,520 --> 01:09:24,520
you add a second GPU server dedicated to LLM inference.
1919
01:09:24,520 --> 01:09:25,920
You root queries round robin.
1920
01:09:25,920 --> 01:09:27,720
Latency returns to three seconds.
1921
01:09:27,720 --> 01:09:28,920
You have scaled horizontally.
1922
01:09:28,920 --> 01:09:31,520
Vector database scaling follows a different curve.
1923
01:09:31,520 --> 01:09:34,360
Search latency depends on vector count and index quality.
1924
01:09:34,360 --> 01:09:37,520
200,000 vectors search in under 100 milliseconds.
1925
01:09:37,520 --> 01:09:40,000
Two million vectors might take 200 milliseconds.
1926
01:09:40,000 --> 01:09:41,480
10 million might take half a second.
1927
01:09:41,480 --> 01:09:43,640
If your document base grows to a million documents,
1928
01:09:43,640 --> 01:09:45,320
consider shouting by department.
1929
01:09:45,320 --> 01:09:47,400
The legal collection contains only legal documents.
1930
01:09:47,400 --> 01:09:50,200
The HR collection contains only HR documents.
1931
01:09:50,200 --> 01:09:52,640
The query interface roots based on the user's department
1932
01:09:52,640 --> 01:09:54,640
or searches all collections in parallel.
1933
01:09:54,640 --> 01:09:56,600
Each collection stays small and fast.
1934
01:09:56,600 --> 01:09:58,400
Network bandwidth is rarely the bottleneck
1935
01:09:58,400 --> 01:09:59,400
in local deployments.
1936
01:09:59,400 --> 01:10:01,320
Your ingestion service talks to SharePoint
1937
01:10:01,320 --> 01:10:02,720
over your corporate network.
1938
01:10:02,720 --> 01:10:05,200
It talks to the vector database over the local LAN.
1939
01:10:05,200 --> 01:10:06,480
The latency is milliseconds.
1940
01:10:06,480 --> 01:10:07,800
The bandwidth is gigabits.
1941
01:10:07,800 --> 01:10:09,920
The bottleneck is compute, not network.
1942
01:10:09,920 --> 01:10:12,400
Focus your scaling budget on GPU and CPU,
1943
01:10:12,400 --> 01:10:13,840
not on network switches.
1944
01:10:13,840 --> 01:10:15,600
Storage scaling is predictable.
1945
01:10:15,600 --> 01:10:18,160
Model weights don't grow unless you upgrade models.
1946
01:10:18,160 --> 01:10:20,040
Vector storage grows with document count.
1947
01:10:20,040 --> 01:10:23,080
Lock storage grows with query volume, plan for log rotation.
1948
01:10:23,080 --> 01:10:25,720
Keep 30 days of detailed logs for troubleshooting.
1949
01:10:25,720 --> 01:10:27,800
Archive older logs to cold storage.
1950
01:10:27,800 --> 01:10:30,160
For a busy system, logs might consume hundreds
1951
01:10:30,160 --> 01:10:31,320
of gigabytes per month.
1952
01:10:31,320 --> 01:10:32,680
Don't let them fill your disk.
1953
01:10:32,680 --> 01:10:34,800
Scaling isn't about buying bigger hardware.
1954
01:10:34,800 --> 01:10:36,760
It's about understanding where the bottlenecks are
1955
01:10:36,760 --> 01:10:38,840
and eliminating them systematically.
1956
01:10:38,840 --> 01:10:41,280
Monitor everything, profile before optimizing,
1957
01:10:41,280 --> 01:10:43,280
and scale one component at a time.
1958
01:10:43,280 --> 01:10:45,720
This methodical approach prevents overspending on hardware
1959
01:10:45,720 --> 01:10:47,720
you don't need while ensuring you never hit a wall
1960
01:10:47,720 --> 01:10:48,800
you can't climb.
1961
01:10:48,800 --> 01:10:50,960
Load testing validates your capacity assumptions
1962
01:10:50,960 --> 01:10:52,440
before production deployment.
1963
01:10:52,440 --> 01:10:56,080
User tool like Locust or K6 to simulate concurrent users.
1964
01:10:56,080 --> 01:10:59,240
Start with 10 virtual users submitting queries every 30 seconds.
1965
01:10:59,240 --> 01:11:01,440
Monitor latency and GPU utilization
1966
01:11:01,440 --> 01:11:04,000
gradually increased to 50, then 100 users.
1967
01:11:04,000 --> 01:11:05,360
Identify the breaking point.
1968
01:11:05,360 --> 01:11:07,040
If the system degrades at 40 users,
1969
01:11:07,040 --> 01:11:09,760
you know your single GPU setup handles 30 comfortably.
1970
01:11:09,760 --> 01:11:12,080
Plan your hardware so your monitoring dashboard
1971
01:11:12,080 --> 01:11:14,440
should show four main metrics at a glance.
1972
01:11:14,440 --> 01:11:18,360
Query latency over the last hour with P50, P95, and P99 lines.
1973
01:11:18,360 --> 01:11:21,760
GPU utilization percentage with a red threshold at 80%.
1974
01:11:21,760 --> 01:11:25,120
Vector database query throughput in queries per second.
1975
01:11:25,120 --> 01:11:27,400
An ingestion Q-depth showing how many documents are waiting
1976
01:11:27,400 --> 01:11:28,440
to be processed.
1977
01:11:28,440 --> 01:11:31,040
These four metrics tell you whether your system is healthy,
1978
01:11:31,040 --> 01:11:32,480
stressed, or failing.
1979
01:11:32,480 --> 01:11:34,320
Everything else is detailed you drill into
1980
01:11:34,320 --> 01:11:36,160
when one of these four looks wrong.
1981
01:11:36,160 --> 01:11:38,880
Alerting should be proactive, not reactive.
1982
01:11:38,880 --> 01:11:42,160
Alert when P95 latency exceeds five seconds for 10 minutes.
1983
01:11:42,160 --> 01:11:45,600
Alert when GPU utilization exceeds 80% for 15 minutes.
1984
01:11:45,600 --> 01:11:48,520
Alert when ingestion Q-depth exceeds 1000 documents
1985
01:11:48,520 --> 01:11:49,560
for 30 minutes.
1986
01:11:49,560 --> 01:11:52,160
And alert when any component logs an error
1987
01:11:52,160 --> 01:11:54,920
more than 10 times in five minutes.
1988
01:11:54,920 --> 01:11:57,680
These alerts catch problems before users complain.
1989
01:11:57,680 --> 01:12:00,120
Troubleshooting follows a simple decision tree.
1990
01:12:00,120 --> 01:12:03,080
If query latency is high, check GPU utilization first.
1991
01:12:03,080 --> 01:12:04,920
If the GPU is saturated, scale it.
1992
01:12:04,920 --> 01:12:08,160
If the GPU is idle, check Vector database latency.
1993
01:12:08,160 --> 01:12:11,280
If Q-drand is slow, check index size and EF parameters.
1994
01:12:11,280 --> 01:12:13,840
If Q-drand is fast, check the embedding model.
1995
01:12:13,840 --> 01:12:16,800
If embedding is slow, check batch size and GPU memory.
1996
01:12:16,800 --> 01:12:19,240
If everything is fast, but the answer quality is poor,
1997
01:12:19,240 --> 01:12:21,800
check chunking strategy and retrieval accuracy.
1998
01:12:21,800 --> 01:12:24,080
Quality problems usually trace back to chunking.
1999
01:12:24,080 --> 01:12:26,200
And here is the part that makes this future proof.
2000
01:12:26,200 --> 01:12:27,920
Modularity and model evolution.
2001
01:12:27,920 --> 01:12:29,720
Yamaha 4 won't be the last open model.
2002
01:12:29,720 --> 01:12:31,320
Next year there will be Lama 5,
2003
01:12:31,320 --> 01:12:33,160
or a competitor with better reasoning
2004
01:12:33,160 --> 01:12:35,920
or a smaller model with equivalent performance.
2005
01:12:35,920 --> 01:12:37,920
Your architecture must allow swapping models
2006
01:12:37,920 --> 01:12:39,560
without rebuilding the pipeline.
2007
01:12:39,560 --> 01:12:41,200
The rack pattern is model agnostic.
2008
01:12:41,200 --> 01:12:44,320
The Vector database doesn't care which LLM generates the answer.
2009
01:12:44,320 --> 01:12:45,720
The chunking logic doesn't care.
2010
01:12:45,720 --> 01:12:47,040
The query interface doesn't care.
2011
01:12:47,040 --> 01:12:49,680
They care about the prompt format and the API endpoint,
2012
01:12:49,680 --> 01:12:53,240
standardized on the Olama API or an open AI compatible local API
2013
01:12:53,240 --> 01:12:56,640
like the one provided by VLLM or text generation inference.
2014
01:12:56,640 --> 01:13:00,160
These APIs accept a model name, messages array and parameters.
2015
01:13:00,160 --> 01:13:02,280
When a new model arrives, you pull it,
2016
01:13:02,280 --> 01:13:04,400
update the model name in your configuration,
2017
01:13:04,400 --> 01:13:05,760
and restart the service.
2018
01:13:05,760 --> 01:13:07,800
A/B testing is responsible model migration.
2019
01:13:07,800 --> 01:13:10,080
Maintain a test suite of representative queries
2020
01:13:10,080 --> 01:13:11,920
with expected answer characteristics.
2021
01:13:11,920 --> 01:13:14,720
Run the old model and the new model against the same queries.
2022
01:13:14,720 --> 01:13:18,120
Compare latency accuracy, citation quality and hallucination rate.
2023
01:13:18,120 --> 01:13:19,880
Only promote the new model to production
2024
01:13:19,880 --> 01:13:22,760
when it matches or exceeds the old model on your metrics.
2025
01:13:22,760 --> 01:13:24,160
This prevents regressions.
2026
01:13:24,160 --> 01:13:25,880
Embedding models also evolve.
2027
01:13:25,880 --> 01:13:29,360
A new sentence transformer might offer better accuracy for your domain.
2028
01:13:29,360 --> 01:13:33,120
But switching embedding models requires re-embedding your entire corpus.
2029
01:13:33,120 --> 01:13:36,200
The old vectors and new vectors exist in different semantic spaces.
2030
01:13:36,200 --> 01:13:37,200
You can't mix them.
2031
01:13:37,200 --> 01:13:39,200
Plan this as a scheduled maintenance task,
2032
01:13:39,200 --> 01:13:42,960
build the new index in parallel, validated against a test query set,
2033
01:13:42,960 --> 01:13:44,840
then swap the collection names automatically.
2034
01:13:44,840 --> 01:13:46,120
The downtime is seconds.
2035
01:13:46,120 --> 01:13:47,800
Model quantization improves over time.
2036
01:13:47,800 --> 01:13:51,640
A new quantization algorithm might reduce VRM usage with less quality loss.
2037
01:13:51,640 --> 01:13:53,520
Requantize your existing model weights.
2038
01:13:53,520 --> 01:13:55,600
Test the quantized model against your benchmark.
2039
01:13:55,600 --> 01:13:58,640
If it passes deploy, if it fails, keep the old quantization.
2040
01:13:58,640 --> 01:14:01,280
This is continuous improvement, not big bang replacement.
2041
01:14:01,280 --> 01:14:03,640
The Microsoft 365 ecosystem also evolves.
2042
01:14:03,640 --> 01:14:05,160
SharePoint APIs change.
2043
01:14:05,160 --> 01:14:07,080
Microsoft Graph adds new endpoints.
2044
01:14:07,080 --> 01:14:09,320
Enter ID updates, its authentication flows.
2045
01:14:09,320 --> 01:14:11,760
Your ingestion service must handle API versioning.
2046
01:14:11,760 --> 01:14:14,760
Use the API version parameter in SharePoint rest calls.
2047
01:14:14,760 --> 01:14:17,880
Subscribe to Microsoft Graph change notifications were supported
2048
01:14:17,880 --> 01:14:20,400
and monitor Microsoft's deprecation announcements.
2049
01:14:20,400 --> 01:14:24,680
An ingestion service that breaks because an API changed isn't a technology failure.
2050
01:14:24,680 --> 01:14:26,240
It's an operational failure.
2051
01:14:26,240 --> 01:14:29,880
Modularity means each component has a well-defined interface.
2052
01:14:29,880 --> 01:14:32,680
The ingestion service outputs text chunks with metadata.
2053
01:14:32,680 --> 01:14:35,320
The chunking engine consumes text and outputs chunks.
2054
01:14:35,320 --> 01:14:38,040
The embedding service consumes chunks and outputs vectors.
2055
01:14:38,040 --> 01:14:41,280
The vector database consumes vectors and outputs search results.
2056
01:14:41,280 --> 01:14:44,160
The LLM runtime consumes prompts and outputs text.
2057
01:14:44,160 --> 01:14:47,480
The query interface consumes user input and orchestrates the rest.
2058
01:14:47,480 --> 01:14:49,720
Change one component, the others stay the same.
2059
01:14:49,720 --> 01:14:52,440
This is how you future-proof, not by predicting the future.
2060
01:14:52,440 --> 01:14:55,120
But by building interfaces that don't care about the future,
2061
01:14:55,120 --> 01:14:57,680
let me talk about testing and continuous integration
2062
01:14:57,680 --> 01:14:59,960
because a production rag system without tests
2063
01:14:59,960 --> 01:15:01,800
is a liability waiting to happen.
2064
01:15:01,800 --> 01:15:03,960
Your test suite should cover three layers.
2065
01:15:03,960 --> 01:15:07,120
Unit tests for the chunking engine feed it a known document.
2066
01:15:07,120 --> 01:15:09,040
Verify the chunks match with headings.
2067
01:15:09,040 --> 01:15:10,600
Verify metadata is preserved.
2068
01:15:10,600 --> 01:15:12,600
Verify chunks size stays within bounds.
2069
01:15:12,600 --> 01:15:14,680
Integration tests for the ingestion pipeline
2070
01:15:14,680 --> 01:15:16,400
pointed at a test sharepoint library.
2071
01:15:16,400 --> 01:15:19,040
Verify it authenticates, downloads extracts, chunks,
2072
01:15:19,040 --> 01:15:20,720
embeds and stores correctly.
2073
01:15:20,720 --> 01:15:22,600
End-to-end tests for the query flow.
2074
01:15:22,600 --> 01:15:25,160
Submit a known question against a known document base.
2075
01:15:25,160 --> 01:15:27,480
Verify the answer contains expected content.
2076
01:15:27,480 --> 01:15:28,920
Verify citations.com.s.
2077
01:15:28,920 --> 01:15:32,920
Verify permission filtering excludes unauthorized content.
2078
01:15:32,920 --> 01:15:34,960
Automate these tests in your CI pipeline.
2079
01:15:34,960 --> 01:15:36,400
Run unit tests on every commit.
2080
01:15:36,400 --> 01:15:38,680
Run integration tests on every pull request.
2081
01:15:38,680 --> 01:15:41,800
Run end-to-end tests nightly against a staging environment
2082
01:15:41,800 --> 01:15:43,240
that mirrors production.
2083
01:15:43,240 --> 01:15:45,200
This catches regressions before they reach users.
2084
01:15:45,200 --> 01:15:47,600
It gives you confidence to upgrade components.
2085
01:15:47,600 --> 01:15:50,360
And it documents the expected behavior for new team members.
2086
01:15:50,360 --> 01:15:52,880
Deployment patterns for this architecture are straightforward.
2087
01:15:52,880 --> 01:15:55,480
The ingestion service, chunking engine, embedding model,
2088
01:15:55,480 --> 01:15:57,760
and vector database are all containerized.
2089
01:15:57,760 --> 01:15:59,520
You deploy them with Docker compose
2090
01:15:59,520 --> 01:16:02,640
for simple setups or Kubernetes for complex ones.
2091
01:16:02,640 --> 01:16:06,880
The LLM runtime runs on bare metal or in a GPU enabled container.
2092
01:16:06,880 --> 01:16:09,360
The query interface is a standard web application
2093
01:16:09,360 --> 01:16:11,840
deployed behind your corporate reverse proxy.
2094
01:16:11,840 --> 01:16:13,840
Blue-green deployment minimizes downtime.
2095
01:16:13,840 --> 01:16:15,800
You stand up a new version of the query interface
2096
01:16:15,800 --> 01:16:16,920
alongside the old.
2097
01:16:16,920 --> 01:16:19,200
You root 10% of traffic to the new version.
2098
01:16:19,200 --> 01:16:21,040
You monitor error rates and latency.
2099
01:16:21,040 --> 01:16:23,680
If metrics look good, you root 100%.
2100
01:16:23,680 --> 01:16:26,640
If metrics degrade, you root back to the old version.
2101
01:16:26,640 --> 01:16:28,440
This pattern works for the query interface,
2102
01:16:28,440 --> 01:16:30,920
the ingestion service, and the vector database.
2103
01:16:30,920 --> 01:16:34,040
It doesn't work for the LLM runtime if you only have one GPU.
2104
01:16:34,040 --> 01:16:36,640
In that case, deploy during maintenance windows.
2105
01:16:36,640 --> 01:16:38,680
Database migrations for QDrand are minimal.
2106
01:16:38,680 --> 01:16:40,280
Collections are created once.
2107
01:16:40,280 --> 01:16:43,040
Vectors are inserted and updated by the ingestion service.
2108
01:16:43,040 --> 01:16:45,960
You don't run schema migrations in the traditional sense.
2109
01:16:45,960 --> 01:16:47,680
But if you change the embedding model,
2110
01:16:47,680 --> 01:16:49,040
you must rebuild the collection,
2111
01:16:49,040 --> 01:16:51,600
create a new collection with the new vector dimensions,
2112
01:16:51,600 --> 01:16:54,600
reindex all documents into it, validate query accuracy
2113
01:16:54,600 --> 01:16:56,000
against your test suite.
2114
01:16:56,000 --> 01:16:57,920
Then, atomically swap the collection names.
2115
01:16:57,920 --> 01:16:59,200
The downtime is seconds.
2116
01:16:59,200 --> 01:17:01,200
Model version control is worth implementing.
2117
01:17:01,200 --> 01:17:03,400
Store model weights in a local artifact repository
2118
01:17:03,400 --> 01:17:05,280
or on network attached storage.
2119
01:17:05,280 --> 01:17:08,720
Tag each model with version, quantization level, and deployment date.
2120
01:17:08,720 --> 01:17:11,960
When you upgrade, keep the previous version available for rollback.
2121
01:17:11,960 --> 01:17:14,200
A new model that performs worse on your test suite
2122
01:17:14,200 --> 01:17:15,960
can be rolled back in minutes by pointing
2123
01:17:15,960 --> 01:17:17,600
or lamer at the previous weights.
2124
01:17:17,600 --> 01:17:19,360
Let us put this all together.
2125
01:17:19,360 --> 01:17:21,240
The complete architecture blueprint.
2126
01:17:21,240 --> 01:17:23,080
The full pipeline has seven layers.
2127
01:17:23,080 --> 01:17:24,720
Each layer runs inside your perimeter.
2128
01:17:24,720 --> 01:17:26,320
Each layer has a specific job.
2129
01:17:26,320 --> 01:17:27,880
And each layer connects to the next
2130
01:17:27,880 --> 01:17:29,280
through a well-defined interface.
2131
01:17:29,280 --> 01:17:32,080
Layer one is SharePoint Online or SharePoint on premises.
2132
01:17:32,080 --> 01:17:33,320
This is the storage tier.
2133
01:17:33,320 --> 01:17:35,960
It holds your documents, enforces access controls,
2134
01:17:35,960 --> 01:17:38,760
manages versions, and applies retention policies.
2135
01:17:38,760 --> 01:17:39,920
It's the source of truth.
2136
01:17:39,920 --> 01:17:42,960
Nothing in the AI layer overrides SharePoint governance.
2137
01:17:42,960 --> 01:17:45,160
Layer two is the ingestion service.
2138
01:17:45,160 --> 01:17:48,600
It authenticates via Microsoft EntraID using OAuth 2.0.
2139
01:17:48,600 --> 01:17:50,920
It enumerates document libraries using the SharePoint
2140
01:17:50,920 --> 01:17:53,160
REST API or Microsoft Graph.
2141
01:17:53,160 --> 01:17:56,320
It extracts text from Word documents, PDFs, Excel sheets,
2142
01:17:56,320 --> 01:17:57,440
and PowerPoint text.
2143
01:17:57,440 --> 01:17:59,360
It detects changes using Delta Sync.
2144
01:17:59,360 --> 01:18:02,280
And it outputs clean text chunks with structural metadata.
2145
01:18:02,280 --> 01:18:04,080
Layer three is the chunking engine.
2146
01:18:04,080 --> 01:18:05,280
It detects document type.
2147
01:18:05,280 --> 01:18:07,280
It applies heading aware chunking for Word.
2148
01:18:07,280 --> 01:18:09,000
Page aware chunking for PDFs.
2149
01:18:09,000 --> 01:18:10,560
Row group chunking for Excel.
2150
01:18:10,560 --> 01:18:12,280
Slide level chunking for PowerPoint.
2151
01:18:12,280 --> 01:18:14,120
It preserves metadata at every step.
2152
01:18:14,120 --> 01:18:15,200
Source URL.
2153
01:18:15,200 --> 01:18:16,320
Document title.
2154
01:18:16,320 --> 01:18:17,080
Author.
2155
01:18:17,080 --> 01:18:18,280
Last modified date.
2156
01:18:18,280 --> 01:18:19,280
Permission level.
2157
01:18:19,280 --> 01:18:21,480
It outputs chunks ready for embedding.
2158
01:18:21,480 --> 01:18:23,440
Layer four is the local embedding model.
2159
01:18:23,440 --> 01:18:26,880
It runs a sentence transformer like all mini-LML6V2
2160
01:18:26,880 --> 01:18:29,960
or BGE large N on your local GPU or CPU.
2161
01:18:29,960 --> 01:18:31,320
It processes chunks and batches.
2162
01:18:31,320 --> 01:18:33,880
It converts each chunk into a dense vector.
2163
01:18:33,880 --> 01:18:35,960
And it outputs vectors with metadata payloads.
2164
01:18:35,960 --> 01:18:38,680
Layer five is the vector database, QDrand or VV8
2165
01:18:38,680 --> 01:18:40,760
running on your local network via Docker.
2166
01:18:40,760 --> 01:18:44,120
It stores vectors in a collection with HNSW indexing.
2167
01:18:44,120 --> 01:18:46,000
It supports metadata filtering for permission
2168
01:18:46,000 --> 01:18:46,920
or wear retrieval.
2169
01:18:46,920 --> 01:18:48,640
It handles point updates for Delta Sync.
2170
01:18:48,640 --> 01:18:52,200
And it returns the top-k nearest neighbors in under 100 milliseconds.
2171
01:18:52,200 --> 01:18:54,640
Layer six is the local YAMA runtime.
2172
01:18:54,640 --> 01:18:57,480
Olamar serving a quantized YAMA three or LAMA four model
2173
01:18:57,480 --> 01:18:58,600
on your GPU server.
2174
01:18:58,600 --> 01:19:00,320
It exposes a local REST API.
2175
01:19:00,320 --> 01:19:02,520
It receives prompts containing system instructions,
2176
01:19:02,520 --> 01:19:04,240
retrieve chunks and user questions.
2177
01:19:04,240 --> 01:19:07,640
It generates answers with low temperature for factual grounding.
2178
01:19:07,640 --> 01:19:10,200
And it streams responses back to the query interface.
2179
01:19:10,200 --> 01:19:11,840
Layer seven is the query interface.
2180
01:19:11,840 --> 01:19:13,360
A minimalist web application
2181
01:19:13,360 --> 01:19:15,800
authenticated through Microsoft Enter ID.
2182
01:19:15,800 --> 01:19:17,240
It embeds user questions.
2183
01:19:17,240 --> 01:19:19,880
It queries the vector database with permission filtering.
2184
01:19:19,880 --> 01:19:20,880
It constructs prompts.
2185
01:19:20,880 --> 01:19:21,960
It calls Olamar.
2186
01:19:21,960 --> 01:19:24,520
It displays generated answers with clickable citations
2187
01:19:24,520 --> 01:19:25,840
back to SharePoint documents.
2188
01:19:25,840 --> 01:19:27,200
It logs every interaction.
2189
01:19:27,200 --> 01:19:29,520
And it never exposes data to external APIs.
2190
01:19:29,520 --> 01:19:30,600
And that's the architecture.
2191
01:19:30,600 --> 01:19:31,520
Seven layers.
2192
01:19:31,520 --> 01:19:33,400
All local, all under your control.
2193
01:19:33,400 --> 01:19:35,080
Now let me address common concerns.
2194
01:19:35,080 --> 01:19:39,360
If the SharePoint API changes your ingestion service uses versioned API calls.
2195
01:19:39,360 --> 01:19:42,360
You test against preview versions in a staging environment.
2196
01:19:42,360 --> 01:19:45,280
You migrate to new versions on your schedule, not Microsofts.
2197
01:19:45,280 --> 01:19:48,680
If the model hallucinates despite rag, you lower the temperature.
2198
01:19:48,680 --> 01:19:49,920
You tighten the system prompt.
2199
01:19:49,920 --> 01:19:52,320
You add output validation in the query interface.
2200
01:19:52,320 --> 01:19:55,600
You implement human feedback loops where users flag bad answers.
2201
01:19:55,600 --> 01:19:57,080
And you monitor logs for patterns.
2202
01:19:57,080 --> 01:19:58,760
Illucination isn't eliminated.
2203
01:19:58,760 --> 01:19:59,720
It's managed.
2204
01:19:59,720 --> 01:20:01,480
If a user tries prompt injection, they
2205
01:20:01,480 --> 01:20:04,920
craft a query designed to make the model ignore its instructions.
2206
01:20:04,920 --> 01:20:06,160
You sanitize inputs.
2207
01:20:06,160 --> 01:20:09,920
You enforce the system prompt at the API level, not just in the application.
2208
01:20:09,920 --> 01:20:13,920
You validate that retrieved chunks match the query semantically before including them.
2209
01:20:13,920 --> 01:20:15,360
And you log suspicious patterns.
2210
01:20:15,360 --> 01:20:18,320
Prompt injection is an attack vector for any LLM system.
2211
01:20:18,320 --> 01:20:20,160
Local deployment doesn't eliminate it.
2212
01:20:20,160 --> 01:20:22,360
But it contains the blast radius to your perimeter.
2213
01:20:22,360 --> 01:20:24,880
If hardware fails, you run Q-drand with replication.
2214
01:20:24,880 --> 01:20:26,760
You snapshot the vector index nightly.
2215
01:20:26,760 --> 01:20:28,720
You keep model weights on network storage.
2216
01:20:28,720 --> 01:20:30,680
And you document the recovery procedure.
2217
01:20:30,680 --> 01:20:33,000
Local infrastructure requires operational discipline.
2218
01:20:33,000 --> 01:20:34,360
The reward is control.
2219
01:20:34,360 --> 01:20:36,400
If a user asks a question in German or French,
2220
01:20:36,400 --> 01:20:40,520
while your documents are in English, multilingual embedding models like multilingual E5
2221
01:20:40,520 --> 01:20:41,960
large handle this scenario.
2222
01:20:41,960 --> 01:20:46,040
They map semantically equivalent sentences in different languages to nearby vectors.
2223
01:20:46,040 --> 01:20:50,200
A user asking for Urlobs Richtlinien in German retrieves the English vacation policy chunk
2224
01:20:50,200 --> 01:20:52,520
because the embeddings are close in vector space.
2225
01:20:52,520 --> 01:20:55,040
The LLM then generates the answer in the user's language.
2226
01:20:55,040 --> 01:20:57,600
This isn't machine translation in the traditional sense.
2227
01:20:57,600 --> 01:21:00,840
It's cross-lingual retrieval followed by monolingual generation.
2228
01:21:00,840 --> 01:21:03,680
And it works surprisingly well with modern multilingual models.
2229
01:21:03,680 --> 01:21:07,360
If a document contains a table that the extraction process fails to pass,
2230
01:21:07,360 --> 01:21:09,440
the chunk contains garbled text.
2231
01:21:09,440 --> 01:21:11,120
The embedding represents noise.
2232
01:21:11,120 --> 01:21:15,680
When a user asks about the table content, the retrieval engine returns the noisy chunk.
2233
01:21:15,680 --> 01:21:19,120
The LLM generates an answer based on partial or incorrect information.
2234
01:21:19,120 --> 01:21:22,160
This is a data quality problem, not a model problem.
2235
01:21:22,160 --> 01:21:24,760
The fix is better extraction, not better prompting.
2236
01:21:24,760 --> 01:21:28,200
Invest in PDF table extractors like Camelot or Tabular Pi.
2237
01:21:28,200 --> 01:21:30,280
Test them on your actual document corpus.
2238
01:21:30,280 --> 01:21:34,480
And fall back to manual review for documents that automated extraction can't handle.
2239
01:21:34,480 --> 01:21:38,480
If the LLM refuses to answer a question because the system prompt is too restrictive,
2240
01:21:38,480 --> 01:21:42,200
this happens when users ask about topics that are adjacent to sensitive areas.
2241
01:21:42,200 --> 01:21:44,400
A user asks about employee benefits.
2242
01:21:44,400 --> 01:21:46,760
The system prompt says only use provided context.
2243
01:21:46,760 --> 01:21:49,000
The retrieved chunks contain benefits information.
2244
01:21:49,000 --> 01:21:53,800
But the LLM interprets the question as potentially asking about other employees and refuses.
2245
01:21:53,800 --> 01:21:55,240
This is over refusal.
2246
01:21:55,240 --> 01:21:57,480
The fix is to tune the system prompt carefully.
2247
01:21:57,480 --> 01:22:00,320
Allow answers that are clearly supported by the context.
2248
01:22:00,320 --> 01:22:03,480
Only refuse when the context genuinely doesn't contain the answer.
2249
01:22:03,480 --> 01:22:08,240
And monitor refusal rates, a system that refuses 50% of queries isn't useful.
2250
01:22:08,240 --> 01:22:14,400
If a document contains outdated information, a policy from 2024 might be superseded in 2025.
2251
01:22:14,400 --> 01:22:18,680
Both versions exist in SharePoint because the old version is retained for legal reasons.
2252
01:22:18,680 --> 01:22:20,360
The ingestion service indexes both.
2253
01:22:20,360 --> 01:22:21,880
The retrieval engine returns both.
2254
01:22:21,880 --> 01:22:25,760
The LLM synthesizes an answer that mixes old and new rules.
2255
01:22:25,760 --> 01:22:27,280
This is a version control problem.
2256
01:22:27,280 --> 01:22:28,480
The fix is metadata.
2257
01:22:28,480 --> 01:22:31,720
Tag every chunk with effective date and superseded status.
2258
01:22:31,720 --> 01:22:36,440
Filter out superseded documents at query time unless the user explicitly asks for historical
2259
01:22:36,440 --> 01:22:40,040
versions and train users to check the last modified date incitations.
2260
01:22:40,040 --> 01:22:44,560
Let me walk through a complete query from start to finish so you see how the layers interact.
2261
01:22:44,560 --> 01:22:47,280
Sarah, a project manager, opens the query interface.
2262
01:22:47,280 --> 01:22:51,920
She types what is the approval process for vendor contracts over $50,000?
2263
01:22:51,920 --> 01:22:55,920
The interface authenticates her via enter ID and determines she belongs to the manager's
2264
01:22:55,920 --> 01:22:58,200
group, giving her a manager permission tier.
2265
01:22:58,200 --> 01:23:00,720
The interface sends her question to the local embedding model.
2266
01:23:00,720 --> 01:23:03,960
The model converts it into a 384-dimensional vector.
2267
01:23:03,960 --> 01:23:07,440
The interface sends this vector to Q-drand with a filter for permission tier less than
2268
01:23:07,440 --> 01:23:08,680
or equal to manager.
2269
01:23:08,680 --> 01:23:13,320
Q-drand searches 200,000 vectors and returns the top five matches in 80 milliseconds.
2270
01:23:13,320 --> 01:23:17,640
The matches include chunks from the procurement policy, the finance handbook and the delegation
2271
01:23:17,640 --> 01:23:18,840
of authority document.
2272
01:23:18,840 --> 01:23:20,640
The interface constructs a prompt.
2273
01:23:20,640 --> 01:23:24,480
System, your unknowledgeable assistant that answers based on provided context, use only
2274
01:23:24,480 --> 01:23:26,120
the information in the context.
2275
01:23:26,120 --> 01:23:29,120
Side sources, context chunk one from procurement policy.
2276
01:23:29,120 --> 01:23:34,600
Doc X contracts exceeding $50,000 require procurement team review and CFO approval.
2277
01:23:34,600 --> 01:23:36,800
Context chunk two from finance handbook.
2278
01:23:36,800 --> 01:23:41,320
Doc X, vendor selection must follow the three-bit process documented in section four,
2279
01:23:41,320 --> 01:23:43,680
chunk three from delegation of authority.
2280
01:23:43,680 --> 01:23:48,600
Doc X, the CFO retains approval authority for all contracts above the departmental threshold.
2281
01:23:48,600 --> 01:23:52,920
User question, what is the approval process for vendor contracts over $50,000?
2282
01:23:52,920 --> 01:23:56,800
The interface sends this prompt to Olama running Lama 370BQ4.
2283
01:23:56,800 --> 01:23:58,840
Olama generates the answer in four seconds.
2284
01:23:58,840 --> 01:24:03,880
For vendor contracts over $50,000, the procurement team must first conduct a three-bit process.
2285
01:24:03,880 --> 01:24:05,800
The results are reviewed by procurement.
2286
01:24:05,800 --> 01:24:09,600
Finally, approval requires the CFO as per the delegation of authority policy.
2287
01:24:09,600 --> 01:24:14,560
Resources, procurement policy.doc X, finance handbook.doc X, delegation of authority.doc X,
2288
01:24:14,560 --> 01:24:17,280
the interface displays the answer with clickable citations.
2289
01:24:17,280 --> 01:24:21,680
Sarah clicks the procurement policy citation and opens the document in SharePoint.
2290
01:24:21,680 --> 01:24:23,800
Total time from question to answer five seconds.
2291
01:24:23,800 --> 01:24:25,560
Total outbound data packets zero.
2292
01:24:25,560 --> 01:24:27,360
This isn't a theoretical architecture.
2293
01:24:27,360 --> 01:24:28,720
It's a production pattern.
2294
01:24:28,720 --> 01:24:29,720
Seven layers.
2295
01:24:29,720 --> 01:24:30,720
Each layer has one job.
2296
01:24:30,720 --> 01:24:33,560
Each layer passes structured data to the next.
2297
01:24:33,560 --> 01:24:36,360
And the entire pipeline stays inside your perimeter.
2298
01:24:36,360 --> 01:24:38,080
This architecture isn't a product you buy.
2299
01:24:38,080 --> 01:24:39,080
It's a stance you take.
2300
01:24:39,080 --> 01:24:41,720
A stance says your data is too useful to delegate.
2301
01:24:41,720 --> 01:24:44,080
Your governance is too specific to outsource.
2302
01:24:44,080 --> 01:24:47,560
And your AI capabilities should serve your perimeter not someone else's.
2303
01:24:47,560 --> 01:24:49,880
That's sovereign intelligence, not sovereign cloud.
2304
01:24:49,880 --> 01:24:52,200
Sovereignty isn't about rejecting AI.
2305
01:24:52,200 --> 01:24:56,280
It's about rejecting the assumption that intelligence must live in someone else's cloud.
2306
01:24:56,280 --> 01:24:59,720
You now have the blueprint to turn your SharePoint into a private brain.
2307
01:24:59,720 --> 01:25:00,720
Your data stays local.
2308
01:25:00,720 --> 01:25:01,880
Your model stays local.
2309
01:25:01,880 --> 01:25:04,720
And your answers stay grounded in documents you already own.
2310
01:25:04,720 --> 01:25:06,280
Share this with your security team.
2311
01:25:06,280 --> 01:25:08,380
for more architecture that respects your data.









