Your SharePoint Data is a Liability: Fixing the Metadata Gap


SharePoint has become the backbone of information management for countless organizations, storing everything from contracts and policies to invoices, project documentation, and business-critical records. Yet beneath the surface of many Microsoft 365 environments lies a hidden problem that continues to grow with every uploaded file. The issue is not storage capacity, search performance, or even user adoption. The real problem is the metadata gap.In this episode, we explore why poorly classified and unstructured SharePoint content has become one of the biggest obstacles to productivity, governance, compliance, and AI readiness. We examine how organizations unknowingly create massive information liabilities when documents lack proper metadata and why this challenge becomes even more critical as Microsoft 365 Copilot and AI-powered experiences become embedded into everyday work.
WHY SHAREPOINT DATA BECOMES A LIABILITY
Many organizations continue to organize content using folder structures designed for a very different era of work. While folders may seem familiar, they fail to provide the context modern businesses need to locate, govern, and automate information effectively.When files lack meaningful metadata, organizations face challenges such as:
- Poor search relevance and content discoverability
- Duplicate documents and inconsistent versions
- Increased compliance and audit risks
- Reduced effectiveness of Microsoft 365 Copilot
THE CRITICAL ROLE OF METADATA
Metadata is far more than simply data about data. It provides the context that allows systems and people to understand, classify, govern, and act upon information. Proper metadata enables organizations to transform document repositories into intelligent knowledge platforms.During this conversation, we discuss how metadata supports:
- Enterprise search and content discovery
- Records management and retention policies
- Compliance and eDiscovery requirements
- AI-powered content retrieval and automation
COPILOT READINESS STARTS WITH CONTENT QUALITY
Many organizations assume that deploying Microsoft 365 Copilot automatically unlocks the value of their knowledge estate. In reality, AI systems are only as effective as the data they consume.We explore how missing metadata directly impacts semantic search, retrieval-augmented generation, document grounding, and AI-generated responses. Listeners will learn why poor information architecture creates inconsistent Copilot experiences and how metadata quality influences trust in AI-generated answers.
INTELLIGENT DOCUMENT PROCESSING EXPLAINED
Modern AI technologies make it possible to automatically classify documents, extract business information, and populate metadata at scale. Intelligent Document Processing combines OCR, machine learning, natural language processing, and AI-powered classification to turn unstructured content into structured business assets.Topics include:
- Structured versus unstructured documents
- Entity extraction and document classification
- Automated metadata generation
- Business process automation through AI
THE EVOLUTION OF MICROSOFT SYNTEX AND SHAREPOINT PREMIUM
Microsoft's content AI journey has undergone multiple transformations over the past several years. From Project Cortex to SharePoint Syntex, Microsoft Syntex, SharePoint Premium, and now Document Processing for Microsoft 365, the platform continues to evolve.In this episode, we break down:
- The history of Microsoft's content AI platform
- Current licensing and service positioning
- Microsoft's strategic investments for the future
- What existing Syntex customers should know
- Designing a scalable metadata taxonomy
- Selecting training documents
- Creating entity extractors
- Measuring model accuracy
- Deploying models into production environments
AI AGENTS, SKILLS, AND THE FUTURE OF SHAREPOINT
The latest generation of SharePoint AI capabilities introduces agents, skills, autofill columns, and conversational automation experiences. These technologies dramatically lower the barrier to implementing content intelligence while introducing new governance considerations.Listeners will learn how AI agents can:
- Automate metadata enrichment
- Improve content quality
- Create workflows using natural language
- Support knowledge discovery across Microsoft 365
FROM DOCUMENT REPOSITORY TO KNOWLEDGE PLATFORM
The ultimate goal is not simply better metadata. The goal is transforming SharePoint from a passive file repository into an active business system that supports decision-making, compliance, automation, and AI-driven productivity.Organizations that successfully close the metadata gap gain significant advantages in search, governance, security, compliance, and AI readiness. They can answer business questions faster, automate repetitive processes, reduce operational risk, and unlock the full value of their Microsoft 365 investments.
FINAL THOUGHTS
Your SharePoint environment may appear organized on the surface, but without consistent metadata, it remains vulnerable to inefficiency, compliance challenges, and AI performance limitations. As Microsoft continues integrating AI into every aspect of the digital workplace, metadata is becoming the foundation that determines success or failure.If your organization is planning a Copilot rollout, reviewing governance strategies, modernizing information management practices, or exploring intelligent document processing, this episode provides practical guidance and real-world insights into closing the metadata gap and preparing your content for the AI era.Tune in to learn why your SharePoint data may already be a liability—and what you can do today to transform it into a strategic asset.
Become a supporter of this podcast: https://www.spreaker.com/podcast/m365-fm-modern-work-security-and-productivity-with-microsoft-365--6704921/support.
🚀 Want to be part of m365.fm?
Then stop just listening… and start showing up.
👉 Connect with me on LinkedIn and let’s make something happen:
- 🎙️ Be a podcast guest and share your story
- 🎧 Host your own episode (yes, seriously)
- 💡 Pitch topics the community actually wants to hear
- 🌍 Build your personal brand in the Microsoft 365 space
This isn’t just a podcast — it’s a platform for people who take action.
🔥 Most people wait. The best ones don’t.
👉 Connect with me on LinkedIn and send me a message:
"I want in"
Let’s build something awesome 👊
00:00:00,000 --> 00:00:02,560
You've been organizing SharePoint like it's 2015,
2
00:00:02,560 --> 00:00:05,480
and that filing logic is about to become a liability,
3
00:00:05,480 --> 00:00:07,400
but in reality, it's not a storage problem.
4
00:00:07,400 --> 00:00:09,720
It's a structural failure that kills search,
5
00:00:09,720 --> 00:00:13,120
breaks compliance, and undermines every co-pilot rollout.
6
00:00:13,120 --> 00:00:15,640
Industry audits consistently find the same pattern.
7
00:00:15,640 --> 00:00:17,360
Organizations only discover this gap
8
00:00:17,360 --> 00:00:19,920
after a failed audit or a co-pilot hallucination.
9
00:00:19,920 --> 00:00:22,880
Custom AI document models close that gap automatically,
10
00:00:22,880 --> 00:00:24,960
and in 2026, the tool that does it
11
00:00:24,960 --> 00:00:27,840
just got renamed for the fourth time in six years.
12
00:00:27,840 --> 00:00:29,720
Your SharePoint data is a liability.
13
00:00:29,720 --> 00:00:31,760
Your IT team probably thinks your SharePoint estate
14
00:00:31,760 --> 00:00:35,440
has a storage problem, more files, more sites, more noise.
15
00:00:35,440 --> 00:00:37,480
But that diagnosis misses the actual disease.
16
00:00:37,480 --> 00:00:39,200
The real problem is that most of those files
17
00:00:39,200 --> 00:00:41,000
carry no business metadata at all.
18
00:00:41,000 --> 00:00:43,280
They're unstructured, untagged, and invisible
19
00:00:43,280 --> 00:00:45,480
to every system that needs to act on them.
20
00:00:45,480 --> 00:00:47,880
Organizations create over two million SharePoint sites
21
00:00:47,880 --> 00:00:49,040
every single day.
22
00:00:49,040 --> 00:00:51,120
They upload more than two billion files daily,
23
00:00:51,120 --> 00:00:53,160
and most of those uploads land with nothing more
24
00:00:53,160 --> 00:00:54,960
than a file name and a timestamp.
25
00:00:54,960 --> 00:00:57,120
No contract type, no expiration date,
26
00:00:57,120 --> 00:00:59,800
no sensitivity label, no retention category,
27
00:00:59,800 --> 00:01:01,600
just a document sitting in a folder,
28
00:01:01,600 --> 00:01:04,360
hoping a human remembers what it is and where it belongs.
29
00:01:04,360 --> 00:01:05,280
That isn't messy.
30
00:01:05,280 --> 00:01:07,360
That's a liability hiding in plain sight.
31
00:01:07,360 --> 00:01:09,080
When documents lack consistent metadata,
32
00:01:09,080 --> 00:01:11,800
employees spend hours hunting for the latest approved version
33
00:01:11,800 --> 00:01:14,680
or recreating information they cannot find.
34
00:01:14,680 --> 00:01:16,080
Analysts have measured this friction
35
00:01:16,080 --> 00:01:17,880
as a direct productivity loss.
36
00:01:17,880 --> 00:01:19,440
In distributed organizations,
37
00:01:19,440 --> 00:01:21,400
multiple teams maintain parallel versions
38
00:01:21,400 --> 00:01:23,280
of the same materials in siloed sites
39
00:01:23,280 --> 00:01:25,000
or personal one-drive accounts.
40
00:01:25,000 --> 00:01:26,720
The same contract lives in three places,
41
00:01:26,720 --> 00:01:27,960
each with different edits,
42
00:01:27,960 --> 00:01:29,960
and nobody knows which one is authoritative.
43
00:01:29,960 --> 00:01:31,640
Somebody rewrites the work from scratch
44
00:01:31,640 --> 00:01:33,120
because they couldn't find the original.
45
00:01:33,120 --> 00:01:35,480
That's not laziness, that's a metadata failure.
46
00:01:35,480 --> 00:01:38,120
The folder-based mental model is the root cause.
47
00:01:38,120 --> 00:01:40,520
Folders assume that people know what they're looking for
48
00:01:40,520 --> 00:01:41,640
and where it lives.
49
00:01:41,640 --> 00:01:43,160
They assume a single hierarchy
50
00:01:43,160 --> 00:01:44,680
can represent every possible way
51
00:01:44,680 --> 00:01:46,360
a document might be discovered.
52
00:01:46,360 --> 00:01:48,520
But modern work doesn't start with navigation.
53
00:01:48,520 --> 00:01:49,760
It starts with context.
54
00:01:49,760 --> 00:01:51,720
A project manager needs every contract
55
00:01:51,720 --> 00:01:53,440
that expires in the next 90 days,
56
00:01:53,440 --> 00:01:55,000
regardless of which site stores it.
57
00:01:55,000 --> 00:01:57,200
A legal team needs every document containing
58
00:01:57,200 --> 00:01:59,920
a liability clause above a specific dollar threshold
59
00:01:59,920 --> 00:02:01,720
without opening each file manually.
60
00:02:01,720 --> 00:02:04,280
Folders can't answer those questions metadata can.
61
00:02:04,280 --> 00:02:07,320
But the problem runs deeper than lost time and duplicated work.
62
00:02:07,320 --> 00:02:10,560
Microsoft 365 co-pilot and other generative AI tools
63
00:02:10,560 --> 00:02:12,440
rely on underlying content signals
64
00:02:12,440 --> 00:02:14,720
to return relevant trustworthy responses
65
00:02:14,720 --> 00:02:16,440
when documents like accurate metadata
66
00:02:16,440 --> 00:02:18,280
or live in haphazard locations,
67
00:02:18,280 --> 00:02:20,960
semantic search and AI-driven retrieval struggle
68
00:02:20,960 --> 00:02:23,160
to surface the right context.
69
00:02:23,160 --> 00:02:25,400
Every missing or incorrect metadata field
70
00:02:25,400 --> 00:02:27,560
degrades the signal that AI systems use
71
00:02:27,560 --> 00:02:29,000
to reason about your content.
72
00:02:29,000 --> 00:02:31,520
Your co-pilot investment becomes inconsistent
73
00:02:31,520 --> 00:02:32,800
and difficult to trust,
74
00:02:32,800 --> 00:02:34,280
not because the model is broken
75
00:02:34,280 --> 00:02:35,800
but because the ground beneath it is.
76
00:02:35,800 --> 00:02:38,280
The metadata gap is also a genuine compliance
77
00:02:38,280 --> 00:02:39,360
and risk concern.
78
00:02:39,360 --> 00:02:41,800
Regulators and courts expect defensible retention,
79
00:02:41,800 --> 00:02:44,280
disposition and e-discovery practices.
80
00:02:44,280 --> 00:02:46,440
Retention policies and labels in Microsoft Perview
81
00:02:46,440 --> 00:02:48,600
are designed to help organizations retain content
82
00:02:48,600 --> 00:02:51,360
for required periods or deleted when it is no longer needed.
83
00:02:51,360 --> 00:02:53,560
But these controls rely on being able to identify
84
00:02:53,560 --> 00:02:56,160
which content falls into which category.
85
00:02:56,160 --> 00:02:57,640
When documents lack classification
86
00:02:57,640 --> 00:02:59,640
that differentiates records from working copies
87
00:02:59,640 --> 00:03:01,880
or regulated data from non-critical content,
88
00:03:01,880 --> 00:03:04,520
retention settings may be applied broadly or not at all.
89
00:03:04,520 --> 00:03:06,280
That creates both over retention risk
90
00:03:06,280 --> 00:03:08,160
and premature deletion risk.
91
00:03:08,160 --> 00:03:10,080
In e-discovery and litigation holds,
92
00:03:10,080 --> 00:03:12,400
metadata becomes even more critical.
93
00:03:12,400 --> 00:03:15,200
Metadata provides creation and modification timestamps,
94
00:03:15,200 --> 00:03:17,600
authorship, file paths and message headers,
95
00:03:17,600 --> 00:03:19,840
all of which are needed to reconstruct events
96
00:03:19,840 --> 00:03:22,240
and demonstrate chain of custody.
97
00:03:22,240 --> 00:03:24,360
Best practices for defensible e-discovery
98
00:03:24,360 --> 00:03:27,720
emphasize preserving metadata using forensic grade tools,
99
00:03:27,720 --> 00:03:29,360
documenting the chain of custody
100
00:03:29,360 --> 00:03:32,440
and auditing metadata before and after processing.
101
00:03:32,440 --> 00:03:34,880
If SharePoint content spreads across uncontrolled sites,
102
00:03:34,880 --> 00:03:37,920
personal one drive locations and shadow repositories.
103
00:03:37,920 --> 00:03:40,600
And if the metadata is inconsistent or incomplete,
104
00:03:40,600 --> 00:03:43,360
legal teams may struggle to identify, preserve
105
00:03:43,360 --> 00:03:46,520
and produce relevant materials in a defensible manner.
106
00:03:46,520 --> 00:03:49,240
Security and privacy risks are similarly amplified.
107
00:03:49,240 --> 00:03:52,280
SharePoint and one drive provide strong security capabilities,
108
00:03:52,280 --> 00:03:54,560
including encryption in transit and at rest,
109
00:03:54,560 --> 00:03:56,640
two-factor authentication, conditional access
110
00:03:56,640 --> 00:03:58,400
and data loss prevention policies.
111
00:03:58,400 --> 00:04:00,360
However, DLP rules, sensitivity labels
112
00:04:00,360 --> 00:04:02,720
and conditional access rely on understanding
113
00:04:02,720 --> 00:04:04,920
which documents contain sensitive information.
114
00:04:04,920 --> 00:04:06,480
Without reliable classification,
115
00:04:06,480 --> 00:04:09,160
organizations may either over-restrict access
116
00:04:09,160 --> 00:04:12,640
and hamper collaboration or under-protect sensitive documents
117
00:04:12,640 --> 00:04:14,560
and lead to accidental exposure.
118
00:04:14,560 --> 00:04:16,400
In an environment with weak metadata,
119
00:04:16,400 --> 00:04:18,760
even powerful tooling cannot reliably enforce
120
00:04:18,760 --> 00:04:20,120
the principle of least privilege.
121
00:04:20,120 --> 00:04:21,880
Hybrid work has fundamentally altered
122
00:04:21,880 --> 00:04:24,200
how teams create and share information
123
00:04:24,200 --> 00:04:26,480
and traditional folder hierarchies no longer align
124
00:04:26,480 --> 00:04:29,120
with how people work in teams, SharePoint
125
00:04:29,120 --> 00:04:31,240
and other collaborative tools.
126
00:04:31,240 --> 00:04:33,720
Modern collaboration centers on projects, channels
127
00:04:33,720 --> 00:04:36,320
or topics that cut across organizational structures,
128
00:04:36,320 --> 00:04:38,640
leading to documents stored in multiple contexts
129
00:04:38,640 --> 00:04:40,040
simultaneously.
130
00:04:40,040 --> 00:04:42,120
Users expect to find information through search,
131
00:04:42,120 --> 00:04:44,160
recommendations or AI assistance,
132
00:04:44,160 --> 00:04:46,960
rather than working through deeply nested folder trees.
133
00:04:46,960 --> 00:04:48,720
This shift exposes the limitations
134
00:04:48,720 --> 00:04:50,800
of purely location-based organization
135
00:04:50,800 --> 00:04:54,160
and shows why content must carry rich descriptive metadata
136
00:04:54,160 --> 00:04:56,640
that allows discovery and reuse regardless
137
00:04:56,640 --> 00:04:58,880
of where it physically resides.
138
00:04:58,880 --> 00:05:02,120
At the same time, AI is becoming embedded in everyday work
139
00:05:02,120 --> 00:05:05,600
via co-pilot experiences in Microsoft 365,
140
00:05:05,600 --> 00:05:07,560
which use an intelligence layer sometimes
141
00:05:07,560 --> 00:05:10,000
described as work IQ to connect individual
142
00:05:10,000 --> 00:05:11,640
and organizational knowledge.
143
00:05:11,640 --> 00:05:14,560
This intelligence layer leverages existing metadata,
144
00:05:14,560 --> 00:05:17,800
signals and user interactions to provide relevant suggestions,
145
00:05:17,800 --> 00:05:19,080
summaries and answers.
146
00:05:19,080 --> 00:05:21,240
If that underlying content is poorly governed
147
00:05:21,240 --> 00:05:23,920
with inconsistent labels or incomplete metadata,
148
00:05:23,920 --> 00:05:27,280
AI systems may surface obsolete or inappropriate documents,
149
00:05:27,280 --> 00:05:29,680
undermining user trust and potentially exposing
150
00:05:29,680 --> 00:05:30,880
sensitive information.
151
00:05:30,880 --> 00:05:32,720
The metadata gap becomes a barrier
152
00:05:32,720 --> 00:05:36,160
to realizing the promise of AI accelerated hybrid work,
153
00:05:36,160 --> 00:05:39,240
reinforcing the argument that unmanaged SharePoint data
154
00:05:39,240 --> 00:05:41,280
represents not only a latent liability,
155
00:05:41,280 --> 00:05:43,040
but also an opportunity cost.
156
00:05:43,040 --> 00:05:45,200
And one level deeper, the problem isn't even the absence
157
00:05:45,200 --> 00:05:45,960
of metadata.
158
00:05:45,960 --> 00:05:47,800
It's the absence of a metadata strategy.
159
00:05:47,800 --> 00:05:49,720
Most SharePoint environments have columns.
160
00:05:49,720 --> 00:05:51,080
They have content types.
161
00:05:51,080 --> 00:05:53,720
They have term sets that somebody built three years ago
162
00:05:53,720 --> 00:05:54,760
and abandoned.
163
00:05:54,760 --> 00:05:57,160
What they don't have is a living governed business-aligned
164
00:05:57,160 --> 00:06:00,040
taxonomy that connects document content to automated action.
165
00:06:00,040 --> 00:06:03,400
Without that strategy, any AI tool you deploy will be guessing.
166
00:06:03,400 --> 00:06:05,240
And guessing at scale is expensive.
167
00:06:05,240 --> 00:06:07,960
Consider what happens when a new employee joins your legal team.
168
00:06:07,960 --> 00:06:10,760
They need to find every active contract with a specific vendor.
169
00:06:10,760 --> 00:06:12,760
They open SharePoint, go to the legal site
170
00:06:12,760 --> 00:06:15,400
and find a folder called Contracts2024.
171
00:06:15,400 --> 00:06:17,760
Inside that folder are subfolders for each quarter.
172
00:06:17,760 --> 00:06:21,440
Inside Q2 are files named ContractFinalV2, DockerX,
173
00:06:21,440 --> 00:06:25,280
ContractFinalV2 are Mike, DockerX, and ContractFinalScient,
174
00:06:25,280 --> 00:06:26,400
at PDF.
175
00:06:26,400 --> 00:06:29,080
None of these files have metadata columns populated.
176
00:06:29,080 --> 00:06:32,640
The new employee opens each one, reads through pages of legal text,
177
00:06:32,640 --> 00:06:34,560
and manually compiles a list.
178
00:06:34,560 --> 00:06:36,200
This process takes four hours.
179
00:06:36,200 --> 00:06:37,440
It should take four minutes.
180
00:06:37,440 --> 00:06:40,080
The same scenario plays out in finance when month and close
181
00:06:40,080 --> 00:06:43,000
requires reconciling invoices against purchase orders.
182
00:06:43,000 --> 00:06:45,640
The accounts payable team searches through email attachments,
183
00:06:45,640 --> 00:06:47,560
teams chats, and personal one-drive folders
184
00:06:47,560 --> 00:06:50,920
because the SharePoint library has no vendor name, no PO number,
185
00:06:50,920 --> 00:06:53,080
and no approval status in the metadata.
186
00:06:53,080 --> 00:06:55,680
They open each PDF, read the header, and copy numbers
187
00:06:55,680 --> 00:06:56,560
into a spreadsheet.
188
00:06:56,560 --> 00:06:58,920
The process is not just slow, it is fragile.
189
00:06:58,920 --> 00:07:01,280
One wrong copy paste creates a reconciliation error
190
00:07:01,280 --> 00:07:03,000
that takes days to untangle.
191
00:07:03,000 --> 00:07:04,920
IT teams often respond to these complaints
192
00:07:04,920 --> 00:07:06,960
by adding more storage or faster search,
193
00:07:06,960 --> 00:07:08,920
but faster search of bad metadata still
194
00:07:08,920 --> 00:07:10,840
returns bad results.
195
00:07:10,840 --> 00:07:13,720
A search for Contract in a 10 terabyte SharePoint estate
196
00:07:13,720 --> 00:07:16,640
returns 10,000 files, most of them irrelevant.
197
00:07:16,640 --> 00:07:19,680
The user adds more keywords, narrows the date range,
198
00:07:19,680 --> 00:07:22,280
and still cannot find the specific document they need.
199
00:07:22,280 --> 00:07:24,480
Search relevance depends on metadata density.
200
00:07:24,480 --> 00:07:27,960
Without metadata, search is just a faster way to browse garbage,
201
00:07:27,960 --> 00:07:29,440
the cost compounds over time.
202
00:07:29,440 --> 00:07:31,400
A contract that auto-renews because nobody
203
00:07:31,400 --> 00:07:33,640
tracked the expiration date costs real money,
204
00:07:33,640 --> 00:07:36,680
a sensitive document that leaks because it was never classified,
205
00:07:36,680 --> 00:07:39,080
costs reputation and regulatory fines.
206
00:07:39,080 --> 00:07:41,360
A co-pilot responds that sites in obsolete policy
207
00:07:41,360 --> 00:07:44,560
because the current version lacked metadata costs user trust.
208
00:07:44,560 --> 00:07:45,880
These are not edge cases.
209
00:07:45,880 --> 00:07:48,200
They are the daily operating reality of organizations
210
00:07:48,200 --> 00:07:49,920
that treat SharePoint as a filing cabinet
211
00:07:49,920 --> 00:07:51,120
instead of a knowledge system.
212
00:07:51,120 --> 00:07:54,160
The opportunity cost is harder to measure, but just as real.
213
00:07:54,160 --> 00:07:55,920
Organizations that govern their metadata
214
00:07:55,920 --> 00:07:58,520
can query their document estate like a database.
215
00:07:58,520 --> 00:08:01,200
They can ask which vendors represent the highest liability
216
00:08:01,200 --> 00:08:03,680
exposure, they can ask which contracts expire
217
00:08:03,680 --> 00:08:06,280
in the next quarter, and require renewal decisions.
218
00:08:06,280 --> 00:08:08,960
They can ask which policies have not been reviewed in two years
219
00:08:08,960 --> 00:08:10,120
and need updating.
220
00:08:10,120 --> 00:08:12,080
These questions transform document management
221
00:08:12,080 --> 00:08:14,760
from a cost center into a strategic capability.
222
00:08:14,760 --> 00:08:17,320
The metadata gap is what prevents most organizations
223
00:08:17,320 --> 00:08:19,000
from making that transformation.
224
00:08:19,000 --> 00:08:21,440
It keeps SharePoint in the filing cabinet era
225
00:08:21,440 --> 00:08:24,480
while competitors move into the intelligence era.
226
00:08:24,480 --> 00:08:25,840
The metadata foundation.
227
00:08:25,840 --> 00:08:27,520
Before you can fix the structure,
228
00:08:27,520 --> 00:08:30,000
you need to understand what the structure actually is.
229
00:08:30,000 --> 00:08:32,800
Most people define metadata as data about data.
230
00:08:32,800 --> 00:08:35,760
That definition is technically accurate and practically useless.
231
00:08:35,760 --> 00:08:37,680
For governance and automation purposes,
232
00:08:37,680 --> 00:08:40,520
metadata is better understood as the structured descriptors
233
00:08:40,520 --> 00:08:42,600
that let systems and people understand,
234
00:08:42,600 --> 00:08:44,960
organize, and act on information.
235
00:08:44,960 --> 00:08:47,880
For a document, this includes simple attributes like title,
236
00:08:47,880 --> 00:08:49,200
author, and creation date,
237
00:08:49,200 --> 00:08:51,880
as well as business-specific properties like contract type,
238
00:08:51,880 --> 00:08:55,680
client name, region, sensitivity level, or retention category.
239
00:08:55,680 --> 00:08:58,120
Metadata provides the context that allows documents
240
00:08:58,120 --> 00:09:00,440
to be grouped, searched, filtered, and processed
241
00:09:00,440 --> 00:09:02,640
in a consistent way, especially when stored at scale
242
00:09:02,640 --> 00:09:04,480
in repositories like SharePoint.
243
00:09:04,480 --> 00:09:07,920
Without a coherent metadata strategy, content becomes opaque.
244
00:09:07,920 --> 00:09:10,120
Even basic questions become impossible to answer
245
00:09:10,120 --> 00:09:11,440
without manual review.
246
00:09:11,440 --> 00:09:13,800
Which contracts expire in the next 90 days?
247
00:09:13,800 --> 00:09:16,920
Which proposals contain governing law clauses for California?
248
00:09:16,920 --> 00:09:19,880
Which HR files are subject to a seven-year retention rule?
249
00:09:19,880 --> 00:09:22,680
Folders can't answer those questions only metadata can.
250
00:09:22,680 --> 00:09:24,560
There are many ways to classify metadata,
251
00:09:24,560 --> 00:09:27,120
but one useful distinction is between technical metadata
252
00:09:27,120 --> 00:09:28,800
and business metadata.
253
00:09:28,800 --> 00:09:31,320
Technical metadata describes system-level properties
254
00:09:31,320 --> 00:09:32,720
like file size or encoding,
255
00:09:32,720 --> 00:09:35,200
and it is often generated automatically by systems.
256
00:09:35,200 --> 00:09:36,920
Business metadata captures meaning
257
00:09:36,920 --> 00:09:38,880
from the perspective of business processes
258
00:09:38,880 --> 00:09:40,400
and regulatory obligations,
259
00:09:40,400 --> 00:09:43,240
and it typically requires explicit design and governance,
260
00:09:43,240 --> 00:09:45,400
including agreed vocabularies and rules for how
261
00:09:45,400 --> 00:09:47,320
and when to apply labels.
262
00:09:47,320 --> 00:09:49,120
Standard schemers, such as Dublin Core,
263
00:09:49,120 --> 00:09:51,760
offer generic elements that conservers a starting point,
264
00:09:51,760 --> 00:09:54,320
but organizations usually need custom schemers
265
00:09:54,320 --> 00:09:57,040
that reflect their specific use cases, industries,
266
00:09:57,040 --> 00:09:59,120
and regulatory requirements.
267
00:09:59,120 --> 00:10:02,440
This is where SharePoints managed metadata infrastructure
268
00:10:02,440 --> 00:10:05,160
and intelligent document processing create real value.
269
00:10:05,160 --> 00:10:08,600
SharePoint has long provided a rich set of metadata mechanisms,
270
00:10:08,600 --> 00:10:10,800
including site columns, content types,
271
00:10:10,800 --> 00:10:12,520
and the managed metadata service,
272
00:10:12,520 --> 00:10:14,240
which allows organizations to define
273
00:10:14,240 --> 00:10:17,080
and centrally manage vocabularies of terms.
274
00:10:17,080 --> 00:10:19,440
Managed metadata enables the creation of terms sets
275
00:10:19,440 --> 00:10:21,320
that can be shared across site collections,
276
00:10:21,320 --> 00:10:24,200
providing consistent choices for fields like department,
277
00:10:24,200 --> 00:10:26,640
product line, or document category.
278
00:10:26,640 --> 00:10:29,320
This shared taxonomy not only improves user experience
279
00:10:29,320 --> 00:10:32,000
when tagging content but also supports more reliable search
280
00:10:32,000 --> 00:10:34,800
refiners, navigation, and policy targeting.
281
00:10:34,800 --> 00:10:37,200
Content types extend this model by bundling together
282
00:10:37,200 --> 00:10:39,560
a set of columns, templates, and workflows
283
00:10:39,560 --> 00:10:42,360
for a specific type of content such as a customer contract
284
00:10:42,360 --> 00:10:44,040
or standard operating procedure,
285
00:10:44,040 --> 00:10:47,160
which can then be reused across libraries and sites.
286
00:10:47,160 --> 00:10:49,120
Taxonomy governance is critical to ensuring
287
00:10:49,120 --> 00:10:53,200
that these structures remain usable and relevant over time.
288
00:10:53,200 --> 00:10:54,800
Frameworks for taxonomy governance
289
00:10:54,800 --> 00:10:56,960
emphasize the need for cross-functional involvement
290
00:10:56,960 --> 00:11:00,040
from business units, IT and information management,
291
00:11:00,040 --> 00:11:02,600
clear ownership of terms sets, and defined processes
292
00:11:02,600 --> 00:11:05,600
for adding, modifying, or deprecating terms.
293
00:11:05,600 --> 00:11:07,760
If terms sets grow organically without oversight,
294
00:11:07,760 --> 00:11:09,840
they can become cluttered and inconsistent,
295
00:11:09,840 --> 00:11:11,840
eroding the benefits of structured metadata
296
00:11:11,840 --> 00:11:13,640
and confusing end-users.
297
00:11:13,640 --> 00:11:17,040
Conversely, a well-governed taxonomy aligned to business domains
298
00:11:17,040 --> 00:11:19,640
can form the backbone of automated classification
299
00:11:19,640 --> 00:11:21,560
and retention policies, particularly
300
00:11:21,560 --> 00:11:23,480
when paired with AI-based systems
301
00:11:23,480 --> 00:11:27,040
that can infer metadata from document content.
302
00:11:27,040 --> 00:11:27,960
But here's the problem.
303
00:11:27,960 --> 00:11:30,320
Most organizations skip the taxonomy step
304
00:11:30,320 --> 00:11:31,840
and jump straight to the AI.
305
00:11:31,840 --> 00:11:33,800
They think a machine learning model will somehow
306
00:11:33,800 --> 00:11:35,560
invent their business logic from scratch.
307
00:11:35,560 --> 00:11:36,320
It won't.
308
00:11:36,320 --> 00:11:38,000
AI can recognize patterns and documents,
309
00:11:38,000 --> 00:11:40,280
but it needs a target vocabulary to populate.
310
00:11:40,280 --> 00:11:42,040
It can extract an expiration date,
311
00:11:42,040 --> 00:11:44,240
but it needs to know which column to drop it into.
312
00:11:44,240 --> 00:11:46,120
It can classify a document as sensitive,
313
00:11:46,120 --> 00:11:48,760
but it needs to know which sensitivity label your organization
314
00:11:48,760 --> 00:11:49,440
uses.
315
00:11:49,440 --> 00:11:50,960
The model isn't the hard part.
316
00:11:50,960 --> 00:11:52,800
The taxonomy is, and if your taxonomy is broken,
317
00:11:52,800 --> 00:11:54,000
your model will be too.
318
00:11:54,000 --> 00:11:56,000
Managed metadata service is the engine that
319
00:11:56,000 --> 00:11:58,440
makes taxonomy scalable across your tenant.
320
00:11:58,440 --> 00:12:01,120
It stores term sets in a central service application,
321
00:12:01,120 --> 00:12:03,160
making them available to any site collection that
322
00:12:03,160 --> 00:12:03,960
subscribes to them.
323
00:12:03,960 --> 00:12:06,840
When your legal team defines a term set for contract types,
324
00:12:06,840 --> 00:12:08,880
every SharePoint site can use those same terms
325
00:12:08,880 --> 00:12:10,120
without recreating them.
326
00:12:10,120 --> 00:12:12,120
When your HR team updates a department list,
327
00:12:12,120 --> 00:12:13,600
the change propagates automatically
328
00:12:13,600 --> 00:12:15,360
to every library that references it.
329
00:12:15,360 --> 00:12:17,440
This centralization is what prevents
330
00:12:17,440 --> 00:12:19,960
the organic sprawl of inconsistent labels
331
00:12:19,960 --> 00:12:22,240
that destroys metadata quality over time.
332
00:12:22,240 --> 00:12:24,360
Content types add another layer of structure.
333
00:12:24,360 --> 00:12:26,360
A content type bundles columns, templates,
334
00:12:26,360 --> 00:12:28,440
and workflows into a reusable package.
335
00:12:28,440 --> 00:12:30,400
When you define a customer contract content type
336
00:12:30,400 --> 00:12:32,760
with columns for counterparty effective date and governing
337
00:12:32,760 --> 00:12:34,680
law, you can attach that content type
338
00:12:34,680 --> 00:12:36,360
to any library in your tenant.
339
00:12:36,360 --> 00:12:38,360
Users see the same column structure,
340
00:12:38,360 --> 00:12:41,200
the same document template, and the same retention workflow
341
00:12:41,200 --> 00:12:43,160
regardless of which site they are in.
342
00:12:43,160 --> 00:12:45,240
This consistency is what makes enterprise governance
343
00:12:45,240 --> 00:12:46,080
possible.
344
00:12:46,080 --> 00:12:47,960
Without content types, every site owner
345
00:12:47,960 --> 00:12:50,480
invents their own column names, their own data formats,
346
00:12:50,480 --> 00:12:51,920
and their own business rules.
347
00:12:51,920 --> 00:12:53,640
The result is a metadata tower of Bable
348
00:12:53,640 --> 00:12:56,440
where vendor in one site means counterparty in another,
349
00:12:56,440 --> 00:12:58,720
and no system can correlate them.
350
00:12:58,720 --> 00:13:00,680
Site columns are the individual fields
351
00:13:00,680 --> 00:13:03,280
that make up a content type or library schema.
352
00:13:03,280 --> 00:13:06,200
Each site column has a data type, a default value,
353
00:13:06,200 --> 00:13:07,880
and a set of allowed values.
354
00:13:07,880 --> 00:13:09,760
When you define a site column for contract value
355
00:13:09,760 --> 00:13:11,560
as currency with a minimum of 0,
356
00:13:11,560 --> 00:13:14,160
you prevent users from entering text or negative numbers.
357
00:13:14,160 --> 00:13:16,680
When you define a site column for contract type,
358
00:13:16,680 --> 00:13:20,040
as a managed metadata field tied to a closed term set,
359
00:13:20,040 --> 00:13:22,560
you prevent users from inventing new contract types
360
00:13:22,560 --> 00:13:23,480
on the fly.
361
00:13:23,480 --> 00:13:25,400
These constraints are not limitations.
362
00:13:25,400 --> 00:13:27,640
They are the guardrails that keep metadata
363
00:13:27,640 --> 00:13:29,120
clean enough for automation.
364
00:13:29,120 --> 00:13:31,160
The relationship between these elements matters.
365
00:13:31,160 --> 00:13:32,920
Terms sets define the vocabulary.
366
00:13:32,920 --> 00:13:35,160
Site columns define the field structure.
367
00:13:35,160 --> 00:13:37,880
Content types bundle them into reusable packages.
368
00:13:37,880 --> 00:13:39,880
Libraries apply them to real documents
369
00:13:39,880 --> 00:13:42,160
and syntax models populate them automatically.
370
00:13:42,160 --> 00:13:45,000
If any layer is missing or broken, the whole stack fails.
371
00:13:45,000 --> 00:13:46,880
A beautiful model with no target columns
372
00:13:46,880 --> 00:13:48,760
has nowhere to write its extractions.
373
00:13:48,760 --> 00:13:50,240
A perfect taxonomy with no model
374
00:13:50,240 --> 00:13:52,040
has no way to populate itself.
375
00:13:52,040 --> 00:13:53,800
A governance policy with no classification
376
00:13:53,800 --> 00:13:54,840
has nothing to enforce.
377
00:13:54,840 --> 00:13:57,320
These four layers, taxonomy, columns, content types,
378
00:13:57,320 --> 00:13:59,520
and models must be designed together.
379
00:13:59,520 --> 00:14:02,000
Unfortunately, most organizations design them separately
380
00:14:02,000 --> 00:14:03,480
if they design them at all.
381
00:14:03,480 --> 00:14:05,840
The SharePoint admin creates generic libraries
382
00:14:05,840 --> 00:14:08,000
with default document content types.
383
00:14:08,000 --> 00:14:10,680
The business user uploads files without tagging them.
384
00:14:10,680 --> 00:14:12,760
The compliance officer writes retention policies
385
00:14:12,760 --> 00:14:15,320
that apply to everything because there is no classification
386
00:14:15,320 --> 00:14:16,040
to target.
387
00:14:16,040 --> 00:14:17,920
The AI team deploys co-pilot and wonders
388
00:14:17,920 --> 00:14:19,800
why it returns irrelevant results.
389
00:14:19,800 --> 00:14:22,040
Each group is doing their job in isolation
390
00:14:22,040 --> 00:14:25,280
and the isolation is what creates the gap.
391
00:14:25,280 --> 00:14:28,040
Governance, compliance, and co-pilot readiness.
392
00:14:28,040 --> 00:14:30,800
The metadata gap isn't only a productivity issue.
393
00:14:30,800 --> 00:14:33,040
It is a genuine compliance and risk concern
394
00:14:33,040 --> 00:14:35,400
that sits at the intersection of governance trends
395
00:14:35,400 --> 00:14:36,640
and AI readiness.
396
00:14:36,640 --> 00:14:39,240
Regulators and courts increasingly expect organizations
397
00:14:39,240 --> 00:14:41,760
to demonstrate defensible retention, disposition,
398
00:14:41,760 --> 00:14:44,000
and e-discovery practices, particularly
399
00:14:44,000 --> 00:14:45,840
in regulated industries or jurisdictions
400
00:14:45,840 --> 00:14:51,320
governed by frameworks like GDPR, CCPA, and ISO 27701.
401
00:14:51,320 --> 00:14:53,680
Retention policies and labels in Microsoft PerView
402
00:14:53,680 --> 00:14:56,160
are designed to help organizations retain content
403
00:14:56,160 --> 00:14:59,680
for required periods or delete it when it is no longer needed,
404
00:14:59,680 --> 00:15:02,720
but these controls rely on being able to identify which
405
00:15:02,720 --> 00:15:04,920
content falls into which category.
406
00:15:04,920 --> 00:15:06,440
When documents lack classification
407
00:15:06,440 --> 00:15:08,680
that differentiates records from working copies
408
00:15:08,680 --> 00:15:11,160
or regulated data from non-critical content,
409
00:15:11,160 --> 00:15:14,240
retention settings may be applied broadly or not at all.
410
00:15:14,240 --> 00:15:17,000
That increases both over-attention and premature deletion risk.
411
00:15:17,000 --> 00:15:19,640
In the context of e-discovery and litigation holds,
412
00:15:19,640 --> 00:15:21,840
metadata becomes even more critical.
413
00:15:21,840 --> 00:15:24,760
Metadata provides details such as creation and modification
414
00:15:24,760 --> 00:15:27,960
timestamps, authorship, file paths, and message headers,
415
00:15:27,960 --> 00:15:29,680
which are needed to reconstruct events
416
00:15:29,680 --> 00:15:31,720
and demonstrate chain of custody.
417
00:15:31,720 --> 00:15:33,600
Best practices for defensible e-discovery
418
00:15:33,600 --> 00:15:37,120
emphasize preserving metadata using forensic grade tools,
419
00:15:37,120 --> 00:15:39,000
documenting the chain of custody,
420
00:15:39,000 --> 00:15:41,880
and auditing metadata before and after processing
421
00:15:41,880 --> 00:15:44,160
to avoid challenges to evidence integrity.
422
00:15:44,160 --> 00:15:46,920
If SharePoint content is spread across uncontrolled sites,
423
00:15:46,920 --> 00:15:50,000
personal OneDrive locations and shadow repositories,
424
00:15:50,000 --> 00:15:52,480
and if the metadata is inconsistent or incomplete,
425
00:15:52,480 --> 00:15:54,840
legal teams may struggle to identify, preserve,
426
00:15:54,840 --> 00:15:58,000
and produce relevant materials in a defensible manner.
427
00:15:58,000 --> 00:16:01,280
So the lack of structured metadata increases litigation risk
428
00:16:01,280 --> 00:16:03,360
and the cost of discovery exercises.
429
00:16:03,360 --> 00:16:06,080
It also amplifies security and privacy risks.
430
00:16:06,080 --> 00:16:09,240
SharePoint and OneDrive provide strong security capabilities,
431
00:16:09,240 --> 00:16:12,640
but DLP rules, sensitivity labels, and conditional access
432
00:16:12,640 --> 00:16:16,640
rely on understanding which documents contain sensitive information.
433
00:16:16,640 --> 00:16:18,040
Without reliable classification,
434
00:16:18,040 --> 00:16:19,960
organizations may either over-restrict access
435
00:16:19,960 --> 00:16:23,000
and hamper collaboration or under-protect sensitive documents
436
00:16:23,000 --> 00:16:24,680
and lead to accidental exposure.
437
00:16:24,680 --> 00:16:27,600
Governance practices in the Microsoft 365 environment
438
00:16:27,600 --> 00:16:30,520
have been evolving to address both the scale of content
439
00:16:30,520 --> 00:16:33,520
and the emergence of AI-powered experiences.
440
00:16:33,520 --> 00:16:35,680
Emerging trends include more granular controls
441
00:16:35,680 --> 00:16:37,920
for regulating which content can be surfaced
442
00:16:37,920 --> 00:16:40,240
by AI assistance and tenant-wide search,
443
00:16:40,240 --> 00:16:42,040
reflecting both privacy concerns,
444
00:16:42,040 --> 00:16:44,480
and the need to respect regulatory boundaries.
445
00:16:44,480 --> 00:16:47,520
One notable feature is restricted content discovery,
446
00:16:47,520 --> 00:16:49,600
which allows administrators to prevent content
447
00:16:49,600 --> 00:16:51,640
from specified SharePoint sites from appearing
448
00:16:51,640 --> 00:16:54,600
in organization-wide search and Microsoft 365
449
00:16:54,600 --> 00:16:55,920
co-pilot business chat,
450
00:16:55,920 --> 00:16:58,560
while still permitting users to access documents they own
451
00:16:58,560 --> 00:17:00,400
or have recently interacted with.
452
00:17:00,400 --> 00:17:02,720
This capability reflects a broader recognition
453
00:17:02,720 --> 00:17:06,040
that AI experiences introduce new governance dimensions.
454
00:17:06,040 --> 00:17:08,320
Organizations must manage not only who can access
455
00:17:08,320 --> 00:17:09,600
a document directly,
456
00:17:09,600 --> 00:17:12,200
but also how content can be discovered or summarized
457
00:17:12,200 --> 00:17:13,880
indirectly by AI tools.
458
00:17:13,880 --> 00:17:16,560
Governance strategies must cover both traditional controls
459
00:17:16,560 --> 00:17:19,920
around permissions and newer controls around AI visibility,
460
00:17:19,920 --> 00:17:22,280
ensuring that sensitive or highly regulated content
461
00:17:22,280 --> 00:17:25,160
is not surfaced in unintended contexts.
462
00:17:25,160 --> 00:17:27,440
At the same time, Microsoft messaging around hybrid work
463
00:17:27,440 --> 00:17:29,920
and AI emphasizes that organizations
464
00:17:29,920 --> 00:17:32,440
which invest in governing and organizing their information
465
00:17:32,440 --> 00:17:34,280
will gain a competitive advantage
466
00:17:34,280 --> 00:17:36,080
as they can more effectively use AI
467
00:17:36,080 --> 00:17:38,720
to drive productivity and innovation.
468
00:17:38,720 --> 00:17:41,200
The metadata gap thus sits at the intersection
469
00:17:41,200 --> 00:17:44,080
of governance trends and AI readiness.
470
00:17:44,080 --> 00:17:46,160
It is the single biggest reason your co-pilot rollout
471
00:17:46,160 --> 00:17:48,440
will underwhelm, not because the AI is weak,
472
00:17:48,440 --> 00:17:51,120
but because the ground beneath it is incomplete.
473
00:17:51,120 --> 00:17:53,520
Co-pilot relies on metadata, permissions,
474
00:17:53,520 --> 00:17:55,840
and information architecture to return relevant,
475
00:17:55,840 --> 00:17:57,200
trustworthy responses.
476
00:17:57,200 --> 00:17:59,280
When documents lack accurate metadata
477
00:17:59,280 --> 00:18:01,280
or are stored in haphazard locations,
478
00:18:01,280 --> 00:18:03,680
semantic search and AI-driven retrieval struggle
479
00:18:03,680 --> 00:18:05,280
to surface the right context.
480
00:18:05,280 --> 00:18:06,760
The metadata gap becomes a barrier
481
00:18:06,760 --> 00:18:09,840
to realizing the promise of AI accelerated hybrid work
482
00:18:09,840 --> 00:18:13,080
and unmanaged SharePoint data quietly becomes a liability
483
00:18:13,080 --> 00:18:14,840
that most boards do not see coming.
484
00:18:14,840 --> 00:18:17,040
There's a secret weapon Microsoft built for this
485
00:18:17,040 --> 00:18:18,720
and most tenants haven't turned it on yet.
486
00:18:18,720 --> 00:18:20,760
Restricted content discovery is one example
487
00:18:20,760 --> 00:18:24,040
of how governance is evolving to address AI-specific risks.
488
00:18:24,040 --> 00:18:26,040
Administrators can configure this feature
489
00:18:26,040 --> 00:18:27,920
to prevent specified SharePoint sites
490
00:18:27,920 --> 00:18:29,800
from appearing in organization wide search
491
00:18:29,800 --> 00:18:33,000
and co-pilot business chat while still allowing direct access
492
00:18:33,000 --> 00:18:36,200
to users who own or have recently interacted with the content.
493
00:18:36,200 --> 00:18:37,800
This is not about blocking access,
494
00:18:37,800 --> 00:18:39,320
it is about controlling discovery.
495
00:18:39,320 --> 00:18:41,320
A site containing merger negotiations
496
00:18:41,320 --> 00:18:43,240
should not surface in a co-pilot query
497
00:18:43,240 --> 00:18:45,800
from an intern who happens to have red permissions.
498
00:18:45,800 --> 00:18:47,240
The metadata and permissions layer
499
00:18:47,240 --> 00:18:49,480
must govern not just who can open a document,
500
00:18:49,480 --> 00:18:51,200
but who can discover it through AI.
501
00:18:51,200 --> 00:18:55,040
ISO 27701, GDPR and CCPA all impose obligations
502
00:18:55,040 --> 00:18:57,800
on data minimization, purpose limitation, and transparency.
503
00:18:57,800 --> 00:18:59,440
These frameworks require organizations
504
00:18:59,440 --> 00:19:01,120
to know what personal data they hold,
505
00:19:01,120 --> 00:19:03,320
where it resides, and how long they keep it.
506
00:19:03,320 --> 00:19:06,600
A SharePoint environment with no metadata cannot answer those questions.
507
00:19:06,600 --> 00:19:09,160
When a data subject requests a raja under GDPR,
508
00:19:09,160 --> 00:19:11,400
the organization must identify every document
509
00:19:11,400 --> 00:19:13,320
containing that person's information.
510
00:19:13,320 --> 00:19:15,760
Without metadata tagging for personal data types,
511
00:19:15,760 --> 00:19:18,560
this becomes a manual search across millions of files,
512
00:19:18,560 --> 00:19:22,320
the cost of compliance scales directly with the metadata gap.
513
00:19:22,320 --> 00:19:24,400
The same logic applies to records management.
514
00:19:24,400 --> 00:19:26,360
A record is not just any old document,
515
00:19:26,360 --> 00:19:28,880
it is evidence of a business transaction or decision
516
00:19:28,880 --> 00:19:31,200
that must be preserved for a defined period
517
00:19:31,200 --> 00:19:33,480
and disposed of according to a legal authority.
518
00:19:33,480 --> 00:19:35,320
Records management requires classification
519
00:19:35,320 --> 00:19:37,520
that distinguishes records from working copies,
520
00:19:37,520 --> 00:19:40,200
transitory communications, and reference materials.
521
00:19:40,200 --> 00:19:44,040
Without that classification, organizations either over-retain everything,
522
00:19:44,040 --> 00:19:46,600
accumulating storage cost and litigation risk
523
00:19:46,600 --> 00:19:49,320
or under-retain destroying evidence they later need.
524
00:19:49,320 --> 00:19:52,440
Metadata is the mechanism that makes records management precise,
525
00:19:52,440 --> 00:19:54,200
defensible, and auditable.
526
00:19:54,200 --> 00:19:57,800
Even basic information security relies on metadata.
527
00:19:57,800 --> 00:19:59,560
Conditional access policies can enforce
528
00:19:59,560 --> 00:20:02,320
multi-factor authentication for sensitive documents,
529
00:20:02,320 --> 00:20:05,440
but only if the system knows which documents are sensitive.
530
00:20:05,440 --> 00:20:08,800
Encryption policies can apply stronger algorithms to regulated data,
531
00:20:08,800 --> 00:20:11,880
but only if the system can identify regulated data.
532
00:20:11,880 --> 00:20:13,600
The principle of least privilege,
533
00:20:13,600 --> 00:20:16,320
granting users the minimum access necessary for their role,
534
00:20:16,320 --> 00:20:19,760
depends on understanding what each document contains and who should see it.
535
00:20:19,760 --> 00:20:22,360
Metadata makes those distinctions machine readable.
536
00:20:22,360 --> 00:20:24,840
Without it, security policies become blanket rules
537
00:20:24,840 --> 00:20:28,320
that either block legitimate work or leave gaps that attack as exploit.
538
00:20:28,320 --> 00:20:32,400
Copilot readiness is the final and most urgent reason to close the metadata gap.
539
00:20:32,400 --> 00:20:36,600
Microsoft has enhanced copilot's ability to reason over SharePoint metadata
540
00:20:36,600 --> 00:20:38,880
as of Ignite 2025.
541
00:20:38,880 --> 00:20:41,240
Copilot can now distinguish between similar documents
542
00:20:41,240 --> 00:20:42,880
by understanding their classification,
543
00:20:42,880 --> 00:20:46,080
department, project associations, and custom tags.
544
00:20:46,080 --> 00:20:49,000
When grounded on libraries with populated metadata,
545
00:20:49,000 --> 00:20:52,920
organizations see tangible improvements in response, quality, and relevance.
546
00:20:52,920 --> 00:20:57,240
For example, asking copilot to show the latest client proposals for the healthcare vertical
547
00:20:57,240 --> 00:21:00,520
now returns accurate results because it understands both the content
548
00:21:00,520 --> 00:21:04,840
and the metadata tags indicating industry vertical, status, and date.
549
00:21:04,840 --> 00:21:08,360
Without metadata, copilot is just a fast way to find the wrong answer.
550
00:21:08,360 --> 00:21:11,080
What intelligent document processing actually means?
551
00:21:11,080 --> 00:21:13,920
The weapon isn't copilot itself, it's the layer underneath.
552
00:21:13,920 --> 00:21:16,200
Intelligent document processing or IDP
553
00:21:16,200 --> 00:21:20,120
refers to software capabilities that capture, transform, and process data
554
00:21:20,120 --> 00:21:23,600
from documents using AI techniques such as computer vision,
555
00:21:23,600 --> 00:21:26,760
optical character recognition, natural language processing,
556
00:21:26,760 --> 00:21:28,240
and machine learning.
557
00:21:28,240 --> 00:21:31,400
IDP solutions ingest documents in various formats including emails,
558
00:21:31,400 --> 00:21:34,840
PDFs, word files, and scanned images and convert them into structured data
559
00:21:34,840 --> 00:21:39,440
that can be analyzed, categorized, and integrated into downstream systems and workflows.
560
00:21:39,440 --> 00:21:42,240
A key goal is to automate the extraction of relevant information
561
00:21:42,240 --> 00:21:45,560
such as invoice numbers, contract dates, or customer details
562
00:21:45,560 --> 00:21:48,520
and to classify documents according to type or business process
563
00:21:48,520 --> 00:21:52,160
which reduces manual data entry and classification work.
564
00:21:52,160 --> 00:21:55,600
At a technical level, IDP systems combine several components,
565
00:21:55,600 --> 00:22:00,080
optical character recognition transforms images of text into machine readable text,
566
00:22:00,080 --> 00:22:04,560
enabling the processing of scanned documents or photographs of physical forms.
567
00:22:04,560 --> 00:22:08,640
Natural language processing techniques interpret the semantics and structure of the text,
568
00:22:08,640 --> 00:22:12,160
identifying entities, relationships, and topics that help classify documents
569
00:22:12,160 --> 00:22:14,280
or extract key value pairs.
570
00:22:14,280 --> 00:22:16,960
Machine learning models, often trained on labeled examples,
571
00:22:16,960 --> 00:22:19,920
learn to recognize patterns in document layouts and content,
572
00:22:19,920 --> 00:22:24,000
allowing them to handle structured, semi-structured, and unstructured documents.
573
00:22:24,000 --> 00:22:26,880
Over time, models can be tuned to specific domains,
574
00:22:26,880 --> 00:22:30,160
improving accuracy for particular types of documents or industries.
575
00:22:30,160 --> 00:22:32,720
Understanding the nature of the documents being processed
576
00:22:32,720 --> 00:22:35,240
is needed to design effective IDP solutions.
577
00:22:35,240 --> 00:22:39,600
Structured documents such as standardized forms and invoices with consistent layouts
578
00:22:39,600 --> 00:22:41,840
are relatively straightforward for extraction engines
579
00:22:41,840 --> 00:22:45,280
because fields appear in predictable locations and formats.
580
00:22:45,280 --> 00:22:49,760
For these, structured document processing models can map specific regions of a page
581
00:22:49,760 --> 00:22:54,320
to named fields, achieving high accuracy with relatively little training data.
582
00:22:54,320 --> 00:22:57,360
Semi-structured documents combine fixed and variable elements.
583
00:22:57,360 --> 00:23:00,000
Examples include purchase orders where line items vary,
584
00:23:00,000 --> 00:23:03,600
but the header structure is consistent, or invoices from multiple vendors
585
00:23:03,600 --> 00:23:06,640
that follow similar patterns but differ in detail.
586
00:23:06,640 --> 00:23:10,000
Unstructured documents such as contracts, policies, memos, and correspondence
587
00:23:10,000 --> 00:23:13,440
lack fixed layouts and rely heavily on linguistic cues and context,
588
00:23:13,440 --> 00:23:16,880
making them more challenging to pass reliably.
589
00:23:16,880 --> 00:23:21,600
The business case for intelligent document processing is grounded in both cost savings and risk reduction.
590
00:23:21,600 --> 00:23:26,080
By automating repetitive tasks such as data entry from invoices or manual tagging of documents,
591
00:23:26,080 --> 00:23:29,200
IDP reduces human effort and speeds up business processes
592
00:23:29,200 --> 00:23:31,440
from accounts payable to contract onboarding.
593
00:23:31,440 --> 00:23:34,960
AI-driven classification and extraction can also improve accuracy
594
00:23:34,960 --> 00:23:38,320
relative to manual processes, which are prone to errors,
595
00:23:38,320 --> 00:23:40,960
especially when volume and complexity are high.
596
00:23:40,960 --> 00:23:43,920
Over time, the data extracted from documents becomes a valuable source
597
00:23:43,920 --> 00:23:45,840
for business intelligence and analytics,
598
00:23:45,840 --> 00:23:50,320
enabling organizations to derive insights from previously inaccessible unstructured content.
599
00:23:50,320 --> 00:23:55,040
Studies have shown how AI-driven automation can transform employee productivity
600
00:23:55,040 --> 00:23:58,000
and accelerate business results by freeing knowledge workers
601
00:23:58,000 --> 00:23:59,760
to focus on higher value tasks.
602
00:23:59,760 --> 00:24:02,400
However, these benefits are not automatic,
603
00:24:02,400 --> 00:24:04,000
and there are important caveats.
604
00:24:04,000 --> 00:24:07,360
AI systems must be trained and validated on representative data
605
00:24:07,360 --> 00:24:10,720
and their outputs need to be monitored for accuracy and bias,
606
00:24:10,720 --> 00:24:15,200
particularly when used for decisions that carry financial, legal, or ethical implications.
607
00:24:15,920 --> 00:24:18,960
Organizations cannot simply assume AI outputs are correct.
608
00:24:18,960 --> 00:24:22,480
Best practice is to treat AI-generated classifications and extractions
609
00:24:22,480 --> 00:24:25,520
as draft work that should be reviewed by subject matter experts,
610
00:24:25,520 --> 00:24:27,280
especially during initial deployment.
611
00:24:27,280 --> 00:24:30,800
Overreliance on AI without proper governance may introduce new risks,
612
00:24:30,800 --> 00:24:34,480
such as misclassification of sensitive data or incorrect metadata
613
00:24:34,480 --> 00:24:37,040
that propagates through retention and DLP policies.
614
00:24:37,040 --> 00:24:41,440
IDP does not eliminate the need for clear process and information architecture.
615
00:24:41,440 --> 00:24:45,520
If business processes are poorly defined or content is created in a chaotic manner,
616
00:24:45,520 --> 00:24:48,080
even the best models will struggle to impose order.
617
00:24:48,080 --> 00:24:50,160
AI amplifies whatever patterns exist.
618
00:24:50,160 --> 00:24:51,680
If those patterns are inconsistent,
619
00:24:51,680 --> 00:24:53,840
the generated metadata will be inconsistent
620
00:24:53,840 --> 00:24:56,240
and governance outcomes will remain unreliable.
621
00:24:56,240 --> 00:24:58,720
IDP should be positioned not as a silver bullet,
622
00:24:58,720 --> 00:25:01,120
but as a powerful enabler within a broader program
623
00:25:01,120 --> 00:25:02,960
that includes taxonomy design,
624
00:25:02,960 --> 00:25:05,600
governance frameworks, and change management.
625
00:25:05,600 --> 00:25:07,440
Within the Microsoft product suite,
626
00:25:07,440 --> 00:25:10,080
IDP capabilities are available across several products
627
00:25:10,080 --> 00:25:11,840
that serve different layers of the stack.
628
00:25:11,840 --> 00:25:16,240
Microsoft syntax now integrated into SharePoint Premium and Document Processing
629
00:25:16,240 --> 00:25:18,960
operates directly on documents stored in SharePoint,
630
00:25:18,960 --> 00:25:20,960
enabling organizations to understand,
631
00:25:20,960 --> 00:25:25,280
classify and extract information from their content within Microsoft 365.
632
00:25:25,280 --> 00:25:28,560
Syntax uses document processing models that can be configured
633
00:25:28,560 --> 00:25:31,680
to recognize structured forms, semi-structured documents,
634
00:25:31,680 --> 00:25:33,200
or unstructured content,
635
00:25:33,200 --> 00:25:36,880
and it integrates with SharePoint libraries to automatically apply metadata,
636
00:25:36,880 --> 00:25:40,640
content types, retention labels, and other governance artifacts.
637
00:25:40,640 --> 00:25:43,760
It also supports related capabilities such as content assembly,
638
00:25:43,760 --> 00:25:47,040
which can generate standardized documents from templates and data sources.
639
00:25:47,040 --> 00:25:50,080
As your document intelligence, now part of as your content understanding
640
00:25:50,080 --> 00:25:53,680
in Foundry tools provides IDP capabilities as a cloud service
641
00:25:53,680 --> 00:25:57,200
that can process documents at scale and integrate with custom applications.
642
00:25:57,200 --> 00:26:00,640
It applies advanced AI models to extract text,
643
00:26:00,640 --> 00:26:05,040
key value pairs, tables, and document structure from files such as PDFs,
644
00:26:05,040 --> 00:26:09,200
images and forms with strong performance on structured and templated documents.
645
00:26:09,200 --> 00:26:12,640
The service exposes rest APIs for developers to analyze documents
646
00:26:12,640 --> 00:26:16,080
using pre-built analyzers such as a general purpose document analyzer
647
00:26:16,080 --> 00:26:20,000
that extracts text and layout elements like paragraphs and tables.
648
00:26:20,000 --> 00:26:22,880
This makes it suitable for scenarios where document processing needs
649
00:26:22,880 --> 00:26:25,680
to be tightly integrated with line of business applications,
650
00:26:25,680 --> 00:26:27,760
data pipelines, or external storage.
651
00:26:27,760 --> 00:26:30,720
Azure content understanding extends these capabilities
652
00:26:30,720 --> 00:26:32,960
beyond documents to other modalities,
653
00:26:32,960 --> 00:26:35,440
including images, audio, and video,
654
00:26:35,440 --> 00:26:38,400
providing pre-built analyzers for extracting transcripts,
655
00:26:38,400 --> 00:26:40,160
keyframes, and descriptions.
656
00:26:40,160 --> 00:26:43,600
A developer can send a file URL to a content understanding endpoint,
657
00:26:43,600 --> 00:26:48,080
specify an analyzer such as pre-built document analyzer or pre-built video analyzer,
658
00:26:48,080 --> 00:26:50,160
and then pull for analysis results,
659
00:26:50,160 --> 00:26:53,200
integrating the extracted data into downstream workflows.
660
00:26:53,200 --> 00:26:57,120
This positions content understanding as a general multimodal extraction layer
661
00:26:57,120 --> 00:26:58,880
within an AI architecture,
662
00:26:58,880 --> 00:27:03,120
complementing syntaxes focus on SharePoint content and user-facing automation.
663
00:27:03,120 --> 00:27:07,040
Azure Databricks in turn offers an intelligent document processing pattern
664
00:27:07,040 --> 00:27:10,640
built on its lakehouse platform using natively composable AI functions
665
00:27:10,640 --> 00:27:13,440
to implement end-to-end IDP pipelines.
666
00:27:13,440 --> 00:27:16,240
This approach involves ingesting raw documents into the lakehouse,
667
00:27:16,240 --> 00:27:20,480
passing them into structured representations using functions like AI-pass document,
668
00:27:20,480 --> 00:27:24,560
enriching them via functions such as AI extract and AI classify,
669
00:27:24,560 --> 00:27:28,480
and then leveraging the results for analytics, retrieval augmented generation,
670
00:27:28,480 --> 00:27:30,080
or agent workflows.
671
00:27:30,080 --> 00:27:32,640
Because each stage of the pipeline, ingestion, passing,
672
00:27:32,640 --> 00:27:35,120
enrichment, and analysis is unified on the lakehouse,
673
00:27:35,120 --> 00:27:38,800
organizations can avoid complex integration or data movement,
674
00:27:38,800 --> 00:27:41,360
while benefiting from data bricks, scalability,
675
00:27:41,360 --> 00:27:43,760
and advanced analytics tooling.
676
00:27:43,760 --> 00:27:46,000
For most SharePoint-centric organizations,
677
00:27:46,000 --> 00:27:47,760
the choice is not between these platforms,
678
00:27:47,760 --> 00:27:49,520
but about how they complement each other.
679
00:27:49,520 --> 00:27:51,600
Syntax handles the SharePoint native automation.
680
00:27:51,600 --> 00:27:54,960
Azure Document Intelligence handles the API-driven bulk processing,
681
00:27:54,960 --> 00:27:57,920
content understanding handles the multimodal enrichment.
682
00:27:57,920 --> 00:28:01,840
Databricks handles the enterprise scale analytics and machine learning refinement.
683
00:28:01,840 --> 00:28:04,800
The metadata schema is the common thread that ties them together.
684
00:28:04,800 --> 00:28:06,800
When every platform writes to the same textonomy,
685
00:28:06,800 --> 00:28:08,960
the document becomes portable across systems,
686
00:28:08,960 --> 00:28:12,080
and the metadata becomes the single source of truth.
687
00:28:12,080 --> 00:28:15,600
A practical integration pattern starts with Syntax as the intake layer.
688
00:28:15,600 --> 00:28:19,360
Documents arrive in SharePoint and Syntax extracts the core metadata.
689
00:28:19,360 --> 00:28:23,440
For simple automation, Power Automate reacts to the extracted values directly.
690
00:28:23,440 --> 00:28:27,520
For complex analytics, the metadata and document references flow to Azure Data Lake,
691
00:28:27,520 --> 00:28:30,560
via Power BI Data Flows or Azure Synapse link.
692
00:28:30,560 --> 00:28:32,480
For advanced machine learning refinement,
693
00:28:32,480 --> 00:28:34,480
Data Bricks reads the label documents,
694
00:28:34,480 --> 00:28:37,120
compares model predictions against human corrections,
695
00:28:37,120 --> 00:28:40,720
and generates improved training sets that feed back into Syntax.
696
00:28:40,720 --> 00:28:41,840
The loop is continuous.
697
00:28:41,840 --> 00:28:43,360
The document stays in SharePoint.
698
00:28:43,360 --> 00:28:44,960
The metadata serves every system.
699
00:28:44,960 --> 00:28:47,680
The taxonomy ensures consistency across the pipeline.
700
00:28:47,680 --> 00:28:51,120
The choice of processing platform also depends on latency requirements.
701
00:28:51,120 --> 00:28:53,520
Syntax processes documents synchronously
702
00:28:53,520 --> 00:28:55,120
as they arrive in SharePoint,
703
00:28:55,120 --> 00:28:57,280
making it ideal for real-time automation.
704
00:28:57,440 --> 00:29:00,800
Azure document intelligence can process large batches asynchronously,
705
00:29:00,800 --> 00:29:02,960
making it ideal for backlogs and migrations.
706
00:29:02,960 --> 00:29:05,200
Data Bricks handles scheduled analytics jobs
707
00:29:05,200 --> 00:29:06,960
that don't need real-time results.
708
00:29:06,960 --> 00:29:09,120
Architects who understand these latency profiles
709
00:29:09,120 --> 00:29:12,400
can design pipelines that use each platform for what it does best,
710
00:29:12,400 --> 00:29:15,040
rather than forcing one tool to handle every scenario.
711
00:29:15,040 --> 00:29:18,960
Syntax, SharePoint Premium, and Document Processing,
712
00:29:18,960 --> 00:29:20,240
the rebrand story.
713
00:29:20,240 --> 00:29:24,960
Microsoft has been building IDP capabilities directly into SharePoint for years.
714
00:29:24,960 --> 00:29:27,200
The problem is that they keep renaming the product,
715
00:29:27,200 --> 00:29:28,480
and if you're keeping score,
716
00:29:28,480 --> 00:29:31,200
this marks the fourth rebrand in just six years.
717
00:29:31,200 --> 00:29:33,120
Back in 2019, while still in development,
718
00:29:33,120 --> 00:29:34,800
it was called Project Cortex.
719
00:29:34,800 --> 00:29:37,280
A year later, it became SharePoint Syntax,
720
00:29:37,280 --> 00:29:39,920
focused on AI-powered document understanding.
721
00:29:39,920 --> 00:29:42,640
Then in 2022, it shifted to Microsoft Syntax
722
00:29:42,640 --> 00:29:45,040
as the company pushed content AI across workloads,
723
00:29:45,040 --> 00:29:46,240
not just SharePoint.
724
00:29:46,240 --> 00:29:48,320
At ignite in November 2023,
725
00:29:48,320 --> 00:29:50,720
Jeff Taper introduced SharePoint Premium,
726
00:29:50,720 --> 00:29:54,080
a suite designed to help you manage ground and use content for AI.
727
00:29:54,080 --> 00:29:55,200
The idea was simple.
728
00:29:55,200 --> 00:29:56,800
One brand everyone could recognize,
729
00:29:56,800 --> 00:29:59,760
bringing together all the extra features on top of base SharePoint.
730
00:29:59,760 --> 00:30:02,320
SharePoint Premium was organized into three pillars,
731
00:30:02,320 --> 00:30:04,640
Experiences, Processes, and Governance.
732
00:30:04,640 --> 00:30:07,120
For content experiences, it included brand new features
733
00:30:07,120 --> 00:30:10,400
like the Agreements app in Teams, SharePoint e-signature,
734
00:30:10,400 --> 00:30:12,720
and the Documents Hub for Customers and Partners.
735
00:30:12,720 --> 00:30:15,280
The content processing pillar carried forward Syntax Classics
736
00:30:15,280 --> 00:30:16,640
like Autofill columns,
737
00:30:16,640 --> 00:30:18,480
Taxonomy tagging, Content Query,
738
00:30:18,480 --> 00:30:20,880
Translation for both Documents and Videos,
739
00:30:20,880 --> 00:30:22,880
PDF annotations, and more.
740
00:30:22,880 --> 00:30:25,040
This is where you transform content with AI,
741
00:30:25,040 --> 00:30:26,880
not just to improve user experience,
742
00:30:26,880 --> 00:30:29,280
but also to prepare your content for co-pilot.
743
00:30:29,280 --> 00:30:30,640
The better your content quality,
744
00:30:30,640 --> 00:30:32,720
the better your co-pilot rollout will be.
745
00:30:32,720 --> 00:30:35,360
The governance pillar introduced advanced admin capabilities
746
00:30:35,360 --> 00:30:37,040
like SharePoint Advanced Management,
747
00:30:37,040 --> 00:30:40,720
Microsoft 365 Archive, and Microsoft 365 Backup.
748
00:30:40,720 --> 00:30:42,960
But now, almost two years later, here we are again.
749
00:30:42,960 --> 00:30:44,960
The funny part is that the rename from Syntax
750
00:30:44,960 --> 00:30:47,360
to SharePoint Premium wasn't even fully completed.
751
00:30:47,360 --> 00:30:48,640
As of late 2025,
752
00:30:48,640 --> 00:30:50,800
you could still see Syntax mentioned everywhere
753
00:30:50,800 --> 00:30:53,200
in the documentation and inside the admin center.
754
00:30:53,200 --> 00:30:54,960
Microsoft decided to split things up.
755
00:30:54,960 --> 00:30:57,280
SharePoint Advanced Management, Backup, and Archive
756
00:30:57,280 --> 00:30:58,800
are now standalone products.
757
00:30:58,800 --> 00:31:00,080
They're no longer under an umbrella,
758
00:31:00,080 --> 00:31:01,120
they're their own thing.
759
00:31:01,120 --> 00:31:03,120
Everything that used to be Syntax is now bundled
760
00:31:03,120 --> 00:31:05,360
under a new umbrella called Document Processing
761
00:31:05,360 --> 00:31:06,720
for Microsoft 365,
762
00:31:06,720 --> 00:31:09,360
and yes, Microsoft is very particular about branding.
763
00:31:09,360 --> 00:31:11,120
In the documentation, you'll notice it's written
764
00:31:11,120 --> 00:31:13,360
with a lowercase DNP because it's a category,
765
00:31:13,360 --> 00:31:14,560
not a product name.
766
00:31:14,560 --> 00:31:16,400
If you check Microsoft Learn right now,
767
00:31:16,400 --> 00:31:19,600
you'll see that this rename helps clarify their priorities.
768
00:31:19,600 --> 00:31:22,960
At the Microsoft 365 Conference in May 2026,
769
00:31:22,960 --> 00:31:26,000
they named four services that will get the most attention
770
00:31:26,000 --> 00:31:26,800
going forward.
771
00:31:26,800 --> 00:31:29,840
Autofill columns, document translation, OCR, and e-signatures.
772
00:31:29,840 --> 00:31:31,360
These are their priorities.
773
00:31:31,360 --> 00:31:33,440
Meanwhile, older features from Syntax,
774
00:31:33,440 --> 00:31:36,080
like document processing models, content assembly,
775
00:31:36,080 --> 00:31:38,400
taxonomy tagging, and image tagging,
776
00:31:38,400 --> 00:31:40,960
have been pushed into the past generations bucket.
777
00:31:40,960 --> 00:31:42,640
In the documentation, they're tucked away
778
00:31:42,640 --> 00:31:44,960
under other document processing services.
779
00:31:44,960 --> 00:31:46,800
Microsoft clearly doesn't want these to be seen
780
00:31:46,800 --> 00:31:47,840
as main services anymore.
781
00:31:47,840 --> 00:31:50,880
They're still available, but they're not part of the future roadmap.
782
00:31:50,880 --> 00:31:53,200
To be clear, Microsoft has not announced any plans
783
00:31:53,200 --> 00:31:55,600
to deprecate or stop supporting these older features.
784
00:31:55,600 --> 00:31:57,440
So if you're using them in production,
785
00:31:57,440 --> 00:31:59,840
don't worry, there's no emergency to move off them.
786
00:31:59,840 --> 00:32:02,000
But it's obvious which features will get the most love
787
00:32:02,000 --> 00:32:04,000
and investment moving forward.
788
00:32:04,000 --> 00:32:06,640
If you're relying heavily on older machine learning-based features
789
00:32:06,640 --> 00:32:09,040
like unstructured document processing models
790
00:32:09,040 --> 00:32:10,000
or content assembly,
791
00:32:10,000 --> 00:32:12,400
this is a good moment to start planning for the future.
792
00:32:12,400 --> 00:32:15,120
Look into whether newer tools like Autofill columns
793
00:32:15,120 --> 00:32:17,120
or co-pilot can replace those features.
794
00:32:17,120 --> 00:32:19,680
It's not urgent, but moving in the same direction as Microsoft
795
00:32:19,680 --> 00:32:21,280
usually means you'll get newer features,
796
00:32:21,280 --> 00:32:24,240
a smoother experience, and maybe even cost savings.
797
00:32:24,240 --> 00:32:25,840
The operating model is changing,
798
00:32:25,840 --> 00:32:27,840
and architects need to understand what that means
799
00:32:27,840 --> 00:32:29,280
for existing investments.
800
00:32:29,280 --> 00:32:31,280
If you've already invested in Microsoft syntax
801
00:32:31,280 --> 00:32:33,040
and built models, bound them to libraries
802
00:32:33,040 --> 00:32:35,280
and wired power automate flows around them,
803
00:32:35,280 --> 00:32:36,320
you're not behind.
804
00:32:36,320 --> 00:32:38,720
You're actually in a better spot than tenant starting from scratch
805
00:32:38,720 --> 00:32:41,280
because the work you did is still valid and can be converted.
806
00:32:41,280 --> 00:32:43,680
But the operating model for that work is changing,
807
00:32:43,680 --> 00:32:46,720
and you need to develop a plan for what you can do moving forward.
808
00:32:46,720 --> 00:32:48,240
Nothing you built is wasted,
809
00:32:48,240 --> 00:32:49,920
but nothing you built is finished either.
810
00:32:49,920 --> 00:32:52,320
Treat this as a refresh, not a re-platforming.
811
00:32:52,320 --> 00:32:54,320
The first step in that plan is inventory.
812
00:32:54,320 --> 00:32:56,800
Pull a list of every syntax model in your tenant,
813
00:32:56,800 --> 00:32:57,840
what it is bound to,
814
00:32:57,840 --> 00:32:59,760
and what depends on its downstream output,
815
00:32:59,760 --> 00:33:02,480
flows, custom apps, and third-party integrations.
816
00:33:02,480 --> 00:33:04,320
Microsoft's re-use story is real,
817
00:33:04,320 --> 00:33:06,240
but it is a re-use story for the model.
818
00:33:06,240 --> 00:33:08,800
The shift from a human-trained model will not be one-to-one
819
00:33:08,800 --> 00:33:11,280
when moving to a large language-model-based classification
820
00:33:11,280 --> 00:33:12,160
and extraction.
821
00:33:12,160 --> 00:33:14,240
You need to know which models are business critical,
822
00:33:14,240 --> 00:33:15,360
which are experimental,
823
00:33:15,360 --> 00:33:16,640
and which have been abandoned.
824
00:33:16,640 --> 00:33:18,400
The second step is governance review.
825
00:33:18,400 --> 00:33:20,720
The libraries where you turned syntax on were,
826
00:33:20,720 --> 00:33:24,000
by definition, libraries with high-value, structured content.
827
00:33:24,000 --> 00:33:26,160
Those are now the libraries, co-pilot, and AI,
828
00:33:26,160 --> 00:33:28,560
in SharePoint will reason over by default.
829
00:33:28,560 --> 00:33:31,280
Re-ask the questions that were used to build those models.
830
00:33:31,280 --> 00:33:33,680
Decide whether there are better AI in SharePoint features
831
00:33:33,680 --> 00:33:35,600
to execute the same governance actions.
832
00:33:35,600 --> 00:33:39,280
A model that classifies contracts might be replaced by an autofill column,
833
00:33:39,280 --> 00:33:42,320
a model that extracts invoice data might be replaced by a skill.
834
00:33:42,320 --> 00:33:45,280
The function might survive even if the implementation changes.
835
00:33:45,280 --> 00:33:48,000
The third step is mapping by feature.
836
00:33:48,000 --> 00:33:50,560
Use the one-to-one mapping from Microsoft documentation
837
00:33:50,560 --> 00:33:54,000
to figure out what you need to enable and learn with AI in SharePoint.
838
00:33:54,000 --> 00:33:57,600
Syntax Autofill columns map to the new create Autofill columns experience.
839
00:33:57,600 --> 00:33:59,920
Document translation remains a metered content service
840
00:33:59,920 --> 00:34:02,720
with co-pilot providing conversational entry points.
841
00:34:02,720 --> 00:34:06,400
OCR becomes part of the baseline AI-ready content pipeline.
842
00:34:06,400 --> 00:34:10,080
Content assembly shifts to template driven generation in content AI.
843
00:34:10,080 --> 00:34:13,600
Image tagging and taxonomy tagging are internalized into the broader
844
00:34:13,600 --> 00:34:18,640
AI-ready content concept pre-built, structured, freeform and unstructured models
845
00:34:18,640 --> 00:34:22,160
all form the processing substrate for higher-level agent actions.
846
00:34:22,160 --> 00:34:25,360
Understanding this mapping prevents you from rebuilding what you already have.
847
00:34:25,360 --> 00:34:28,000
The fourth step is piloting skills somewhere boring.
848
00:34:28,000 --> 00:34:28,960
Skills are interesting.
849
00:34:28,960 --> 00:34:32,400
Agent-driven orchestration is genuinely new and the agent assets library
850
00:34:32,400 --> 00:34:36,800
deserves a real governance review before you let business critical workflows live there.
851
00:34:36,800 --> 00:34:40,080
Pick a low-stakes site, an internal team that will not be heard
852
00:34:40,080 --> 00:34:43,440
if something gets edited, deleted or executed unexpectedly
853
00:34:43,440 --> 00:34:44,720
and learn there.
854
00:34:44,720 --> 00:34:47,120
The interesting skills are the ones that touch content.
855
00:34:47,120 --> 00:34:49,840
The interesting failures are the ones that touch the wrong content.
856
00:34:49,840 --> 00:34:51,680
The fifth step is keeping the receipts.
857
00:34:51,680 --> 00:34:55,040
Don't tear down your Microsoft Syntax configuration documentation,
858
00:34:55,040 --> 00:34:57,840
your model training notes or your cost baselines yet.
859
00:34:57,840 --> 00:35:00,240
We're in the middle of a transition, the meter is still bill
860
00:35:00,240 --> 00:35:04,720
and Microsoft moved, my cheese is going to be a recurring conversation with your finance partner.
861
00:35:04,720 --> 00:35:07,680
The documentation you have now is an asset you'll want later.
862
00:35:07,680 --> 00:35:11,280
It proves what you built, how much it cost and what it delivered.
863
00:35:11,280 --> 00:35:14,240
That evidence is what secures budget for the next phase.
864
00:35:14,240 --> 00:35:17,440
Model types, structured, freeform and unstructured.
865
00:35:17,440 --> 00:35:20,480
Behind all the branding changes, the core technology hasn't changed.
866
00:35:20,480 --> 00:35:21,520
It's still models.
867
00:35:21,520 --> 00:35:25,360
Microsoft defines several model types tuned to different classes of documents.
868
00:35:25,360 --> 00:35:29,760
Understanding which type fits your content is the first real decision you need to make.
869
00:35:29,760 --> 00:35:33,840
Structured document processing is for forms and layouts with predictable fields.
870
00:35:33,840 --> 00:35:38,640
Invoices, purchase orders, tax forms and standardized applications all fall into this category
871
00:35:38,640 --> 00:35:41,520
because the fields appear in consistent locations and formats,
872
00:35:41,520 --> 00:35:45,600
structured models can map specific regions of a page to named columns,
873
00:35:45,600 --> 00:35:49,120
achieving high accuracy with relatively little training data.
874
00:35:49,120 --> 00:35:52,400
You upload a few sample documents, draw boxes around the fields you want,
875
00:35:52,400 --> 00:35:55,280
and the model learns to find those same fields on new documents.
876
00:35:55,280 --> 00:35:57,280
This is the fastest path to value.
877
00:35:57,280 --> 00:36:00,080
If your documents are mostly structured, start here,
878
00:36:00,080 --> 00:36:02,080
freeform document processing sits in the middle.
879
00:36:02,080 --> 00:36:06,560
These are documents that contain fields in variable positions but with recognizable patterns.
880
00:36:06,560 --> 00:36:10,400
A statement of work might list deliverables, timelines and budgets in different places
881
00:36:10,400 --> 00:36:13,440
depending on the client, but the language patterns are consistent.
882
00:36:13,440 --> 00:36:18,160
A health insurance claim might mix structured tables with freeform provider notes.
883
00:36:18,160 --> 00:36:21,760
Freeform models are backed by AI builder and use a combination of layout,
884
00:36:21,760 --> 00:36:23,360
understanding and language recognition.
885
00:36:23,360 --> 00:36:26,080
They require more training examples than structured models,
886
00:36:26,080 --> 00:36:29,200
but handle variability that rigid templates cannot capture.
887
00:36:29,200 --> 00:36:33,360
Unstructured document processing is the most complex and the most valuable for governance.
888
00:36:33,360 --> 00:36:35,840
Contracts, policies, memos and correspondence,
889
00:36:35,840 --> 00:36:39,360
lack fixed layouts and rely heavily on linguistic cues in context.
890
00:36:39,360 --> 00:36:41,600
An effective date might appear after the phrase,
891
00:36:41,600 --> 00:36:46,000
this agreement is effective as of in one contract and after commencement date in another.
892
00:36:46,000 --> 00:36:48,800
A liability cap might be expressed as a dollar amount,
893
00:36:48,800 --> 00:36:51,760
a percentage of revenue or a multiple of fees.
894
00:36:51,760 --> 00:36:55,040
Unstructured models use natural language processing and machine learning
895
00:36:55,040 --> 00:36:57,680
to infirm meaning from context, not just location.
896
00:36:57,680 --> 00:37:01,920
They need more training examples, more careful field definition and more iterative refinement,
897
00:37:01,920 --> 00:37:05,200
but they are the ones that transform document dumps into governed,
898
00:37:05,200 --> 00:37:06,880
searchable actionable libraries.
899
00:37:06,880 --> 00:37:11,440
For common business scenarios, Microsoft provides pre-built models that do not require training.
900
00:37:11,440 --> 00:37:13,440
These include models for contract processing,
901
00:37:13,440 --> 00:37:16,960
invoices, receipts, sensitive information and simple documents.
902
00:37:16,960 --> 00:37:21,120
These pre-built models aim to accelerate value by addressing everyday document types
903
00:37:21,120 --> 00:37:24,400
without requiring organizations to label their own training sets,
904
00:37:24,400 --> 00:37:27,520
while still allowing customization where domain specific needs arise.
905
00:37:27,520 --> 00:37:30,160
If your documents match one of these types,
906
00:37:30,160 --> 00:37:33,600
the pre-built model can get you live in hours instead of weeks.
907
00:37:33,600 --> 00:37:37,840
From a machine learning standpoint, unstructured document classification and extraction
908
00:37:37,840 --> 00:37:42,800
often use natural language processing models that encode text into numerical representations,
909
00:37:42,800 --> 00:37:46,960
using techniques such as term frequency or modern transformer based embeddings.
910
00:37:46,960 --> 00:37:51,520
Models including naive bays, support vector machines,
911
00:37:51,520 --> 00:37:54,240
logistic regression and deep learning architectures,
912
00:37:54,240 --> 00:37:57,040
like transformers or convolutional neural networks,
913
00:37:57,040 --> 00:38:02,160
can then be trained to map these representations to labels such as document type or sensitivity
914
00:38:02,160 --> 00:38:04,960
or to extract spans corresponding to entities.
915
00:38:04,960 --> 00:38:09,360
As organizations accumulate labeled examples from human review or syntax feedback,
916
00:38:09,360 --> 00:38:11,520
they can iteratively improve these models,
917
00:38:11,520 --> 00:38:15,520
whether through built-in training capabilities or external machine learning platforms.
918
00:38:15,520 --> 00:38:17,040
The choice depends on your documents.
919
00:38:17,040 --> 00:38:19,920
Start with pre-built if your documents match a standard type,
920
00:38:19,920 --> 00:38:22,080
move to structured if your forms are consistent,
921
00:38:22,080 --> 00:38:24,560
use free form for semi-structured variability,
922
00:38:24,560 --> 00:38:28,640
reserve unstructured for the complex narrative documents that carry the most business risk.
923
00:38:28,640 --> 00:38:30,640
And remember, you don't need a data science degree.
924
00:38:30,640 --> 00:38:35,200
The no-code interfaces in syntax and AI builder-led subject matter experts label documents
925
00:38:35,200 --> 00:38:37,920
defined fields and train models without writing code.
926
00:38:37,920 --> 00:38:40,720
The model learns from your expertise, not the other way around.
927
00:38:40,720 --> 00:38:43,200
Let's walk through a concrete decision scenario.
928
00:38:43,200 --> 00:38:47,280
Your finance department receives 500 invoices per month from 40 different vendors.
929
00:38:47,280 --> 00:38:49,200
Some vendors use their own templates.
930
00:38:49,200 --> 00:38:52,080
Other sent PDFs generated from accounting software.
931
00:38:52,080 --> 00:38:54,240
A few still send scanned paper invoices.
932
00:38:54,240 --> 00:38:55,920
The header information is always present,
933
00:38:55,920 --> 00:38:58,320
but the line items vary in number and layout.
934
00:38:58,320 --> 00:39:00,720
The total amount might appear at the bottom right,
935
00:39:00,720 --> 00:39:03,600
in a summary box, or in a payment instructions block.
936
00:39:03,600 --> 00:39:05,280
This is a semi-structured problem.
937
00:39:05,280 --> 00:39:08,000
A structured model would fail on the vendor-specific variations.
938
00:39:08,000 --> 00:39:10,960
An unstructured model would be overkill and slower to train.
939
00:39:10,960 --> 00:39:14,400
A free form model, possibly starting from the pre-built invoice processor,
940
00:39:14,400 --> 00:39:15,360
is the right fit.
941
00:39:15,360 --> 00:39:16,880
Now consider your legal department.
942
00:39:16,880 --> 00:39:18,720
They manage 2,000 active contracts,
943
00:39:18,720 --> 00:39:20,400
including master service agreements,
944
00:39:20,400 --> 00:39:23,200
statements of work, amendments, and non-disclosure agreements.
945
00:39:23,200 --> 00:39:25,280
Each law firm uses its own template.
946
00:39:25,280 --> 00:39:26,880
Some contracts are 50 pages.
947
00:39:26,880 --> 00:39:27,760
Others are 5.
948
00:39:27,760 --> 00:39:30,000
Govning law might appear in a dedicated section,
949
00:39:30,000 --> 00:39:33,440
in a footer, or in a governing law rider attached as a separate document.
950
00:39:33,440 --> 00:39:36,480
There are no fixed layouts, no consistent field positions.
951
00:39:36,480 --> 00:39:39,200
The language varies between plain English and dense legalese.
952
00:39:39,200 --> 00:39:40,720
This is an unstructured problem.
953
00:39:40,720 --> 00:39:44,160
Only an unstructured document processing model can extract effective dates,
954
00:39:44,160 --> 00:39:46,320
termination clauses, liability caps,
955
00:39:46,320 --> 00:39:48,640
and renewal terms across this variability.
956
00:39:48,640 --> 00:39:51,360
The pre-built contract model gives you a head start,
957
00:39:51,360 --> 00:39:55,680
but you will need to customize it heavily for your organization's specific contract language.
958
00:39:55,680 --> 00:39:59,920
The machine learning behind these models is more accessible than most people assume.
959
00:39:59,920 --> 00:40:01,920
For structured and free form documents,
960
00:40:01,920 --> 00:40:04,400
AI Builder uses a form recognizer
961
00:40:04,400 --> 00:40:07,120
that learns layout patterns from your labeled examples.
962
00:40:07,120 --> 00:40:10,080
It creates a geometric map of where fields typically appear
963
00:40:10,080 --> 00:40:12,480
and refines that map as it sees more documents.
964
00:40:12,480 --> 00:40:15,680
For unstructured documents, syntax uses natural language processing models
965
00:40:15,680 --> 00:40:18,240
that encode text into numerical representations.
966
00:40:18,240 --> 00:40:21,680
These representations capture semantic meaning not just keyword matching.
967
00:40:21,680 --> 00:40:24,320
The model learns that effective date and commencement date
968
00:40:24,320 --> 00:40:27,760
and start date are semantically equivalent in contract contexts.
969
00:40:27,760 --> 00:40:30,880
It learns that a dollar amount following liability shall not exceed
970
00:40:30,880 --> 00:40:32,160
is a liability cap,
971
00:40:32,160 --> 00:40:35,920
while a dollar amount following total amount due is an invoice total.
972
00:40:35,920 --> 00:40:39,120
This semantic understanding is what makes unstructured extraction powerful
973
00:40:39,120 --> 00:40:42,320
and it is why the training data must include linguistic diversity
974
00:40:42,320 --> 00:40:43,680
not just layout diversity.
975
00:40:43,680 --> 00:40:46,960
Model accuracy follows a predictable curve.
976
00:40:46,960 --> 00:40:51,120
With structured documents, you might reach 80% accuracy with 20 labeled examples
977
00:40:51,120 --> 00:40:52,800
and 95% with 50.
978
00:40:52,800 --> 00:40:54,880
With free form documents, the curve is flatter.
979
00:40:54,880 --> 00:40:59,040
You might need 100 examples to reach 80% and 300 to reach 95.
980
00:40:59,040 --> 00:41:01,920
With unstructured documents, the curve is steeper and longer.
981
00:41:01,920 --> 00:41:05,040
You might need 200 examples to reach 70% accuracy
982
00:41:05,040 --> 00:41:07,760
and continuous iteration to push past 85.
983
00:41:07,760 --> 00:41:10,240
The organizations that succeed plan for this curve
984
00:41:10,240 --> 00:41:12,880
they allocate time for multiple training cycles.
985
00:41:12,880 --> 00:41:15,680
They build feedback loops where users correct extractions
986
00:41:15,680 --> 00:41:17,520
and those corrections feedback into the model.
987
00:41:17,520 --> 00:41:19,680
They measure accuracy per field, not per document
988
00:41:19,680 --> 00:41:21,680
because some fields are harder than others.
989
00:41:21,680 --> 00:41:24,720
A model that extracts invoice totals at 98% accuracy
990
00:41:24,720 --> 00:41:28,800
but vendor names at 70% accuracy needs more training on vendor name variants
991
00:41:28,800 --> 00:41:30,320
not more training on totals.
992
00:41:30,320 --> 00:41:33,040
AI agents, skills and the new operating model
993
00:41:33,040 --> 00:41:35,120
but choosing the model type is just the warm-up.
994
00:41:35,120 --> 00:41:36,800
The real work starts when you train it
995
00:41:36,800 --> 00:41:40,160
and in 2026 the training environment itself just changed.
996
00:41:40,160 --> 00:41:42,480
The architectural shift that Microsoft is pushing
997
00:41:42,480 --> 00:41:45,200
is a move from feature by feature syntax configuration
998
00:41:45,200 --> 00:41:48,080
to conversational intent-based AI agents.
999
00:41:48,080 --> 00:41:49,840
Every SharePoint site now has an agent.
1000
00:41:49,840 --> 00:41:52,800
The SharePoint admin agent monitors governance automatically,
1001
00:41:52,800 --> 00:41:56,000
identifying inactive sites, flagging overshared content,
1002
00:41:56,000 --> 00:41:59,040
tracking permissions sprawl and highlighting high activity sites
1003
00:41:59,040 --> 00:42:02,000
that may need additional attention as co-pilot adoption grows.
1004
00:42:02,000 --> 00:42:04,640
Rather than requiring manual intervention,
1005
00:42:04,640 --> 00:42:06,880
the admin agent automatically applies policies
1006
00:42:06,880 --> 00:42:08,720
such as archiving inactive sites
1007
00:42:08,720 --> 00:42:11,840
or adjusting access permissions to reduce security risks.
1008
00:42:11,840 --> 00:42:15,840
Site-specific agents create instant knowledge bases for every SharePoint site.
1009
00:42:15,840 --> 00:42:19,360
These agents understand the content, structure and purpose of their respective sites
1010
00:42:19,360 --> 00:42:22,160
allowing employees to quickly tap into relevant information
1011
00:42:22,160 --> 00:42:24,320
without complex searches or navigation.
1012
00:42:24,320 --> 00:42:26,000
Site agents can answer questions like
1013
00:42:26,000 --> 00:42:28,480
"What are the key deliverables for the current quarter?"
1014
00:42:28,480 --> 00:42:32,000
By synthesizing information from documents, lists, and pages within their site.
1015
00:42:32,000 --> 00:42:35,120
Organizations can also build custom agents directly
1016
00:42:35,120 --> 00:42:37,600
within SharePoint using co-pilot studio integration.
1017
00:42:37,600 --> 00:42:39,440
These agents connect to SharePoint lists,
1018
00:42:39,440 --> 00:42:42,080
document libraries and pages as knowledge sources,
1019
00:42:42,080 --> 00:42:44,400
providing real-time access to current data.
1020
00:42:44,400 --> 00:42:47,440
Custom agents handle specific business processes such as employee
1021
00:42:47,440 --> 00:42:49,680
onboarding assistance, policy question answering,
1022
00:42:49,680 --> 00:42:52,240
project status updates, or equipment request management,
1023
00:42:52,240 --> 00:42:55,040
all without requiring extensive development resources.
1024
00:42:55,040 --> 00:42:56,480
Skills are the new automation layer.
1025
00:42:56,480 --> 00:43:00,160
A skill is a reusable AI workflow stored as a markdown definition
1026
00:43:00,160 --> 00:43:02,240
in the site's agent-assets library.
1027
00:43:02,240 --> 00:43:05,840
Skills describe repeatable AI workflows over the site's content,
1028
00:43:05,840 --> 00:43:08,560
such as review, enrich or root operations.
1029
00:43:08,560 --> 00:43:10,240
You write a skill in natural language,
1030
00:43:10,240 --> 00:43:11,840
describing what you want to happen,
1031
00:43:11,840 --> 00:43:13,840
and the agent executes it across the site.
1032
00:43:13,840 --> 00:43:16,160
For example, you might write a skill that says
1033
00:43:16,160 --> 00:43:18,720
"For every invoice uploaded to the finance library,
1034
00:43:18,720 --> 00:43:21,920
extract the vendor name, invoice number, and total amount,
1035
00:43:21,920 --> 00:43:24,560
then send a notification to the accounts payable team
1036
00:43:24,560 --> 00:43:26,320
if the total exceeds $10,000."
1037
00:43:26,320 --> 00:43:29,440
The skill chain's extraction, metadata update,
1038
00:43:29,440 --> 00:43:32,640
list operation and notification into a single automated workflow.
1039
00:43:32,640 --> 00:43:35,120
This sounds like the future and in many ways it is,
1040
00:43:35,120 --> 00:43:37,120
but there are drawbacks nobody is talking about.
1041
00:43:37,120 --> 00:43:39,200
Skills have no version control and no rollback.
1042
00:43:39,200 --> 00:43:40,960
They are stored as markdown in SharePoint,
1043
00:43:40,960 --> 00:43:43,200
edit and save, and it's live immediately.
1044
00:43:43,200 --> 00:43:46,080
There is no git history, no undo and no sandbox environment.
1045
00:43:46,080 --> 00:43:49,280
You write the skill, submit it, and it runs on real data.
1046
00:43:49,280 --> 00:43:51,680
One mistake and you've modified thousands of records,
1047
00:43:51,680 --> 00:43:53,600
and by default everyone can create a skill
1048
00:43:53,600 --> 00:43:55,440
that is a governance problem waiting to happen.
1049
00:43:55,440 --> 00:43:58,000
Autofill columns represent another major shift.
1050
00:43:58,000 --> 00:43:59,840
In the syntax era, you build a model,
1051
00:43:59,840 --> 00:44:02,400
bound it to a library and configured extractors.
1052
00:44:02,400 --> 00:44:06,080
In the AI era, users just ask co-pilot to fill library columns,
1053
00:44:06,080 --> 00:44:08,400
no model galleries, no complex binding,
1054
00:44:08,400 --> 00:44:10,000
just a natural language prompt.
1055
00:44:10,000 --> 00:44:12,000
But Autofill has its own limitations.
1056
00:44:12,000 --> 00:44:13,920
Configurations live in your document library
1057
00:44:13,920 --> 00:44:15,840
and can't be reused in another library.
1058
00:44:15,840 --> 00:44:18,560
Each library requires its own separate setup.
1059
00:44:18,560 --> 00:44:20,560
Testing is limited to one file at a time,
1060
00:44:20,560 --> 00:44:22,560
with thousands of files you're flying blind.
1061
00:44:22,560 --> 00:44:24,240
And when you run batch prompt testing,
1062
00:44:24,240 --> 00:44:26,320
there's no preview of what's actually happening.
1063
00:44:26,320 --> 00:44:28,160
You submit the batch and hope it works.
1064
00:44:28,160 --> 00:44:30,400
No visibility into errors or extraction quality
1065
00:44:30,400 --> 00:44:31,600
until it's too late.
1066
00:44:31,600 --> 00:44:33,600
There is also a critical architectural gap.
1067
00:44:33,600 --> 00:44:36,320
There is no native ability to perform sensitivity
1068
00:44:36,320 --> 00:44:39,520
and retention labeling with AI in SharePoint currently.
1069
00:44:39,520 --> 00:44:41,440
The agents and skills can extract metadata,
1070
00:44:41,440 --> 00:44:43,440
classified documents and update columns,
1071
00:44:43,440 --> 00:44:45,120
but they do not natively apply
1072
00:44:45,120 --> 00:44:48,080
per view sensitivity labels or retention policies.
1073
00:44:48,080 --> 00:44:50,480
That means organizations still need SharePoint's
1074
00:44:50,480 --> 00:44:53,360
native governance controls, power automate orchestration,
1075
00:44:53,360 --> 00:44:55,200
or custom integrations to bridge the gap
1076
00:44:55,200 --> 00:44:58,240
between extracted metadata and enforced compliance.
1077
00:44:58,240 --> 00:45:01,280
The AI layer handles extraction and classification,
1078
00:45:01,280 --> 00:45:03,280
per view handles enforcement.
1079
00:45:03,280 --> 00:45:04,960
And the connection between them is still manual.
1080
00:45:04,960 --> 00:45:07,120
The one-to-one mapping from syntax to AI
1081
00:45:07,120 --> 00:45:08,960
in SharePoint is helpful for architects
1082
00:45:08,960 --> 00:45:10,480
trying to manage the transition.
1083
00:45:10,480 --> 00:45:12,240
Syntax Autofill columns map directly
1084
00:45:12,240 --> 00:45:14,240
to the new create Autofill columns experience
1085
00:45:14,240 --> 00:45:15,760
in AI enabled libraries.
1086
00:45:15,760 --> 00:45:18,240
Document translation remains a metered content service
1087
00:45:18,240 --> 00:45:20,480
with copilot providing conversational entry points
1088
00:45:20,480 --> 00:45:21,840
for ad hoc translation.
1089
00:45:21,840 --> 00:45:25,360
OCR becomes part of the baseline AI ready content pipeline,
1090
00:45:25,360 --> 00:45:27,520
triggered when users initiate interactions
1091
00:45:27,520 --> 00:45:30,000
rather than running automatically on every upload.
1092
00:45:30,000 --> 00:45:32,960
Content assembly shifts to template driven document generation
1093
00:45:32,960 --> 00:45:36,560
in content AI combined with copilot driven content generation.
1094
00:45:36,560 --> 00:45:39,040
Image tagging and taxonomy tagging are internalized
1095
00:45:39,040 --> 00:45:41,280
into the broader AI ready content concept,
1096
00:45:41,280 --> 00:45:43,520
surfaced through improved search and Autofill
1097
00:45:43,520 --> 00:45:45,600
rather than standalone configurations.
1098
00:45:45,600 --> 00:45:48,640
Prebuilt, structured, freeform, and unstructured models
1099
00:45:48,640 --> 00:45:50,240
all form the processing substrate
1100
00:45:50,240 --> 00:45:51,920
for higher level agent actions.
1101
00:45:51,920 --> 00:45:54,800
What this means in practice is that your existing syntax models
1102
00:45:54,800 --> 00:45:55,920
are not wasted.
1103
00:45:55,920 --> 00:45:58,160
They still build through Azure payers you go meters.
1104
00:45:58,160 --> 00:46:00,160
They still run, they still extract,
1105
00:46:00,160 --> 00:46:02,720
but their role is shifting from primary user interface
1106
00:46:02,720 --> 00:46:04,560
to back and substrate for agent actions.
1107
00:46:04,560 --> 00:46:07,120
The user-facing experience is now conversational.
1108
00:46:07,120 --> 00:46:10,160
The configuration experience is now intent-based
1109
00:46:10,160 --> 00:46:12,080
and the governance experience is now distributed
1110
00:46:12,080 --> 00:46:13,840
across every site in your tenant.
1111
00:46:13,840 --> 00:46:16,400
That is a structural change, not a cosmetic one.
1112
00:46:16,400 --> 00:46:18,080
Architects who understand this shift
1113
00:46:18,080 --> 00:46:20,800
will design solutions that survive the transition.
1114
00:46:20,800 --> 00:46:23,120
Those who don't will find their carefully trained models
1115
00:46:23,120 --> 00:46:26,480
often behind a user interface that no longer exists.
1116
00:46:26,480 --> 00:46:28,880
The knowledge agent is another major piece of this puzzle
1117
00:46:28,880 --> 00:46:31,600
that entered public preview in September 2025
1118
00:46:31,600 --> 00:46:34,480
and reached general availability in early 2026.
1119
00:46:34,480 --> 00:46:37,360
It is included with Microsoft 365 co-pilot licenses
1120
00:46:37,360 --> 00:46:39,040
at no extra cost.
1121
00:46:39,040 --> 00:46:40,960
Knowledge agent uses artificial intelligence
1122
00:46:40,960 --> 00:46:44,000
to automatically tag and classify documents with metadata,
1123
00:46:44,000 --> 00:46:46,080
eliminating hours of manual data entry.
1124
00:46:46,080 --> 00:46:48,400
When you upload financial reports or project plans,
1125
00:46:48,400 --> 00:46:51,920
the agent analyzes content and suggests relevant metadata columns
1126
00:46:51,920 --> 00:46:55,360
ensuring consistent organization across libraries and sites.
1127
00:46:55,360 --> 00:46:56,720
Early adopters like Mars Inc.
1128
00:46:56,720 --> 00:46:59,120
called the Autofill metadata capability
1129
00:46:59,120 --> 00:47:02,160
a total game-changer drastically reducing time spent
1130
00:47:02,160 --> 00:47:03,600
on manual categorization.
1131
00:47:03,600 --> 00:47:06,160
The agent also continuously monitors sharepoint environments
1132
00:47:06,160 --> 00:47:07,760
to maintain content quality.
1133
00:47:07,760 --> 00:47:10,800
It identifies broken links, flags outdated pages,
1134
00:47:10,800 --> 00:47:12,400
detects duplicate content,
1135
00:47:12,400 --> 00:47:15,120
and locates missing information that users are searching for,
1136
00:47:15,120 --> 00:47:16,720
but cannot find.
1137
00:47:16,720 --> 00:47:18,640
Rather than waiting for manual audits,
1138
00:47:18,640 --> 00:47:20,800
knowledge agent proactively keeps your sharepoint
1139
00:47:20,800 --> 00:47:22,400
instance clean and current.
1140
00:47:22,400 --> 00:47:25,200
A floating button appears throughout sharepoint surfaces,
1141
00:47:25,200 --> 00:47:27,200
providing role-specific actions based
1142
00:47:27,200 --> 00:47:28,320
on what you're working on.
1143
00:47:28,320 --> 00:47:31,200
Site owners see options to improve their sites and fill content gaps
1144
00:47:31,200 --> 00:47:32,960
while content creators receive suggestions
1145
00:47:32,960 --> 00:47:35,600
for metadata enrichment and automated workflows.
1146
00:47:35,600 --> 00:47:39,200
Automated workflow creation is where the agent gets genuinely powerful.
1147
00:47:39,200 --> 00:47:41,040
You describe what you need in natural language,
1148
00:47:41,040 --> 00:47:44,320
such as identifying new invoices exceeding a specific threshold
1149
00:47:44,320 --> 00:47:47,200
and knowledge agent builds the workflow automatically.
1150
00:47:47,200 --> 00:47:49,440
This removes the technical barrier to automation,
1151
00:47:49,440 --> 00:47:52,000
allowing business users to create sophisticated processes
1152
00:47:52,000 --> 00:47:53,200
without coding knowledge.
1153
00:47:53,200 --> 00:47:55,280
But this power comes with the same governance warnings
1154
00:47:55,280 --> 00:47:56,560
that apply to skills.
1155
00:47:56,560 --> 00:47:59,920
A workflow built by an agent is still a workflow running on real data.
1156
00:47:59,920 --> 00:48:02,480
If the description is ambiguous, the result might be wrong.
1157
00:48:02,480 --> 00:48:04,560
If the threshold is mistated, the notification
1158
00:48:04,560 --> 00:48:05,760
goes to the wrong person.
1159
00:48:05,760 --> 00:48:08,160
Natural language is fast but imprecise.
1160
00:48:08,160 --> 00:48:10,000
Precision requires governance.
1161
00:48:10,000 --> 00:48:11,920
The SharePoint page agent enables users
1162
00:48:11,920 --> 00:48:14,320
with Microsoft 365 co-pilot licenses
1163
00:48:14,320 --> 00:48:17,760
to create and refine SharePoint pages using natural language.
1164
00:48:17,760 --> 00:48:19,840
Rather than manually building page layouts
1165
00:48:19,840 --> 00:48:23,200
and adding web parts, content creators describe what they need,
1166
00:48:23,200 --> 00:48:25,920
and the agent generates professionally designed pages
1167
00:48:25,920 --> 00:48:28,720
that match organizational branding and standards.
1168
00:48:28,720 --> 00:48:30,880
This accelerates content creation dramatically
1169
00:48:30,880 --> 00:48:34,560
but it also raises questions about content quality and consistency.
1170
00:48:34,560 --> 00:48:36,400
An AI-generated page might look correct
1171
00:48:36,400 --> 00:48:38,480
while containing outdated policy language
1172
00:48:38,480 --> 00:48:40,480
or incorrect contact information.
1173
00:48:40,480 --> 00:48:42,800
Human review remains needed, especially for pages
1174
00:48:42,800 --> 00:48:45,440
that serve as authoritative sources for co-pilot grounding.
1175
00:48:45,440 --> 00:48:48,720
Custom no-code agents built through co-pilot studio integration
1176
00:48:48,720 --> 00:48:51,520
represent the most flexible layer of this new architecture.
1177
00:48:51,520 --> 00:48:54,640
Organizations can create agents that connect to SharePoint lists,
1178
00:48:54,640 --> 00:48:57,440
document libraries and pages as knowledge sources,
1179
00:48:57,440 --> 00:48:59,760
providing real-time access to current data.
1180
00:48:59,760 --> 00:49:01,840
These agents handle specific business processes
1181
00:49:01,840 --> 00:49:04,640
such as employee onboarding assistance, policy question answering,
1182
00:49:04,640 --> 00:49:07,360
project status updates, or equipment request management.
1183
00:49:07,360 --> 00:49:09,520
Because they live inside the SharePoint environment,
1184
00:49:09,520 --> 00:49:10,880
they can reference current data
1185
00:49:10,880 --> 00:49:13,120
without the latency of external integrations.
1186
00:49:13,120 --> 00:49:15,600
But they also inherit the permissions and visibility rules
1187
00:49:15,600 --> 00:49:16,960
of the sites they access.
1188
00:49:16,960 --> 00:49:19,520
An agent that answers questions about salary ranges
1189
00:49:19,520 --> 00:49:22,400
must respect the permissions of the HR site at queries.
1190
00:49:22,400 --> 00:49:24,560
An agent that summarizes project status
1191
00:49:24,560 --> 00:49:26,960
must not expose confidential client information
1192
00:49:26,960 --> 00:49:28,480
from a restricted library.
1193
00:49:28,480 --> 00:49:30,160
Agent governance is site governance
1194
00:49:30,160 --> 00:49:32,480
and the two cannot be separated.
1195
00:49:32,480 --> 00:49:34,960
Building your first custom document processing model.
1196
00:49:34,960 --> 00:49:37,280
Building a custom model that survives this transition
1197
00:49:37,280 --> 00:49:38,880
is less about machine learning theory
1198
00:49:38,880 --> 00:49:40,720
and more about document archaeology.
1199
00:49:40,720 --> 00:49:43,680
You need to understand your content before the machine can.
1200
00:49:43,680 --> 00:49:45,760
Step one is auditing your documents.
1201
00:49:45,760 --> 00:49:48,160
Pull a representative sample of every document type
1202
00:49:48,160 --> 00:49:49,120
you want to process.
1203
00:49:49,120 --> 00:49:50,640
If you're starting with contracts,
1204
00:49:50,640 --> 00:49:52,400
gather master service agreements,
1205
00:49:52,400 --> 00:49:53,600
statements of work,
1206
00:49:53,600 --> 00:49:56,000
non-disclosure agreements, and amendment documents.
1207
00:49:56,000 --> 00:49:57,840
Include the good scans, the bad scans,
1208
00:49:57,840 --> 00:49:59,520
the digital word files, and the PDFs
1209
00:49:59,520 --> 00:50:01,440
that somebody printed and scanned back in.
1210
00:50:01,440 --> 00:50:02,880
The model needs to see the mess.
1211
00:50:02,880 --> 00:50:04,400
Not just the clean examples.
1212
00:50:04,400 --> 00:50:06,080
Diversity matters more than volume.
1213
00:50:06,080 --> 00:50:09,440
A hundred varied documents teach more than a thousand identical ones.
1214
00:50:09,440 --> 00:50:11,440
Step two is designing the taxonomy first.
1215
00:50:11,440 --> 00:50:14,240
Before you open the model builder, open a whiteboard.
1216
00:50:14,240 --> 00:50:16,240
List every column your library needs.
1217
00:50:16,240 --> 00:50:18,560
Contract type, counterparty name, effective date,
1218
00:50:18,560 --> 00:50:20,560
expiration date, governing law jurisdiction,
1219
00:50:20,560 --> 00:50:22,560
liability cap, auto renewal flag,
1220
00:50:22,560 --> 00:50:24,320
termination notice period.
1221
00:50:24,320 --> 00:50:26,400
Each of these fields maps to a SharePoint column
1222
00:50:26,400 --> 00:50:28,960
and each column needs a data type and a vocabulary.
1223
00:50:28,960 --> 00:50:30,640
If you're using managed metadata,
1224
00:50:30,640 --> 00:50:32,160
define your term sets now.
1225
00:50:32,160 --> 00:50:34,560
If you're using choice columns, list the options.
1226
00:50:34,560 --> 00:50:35,760
The model isn't the hard part.
1227
00:50:35,760 --> 00:50:36,800
The taxonomy is.
1228
00:50:36,800 --> 00:50:39,440
And if your taxonomy changes after you've trained the model,
1229
00:50:39,440 --> 00:50:40,400
you're retraining.
1230
00:50:40,400 --> 00:50:43,200
Step three is selecting and uploading your training set.
1231
00:50:43,200 --> 00:50:46,880
In the syntax model builder or AI builder interface,
1232
00:50:46,880 --> 00:50:49,680
you upload your sample documents and begin labeling.
1233
00:50:49,680 --> 00:50:51,040
For structured and freeform models,
1234
00:50:51,040 --> 00:50:54,080
you draw bounding boxes around the fields you want to extract.
1235
00:50:54,080 --> 00:50:55,360
For unstructured models,
1236
00:50:55,360 --> 00:50:58,240
you highlight text spans and assign them to named entities.
1237
00:50:58,240 --> 00:51:00,560
The interface is visual and requires no code.
1238
00:51:00,560 --> 00:51:03,120
A contract manager or legal analyst can do this work.
1239
00:51:03,120 --> 00:51:04,880
The model learns from their expertise.
1240
00:51:04,880 --> 00:51:07,520
Step four is defining your entities with precision.
1241
00:51:07,520 --> 00:51:09,840
An expiration date is not just any date on the page.
1242
00:51:09,840 --> 00:51:11,120
It's the date after the phrase,
1243
00:51:11,120 --> 00:51:12,880
"This agreement shall terminate on"
1244
00:51:12,880 --> 00:51:14,560
or the date in the term section.
1245
00:51:14,560 --> 00:51:17,120
A liability cap is not just any dollar amount.
1246
00:51:17,120 --> 00:51:21,200
It's the number after liability shall not exceed or cap on damages.
1247
00:51:21,200 --> 00:51:24,800
The more precisely you define the linguistic context around each field,
1248
00:51:24,800 --> 00:51:27,200
the more accurately the model will extract it.
1249
00:51:27,200 --> 00:51:29,600
Vague field definitions produce vague results.
1250
00:51:29,600 --> 00:51:31,920
Step five is training, testing and validating.
1251
00:51:31,920 --> 00:51:36,240
The model builder splits your label documents into training and test sets automatically.
1252
00:51:36,240 --> 00:51:39,360
After training, you review the confidence scores for each extraction.
1253
00:51:39,360 --> 00:51:44,400
A confidence score above 80% is generally reliable for automated processing.
1254
00:51:44,400 --> 00:51:47,440
A score between 50 and 80% should flag for human review.
1255
00:51:47,440 --> 00:51:51,440
Below 50%, you need more training examples or better field definitions.
1256
00:51:51,440 --> 00:51:53,760
Test on documents that weren't in your training set.
1257
00:51:53,760 --> 00:51:54,960
Test on the edge cases.
1258
00:51:54,960 --> 00:51:57,280
Test on the blurry scans and the odd formatting.
1259
00:51:57,280 --> 00:51:59,600
If the model fails there, it will fail in production.
1260
00:51:59,600 --> 00:52:01,040
The no-code promise is real.
1261
00:52:01,040 --> 00:52:04,080
Subject matter experts can build these models without writing Python
1262
00:52:04,080 --> 00:52:06,160
or understanding neural network architecture.
1263
00:52:06,160 --> 00:52:08,720
But the no-code label is misleading in one important way.
1264
00:52:08,720 --> 00:52:09,920
It doesn't mean no expertise.
1265
00:52:09,920 --> 00:52:11,760
It means the expertise is domain expertise.
1266
00:52:11,760 --> 00:52:12,720
Not data science.
1267
00:52:12,720 --> 00:52:15,680
The person labeling contracts needs to understand contract structure.
1268
00:52:15,680 --> 00:52:19,760
The person defining entity extractors needs to know where exploration dates hide
1269
00:52:19,760 --> 00:52:21,280
in their organization's documents.
1270
00:52:21,280 --> 00:52:23,120
The model learns from human pattern recognition.
1271
00:52:23,120 --> 00:52:25,760
If the human doesn't recognize the pattern, the machine won't either.
1272
00:52:25,760 --> 00:52:30,320
Microsoft provides pre-built models that accelerate this process for common scenarios.
1273
00:52:30,320 --> 00:52:34,480
The pre-built contract processing model can identify contract type, parties,
1274
00:52:34,480 --> 00:52:38,320
dates and other standard elements without requiring extensive training.
1275
00:52:38,320 --> 00:52:41,360
The pre-built invoice and receipt models extract totals,
1276
00:52:41,360 --> 00:52:43,040
dates and vendor information.
1277
00:52:43,040 --> 00:52:44,880
If your documents match these patterns,
1278
00:52:44,880 --> 00:52:47,520
start with the pre-built model and customize from there.
1279
00:52:47,520 --> 00:52:51,200
You'll get to production faster and you'll have a baseline for accuracy
1280
00:52:51,200 --> 00:52:52,960
that you can improve iteratively.
1281
00:52:52,960 --> 00:52:56,560
Pay as you go billing means you only pay for what you process.
1282
00:52:56,560 --> 00:53:00,880
Each page, sheet, slide or file transaction is meted through your attached
1283
00:53:00,880 --> 00:53:02,160
as your subscription.
1284
00:53:02,160 --> 00:53:04,720
This makes it feasible to run proof-of-concept projects
1285
00:53:04,720 --> 00:53:06,960
without committing to large upfront licenses.
1286
00:53:06,960 --> 00:53:09,440
Process a few hundred documents, measure the accuracy,
1287
00:53:09,440 --> 00:53:12,640
calculate the time savings and build the business case from real data.
1288
00:53:12,640 --> 00:53:15,600
The meters continue to exist behind the scenes in the AI era
1289
00:53:15,600 --> 00:53:18,720
and Azure Cost Management lets you monitor consumption as you scale.
1290
00:53:18,720 --> 00:53:21,040
There is a critical limitation you need to plan for.
1291
00:53:21,040 --> 00:53:24,320
Autofill configurations in the AI era live in your document library.
1292
00:53:24,320 --> 00:53:26,160
You can't reuse them in another library.
1293
00:53:26,160 --> 00:53:28,400
Each library requires its own separate setup.
1294
00:53:28,400 --> 00:53:30,320
There is no sharing, no inheritance.
1295
00:53:30,320 --> 00:53:32,800
If you have 50 contract libraries across your tenant,
1296
00:53:32,800 --> 00:53:34,800
you may need 50 separate configurations.
1297
00:53:34,800 --> 00:53:35,920
That is not a scaling model.
1298
00:53:35,920 --> 00:53:37,280
It is a point solution.
1299
00:53:37,280 --> 00:53:40,240
For enterprise deployment, you need to think about whether the new
1300
00:53:40,240 --> 00:53:42,880
autofill approach fits your architecture or whether you should stick
1301
00:53:42,880 --> 00:53:47,280
with classic syntax model binding that can be applied across libraries via content types.
1302
00:53:47,280 --> 00:53:51,680
Power Automate remains the connective tissue between extracted metadata
1303
00:53:51,680 --> 00:53:52,960
and downstream action.
1304
00:53:52,960 --> 00:53:55,440
When a syntax model extracts an expiration date,
1305
00:53:55,440 --> 00:53:57,840
a power automate flow can read that column value,
1306
00:53:57,840 --> 00:54:01,120
calculate the 90-day warning threshold, create a planner task
1307
00:54:01,120 --> 00:54:03,040
and send an email to the contract owner.
1308
00:54:03,040 --> 00:54:06,000
When a model classifies a document as containing personal data,
1309
00:54:06,000 --> 00:54:08,560
a flow can trigger a sensitivity label application,
1310
00:54:08,560 --> 00:54:10,640
log the event to a compliance register,
1311
00:54:10,640 --> 00:54:12,480
and notify the data protection officer.
1312
00:54:12,480 --> 00:54:15,360
These orchestrations are not optional extras.
1313
00:54:15,360 --> 00:54:18,720
They are what turns extracted metadata into business process automation.
1314
00:54:18,720 --> 00:54:23,280
Error handling in production workflows is often overlooked during pilot deployments.
1315
00:54:23,280 --> 00:54:26,080
Your production flows need branches for model failures,
1316
00:54:26,080 --> 00:54:29,840
low-confidence scores, corrupted files, and password protected PDFs.
1317
00:54:29,840 --> 00:54:32,800
A missing expiration date should not create a blank task.
1318
00:54:32,800 --> 00:54:35,360
A missing expiration date should not create a blank task.
1319
00:54:35,360 --> 00:54:38,640
A low-confidence extraction should root to a human reviewer queue.
1320
00:54:38,640 --> 00:54:41,280
A processing failure should log to an admin dashboard,
1321
00:54:41,280 --> 00:54:42,960
not silently drop the document.
1322
00:54:42,960 --> 00:54:45,120
Build these exception paths before you scale,
1323
00:54:45,120 --> 00:54:47,600
because exceptions become common at volume.
1324
00:54:47,600 --> 00:54:51,040
Document quality is another production reality that pilots often ignore.
1325
00:54:51,040 --> 00:54:53,920
A training set of pristine digital PDFs produces a model
1326
00:54:53,920 --> 00:54:56,080
that fails on the fact scanned, coffee stains,
1327
00:54:56,080 --> 00:54:58,560
sideways-oriented documents that arrive in production.
1328
00:54:58,560 --> 00:55:00,720
Your intake process needs quality gates.
1329
00:55:00,720 --> 00:55:03,120
Reject scans below 300 dots per inch.
1330
00:55:03,120 --> 00:55:05,040
Flag documents with missing pages.
1331
00:55:05,040 --> 00:55:08,000
Alert users when a file is encrypted or password protected.
1332
00:55:08,000 --> 00:55:10,480
The model cannot process what it cannot read
1333
00:55:10,480 --> 00:55:13,600
and failed processing costs money without delivering value.
1334
00:55:13,600 --> 00:55:18,480
A small amount of intake validation prevents a large amount of downstream rework.
1335
00:55:18,480 --> 00:55:22,080
Integration with external systems is where the architecture gets interesting.
1336
00:55:22,080 --> 00:55:27,280
Documents arrive through email, third-party portals, migration batches, and API uploads.
1337
00:55:27,280 --> 00:55:30,400
Each of these channels needs to land the file in a SharePoint library,
1338
00:55:30,400 --> 00:55:33,600
trigger the model, and handle the metadata output consistently.
1339
00:55:33,600 --> 00:55:36,960
A common pattern is to create a staging library for incoming documents,
1340
00:55:36,960 --> 00:55:39,680
run the model there, and then move the classified document
1341
00:55:39,680 --> 00:55:43,280
to its final destination library based on the extracted metadata.
1342
00:55:43,280 --> 00:55:45,200
This decouples intake from storage,
1343
00:55:45,200 --> 00:55:47,360
allowing you to apply different processing rules
1344
00:55:47,360 --> 00:55:51,200
to different document sources without complicating the destination libraries.
1345
00:55:51,200 --> 00:55:54,720
Monitoring and telemetry complete the production picture.
1346
00:55:54,720 --> 00:55:57,280
Azure application insights can track model performance,
1347
00:55:57,280 --> 00:55:59,520
processing latency, and error rates.
1348
00:55:59,520 --> 00:56:01,840
SharePoint audit logs record who uploaded what,
1349
00:56:01,840 --> 00:56:04,240
when the model ran, and what metadata was applied.
1350
00:56:04,240 --> 00:56:08,720
Power BI dashboards can visualize throughput, accuracy trends, and cost per document.
1351
00:56:08,720 --> 00:56:11,120
These monitoring layers are not just for troubleshooting,
1352
00:56:11,120 --> 00:56:12,640
they are for continuous improvement.
1353
00:56:12,640 --> 00:56:14,560
When accuracy drops, the dashboard tells you.
1354
00:56:14,560 --> 00:56:16,320
When costs spike the alert fires,
1355
00:56:16,320 --> 00:56:18,480
when a new document type appears in the intake,
1356
00:56:18,480 --> 00:56:21,040
the pattern detection flags it for model retraining.
1357
00:56:21,040 --> 00:56:23,920
Entity extraction, deep dive, the technical core.
1358
00:56:23,920 --> 00:56:26,800
Entity extractors are where beginners become professionals.
1359
00:56:26,800 --> 00:56:29,680
Simple field mapping finds text in a fixed location.
1360
00:56:29,680 --> 00:56:33,200
Entity extraction finds text based on context, pattern, and meaning.
1361
00:56:33,200 --> 00:56:36,080
That difference is everything when you're processing contracts,
1362
00:56:36,080 --> 00:56:37,680
policies, and legal documents.
1363
00:56:37,680 --> 00:56:39,360
Let's start with a concrete example.
1364
00:56:39,360 --> 00:56:42,400
A master service agreement might contain an effective date.
1365
00:56:42,400 --> 00:56:47,760
In one document, it appears after this agreement is effective as of January 15, 2024.
1366
00:56:47,760 --> 00:56:51,760
In another, it appears after commencement date 15th January 2024.
1367
00:56:51,760 --> 00:56:55,920
In a third, it appears in a table labeled Key Terms with a row for start date.
1368
00:56:55,920 --> 00:56:58,880
A simple field map I would miss two of these three instances.
1369
00:56:58,880 --> 00:57:03,360
An entity extractor trained on the linguistic patterns around effective dates would catch all three.
1370
00:57:03,360 --> 00:57:07,120
The extractor works by learning the relationship between a label and a value.
1371
00:57:07,120 --> 00:57:10,960
You highlight multiple examples of the same entity across different documents
1372
00:57:10,960 --> 00:57:12,640
and the model generalizes the pattern.
1373
00:57:12,640 --> 00:57:15,520
It learns that effective dates often follow phrases like
1374
00:57:15,520 --> 00:57:20,080
effective as of, commences on, start date, or term begins.
1375
00:57:20,080 --> 00:57:23,040
It learns that expiration dates follow terminates on,
1376
00:57:23,040 --> 00:57:25,680
end date, expiration, or renewal deadline.
1377
00:57:26,640 --> 00:57:30,480
The more examples you provide, the stronger the pattern recognition becomes.
1378
00:57:30,480 --> 00:57:33,360
Liability clauses are another high value extraction target.
1379
00:57:33,360 --> 00:57:35,840
A liability cap might be expressed as a fixed dollar amount,
1380
00:57:35,840 --> 00:57:38,800
a percentage of fees paid, a multiple of annual charges,
1381
00:57:38,800 --> 00:57:40,240
or an insurance coverage limit.
1382
00:57:40,240 --> 00:57:44,320
The extractor needs to understand that liability shall not exceed $500,000
1383
00:57:44,320 --> 00:57:49,920
and total liability capped at $500K and damages limited to the greater of $500,000,
1384
00:57:49,920 --> 00:57:53,840
or 12 months of fees, or represent the same conceptual entity.
1385
00:57:53,840 --> 00:57:58,960
You train this by labeling each variant and letting the model learn the semantic equivalence.
1386
00:57:58,960 --> 00:58:01,760
Termination clauses require similar semantic understanding.
1387
00:58:01,760 --> 00:58:03,760
Some contracts terminate on a fixed date,
1388
00:58:03,760 --> 00:58:05,520
others terminate after a notice period,
1389
00:58:05,520 --> 00:58:08,400
others terminate upon a material breach, others auto-renew,
1390
00:58:08,400 --> 00:58:10,560
unless one party provides written notice.
1391
00:58:10,560 --> 00:58:13,120
Each of these represents a different termination pattern
1392
00:58:13,120 --> 00:58:16,240
and your taxonomy needs columns that capture the distinction.
1393
00:58:16,240 --> 00:58:18,080
A single end date column is insufficient.
1394
00:58:18,080 --> 00:58:20,880
You need termination type, notice period,
1395
00:58:20,880 --> 00:58:24,160
auto-renewal flag, and renewal notice deadline.
1396
00:58:24,160 --> 00:58:27,680
The richness of your taxonomy determines the richness of your extraction.
1397
00:58:27,680 --> 00:58:30,160
Confidence scores are your quality control mechanism.
1398
00:58:30,160 --> 00:58:31,680
When the model extracts a value,
1399
00:58:31,680 --> 00:58:34,720
it assigns a confidence score between 0 and 100.
1400
00:58:34,720 --> 00:58:36,960
You need to set thresholds that match your risk tolerance.
1401
00:58:36,960 --> 00:58:40,480
For low-risk extractions like document type classification,
1402
00:58:40,480 --> 00:58:42,720
an 80% threshold might be acceptable.
1403
00:58:42,720 --> 00:58:45,600
For high-risk extractions like liability caps or governing law,
1404
00:58:45,600 --> 00:58:50,400
you might want 95% confidence before auto accepting anything below that flags for human review.
1405
00:58:50,400 --> 00:58:52,240
Don't set one threshold for everything.
1406
00:58:52,240 --> 00:58:55,280
Match the threshold to the business consequence of an error.
1407
00:58:55,280 --> 00:58:58,560
Handling variability is where most models fail in production.
1408
00:58:58,560 --> 00:59:01,040
Documents that look clean in training arrive in production
1409
00:59:01,040 --> 00:59:03,360
with handwritten annotations, stamp overlays,
1410
00:59:03,360 --> 00:59:05,360
fax headers or rotated pages.
1411
00:59:05,360 --> 00:59:07,840
The model needs to see these variants during training,
1412
00:59:07,840 --> 00:59:09,840
include at least 10% of your training set
1413
00:59:09,840 --> 00:59:11,600
as edge cases and bad examples.
1414
00:59:11,600 --> 00:59:13,360
If you only train on pristine documents,
1415
00:59:13,360 --> 00:59:15,440
your model will be a pristine failure.
1416
00:59:15,440 --> 00:59:17,600
Scanned PDFs present a specific challenge.
1417
00:59:17,600 --> 00:59:19,920
Optical character recognition transforms the image
1418
00:59:19,920 --> 00:59:22,240
into text before the model processes it.
1419
00:59:22,240 --> 00:59:24,480
But OCR quality varies with scan resolution,
1420
00:59:24,480 --> 00:59:26,480
font clarity and page rotation.
1421
00:59:26,480 --> 00:59:28,400
A document scanned at 200 dots per inch
1422
00:59:28,400 --> 00:59:31,440
might produce garbled text where 300 dots per inch scan
1423
00:59:31,440 --> 00:59:32,800
produces perfect extraction.
1424
00:59:32,800 --> 00:59:36,400
If your document pipeline includes a lot of scan paper,
1425
00:59:36,400 --> 00:59:39,280
test your OCR layer separately from your extraction layer,
1426
00:59:39,280 --> 00:59:41,360
bad OCR produces bad extractions,
1427
00:59:41,360 --> 00:59:43,600
and the extraction model gets blamed for a problem
1428
00:59:43,600 --> 00:59:44,800
that happened upstream.
1429
00:59:44,800 --> 00:59:47,440
Common training mistakes follow predictable patterns.
1430
00:59:47,440 --> 00:59:51,280
Two few edge cases produces a model that only works on ideal documents,
1431
00:59:51,280 --> 00:59:53,920
overfitting to one template produces a model that fails
1432
00:59:53,920 --> 00:59:56,000
when a new vendor sends a different layout.
1433
00:59:56,000 --> 00:59:57,520
Ignoring negative examples,
1434
00:59:57,520 --> 00:59:59,920
documents that don't contain the entity at all
1435
00:59:59,920 --> 01:00:02,880
produces a model that hallucinates extractions when none exists.
1436
01:00:02,880 --> 01:00:05,680
The cure for all of these is more representative training data
1437
01:00:05,680 --> 01:00:07,440
and iterative validation.
1438
01:00:07,440 --> 01:00:08,960
Iteration is not a sign of failure.
1439
01:00:08,960 --> 01:00:10,000
It is the process.
1440
01:00:10,000 --> 01:00:11,760
Deploy your model to a pilot library,
1441
01:00:11,760 --> 01:00:13,520
process a few hundred real documents,
1442
01:00:13,520 --> 01:00:15,760
review the extractions and feed the corrections
1443
01:00:15,760 --> 01:00:17,200
back into the training set.
1444
01:00:17,200 --> 01:00:20,080
Every corrected extraction makes the model smarter.
1445
01:00:20,080 --> 01:00:22,480
Every false positive teaches it what to avoid.
1446
01:00:22,480 --> 01:00:25,520
Plan for at least three training cycles
1447
01:00:25,520 --> 01:00:27,360
before you call a model production ready.
1448
01:00:27,360 --> 01:00:29,360
The organizations that get the best results
1449
01:00:29,360 --> 01:00:32,400
treat model training as a continuous improvement process,
1450
01:00:32,400 --> 01:00:34,160
not a one time setup task.
1451
01:00:34,160 --> 01:00:36,400
Multi-page documents present extraction challenges
1452
01:00:36,400 --> 01:00:38,160
that single-page models do not.
1453
01:00:38,160 --> 01:00:40,640
A contract might have the effective date on page one,
1454
01:00:40,640 --> 01:00:42,240
the liability clause on page eight,
1455
01:00:42,240 --> 01:00:44,160
and the signature block on page 12.
1456
01:00:44,160 --> 01:00:46,000
The model needs to understand that entities
1457
01:00:46,000 --> 01:00:47,760
can appear anywhere in the document,
1458
01:00:47,760 --> 01:00:49,040
not just on the first page.
1459
01:00:49,040 --> 01:00:50,720
When you label training examples,
1460
01:00:50,720 --> 01:00:53,040
make sure to label entities on every page,
1461
01:00:53,040 --> 01:00:55,760
not just the page where they most commonly appear.
1462
01:00:55,760 --> 01:00:58,080
A model trained only on page one extractions
1463
01:00:58,080 --> 01:01:00,080
will miss page eight data in production.
1464
01:01:00,080 --> 01:01:01,600
Tables are another common pitfall.
1465
01:01:01,600 --> 01:01:03,200
Contracts often include schedules,
1466
01:01:03,200 --> 01:01:05,600
pricing tables and responsibility matrices.
1467
01:01:05,600 --> 01:01:08,640
A simple entity extractor might see a table cell
1468
01:01:08,640 --> 01:01:10,800
containing a date and incorrectly label it
1469
01:01:10,800 --> 01:01:11,920
as the effective date
1470
01:01:11,920 --> 01:01:14,880
when it is actually a delivery milestone from a schedule.
1471
01:01:14,880 --> 01:01:17,760
You need to train the model to recognize table boundaries
1472
01:01:17,760 --> 01:01:20,000
and distinguish between header rows, data rows,
1473
01:01:20,000 --> 01:01:21,200
and footer summaries.
1474
01:01:21,200 --> 01:01:23,840
Some extraction problems are better solved by table extraction
1475
01:01:23,840 --> 01:01:25,280
than by entity extraction.
1476
01:01:25,280 --> 01:01:27,840
If your document contains a recurring table structure,
1477
01:01:27,840 --> 01:01:29,520
such as a monthly payment schedule,
1478
01:01:29,520 --> 01:01:31,680
consider whether a structured or freeform model
1479
01:01:31,680 --> 01:01:33,840
would capture that table more reliably
1480
01:01:33,840 --> 01:01:35,920
than an unstructured entity extractor
1481
01:01:35,920 --> 01:01:37,920
hunting for isolated values.
1482
01:01:37,920 --> 01:01:39,600
Signatures and handwritten annotations
1483
01:01:39,600 --> 01:01:41,520
break many extraction pipelines.
1484
01:01:41,520 --> 01:01:44,640
A scan signature might cover a date field making OCR fail,
1485
01:01:44,640 --> 01:01:46,160
a handwritten note in the margin
1486
01:01:46,160 --> 01:01:47,840
might contain a critical amendment
1487
01:01:47,840 --> 01:01:51,360
that the model ignores because it only processes printed text.
1488
01:01:51,360 --> 01:01:54,480
If your document pipeline includes heavily annotated documents,
1489
01:01:54,480 --> 01:01:57,600
you need to decide whether to extract the annotation layer separately
1490
01:01:57,600 --> 01:02:00,160
or to train the model to recognize handwriting.
1491
01:02:00,160 --> 01:02:01,600
Neither approach is trivial
1492
01:02:01,600 --> 01:02:03,680
and both require higher quality scans
1493
01:02:03,680 --> 01:02:05,760
than pure printed text extraction.
1494
01:02:05,760 --> 01:02:07,680
Cross-document entity resolution
1495
01:02:07,680 --> 01:02:10,640
is the advanced challenge that separates basic extraction
1496
01:02:10,640 --> 01:02:12,560
from true document intelligence.
1497
01:02:12,560 --> 01:02:14,320
Suppose you have a master service agreement
1498
01:02:14,320 --> 01:02:16,240
and five associated statements of work.
1499
01:02:16,240 --> 01:02:19,600
The MSA contains the governing law and the liability cap.
1500
01:02:19,600 --> 01:02:23,280
Each SOW contains the project scope, timeline, and budget.
1501
01:02:23,280 --> 01:02:26,720
Your taxonomy needs to capture the relationship between these documents.
1502
01:02:26,720 --> 01:02:29,600
A standalone extraction of the SOW budget is useful,
1503
01:02:29,600 --> 01:02:32,880
but an extraction that links the SOW to its parent MSA
1504
01:02:32,880 --> 01:02:36,240
and inherits the governing law from the MSA is far more powerful.
1505
01:02:36,240 --> 01:02:38,320
This requires not just document level extraction
1506
01:02:38,320 --> 01:02:39,840
but document relationship modeling,
1507
01:02:39,840 --> 01:02:42,240
which is beyond the current native capabilities of syntax
1508
01:02:42,240 --> 01:02:46,160
and requires custom orchestration through power, automate, or as your logic apps.
1509
01:02:46,160 --> 01:02:49,440
Language and jurisdiction variations add another layer of complexity.
1510
01:02:49,440 --> 01:02:51,520
A global organization might process contracts
1511
01:02:51,520 --> 01:02:53,920
in English, German, Japanese, and Portuguese.
1512
01:02:53,920 --> 01:02:56,000
The pre-built models handle some of these languages,
1513
01:02:56,000 --> 01:02:58,560
but custom models need language-specific training.
1514
01:02:58,560 --> 01:03:01,760
A termination clause in English uses different linguistic patterns
1515
01:03:01,760 --> 01:03:03,040
than a clause in German.
1516
01:03:03,040 --> 01:03:04,960
You cannot train a model on English contracts
1517
01:03:04,960 --> 01:03:07,920
and expect it to extract accurately from Japanese contracts.
1518
01:03:07,920 --> 01:03:09,920
You need separate models for each language
1519
01:03:09,920 --> 01:03:12,320
or you need to use as your document intelligences
1520
01:03:12,320 --> 01:03:14,880
multilingual capabilities as a pre-processing layer
1521
01:03:14,880 --> 01:03:16,480
before syntax extraction.
1522
01:03:16,480 --> 01:03:19,760
The metadata schema, however, can be shared across languages.
1523
01:03:19,760 --> 01:03:21,600
The contract type column means the same thing
1524
01:03:21,600 --> 01:03:24,080
whether the document is in English or Japanese.
1525
01:03:24,080 --> 01:03:26,720
This is why taxonomy design must be language-agnostic
1526
01:03:26,720 --> 01:03:29,040
even when model training is language-specific.
1527
01:03:29,040 --> 01:03:31,040
Testing protocols for production readiness
1528
01:03:31,040 --> 01:03:33,120
should include a formal validation suite.
1529
01:03:33,120 --> 01:03:36,560
Select 100 documents that represent your full production mix,
1530
01:03:36,560 --> 01:03:38,400
including clean digital files,
1531
01:03:38,400 --> 01:03:42,080
scan paper, multi-page contracts, tables, signatures,
1532
01:03:42,080 --> 01:03:43,280
and edge cases.
1533
01:03:43,280 --> 01:03:44,880
Run these documents through the model
1534
01:03:44,880 --> 01:03:48,160
and compare the extractions against a human-verified ground truth.
1535
01:03:48,160 --> 01:03:51,360
Calculate precision, recall, and F1 score for each field.
1536
01:03:51,360 --> 01:03:53,440
Publish the results to the governance council.
1537
01:03:53,440 --> 01:03:56,400
If accuracy is below your threshold, do not deploy.
1538
01:03:56,400 --> 01:03:59,600
Retrain and retest until the numbers meet your standard.
1539
01:03:59,600 --> 01:04:03,360
This discipline separates experiments from production systems.
1540
01:04:03,360 --> 01:04:05,920
Deploying to live libraries, autofill, and skills.
1541
01:04:05,920 --> 01:04:07,040
Once the model is trained,
1542
01:04:07,040 --> 01:04:08,800
you need to connect it to live workflows.
1543
01:04:08,800 --> 01:04:10,800
In the classic syntax model, you publish the model
1544
01:04:10,800 --> 01:04:12,800
and bind it to a SharePoint library.
1545
01:04:12,800 --> 01:04:15,120
Every time a document is uploaded to that library,
1546
01:04:15,120 --> 01:04:16,800
the model runs automatically.
1547
01:04:16,800 --> 01:04:19,600
It classifies the document type, extracts the fields,
1548
01:04:19,600 --> 01:04:21,360
and populates the library columns.
1549
01:04:21,360 --> 01:04:22,880
The process happens in seconds,
1550
01:04:22,880 --> 01:04:26,000
and the user sees the metadata appear as if by magic.
1551
01:04:26,000 --> 01:04:28,000
That binding mechanism still works.
1552
01:04:28,000 --> 01:04:29,680
And for multi-library deployments,
1553
01:04:29,680 --> 01:04:31,520
it is still the most scalable approach.
1554
01:04:31,520 --> 01:04:33,680
In the AI era, the experience is different.
1555
01:04:33,680 --> 01:04:35,200
Users open a document library,
1556
01:04:35,200 --> 01:04:36,720
select create autofill columns,
1557
01:04:36,720 --> 01:04:38,720
and describe what they want in natural language.
1558
01:04:38,720 --> 01:04:41,840
Co-pilot processes the documents and suggests column values.
1559
01:04:41,840 --> 01:04:44,480
There are no model galleries, no complex binding,
1560
01:04:44,480 --> 01:04:46,800
no extractors to configure, just a prompt.
1561
01:04:46,800 --> 01:04:49,040
The trade-off is flexibility for simplicity.
1562
01:04:49,040 --> 01:04:52,640
You lose the ability to reuse the configuration across libraries.
1563
01:04:52,640 --> 01:04:54,320
You lose batch testing visibility,
1564
01:04:54,320 --> 01:04:56,240
but you gain speed and accessibility.
1565
01:04:56,240 --> 01:04:59,280
Real-time classification means that the moment a file hits the library,
1566
01:04:59,280 --> 01:05:00,880
the metadata is generated.
1567
01:05:00,880 --> 01:05:04,320
For high-volume libraries processing hundreds of documents per day,
1568
01:05:04,320 --> 01:05:06,320
this automatic processing is critical.
1569
01:05:06,320 --> 01:05:08,800
Documents can also be processed via API,
1570
01:05:08,800 --> 01:05:11,440
which means that files uploaded through third-party systems,
1571
01:05:11,440 --> 01:05:13,280
email attachments, or migration tools,
1572
01:05:13,280 --> 01:05:15,440
can still trigger the same extraction pipeline.
1573
01:05:15,440 --> 01:05:17,040
The metadata schema is consistent
1574
01:05:17,040 --> 01:05:19,040
regardless of how the document arrived.
1575
01:05:19,040 --> 01:05:21,440
Multimodal inputs are increasingly common.
1576
01:05:21,440 --> 01:05:24,160
Your contract library might contain digital word files,
1577
01:05:24,160 --> 01:05:26,320
scanned PDFs, image attachments,
1578
01:05:26,320 --> 01:05:28,880
and even photographs of signed documents taken on a phone.
1579
01:05:28,880 --> 01:05:32,640
The OCR layer handles the image-to-text conversion
1580
01:05:32,640 --> 01:05:35,200
and the extraction model processes the resulting text.
1581
01:05:35,200 --> 01:05:36,800
The user experience is the same.
1582
01:05:36,800 --> 01:05:40,000
The underlying pipeline adapts to the input type automatically.
1583
01:05:40,000 --> 01:05:43,040
What matters is that the extraction quality depends on OCR accuracy
1584
01:05:43,040 --> 01:05:44,160
for scanned inputs,
1585
01:05:44,160 --> 01:05:46,800
so monitor your scan quality as part of the deployment,
1586
01:05:46,800 --> 01:05:49,760
connecting extracted metadata to views and filters,
1587
01:05:49,760 --> 01:05:52,960
transforms the library from a file dump into a business dashboard,
1588
01:05:52,960 --> 01:05:56,160
create views that show contracts expiring in the next 90 days,
1589
01:05:56,160 --> 01:05:58,720
create views that filter by governing low jurisdiction,
1590
01:05:58,720 --> 01:06:01,120
create views that flag auto-renewal contracts
1591
01:06:01,120 --> 01:06:02,640
approaching their notice deadline.
1592
01:06:02,640 --> 01:06:04,640
These views turn metadata into action.
1593
01:06:04,640 --> 01:06:06,080
A legal team can open SharePoint
1594
01:06:06,080 --> 01:06:07,760
and see exactly what needs attention today
1595
01:06:07,760 --> 01:06:09,600
without running a single manual search.
1596
01:06:09,600 --> 01:06:11,040
Skills take this a step further
1597
01:06:11,040 --> 01:06:13,600
by automating workflows across the entire site.
1598
01:06:13,600 --> 01:06:14,640
A skill might say,
1599
01:06:14,640 --> 01:06:16,960
for every contract uploaded to the legal library,
1600
01:06:16,960 --> 01:06:19,360
extract the counterparty name and expiration date,
1601
01:06:19,360 --> 01:06:22,320
create a calendar event for 90 days before expiration
1602
01:06:22,320 --> 01:06:23,760
and notify the contract manager.
1603
01:06:23,760 --> 01:06:26,400
The skill is written in natural language,
1604
01:06:26,400 --> 01:06:28,720
stored as markdown in the agent assets library
1605
01:06:28,720 --> 01:06:30,320
and executed by the site agent.
1606
01:06:30,320 --> 01:06:32,320
It chains extraction metadata update,
1607
01:06:32,320 --> 01:06:34,080
list operation and notification
1608
01:06:34,080 --> 01:06:36,000
into a single automated workflow
1609
01:06:36,000 --> 01:06:37,760
that runs without human intervention.
1610
01:06:37,760 --> 01:06:40,800
But production deployment requires caution.
1611
01:06:40,800 --> 01:06:42,400
Skills have no version control.
1612
01:06:42,400 --> 01:06:44,800
Edit the markdown, save it, and it's live.
1613
01:06:44,800 --> 01:06:47,120
There is no rollback, there is no sandbox for testing.
1614
01:06:47,120 --> 01:06:48,560
You write the skill, submit it,
1615
01:06:48,560 --> 01:06:50,000
and it runs on real data.
1616
01:06:50,000 --> 01:06:50,960
One syntax error,
1617
01:06:50,960 --> 01:06:52,960
and you've modified thousands of records
1618
01:06:52,960 --> 01:06:54,960
and by default, everyone can create a skill.
1619
01:06:54,960 --> 01:06:56,400
Before you enable skills,
1620
01:06:56,400 --> 01:06:57,520
in a production site,
1621
01:06:57,520 --> 01:06:59,840
restrict creation to designated owners.
1622
01:06:59,840 --> 01:07:02,240
Before you deploy a skill to a high-stakes library,
1623
01:07:02,240 --> 01:07:03,840
test it on a copy of the data.
1624
01:07:03,840 --> 01:07:05,920
The speed of skills is a double-edged sword.
1625
01:07:05,920 --> 01:07:07,840
Fast execution means fast mistakes.
1626
01:07:07,840 --> 01:07:09,360
You also need to monitor costs.
1627
01:07:09,360 --> 01:07:10,640
Azure Payers, you go meters,
1628
01:07:10,640 --> 01:07:12,000
track every page processed,
1629
01:07:12,000 --> 01:07:13,120
every translation character,
1630
01:07:13,120 --> 01:07:14,480
every signature request.
1631
01:07:14,480 --> 01:07:16,160
The meters are granular and transparent,
1632
01:07:16,160 --> 01:07:18,080
but they accumulate quickly at scale.
1633
01:07:18,080 --> 01:07:20,800
Set up Azure Cost Management alerts before you go live.
1634
01:07:20,800 --> 01:07:22,240
Process a pilot batch,
1635
01:07:22,240 --> 01:07:23,840
measure the meter cost per document
1636
01:07:23,840 --> 01:07:25,840
and extrapolate to your full volume.
1637
01:07:25,840 --> 01:07:28,480
A surprise Azure Bill is not the kind of production feedback
1638
01:07:28,480 --> 01:07:30,320
your finance team wants.
1639
01:07:30,320 --> 01:07:32,640
Search configuration is an often overlooked benefit
1640
01:07:32,640 --> 01:07:34,240
of populated metadata.
1641
01:07:34,240 --> 01:07:36,480
SharePoint Search uses crawled properties
1642
01:07:36,480 --> 01:07:38,560
and manage properties to index content.
1643
01:07:38,560 --> 01:07:41,200
When your model populates a managed metadata column,
1644
01:07:41,200 --> 01:07:42,800
that value becomes a managed property
1645
01:07:42,800 --> 01:07:45,040
that Search can filter, refine, and query.
1646
01:07:45,040 --> 01:07:46,800
Users can search for contract type,
1647
01:07:46,800 --> 01:07:49,840
MSA, and find every master service agreement in the tenant.
1648
01:07:49,840 --> 01:07:52,960
They can search for expiration date now to now nine days
1649
01:07:52,960 --> 01:07:55,760
and find every contract expiring in the next 90 days.
1650
01:07:55,760 --> 01:07:57,280
These are not advanced queries.
1651
01:07:57,280 --> 01:07:58,800
They are standard search syntax
1652
01:07:58,800 --> 01:08:01,120
that becomes powerful when metadata exists.
1653
01:08:01,120 --> 01:08:02,960
Without metadata, Search is limited
1654
01:08:02,960 --> 01:08:05,840
to full text keyword matching, which returns noise.
1655
01:08:05,840 --> 01:08:08,880
With metadata, Search becomes precision retrieval.
1656
01:08:08,880 --> 01:08:11,200
The user experience of automated metadata population
1657
01:08:11,200 --> 01:08:12,240
also drives adoption.
1658
01:08:12,240 --> 01:08:13,680
When a user uploads a contract
1659
01:08:13,680 --> 01:08:15,280
and sees the expiration date,
1660
01:08:15,280 --> 01:08:17,920
counterparty name, and contract type appear automatically
1661
01:08:17,920 --> 01:08:19,120
in the library columns,
1662
01:08:19,120 --> 01:08:21,040
they understand the value immediately.
1663
01:08:21,040 --> 01:08:23,680
When they see empty columns that require manual tagging,
1664
01:08:23,680 --> 01:08:24,960
they skip the step.
1665
01:08:24,960 --> 01:08:27,440
The model's accuracy directly affects user trust.
1666
01:08:27,440 --> 01:08:30,640
If the model populates columns correctly nine times out of 10,
1667
01:08:30,640 --> 01:08:32,560
users learn to trust the 10th.
1668
01:08:32,560 --> 01:08:34,160
If it fails five times out of 10,
1669
01:08:34,160 --> 01:08:35,680
users learn to ignore it.
1670
01:08:35,680 --> 01:08:37,520
Accuracy is not just a technical metric.
1671
01:08:37,520 --> 01:08:38,800
It is an adoption metric.
1672
01:08:38,800 --> 01:08:42,320
Migration scenarios are another place where deployment planning matters.
1673
01:08:42,320 --> 01:08:44,560
If you are migrating documents from a file share
1674
01:08:44,560 --> 01:08:47,680
or a legacy document management system into SharePoint,
1675
01:08:47,680 --> 01:08:49,920
you have a one time opportunity to apply metadata
1676
01:08:49,920 --> 01:08:51,040
during the migration.
1677
01:08:51,040 --> 01:08:52,960
Tools like SharePoint Migration Manager
1678
01:08:52,960 --> 01:08:55,520
can map source properties to destination columns.
1679
01:08:55,520 --> 01:08:57,920
But most source systems have no structured metadata.
1680
01:08:57,920 --> 01:08:59,600
They have folder paths and file names.
1681
01:08:59,600 --> 01:09:01,600
You can use the folder path as a hint,
1682
01:09:01,600 --> 01:09:02,960
mapping the file share of a contract
1683
01:09:02,960 --> 01:09:05,600
due to 24 bars to a contract year column.
1684
01:09:05,600 --> 01:09:08,000
But the real value comes from running the syntax model
1685
01:09:08,000 --> 01:09:10,240
on the migrated documents after they land in SharePoint.
1686
01:09:10,240 --> 01:09:12,000
Plan the migration in two waves.
1687
01:09:12,000 --> 01:09:13,360
First move the files.
1688
01:09:13,360 --> 01:09:14,880
Second process them with the model.
1689
01:09:14,880 --> 01:09:16,240
The model will populate columns
1690
01:09:16,240 --> 01:09:17,840
that the migration could not.
1691
01:09:17,840 --> 01:09:19,840
Disaster recovery and business continuity
1692
01:09:19,840 --> 01:09:21,440
also depend on metadata.
1693
01:09:21,440 --> 01:09:24,400
If a site is accidentally deleted and restored from backup,
1694
01:09:24,400 --> 01:09:27,200
the documents return, but the model bindings might not.
1695
01:09:27,200 --> 01:09:29,760
If the model was bound to a library that no longer exists,
1696
01:09:29,760 --> 01:09:32,080
the restored library might not have the same binding,
1697
01:09:32,080 --> 01:09:34,160
document your model library relationships,
1698
01:09:34,160 --> 01:09:36,800
include them in your backup and recovery runbooks.
1699
01:09:36,800 --> 01:09:38,960
Test restoration in a sandbox environment,
1700
01:09:38,960 --> 01:09:41,520
a model that worked perfectly before a disaster
1701
01:09:41,520 --> 01:09:43,520
and failed silently after recovery
1702
01:09:43,520 --> 01:09:45,280
is not a resilient system.
1703
01:09:45,280 --> 01:09:47,680
Metadata governance includes disaster planning,
1704
01:09:47,680 --> 01:09:49,760
not just day-to-day operations.
1705
01:09:49,760 --> 01:09:51,920
From metadata to governed action.
1706
01:09:51,920 --> 01:09:53,840
Extracted metadata is just data
1707
01:09:53,840 --> 01:09:55,600
until you enforce something with it.
1708
01:09:55,600 --> 01:09:56,960
The real transformation happens
1709
01:09:56,960 --> 01:09:58,480
when you connect those library columns
1710
01:09:58,480 --> 01:10:00,960
to policies, workflows, and compliance controls.
1711
01:10:00,960 --> 01:10:02,800
Retention labels in Microsoft PerView
1712
01:10:02,800 --> 01:10:04,240
are the first place to connect.
1713
01:10:04,240 --> 01:10:06,160
When your model extracts an expiration date
1714
01:10:06,160 --> 01:10:07,840
and writes it to a SharePoint column,
1715
01:10:07,840 --> 01:10:09,600
you can configure a retention policy
1716
01:10:09,600 --> 01:10:11,600
that triggers six months before that date.
1717
01:10:11,600 --> 01:10:13,200
The document is flagged for review,
1718
01:10:13,200 --> 01:10:14,480
the owner is notified,
1719
01:10:14,480 --> 01:10:16,400
and the organization makes a conscious decision
1720
01:10:16,400 --> 01:10:18,480
to renew, renegotiate or dispose.
1721
01:10:18,480 --> 01:10:21,200
Without that connection, contracts expire silently,
1722
01:10:21,200 --> 01:10:22,560
auto-renew by default,
1723
01:10:22,560 --> 01:10:25,200
or sit in storage long past their useful life.
1724
01:10:25,200 --> 01:10:27,680
Retention without metadata is a blunt instrument.
1725
01:10:27,680 --> 01:10:30,640
Retention with extracted metadata is a precision tool.
1726
01:10:30,640 --> 01:10:32,560
Sensitivity labels work the same way.
1727
01:10:32,560 --> 01:10:34,240
When your model classifies a document
1728
01:10:34,240 --> 01:10:35,600
as containing personal data,
1729
01:10:35,600 --> 01:10:37,360
financial records, or intellectual property
1730
01:10:37,360 --> 01:10:39,760
that classification can trigger a sensitivity label
1731
01:10:39,760 --> 01:10:41,680
that restricts external sharing,
1732
01:10:41,680 --> 01:10:44,320
encrypts the file, or applies a watermark.
1733
01:10:44,320 --> 01:10:46,560
The classification happens automatically at upload,
1734
01:10:46,560 --> 01:10:48,160
the protection follows immediately.
1735
01:10:48,160 --> 01:10:50,560
Users don't need to remember which documents are sensitive,
1736
01:10:50,560 --> 01:10:51,840
the model remembers for them.
1737
01:10:51,840 --> 01:10:54,080
Data loss prevention policies benefit enormously
1738
01:10:54,080 --> 01:10:55,760
from accurate classification.
1739
01:10:55,760 --> 01:10:58,480
DLP rules detect and protect sensitive information
1740
01:10:58,480 --> 01:11:00,880
across Microsoft 365 locations.
1741
01:11:00,880 --> 01:11:03,680
They use predefined or custom sensitive information types
1742
01:11:03,680 --> 01:11:04,880
and can take actions,
1743
01:11:04,880 --> 01:11:06,560
such as blocking external sharing,
1744
01:11:06,560 --> 01:11:08,400
sending alerts, or requiring justification
1745
01:11:08,400 --> 01:11:09,600
for policy overrides.
1746
01:11:09,600 --> 01:11:12,800
But DLP is only as accurate as the classification feeding it.
1747
01:11:12,800 --> 01:11:15,360
When documents carry reliable metadata generated
1748
01:11:15,360 --> 01:11:18,640
by trained models, DLP policies trigger on the right documents
1749
01:11:18,640 --> 01:11:20,800
and ignore the false positives.
1750
01:11:20,800 --> 01:11:22,480
When metadata is missing or wrong,
1751
01:11:22,480 --> 01:11:25,200
DLP becomes a nuisance that uses bypass.
1752
01:11:25,200 --> 01:11:27,520
Legal hold scenarios demonstrate the full value
1753
01:11:27,520 --> 01:11:28,960
of govern metadata.
1754
01:11:28,960 --> 01:11:30,720
When litigation is anticipated,
1755
01:11:30,720 --> 01:11:32,720
legal teams need to identify, preserve,
1756
01:11:32,720 --> 01:11:34,560
and produce all relevant documents.
1757
01:11:34,560 --> 01:11:36,320
Without metadata that means manual review
1758
01:11:36,320 --> 01:11:37,760
of every file in every site.
1759
01:11:37,760 --> 01:11:40,560
With metadata, it means filtering by document type,
1760
01:11:40,560 --> 01:11:43,360
date range, counterparty, and custodian.
1761
01:11:43,360 --> 01:11:46,400
A well-classified SharePoint environment reduces e-discovery
1762
01:11:46,400 --> 01:11:49,520
from weeks of manual labor to hours of targeted export.
1763
01:11:49,520 --> 01:11:52,000
The chain of custody is documented automatically.
1764
01:11:52,000 --> 01:11:53,920
The metadata integrity is preserved.
1765
01:11:53,920 --> 01:11:55,280
The defensibility is stronger.
1766
01:11:55,280 --> 01:11:58,160
Dashboard visibility closes the loop for business stakeholders.
1767
01:11:58,160 --> 01:12:01,440
Power BI can connect directly to SharePoint lists and libraries,
1768
01:12:01,440 --> 01:12:03,920
visualizing the extracted metadata in real time.
1769
01:12:03,920 --> 01:12:06,240
Executive C, contract renewal pipelines,
1770
01:12:06,240 --> 01:12:07,360
compliance officers,
1771
01:12:07,360 --> 01:12:09,680
C retention status across departments.
1772
01:12:09,680 --> 01:12:12,320
Security teams see sensitivity label coverage.
1773
01:12:12,320 --> 01:12:14,320
The documents that were once invisible
1774
01:12:14,320 --> 01:12:16,880
become the data source for strategic decisions.
1775
01:12:16,880 --> 01:12:18,640
The metadata gap that was a liability
1776
01:12:18,640 --> 01:12:20,320
becomes a competitive advantage.
1777
01:12:20,320 --> 01:12:21,920
And the digital trail is complete.
1778
01:12:21,920 --> 01:12:23,680
Every extraction, every column update,
1779
01:12:23,680 --> 01:12:24,960
every policy application,
1780
01:12:24,960 --> 01:12:27,920
and every user access is logged in SharePoint audit records.
1781
01:12:27,920 --> 01:12:30,720
When an auditor asks how you knew which documents to retain,
1782
01:12:30,720 --> 01:12:33,200
you show them the model that extracted the expiration date
1783
01:12:33,200 --> 01:12:34,960
and the policy that applied the label.
1784
01:12:34,960 --> 01:12:37,680
When a regulator asks how you protected sensitive data,
1785
01:12:37,680 --> 01:12:39,520
you show them the classification model
1786
01:12:39,520 --> 01:12:41,520
and the sensitivity label automation.
1787
01:12:41,520 --> 01:12:43,520
The metadata doesn't just organize your content,
1788
01:12:43,520 --> 01:12:44,880
it proves your governance.
1789
01:12:44,880 --> 01:12:46,880
Workflow automation triggered by metadata
1790
01:12:46,880 --> 01:12:49,520
is where the operational value becomes tangible.
1791
01:12:49,520 --> 01:12:51,280
When a model extracts a contract value
1792
01:12:51,280 --> 01:12:52,960
and writes it to a SharePoint column,
1793
01:12:52,960 --> 01:12:55,120
a power automate flow can evaluate that value
1794
01:12:55,120 --> 01:12:56,720
against approval thresholds.
1795
01:12:56,720 --> 01:12:59,360
A contract under $10,000 might route
1796
01:12:59,360 --> 01:13:01,120
to a department manager for approval.
1797
01:13:01,120 --> 01:13:04,480
A contract over $100,000 might route to legal, finance,
1798
01:13:04,480 --> 01:13:06,080
and the CFO in sequence.
1799
01:13:06,080 --> 01:13:08,400
The routing logic is driven by the extracted metadata,
1800
01:13:08,400 --> 01:13:10,160
not by manual form submissions.
1801
01:13:10,160 --> 01:13:12,560
The metadata becomes the workflow engine.
1802
01:13:12,560 --> 01:13:15,440
Document lifecycle management also improves dramatically.
1803
01:13:15,440 --> 01:13:18,480
Most organizations have a loose concept of document lifecycle,
1804
01:13:18,480 --> 01:13:20,720
but they lack the mechanism to enforce it.
1805
01:13:20,720 --> 01:13:23,040
A contract might have six lifecycle stages,
1806
01:13:23,040 --> 01:13:26,800
draft, under review, approved, executed, active, and expired.
1807
01:13:26,800 --> 01:13:28,480
Each stage has different permissions,
1808
01:13:28,480 --> 01:13:31,440
different retention rules, and different notification requirements.
1809
01:13:31,440 --> 01:13:33,360
With metadata-driven lifecycle management,
1810
01:13:33,360 --> 01:13:35,040
the stage is a SharePoint column
1811
01:13:35,040 --> 01:13:37,520
populated by the model or by workflow logic.
1812
01:13:37,520 --> 01:13:39,920
When the model detects signatures on the final page,
1813
01:13:39,920 --> 01:13:41,840
it updates the stage to execute it.
1814
01:13:41,840 --> 01:13:43,440
When the expiration date passes,
1815
01:13:43,440 --> 01:13:45,520
the workflow updates the stage to expired
1816
01:13:45,520 --> 01:13:47,040
and triggers a retention review.
1817
01:13:47,040 --> 01:13:49,680
The document moves through its lifecycle automatically,
1818
01:13:49,680 --> 01:13:51,680
guided by the metadata it carries.
1819
01:13:51,680 --> 01:13:53,760
External collaboration scenarios benefit
1820
01:13:53,760 --> 01:13:55,760
from metadata-driven access control.
1821
01:13:55,760 --> 01:13:57,840
When you share a document with an external law firm
1822
01:13:57,840 --> 01:14:00,640
or audit partner, you need to know what you are sharing.
1823
01:14:00,640 --> 01:14:03,680
A document tagged with client, confidential, and legal privilege
1824
01:14:03,680 --> 01:14:05,200
requires different sharing controls
1825
01:14:05,200 --> 01:14:08,080
than a document tagged with public reference material.
1826
01:14:08,080 --> 01:14:12,000
Metadata-driven sharing policies can enforce these distinctions automatically.
1827
01:14:12,000 --> 01:14:14,400
The model classifies the document at upload.
1828
01:14:14,400 --> 01:14:16,960
The sensitivity label applies the sharing restriction.
1829
01:14:16,960 --> 01:14:19,920
The external recipient receives only what they are supposed to receive
1830
01:14:19,920 --> 01:14:21,840
and the organization maintains an audit trail
1831
01:14:21,840 --> 01:14:24,560
of what was shared, why, and when.
1832
01:14:24,560 --> 01:14:27,520
Version control and document history also improve with metadata.
1833
01:14:27,520 --> 01:14:29,760
When a contract goes through five revision cycles,
1834
01:14:29,760 --> 01:14:31,440
the version history shows what changed,
1835
01:14:31,440 --> 01:14:32,880
but it doesn't explain why.
1836
01:14:32,880 --> 01:14:35,120
A metadata column for approval stage can show
1837
01:14:35,120 --> 01:14:37,520
that version two was the legal review draft,
1838
01:14:37,520 --> 01:14:39,520
version three included finance feedback,
1839
01:14:39,520 --> 01:14:42,720
and version four was the final executive approval copy.
1840
01:14:42,720 --> 01:14:44,480
The version history tells you when.
1841
01:14:44,480 --> 01:14:46,000
The metadata tells you why.
1842
01:14:46,000 --> 01:14:48,240
Together, they create a complete narrative
1843
01:14:48,240 --> 01:14:51,040
of the document lifecycle that satisfies auditors,
1844
01:14:51,040 --> 01:14:53,040
protects the organization in disputes,
1845
01:14:53,040 --> 01:14:56,080
and educates new team members about how decisions were made.
1846
01:14:56,080 --> 01:14:59,840
Training and compliance reporting also rely on metadata.
1847
01:14:59,840 --> 01:15:02,400
Regulatory frameworks often require evidence
1848
01:15:02,400 --> 01:15:04,800
that employees have read and acknowledged policies.
1849
01:15:04,800 --> 01:15:06,640
A policy document tagged with acknowledgement
1850
01:15:06,640 --> 01:15:09,520
required an employee handbook can trigger a training workflow
1851
01:15:09,520 --> 01:15:11,360
that assigns the document to every new hire
1852
01:15:11,360 --> 01:15:12,800
and tracks their acknowledgement.
1853
01:15:12,800 --> 01:15:14,720
The metadata drives the assignment list.
1854
01:15:14,720 --> 01:15:16,800
The workflow tracks the completion status,
1855
01:15:16,800 --> 01:15:20,080
the audit report shows who has acknowledged which policies and when.
1856
01:15:20,080 --> 01:15:22,560
Without metadata, this becomes a manual mailing list
1857
01:15:22,560 --> 01:15:24,400
and a spreadsheet that nobody updates.
1858
01:15:24,400 --> 01:15:26,080
The cumulative effect of these connections
1859
01:15:26,080 --> 01:15:28,320
is that SharePoint stops being a file repository
1860
01:15:28,320 --> 01:15:30,080
and starts being a business system.
1861
01:15:30,080 --> 01:15:31,680
Documents are not passive storage.
1862
01:15:31,680 --> 01:15:33,760
They are active records that trigger workflows
1863
01:15:33,760 --> 01:15:36,880
in force policies, generate reports and drive decisions.
1864
01:15:36,880 --> 01:15:39,360
The metadata is what makes this transition possible
1865
01:15:39,360 --> 01:15:41,760
and the model is what generates the metadata at scale.
1866
01:15:41,760 --> 01:15:44,400
Scaling AI transformation across the enterprise.
1867
01:15:44,400 --> 01:15:46,000
One library is proof of concept.
1868
01:15:46,000 --> 01:15:47,680
The real return comes when you scale,
1869
01:15:47,680 --> 01:15:50,720
but scaling requires more than copying the model to another site.
1870
01:15:50,720 --> 01:15:53,280
It requires a strategy, a governance framework
1871
01:15:53,280 --> 01:15:54,800
and a change management plan.
1872
01:15:54,800 --> 01:15:56,640
Start by measuring what matters.
1873
01:15:56,640 --> 01:15:57,680
Before you expand,
1874
01:15:57,680 --> 01:15:59,840
establish baseline metrics for your pilot.
1875
01:15:59,840 --> 01:16:01,280
How many documents were processed?
1876
01:16:01,280 --> 01:16:03,520
What was the average extraction accuracy?
1877
01:16:03,520 --> 01:16:06,000
How many hours did staff spend on manual data entry
1878
01:16:06,000 --> 01:16:07,840
before the model and how many after?
1879
01:16:07,840 --> 01:16:11,120
How many compliance incidents were prevented by automated retention?
1880
01:16:11,120 --> 01:16:13,280
These numbers build the business case for expansion.
1881
01:16:13,280 --> 01:16:16,400
Without them, you're asking for budget based on faith.
1882
01:16:16,400 --> 01:16:19,920
The ROI calculation combines time-saved, error reduction
1883
01:16:19,920 --> 01:16:21,280
and risk avoidance.
1884
01:16:21,280 --> 01:16:24,000
Industry benchmarks suggest that automated document processing
1885
01:16:24,000 --> 01:16:28,000
can reduce processing time by 70% and error rates by 80%.
1886
01:16:28,000 --> 01:16:30,000
But these are assumptions, not guarantees.
1887
01:16:30,000 --> 01:16:32,240
Your actual results depend on document quality,
1888
01:16:32,240 --> 01:16:34,640
model accuracy and process maturity.
1889
01:16:34,640 --> 01:16:35,920
Measure your own results.
1890
01:16:35,920 --> 01:16:38,480
Use the pilot data to build a custom ROI model
1891
01:16:38,480 --> 01:16:40,480
that reflects your organization's reality,
1892
01:16:40,480 --> 01:16:42,400
not a vendor's best case scenario.
1893
01:16:42,400 --> 01:16:45,120
Expanding from legal to finance, HR and operations
1894
01:16:45,120 --> 01:16:46,160
follows a pattern.
1895
01:16:46,160 --> 01:16:48,400
Identify the next document type with high volume,
1896
01:16:48,400 --> 01:16:50,640
high manual effort and high business risk.
1897
01:16:50,640 --> 01:16:52,960
Finance might process invoices and purchase orders,
1898
01:16:52,960 --> 01:16:56,400
HR might process employee agreements and benefits forms,
1899
01:16:56,400 --> 01:16:58,480
operations might process safety reports
1900
01:16:58,480 --> 01:17:00,400
and vendor certifications.
1901
01:17:00,400 --> 01:17:02,720
Each domain has its own document patterns,
1902
01:17:02,720 --> 01:17:05,600
its own taxonomy and its own compliance requirements.
1903
01:17:05,600 --> 01:17:08,160
Don't assume the contract model transfers directly.
1904
01:17:08,160 --> 01:17:10,000
Train a new model for each document type
1905
01:17:10,000 --> 01:17:11,920
using domain experts from that department.
1906
01:17:11,920 --> 01:17:14,800
Change management is the most underestimated part of scaling.
1907
01:17:14,800 --> 01:17:16,080
The technology is the easy part.
1908
01:17:16,080 --> 01:17:17,760
The hard part is getting people to trust it.
1909
01:17:17,760 --> 01:17:20,080
Contract managers need to see that the models' extractions
1910
01:17:20,080 --> 01:17:23,440
are accurate before they stop manually checking every field.
1911
01:17:23,440 --> 01:17:26,400
Legal teams need to understand that the retention policy is defensible
1912
01:17:26,400 --> 01:17:28,320
before they rely on it for e-discovery.
1913
01:17:28,320 --> 01:17:31,760
IT teams need to know that the model won't generate surprise
1914
01:17:31,760 --> 01:17:34,320
Azure Bills before they let it run unattended.
1915
01:17:34,320 --> 01:17:36,000
Trust is built through transparency,
1916
01:17:36,000 --> 01:17:37,520
show people the confidence scores,
1917
01:17:37,520 --> 01:17:39,200
show them the correction workflow,
1918
01:17:39,200 --> 01:17:41,280
let them override the model when it's wrong
1919
01:17:41,280 --> 01:17:43,840
and feed those corrections back into training.
1920
01:17:43,840 --> 01:17:45,760
Governance councils become critical at scale.
1921
01:17:45,760 --> 01:17:48,560
A single model in one library can be managed by one owner.
1922
01:17:48,560 --> 01:17:51,440
50 models across 10 departments need coordination.
1923
01:17:51,440 --> 01:17:53,440
Establish across functional governance council
1924
01:17:53,440 --> 01:17:54,960
with representatives from IT,
1925
01:17:54,960 --> 01:17:56,720
legal compliance and business units.
1926
01:17:56,720 --> 01:17:58,640
Define who owns the term sets,
1927
01:17:58,640 --> 01:18:00,000
who approves new models,
1928
01:18:00,000 --> 01:18:02,960
who monitors accuracy and who handles exceptions.
1929
01:18:02,960 --> 01:18:04,240
Without this structure,
1930
01:18:04,240 --> 01:18:06,320
models proliferate inconsistently,
1931
01:18:06,320 --> 01:18:07,760
taxonomies diverge,
1932
01:18:07,760 --> 01:18:10,720
and the metadata gap reappears at a larger scale.
1933
01:18:10,720 --> 01:18:11,920
The roadmap is simple.
1934
01:18:11,920 --> 01:18:13,200
Audit your current workflows
1935
01:18:13,200 --> 01:18:15,520
and identify the highest impact document type.
1936
01:18:15,520 --> 01:18:17,840
Pile it a model for that type in a single library.
1937
01:18:17,840 --> 01:18:20,000
Measure accuracy, cost and time savings,
1938
01:18:20,000 --> 01:18:21,440
iterate on the training data
1939
01:18:21,440 --> 01:18:23,120
until the model is production ready.
1940
01:18:23,120 --> 01:18:24,640
Connect the extracted metadata
1941
01:18:24,640 --> 01:18:27,520
to retention, sensitivity and DLP policies,
1942
01:18:27,520 --> 01:18:28,960
measure compliance outcomes,
1943
01:18:28,960 --> 01:18:30,800
then expand to the next document type
1944
01:18:30,800 --> 01:18:31,760
and the next,
1945
01:18:31,760 --> 01:18:34,320
until the entire document lifecycle is governed.
1946
01:18:34,320 --> 01:18:35,920
This is not a big bank transformation,
1947
01:18:35,920 --> 01:18:37,280
it is a series of deliberate,
1948
01:18:37,280 --> 01:18:39,200
measured steps that compound over time.
1949
01:18:39,200 --> 01:18:41,680
Department specific expansion requires different planning
1950
01:18:41,680 --> 01:18:42,720
for each domain.
1951
01:18:42,720 --> 01:18:44,880
Finance departments typically start with invoices
1952
01:18:44,880 --> 01:18:46,000
and purchase orders
1953
01:18:46,000 --> 01:18:47,920
because these documents have high volume,
1954
01:18:47,920 --> 01:18:48,960
consistent structure,
1955
01:18:48,960 --> 01:18:50,400
and direct cost impact.
1956
01:18:50,400 --> 01:18:52,080
The ROI is easy to calculate,
1957
01:18:52,080 --> 01:18:53,840
multiply the number of invoices per month
1958
01:18:53,840 --> 01:18:55,600
by the minutes saved per invoice
1959
01:18:55,600 --> 01:18:57,440
by the fully loaded cost per minute.
1960
01:18:57,440 --> 01:18:58,960
The result is a monthly savings number
1961
01:18:58,960 --> 01:19:00,720
that justifies the model investment.
1962
01:19:00,720 --> 01:19:02,960
HR departments often start with employee agreements
1963
01:19:02,960 --> 01:19:04,320
and benefits enrollment forms
1964
01:19:04,320 --> 01:19:06,240
because these carry compliance risk.
1965
01:19:06,240 --> 01:19:07,680
A misfiled i9 form
1966
01:19:07,680 --> 01:19:09,440
or an unsigned non-competed agreement
1967
01:19:09,440 --> 01:19:11,760
creates legal exposure that is easy to quantify.
1968
01:19:11,760 --> 01:19:13,200
Operations departments might start
1969
01:19:13,200 --> 01:19:14,480
with safety incident reports
1970
01:19:14,480 --> 01:19:15,760
or vendor qualification documents
1971
01:19:15,760 --> 01:19:17,360
because these tie directly to insurance
1972
01:19:17,360 --> 01:19:18,640
and audit requirements.
1973
01:19:18,640 --> 01:19:20,240
Each department has its own pain point,
1974
01:19:20,240 --> 01:19:21,440
its own document type
1975
01:19:21,440 --> 01:19:22,960
and its own success metric.
1976
01:19:22,960 --> 01:19:24,640
Don't impose a single template.
1977
01:19:24,640 --> 01:19:26,240
Let each department solve its own
1978
01:19:26,240 --> 01:19:28,240
highest priority problem first.
1979
01:19:28,240 --> 01:19:30,320
Technology adoption curves vary by department
1980
01:19:30,320 --> 01:19:31,120
and by role.
1981
01:19:31,120 --> 01:19:32,800
Early adopters in IT and legal
1982
01:19:32,800 --> 01:19:34,800
might embrace the model immediately.
1983
01:19:34,800 --> 01:19:36,560
Skeptics in finance and operations
1984
01:19:36,560 --> 01:19:38,400
might resist until they see proof.
1985
01:19:38,400 --> 01:19:39,600
Plan for this variation.
1986
01:19:39,600 --> 01:19:41,520
Identify your champions in each department,
1987
01:19:41,520 --> 01:19:43,840
give them early access to the pilot results,
1988
01:19:43,840 --> 01:19:45,680
let them become internal advocates.
1989
01:19:45,680 --> 01:19:47,440
When a contract manager tells her colleagues
1990
01:19:47,440 --> 01:19:49,600
that the model saved her four hours last week,
1991
01:19:49,600 --> 01:19:50,960
that testimony is more powerful
1992
01:19:50,960 --> 01:19:52,720
than any business case presentation.
1993
01:19:52,720 --> 01:19:54,960
Peer influence drives adoption more effectively
1994
01:19:54,960 --> 01:19:56,400
than top-down mandates.
1995
01:19:56,400 --> 01:19:58,560
Training programs need to cover three audiences.
1996
01:19:58,560 --> 01:20:00,720
End users need to know how to upload documents
1997
01:20:00,720 --> 01:20:02,720
and how to verify automated extractions.
1998
01:20:02,720 --> 01:20:05,280
They need to understand what the confidence score means
1999
01:20:05,280 --> 01:20:06,960
and when to override the model.
2000
01:20:06,960 --> 01:20:08,720
Site owners need to know how to configure
2001
01:20:08,720 --> 01:20:11,200
libraries, bind models and create views.
2002
01:20:11,200 --> 01:20:13,200
They need to understand taxonomy basics
2003
01:20:13,200 --> 01:20:15,760
so they don't break the model by changing column names.
2004
01:20:15,760 --> 01:20:18,080
Administrators need to know how to monitor costs,
2005
01:20:18,080 --> 01:20:20,800
manage model versions, and enforce governance rules.
2006
01:20:20,800 --> 01:20:23,120
Each audience needs a different depth of training
2007
01:20:23,120 --> 01:20:24,480
and each needs hands-on practice
2008
01:20:24,480 --> 01:20:26,640
with real documents from their own department.
2009
01:20:26,640 --> 01:20:28,800
Generic training on sample documents fails
2010
01:20:28,800 --> 01:20:31,440
because it doesn't reflect the messiness of real-world content.
2011
01:20:31,440 --> 01:20:34,160
Executive sponsorship is the single biggest predictor
2012
01:20:34,160 --> 01:20:35,520
of scaling success.
2013
01:20:35,520 --> 01:20:36,880
Without a senior leader who believes
2014
01:20:36,880 --> 01:20:38,800
in the value of metadata governance,
2015
01:20:38,800 --> 01:20:40,800
the project becomes an IT experiment
2016
01:20:40,800 --> 01:20:43,120
that dies when the initial budget runs out.
2017
01:20:43,120 --> 01:20:44,480
With executive sponsorship,
2018
01:20:44,480 --> 01:20:46,320
the project becomes a strategic initiative
2019
01:20:46,320 --> 01:20:48,560
that survives reorganizations, budget cuts,
2020
01:20:48,560 --> 01:20:49,760
and staff turnover.
2021
01:20:49,760 --> 01:20:51,520
The sponsor does not need to understand
2022
01:20:51,520 --> 01:20:53,600
transformer models or Azure meters.
2023
01:20:53,600 --> 01:20:55,600
They need to understand that unmanaged documents
2024
01:20:55,600 --> 01:20:58,080
are a liability, that AI can fix it,
2025
01:20:58,080 --> 01:21:01,440
and that the fix requires sustained investment in taxonomy,
2026
01:21:01,440 --> 01:21:03,120
training, and governance.
2027
01:21:03,120 --> 01:21:05,200
Their job is to protect the program long enough
2028
01:21:05,200 --> 01:21:07,840
for the compounding benefits to become visible.
2029
01:21:07,840 --> 01:21:10,640
Metrics and reporting sustain the program over time.
2030
01:21:10,640 --> 01:21:12,880
After the pilot establish a governance dashboard
2031
01:21:12,880 --> 01:21:15,360
that tracks the metrics that matter to your organization.
2032
01:21:15,360 --> 01:21:17,120
Document volume processed per month.
2033
01:21:17,120 --> 01:21:19,200
Average extraction accuracy by field,
2034
01:21:19,200 --> 01:21:20,880
cost per document processed.
2035
01:21:20,880 --> 01:21:22,320
Hours saved per department.
2036
01:21:22,320 --> 01:21:23,920
Compliance incidents prevented,
2037
01:21:23,920 --> 01:21:25,360
audit findings resolved,
2038
01:21:25,360 --> 01:21:26,880
share this dashboard monthly
2039
01:21:26,880 --> 01:21:28,160
with the governance council
2040
01:21:28,160 --> 01:21:30,000
and quarterly with executive leadership.
2041
01:21:30,000 --> 01:21:32,080
When the numbers improve, celebrate the progress.
2042
01:21:32,080 --> 01:21:33,200
When the numbers stall,
2043
01:21:33,200 --> 01:21:34,560
diagnose the problem.
2044
01:21:34,560 --> 01:21:36,080
Visibility creates accountability
2045
01:21:36,080 --> 01:21:37,760
and accountability creates improvement.
2046
01:21:37,760 --> 01:21:39,520
The organizations that succeed treat this
2047
01:21:39,520 --> 01:21:41,040
not as a technology deployment
2048
01:21:41,040 --> 01:21:42,320
but as a capability build.
2049
01:21:42,320 --> 01:21:43,440
Technology changes.
2050
01:21:43,440 --> 01:21:46,080
Models get renamed, interfaces evolve.
2051
01:21:46,080 --> 01:21:47,360
The capability that endures
2052
01:21:47,360 --> 01:21:49,840
is the organization's ability to classify its content,
2053
01:21:49,840 --> 01:21:51,040
govern its metadata,
2054
01:21:51,040 --> 01:21:53,120
and automate its document processes.
2055
01:21:53,120 --> 01:21:54,160
Build that capability
2056
01:21:54,160 --> 01:21:56,240
and you will survive every rebrand.
2057
01:21:56,240 --> 01:21:57,840
Build only a point solution
2058
01:21:57,840 --> 01:21:58,960
and you will be rebuilding
2059
01:21:58,960 --> 01:22:00,720
when the next rename arrives.
2060
01:22:00,720 --> 01:22:02,000
Finally, consider the platform
2061
01:22:02,000 --> 01:22:03,520
around your SharePoint deployment.
2062
01:22:03,520 --> 01:22:06,560
Microsoft 365 is not a collection of separate products.
2063
01:22:06,560 --> 01:22:07,840
It is an integrated platform
2064
01:22:07,840 --> 01:22:10,160
where SharePoint teams, outlook, one drive,
2065
01:22:10,160 --> 01:22:10,960
power platform,
2066
01:22:10,960 --> 01:22:13,760
and per view, share data, permissions, and metadata.
2067
01:22:13,760 --> 01:22:15,920
When you fix the metadata gap in SharePoint,
2068
01:22:15,920 --> 01:22:18,080
you improve the signal for co-pilot in Teams,
2069
01:22:18,080 --> 01:22:19,440
for Power BI in Outlook,
2070
01:22:19,440 --> 01:22:21,600
and for Per View across the entire tenant.
2071
01:22:21,600 --> 01:22:24,400
The metadata you extract from a contract in SharePoint today
2072
01:22:24,400 --> 01:22:27,840
becomes the data that drives a power app's approval workflow tomorrow,
2073
01:22:27,840 --> 01:22:30,160
a Power BI risk dashboard next quarter,
2074
01:22:30,160 --> 01:22:32,640
and a co-pilot answer six months from now.
2075
01:22:32,640 --> 01:22:34,320
The investment in taxonomy and governance
2076
01:22:34,320 --> 01:22:37,520
pays dividends across every workload in Microsoft 365.
2077
01:22:37,520 --> 01:22:40,320
That is why the metadata gap is not just a SharePoint problem.
2078
01:22:40,320 --> 01:22:41,520
It is a platform problem
2079
01:22:41,520 --> 01:22:43,440
and fixing it is not just an IT project,
2080
01:22:43,440 --> 01:22:44,880
it is a business transformation.
2081
01:22:44,880 --> 01:22:46,640
And remember, nothing you built is wasted,
2082
01:22:46,640 --> 01:22:48,480
but nothing you built is finished either.
2083
01:22:48,480 --> 01:22:49,680
The syntax error has ended,
2084
01:22:49,680 --> 01:22:52,720
but its capabilities live on as part of AI in SharePoint.
2085
01:22:52,720 --> 01:22:54,240
Your existing model still bill,
2086
01:22:54,240 --> 01:22:55,760
still run, and still extract,
2087
01:22:55,760 --> 01:22:57,280
but their behavior has changed.
2088
01:22:57,280 --> 01:22:59,040
The user interface is conversational.
2089
01:22:59,040 --> 01:23:00,800
The configuration is intent-based.
2090
01:23:00,800 --> 01:23:03,760
The governance is distributed across agents and skills.
2091
01:23:03,760 --> 01:23:05,600
Treat this as a refresh, not a swap,
2092
01:23:05,600 --> 01:23:06,880
validate before production,
2093
01:23:06,880 --> 01:23:08,560
plan for continuous iteration,
2094
01:23:08,560 --> 01:23:09,840
and build your architecture
2095
01:23:09,840 --> 01:23:11,600
so that it survives the next rename
2096
01:23:11,600 --> 01:23:12,720
because there will be one.
2097
01:23:13,920 --> 01:23:16,160
The metadata gap is not a technology problem.
2098
01:23:16,160 --> 01:23:18,160
It is a structural problem that technology can fix
2099
01:23:18,160 --> 01:23:19,760
if you build the foundation first.
2100
01:23:19,760 --> 01:23:21,280
Custom AI document models
2101
01:23:21,280 --> 01:23:23,360
turn your SharePoint liability into a governed,
2102
01:23:23,360 --> 01:23:24,880
searchable, co-pilot-ready asset.
2103
01:23:24,880 --> 01:23:27,600
But the model is only as good as the taxonomy it feeds,
2104
01:23:27,600 --> 01:23:29,920
and the taxonomy is only as good as the governance
2105
01:23:29,920 --> 01:23:30,720
that maintains it.
2106
01:23:30,720 --> 01:23:32,400
If this changed how you think about SharePoint,
2107
01:23:32,400 --> 01:23:33,680
follow me on LinkedIn.
2108
01:23:33,680 --> 01:23:35,120
And if you want the next detailed look
2109
01:23:35,120 --> 01:23:36,560
at Power Automate Orchestration
2110
01:23:36,560 --> 01:23:37,920
for Extracted Metadata,
2111
01:23:37,920 --> 01:23:38,960
that's coming next.









