June 10, 2026

How to Architect Low-Cost AI Agents in the Microsoft Cloud

How to Architect Low-Cost AI Agents in the Microsoft Cloud
How to Architect Low-Cost AI Agents in the Microsoft Cloud
M365 FM Podcast
How to Architect Low-Cost AI Agents in the Microsoft Cloud

In this episode, we explore how to design and operate low-cost AI agents in the Microsoft Cloud without sacrificing quality, security, or scalability.

Many organizations assume that building AI solutions automatically means high Azure OpenAI costs. In reality, the biggest savings often come from architectural decisions rather than model selection alone. The discussion focuses on choosing the right model for the right task, reducing unnecessary token consumption, and avoiding expensive processing patterns that provide little business value.

Listeners will learn how to combine Microsoft 365, Azure OpenAI, Copilot Studio, and Power Platform services to build efficient AI agents that deliver measurable outcomes while keeping cloud spending under control. The episode covers practical techniques such as prompt optimization, retrieval-based architectures, intelligent orchestration, caching strategies, and workload segmentation between large and small language models.

The conversation also highlights when organizations should use premium AI capabilities and when simpler automation or rule-based approaches can achieve the same result at a fraction of the cost. Governance, monitoring, and cost visibility are discussed as essential components of any successful AI deployment.

By the end of the episode, you'll understand how to architect AI agents that balance performance, security, and operational costs, helping your organization move from AI experimentation to sustainable, production-ready solutions in the Microsoft Cloud.

Apple Podcasts podcast player iconSpotify podcast player iconYoutube Music podcast player iconSpreaker podcast player iconPodchaser podcast player iconAmazon Music podcast player icon

You can architect low-cost ai agents in azure by making strategic choices. Many organizations overlook hidden costs like licensing, API calls, and message processing when building an agent. Tools like copilot and azure ai foundry help you design efficient solutions. You need to monitor resources, automate shutdowns, and review support plans to manage cost. With microsoft 365 copilot now widely available, it is vital to audit your ai agent expenses and prepare for the changes coming in November 2026.

Key Takeaways

  • Understand the difference between AI agents and chatbots. AI agents handle complex tasks, while chatbots manage simple interactions.
  • Be aware of hidden costs like context, reasoning, and autonomy when deploying AI agents. These factors can significantly impact your budget.
  • Choose the right billing model in Azure. Use pre-purchase plans for predictable savings and better cost management.
  • Implement semantic caching to reduce API calls. This strategy can lower costs and improve response times for your AI agents.
  • Use prompt compression to minimize token usage. Shorter prompts save money and speed up processing.
  • Monitor resource usage regularly. Set up alerts for spending spikes to keep your AI project within budget.
  • Consider using multi-agent systems for better scalability. Assign specific tasks to each agent to optimize performance and cost.
  • Prepare for upcoming changes in Azure billing by reviewing your usage and setting up billing policies to control expenses.

Low-Cost AI Agents: Key Concepts

AI Agents vs. Chatbots

You may wonder how ai agents differ from chatbots, especially when considering cost. Chatbots handle basic tasks like answering FAQs or providing simple customer support. Their cost usually ranges from $2,000 to $20,000, depending on how much you want to integrate them with other systems. Ai agents, on the other hand, manage complex workflows and make proactive decisions. This means you need to budget more for their development and integration.

  • Chatbots work best for straightforward interactions.
  • Ai agents support advanced scenarios, such as automating business processes or analyzing data for insights.

For small and medium businesses, you might spend $50 to $500 each month for basic ai agent solutions. Enterprises often pay $100 to $500 per user monthly, with extra costs for support and implementation. When building agents in azure, you must consider these differences to avoid surprises in your cost management plan.

Hidden Taxes: Context, Reasoning, Autonomous

When you deploy ai agents in azure, you face three hidden taxes that can impact your budget. These taxes come from the context, reasoning, and autonomy features that make agents powerful.

  • Context Tax: Each time your agent processes more information or longer conversations, you pay for extra tokens and compute time.
  • Reasoning Tax: If your agent needs to make complex decisions, you may need premium models or higher service levels, which increase costs.
  • Autonomous Tax: As your agent becomes more independent, vendors may charge more for advanced integration and support.

Every LLM call is a gamble on both cost and result. Use the Small-Model-First pattern. Start with a smaller model to classify intent, and only escalate to a larger model like GPT-4o for complex tasks. This approach can reduce your FinOps overhead by up to 80% without losing intelligence.

You should always review how these hidden taxes affect your ai agent expenses in azure ai foundry.

Billing Layers and Consumption Costs

Understanding billing layers is key to controlling the cost of low-cost ai agents in azure. Each service in azure ai foundry has its own billing model. You need to know how these models work to manage your budget.

ComponentDescriptionCost Example
Token ConsumptionCosts for input/output tokens across models and agents.$0.01 per token
User ConcurrencyCosts for the number of users and their sessions.Variable based on usage
Agent Logging CostsCosts for logging and observability.Variable based on logging volume
Microsoft Agent Pre-Purchase PlanUnified plan with discounts on ai services.$19,000 for 20,000 ACUs
Provisioned Throughput Units (PTU)Costs for throughput in foundry.$1 per PTU

You can choose a pre-purchase plan for predictable savings and easier cost management. This plan covers over 30 services and offers a single governance path for your ai projects. By understanding these billing layers, you can make better decisions when building agents and keep your ai costs under control.

Azure Architecture for Cost Efficiency

Azure Architecture for Cost Efficiency

Model Selection and Hosting

Choosing the right model is the most important step for cost-efficient ai agent deployment in azure. You need to match the model size and complexity to your task. Smaller models often handle simple tasks well and cost less. Larger models like GPT-4o offer advanced reasoning but increase expenses. You should review your model choices often because azure updates its catalog with new options that can improve performance and reduce costs.

Azure OpenAI vs. Custom Models

Azure OpenAI gives you access to powerful models for ai agent development. You can use pre-built models for common tasks or create custom models for unique needs. Custom models require more development time and resources. You must consider the balance between flexibility and cost. Azure OpenAI works well for most scenarios, especially when you want to scale quickly. Custom models fit best when you need specialized ai features or want to optimize for specific business goals.

Tip: Assign the right model to each agent based on task complexity. This approach helps you avoid unnecessary spending and supports cost management.

Serverless and Consumption-Based Options

Azure offers serverless and consumption-based hosting for ai agents. Serverless options let you pay only for what you use. You do not need to manage infrastructure. Consumption-based pricing charges you for each invocation, token, or transaction. You can control costs by right-sizing context windows and throughput settings. You should monitor token consumption to spot expensive agents and optimize orchestration runs.

  • Serverless hosting supports scalability and reduces operational overhead.
  • Consumption-based pricing gives you flexibility for agent development and testing.

Orchestration and Workflow

Orchestration patterns shape how agents interact and how much you pay. You can choose sequential, concurrent, or magnetic patterns. Sequential orchestration limits resource usage but may increase the number of invocations. Concurrent orchestration boosts throughput but can spike resource consumption. Magnetic orchestration uses iterative planning and may lead to variable costs.

Copilot Studio Integration

Copilot Studio helps you build and orchestrate ai agents in azure. You can integrate agents with microsoft 365 copilot and other applications. Copilot Studio supports best practices for agent development, including monitoring token usage and applying context compaction. You can use Copilot Studio to manage workflows, automate tasks, and improve integration across your data estate.

Logic Apps and Durable Functions

Azure Logic Apps and Durable Functions provide workflow automation for ai agents. Logic Apps connect agents to external systems and automate business processes. Durable Functions enable long-running workflows and stateful orchestration. You can use these tools to build scalable solutions and reduce manual intervention. They help you optimize agent development and support cost-efficient architecture.

Orchestration PatternCost Implications
SequentialLimits concurrent resource usage, accumulates cost across steps
ConcurrentIncreases throughput but may spike resource consumption
MagneticHighly variable costs due to iterative planning by the manager agent

Note: Monitor token consumption and apply context compaction to reduce token volume passed through orchestration. This practice helps you control expenses and improve cost management.

Data and State Management

Managing data and state is essential for low-cost ai agent solutions in azure. You need to secure your data estate and choose storage options that fit your budget.

Secure, Governed Data Estate

You must protect your data and follow governance standards. Azure ai foundry offers tools for securing data and managing access. You can use tagging strategies to track costs and maintain financial sustainability. A governed data estate supports compliance and reduces risk for agent development.

Low-Cost Storage Choices

Azure provides several storage options for ai agents. You can use prompt caching and semantic caching to cut repeated processing. Batching jobs with Azure OpenAI Batch API gives you discounts for delayed tasks. Routing traffic to cheaper models saves costs without losing quality.

StrategyDescriptionCost Impact
CachingUse prompt caching and semantic caching to reduce repeated processing.Cuts inference cost by 60-80%
BatchingUtilize Azure OpenAI Batch API for jobs that can wait, offering a 50% discount.Reduces costs for delayed jobs
RoutingImplement a routing mechanism to direct traffic to cheaper models when appropriate.Saves costs without quality loss

Tip: Use caching and batching to minimize inference costs. These strategies help you build agents that scale efficiently and stay within budget.

You can leverage azure landing zones and reference architectures for cost-optimized deployments. Compare single-agent and multi-agent systems to find the most cost-effective approach for your ai projects. By following these best practices, you ensure your ai agents deliver value without overspending.

Cost Optimization Strategies for AI Agents

Semantic Caching

Semantic caching is one of the most effective ways to reduce operational costs for low-cost ai agents in azure. When you use semantic caching, your ai agent stores previous prompts and responses. The agent then uses vector similarity search to find and reuse answers for similar queries. This method lowers the number of calls to large language models, which are often the most expensive part of ai applications.

You can follow these best practices to get the most out of semantic caching:

  • Implement smart caching strategies to improve performance and lower cost.
  • Store results of expensive ai queries and reuse them for similar questions.
  • Use retrieval caching for information fetched from databases to avoid repeated queries.
  • Apply standard web caching for static content.
  • Monitor cache hit rates and adjust your strategy as needed.
  • Invalidate caches when your data changes to prevent outdated answers.

By using semantic caching, you can cut down on api calls, which leads to lower costs and faster response times. This approach also helps your ai agents scale efficiently in azure ai foundry.

Tip: Always monitor your cache hit rates. High hit rates mean your caching strategy works well and saves you money.

Prompt Compression

Prompt compression helps you control the size and cost of each ai model call. When you compress prompts, you remove unnecessary words and focus only on the key information. This reduces the number of tokens sent to the model, which directly lowers your cost in azure.

You can use prompt compression in several ways:

  • Summarize user input before sending it to the ai model.
  • Remove repeated or irrelevant context from prompts.
  • Use templates to standardize and shorten prompts.
  • Apply automated tools in azure ai foundry to compress prompts during agent development.

Prompt compression not only saves money but also speeds up response times. You can combine this with semantic caching for even greater savings. Many organizations use prompt compression as a quick win when building low-cost ai agents in azure.

Note: Regularly review your prompt templates. Shorter, clearer prompts lead to better performance and lower costs.

Intelligent Model Routing

Intelligent model routing lets you choose the best ai model for each task. You can route simple queries to smaller, cheaper models and send complex tasks to advanced models like those in azure ai foundry. This strategy helps you balance quality, speed, and cost for your ai agent.

The table below shows how different routing modes affect cost and latency:

ModeCost SavingsAvg Latency (Router)Avg Latency (Standard)
Balanced~4.5%~7,800 ms~7,700 ms
Cost-Optimised~4.7%~7,800 ms~7,300 ms
Quality-Optimised~14.2%~6,800 ms~8,300 ms

You gain several benefits from intelligent model routing:

  • Achieve measurable cost savings across all routing modes.
  • Switch between modes without redeploying your ai agent.
  • Automatically use new models as they become available in azure ai foundry.
  • Improve scalability and maintain high-quality results.

You can integrate intelligent model routing with copilot and microsoft 365 copilot for seamless agent development and deployment. This approach supports both cost control and high performance in your ai projects.

Tip: Use intelligent model routing to future-proof your ai agent. As new models launch in azure, your agent will always use the best option for each task.

By combining semantic caching, prompt compression, and intelligent model routing, you can build low-cost ai agents that deliver value, scale efficiently, and stay within budget. These strategies form the foundation for sustainable ai development and integration in azure.

Quick Wins for Reducing Cost

You can achieve significant cost savings for your AI projects in Azure by focusing on a few high-impact actions. These quick wins help you control expenses while supporting scalability and efficient agent development.

  • Target repeatable, time-consuming tasks first. For example, use Microsoft Copilot to summarize meetings or draft routine communications. This approach reduces manual effort and lowers the number of AI model calls, which cuts costs quickly.
  • Lock in discounts for predictable workloads. Analyze your usage patterns in Azure AI Foundry to identify stable resources, such as virtual machines, SQL databases, or storage with consistent demand. Purchase Azure Reservations or Savings Plans for these workloads. Monitor your commitment utilization every week and adjust as your needs change. Rebalance your commitments each quarter by exchanging underused reservations or increasing your Savings Plan commitments as your usage grows.
  • Eliminate idle resources. Review your Azure environment for unattached disks, unused storage accounts, or idle virtual machines. Use Azure Advisor to get cost recommendations, such as rightsizing VMs, reserving capacity, or optimizing storage tiers. Schedule non-production resources to shut down outside business hours. Move infrequently accessed data to cool or archive tiers and delete old snapshots. Regularly check for over-provisioned resources, like oversized VMs or excessive backup retention, and remove them.

Tip: Automate these reviews using Azure AI Foundry tools. Automation ensures you do not miss hidden costs and supports ongoing cost optimization.

You should also apply best practices for agent development. Use prompt compression and semantic caching in Azure AI Foundry to reduce the number of model calls and lower data processing costs. Integrate these strategies early in your development process to maximize savings. Strong integration between your agents and Azure services improves efficiency and helps you scale without overspending.

By following these quick wins, you can keep your AI costs under control, improve your return on investment, and build a sustainable foundation for future development.

Multi-Agent Patterns and Scaling

Multi-Agent Patterns and Scaling

Single vs. Multi-Agent Design

When you design an ai agent in azure, you face a choice between single-agent and multi-agent systems. A single-agent system can seem simple at first. As you add more features, the prompts get longer and the logic grows more complex. This complexity can raise operational and compliance risks. You may also see higher costs because the system becomes less efficient. Multi-agent systems offer a different path. You can assign each agent a specific task, which allows for specialization and better scalability. This approach helps you manage cost, but you need to watch for unpredictable expenses. Each agent may call a different model, and the number of model invocations can grow quickly.

You should also consider how orchestration patterns affect your budget. Sequential patterns limit how many resources run at once, but costs can add up over several steps. Concurrent patterns let you process more tasks at the same time, which increases throughput. However, this can also lead to higher resource use if many agents run together. In azure, the pricing structure depends on model size, token count, and how often you use tools. If you do not manage these factors, costs can rise fast, especially with dynamic prompts or long reasoning loops.

Collaboration and Cost Impact

When agents work together, you need to balance performance and budget. Collaboration lets you optimize resource use and agent activity. You can also control how many tools connect to each agent. This helps you keep your ai project financially sustainable.

AspectDescription
Cost OptimizationResource usage, agent activity, and tool connections are evaluated for cost impact.
Performance vs BudgetOptimization balances performance and budget, helping maintain financial sustainability.

You gain more benefits when you separate concerns. Each agent works within clear boundaries, which improves reliability and reduces costs. An orchestrator agent can coordinate the workflow. This setup increases accuracy and efficiency.

FeatureBenefit
Separation of ConcernsEach agent operates within defined boundaries, improving system reliability and reducing costs.
Workflow CoordinationAn orchestrator agent coordinates the workflow, enhancing accuracy and efficiency.

Tip: Assign clear roles to each agent and use an orchestrator to manage complex workflows. This method supports both cost control and high performance.

Scaling AI Agents in Azure

Scaling ai agents in azure requires careful planning. You need to monitor how each agent uses resources and how often they call a model. Azure gives you tools to track token usage, model invocations, and data flow. You can use these insights to adjust your architecture as your ai project grows.

Start by scaling agents that handle the most important tasks. Use azure’s monitoring features to spot bottlenecks or spikes in usage. If you see one agent using too many resources, consider splitting its tasks or adding another agent. This approach helps you keep your ai system efficient and cost-effective.

You should also automate scaling where possible. Azure supports auto-scaling for many services. This feature lets your ai agents handle more requests during busy times and scale down when demand drops. You save money and keep your system responsive.

Note: Regularly review your scaling strategy. As your ai development evolves, your needs may change. Stay flexible and adjust your agents to match your goals.

Governance and Monitoring for AI Cost Control

Compliance and Security

You must address compliance and security from the start of your ai journey in azure. These requirements protect your organization and help you avoid unexpected expenses. You should review both regulatory and corporate standards before you deploy agents. The table below outlines the main requirements you need to consider:

Requirement TypeDescription
Regulatory ComplianceAll agents must comply with regulations and standards, including data protection laws and industry certifications.
Corporate ComplianceAgents must align with Responsible AI policies, ensuring fairness, reliability, safety, privacy, security, inclusiveness, transparency, and accountability.
Baseline Security RequirementsAI agents must meet baseline security requirements to mitigate risks such as data leakage and credential theft.
Cost Tracking and AllocationEstablish a unified view of agent usage and costs, applying cost center tags and setting up real-time alerts to manage spending effectively.

You should always secure your data and restrict access to sensitive information. Assign clear ownership for ai outcomes and use tagging to track spending. These steps help you stay compliant and keep your ai development on track.

Usage Monitoring and Alerts

You need strong monitoring tools to control spending and keep your ai projects efficient in azure. Start with Microsoft Cost Management + Billing to track your spending and set budgets. Azure Advisor gives you recommendations to save money by analyzing your resource usage. Azure Monitor provides real-time insights into how your agents use resources. You can also use third-party tools like Finout and Sedai for advanced monitoring features.

  • Use resource tagging to organize and track your ai workloads.
  • Set up cost threshold alerts with platforms like PagerDuty or Grafana.
  • Enable budget overrun notifications in Azure Cost Management.
  • Apply anomaly detection tools such as Evidently AI to catch unusual spending patterns.

You should review your monitoring setup often. Early alerts help you fix problems before they grow. This approach keeps your ai development sustainable and prevents budget surprises.

Responsible AI Guardrails

Responsible ai practices protect your organization and support cost-effective operations in azure. You must align all agents with internal governance policies. Isolate confidential data and restrict access so agents only use what they need. Standardize your knowledge and tool integrations to reduce duplication and simplify maintenance. Always make it clear when an ai agent is involved in a process.

Follow these steps to strengthen your guardrails:

  1. Use Azure API Management as a gateway for authentication and tracing.
  2. Implement Role-Based Access Control to manage permissions for cost data.
  3. Automate compliance reporting with Azure Policy to ensure you meet regulatory standards.

You should enforce fairness, inclusiveness, and accountability in every ai project. Assign clear roles for each agent and review outcomes regularly. These guardrails help you build trust and keep your ai development efficient.

Tip: Responsible ai practices not only protect your organization but also help you control costs and scale your solutions with confidence.

Design Tradeoffs and Pitfalls

Model Complexity vs. Cost

When you design ai agents, you must balance model complexity with your budget. Complex models can handle more advanced tasks, but they also use more resources. If you increase the context window size, your input processing costs will rise. Adding multimodal inputs, such as images or audio, means your system needs extra steps for tokenization and preprocessing. Advanced reasoning features require more compute power, which adds to your expenses.

Here is a table that shows how different factors affect your ai deployment:

FactorImpact on Cost
Context window sizeLarger windows increase input processing costs.
Multimodal inputsAdds preprocessing and tokenization overhead.
Reasoning capabilitiesIntroduces additional compute cost beyond output.

You should review your model’s features and only use what you need. This approach helps you avoid unnecessary spending and keeps your ai agents efficient.

Managed vs. Custom Deployments

You have two main options for deploying ai agents in Microsoft Azure: managed and custom. Managed deployments use Azure AI Foundry, which does not charge a licensing fee. You pay only for the Azure services you use. Custom deployments give you more control, but they require more setup and maintenance.

Consider these tradeoffs:

  1. Platform Cost: Azure AI Foundry has no licensing fee; you pay for the services you consume.
  2. Billing Models:
    • Standard (Pay-as-You-Go): You pay per token, which works well for changing workloads.
    • Provisioned Throughput Units (PTUs): You reserve compute capacity for a fixed rate, which is better for high-volume tasks.
  3. PTU Costs:
    • You must commit to at least 15 PTUs at about $1.00 per hour.
    • A monthly reservation for 15 PTUs costs around $260, and you can save about 15% per year.
  4. Cost Optimization Strategies:
    • Use model routing to select lighter models when possible.
    • Apply prompt caching to cut costs by up to 75%.
    • Use the Batch API for jobs that do not need instant results.
    • Move to PTU reservations after you know your baseline needs.

You should choose the deployment method that matches your workload and budget.

Real-Time vs. Batch Processing

You must also decide between real-time and batch processing for your ai agents. Real-time processing gives instant results, but it is much more expensive. You need dedicated compute resources that stay active, even when not in use. This setup leads to low GPU utilization, often around 14%, but you still pay for full capacity. Real-time applications may also need premium hardware, such as NVIDIA H100 GPUs, which cost more than other options.

Batch processing works differently. You process requests when resources are available, which boosts GPU utilization to 80-95%. This method lets you handle more tasks with the same hardware and saves money. Batch processing is best for jobs that do not need immediate results.

  • Real-time ai agents can cost 3-10 times more than batch processing.
  • Batch processing increases throughput and uses resources more efficiently.
  • Real-time setups require expensive hardware and always-on infrastructure.

You should match your processing approach to your business needs. If you do not need instant answers, batch processing can help you control spending and scale your ai solutions.

Common Pitfalls and How to Avoid Them

When you build low-cost ai agents in Microsoft Azure, you may run into several common pitfalls. Knowing these challenges helps you avoid wasted time and unexpected expenses. Here are the most frequent issues and how you can address them:

  • Latency stacking
    In multi-agent systems, each agent may wait for another to finish before starting its own task. This can make your ai agents slow, especially when they call each other many times. To fix this, use caching to store results from previous runs. You can also use lightweight reasoning agents to handle simple routing tasks. Limit how many times agents can delegate work to each other to keep response times fast.

  • Cost unpredictability
    Azure charges you based on model size, token count, and how often your ai agents use tools. If you do not track these details, your costs can rise quickly. Always log token usage for each session. This helps you see where your money goes and lets you forecast future expenses. Set up alerts for spending spikes so you can act before costs get out of control.

  • Debugging opacity
    Sometimes, ai agents make decisions in ways that are hard to trace. This makes debugging difficult when things go wrong. Enable structured reasoning logs for your agents. Use tools like Azure AI Foundry with OpenTelemetry to visualize how your agents run, which tools they call, and how they make decisions. Clear logs help you spot problems and fix them faster.

  • Version drift
    Over time, small changes to prompts or policies can change how your ai agents behave. This can lead to inconsistent results across different environments. Always version every instruction set, prompt, and model pairing. This practice keeps your ai agents reproducible and stable, even as you update your system.

Tip: Review your ai agent architecture regularly. Small changes can have a big impact on performance and cost. Stay proactive to keep your solutions efficient.

By understanding these pitfalls, you can design ai agents that are reliable, cost-effective, and easy to maintain in Azure.

Roadmap for Sustainable AI Architecture

90-Day Audit Plan

You can start your journey toward sustainable ai architecture in azure with a focused 90-day audit plan. This plan helps you understand the shift from assist to execute, which means moving from simple support tasks to autonomous agent operations. You should use the 5×5 diagnostic to assess your current state across five capability drivers. This diagnostic gives you a clear picture of your strengths and gaps. Next, define the right Center of Excellence model. You can choose a centralized or federated approach based on your organization’s needs.

Follow these steps to execute your audit:

  1. Select the best pattern for your ai agent deployment.
  2. Assign ownership for each initiative.
  3. Identify your scale-breakers—these are the tasks or processes that could cause costs to spike.
  4. Begin execution and track progress.

Tip: A well-structured audit plan helps you spot hidden costs and optimize your azure environment for ai.

Preparing for November 2026 Changes

You need to prepare for major changes in azure billing and architecture by November 2026. Microsoft 365 Copilot billing will stay per-user for tasks like typing in Word or summarizing emails. Agent work will shift to a consumption-based model measured in Copilot Credits. You can choose pay-as-you-go pricing at $0.01 per Copilot Credit or prepaid packs for lower costs if your usage is predictable.

To activate billing for agent work, you must enable it in the Microsoft Admin Center. Set billing policies, usage caps, and alert configurations to control spending. Microsoft IQ will become critical infrastructure for agents, improving reasoning and planning. Work IQ APIs will be available at no extra cost for in-Copilot scenarios if you have a Microsoft 365 Copilot add-on license. For other scenarios, you will need Copilot Credits. Web IQ will provide structured context, such as permissions and relationships, which helps your ai agents ground their responses and reduces hallucinations.

  • Review your billing policies and usage caps.
  • Monitor Copilot Credit consumption.
  • Use Web IQ to enhance agent grounding and reliability.

Long-Term Transformation Steps

You can build a sustainable ai architecture in azure by following key transformation steps. Establish an AI Center of Excellence to ensure your ai agents use ai-ready data and undergo comprehensive reviews. Embed ai into every part of your operations and culture. Use structured mechanisms like a Kaizen funnel to crowdsource and prioritize ideas for ai initiatives.

Strengthen governance to address challenges such as responsible scaling and mitigation of ai hallucinations. Implement continuous improvement practices, like ‘Fix, Hack, Learn’ weeks, to encourage innovation and boost effectiveness.

Transformation StepDescription
Center of ExcellenceEnsures agents use ai-ready data and strong governance.
Kaizen FunnelCrowdsources and prioritizes ai ideas.
Continuous ImprovementDrives innovation and organizational effectiveness.

Note: Sustainable ai architecture requires ongoing review and adaptation. You should always look for ways to improve your azure environment and agent performance.


You can build low-cost AI agents in Azure by focusing on high-impact use cases, optimizing resource allocation, and automating routine tasks. Regular audits and quick wins help you control expenses. Long-term planning ensures your architecture stays efficient. Keep monitoring costs with Azure tools and adapt your strategy as needs change. For deeper learning, explore these resources:

Stay proactive to maximize value and keep your AI investments sustainable.

FAQ

What is the fastest way to reduce AI agent costs in Azure?

You can start by using semantic caching and prompt compression. These methods lower the number of model calls. You should also monitor token usage and automate shutdowns for idle resources.

How do Copilot Studio and Azure AI Foundry help with cost control?

Copilot Studio lets you build and manage AI agents with efficient workflows. Azure AI Foundry offers tools for monitoring, caching, and routing. You gain better visibility and can optimize resource allocation.

What are Copilot Credits, and how do they affect billing?

Copilot Credits measure agent work in Microsoft 365 Copilot. You pay per credit or buy prepaid packs for discounts. You should track credit consumption to avoid budget surprises.

Can I use batch processing for all AI agent tasks?

Batch processing works best for jobs that do not need instant results. You should use real-time processing only when immediate answers are required. Batch jobs save money and improve resource utilization.

How do I monitor AI agent spending in Azure?

You can use Microsoft Cost Management + Billing to track expenses. Set up alerts for spending spikes. Tag resources for easy tracking. Review usage reports weekly to stay on budget.

What steps should I take to ensure compliance and security?

You must follow regulatory standards and use role-based access control. Secure your data estate. Automate compliance checks with Azure Policy. Assign clear ownership for agent outcomes.

🚀 Want to be part of m365.fm?

Then stop just listening… and start showing up.

👉 Connect with me on LinkedIn and let’s make something happen:

  • 🎙️ Be a podcast guest and share your story
  • 🎧 Host your own episode (yes, seriously)
  • 💡 Pitch topics the community actually wants to hear
  • 🌍 Build your personal brand in the Microsoft 365 space

This isn’t just a podcast — it’s a platform for people who take action.

🔥 Most people wait. The best ones don’t.

👉 Connect with me on LinkedIn and send me a message:
"I want in"

Let’s build something awesome 👊

1
00:00:00,000 --> 00:00:04,280
There is a reason your Azure Open AI bill keeps climbing even though usage looks flat.

2
00:00:04,280 --> 00:00:07,040
Three hidden taxes are embedded in every agent call.

3
00:00:07,040 --> 00:00:09,760
The context tags inflates prompts with irrelevant documents.

4
00:00:09,760 --> 00:00:12,600
The reasoning tags send simple questions to the most expensive model.

5
00:00:12,600 --> 00:00:15,320
And the autonomous tags burn credits while nobody is watching.

6
00:00:15,320 --> 00:00:16,760
Here's how to find them.

7
00:00:16,760 --> 00:00:18,560
The promise vices the invoice.

8
00:00:18,560 --> 00:00:22,360
Microsoft sold you AI as part of your Microsoft 365 subscription.

9
00:00:22,360 --> 00:00:24,360
Copilot sits in Teams Outlook and Word.

10
00:00:24,360 --> 00:00:27,280
It feels included, but that inclusion has a hard boundary.

11
00:00:27,280 --> 00:00:30,280
The moment you build a custom agent in Copilot Studio,

12
00:00:30,280 --> 00:00:33,760
attach a knowledge source from SharePoint or call an Azure Open AI model

13
00:00:33,760 --> 00:00:35,280
through a power automate flow,

14
00:00:35,280 --> 00:00:37,920
you have left the license and entered consumption billing.

15
00:00:37,920 --> 00:00:40,200
And consumption billing does not care about your seat count.

16
00:00:40,200 --> 00:00:43,480
Most finance teams discover this when the first Azure invoice arrives.

17
00:00:43,480 --> 00:00:47,480
They see a line item for Azure Open AI and assume it is a trial or a mistake.

18
00:00:47,480 --> 00:00:48,240
It is not.

19
00:00:48,240 --> 00:00:51,040
The M365 Copilot license covers classic answers and

20
00:00:51,040 --> 00:00:53,000
generative responses for license uses,

21
00:00:53,000 --> 00:00:56,680
but custom agents, external facing bots and autonomous actions run on separate

22
00:00:56,680 --> 00:00:57,680
meters.

23
00:00:57,680 --> 00:00:59,400
Those meters build in tokens, not seats.

24
00:00:59,400 --> 00:01:02,840
And tokens are priced per million with input and output charged separately.

25
00:01:02,840 --> 00:01:04,480
Here is where the math gets painful.

26
00:01:04,480 --> 00:01:08,200
As your Open AI token price is matched the direct Open AI API,

27
00:01:08,200 --> 00:01:12,720
but your total as your cost runs 15 to 40% higher once you add support plans,

28
00:01:12,720 --> 00:01:17,920
networking, data egress, storage for logs and vectors, and the observability stack.

29
00:01:17,920 --> 00:01:22,280
A prompt that costs one dollar in raw tokens can easily cost one dollar 40 in reality.

30
00:01:22,280 --> 00:01:24,120
That overhead is not a rounding error.

31
00:01:24,120 --> 00:01:27,560
At scale, it is the difference between a pilot and a budget crisis.

32
00:01:27,560 --> 00:01:29,240
The billing layers stack like this.

33
00:01:29,240 --> 00:01:32,840
At the top, you have Microsoft 365 Copilot licenses.

34
00:01:32,840 --> 00:01:37,400
For internal users, these absorb many Copilot studio interactions without extra credits.

35
00:01:37,400 --> 00:01:40,440
Below that, Copilot Studio uses Copilot credits for agent actions,

36
00:01:40,440 --> 00:01:42,480
generative answers, and graph grounding.

37
00:01:42,480 --> 00:01:47,520
Each credit is roughly one cent on pay as you go or eight tenths of a cent if you buy capacity packs.

38
00:01:47,520 --> 00:01:51,320
Below that again, any custom Azure Open AI deployment builds per million tokens with

39
00:01:51,320 --> 00:01:53,240
separate rates for input and output.

40
00:01:53,240 --> 00:01:58,400
And underneath everything, Azure infrastructure charges for compute, bandwidth, logging, and security.

41
00:01:58,400 --> 00:02:01,960
Most organizations never mapped these layers before they started building.

42
00:02:01,960 --> 00:02:06,160
They assumed the license covered the stack in reality, the license covers the top floor,

43
00:02:06,160 --> 00:02:08,480
and every floor below it has its own meter.

44
00:02:08,480 --> 00:02:13,840
If you are running a customer facing agent, that queries SharePoint calls GPT-5 for summarization.

45
00:02:13,840 --> 00:02:18,480
And logs every interaction to application insights, you are paying across all four layers.

46
00:02:18,480 --> 00:02:21,720
The agent is not expensive because your users are greedy.

47
00:02:21,720 --> 00:02:24,960
It is expensive because the architecture was never designed for cost visibility.

48
00:02:24,960 --> 00:02:26,560
This is the first structural floor.

49
00:02:26,560 --> 00:02:28,480
You cannot control what you cannot see.

50
00:02:28,480 --> 00:02:32,640
And most Microsoft environments were built for productivity, not for token accounting.

51
00:02:32,640 --> 00:02:35,400
The tools were designed to make AI easy to deploy.

52
00:02:35,400 --> 00:02:40,000
Nobody told finance that easy deployment comes with invisible consumption until the invoice shows up.

53
00:02:40,000 --> 00:02:42,400
That is why the starting point is not a cost-cutting exercise.

54
00:02:42,400 --> 00:02:43,400
It is an audit.

55
00:02:43,400 --> 00:02:47,800
Before you optimize a single prompt, you need to know which layer is bleeding and how much.

56
00:02:47,800 --> 00:02:52,360
Because the fixes we will build later only work if you apply them at the right layer.

57
00:02:52,360 --> 00:02:56,080
A cache saves Azure Open AI tokens, but it does not save co-pilot credits.

58
00:02:56,080 --> 00:02:59,360
A cheaper model saves token cost, but it does not fix retrieval bloat.

59
00:02:59,360 --> 00:03:02,200
You need the map before you need the medicine.

60
00:03:02,200 --> 00:03:03,480
The context tax.

61
00:03:03,480 --> 00:03:07,240
The context tax is the biggest hidden cost in most Microsoft AI deployments.

62
00:03:07,240 --> 00:03:10,600
It hits when your retrieval pipeline pulls too many documents and stuffs them into the

63
00:03:10,600 --> 00:03:12,440
prompt without filtering.

64
00:03:12,440 --> 00:03:16,440
In a typical co-pilot studio rag setup, the agent queries a SharePoint library or an Azure

65
00:03:16,440 --> 00:03:21,320
AI search index gets back a set of chunks and depends them to the user question before sending

66
00:03:21,320 --> 00:03:23,080
everything to the LLM.

67
00:03:23,080 --> 00:03:25,680
Most teams never question how many chunks or how large they are.

68
00:03:25,680 --> 00:03:27,000
Let me show you the math.

69
00:03:27,000 --> 00:03:31,080
Suppose your retrieval step returns 50 chunks, each averaging 500 tokens.

70
00:03:31,080 --> 00:03:35,480
That is 25,000 input tokens before the user question is even added.

71
00:03:35,480 --> 00:03:41,160
If the user asks a 10-word question that adds maybe 20 tokens, so you are sending a 25,020

72
00:03:41,160 --> 00:03:44,360
token prompt to answer a 10-word question.

73
00:03:44,360 --> 00:03:49,560
The GPT-5 global input rates, that prompt costs you over 30 dollars per call in context alone,

74
00:03:49,560 --> 00:03:51,760
but the real damage is not the per call cost.

75
00:03:51,760 --> 00:03:52,760
It is the compounding.

76
00:03:52,760 --> 00:03:57,000
If that agent handles a thousand queries per day, you are burning 30,000 dollars per day

77
00:03:57,000 --> 00:03:58,440
on retrieval context.

78
00:03:58,440 --> 00:04:01,400
Over a month that is 600,000 dollars in input tokens.

79
00:04:01,400 --> 00:04:03,680
And most of those chunks were never relevant to the question.

80
00:04:03,680 --> 00:04:07,480
They were retrieved by a loose semantic search and included because the pipeline was built

81
00:04:07,480 --> 00:04:09,200
for recall, not precision.

82
00:04:09,200 --> 00:04:10,560
The problem is not rag itself.

83
00:04:10,560 --> 00:04:11,560
Rag is the right pattern.

84
00:04:11,560 --> 00:04:13,040
The problem is lazy rag.

85
00:04:13,040 --> 00:04:17,880
Lazy rag retrieves broadly, does not re-rank aggressively and concatenates everything into

86
00:04:17,880 --> 00:04:19,520
the prompt because it feels safer.

87
00:04:19,520 --> 00:04:22,760
The assumption is that more context reduces hallucinations.

88
00:04:22,760 --> 00:04:27,840
In reality, more context increases cost, slows response time and often confuses the model

89
00:04:27,840 --> 00:04:30,800
with contradictory information from unrelated documents.

90
00:04:30,800 --> 00:04:35,120
Microsoft co-pilot studio and Azure AI search give you the tools to fix this, but most

91
00:04:35,120 --> 00:04:36,800
implementations skip the tuning.

92
00:04:36,800 --> 00:04:38,480
They use default chunk sizes.

93
00:04:38,480 --> 00:04:41,040
They do not enable hybrid search with semantic ranking.

94
00:04:41,040 --> 00:04:42,400
They do not set top-k limits.

95
00:04:42,400 --> 00:04:43,760
They do not filter by metadata.

96
00:04:43,760 --> 00:04:48,440
So the retrieval layer returns noise and the LLM pays for every token of that noise.

97
00:04:48,440 --> 00:04:50,600
The fixes engineered retrieval, not more retrieval.

98
00:04:50,600 --> 00:04:52,240
You need to retrieve fewer better chunks.

99
00:04:52,240 --> 00:04:55,760
That means smaller chunk sizes aligned with document structure, hybrid search combining

100
00:04:55,760 --> 00:04:59,800
vector similarity with keyword filtering and metadata filters that restrict the search

101
00:04:59,800 --> 00:05:02,920
scope before the LLM ever sees a token.

102
00:05:02,920 --> 00:05:05,480
Top-k should be 3 to 7 chunks, not 50.

103
00:05:05,480 --> 00:05:08,760
And every chunk should be scored for relevance before it enters the prompt.

104
00:05:08,760 --> 00:05:10,160
There is also a deeper layer.

105
00:05:10,160 --> 00:05:13,640
Even with good retrieval, the prompt itself often contains redundant system instructions,

106
00:05:13,640 --> 00:05:16,680
repeated formatting rules and bloated conversation history.

107
00:05:16,680 --> 00:05:18,840
Every token in the system prompt is sent on every call.

108
00:05:18,840 --> 00:05:23,480
If your system prompt is 500 tokens and you send 10,000 calls per day, that is 5 million

109
00:05:23,480 --> 00:05:24,480
tokens of overhead.

110
00:05:24,480 --> 00:05:28,520
At GPT-5 rates, that is over $6 per day in instructions alone.

111
00:05:28,520 --> 00:05:31,680
Over a year, that is $2,000 for text the model already knows.

112
00:05:31,680 --> 00:05:32,680
This is the context text.

113
00:05:32,680 --> 00:05:33,960
It is not a model problem.

114
00:05:33,960 --> 00:05:35,280
It is a pipeline problem.

115
00:05:35,280 --> 00:05:39,760
And it is the first place you should look when your bill is higher than your usage suggests.

116
00:05:39,760 --> 00:05:40,960
The reasoning tax.

117
00:05:40,960 --> 00:05:44,880
The reasoning tax is what you pay when you send every question to the most capable, most

118
00:05:44,880 --> 00:05:46,040
expensive model.

119
00:05:46,040 --> 00:05:51,320
Most teams default to GPT-5 global or GPT-4O because they want the best possible answer.

120
00:05:51,320 --> 00:05:53,480
That is understandable, but it is not architecture.

121
00:05:53,480 --> 00:05:54,720
It is convenience.

122
00:05:54,720 --> 00:05:57,920
And convenience at scale is the fastest way to double a token bill.

123
00:05:57,920 --> 00:06:00,680
Look at the 2026 Azure Open AI pricing.

124
00:06:00,680 --> 00:06:06,760
GPT-5 global charges $1.25 per million input tokens and $10.00 per million output tokens.

125
00:06:06,760 --> 00:06:11,040
GPT-5 mini charges, $25.00 per million input and $2.00 per million output.

126
00:06:11,040 --> 00:06:15,040
GPT-5 nano charges, $5.00 per million input and $0.40 per million output.

127
00:06:15,040 --> 00:06:19,120
The gap between nano and global is 25 times on input and 25 times on output.

128
00:06:19,120 --> 00:06:24,160
If you are using global to classify an email or answer a FAQ, you are paying 25 times

129
00:06:24,160 --> 00:06:25,560
more than necessary.

130
00:06:25,560 --> 00:06:27,360
The mistake is not using a powerful model.

131
00:06:27,360 --> 00:06:29,320
The mistake is using one model for everything.

132
00:06:29,320 --> 00:06:33,520
In a well-architected system, the task should choose the model, not the other way around.

133
00:06:33,520 --> 00:06:38,920
Simple classification, sentiment analysis, entity extraction and FAQ matching do not need

134
00:06:38,920 --> 00:06:39,920
frontier reasoning.

135
00:06:39,920 --> 00:06:45,080
They need fast, cheap inference, complex multi-step reasoning, legal analysis, creative drafting

136
00:06:45,080 --> 00:06:47,800
and cross-document synthesis need the flagship.

137
00:06:47,800 --> 00:06:51,040
Most agent workloads are 80% simple and 20% complex.

138
00:06:51,040 --> 00:06:55,560
If you root everything to global, you are paying the flagship rate for the entire workload.

139
00:06:55,560 --> 00:06:58,080
This is where model routing changes the economics.

140
00:06:58,080 --> 00:07:02,080
Azure AI Foundry now offers a model router that is itself a trained language model.

141
00:07:02,080 --> 00:07:05,400
It sits between your application and the underlying LLM pool.

142
00:07:05,400 --> 00:07:06,840
Your code calls a single endpoint.

143
00:07:06,840 --> 00:07:11,080
The router reads the prompt, estimates the complexity and sends the request to the cheapest

144
00:07:11,080 --> 00:07:12,680
model that can handle it.

145
00:07:12,680 --> 00:07:14,520
Simple questions go to nano or mini.

146
00:07:14,520 --> 00:07:16,640
Hard questions go to global or GPT-5 Pro.

147
00:07:16,640 --> 00:07:18,520
You do not hard-code model selection.

148
00:07:18,520 --> 00:07:19,920
The architecture does it for you.

149
00:07:19,920 --> 00:07:24,160
The documented savings from intelligent routing sit around 60% for mixed workloads.

150
00:07:24,160 --> 00:07:25,640
That is not a minor optimization.

151
00:07:25,640 --> 00:07:29,520
It is the difference between a pilot that gets funded and a pilot that gets cancelled.

152
00:07:29,520 --> 00:07:32,960
And the router is deployed like any other model in Azure AI Foundry.

153
00:07:32,960 --> 00:07:36,720
You browse the model catalog, deploy the router and call it through the same Azure Open

154
00:07:36,720 --> 00:07:38,800
AI SDK you already use.

155
00:07:38,800 --> 00:07:42,240
The integration cost is low because the API surface does not change.

156
00:07:42,240 --> 00:07:43,600
Of course, there are constraints.

157
00:07:43,600 --> 00:07:47,320
You cannot force the router to use a specific model via API parameters.

158
00:07:47,320 --> 00:07:51,400
You cannot add your own fine-tuned models or external APIs to the pool.

159
00:07:51,400 --> 00:07:54,360
And the routing decision is opaque on a per request basis.

160
00:07:54,360 --> 00:07:58,320
If your compliance team requires deterministic model choice for ordered reasons, you should

161
00:07:58,320 --> 00:08:02,640
deploy specific models directly rather than routing through the abstraction.

162
00:08:02,640 --> 00:08:06,560
But for most internal and customer-facing agents, the trade-off favors the savings.

163
00:08:06,560 --> 00:08:09,200
The other half of the reasoning text is output length.

164
00:08:09,200 --> 00:08:13,720
Output tokens cost roughly 8 times input tokens for GPT-5 global.

165
00:08:13,720 --> 00:08:16,920
That means a verbose response is far more expensive than a long prompt.

166
00:08:16,920 --> 00:08:20,960
If your agent generates a 500 token answer when 100 tokens would suffice, you have just

167
00:08:20,960 --> 00:08:23,360
quadrupled the most expensive part of the bill.

168
00:08:23,360 --> 00:08:27,920
Connecting max tokens limits, constraining response formats, and training the model to be concise

169
00:08:27,920 --> 00:08:29,520
are not quality compromises.

170
00:08:29,520 --> 00:08:34,760
They are cost controls and at scale, they matter more than almost any other single setting.

171
00:08:34,760 --> 00:08:37,360
The autonomous tax and the 2026 cliff.

172
00:08:37,360 --> 00:08:40,880
The autonomous tax is the cost of agents that run without human throttle.

173
00:08:40,880 --> 00:08:45,120
In co-pilot studio, an autonomous agent can trigger actions, query knowledge sources, and

174
00:08:45,120 --> 00:08:49,680
initiate power automate flows based on events rather than direct user requests.

175
00:08:49,680 --> 00:08:51,160
That sounds powerful because it is.

176
00:08:51,160 --> 00:08:55,080
But every trigger consumes co-pilot credits, and credits consumed while your team is asleep

177
00:08:55,080 --> 00:08:56,840
still show up on the invoice.

178
00:08:56,840 --> 00:08:58,360
The variability is the problem.

179
00:08:58,360 --> 00:09:00,400
A quiet month might use a few thousand credits.

180
00:09:00,400 --> 00:09:05,280
A busy month, where an agent triggers on every new SharePoint file or every incoming email,

181
00:09:05,280 --> 00:09:06,640
might use a hundred times more.

182
00:09:06,640 --> 00:09:09,040
That burst pattern makes budgeting nearly impossible.

183
00:09:09,040 --> 00:09:11,200
Finance teams plan for steady spend.

184
00:09:11,200 --> 00:09:15,560
Autonomous agents deliver spikes, and because co-pilot studio builds credits, per action,

185
00:09:15,560 --> 00:09:19,440
not per user session, the spike can happen without any increase in human users.

186
00:09:19,440 --> 00:09:21,720
This gets worse on November 1, 2026.

187
00:09:21,720 --> 00:09:25,480
That is the date Microsoft removes seeded AI builder credits from all tenants.

188
00:09:25,480 --> 00:09:30,900
If your organization is currently using PowerApps premium or Dynamics 365 licenses, those

189
00:09:30,900 --> 00:09:34,320
licenses include a monthly pool of seeded AI builder credits.

190
00:09:34,320 --> 00:09:39,080
Those credits have been covering many co-pilot studio and AI builder experiments at low or

191
00:09:39,080 --> 00:09:40,680
zero marginal cost.

192
00:09:40,680 --> 00:09:42,160
After November 1, they vanish.

193
00:09:42,160 --> 00:09:43,840
There is no automatic conversion.

194
00:09:43,840 --> 00:09:45,720
Your agents do not switch to a free tier.

195
00:09:45,720 --> 00:09:47,200
They start billing immediately.

196
00:09:47,200 --> 00:09:50,680
If you have not bought co-pilot studio capacity packs or set up payers you go billing in

197
00:09:50,680 --> 00:09:52,920
Azure, your agents will simply stop working.

198
00:09:52,920 --> 00:09:57,200
Or worse, if you have set up PAYG but not capped it, they will keep running and billing

199
00:09:57,200 --> 00:09:58,200
without limit.

200
00:09:58,200 --> 00:10:03,600
Capacity packs cost $200 per month and provide 25,000 credits that works out to 8/10 of a

201
00:10:03,600 --> 00:10:04,840
cent per credit.

202
00:10:04,840 --> 00:10:07,120
Payers you go charges of full cent per credit.

203
00:10:07,120 --> 00:10:11,920
An agent that uses 10 credits per interaction will cost 8 to 10 cents per conversation.

204
00:10:11,920 --> 00:10:15,960
At 10,000 conversations per month, that is $800 to $1000.

205
00:10:15,960 --> 00:10:20,040
At 100,000 conversations it is 8,000 to $10,000 per month.

206
00:10:20,040 --> 00:10:24,720
And that is before you add Azure Open AI token costs for any custom model calls.

207
00:10:24,720 --> 00:10:29,960
The same agent design can cost $8 per month or $800 per month depending on how many actions

208
00:10:29,960 --> 00:10:34,960
it triggers, how many knowledge sources it queries and whether it uses graph grounding.

209
00:10:34,960 --> 00:10:37,680
Graph grounding alone consumes roughly 10 credits per query.

210
00:10:37,680 --> 00:10:42,840
If your agent grounds every answer in Microsoft 365 data, you have just added 10 cents to

211
00:10:42,840 --> 00:10:46,320
every response before the LLM token meter even starts spinning.

212
00:10:46,320 --> 00:10:51,120
This is why the 2026 cliff is not just a licensing change, it is an architectural reckoning.

213
00:10:51,120 --> 00:10:54,920
Every agent built during the seeded credit period was built without cost pressure.

214
00:10:54,920 --> 00:10:57,200
Citizen developers could experiment freely.

215
00:10:57,200 --> 00:11:01,720
That freedom produced valuable prototypes, but it also produced inefficient architectures.

216
00:11:01,720 --> 00:11:04,560
After November, those inefficiencies become real money.

217
00:11:04,560 --> 00:11:08,480
And the organizations that survive the transition are the ones that ordered their agents before

218
00:11:08,480 --> 00:11:11,280
the deadline, not after the invoice arrives.

219
00:11:11,280 --> 00:11:13,120
The Azure Cost Management deep dive.

220
00:11:13,120 --> 00:11:14,840
You cannot fix what you cannot measure.

221
00:11:14,840 --> 00:11:17,760
And in Microsoft AI environments, most organizations cannot measure.

222
00:11:17,760 --> 00:11:22,440
They see a single Azure Open AI line item on a consolidated invoice and treat it as

223
00:11:22,440 --> 00:11:23,480
a black box.

224
00:11:23,480 --> 00:11:27,520
The first step in breaking that box open is Azure Cost Management and the most underused

225
00:11:27,520 --> 00:11:31,840
feature inside it is natural language querying through Azure Co-Pilot.

226
00:11:31,840 --> 00:11:35,720
Most finance teams export cost data into spreadsheets and build pivot tables.

227
00:11:35,720 --> 00:11:39,720
That works for static reporting, but it is too slow for operational decision making.

228
00:11:39,720 --> 00:11:43,040
Azure Co-Pilot lets you ask questions directly against your cost data.

229
00:11:43,040 --> 00:11:48,800
You can say, summarize my Azure Open AI cost for the last 90 days broken down by model deployment.

230
00:11:48,800 --> 00:11:53,160
You can ask, what are the top three services driving my AI spend in East US?

231
00:11:53,160 --> 00:11:54,920
You can even run scenarios.

232
00:11:54,920 --> 00:11:59,960
How much would my cost change if GPT-5 usage increased 20% next quarter?

233
00:11:59,960 --> 00:12:02,960
The power of these prompts depends on your tagging discipline.

234
00:12:02,960 --> 00:12:08,040
If your resource groups are named RG Prod001 and your deployments carry no metadata,

235
00:12:08,040 --> 00:12:10,160
your Co-Pilot cannot segment your spend.

236
00:12:10,160 --> 00:12:13,920
It will return aggregate numbers that tell you nothing about which agent, which team, or

237
00:12:13,920 --> 00:12:15,600
which workload is responsible.

238
00:12:15,600 --> 00:12:18,680
Before you ask a single cost question, you need clean tags.

239
00:12:18,680 --> 00:12:23,360
At minimum, every Azure Open AI resource should carry tags for environment, department,

240
00:12:23,360 --> 00:12:25,280
agent name, and cost class.

241
00:12:25,280 --> 00:12:30,200
Every Co-Pilot studio environment should map to a billing scope in the power platform admin center.

242
00:12:30,200 --> 00:12:32,840
Once your tags are clean, build a query rhythm.

243
00:12:32,840 --> 00:12:36,080
Weekly, ask for a trend summary by tagged agent.

244
00:12:36,080 --> 00:12:38,480
Monthly, ask for a breakdown by model SKU in region.

245
00:12:38,480 --> 00:12:41,360
Quartally, runner-what-if analysis against projected usage growth.

246
00:12:41,360 --> 00:12:42,440
The goal is not a report.

247
00:12:42,440 --> 00:12:47,960
The goal is a feedback loop where every spike triggers a question within 48 hours, not 48 days.

248
00:12:47,960 --> 00:12:50,880
There is also a second layer, most teams miss.

249
00:12:50,880 --> 00:12:53,080
As your monitor diagnostic logs.

250
00:12:53,080 --> 00:12:57,120
Every Azure Open AI deployment emits logs for every request, including input token count,

251
00:12:57,120 --> 00:12:59,160
output token count, model name, and latency.

252
00:12:59,160 --> 00:13:03,640
You can stream these logs to a log analytics workspace and query them with KQL.

253
00:13:03,640 --> 00:13:08,280
A simple KQL query can show you which agent consumes the most tokens per call, which hour of

254
00:13:08,280 --> 00:13:12,400
the day drives peak usage and which user segments generate the longest prompts.

255
00:13:12,400 --> 00:13:16,720
This is not billing data, it is operational telemetry, and it is far more actionable than

256
00:13:16,720 --> 00:13:18,040
an invoice line item.

257
00:13:18,040 --> 00:13:20,360
The audit phase should produce three artifacts.

258
00:13:20,360 --> 00:13:22,160
First, a cost baseline.

259
00:13:22,160 --> 00:13:25,440
Total AI spend by layer by week for the last 90 days.

260
00:13:25,440 --> 00:13:31,040
Second, a token flow map, average input tokens, average output tokens, and peak concurrency

261
00:13:31,040 --> 00:13:32,200
for each agent.

262
00:13:32,200 --> 00:13:33,880
Third, an anomaly log.

263
00:13:33,880 --> 00:13:38,200
Any week where spend jumped more than 20% above trend with a hypothesis for why.

264
00:13:38,200 --> 00:13:41,520
If you cannot produce these three artifacts, you are not auditing.

265
00:13:41,520 --> 00:13:42,600
You are guessing.

266
00:13:42,600 --> 00:13:46,560
And guessing is how the context tax and the reasoning tax hide in plain sight.

267
00:13:46,560 --> 00:13:48,800
The co-pilot studio consumption estimator.

268
00:13:48,800 --> 00:13:53,280
As your cost management covers the infrastructure layer, but co-pilot studio has its own meter,

269
00:13:53,280 --> 00:13:56,960
and that meter is not measured in tokens, it is measured in co-pilot credits.

270
00:13:56,960 --> 00:14:00,760
Understanding the credit model is essential, because the same agent can cost $8 per month

271
00:14:00,760 --> 00:14:04,160
or $800 per month, depending entirely on design choices.

272
00:14:04,160 --> 00:14:07,920
Microsoft provides an agent consumption estimator tool specifically for this purpose.

273
00:14:07,920 --> 00:14:12,040
You input your expected session volume, the number of generative answers per session, the

274
00:14:12,040 --> 00:14:17,040
frequency of graph grounding calls, and any power automate actions triggered by the agent.

275
00:14:17,040 --> 00:14:20,840
The estimator then projects your monthly credit consumption and compares prepaid capacity

276
00:14:20,840 --> 00:14:22,800
packs against payers you go billing.

277
00:14:22,800 --> 00:14:24,600
This comparison is not trivial.

278
00:14:24,600 --> 00:14:28,720
Capacity packs pool across the tenant and expire monthly, so over provisioning waists

279
00:14:28,720 --> 00:14:29,720
money.

280
00:14:29,720 --> 00:14:33,600
The UIG scales without limit, but costs 25% more per credit.

281
00:14:33,600 --> 00:14:37,820
The wrong choice can add hundreds of dollars per month in either unused packs or premium

282
00:14:37,820 --> 00:14:39,640
PAYG rates.

283
00:14:39,640 --> 00:14:42,400
Credit consumption varies dramatically by agent behavior.

284
00:14:42,400 --> 00:14:46,480
A classic scripted FAQ answer consumes roughly one credit per response.

285
00:14:46,480 --> 00:14:50,040
A generative answer from a knowledge source consumes roughly two credits.

286
00:14:50,040 --> 00:14:54,800
Graph grounding, which queries Microsoft 365 data, through the Microsoft graph, consumes

287
00:14:54,800 --> 00:14:56,920
roughly ten credits per query.

288
00:14:56,920 --> 00:15:00,880
Contrary reasoning chains and autonomous actions can multiply the base cost by a factor

289
00:15:00,880 --> 00:15:02,120
of ten or more.

290
00:15:02,120 --> 00:15:05,640
If your agent triggers a power automate flow with a hundred actions, that adds roughly

291
00:15:05,640 --> 00:15:06,640
thirteen credits.

292
00:15:06,640 --> 00:15:11,080
A single user conversation that touches all of these layers can easily consume 50 to 100

293
00:15:11,080 --> 00:15:12,080
credits.

294
00:15:12,080 --> 00:15:16,040
At one cent per credit on PAYG, a hundred credit conversation costs one dollar.

295
00:15:16,040 --> 00:15:21,040
At eight tenths of a cent per credit via capacity packs, it costs 80 cents.

296
00:15:21,040 --> 00:15:23,480
Multiply that by ten thousand conversations per month.

297
00:15:23,480 --> 00:15:27,200
And the difference between good design and bad design is two thousand dollars per month.

298
00:15:27,200 --> 00:15:31,120
Multiply by a hundred thousand conversations and it is twenty thousand dollars per month.

299
00:15:31,120 --> 00:15:32,840
The unit economics are not abstract.

300
00:15:32,840 --> 00:15:36,440
They are the difference between a sustainable agent and a budget review.

301
00:15:36,440 --> 00:15:40,080
The estimator also helps you size capacity packs correctly.

302
00:15:40,080 --> 00:15:44,600
If your projected monthly need is ninety thousand credits, you need four capacity packs providing

303
00:15:44,600 --> 00:15:46,120
a hundred thousand credits.

304
00:15:46,120 --> 00:15:48,720
That leaves a ten thousand credit buffer for spikes.

305
00:15:48,720 --> 00:15:52,760
If you buy only three packs, you cover seventy five thousand credits and the remaining fifteen

306
00:15:52,760 --> 00:15:55,360
thousand bill at the higher PAYG rate.

307
00:15:55,360 --> 00:15:57,680
The optimization is not about eliminating buffer.

308
00:15:57,680 --> 00:16:01,640
It is about sizing buffer accurately, so you do not pay premium rates for predictable

309
00:16:01,640 --> 00:16:04,520
demand or waste money on unused capacity.

310
00:16:04,520 --> 00:16:06,080
Mapping your token flows.

311
00:16:06,080 --> 00:16:09,160
Cost control without architecture visibility is guesswork.

312
00:16:09,160 --> 00:16:14,040
A token flow map traces every request from the user through the agent, through retrieval,

313
00:16:14,040 --> 00:16:16,600
through the LLM, and back to the response.

314
00:16:16,600 --> 00:16:20,000
It exposes exactly where tokens are born and where they die.

315
00:16:20,000 --> 00:16:23,160
And it reveals which layer of your pipeline is the most expensive.

316
00:16:23,160 --> 00:16:24,760
Start with the user interaction layer.

317
00:16:24,760 --> 00:16:28,900
Check how many turns a typical session lasts and how much chat history is retained and

318
00:16:28,900 --> 00:16:30,840
resent with every new question.

319
00:16:30,840 --> 00:16:35,200
A ten turn conversation where each prior turn is included in the context window can multiply

320
00:16:35,200 --> 00:16:37,520
token consumption by a factor of five or more.

321
00:16:37,520 --> 00:16:41,720
If you are not compressing or summarizing conversation history, you are paying for the same text

322
00:16:41,720 --> 00:16:42,720
on every turn.

323
00:16:42,720 --> 00:16:44,440
Next is the retrieval layer.

324
00:16:44,440 --> 00:16:48,200
How many chunks your rag pipeline returns, what the chunk size is and where the hybrid

325
00:16:48,200 --> 00:16:49,680
search is enabled.

326
00:16:49,680 --> 00:16:54,360
Most co-pilot studio agents use default settings here and the defaults are not cost optimized.

327
00:16:54,360 --> 00:16:57,720
Default chunk sizes and share point indexing are often larger than necessary for question

328
00:16:57,720 --> 00:16:58,720
answering.

329
00:16:58,720 --> 00:17:02,240
Default top-k values in Azure AI search are often higher than needed.

330
00:17:02,240 --> 00:17:06,440
And hybrid search, which combines vector similarity with keyword filtering, is frequently

331
00:17:06,440 --> 00:17:09,200
disabled because it requires extra configuration.

332
00:17:09,200 --> 00:17:13,120
Without hybrid search, your vector query returns broadly related chunks that share semantic

333
00:17:13,120 --> 00:17:15,200
neighbors, but not topical relevance.

334
00:17:15,200 --> 00:17:18,680
Those chunks inflate the prompt and degrade answer quality at the same time.

335
00:17:18,680 --> 00:17:21,880
The prompt layer is where the hidden overhead lives.

336
00:17:21,880 --> 00:17:25,320
System prompts, formatting instructions, safety matter prompts and grounding metadata are

337
00:17:25,320 --> 00:17:26,920
all propended to every call.

338
00:17:26,920 --> 00:17:31,440
If your system prompt is 400 tokens and you send 5,000 calls per day, you are sending

339
00:17:31,440 --> 00:17:33,800
2 million tokens of instructions daily.

340
00:17:33,800 --> 00:17:38,960
At GPT-5 global rates, that is $2.50 per day in system prompt overhead alone.

341
00:17:38,960 --> 00:17:42,400
Over a year, that is over $900 for text that never changes.

342
00:17:42,400 --> 00:17:46,800
Catching the system prompt or moving static instructions into the model deployment configuration

343
00:17:46,800 --> 00:17:48,200
can eliminate most of this.

344
00:17:48,200 --> 00:17:52,360
Finally, the model layer, check which SKU is handling the request, what the max token setting

345
00:17:52,360 --> 00:17:56,200
is, and whether output length is capped or allowed to drift.

346
00:17:56,200 --> 00:18:00,600
Output tokens dominate cost for GPT-5 because output is 8 times more expensive than input.

347
00:18:00,600 --> 00:18:05,720
A model setting that allows 500 token responses when 100 would suffice is not a quality setting.

348
00:18:05,720 --> 00:18:09,240
It is a 5-fold cost multiplier on your most expensive meter.

349
00:18:09,240 --> 00:18:10,760
Build this map for each agent.

350
00:18:10,760 --> 00:18:15,440
Use KQL queries against Azure OpenAI diagnostic logs to extract real numbers.

351
00:18:15,440 --> 00:18:18,440
Use the co-pilot Studio Analytics dashboard for credit data.

352
00:18:18,440 --> 00:18:22,560
And use Azure API management trace logs if you have a gateway in front of your endpoints.

353
00:18:22,560 --> 00:18:24,280
The map does not need to be beautiful.

354
00:18:24,280 --> 00:18:25,560
It needs to be true.

355
00:18:25,560 --> 00:18:29,640
And once it is true, you will almost always find that one layer accounts for 70% of your

356
00:18:29,640 --> 00:18:30,640
waste.

357
00:18:30,640 --> 00:18:32,560
Building your baseline and cost classes.

358
00:18:32,560 --> 00:18:35,840
Once your token flow map is complete, you need a baseline.

359
00:18:35,840 --> 00:18:40,080
A baseline is a snapshot of current cost per interaction, tokens per interaction, and

360
00:18:40,080 --> 00:18:41,840
peak latency for each agent.

361
00:18:41,840 --> 00:18:42,840
It is not a guess.

362
00:18:42,840 --> 00:18:46,920
It is a measured starting point against which every future optimization is judged.

363
00:18:46,920 --> 00:18:49,320
Without it, you cannot prove that a change saved money.

364
00:18:49,320 --> 00:18:50,480
You can only hope.

365
00:18:50,480 --> 00:18:53,200
Store your baseline metrics in an Azure Monitor workbook.

366
00:18:53,200 --> 00:18:55,640
Track weekly averages and monthly P95 peaks.

367
00:18:55,640 --> 00:18:56,640
Watch for drift.

368
00:18:56,640 --> 00:19:00,440
If your tokens per interaction creep up over time, that usually means your retrieval pipeline

369
00:19:00,440 --> 00:19:04,000
is degrading or your conversation history is growing unchecked.

370
00:19:04,000 --> 00:19:08,720
If your cost per interaction spikes on specific days, that usually means a new feature or a marketing

371
00:19:08,720 --> 00:19:12,760
campaign drove unusual traffic to an unoptimized path.

372
00:19:12,760 --> 00:19:15,320
With the baseline in place, introduce cost classes.

373
00:19:15,320 --> 00:19:17,960
This is the framework that prevents waste before it happens.

374
00:19:17,960 --> 00:19:21,560
Every agent workload gets classified into one of three tiers before deployment.

375
00:19:21,560 --> 00:19:24,560
Gold is customer facing, SLA backed, and complex.

376
00:19:24,560 --> 00:19:28,600
It gets the best model, the highest latency budget, and the most rigorous retrieval.

377
00:19:28,600 --> 00:19:30,960
Silver is internal tools and moderate complexity.

378
00:19:30,960 --> 00:19:35,280
It gets a mid-tier model, cashed responses where possible and standard retrieval.

379
00:19:35,280 --> 00:19:38,480
Bronze is FAQ, classification, extraction, and experimentation.

380
00:19:38,480 --> 00:19:42,880
It gets the cheapest model, aggressive caching, strict max tokens limits, and minimal retrieval.

381
00:19:42,880 --> 00:19:46,960
The classification happens before the agent is built not after the invoice arrives.

382
00:19:46,960 --> 00:19:48,720
Platform teams publish the criteria.

383
00:19:48,720 --> 00:19:52,720
App teams justify any gold classification with an SLA and a business case.

384
00:19:52,720 --> 00:19:53,720
Silver is the default.

385
00:19:53,720 --> 00:19:58,000
Bronze is the sandbox, and no team can override the model assignment or cash policy without

386
00:19:58,000 --> 00:19:59,360
a platform review.

387
00:19:59,360 --> 00:20:03,320
This sounds bureaucratic, but it is the only way to prevent 10 teams from independently choosing

388
00:20:03,320 --> 00:20:05,800
GPT-5 global because it feels safer.

389
00:20:05,800 --> 00:20:10,400
Gold is not a model choice, safety is a pipeline choice, and cost classes make that explicit.

390
00:20:10,400 --> 00:20:14,240
The baseline and the classification framework together give you something most organizations

391
00:20:14,240 --> 00:20:17,600
lack, a shared language for AI cost.

392
00:20:17,600 --> 00:20:19,760
Finance can ask why an agent is gold.

393
00:20:19,760 --> 00:20:23,960
Engineering can show that gold status reduced escalation rates by 40%.

394
00:20:23,960 --> 00:20:27,680
Platform teams can monitor whether silver agents are creeping toward gold behavior, and

395
00:20:27,680 --> 00:20:29,560
everyone can see the same dashboard.

396
00:20:29,560 --> 00:20:33,920
That visibility is the foundation for everything that comes next.

397
00:20:33,920 --> 00:20:35,320
The architecture pivot.

398
00:20:35,320 --> 00:20:38,520
At this point, most cost guides tell you to monitor and hope.

399
00:20:38,520 --> 00:20:40,960
They give you dashboards, alerts, and reports.

400
00:20:40,960 --> 00:20:44,680
Then they recommend that you review your spending monthly and optimize where you can.

401
00:20:44,680 --> 00:20:47,240
That is not architecture, that is accounting.

402
00:20:47,240 --> 00:20:50,840
And accounting will not save you when your autonomous agents hit the co-pilot credit meter

403
00:20:50,840 --> 00:20:52,560
at midnight on a weekend.

404
00:20:52,560 --> 00:20:57,000
The shift you need to make is from reactive cost-cutting to structural cost prevention.

405
00:20:57,000 --> 00:20:59,560
You do not want to spend less on the same pipeline.

406
00:20:59,560 --> 00:21:03,240
You want to design a pipeline where the expensive path is never taken unless it is genuinely

407
00:21:03,240 --> 00:21:04,240
required.

408
00:21:04,240 --> 00:21:06,360
The token should be justified before it is sent.

409
00:21:06,360 --> 00:21:10,600
Every model call should pass through a filter that asks whether a cheaper path exists.

410
00:21:10,600 --> 00:21:14,640
And every response should be cashed, so the next identical question costs nothing.

411
00:21:14,640 --> 00:21:16,760
There are four engineering levers that make this possible.

412
00:21:16,760 --> 00:21:18,000
They are not isolated tips.

413
00:21:18,000 --> 00:21:21,840
They are a system, and when you deploy them together, they compound, but when you deploy them

414
00:21:21,840 --> 00:21:23,440
alone, they underperform.

415
00:21:23,440 --> 00:21:25,520
The first lever is semantic caching.

416
00:21:25,520 --> 00:21:29,520
If a user asks the same question twice or a question that is semantically equivalent,

417
00:21:29,520 --> 00:21:31,720
the system should reuse the previous answer.

418
00:21:31,720 --> 00:21:32,840
It should not regenerate.

419
00:21:32,840 --> 00:21:37,560
This eliminates redundant LLM calls at the gateway layer before the token meter ever spins.

420
00:21:37,560 --> 00:21:39,640
The second lever is prompt compression.

421
00:21:39,640 --> 00:21:43,920
Most enterprise prompts are bloated with retrieved documents, conversation history, and system

422
00:21:43,920 --> 00:21:46,360
instructions that are longer than necessary.

423
00:21:46,360 --> 00:21:49,960
Compressing the prompt before it reaches the LLM reduces input tokens without changing

424
00:21:49,960 --> 00:21:51,040
the answer quality.

425
00:21:51,040 --> 00:21:54,480
At scale, this cuts cost by a factor of 2 to 5.

426
00:21:54,480 --> 00:21:56,320
The third lever is model routing.

427
00:21:56,320 --> 00:21:58,320
Not every question needs a flagship model.

428
00:21:58,320 --> 00:22:02,280
A cheap model can handle FAQ, classification, and simple extraction.

429
00:22:02,280 --> 00:22:06,840
An expensive model should only see the tasks that genuinely need reasoning, synthesis, or

430
00:22:06,840 --> 00:22:07,840
creativity.

431
00:22:07,840 --> 00:22:11,840
Rooting the task to the cheapest adequate model is the single biggest per request cost

432
00:22:11,840 --> 00:22:13,560
win available in 2026.

433
00:22:13,560 --> 00:22:15,600
The fourth lever is capacity planning.

434
00:22:15,600 --> 00:22:19,520
Once your per request cost is optimized, you need to buy capacity correctly.

435
00:22:19,520 --> 00:22:23,960
Pay as you go is the right model for variable experimental workloads.

436
00:22:23,960 --> 00:22:27,600
Provisioned throughput units are the right model for steady high volume production.

437
00:22:27,600 --> 00:22:31,920
Buying the wrong capacity commitment turns your optimized per token cost into a stranded

438
00:22:31,920 --> 00:22:33,080
investment.

439
00:22:33,080 --> 00:22:35,000
These four levers work in sequence.

440
00:22:35,000 --> 00:22:37,520
Caching catches repeats before they reach the model.

441
00:22:37,520 --> 00:22:39,240
Compression shrinks what remains.

442
00:22:39,240 --> 00:22:42,840
Routing sends the shrunken prompt to the cheapest capable model, and capacity planning

443
00:22:42,840 --> 00:22:46,000
ensures you are buying that model's output at the right rate.

444
00:22:46,000 --> 00:22:48,000
Together they form a cost architecture.

445
00:22:48,000 --> 00:22:50,160
And that is what we will build now.

446
00:22:50,160 --> 00:22:53,320
Semantic catching via Azure API management.

447
00:22:53,320 --> 00:22:56,080
Organizations regenerate the same answers thousands of times per day.

448
00:22:56,080 --> 00:23:00,520
A policy question about vacation days, a troubleshooting step for a common error, and a status

449
00:23:00,520 --> 00:23:04,000
update on a recurring process are not novel reasoning tasks.

450
00:23:04,000 --> 00:23:08,200
They are lookups disguised as conversations, and every time an LLM regenerates the answer,

451
00:23:08,200 --> 00:23:11,040
you pay for tokens that were already computed yesterday.

452
00:23:11,040 --> 00:23:13,120
Semantic caching solves this at the gateway layer.

453
00:23:13,120 --> 00:23:16,920
Instead of matching identical text strings, it embeds the incoming prompt into a vector

454
00:23:16,920 --> 00:23:20,400
and checks whether a semantically similar prompt was answered recently.

455
00:23:20,400 --> 00:23:24,960
If the similarity score is high enough, it returns the cache response with no LLM call,

456
00:23:24,960 --> 00:23:27,400
zero tokens consumed, and millisecond latency.

457
00:23:27,400 --> 00:23:30,400
Azure API management implements this through two policies.

458
00:23:30,400 --> 00:23:34,200
On the inbound side you add an Azure OpenAI semantic cache lookup policy.

459
00:23:34,200 --> 00:23:38,440
This policy sends the prompt to an Azure OpenAI embedding deployment, converts it to a vector,

460
00:23:38,440 --> 00:23:40,400
and queries an external cache.

461
00:23:40,400 --> 00:23:44,400
On the outbound side you add an Azure OpenAI semantic cache store policy.

462
00:23:44,400 --> 00:23:49,080
This policy embeds the prompt, stores the vector as a cache key, and stores the LLM response

463
00:23:49,080 --> 00:23:51,760
as the value with a configurable time to live.

464
00:23:51,760 --> 00:23:57,000
The external cache is typically Azure cache for reddys with the ready search module enabled.

465
00:23:57,000 --> 00:24:00,320
Reddys handles the vector similarity search at high speed.

466
00:24:00,320 --> 00:24:04,880
The similarity threshold is the tuning parameter that determines whether a cached answer is returned.

467
00:24:04,880 --> 00:24:09,000
A threshold of 0.92 to 0.96 is typical in production.

468
00:24:09,000 --> 00:24:12,680
Below that you risk false positives where a vaguely related question gets an incorrect

469
00:24:12,680 --> 00:24:14,080
cached answer.

470
00:24:14,080 --> 00:24:17,800
Above that you may miss valid cache hits because the threshold is too strict.

471
00:24:17,800 --> 00:24:19,680
The deployment steps are straightforward.

472
00:24:19,680 --> 00:24:25,160
First, deploy an embeddings model such as text embedding ADA 0.02 in Azure OpenAI.

473
00:24:25,160 --> 00:24:28,880
Second, provision in Azure cache for reddys instance with ready search enabled.

474
00:24:28,880 --> 00:24:32,000
Third, configure an APM backend pointing to the embeddings endpoint.

475
00:24:32,000 --> 00:24:38,520
Fourth, add the semantic cache lookup policy to your inbound processing and the semantic cache store policy to your outbound processing.

476
00:24:38,520 --> 00:24:42,480
Fifth, test with two similar prompts and confirm that the second one hits the cache.

477
00:24:42,480 --> 00:24:43,680
The validation is critical.

478
00:24:43,680 --> 00:24:48,920
Use the APM test tab and trace logs to verify that the first call reaches the LLM and stores the response,

479
00:24:48,920 --> 00:24:52,640
while the second similar call short circuits and returns from cache.

480
00:24:52,640 --> 00:25:00,160
If you do not validate this, you may think you have caching when you actually have a broken policy that silently falls through to the LLM on every call.

481
00:25:00,160 --> 00:25:00,920
There are risks.

482
00:25:00,920 --> 00:25:02,440
Stale cache is the obvious one.

483
00:25:02,440 --> 00:25:08,960
If your policy answer changes because of a new regulation, the cache response may be wrong for hours or days until the time to live expires.

484
00:25:08,960 --> 00:25:14,960
Mitigate this with shorter TTLs for dynamic content and explicit cache invalidation hooks when source documents change.

485
00:25:14,960 --> 00:25:16,600
Privacy is another concern.

486
00:25:16,600 --> 00:25:26,840
Cache prompts may contain sensitive user data, use per tenant cache partitions, encrypt data at rest, and consider redacting personally identifiable information before embedding.

487
00:25:26,840 --> 00:25:28,920
False positives are the most subtle risk.

488
00:25:28,920 --> 00:25:35,560
If your threshold is too loose, a user asking about health insurance might get a cached answer about dental coverage because the vectors are close.

489
00:25:35,560 --> 00:25:37,280
This is not just a cost problem.

490
00:25:37,280 --> 00:25:38,440
It is a quality problem.

491
00:25:38,440 --> 00:25:46,560
The fix is conservative thresholds, human evaluation of cache hit quality during rollout, and continuous monitoring of user satisfaction scores after deployment.

492
00:25:46,560 --> 00:25:49,240
When semantic caching works, the impact is immediate.

493
00:25:49,240 --> 00:25:54,680
Cache hits return in milliseconds instead of hundreds of milliseconds or seconds, they consume zero LLM tokens.

494
00:25:54,680 --> 00:26:00,520
And they reduce concurrency pressure on your Azure OpenAI deployment, which lowers the risk of throttling during peak hours.

495
00:26:00,520 --> 00:26:06,120
For high volume agents with repetitive question patterns, cache hit rates of 40 to 60% are common.

496
00:26:06,120 --> 00:26:07,720
That is not a marginal saving.

497
00:26:07,720 --> 00:26:10,000
That is a structural change in your cost curve.

498
00:26:10,000 --> 00:26:11,960
Prompt compression with LLM lingua.

499
00:26:11,960 --> 00:26:13,960
Caching eliminates redundant work.

500
00:26:13,960 --> 00:26:16,000
But most of your agent calls are not redundant.

501
00:26:16,000 --> 00:26:19,840
They are unique questions that require a live LLM response.

502
00:26:19,840 --> 00:26:24,200
The question is whether those unique questions need the full prompt you are sending.

503
00:26:24,200 --> 00:26:30,920
In a typical enterprise rag pipeline, the prompt contains the user question, a system instruction, and a block of retrieved documents.

504
00:26:30,920 --> 00:26:33,360
That block is often 20,000 tokens or more.

505
00:26:33,360 --> 00:26:34,440
And much of it is noise.

506
00:26:34,440 --> 00:26:41,840
Redundant sentences, irrelevant paragraphs, formatting overhead, and metadata that the model does not need to answer the specific question.

507
00:26:41,840 --> 00:26:49,280
You are paying for every one of those tokens, and you are paying eight times more for the output tokens that the model generates in response to all that noise.

508
00:26:49,280 --> 00:26:52,800
LLM lingua is a prompt compression framework from Microsoft Research.

509
00:26:52,800 --> 00:26:58,840
It uses a lightweight encoder model to score every token or sentence in the retrieved context for relevance to the task.

510
00:26:58,840 --> 00:27:01,560
It keeps the important parts and discards the rest.

511
00:27:01,560 --> 00:27:03,760
Compression ratios of two to ten times are typical.

512
00:27:03,760 --> 00:27:10,640
With some configurations reaching 20 times on long, redundant corpora, the quality losses minimal because the compression is task aware.

513
00:27:10,640 --> 00:27:12,000
It does not blindly truncate.

514
00:27:12,000 --> 00:27:14,600
It preserves the sentences that carry the signal.

515
00:27:14,600 --> 00:27:18,680
In Azure, LLM lingua two models are available in the Azure AI model catalog.

516
00:27:18,680 --> 00:27:23,840
You can deploy them into an Azure AI Studio project and use them as a preprocessing step in prompt flow.

517
00:27:23,840 --> 00:27:25,080
The architecture is simple.

518
00:27:25,080 --> 00:27:26,880
Retrieval returns a set of documents.

519
00:27:26,880 --> 00:27:30,320
The LLM lingua node compresses them to a token budget you specify.

520
00:27:30,320 --> 00:27:34,400
The compressed context is then passed to the main LLM along with the user query.

521
00:27:34,400 --> 00:27:37,960
The main LLMC's less text generates faster and costs fewer tokens.

522
00:27:37,960 --> 00:27:39,840
But it still receives the information it needs.

523
00:27:39,840 --> 00:27:41,280
There are three integration patterns.

524
00:27:41,280 --> 00:27:43,320
The first is prompt flow preprocessing.

525
00:27:43,320 --> 00:27:46,720
You insert an LLM lingua node between your retrieval step and your LLM step.

526
00:27:46,720 --> 00:27:51,720
This is the easiest pattern if you are already using Azure AI Studio and prompt flow for orchestration.

527
00:27:51,720 --> 00:27:53,440
The second is back end integration.

528
00:27:53,440 --> 00:27:58,320
Your application calls a deployed LLM lingua endpoint before calling Azure Open AI.

529
00:27:58,320 --> 00:28:02,640
This pattern works for custom applications as your functions and containerized services.

530
00:28:02,640 --> 00:28:04,280
The third is batch pre-compression.

531
00:28:04,280 --> 00:28:08,120
You compress your knowledge-based documents offline before embedding and indexing them.

532
00:28:08,120 --> 00:28:12,920
This reduces both storage and retrieval overhead, though it requires reprocessing when documents change.

533
00:28:12,920 --> 00:28:14,440
The tuning process matters.

534
00:28:14,440 --> 00:28:18,880
Start with conservative compression targeting a 20 to 40% reduction in prompt length.

535
00:28:18,880 --> 00:28:22,800
Run an evaluation using Azure AI Studios built in evaluations feature.

536
00:28:22,800 --> 00:28:28,640
Compare answer accuracy, completeness and user satisfaction between the compressed and uncompressed paths.

537
00:28:28,640 --> 00:28:30,720
If quality holds, increase compression.

538
00:28:30,720 --> 00:28:34,280
If quality drops back off, different tasks tolerate different levels of compression.

539
00:28:34,280 --> 00:28:37,520
Classification and FAQ can handle aggressive compression.

540
00:28:37,520 --> 00:28:40,680
Legal and medical compliance answers need conservative compression.

541
00:28:40,680 --> 00:28:42,680
Do not use a single ratio for all agents.

542
00:28:42,680 --> 00:28:44,200
Safety is also a consideration.

543
00:28:44,200 --> 00:28:47,480
LLM lingua compresses content, but it does not validate it.

544
00:28:47,480 --> 00:28:51,160
If your retrieved documents contain sensitive data, compression does not remove it.

545
00:28:51,160 --> 00:28:53,760
And compression should never be your primary security control.

546
00:28:53,760 --> 00:28:56,760
Run Azure AI content safety on the final prompt and output.

547
00:28:56,760 --> 00:29:01,360
Apply the same R-back, private networking and encryption standards to the LLM lingua deployment

548
00:29:01,360 --> 00:29:02,920
that you apply to your main LLM.

549
00:29:02,920 --> 00:29:05,920
The compressed pipeline is only as secure as its weakest node.

550
00:29:05,920 --> 00:29:09,600
For co-pilot studio environments, LLM lingua is not a native feature.

551
00:29:09,600 --> 00:29:11,240
You cannot simply toggle it on.

552
00:29:11,240 --> 00:29:16,000
Instead, you need to build a custom RAC orchestrator using an Azure Function or a custom API.

553
00:29:16,000 --> 00:29:19,760
Co-pilot studio calls the orchestrator instead of using its built-in knowledge source.

554
00:29:19,760 --> 00:29:24,320
The orchestrator retrieves documents from Azure AI Search, runs LLM lingua compression, calls

555
00:29:24,320 --> 00:29:26,600
Azure OpenAI and returns the answer.

556
00:29:26,600 --> 00:29:31,000
This requires more engineering than a native co-pilot studio agent, but for high volume scenarios,

557
00:29:31,000 --> 00:29:32,880
the token savings justify the build.

558
00:29:32,880 --> 00:29:37,080
The combined impact of compression and caching is where the architecture starts to pay off.

559
00:29:37,080 --> 00:29:38,600
Caching handles the repeats.

560
00:29:38,600 --> 00:29:40,640
Compression shrinks the unique requests.

561
00:29:40,640 --> 00:29:45,480
Together they can reduce your input token volume by 50% to 70% before the model ever sees

562
00:29:45,480 --> 00:29:46,480
the prompt.

563
00:29:46,480 --> 00:29:51,080
And because output cost is driven by input complexity, shorter prompts often produce shorter,

564
00:29:51,080 --> 00:29:52,200
more focused answers.

565
00:29:52,200 --> 00:29:55,640
The savings cascade through both sides of the token equation.

566
00:29:55,640 --> 00:29:57,680
Model routing with Azure AI Foundry.

567
00:29:57,680 --> 00:30:01,960
Hard coding, GPT-5 global for every query is the fastest way to overpay.

568
00:30:01,960 --> 00:30:03,720
Most teams do it because it feels safer.

569
00:30:03,720 --> 00:30:07,800
They do not want to explain to a user why a cheap model gave a bad answer, so they send

570
00:30:07,800 --> 00:30:10,320
everything to the flagship and absorb the cost.

571
00:30:10,320 --> 00:30:15,040
At low volume that cost is invisible, but at high volume it is catastrophic, so the alternative

572
00:30:15,040 --> 00:30:16,200
is model routing.

573
00:30:16,200 --> 00:30:20,480
Azure AI Foundry now provides a model router that acts as a trained meta model between your

574
00:30:20,480 --> 00:30:22,880
application and the underlying LLM pool.

575
00:30:22,880 --> 00:30:24,520
Your code calls one endpoint.

576
00:30:24,520 --> 00:30:28,440
The router reads the prompt, estimates the complexity, and sends the request to the

577
00:30:28,440 --> 00:30:30,240
cheapest model that can handle it.

578
00:30:30,240 --> 00:30:35,240
Global classification tasks go to GPT-5 Nano, standard Q&A and summarization go to GPT-5

579
00:30:35,240 --> 00:30:36,240
Mini.

580
00:30:36,240 --> 00:30:41,000
Complex reasoning, multi-document synthesis and creative drafting go to GPT-5 global or GPT-5

581
00:30:41,000 --> 00:30:42,000
Pro.

582
00:30:42,000 --> 00:30:44,680
You do not hard code these choices because the architecture makes them for you.

583
00:30:44,680 --> 00:30:46,400
The deployment is straightforward.

584
00:30:46,400 --> 00:30:51,200
In your Azure AI Foundry project, browse the model catalog and search for model router.

585
00:30:51,200 --> 00:30:54,800
Deploy it like any other model, giving it a name such as model router prod.

586
00:30:54,800 --> 00:30:59,120
You receive an endpoint URL, an API key, and a compatible API version.

587
00:30:59,120 --> 00:31:03,960
Your application then calls this endpoint using the standard Azure OpenAI SDK, passing the

588
00:31:03,960 --> 00:31:06,360
router deployment name as the model parameter.

589
00:31:06,360 --> 00:31:09,600
The payload looks identical to a normal chat completion request.

590
00:31:09,600 --> 00:31:11,760
The router handles the selection internally.

591
00:31:11,760 --> 00:31:13,080
The savings are substantial.

592
00:31:13,080 --> 00:31:17,880
For workloads that are 80% simple and 20% complex, intelligent routing can reduce average

593
00:31:17,880 --> 00:31:20,000
model cost by roughly 60%.

594
00:31:20,000 --> 00:31:21,600
That is not a minor optimization.

595
00:31:21,600 --> 00:31:26,520
It is the difference between a pilot that scales and a pilot that gets shut down by finance.

596
00:31:26,520 --> 00:31:30,880
The integration cost is minimal because you are already using the Azure OpenAI SDK.

597
00:31:30,880 --> 00:31:33,120
You change the deployment name and you are done.

598
00:31:33,120 --> 00:31:34,760
There are important constraints.

599
00:31:34,760 --> 00:31:37,000
The router pool is managed by Microsoft.

600
00:31:37,000 --> 00:31:39,120
You cannot add your own fine-tuned models to it.

601
00:31:39,120 --> 00:31:42,640
You cannot add external APIs or self-hosted open source models.

602
00:31:42,640 --> 00:31:46,240
And you cannot force a specific model choice via API parameters.

603
00:31:46,240 --> 00:31:50,480
If your compliance or safety team requires deterministic model selection for ordered reasons,

604
00:31:50,480 --> 00:31:51,920
the router is not the right tool.

605
00:31:51,920 --> 00:31:55,560
You should deploy specific models directly and build your own routing logic.

606
00:31:55,560 --> 00:31:58,680
Also the routing decision is opaque on a per request basis.

607
00:31:58,680 --> 00:32:01,800
You cannot look at a single response and know which model generated it.

608
00:32:01,800 --> 00:32:03,720
You can only see aggregate metrics.

609
00:32:03,720 --> 00:32:08,840
That means your monitoring strategy must focus on overall cost per request, average latency,

610
00:32:08,840 --> 00:32:12,120
and downstream quality scores rather than per call attribution.

611
00:32:12,120 --> 00:32:16,200
If you need per call model logs for regulatory purposes, you will need custom orchestration

612
00:32:16,200 --> 00:32:17,200
instead.

613
00:32:17,200 --> 00:32:19,840
For most business agents, these constraints are acceptable.

614
00:32:19,840 --> 00:32:24,560
The router is ideal for general purpose co-pilots, developer productivity tools, and

615
00:32:24,560 --> 00:32:28,800
multi-tenant SaaS products where query complexity varies widely.

616
00:32:28,800 --> 00:32:33,160
It is less ideal for regulated environments that require fixed model versions or for use

617
00:32:33,160 --> 00:32:36,640
cases where a specific model has been certified for a specific task.

618
00:32:36,640 --> 00:32:38,400
The monitoring strategy matters.

619
00:32:38,400 --> 00:32:42,360
Because you cannot see individual routing decisions, you should define aggregate KPIs

620
00:32:42,360 --> 00:32:43,760
before deployment.

621
00:32:43,760 --> 00:32:46,880
Target a cost per request below a specific threshold.

622
00:32:46,880 --> 00:32:49,440
Target a latency P95 below a specific ceiling.

623
00:32:49,440 --> 00:32:53,600
Target a user satisfaction or escalation rate that matches or beats your current direct

624
00:32:53,600 --> 00:32:54,800
model baseline.

625
00:32:54,800 --> 00:32:58,840
If any of these KPIs degrades after routing, you can tune the prompt, adjust the generation

626
00:32:58,840 --> 00:33:02,520
parameters, or fall back to direct model deployment for that workload.

627
00:33:02,520 --> 00:33:04,720
Routing also pairs naturally with the other levers.

628
00:33:04,720 --> 00:33:09,800
A cashed response never reaches the router, so cash hit rate directly reduces routing volume.

629
00:33:09,800 --> 00:33:14,160
A compressed prompt is what the router evaluates, so shorter prompts may be classified as simpler

630
00:33:14,160 --> 00:33:16,520
and routed to cheaper models more often.

631
00:33:16,520 --> 00:33:20,400
And capacity planning, which we will cover next, determines how you pay for the model the

632
00:33:20,400 --> 00:33:22,360
router ultimately selects.

633
00:33:22,360 --> 00:33:25,880
Ptu, first PAYG, the capacity decision.

634
00:33:25,880 --> 00:33:28,360
Model routing optimizes per request cost.

635
00:33:28,360 --> 00:33:32,600
But how you buy capacity determines your total spend regardless of per request efficiency.

636
00:33:32,600 --> 00:33:37,280
This is the strategic decision that separates experimental workloads from production commitments.

637
00:33:37,280 --> 00:33:41,160
And most organizations get it wrong because they focus on sticker price rather than utilization

638
00:33:41,160 --> 00:33:42,160
math.

639
00:33:42,160 --> 00:33:45,880
As your open AI offers two billing models, pay as you go charges per million tokens with

640
00:33:45,880 --> 00:33:48,000
no commitment.

641
00:33:48,000 --> 00:33:53,160
You're visioned through put units reserve a fixed amount of capacity and charge per Ptu hour

642
00:33:53,160 --> 00:33:55,040
with monthly or yearly discounts.

643
00:33:55,040 --> 00:34:00,040
The advertised savings for Ptu can reach 70% compared to PAYG list prices, but those savings

644
00:34:00,040 --> 00:34:02,800
only materialize at very high utilization.

645
00:34:02,800 --> 00:34:10,000
For GPT-5 Global in 2026, a yearly Ptu reservation breaks even against PAYG at roughly 86% sustained

646
00:34:10,000 --> 00:34:11,000
utilization.

647
00:34:11,000 --> 00:34:16,440
That means you need to keep the reserved capacity busy at 86% or higher, 24 hours a day, 7 days

648
00:34:16,440 --> 00:34:17,440
a week.

649
00:34:17,440 --> 00:34:20,760
Average utilization is 70% PAYG is cheaper.

650
00:34:20,760 --> 00:34:25,600
If your traffic is spiky, with peaks at noon and troughs at midnight, sustaining 86% average

651
00:34:25,600 --> 00:34:27,000
is nearly impossible.

652
00:34:27,000 --> 00:34:28,880
Hourly Ptu never wins on cost.

653
00:34:28,880 --> 00:34:32,160
It is useful only for short term benchmarking or load testing.

654
00:34:32,160 --> 00:34:36,960
Monthly Ptu rarely wins because the break even point is near 100% utilization, which is

655
00:34:36,960 --> 00:34:41,240
practically unattainable in real workloads with any seasonality or downtime.

656
00:34:41,240 --> 00:34:46,720
Only yearly Ptu can beat PAYG and only for stable, high volume always on production workloads.

657
00:34:46,720 --> 00:34:48,520
The practical framework is simple.

658
00:34:48,520 --> 00:34:50,520
Start every new workload on PAYG.

659
00:34:50,520 --> 00:34:53,400
Collect 30 to 90 days of real token throughput data.

660
00:34:53,400 --> 00:34:56,480
Compute your peak average and off peak utilization.

661
00:34:56,480 --> 00:34:59,960
Then model the Ptu cost at your actual utilization, not at ideal utilization.

662
00:34:59,960 --> 00:35:04,360
If your sustained average is above 85%, and your workload is mission critical with strict

663
00:35:04,360 --> 00:35:06,680
latency requirements, consider yearly Ptu.

664
00:35:06,680 --> 00:35:08,400
Otherwise, stay on PAYG.

665
00:35:08,400 --> 00:35:10,600
There is a nuance that many calculators miss.

666
00:35:10,600 --> 00:35:14,440
Azure OpenAI introduced prompt prefix caching for PAYG workloads.

667
00:35:14,440 --> 00:35:18,480
If your prompts share a common prefix such as a system instruction or a structured template,

668
00:35:18,480 --> 00:35:21,400
repeated calls are built at a much lower cache rate.

669
00:35:21,400 --> 00:35:26,520
With cache hit rates above 50%, PAYG becomes even more attractive, and the Ptu break even

670
00:35:26,520 --> 00:35:28,160
threshold moves even higher.

671
00:35:28,160 --> 00:35:32,360
If you have built caching and compression correctly, your PAYG bill may already be lower

672
00:35:32,360 --> 00:35:34,360
than a Ptu reservation would deliver.

673
00:35:34,360 --> 00:35:36,000
Some teams adopt a hybrid approach.

674
00:35:36,000 --> 00:35:38,560
They reserve a yearly Ptu for their baseline load.

675
00:35:38,560 --> 00:35:40,800
The steady traffic they know will always be there.

676
00:35:40,800 --> 00:35:43,600
Then they use PAYG for bursts above the baseline.

677
00:35:43,600 --> 00:35:47,880
This improves average utilization of the Ptu and avoids over provisioning, but it adds

678
00:35:47,880 --> 00:35:49,200
operational complexity.

679
00:35:49,200 --> 00:35:53,320
You need two deployments, two monitoring dashboards, and logic to root traffic between them.

680
00:35:53,320 --> 00:35:56,680
Only do this if your baseline is large enough to justify the overhead.

681
00:35:56,680 --> 00:35:59,120
The hidden costs around Ptu also matter.

682
00:35:59,120 --> 00:36:04,200
Support plans, networking, data egress, and logging scale with your overall Azure footprint.

683
00:36:04,200 --> 00:36:08,120
Ptu deployments tend to be permanent and always on, which increases the share of these

684
00:36:08,120 --> 00:36:09,120
fixed overheads.

685
00:36:09,120 --> 00:36:13,880
PAYG lets you shut down environments faster when demand drops, which can shrink the surrounding

686
00:36:13,880 --> 00:36:15,120
infrastructure bill.

687
00:36:15,120 --> 00:36:17,120
The token price is only part of the equation.

688
00:36:17,120 --> 00:36:19,280
Fiscal guardrails and budget alerts.

689
00:36:19,280 --> 00:36:21,080
Engineers optimized for speed and quality.

690
00:36:21,080 --> 00:36:24,040
They do not optimize for cost unless cost is enforced.

691
00:36:24,040 --> 00:36:27,520
That is why the fourth layer of this architecture is not a technical lever.

692
00:36:27,520 --> 00:36:28,600
It is a governance lever.

693
00:36:28,600 --> 00:36:32,680
You need automated guardrails that make the cheapest path the default path and that alert

694
00:36:32,680 --> 00:36:34,600
you when someone overrides it.

695
00:36:34,600 --> 00:36:36,520
Start with Azure cost management budgets.

696
00:36:36,520 --> 00:36:41,480
Get a monthly budget for your total AI spend across Azure OpenAI, Copilot Studio, and related

697
00:36:41,480 --> 00:36:42,480
infrastructure.

698
00:36:42,480 --> 00:36:44,160
Configure three alert thresholds.

699
00:36:44,160 --> 00:36:47,720
At 50% of budget, send a warning email to the platform team.

700
00:36:47,720 --> 00:36:51,480
At 80% trigger a logic app that posts to your operations channel and tags the workload

701
00:36:51,480 --> 00:36:52,480
owner.

702
00:36:52,480 --> 00:36:57,440
At 100% block new deployments or throttle non-gold workloads until the next budget cycle.

703
00:36:57,440 --> 00:36:59,960
Do not rely on humans to notice a dashboard.

704
00:36:59,960 --> 00:37:02,280
Automated alerts are the only guardrail that scales.

705
00:37:02,280 --> 00:37:06,760
The Copilot Studio set per agent credit caps in the power platform admin center.

706
00:37:06,760 --> 00:37:10,240
Every agent should have a maximum monthly credit allocation based on its cost class.

707
00:37:10,240 --> 00:37:11,760
Bronze agents get tight caps.

708
00:37:11,760 --> 00:37:13,400
Silver agents get moderate caps.

709
00:37:13,400 --> 00:37:16,120
Gold agents get higher caps but still have a ceiling.

710
00:37:16,120 --> 00:37:20,040
When an agent hits its cap, it stops responding or falls back to a static FAQ.

711
00:37:20,040 --> 00:37:24,160
This prevents runaway agents from consuming unlimited credits during a traffic spike or

712
00:37:24,160 --> 00:37:25,160
a loop bug.

713
00:37:25,160 --> 00:37:27,960
Azure API management provides the technical enforcement layer.

714
00:37:27,960 --> 00:37:32,040
You can apply rate limiting and quota policies by cost class at the gateway.

715
00:37:32,040 --> 00:37:34,800
The workloads get higher tokens per minute limits.

716
00:37:34,800 --> 00:37:38,000
Bronze workloads get throttled first when the gateway is under pressure.

717
00:37:38,000 --> 00:37:42,360
You can also enforce max tokens limits by policy, preventing any agent from generating

718
00:37:42,360 --> 00:37:45,400
unexpectedly long responses that blow the output budget.

719
00:37:45,400 --> 00:37:47,160
These policies are not suggestions.

720
00:37:47,160 --> 00:37:49,400
They are rules that the infrastructure enforces.

721
00:37:49,400 --> 00:37:53,560
The combination of budgets, caps and rate limits creates a layer defense.

722
00:37:53,560 --> 00:37:54,800
Finance owns the budget.

723
00:37:54,800 --> 00:37:56,360
Platform teams own the gateway policies.

724
00:37:56,360 --> 00:37:58,000
App teams own the agent design.

725
00:37:58,000 --> 00:38:00,560
When a spike happens, the budget alert fires first.

726
00:38:00,560 --> 00:38:02,920
If the spike continues, the credit cap kicks in.

727
00:38:02,920 --> 00:38:06,680
If the spike is a sustained change in demand, the rate limits smooth the load.

728
00:38:06,680 --> 00:38:10,400
And if none of these catch it, the cost class classification ensures that the most expensive

729
00:38:10,400 --> 00:38:13,360
workloads were already the smallest segment of your traffic.

730
00:38:13,360 --> 00:38:15,160
These guardrails also change behavior.

731
00:38:15,160 --> 00:38:19,000
When app teams know that Bronze agents have a 200 token max response limit, they design

732
00:38:19,000 --> 00:38:20,240
for conciseness.

733
00:38:20,240 --> 00:38:24,720
When they know that silver agents must use cashed responses for the top 10 FAQ questions,

734
00:38:24,720 --> 00:38:26,760
they build caching into the flow from day one.

735
00:38:26,760 --> 00:38:28,320
Governance is not a constraint.

736
00:38:28,320 --> 00:38:30,120
It is a design specification.

737
00:38:30,120 --> 00:38:34,840
And when it is enforced automatically, it becomes part of the architecture rather than an

738
00:38:34,840 --> 00:38:36,840
afterthought.

739
00:38:36,840 --> 00:38:40,200
The 90-day implementation roadmap theory is useful.

740
00:38:40,200 --> 00:38:41,720
But what matters is execution.

741
00:38:41,720 --> 00:38:45,200
The following roadmap breaks the entire framework into three 30-day sprints.

742
00:38:45,200 --> 00:38:49,080
Each sprint has a clear, deliverable exit criteria and rollback triggers.

743
00:38:49,080 --> 00:38:50,440
Do not skip the audit sprint.

744
00:38:50,440 --> 00:38:52,640
Do not deploy caching without validation.

745
00:38:52,640 --> 00:38:56,360
And do not commit to Ptu before you have 30 days of real telemetry.

746
00:38:56,360 --> 00:39:00,080
The organizations that fail this transition are the ones that treat it as a configuration.

747
00:39:00,080 --> 00:39:04,240
So the configuration change rather than an architecture program.

748
00:39:04,240 --> 00:39:07,080
Days 1 through 30, the audit sprint.

749
00:39:07,080 --> 00:39:10,960
The goal of the first sprint is visibility because you cannot optimize what you cannot

750
00:39:10,960 --> 00:39:11,960
see.

751
00:39:11,960 --> 00:39:14,160
And right now most organizations are flying blind.

752
00:39:14,160 --> 00:39:18,480
Week 1 is tagging clean up every Azure OpenAI resource, every co-pilot studio environment

753
00:39:18,480 --> 00:39:23,160
and every related infrastructure component gets tagged with four mandatory fields.

754
00:39:23,160 --> 00:39:26,520
Environment, department, agent name and cost class.

755
00:39:26,520 --> 00:39:31,200
If a resource already exists without tags, retroactively apply them based on ownership records

756
00:39:31,200 --> 00:39:32,360
or deployment logs.

757
00:39:32,360 --> 00:39:33,360
This is tedious.

758
00:39:33,360 --> 00:39:34,760
It is also non-negotiable.

759
00:39:34,760 --> 00:39:39,160
Every cost query, every KQL analysis and every charge back report depends on these tags

760
00:39:39,160 --> 00:39:40,520
being clean and consistent.

761
00:39:40,520 --> 00:39:41,920
Week 2 is logging enablement.

762
00:39:41,920 --> 00:39:46,480
Turn on diagnostic logging for every Azure OpenAI deployment and stream the logs to a log

763
00:39:46,480 --> 00:39:47,920
analytics workspace.

764
00:39:47,920 --> 00:39:51,760
If you have Azure API management in front of your endpoints, enable trace logging there

765
00:39:51,760 --> 00:39:53,000
as well.

766
00:39:53,000 --> 00:39:57,800
In co-pilot studio, export the built-in analytics to a dataverse table or a power BI data

767
00:39:57,800 --> 00:39:58,800
set.

768
00:39:58,800 --> 00:40:02,640
The specific telemetry you need is input tokens per request, output tokens per request,

769
00:40:02,640 --> 00:40:05,480
model SKU, latency and user session ID.

770
00:40:05,480 --> 00:40:09,360
Without session ID, you cannot attribute cost to specific conversation patterns.

771
00:40:09,360 --> 00:40:11,320
Week 3 is baseline construction.

772
00:40:11,320 --> 00:40:15,920
Run your first KQL query against the log analytics workspace to compute average tokens per

773
00:40:15,920 --> 00:40:20,200
interaction, peak tokens per interaction and total interactions per day for each tagged

774
00:40:20,200 --> 00:40:21,200
agent.

775
00:40:21,200 --> 00:40:24,840
Co-pilot studio consumption estimator for every active agent.

776
00:40:24,840 --> 00:40:29,100
Record the projected credit need versus current pack size and run your first Azure Co-pilot

777
00:40:29,100 --> 00:40:31,200
natural language cost query.

778
00:40:31,200 --> 00:40:34,920
Summarize my AI spend by service and tag for the last 30 days.

779
00:40:34,920 --> 00:40:39,320
Store all of this in an Azure monitor workbook that the platform team, finance and app teams

780
00:40:39,320 --> 00:40:40,320
can all access.

781
00:40:40,320 --> 00:40:42,800
Week 4 is a normally identification.

782
00:40:42,800 --> 00:40:45,680
Compare the last 30 days of spend against the prior 30 days.

783
00:40:45,680 --> 00:40:50,520
Flag any agent whose cost per interaction increased more than 20% week over week.

784
00:40:50,520 --> 00:40:55,960
The top three drivers, common culprits are a new feature that increased retrieval scope,

785
00:40:55,960 --> 00:41:00,160
a marketing campaign that drove traffic to an unoptimized agent or a code change that removed

786
00:41:00,160 --> 00:41:01,920
a max tokens limit.

787
00:41:01,920 --> 00:41:03,680
Document the root cause for each anomaly.

788
00:41:03,680 --> 00:41:06,880
These become your first optimization targets in sprint 2.

789
00:41:06,880 --> 00:41:09,040
The exit criteria for sprint 1 are simple.

790
00:41:09,040 --> 00:41:11,960
You must have clean tags on 100% of resources.

791
00:41:11,960 --> 00:41:15,440
You must have 30 days of diagnostic logs in a queryable workspace.

792
00:41:15,440 --> 00:41:19,240
You must have a baseline workbook showing cost per interaction, tokens per interaction

793
00:41:19,240 --> 00:41:21,480
and P95 latency for every agent.

794
00:41:21,480 --> 00:41:24,720
And you must have an anomaly log with at least three investigated spikes.

795
00:41:24,720 --> 00:41:28,080
If you cannot produce these artifacts, do not proceed to sprint 2.

796
00:41:28,080 --> 00:41:29,880
You are not ready to optimize.

797
00:41:29,880 --> 00:41:31,560
Days 31.

798
00:41:31,560 --> 00:41:33,600
Through 60, the quick wind sprint.

799
00:41:33,600 --> 00:41:38,400
The goal of the second sprint is 20% to 30% cost reduction with zero quality degradation.

800
00:41:38,400 --> 00:41:42,160
These are the low hanging fruits that require minimal architectural change but deliver immediate

801
00:41:42,160 --> 00:41:43,160
impact.

802
00:41:43,160 --> 00:41:45,160
Week 5 is semantic caching deployment.

803
00:41:45,160 --> 00:41:48,240
Identify your top 5 most repeated query patterns.

804
00:41:48,240 --> 00:41:51,920
These are usually FAQ questions, policy lookups and status checks.

805
00:41:51,920 --> 00:41:55,720
For each pattern, deploy semantic caching through Azure API management using the policy

806
00:41:55,720 --> 00:41:57,320
pair we discussed earlier.

807
00:41:57,320 --> 00:42:00,280
Set an initial similarity threshold of 0.94.

808
00:42:00,280 --> 00:42:03,040
Test with real user queries and measure cache hit rate.

809
00:42:03,040 --> 00:42:07,960
If hit rate is below 30% after one week, lower the threshold to 0.92.

810
00:42:07,960 --> 00:42:10,800
If false positives appear, raise it back.

811
00:42:10,800 --> 00:42:14,640
Target a cache hit rate of 40% or higher for these top 5 patterns that alone eliminates

812
00:42:14,640 --> 00:42:17,800
40% of LLM calls for your most common traffic.

813
00:42:17,800 --> 00:42:20,000
Week 6 is max tokens enforcement.

814
00:42:20,000 --> 00:42:23,280
Review every agent and enforce a max tokens limit based on cost class.

815
00:42:23,280 --> 00:42:25,560
Bronze agents get a 100 token output limit.

816
00:42:25,560 --> 00:42:27,760
Silver agents get a 200 token limit.

817
00:42:27,760 --> 00:42:33,000
Gold agents get a 500 token limit with a requirement that the app team justifies any exception.

818
00:42:33,000 --> 00:42:34,800
This is not about restricting users.

819
00:42:34,800 --> 00:42:36,920
It is about forcing agents to be concise.

820
00:42:36,920 --> 00:42:41,000
If an answer genuinely needs more than 200 tokens, the agent should offer a summary and

821
00:42:41,000 --> 00:42:43,120
invite the user to ask for details.

822
00:42:43,120 --> 00:42:45,360
Most users prefer concise answers anyway.

823
00:42:45,360 --> 00:42:47,080
Week 7 is retrieval tuning.

824
00:42:47,080 --> 00:42:51,040
For every rag-based agent, reduce the top-carage retrieval count from its current default

825
00:42:51,040 --> 00:42:53,000
to 7 chunks maximum.

826
00:42:53,000 --> 00:42:56,560
Enable hybrid search with semantic ranking if it is not already active.

827
00:42:56,560 --> 00:43:00,440
Reach-chunk long documents using paragraph boundaries rather than fixed token counts.

828
00:43:00,440 --> 00:43:04,560
And add metadata filters so that queries about HR policy only search HR documents not

829
00:43:04,560 --> 00:43:06,200
the entire corporate knowledge base.

830
00:43:06,200 --> 00:43:10,000
These changes reduce input tokens before they ever reach compression or caching.

831
00:43:10,000 --> 00:43:11,840
Week 8 is budget alert activation.

832
00:43:11,840 --> 00:43:16,000
Set as your cost management budgets at the tenant level and at the department level.

833
00:43:16,000 --> 00:43:18,080
Consider the 50-8100 alert ladder.

834
00:43:18,080 --> 00:43:21,440
Set co-pilot studio credit caps for every bronze and silver agent.

835
00:43:21,440 --> 00:43:25,800
And add API-M rate limit policies that throttle bronze workloads when gateway capacity reaches

836
00:43:25,800 --> 00:43:26,800
80%.

837
00:43:26,800 --> 00:43:30,520
Test the alert chain by simulating a traffic spike in your staging environment.

838
00:43:30,520 --> 00:43:34,560
Verify that alerts fire, logic apps trigger and throttling engages at the correct threshold.

839
00:43:34,560 --> 00:43:37,360
The exit criteria for sprint 2 are measurable.

840
00:43:37,360 --> 00:43:41,600
You must demonstrate a 20% reduction in average tokens per interaction.

841
00:43:41,600 --> 00:43:45,120
You must demonstrate a 15% reduction in cost per interaction.

842
00:43:45,120 --> 00:43:48,400
Dash-hit rate must exceed 40% for the top 5 patterns.

843
00:43:48,400 --> 00:43:51,600
And no alert or cap must have fired accidentally during normal operations.

844
00:43:51,600 --> 00:43:55,680
If you hit these numbers, you have proven that cost control does not require quality sacrifice.

845
00:43:55,680 --> 00:43:58,720
If you do not hit these numbers, do not proceed.

846
00:43:58,720 --> 00:43:59,720
Investigate why.

847
00:43:59,720 --> 00:44:03,040
Days 61 through 90, the architecture sprint.

848
00:44:03,040 --> 00:44:05,720
The goal of the third sprint is structural transformation.

849
00:44:05,720 --> 00:44:08,440
The quick wins are exhausted, so now you redesign the pipeline.

850
00:44:08,440 --> 00:44:11,080
Starting with LLM-Lingua deployment in week 9.

851
00:44:11,080 --> 00:44:14,680
Choose one high-volume silver or gold agent with long retrieved documents.

852
00:44:14,680 --> 00:44:20,560
Compare LLM-Lingua 2 from the Azure AI model catalog as a pre-processing step in prompt flow.

853
00:44:20,560 --> 00:44:22,760
Start with a 30% compression target.

854
00:44:22,760 --> 00:44:26,040
Run an A/B test using Azure AI Studio evaluations.

855
00:44:26,040 --> 00:44:29,880
Compare the compressed path against the uncompress path on three metrics.

856
00:44:29,880 --> 00:44:32,840
Answer accuracy, user satisfaction and token cost.

857
00:44:32,840 --> 00:44:36,320
If accuracy drops more than 2%, reduce compression.

858
00:44:36,320 --> 00:44:39,200
If accuracy holds, increase to 50%.

859
00:44:39,200 --> 00:44:42,440
Document the optimal compression ratio and apply it to all similar agents.

860
00:44:42,440 --> 00:44:44,400
Week 10 is model routing deployment.

861
00:44:44,400 --> 00:44:47,120
In Azure AI Foundry, deploy the model router.

862
00:44:47,120 --> 00:44:50,720
Redirect one non-regulated bronze or silver agent to the router endpoint.

863
00:44:50,720 --> 00:44:55,880
Monitor aggregate KPIs for one week, cost per request, P95 latency and escalation rate.

864
00:44:55,880 --> 00:44:59,920
Compare against the prior week when the agent used GPT-5 global directly.

865
00:44:59,920 --> 00:45:04,480
If cost per request drops by 40% or more with no latency or quality degradation, expand

866
00:45:04,480 --> 00:45:05,880
routing to additional agents.

867
00:45:05,880 --> 00:45:09,760
If KPIs degrade, investigate whether the router is sending complex queries to a model

868
00:45:09,760 --> 00:45:11,000
that is too small.

869
00:45:11,000 --> 00:45:14,840
You may need to tune the prompt or add task type hints to improve routing accuracy.

870
00:45:14,840 --> 00:45:16,960
Week 11 is Ptu evaluation.

871
00:45:16,960 --> 00:45:21,520
Using the telemetry from the last 30 days of Sprint 2 and Sprint 3, compute your sustained

872
00:45:21,520 --> 00:45:25,200
utilization for any agent that is a candidate for yearly Ptu.

873
00:45:25,200 --> 00:45:26,200
The formula is simple.

874
00:45:26,200 --> 00:45:30,440
Divide your actual average tokens per minute by the Ptu capacity you would reserve.

875
00:45:30,440 --> 00:45:35,720
If the result is 86% or higher and the workload is mission critical with stable demand, evaluate

876
00:45:35,720 --> 00:45:37,800
a yearly Ptu commitment.

877
00:45:37,800 --> 00:45:40,680
If the result is below 80%, stay on PAYG.

878
00:45:40,680 --> 00:45:43,600
Do not let procurement pressure you into a commitment to get a discount.

879
00:45:43,600 --> 00:45:47,280
An underutilised Ptu is more expensive than PAYG with caching.

880
00:45:47,280 --> 00:45:48,920
Week 12 is governance hardening.

881
00:45:48,920 --> 00:45:52,840
Publish the cost class criteria in your internal Wiki or developer portal.

882
00:45:52,840 --> 00:45:56,440
Create an onboarding checklist that every new agent must complete before deployment.

883
00:45:56,440 --> 00:46:01,560
The checklist includes cost class justification, max token setting, cache configuration, model

884
00:46:01,560 --> 00:46:04,520
assignment or router endpoint and credit cap.

885
00:46:04,520 --> 00:46:06,480
Train citizen developers on the framework.

886
00:46:06,480 --> 00:46:10,160
Most cost overruns come from well-meaning developers who simply do not know that a cheaper

887
00:46:10,160 --> 00:46:13,080
model exists or that retrieval scope matters.

888
00:46:13,080 --> 00:46:14,960
Education is cheaper than remediation.

889
00:46:14,960 --> 00:46:17,520
The exit criteria for Sprint 3 are architectural.

890
00:46:17,520 --> 00:46:21,960
You must have at least one agent running with semantic caching and validated hit rates.

891
00:46:21,960 --> 00:46:26,560
You must have at least one agent running with LLM lingua compression and proven accuracy.

892
00:46:26,560 --> 00:46:30,520
You must have at least one agent routed through the Azure AI Foundry model router with stable

893
00:46:30,520 --> 00:46:31,520
KPIs.

894
00:46:31,520 --> 00:46:34,800
You must have a published cost class framework and an onboarding checklist.

895
00:46:34,800 --> 00:46:38,040
And you must have a Ptu evaluation for every gold candidate.

896
00:46:38,040 --> 00:46:41,520
One if the conclusion is to stay on PAYG.

897
00:46:41,520 --> 00:46:43,320
Rollback triggers and risk mitigation.

898
00:46:43,320 --> 00:46:45,120
Every Sprint includes a rollback plan.

899
00:46:45,120 --> 00:46:50,000
If cache hit rates drop below 20% after deployment, disable the cache and investigate.

900
00:46:50,000 --> 00:46:55,440
If LLM lingua compression degrades user satisfaction by more than 5%, revert to the uncompressed path

901
00:46:55,440 --> 00:46:56,440
and retune.

902
00:46:56,440 --> 00:47:00,720
If model routing increases escalation rates, fall back to direct model deployment and engage

903
00:47:00,720 --> 00:47:03,160
Microsoft support to review the router behavior.

904
00:47:03,160 --> 00:47:07,160
If a budget alert fires because of a genuine business search rather than waste, temporarily

905
00:47:07,160 --> 00:47:09,440
raise the cap and schedule a capacity review.

906
00:47:09,440 --> 00:47:11,880
The biggest risk is organizational fatigue.

907
00:47:11,880 --> 00:47:16,880
A 90 day program requires dedicated time from platform engineering, app teams and finance.

908
00:47:16,880 --> 00:47:21,640
If your organization cannot sustain the pace, extend the timeline, but do not skip steps.

909
00:47:21,640 --> 00:47:23,680
An audit without action is useless.

910
00:47:23,680 --> 00:47:25,960
Quick wins without architecture are temporary.

911
00:47:25,960 --> 00:47:29,800
An architecture without governance will erode within six months as new teams join and

912
00:47:29,800 --> 00:47:31,240
old habits return.

913
00:47:31,240 --> 00:47:33,320
The second biggest risk is scope creep.

914
00:47:33,320 --> 00:47:36,520
This roadmap is about cost architecture, not feature development.

915
00:47:36,520 --> 00:47:40,720
Do not add new agent capabilities during the 90 days unless they are specifically designed

916
00:47:40,720 --> 00:47:42,280
to test the cost framework.

917
00:47:42,280 --> 00:47:46,360
Do not redesign your knowledge base unless the retrieval tuning in week 7 requires it.

918
00:47:46,360 --> 00:47:47,360
Stay focused.

919
00:47:47,360 --> 00:47:51,440
The organizations that succeed treat this as a capital program with a defined budget, a

920
00:47:51,440 --> 00:47:54,440
defined timeline and a defined return on investment.

921
00:47:54,440 --> 00:47:55,440
The governance model.

922
00:47:55,440 --> 00:48:00,160
A 90 day roadmap delivers a cost-optimized architecture, but architecture erodes without

923
00:48:00,160 --> 00:48:01,160
governance.

924
00:48:01,160 --> 00:48:06,200
As new teams join, citizen developers experiment and procurement renews licenses without

925
00:48:06,200 --> 00:48:08,080
checking utilization.

926
00:48:08,080 --> 00:48:12,960
Within six months, the agents you optimised are surrounded by new agents built on old habits.

927
00:48:12,960 --> 00:48:15,440
Governance is the operating model that prevents this erosion.

928
00:48:15,440 --> 00:48:18,960
It is not bureaucracy, it is the system that keeps the architecture alive.

929
00:48:18,960 --> 00:48:20,560
The governance model has three layers.

930
00:48:20,560 --> 00:48:24,240
The platform team owns the infrastructure and the standards, the app teams own the agents

931
00:48:24,240 --> 00:48:28,720
and the user experience, and finance owns the budget and the charge back.

932
00:48:28,720 --> 00:48:33,240
Each layer has clear responsibilities, clear metrics and clear escalation paths.

933
00:48:33,240 --> 00:48:36,720
When these layers are aligned, cost control becomes self-sustaining.

934
00:48:36,720 --> 00:48:41,480
When they are not aligned, every optimisation is eventually undone.

935
00:48:41,480 --> 00:48:42,480
The platform team.

936
00:48:42,480 --> 00:48:44,160
The platform team is the centre of gravity.

937
00:48:44,160 --> 00:48:48,960
They own the Azure OpenAI deployments, the Azure API management instance, the model router,

938
00:48:48,960 --> 00:48:52,280
the LLM-lingua endpoints and the Azure Monitor workbooks.

939
00:48:52,280 --> 00:48:56,000
They publish a service catalogue that defines what every app team can use and how to request

940
00:48:56,000 --> 00:48:57,000
it.

941
00:48:57,000 --> 00:48:58,640
The service catalogue has four entries.

942
00:48:58,640 --> 00:49:03,840
Number one is bronze hosting. This includes a shared Azure OpenAI deployment on PAYG, APIM

943
00:49:03,840 --> 00:49:08,040
with rate limiting, a cached FAQ back end and a model router endpoint.

944
00:49:08,040 --> 00:49:10,720
Bronze workloads are automatically throttled and capped.

945
00:49:10,720 --> 00:49:12,120
Entry 2 is silver hosting.

946
00:49:12,120 --> 00:49:17,560
This includes a dedicated PAYG deployment, semantic caching, LLM-lingua compression and higher

947
00:49:17,560 --> 00:49:18,720
rate limits.

948
00:49:18,720 --> 00:49:21,160
Silver workloads are reviewed quarterly for cost drift.

949
00:49:21,160 --> 00:49:22,560
Entry 3 is gold hosting.

950
00:49:22,560 --> 00:49:28,080
This includes a dedicated PTU or high capacity PAYG deployment, custom retrieval tuning, direct

951
00:49:28,080 --> 00:49:30,920
model assignment and SLA-backed latency.

952
00:49:30,920 --> 00:49:35,040
Gold workloads require a business case, a named SLA and a cost justification that is renewed

953
00:49:35,040 --> 00:49:36,720
every six months.

954
00:49:36,720 --> 00:49:38,040
Entry 4 is the sandbox.

955
00:49:38,040 --> 00:49:40,960
This is the only environment where app teams can experiment freely.

956
00:49:40,960 --> 00:49:45,320
It uses the cheapest models, the strictest rate limits and a weekly credit cap that resets

957
00:49:45,320 --> 00:49:46,640
every Monday.

958
00:49:46,640 --> 00:49:50,880
Experiments that outgrow the sandbox must graduate to bronze, silver or gold via a platform

959
00:49:50,880 --> 00:49:51,880
review.

960
00:49:51,880 --> 00:49:53,600
Nothing lives in the sandbox permanently.

961
00:49:53,600 --> 00:49:56,000
The platform team also owns the cost dashboard.

962
00:49:56,000 --> 00:50:00,280
This dashboard shows real-time spent by cost class, by department and by agent.

963
00:50:00,280 --> 00:50:02,000
It highlights anomalies automatically.

964
00:50:02,000 --> 00:50:06,480
It tracks cache, hit rates, compression ratios, routing accuracy and PTU utilization.

965
00:50:06,480 --> 00:50:09,600
And it is the single source of truth for every cost review meeting.

966
00:50:09,600 --> 00:50:13,440
If an app team disputes a charge, the platform team points to the dashboard.

967
00:50:13,440 --> 00:50:16,760
If finance questions a spike, the platform team pulls the telemetry.

968
00:50:16,760 --> 00:50:20,160
There are no debates about who used what because the data is public.

969
00:50:20,160 --> 00:50:21,160
The app teams.

970
00:50:21,160 --> 00:50:22,920
App teams do not choose their infrastructure.

971
00:50:22,920 --> 00:50:26,320
They choose their use case and the use case determines the cost class.

972
00:50:26,320 --> 00:50:29,600
Before any new agent is built, the app team completes a brief intake form.

973
00:50:29,600 --> 00:50:31,160
The form asks five questions.

974
00:50:31,160 --> 00:50:32,440
What is the user facing goal?

975
00:50:32,440 --> 00:50:34,760
What is the expected monthly interaction volume?

976
00:50:34,760 --> 00:50:36,480
What knowledge sources are required?

977
00:50:36,480 --> 00:50:38,240
What is the acceptable latency?

978
00:50:38,240 --> 00:50:41,800
And what is the business impact if the agent fails or returns a wrong answer?

979
00:50:41,800 --> 00:50:43,400
The answer is determined the cost class.

980
00:50:43,400 --> 00:50:47,640
If the agent is customer facing and failure costs revenue, it is a gold candidate.

981
00:50:47,640 --> 00:50:51,200
If it is internal and failure causes minor inconvenience, it is silver.

982
00:50:51,200 --> 00:50:53,960
If it is experimental or low stakes, it is bronze.

983
00:50:53,960 --> 00:50:57,240
The platform team reviews the intake and assigns the infrastructure here.

984
00:50:57,240 --> 00:50:59,120
App teams can appeal but they need data.

985
00:50:59,120 --> 00:51:03,920
If a bronze agent is missing its latency target, the app team can request silver resources.

986
00:51:03,920 --> 00:51:08,560
But they must show that the latency problem is infrastructure bound, not design bound.

987
00:51:08,560 --> 00:51:10,680
Most latency problems are design bound.

988
00:51:10,680 --> 00:51:12,840
App teams also own user experience quality.

989
00:51:12,840 --> 00:51:16,960
They monitor escalation rates, user satisfaction scores and answer accuracy.

990
00:51:16,960 --> 00:51:20,960
If a bronze agent has high escalation rates, the app team must improve the retrieval

991
00:51:20,960 --> 00:51:24,840
tuning or the prompt design before requesting a more expensive model.

992
00:51:24,840 --> 00:51:27,920
Throwing GPT-5 global at a bad prompt does not fix the prompt.

993
00:51:27,920 --> 00:51:29,640
It just makes the bad prompt expensive.

994
00:51:29,640 --> 00:51:31,840
The intake form also includes a cost estimate.

995
00:51:31,840 --> 00:51:35,680
Using the co-pilot studio consumption estimator and the Azure pricing calculator,

996
00:51:35,680 --> 00:51:38,000
the app team projects their first month cost.

997
00:51:38,000 --> 00:51:41,000
This projection is compared against actual spend after 30 days.

998
00:51:41,000 --> 00:51:46,240
If actual spend exceeds the projection by more than 50%, the platform team initiates a review.

999
00:51:46,240 --> 00:51:50,400
This prevents the all-too-common pattern where an experiment is approved as low cost

1000
00:51:50,400 --> 00:51:53,520
and then silently balloons into a production dependency.

1001
00:51:53,520 --> 00:51:54,720
Finance and chargeback.

1002
00:51:54,720 --> 00:51:56,880
Finance owns the budget and the chargeback model.

1003
00:51:56,880 --> 00:51:58,320
The chargeback model is simple.

1004
00:51:58,320 --> 00:52:01,800
Every app team pays for the tokens and credits their agents consume.

1005
00:52:01,800 --> 00:52:05,000
There is no shared pool that hides individual accountability.

1006
00:52:05,000 --> 00:52:11,040
If the HR team builds an agent that consumes $10,000 per month, the HR budget sees $10,000.

1007
00:52:11,040 --> 00:52:16,800
If the support team builds an agent that consumes $500 per month, the support budget sees $500.

1008
00:52:16,800 --> 00:52:19,680
This sounds harsh, but it is the only way to align incentives.

1009
00:52:19,680 --> 00:52:24,160
When costs are hidden in a central platform budget, app teams have no reason to optimize.

1010
00:52:24,160 --> 00:52:28,000
When costs are visible in their own P&L optimization becomes a priority.

1011
00:52:28,000 --> 00:52:33,280
And the platform team supports this by providing detailed chargeback reports from the cost dashboard.

1012
00:52:33,280 --> 00:52:36,920
Every line item is tagged with agent name, department and cost class.

1013
00:52:36,920 --> 00:52:38,200
Finance does not need to guess.

1014
00:52:38,200 --> 00:52:39,400
They just need to read.

1015
00:52:39,400 --> 00:52:41,600
Finance also owns the annual planning cycle.

1016
00:52:41,600 --> 00:52:46,680
Every year, each app team submits a forecast for agent growth, model upgrades, and capacity changes.

1017
00:52:46,680 --> 00:52:50,640
The platform team translates these forecasts into Azure Open AIPTU needs,

1018
00:52:50,640 --> 00:52:53,880
co-pilot studio pack sizes, and infrastructure scaling plans.

1019
00:52:53,880 --> 00:52:57,600
Finance reviews the total and either approves the budget or challenges the forecasts.

1020
00:52:57,600 --> 00:52:59,120
This is standard capital planning.

1021
00:52:59,120 --> 00:53:03,480
The only difference is that the inputs are tokens and credits instead of servers and storage.

1022
00:53:03,480 --> 00:53:06,280
The quarterly business review is where all three layers meet.

1023
00:53:06,280 --> 00:53:07,720
Platform presents the dashboard.

1024
00:53:07,720 --> 00:53:09,600
App teams present their quality metrics.

1025
00:53:09,600 --> 00:53:11,120
Finance presents the budget variance.

1026
00:53:11,120 --> 00:53:15,120
Together, they decide whether any gold workloads should be reclassified to silver,

1027
00:53:15,120 --> 00:53:20,200
whether any silver workloads should graduate to gold, and whether any bronze experiments should be retired.

1028
00:53:20,200 --> 00:53:22,040
These decisions are data driven in public.

1029
00:53:22,040 --> 00:53:24,200
There are no hidden negotiations.

1030
00:53:24,200 --> 00:53:25,480
Antipetants to avoid.

1031
00:53:25,480 --> 00:53:28,560
There are four common governance failures that undo cost architecture.

1032
00:53:28,560 --> 00:53:29,880
The first is the hero team.

1033
00:53:29,880 --> 00:53:33,200
A high performing app team builds a successful agent, gets promoted,

1034
00:53:33,200 --> 00:53:38,480
and leaves behind a mess of untagged resources, undocumented prompts and hard-coded model references.

1035
00:53:38,480 --> 00:53:39,360
The fix is simple.

1036
00:53:39,360 --> 00:53:42,760
No agent graduates from sandbox to production without platform review.

1037
00:53:42,760 --> 00:53:45,040
Success does not exempt a team from standards.

1038
00:53:45,040 --> 00:53:46,960
The second is the finance override.

1039
00:53:46,960 --> 00:53:50,120
Finance sees a high bill and demands immediate cuts.

1040
00:53:50,120 --> 00:53:53,520
They cap all agents indiscriminately or cancel all PTO reservations.

1041
00:53:53,520 --> 00:53:56,520
This destroys user experience and damages business value.

1042
00:53:56,520 --> 00:53:58,480
The fix is the cost class framework.

1043
00:53:58,480 --> 00:54:00,800
Finance can challenge gold classifications,

1044
00:54:00,800 --> 00:54:04,360
but they cannot throttle gold workloads that have approved SLAs,

1045
00:54:04,360 --> 00:54:09,040
and they cannot cancel PTO reservations for stable production workloads without a migration plan.

1046
00:54:09,040 --> 00:54:10,480
The third is the platform monopoly.

1047
00:54:10,480 --> 00:54:14,280
The platform team becomes so controlling that app teams bypass them entirely

1048
00:54:14,280 --> 00:54:19,840
and build shadow agents in personal as your subscriptions or unmanaged co-pilot studio environments.

1049
00:54:19,840 --> 00:54:23,920
This is the most expensive outcome of all because shadow agents have zero visibility.

1050
00:54:23,920 --> 00:54:25,600
The fix is service catalog speed.

1051
00:54:25,600 --> 00:54:27,880
Platform reviews must happen within 48 hours.

1052
00:54:27,880 --> 00:54:30,200
Sandbox provisioning must be self-service,

1053
00:54:30,200 --> 00:54:34,680
and the published standards must be clear enough that app teams do not need to negotiate every detail.

1054
00:54:34,680 --> 00:54:36,160
The fourth is the permanent pilot.

1055
00:54:36,160 --> 00:54:39,720
An agent is built as a pilot, consumes seeded credits, and never gets reviewed

1056
00:54:39,720 --> 00:54:41,920
because it is just an experiment.

1057
00:54:41,920 --> 00:54:45,680
After November, 2026, these permanent pilots become permanent bills.

1058
00:54:45,680 --> 00:54:47,400
The fix is a sunset clause.

1059
00:54:47,400 --> 00:54:52,160
Every sandbox agent expires after 90 days unless it is promoted to bronze or higher.

1060
00:54:52,160 --> 00:54:54,320
Every bronze agent is reviewed annually,

1061
00:54:54,320 --> 00:54:57,240
and every agent without an owner is deleted.

1062
00:54:57,240 --> 00:54:58,440
Success metrics.

1063
00:54:58,440 --> 00:55:00,400
Governance success is measured by drift.

1064
00:55:00,400 --> 00:55:04,800
Cost drift is the percentage increase in spend per interaction month over month.

1065
00:55:04,800 --> 00:55:08,360
A well-governed environment shows flat or declining cost per interaction.

1066
00:55:08,360 --> 00:55:12,880
Architecture drift is the percentage of agents that deviate from their assigned cost class.

1067
00:55:12,880 --> 00:55:16,320
A well-governed environment shows near zero architecture drift.

1068
00:55:16,320 --> 00:55:20,400
And behavior drift is the percentage of new agents that bypass the intake process.

1069
00:55:20,400 --> 00:55:23,200
A well-governed environment shows zero behavior drift.

1070
00:55:23,200 --> 00:55:26,880
If all three drift metrics are near zero, your governance model is working.

1071
00:55:26,880 --> 00:55:28,920
If any metrics bikes, you have a gap.

1072
00:55:28,920 --> 00:55:30,840
Find it, fix it, and move on.

1073
00:55:30,840 --> 00:55:32,920
Governance is not a one-time project.

1074
00:55:32,920 --> 00:55:35,960
It is a continuous process of detection and correction.

1075
00:55:35,960 --> 00:55:37,640
And like the architecture it protects,

1076
00:55:37,640 --> 00:55:39,800
it is only as strong as the attention you give it.

1077
00:55:39,800 --> 00:55:42,280
The 2026 calendar and what to watch.

1078
00:55:42,280 --> 00:55:44,600
Architecture and governance are not static.

1079
00:55:44,600 --> 00:55:48,000
Microsoft changes pricing, models, and credit policies.

1080
00:55:48,000 --> 00:55:50,360
Your environment changes as new agents are built,

1081
00:55:50,360 --> 00:55:53,000
old ones retire, and user behavior shifts.

1082
00:55:53,000 --> 00:55:55,040
The final piece of this framework is a calendar,

1083
00:55:55,040 --> 00:55:58,360
a set of recurring checks that keep your cost architecture current.

1084
00:55:58,360 --> 00:56:00,040
The November 2026 deadline.

1085
00:56:00,040 --> 00:56:03,120
The most urgent calendar item is November 1, 2026.

1086
00:56:03,120 --> 00:56:07,560
On that date, Microsoft removes seeded AI-builder credits from all tenants.

1087
00:56:07,560 --> 00:56:10,000
Every organization that has been relying on power apps,

1088
00:56:10,000 --> 00:56:14,200
premium or dynamic 365 licenses to cover AI builder and co-pilot studio usage

1089
00:56:14,200 --> 00:56:15,960
will see an immediate cost increase.

1090
00:56:15,960 --> 00:56:19,320
Your preparation should start no later than August 1, 2026.

1091
00:56:19,320 --> 00:56:22,480
By then, you must have a complete inventory of every agent,

1092
00:56:22,480 --> 00:56:26,200
every flow, and every experiment that currently consumes seeded credits.

1093
00:56:26,200 --> 00:56:29,800
For each one, decide whether to migrate to co-pilot studio capacity packs,

1094
00:56:29,800 --> 00:56:32,400
migrate to Azure AI architectures or retire,

1095
00:56:32,400 --> 00:56:33,800
do not wait until October.

1096
00:56:33,800 --> 00:56:36,960
Procurement lead times for capacity packs can be several weeks.

1097
00:56:36,960 --> 00:56:42,160
And re-architecting an agent from co-pilot studio to Azure AI is not a weekend project.

1098
00:56:42,160 --> 00:56:44,960
By September 1, you should have purchased the initial capacity packs

1099
00:56:44,960 --> 00:56:47,080
and set up PAYG overflow billing.

1100
00:56:47,080 --> 00:56:50,640
Test the billing chain by running synthetic traffic through a non-production agent

1101
00:56:50,640 --> 00:56:53,560
and verifying that credits are consumed in the correct order,

1102
00:56:53,560 --> 00:56:55,960
seeded first, then packs, then PAYG.

1103
00:56:55,960 --> 00:56:58,880
If the order is wrong, open a support ticket immediately.

1104
00:56:58,880 --> 00:57:02,080
Billing misconfiguration is easier to fix before the deadline than after.

1105
00:57:02,080 --> 00:57:06,000
By October 1, communicate the change to all citizen developers and app teams.

1106
00:57:06,000 --> 00:57:07,920
Show them the new cost per interaction.

1107
00:57:07,920 --> 00:57:11,160
Show them the credit caps and show them the retirement plan for any agents

1108
00:57:11,160 --> 00:57:12,760
that do not justify the new cost.

1109
00:57:12,760 --> 00:57:14,200
Some experiments will not survive.

1110
00:57:14,200 --> 00:57:15,040
That is the point.

1111
00:57:15,040 --> 00:57:18,160
The free period created value, but it also created waste.

1112
00:57:18,160 --> 00:57:20,200
November is when you separate the two.

1113
00:57:20,200 --> 00:57:21,320
Quarterly reviews.

1114
00:57:21,320 --> 00:57:25,400
After the November transition, run a formal cost architecture review every quarter.

1115
00:57:25,400 --> 00:57:29,640
The review has a fixed agenda and should take no more than 90 minutes.

1116
00:57:29,640 --> 00:57:32,720
First 15 minutes, platform team presents the cost dashboard.

1117
00:57:32,720 --> 00:57:38,000
Show spent by layer by cost class and by department, highlight any month over month variance above 10%.

1118
00:57:38,000 --> 00:57:41,760
Call out cash, hit rate trends, compression ratios and routing accuracy.

1119
00:57:41,760 --> 00:57:45,480
Next 30 minutes, app teams present agent quality metrics.

1120
00:57:45,480 --> 00:57:48,840
Show escalation rates, user satisfaction and accuracy scores.

1121
00:57:48,840 --> 00:57:51,040
Compare these against the prior quarter.

1122
00:57:51,040 --> 00:57:53,800
If quality improved while cost declined, celebrate.

1123
00:57:53,800 --> 00:57:56,520
If quality declined while cost declined, investigate.

1124
00:57:56,520 --> 00:57:58,000
You may have over optimized.

1125
00:57:58,000 --> 00:58:02,320
Next 30 minutes, finance presents budget variance and chargeback reconciliation.

1126
00:58:02,320 --> 00:58:04,840
Confirm that every department is built correctly.

1127
00:58:04,840 --> 00:58:11,240
Identify any shadow spending or untagged resources and update the annual forecast based on actual growth.

1128
00:58:11,240 --> 00:58:12,600
Final 15 minutes.

1129
00:58:12,600 --> 00:58:13,520
Decisions.

1130
00:58:13,520 --> 00:58:17,320
Promote any silver agent that has outgrown its tier and justified the business case.

1131
00:58:17,320 --> 00:58:21,480
Demote any gold agent that is missing its SLA or running below target utilization.

1132
00:58:21,480 --> 00:58:24,760
Retire any bronze agent that has not been used in 60 days.

1133
00:58:24,760 --> 00:58:28,200
And update the service catalog if new Azure features change the economics.

1134
00:58:28,200 --> 00:58:29,320
Monthly monitoring.

1135
00:58:29,320 --> 00:58:32,600
Between quarterly reviews, run a lightweight monthly check.

1136
00:58:32,600 --> 00:58:35,240
It takes 30 minutes and covers five questions.

1137
00:58:35,240 --> 00:58:39,280
Check whether any agent is consuming more than 120% of its credit cap.

1138
00:58:39,280 --> 00:58:44,720
Whether any Azure open AI deployment is showing utilization above 80% or below 20%

1139
00:58:44,720 --> 00:58:48,960
where the cash hit rate is stable, whether LLM lingua compression ratio is stable,

1140
00:58:48,960 --> 00:58:53,960
and whether any new model releases in the Azure AI Foundry catalog could change routing economics.

1141
00:58:53,960 --> 00:58:55,600
The last question is often overlooked.

1142
00:58:55,600 --> 00:58:58,840
Microsoft adds new models to the router pool and the model catalog

1143
00:58:58,840 --> 00:58:59,840
without fanfare.

1144
00:58:59,840 --> 00:59:05,120
A new model that is cheaper than GPT-5 mini for your specific task could drop your per request

1145
00:59:05,120 --> 00:59:06,960
cost by another 20%.

1146
00:59:06,960 --> 00:59:08,680
But you will only know if you are watching.

1147
00:59:08,680 --> 00:59:10,560
Subscribe to the Azure AI release notes.

1148
00:59:10,560 --> 00:59:14,680
Review the model catalog monthly and test promising new models in the sandbox before promoting

1149
00:59:14,680 --> 00:59:16,120
them to production.

1150
00:59:16,120 --> 00:59:17,120
The annual cycle.

1151
00:59:17,120 --> 00:59:21,240
Once per year, typically in January of February, run a full architecture refresh.

1152
00:59:21,240 --> 00:59:23,160
Review every agent from scratch.

1153
00:59:23,160 --> 00:59:25,680
Revalidate the cost class of every production workload.

1154
00:59:25,680 --> 00:59:30,740
Run the Ptu versus Pau IG analysis for all gold candidates using the last 12 months of

1155
00:59:30,740 --> 00:59:31,740
telemetry.

1156
00:59:31,740 --> 00:59:35,640
Update the baseline workbook with new P95 values and refresh the training materials for

1157
00:59:35,640 --> 00:59:36,940
citizen developers.

1158
00:59:36,940 --> 00:59:41,080
This annual cycle is also when you negotiate capacity pack sizes and Ptu reservations for

1159
00:59:41,080 --> 00:59:42,280
the coming year.

1160
00:59:42,280 --> 00:59:46,440
Use your actual consumption data, not your projected growth as the starting point if you

1161
00:59:46,440 --> 00:59:49,840
grew 20% last year, plan for 20% growth next year.

1162
00:59:49,840 --> 00:59:53,720
Do not let vendor forecasts or executive ambition inflate your commitments.

1163
00:59:53,720 --> 00:59:56,760
An over committed Ptu reservation is a stranded asset.

1164
00:59:56,760 --> 01:00:00,880
And an under committed capacity pack plan leads to expensive Pau IG overflow.

1165
01:00:00,880 --> 01:00:03,080
The final annual task is a vendor check.

1166
01:00:03,080 --> 01:00:08,520
Review Microsoft's published pricing for Azure Open AI, co-pilot studio and Azure AI Foundry.

1167
01:00:08,520 --> 01:00:11,400
Note any changes that took effect in the prior year.

1168
01:00:11,400 --> 01:00:14,440
Model prices, credit costs and Ptu rates all change.

1169
01:00:14,440 --> 01:00:18,880
If you are not tracking these changes, your baseline is wrong and your forecasts are useless.

1170
01:00:18,880 --> 01:00:23,320
Update your cost dashboard with the current rates before you build next year's budget.

1171
01:00:23,320 --> 01:00:25,680
Note to watch in 2027 and beyond.

1172
01:00:25,680 --> 01:00:28,880
The AI cost landscape will not stabilize in 2026.

1173
01:00:28,880 --> 01:00:32,760
Model prices will continue to fall as competition increases and hardware improves.

1174
01:00:32,760 --> 01:00:35,880
Context windows will grow, which could increase prompt costs unless compression and caching

1175
01:00:35,880 --> 01:00:36,880
keep pace.

1176
01:00:36,880 --> 01:00:40,760
An autonomous agents will become more capable, which could increase consumption unless governance

1177
01:00:40,760 --> 01:00:41,760
caps keep pace.

1178
01:00:41,760 --> 01:00:46,320
The organizations that thrive are the ones that treat cost architecture as a core competency,

1179
01:00:46,320 --> 01:00:47,560
not a one time project.

1180
01:00:47,560 --> 01:00:51,360
They have teams, processes and calendars dedicated to monitoring and optimization.

1181
01:00:51,360 --> 01:00:53,560
They do not wait for the invoice to surprise them.

1182
01:00:53,560 --> 01:00:57,280
They see the trend in their dashboard and adjust before the trend becomes a crisis.

1183
01:00:57,280 --> 01:00:58,760
That is the final principle.

1184
01:00:58,760 --> 01:01:00,680
Cost architecture is not a destination.

1185
01:01:00,680 --> 01:01:01,960
It is a discipline.

1186
01:01:01,960 --> 01:01:06,240
And in the Microsoft Cloud, where the boundary between licensed and consumed is blurry and

1187
01:01:06,240 --> 01:01:11,480
moving, discipline is the only protection against runaway spend.

1188
01:01:11,480 --> 01:01:13,960
Cost architecture beats cost accounting every time.

1189
01:01:13,960 --> 01:01:17,760
Start your audit this week because November is coming and every experiment you build for

1190
01:01:17,760 --> 01:01:20,920
free is about to start charging.

1191
01:01:20,920 --> 01:01:26,360
Case study rebuilding the Contoso support agent theory is clearer with an example.

1192
01:01:26,360 --> 01:01:30,120
This case study is a composite based on patterns I have seen across multiple Microsoft

1193
01:01:30,120 --> 01:01:31,120
environments.

1194
01:01:31,120 --> 01:01:33,680
The company the numbers and the timeline are representative.

1195
01:01:33,680 --> 01:01:35,640
The decisions are real.

1196
01:01:35,640 --> 01:01:39,880
The starting point Contoso has a customer support agent built in co-pilot studio.

1197
01:01:39,880 --> 01:01:44,200
It answers questions about product features, troubleshooting steps and warranty status.

1198
01:01:44,200 --> 01:01:48,240
The agent is connected to a SharePoint knowledge base with roughly 5000 articles.

1199
01:01:48,240 --> 01:01:54,040
It uses Azure AI search for retrieval and Azure open AI GPT-5 global for response generation.

1200
01:01:54,040 --> 01:01:57,200
The agent handles roughly 50,000 conversations per month.

1201
01:01:57,200 --> 01:02:01,960
In January 2026, the support team receives the first consolidated Azure invoice that includes

1202
01:02:01,960 --> 01:02:03,360
the full agent cost.

1203
01:02:03,360 --> 01:02:06,000
The total is $48,000 for the month.

1204
01:02:06,000 --> 01:02:07,920
The support director assumes it is a mistake.

1205
01:02:07,920 --> 01:02:10,000
It is not.

1206
01:02:10,000 --> 01:02:13,720
Breaking down the $48,000 reveals the three hidden taxes in action.

1207
01:02:13,720 --> 01:02:16,480
The context tax accounts for roughly $22,000.

1208
01:02:16,480 --> 01:02:20,560
The agent retrieves 30 chunks per query at an average of 600 tokens each.

1209
01:02:20,560 --> 01:02:25,640
With 50,000 conversations averaging three turns each that is 150,000 retrievals per month.

1210
01:02:25,640 --> 01:02:30,760
At GPT-5 global input pricing, the retrieval context alone consumes $22,000 before the user

1211
01:02:30,760 --> 01:02:32,560
question is even processed.

1212
01:02:32,560 --> 01:02:35,600
The reasoning tax accounts for roughly $18,000.

1213
01:02:35,600 --> 01:02:40,560
Every query from simple FAQ to complex troubleshooting is routed to GPT-5 global.

1214
01:02:40,560 --> 01:02:44,920
Roughly 70% of the queries are simple enough for GPT-5 mini or GPT-5 nano.

1215
01:02:44,920 --> 01:02:48,560
But the agent has no routing logic, so everything runs at the flagship rate.

1216
01:02:48,560 --> 01:02:52,760
The output tokens, which are eight times more expensive than input tokens, drive most of

1217
01:02:52,760 --> 01:02:57,760
this cost because the model generates long detailed responses even for one sentence questions.

1218
01:02:57,760 --> 01:03:00,760
The autonomous tax accounts for the remaining $8,000.

1219
01:03:00,760 --> 01:03:05,240
The agent is configured to proactively check order status for every returning customer, triggering

1220
01:03:05,240 --> 01:03:09,000
a power automate flow that queries dataverse and the Microsoft graph.

1221
01:03:09,000 --> 01:03:10,840
This feature seemed helpful during design.

1222
01:03:10,840 --> 01:03:14,200
In practice, it consumes 10 co-pilot credits per pro-active check.

1223
01:03:14,200 --> 01:03:18,240
And with 20,000 returning customers per month, that is 200,000 credits.

1224
01:03:18,240 --> 01:03:21,360
At one cent per credit on PAYG, that is $2,000.

1225
01:03:21,360 --> 01:03:26,240
The remaining $6,000 comes from Azure infrastructure, logging and data egress that were never modeled

1226
01:03:26,240 --> 01:03:27,800
during the pilot phase.

1227
01:03:27,800 --> 01:03:28,960
The audit phase.

1228
01:03:28,960 --> 01:03:31,960
The support team engages the platform team to run the audit sprint.

1229
01:03:31,960 --> 01:03:35,880
They clean up tags, enable diagnostic logging, and build the token flow map.

1230
01:03:35,880 --> 01:03:38,080
The map confirms what the invoice suggested.

1231
01:03:38,080 --> 01:03:41,360
The retrieval layer is the biggest problem, followed by the model layer followed by the

1232
01:03:41,360 --> 01:03:42,600
proactive automation.

1233
01:03:42,600 --> 01:03:46,840
The baseline workbook shows an average cost per interaction of 96 cents.

1234
01:03:46,840 --> 01:03:51,280
For comparison, a human support agent costs roughly $8 per interaction when fully loaded,

1235
01:03:51,280 --> 01:03:53,520
so the agent is still cheaper than humans.

1236
01:03:53,520 --> 01:03:56,160
But it is 10 times more expensive than it needs to be.

1237
01:03:56,160 --> 01:04:00,560
And with planned growth to 200,000 conversations per month, the projected annual cost would exceed

1238
01:04:00,560 --> 01:04:01,560
$1 million.

1239
01:04:01,560 --> 01:04:03,040
That is no longer a support tool.

1240
01:04:03,040 --> 01:04:06,160
That is a budget line item that competes with headcount.

1241
01:04:06,160 --> 01:04:07,760
Sprint 2, the quick wins.

1242
01:04:07,760 --> 01:04:11,560
In February, the team deploy semantic caching for the top 10 FAQ questions.

1243
01:04:11,560 --> 01:04:14,840
These questions account for roughly 45% of all queries.

1244
01:04:14,840 --> 01:04:19,880
The cache is implemented in Azure API management with a similarity threshold of 0.94.

1245
01:04:19,880 --> 01:04:23,200
Within two weeks, cache hit rate stabilizes at 52%.

1246
01:04:23,200 --> 01:04:26,240
That eliminates roughly 26,000 LLM calls per month.

1247
01:04:26,240 --> 01:04:28,080
The immediate saving is $12,000.

1248
01:04:28,080 --> 01:04:31,600
At the same time, the team enforces max tokens limits.

1249
01:04:31,600 --> 01:04:36,280
Bronze class queries, which include all FAQ and simple status checks, are capped at 120

1250
01:04:36,280 --> 01:04:37,280
tokens.

1251
01:04:37,280 --> 01:04:41,520
Silver class queries, which include standard troubleshooting, are capped at 250 tokens.

1252
01:04:41,520 --> 01:04:46,920
Only gold class queries, which require multi-step reasoning across documents, are allowed 500 tokens.

1253
01:04:46,920 --> 01:04:51,240
The average output length drops from 300 tokens to 180 tokens.

1254
01:04:51,240 --> 01:04:54,680
This saves roughly $6,000 per month in output tokens alone.

1255
01:04:54,680 --> 01:04:56,000
The team also tunes retrieval.

1256
01:04:56,000 --> 01:04:58,520
Top-K is reduced from 30 chunks to 5.

1257
01:04:58,520 --> 01:05:00,760
Hybrid search with semantic ranking is enabled.

1258
01:05:00,760 --> 01:05:05,040
And metadata filters are added so that warranty queries only search warranty articles and

1259
01:05:05,040 --> 01:05:07,760
troubleshooting queries only search troubleshooting guides.

1260
01:05:07,760 --> 01:05:12,800
The average retrieved context drops from 18,000 tokens per query to 4,000 tokens.

1261
01:05:12,800 --> 01:05:15,160
This saves roughly $9,000 per month.

1262
01:05:15,160 --> 01:05:20,160
By the end of February, the monthly cost has dropped from $48,000 to $21,000.

1263
01:05:20,160 --> 01:05:23,680
The cost per interaction has dropped from 96 cents to 42 cents.

1264
01:05:23,680 --> 01:05:27,600
And user satisfaction has actually improved slightly because the shorter answers are more

1265
01:05:27,600 --> 01:05:32,280
focused and the tuned retrieval returns more relevant context.

1266
01:05:32,280 --> 01:05:34,360
Sprint 3, the architecture shift.

1267
01:05:34,360 --> 01:05:39,160
In March, the team deploys LLM lingua compression on the silver and gold retrieval paths.

1268
01:05:39,160 --> 01:05:41,080
Compression targets 40% on average.

1269
01:05:41,080 --> 01:05:45,440
Some dense technical documents compressed by 60%, some sparse policy documents compressed

1270
01:05:45,440 --> 01:05:47,000
by only 25%.

1271
01:05:47,000 --> 01:05:51,520
The A/B test shows no measurable degradation in answer accuracy or user satisfaction.

1272
01:05:51,520 --> 01:05:55,000
The compression saves roughly $3,000 per month in input tokens.

1273
01:05:55,000 --> 01:05:58,400
The team also deploys the Azure AI Foundry model router.

1274
01:05:58,400 --> 01:06:00,320
Bronze queries are routed to GPT-5 Nano.

1275
01:06:00,320 --> 01:06:02,760
Silver queries are routed to GPT-5 Mini.

1276
01:06:02,760 --> 01:06:05,200
Gold queries reach GPT-5 Global.

1277
01:06:05,200 --> 01:06:08,720
The router adds a small latency overhead of roughly 50 milliseconds per call, which is

1278
01:06:08,720 --> 01:06:09,720
invisible to users.

1279
01:06:09,720 --> 01:06:12,720
The cost saving is $7,000 per month.

1280
01:06:12,720 --> 01:06:16,920
Combined with the other optimizations, the total monthly cost is now $11,000.

1281
01:06:16,920 --> 01:06:19,560
The team evaluates PTO for the gold workload.

1282
01:06:19,560 --> 01:06:24,920
The gold queries represent roughly 5% of traffic, but still run on GPT-5 Global.

1283
01:06:24,920 --> 01:06:28,520
The average utilization of a candidate PTO would be 72%.

1284
01:06:28,520 --> 01:06:30,680
That is below the 86% break even threshold.

1285
01:06:30,680 --> 01:06:32,040
The team stays on PAYG.

1286
01:06:32,040 --> 01:06:36,000
They also enable prompt prefix caching for the gold system prompt, which reduces the

1287
01:06:36,000 --> 01:06:39,880
effective input cost by roughly 15% of the governance lock-in.

1288
01:06:39,880 --> 01:06:43,640
By April, the architecture is optimised, but the team knows that without governance,

1289
01:06:43,640 --> 01:06:45,160
the costs will creep back.

1290
01:06:45,160 --> 01:06:46,600
They implement the full framework.

1291
01:06:46,600 --> 01:06:50,040
The support agent is classified as silver with gold escalation.

1292
01:06:50,040 --> 01:06:51,560
The cost dashboard is published.

1293
01:06:51,560 --> 01:06:56,560
The app team reviews it weekly and finances a chargeback of $11,000 per month instead

1294
01:06:56,560 --> 01:06:57,560
of $48,000.

1295
01:06:57,560 --> 01:07:01,760
The annual projection is now $132,000 instead of $1,000,000.

1296
01:07:01,760 --> 01:07:05,000
The support director no longer sees the agent as a budget risk.

1297
01:07:05,000 --> 01:07:06,560
They see it as a scalable asset.

1298
01:07:06,560 --> 01:07:11,280
And the platform team uses the contoso case as the template for every new agent request

1299
01:07:11,280 --> 01:07:12,560
in the organisation.

1300
01:07:12,560 --> 01:07:14,600
This is what cost architecture delivers.

1301
01:07:14,600 --> 01:07:17,640
Not incremental savings, but structural transformation.

1302
01:07:17,640 --> 01:07:23,640
From $48,000 to $11,000, from guessing to measuring, and from reactive panic to proactive design.

1303
01:07:23,640 --> 01:07:25,640
The anti-patents most teams miss.

1304
01:07:25,640 --> 01:07:28,280
Even teams that follow the framework make mistakes.

1305
01:07:28,280 --> 01:07:33,040
Here are the five anti-patents I see most often, and how to avoid them.

1306
01:07:33,040 --> 01:07:34,280
Anti-patent one.

1307
01:07:34,280 --> 01:07:35,880
The forever context window.

1308
01:07:35,880 --> 01:07:39,880
Some teams think that keeping the entire conversation history improves user experience.

1309
01:07:39,880 --> 01:07:43,080
So they append every prior turn to every new prompt.

1310
01:07:43,080 --> 01:07:46,120
After 10 turns, the prompt is longer than the retrieval context.

1311
01:07:46,120 --> 01:07:49,120
And the model is paying for the same chat history on every call.

1312
01:07:49,120 --> 01:07:51,200
The fix is conversation summarisation.

1313
01:07:51,200 --> 01:07:55,320
After every three turns, summarise the conversation into a single paragraph of key facts and

1314
01:07:55,320 --> 01:07:56,480
user intent.

1315
01:07:56,480 --> 01:07:57,480
Then drop the raw turns.

1316
01:07:57,480 --> 01:08:01,800
The model still knows what the user wants, but the token count drops by 80%.

1317
01:08:01,800 --> 01:08:06,240
This is especially important for multi-turn support agents and interview style bots.

1318
01:08:06,240 --> 01:08:07,240
Anti-patent 2.

1319
01:08:07,240 --> 01:08:08,840
The Vanitymetric Dashboard.

1320
01:08:08,840 --> 01:08:12,920
Teams love to build dashboards that show total requests, total users, and total conversations.

1321
01:08:12,920 --> 01:08:13,920
These are Vanitymetrics.

1322
01:08:13,920 --> 01:08:15,800
They tell you that the agent is popular.

1323
01:08:15,800 --> 01:08:17,840
They do not tell you whether it is efficient.

1324
01:08:17,840 --> 01:08:22,880
The metrics that matter are cost per interaction, tokens per interaction, cash hit rate, compression

1325
01:08:22,880 --> 01:08:26,040
ratio, rooting accuracy, and cost drift.

1326
01:08:26,040 --> 01:08:30,320
If your dashboard does not show these six numbers for every agent, it is a popularity contest

1327
01:08:30,320 --> 01:08:31,960
not a cost control tool.

1328
01:08:31,960 --> 01:08:32,960
Fix it.

1329
01:08:32,960 --> 01:08:33,960
Anti-patent 3.

1330
01:08:33,960 --> 01:08:35,680
The Model Upgrade Trap.

1331
01:08:35,680 --> 01:08:37,800
Microsoft releases new models every quarter.

1332
01:08:37,800 --> 01:08:41,840
Teams immediately upgrade to the latest flagship because it scores better on benchmarks.

1333
01:08:41,840 --> 01:08:45,320
They do not ask whether their specific tasks need the improvement.

1334
01:08:45,320 --> 01:08:49,960
For most FAQ classification and extraction workloads, the difference between GPT-4 and

1335
01:08:49,960 --> 01:08:54,600
GPT-5 is invisible to users, but the cost difference is very visible to finance.

1336
01:08:54,600 --> 01:08:56,000
The rule is simple.

1337
01:08:56,000 --> 01:08:59,880
Only upgrade the model if your evaluation suite shows a measurable improvement in your

1338
01:08:59,880 --> 01:09:03,320
specific task, not on general benchmarks, on your task.

1339
01:09:03,320 --> 01:09:09,080
If the improvement is less than 5%, and the cost increases 25%, the upgrade is a bad deal.

1340
01:09:09,080 --> 01:09:10,160
Anti-patent 4.

1341
01:09:10,160 --> 01:09:11,160
The Capacity Pack.

1342
01:09:11,160 --> 01:09:12,160
Guessing Game.

1343
01:09:12,160 --> 01:09:15,160
Teams buy co-pilot studio Capacity Packs based on fear.

1344
01:09:15,160 --> 01:09:18,560
They buy 10 packs because they are afraid of running out, then use only 4.

1345
01:09:18,560 --> 01:09:23,000
The unused credits expire every month, that is 6 packs of waste, or $1200 per month in

1346
01:09:23,000 --> 01:09:24,000
dead money.

1347
01:09:24,000 --> 01:09:27,000
The fix is to start with PAYG for the first 30 days.

1348
01:09:27,000 --> 01:09:32,040
Measure actual consumption, then size your packs to cover 120% of your observed average.

1349
01:09:32,040 --> 01:09:34,240
Not your projected peak, your observed average.

1350
01:09:34,240 --> 01:09:35,760
Use PAYG for the rest.

1351
01:09:35,760 --> 01:09:38,960
This gives you a predictable baseline and a variable overflow.

1352
01:09:38,960 --> 01:09:43,320
It is the same principle as PT-U plus PAYG in Azure OpenAI.

1353
01:09:43,320 --> 01:09:44,400
Anti-patent 5.

1354
01:09:44,400 --> 01:09:45,800
The governance bypass.

1355
01:09:45,800 --> 01:09:49,840
A platform team builds a beautiful service catalog and a rigorous intake process.

1356
01:09:49,840 --> 01:09:52,720
Then a senior executive wants a quick demo for a board meeting.

1357
01:09:52,720 --> 01:09:56,920
The executive's assistant spins up a new agent in a personal sandbox, connects it to a production

1358
01:09:56,920 --> 01:09:59,280
knowledge base, and shows it at the meeting.

1359
01:09:59,280 --> 01:10:02,320
The agent is never reviewed, never classified, and never capped.

1360
01:10:02,320 --> 01:10:05,000
Six months later, it is still running and consuming credits.

1361
01:10:05,000 --> 01:10:07,440
The fix is technical, not procedural.

1362
01:10:07,440 --> 01:10:10,800
Every production knowledge base should require platform authentication.

1363
01:10:10,800 --> 01:10:14,320
Every Azure OpenAI deployment should be behind APIM with rate limits.

1364
01:10:14,320 --> 01:10:18,400
And every co-pilot studio environment should have a tenant level credit cap that no individual

1365
01:10:18,400 --> 01:10:19,680
user can override.

1366
01:10:19,680 --> 01:10:23,200
If the architecture enforces the rules, the bypass is impossible.

1367
01:10:23,200 --> 01:10:27,040
If the architecture allows the bypass, no policy will stop it.

1368
01:10:27,040 --> 01:10:29,600
The final anti-patent, perfection paralysis.

1369
01:10:29,600 --> 01:10:32,320
Some teams never deploy because they are afraid of getting it wrong.

1370
01:10:32,320 --> 01:10:36,840
They want to model every scenario, forecast every edge case, and build the perfect architecture

1371
01:10:36,840 --> 01:10:38,320
before they launch a single agent.

1372
01:10:38,320 --> 01:10:42,000
They spend six months planning and never save a dollar because nothing is live.

1373
01:10:42,000 --> 01:10:43,880
The framework is designed to avoid this.

1374
01:10:43,880 --> 01:10:45,880
The second one is an audit, not a design.

1375
01:10:45,880 --> 01:10:48,040
Sprint 2 delivers quick wins in 30 days.

1376
01:10:48,040 --> 01:10:51,200
Sprint 3 builds the architecture on top of a working baseline.

1377
01:10:51,200 --> 01:10:52,520
You do not need perfection.

1378
01:10:52,520 --> 01:10:53,680
You need progress.

1379
01:10:53,680 --> 01:10:55,000
Start the audit this week.

1380
01:10:55,000 --> 01:10:57,440
Deploy caching next week, the rest follows.

1381
01:10:57,440 --> 01:11:01,120
Advanced scenarios, multi-tenant, multi-region, and hybrid workloads.

1382
01:11:01,120 --> 01:11:04,560
The core framework applies to single-tenant, single-region deployments.

1383
01:11:04,560 --> 01:11:07,160
But most enterprise-microsoft environments are more complex.

1384
01:11:07,160 --> 01:11:11,040
They serve multiple business units, span multiple geographies, and combine low-code

1385
01:11:11,040 --> 01:11:14,120
and co-pilot studio agents with pro-code Azure architectures.

1386
01:11:14,120 --> 01:11:16,000
These complexities do not change the principles.

1387
01:11:16,000 --> 01:11:17,960
They change the implementation.

1388
01:11:17,960 --> 01:11:19,560
Multi-tenant cost isolation.

1389
01:11:19,560 --> 01:11:23,600
In a multi-tenant SaaS product, each customer shares the same underlying infrastructure,

1390
01:11:23,600 --> 01:11:25,600
but should see separate cost accounting.

1391
01:11:25,600 --> 01:11:29,920
Without isolation, one customer's search becomes everyone else's bill, and without chargeback,

1392
01:11:29,920 --> 01:11:32,160
you cannot price your product correctly.

1393
01:11:32,160 --> 01:11:34,440
The solution is per-tenant routing at the gateway.

1394
01:11:34,440 --> 01:11:38,760
Azure API management supports tenant-specific API keys and subscription quotas.

1395
01:11:38,760 --> 01:11:42,840
Each tenant gets a key that maps to a dedicated rate limit, a dedicated cash partition,

1396
01:11:42,840 --> 01:11:44,200
and a dedicated cost tag.

1397
01:11:44,200 --> 01:11:46,840
The logs show exactly which tenant consumed which tokens.

1398
01:11:46,840 --> 01:11:48,440
The chargeback is automatic.

1399
01:11:48,440 --> 01:11:51,000
For caching, each tenant needs its own cash partition.

1400
01:11:51,000 --> 01:11:55,640
A semantic cache that mixes tenant-a's healthcare queries with tenant-b's financial queries

1401
01:11:55,640 --> 01:11:57,720
is a privacy risk and a quality risk.

1402
01:11:57,720 --> 01:12:00,120
Use redis-cache keys that include the tenant-id.

1403
01:12:00,120 --> 01:12:03,640
The router still works, the compressor still works, but the cash responses are never shared

1404
01:12:03,640 --> 01:12:04,640
across tenants.

1405
01:12:04,640 --> 01:12:08,000
For model routing, you may want tenant-specific routing profiles.

1406
01:12:08,000 --> 01:12:11,720
A healthcare tenant might require GPT-5 global for compliance reasons.

1407
01:12:11,720 --> 01:12:14,280
A retail tenant might be fine with GPT-5 mini.

1408
01:12:14,280 --> 01:12:18,600
The Azure AI Foundry model router does not support tenant-specific pools today, so you

1409
01:12:18,600 --> 01:12:20,720
may need a custom orchestrator layer.

1410
01:12:20,720 --> 01:12:24,080
The orchestrator inspects the tenant-id and routes to the appropriate deployment.

1411
01:12:24,080 --> 01:12:29,480
This adds complexity, but it is the only way to satisfy mixed compliance requirements.

1412
01:12:29,480 --> 01:12:31,440
Multi-region latency and cost-trade-offs.

1413
01:12:31,440 --> 01:12:33,640
Azure OpenI pricing varies by region.

1414
01:12:33,640 --> 01:12:37,560
EastUS is often cheaper than West Europe, but your users in Germany do not want their

1415
01:12:37,560 --> 01:12:39,600
data processed in Virginia.

1416
01:12:39,600 --> 01:12:43,760
Data residency requirements, latency requirements, and cost requirements often conflict.

1417
01:12:43,760 --> 01:12:46,120
The standard pattern is a hub and spoke architecture.

1418
01:12:46,120 --> 01:12:50,800
The primary region hosts the gold workloads, the centralized cash and the governance dashboard.

1419
01:12:50,800 --> 01:12:54,040
Secondary regions host bronze and silver workloads close to the users.

1420
01:12:54,040 --> 01:12:58,360
Azure Front Door routes traffic by geography, sending European users to West Europe and Asian

1421
01:12:58,360 --> 01:12:59,920
users to Southeast Asia.

1422
01:12:59,920 --> 01:13:04,280
Each region has its own APRM instance, its own cash, and its own rate limits.

1423
01:13:04,280 --> 01:13:07,760
The cost optimization is regional capacity planning.

1424
01:13:07,760 --> 01:13:11,480
If your West Europe traffic is steady and high, a yearly PTO in West Europe may make sense

1425
01:13:11,480 --> 01:13:12,640
for gold workloads.

1426
01:13:12,640 --> 01:13:16,880
If your Southeast Asia traffic is spiky and unpredictable, PAYG is better.

1427
01:13:16,880 --> 01:13:19,600
Do not apply the same capacity strategy to every region.

1428
01:13:19,600 --> 01:13:21,160
Each region gets its own analysis.

1429
01:13:21,160 --> 01:13:24,200
The telemetry aggregation is harder in multi-region setups.

1430
01:13:24,200 --> 01:13:29,040
You need a central log analytics workspace that ingests logs from all regional deployments.

1431
01:13:29,040 --> 01:13:33,080
Use Azure Monitor Cross Resource queries to compute global cost per interaction, regional

1432
01:13:33,080 --> 01:13:36,680
cost per interaction, and interregion data transfer costs.

1433
01:13:36,680 --> 01:13:41,960
Data egress between regions is often overlooked and can add 5 to 15% to the total bill.

1434
01:13:41,960 --> 01:13:43,760
Hybrid low-code and Procode.

1435
01:13:43,760 --> 01:13:48,400
Most large Microsoft environments are not purely Azure AI Foundry or purely CoPilot Studio.

1436
01:13:48,400 --> 01:13:49,920
They are hybrid.

1437
01:13:49,920 --> 01:13:52,720
CoPilot Studio handles the low-code business-led agents.

1438
01:13:52,720 --> 01:13:57,880
Azure AI Foundry handles the Procode engineering-led agents and Power Automate connects the two.

1439
01:13:57,880 --> 01:14:00,520
The cost danger in hybrid environments is double billing.

1440
01:14:00,520 --> 01:14:04,720
A CoPilot Studio agent that calls a custom Azure Function which then calls Azure OpenAI

1441
01:14:04,720 --> 01:14:06,520
might be billed in both places.

1442
01:14:06,520 --> 01:14:10,640
The CoPilot Studio agent consumes credits for the action call, the Azure Function consumes

1443
01:14:10,640 --> 01:14:12,600
Azure Compute and Storage.

1444
01:14:12,600 --> 01:14:15,080
And the Azure OpenAI deployment consumes tokens.

1445
01:14:15,080 --> 01:14:18,200
If you are not tagging the Azure Function and correlating its logs with the CoPilot

1446
01:14:18,200 --> 01:14:22,320
Studio session, you will see two separate line items and never know they belong to the

1447
01:14:22,320 --> 01:14:23,600
same user request.

1448
01:14:23,600 --> 01:14:25,760
The fixes end-to-end correlation IDs.

1449
01:14:25,760 --> 01:14:28,280
Every user session gets a unique correlation ID.

1450
01:14:28,280 --> 01:14:32,160
The CoPilot Studio agent passes it to the Azure Function, the Azure Function passes it

1451
01:14:32,160 --> 01:14:33,840
to Azure OpenAI.

1452
01:14:33,840 --> 01:14:36,040
Every log entry includes the correlation ID.

1453
01:14:36,040 --> 01:14:40,160
Then your KQL queries can join the logs across services and compute the true end-to-end

1454
01:14:40,160 --> 01:14:41,640
cost per interaction.

1455
01:14:41,640 --> 01:14:45,480
Without correlation IDs, hybrid cost analysis is impossible.

1456
01:14:45,480 --> 01:14:47,400
Another hybrid danger is model inconsistency.

1457
01:14:47,400 --> 01:14:52,240
A CoPilot Studio agent might use GPT-5 global through the built-in generative answers feature.

1458
01:14:52,240 --> 01:14:57,080
A Procode agent in the same organization might use GPT-5 mini through Azure AI Foundry.

1459
01:14:57,080 --> 01:15:00,800
They serve the same user base with the same questions, but one costs eight times more

1460
01:15:00,800 --> 01:15:01,800
per token.

1461
01:15:01,800 --> 01:15:04,040
The governance fix is the service catalog.

1462
01:15:04,040 --> 01:15:07,640
Every agent, regardless of platform, must register its model assignment.

1463
01:15:07,640 --> 01:15:11,960
The platform team reviews quarterly and consolidates where possible.

1464
01:15:11,960 --> 01:15:13,760
Seasonal and event-driven spikes.

1465
01:15:13,760 --> 01:15:18,240
Some workloads are not steady because retail agent spike in November, tax agent spike in

1466
01:15:18,240 --> 01:15:23,040
March, and event-driven agent spike when a product launches or a crisis hits.

1467
01:15:23,040 --> 01:15:25,520
Steady state architecture fails for these patterns.

1468
01:15:25,520 --> 01:15:29,040
The solution is elastic throttling combined with burst PAYG.

1469
01:15:29,040 --> 01:15:33,240
Maintain a small Ptu or capacity pack for baseline demand, then use PAYG for the spikes.

1470
01:15:33,240 --> 01:15:37,680
In Azure API management, configure burst policies that allow temporary traffic increases

1471
01:15:37,680 --> 01:15:41,720
of 200 to 500% for a limited time window.

1472
01:15:41,720 --> 01:15:45,000
After the window, traffic falls back to the baseline rate limit.

1473
01:15:45,000 --> 01:15:47,480
Also pre-warm your cache before predictable spikes.

1474
01:15:47,480 --> 01:15:51,560
If you know the Black Friday will generate a surge of the same 20 questions, pre-populate

1475
01:15:51,560 --> 01:15:53,760
the semantic cache with those answers.

1476
01:15:53,760 --> 01:15:58,080
You can do this by running a script that sends the top 20 queries through the APM endpoint

1477
01:15:58,080 --> 01:15:59,760
a few hours before the spike.

1478
01:15:59,760 --> 01:16:05,120
The cache is warm, and the real user traffic hits cache responses instead of the LLM.

1479
01:16:05,120 --> 01:16:08,800
For unpredictable spikes, the safeguard is the budget alert at 50%.

1480
01:16:08,800 --> 01:16:13,080
If a surprise event drives traffic up, the alert fires before the bill explodes.

1481
01:16:13,080 --> 01:16:17,360
The platform team can then temporarily promote the agent to a higher tier, increase the rate

1482
01:16:17,360 --> 01:16:20,440
limit, or add emergency PAYG capacity.

1483
01:16:20,440 --> 01:16:24,560
The key is that the response is triggered by data, not by an angry finance email two weeks

1484
01:16:24,560 --> 01:16:25,560
later.

1485
01:16:25,560 --> 01:16:26,880
The governance scaling rule.

1486
01:16:26,880 --> 01:16:32,080
As your environment grows from 10 agents to 100 agents, governance must scale without creating

1487
01:16:32,080 --> 01:16:33,280
a bottleneck.

1488
01:16:33,280 --> 01:16:36,480
The 100 agent milestone is where many organizations fail.

1489
01:16:36,480 --> 01:16:38,880
The platform team cannot review every agent personally.

1490
01:16:38,880 --> 01:16:43,160
The finance team cannot negotiate every capacity pack individually, and the app teams cannot

1491
01:16:43,160 --> 01:16:45,320
wait two weeks for sandbox approval.

1492
01:16:45,320 --> 01:16:47,640
The scaling rule is automation and self-service.

1493
01:16:47,640 --> 01:16:52,280
The intake form becomes a web app that validates cost class automatically based on the five questions.

1494
01:16:52,280 --> 01:16:56,200
The service catalog becomes a terraform or bicep template repository that provisions approved

1495
01:16:56,200 --> 01:16:57,720
infrastructure in minutes.

1496
01:16:57,720 --> 01:17:01,640
The cost dashboard becomes a power BI report that auto refreshes.

1497
01:17:01,640 --> 01:17:06,200
And the quarterly review becomes a 30 minute stand-up where only exceptions are discussed.

1498
01:17:06,200 --> 01:17:08,120
What does not scale is personal approval.

1499
01:17:08,120 --> 01:17:11,640
If every agent requires a meeting, the system breaks at 50 agents.

1500
01:17:11,640 --> 01:17:14,480
Design your governance for 100 agents on day one.

1501
01:17:14,480 --> 01:17:18,040
Even if you only have five agents today, the processes you build now determine whether

1502
01:17:18,040 --> 01:17:19,520
you can grow later.

1503
01:17:19,520 --> 01:17:20,960
The emergency break.

1504
01:17:20,960 --> 01:17:21,960
When costs spiral.

1505
01:17:21,960 --> 01:17:26,760
No matter how well you architect, surprises happen, such as an agent going viral on social

1506
01:17:26,760 --> 01:17:32,040
media, a marketing campaign, driving 10 times expected traffic, or a code bug creating an

1507
01:17:32,040 --> 01:17:35,160
infinite loop that burns tokens at midnight.

1508
01:17:35,160 --> 01:17:38,760
You need an emergency break that stops the spend before it becomes a career event.

1509
01:17:38,760 --> 01:17:42,200
The first break is the Azure API management circuit breaker.

1510
01:17:42,200 --> 01:17:45,680
There are a policy that detects error rates or latency spikes above a threshold and automatically

1511
01:17:45,680 --> 01:17:48,080
returns a static fallback response.

1512
01:17:48,080 --> 01:17:52,280
If your agent normally responds in two seconds and suddenly starts timing out at 30 seconds,

1513
01:17:52,280 --> 01:17:53,280
something is wrong.

1514
01:17:53,280 --> 01:17:56,160
The circuit breaker stops the bleeding and alerts the platform team.

1515
01:17:56,160 --> 01:17:59,440
The second break is the co-pilot studio emergency cap.

1516
01:17:59,440 --> 01:18:04,560
In addition to the standard monthly credit cap, set a daily cap at 50% of the monthly budget.

1517
01:18:04,560 --> 01:18:09,360
If an agent consumes its entire daily allocation by noon, it shuts down for the rest of the day.

1518
01:18:09,360 --> 01:18:13,080
Users see a polite message explaining that the service is temporarily unavailable.

1519
01:18:13,080 --> 01:18:17,600
This is not ideal user experience, but it is far better than a $10,000 surprise.

1520
01:18:17,600 --> 01:18:19,840
The third break is the Azure budget hard limit.

1521
01:18:19,840 --> 01:18:24,000
While most budgets are set to alert, you can configure some budgets to trigger automation.

1522
01:18:24,000 --> 01:18:29,320
A logic app can receive the 100% budget alert and execute a runbook that disables non-essential

1523
01:18:29,320 --> 01:18:34,480
agent deployments, scales down Azure open AI endpoints, or redirects traffic to a static

1524
01:18:34,480 --> 01:18:35,800
FAQ page.

1525
01:18:35,800 --> 01:18:39,760
This is the nuclear option, so use it only for truly catastrophic scenarios, but know

1526
01:18:39,760 --> 01:18:40,760
that it exists.

1527
01:18:40,760 --> 01:18:42,480
The fourth break is human judgment.

1528
01:18:42,480 --> 01:18:47,160
Every platform team needs an on-call rotation with the authority to throttle or disable agents.

1529
01:18:47,160 --> 01:18:50,640
The authority must include the ability to override app team objections in real time.

1530
01:18:50,640 --> 01:18:55,160
If a bug is burning $5,000 per hour, the platform engineer must be able to pull the plug without

1531
01:18:55,160 --> 01:18:57,640
waiting for a change advisory board meeting.

1532
01:18:57,640 --> 01:19:01,800
Document this authority, train the engineers, and practice the scenario in a tabletop

1533
01:19:01,800 --> 01:19:03,800
exercise before it happens for real.

1534
01:19:03,800 --> 01:19:08,840
The final break is post-incident learning, after every cost spike run a blameless post-mortem.

1535
01:19:08,840 --> 01:19:12,600
Ask what happened, why the monitoring did not catch it sooner, and what architectural

1536
01:19:12,600 --> 01:19:14,440
change would prevent a recurrence.

1537
01:19:14,440 --> 01:19:18,880
Update the runbook, update the alert thresholds, and share the learnings across all app teams.

1538
01:19:18,880 --> 01:19:22,640
A spike that teaches the organization is expensive, but valuable.

1539
01:19:22,640 --> 01:19:26,560
A spike that repeats because nobody learned is just waste.

1540
01:19:26,560 --> 01:19:28,520
The weekly measurement checklist.

1541
01:19:28,520 --> 01:19:31,360
Cost architecture is not a set and forget system.

1542
01:19:31,360 --> 01:19:35,680
It requires weekly attention for the first 90 days, then monthly attention thereafter.

1543
01:19:35,680 --> 01:19:39,040
The following checklist takes 15 minutes and should be completed by the platform team

1544
01:19:39,040 --> 01:19:40,520
lead every Monday morning.

1545
01:19:40,520 --> 01:19:42,360
First check the cost dashboard.

1546
01:19:42,360 --> 01:19:45,400
Verify that total AI spend is within 5% of the prior week.

1547
01:19:45,400 --> 01:19:47,320
If it is higher, drill down by agent.

1548
01:19:47,320 --> 01:19:50,680
If a single agent cause the jump, open a ticket with the app team owner.

1549
01:19:50,680 --> 01:19:54,640
If the jump is broad-based, check whether a new model or a new feature was deployed.

1550
01:19:54,640 --> 01:19:56,520
Second check cash hit rates.

1551
01:19:56,520 --> 01:19:59,640
Semantic caching should deliver 40% or higher for mature agents.

1552
01:19:59,640 --> 01:20:03,360
If hit rate drops, investigate whether the user question patterns have changed.

1553
01:20:03,360 --> 01:20:07,520
A new product launch or a policy change can shift the question distribution and invalidate

1554
01:20:07,520 --> 01:20:08,920
your cashed answers.

1555
01:20:08,920 --> 01:20:11,560
Third check LLM lingua compression ratios.

1556
01:20:11,560 --> 01:20:14,600
The ratio should be stable within 10% week over week.

1557
01:20:14,600 --> 01:20:17,520
If compression drops, the input documents may have changed.

1558
01:20:17,520 --> 01:20:22,760
New document types, longer paragraphs or denser technical content can reduce compressibility,

1559
01:20:22,760 --> 01:20:24,800
retune the compression target if needed.

1560
01:20:24,800 --> 01:20:26,760
Fourth check model routing accuracy.

1561
01:20:26,760 --> 01:20:29,920
Review the aggregate cost per request and the P95 latency.

1562
01:20:29,920 --> 01:20:34,080
If cost per request rises while volume stays flat, the router may be sending more queries

1563
01:20:34,080 --> 01:20:35,760
to expensive models than before.

1564
01:20:35,760 --> 01:20:39,280
This can happen if user questions become more complex or if a prompt change made the

1565
01:20:39,280 --> 01:20:41,240
queries look more difficult to the router.

1566
01:20:41,240 --> 01:20:43,680
Fifth check credit caps and rate limits.

1567
01:20:43,680 --> 01:20:46,320
Verify that no agent hit its cap unintentionally.

1568
01:20:46,320 --> 01:20:50,680
If an agent hit its cap three weeks in a row, either the cap is too low or the agent is

1569
01:20:50,680 --> 01:20:51,680
growing organically.

1570
01:20:51,680 --> 01:20:53,520
Either way, schedule a review.

1571
01:20:53,520 --> 01:20:56,160
Sixth check for new resources that lack tags.

1572
01:20:56,160 --> 01:21:00,600
Any Azure open AI deployment, any co-pilot studio environment or any APIM instance created

1573
01:21:00,600 --> 01:21:03,280
in the last week should have the mandatory tags.

1574
01:21:03,280 --> 01:21:06,400
Untagged resources are invisible to charge back and cost control.

1575
01:21:06,400 --> 01:21:09,440
Tag them immediately or delete them if they are unauthorized.

1576
01:21:09,440 --> 01:21:11,880
Seventh review the upcoming calendar.

1577
01:21:11,880 --> 01:21:15,460
Check for any product launches, marketing campaigns or seasonal events this week that could

1578
01:21:15,460 --> 01:21:16,460
drive traffic.

1579
01:21:16,460 --> 01:21:21,240
If there are, verify that the cache is pre-warmed, the rate limits are adjusted and the budget

1580
01:21:21,240 --> 01:21:24,400
alert thresholds are temporarily lowered for closer monitoring.

1581
01:21:24,400 --> 01:21:28,620
Eighth verify that no new models were added to the Azure AI Foundry catalog that could

1582
01:21:28,620 --> 01:21:30,120
change your routing economics.

1583
01:21:30,120 --> 01:21:34,520
A new mid-tier model that is cheaper than GPT-5 Mini for your specific task profile could

1584
01:21:34,520 --> 01:21:38,160
drop your per request cost by another 10 to 20%.

1585
01:21:38,160 --> 01:21:40,640
Test it in the sandbox before promoting it to production.

1586
01:21:40,640 --> 01:21:45,720
Ninth check the cost class compliance of any agent deployed in the last seven days.

1587
01:21:45,720 --> 01:21:49,480
New agents should not graduate from sandbox to production without platform review, but

1588
01:21:49,480 --> 01:21:50,880
sometimes teams rush.

1589
01:21:50,880 --> 01:21:55,200
Verify that every new production agent has the correct tags, the correct model assignment

1590
01:21:55,200 --> 01:21:56,600
and the correct credit cap.

1591
01:21:56,600 --> 01:22:00,560
If an agent is missing any of these, flag it and require remediation before the next billing

1592
01:22:00,560 --> 01:22:01,560
cycle.

1593
01:22:01,560 --> 01:22:03,680
Tenth share the weekly summary.

1594
01:22:03,680 --> 01:22:06,440
Post a three-bullet update to your internal team channel.

1595
01:22:06,440 --> 01:22:08,160
Total spend this week versus last week.

1596
01:22:08,160 --> 01:22:10,000
Any anomalies found and their status.

1597
01:22:10,000 --> 01:22:14,000
And any upcoming changes like model tests or cache reconfigurations.

1598
01:22:14,000 --> 01:22:17,440
This transparency keeps every stakeholder aligned without requiring a meeting.

1599
01:22:17,440 --> 01:22:18,640
This checklist is not optional.

1600
01:22:18,640 --> 01:22:21,800
It is the operational heartbeat of your cost architecture.

1601
01:22:21,800 --> 01:22:24,760
Skip it for two weeks and small problems become big problems.

1602
01:22:24,760 --> 01:22:26,880
Skip it for a month and you are back to guessing.

1603
01:22:26,880 --> 01:22:30,640
The organizations that sustain their cost savings are the ones that treat this 15-minute

1604
01:22:30,640 --> 01:22:35,280
ritual with the same discipline as their security patching cycle or their backup verification.

1605
01:22:35,280 --> 01:22:36,480
It is maintenance.

1606
01:22:36,480 --> 01:22:41,440
And maintenance is what separates professional architecture from experimental hobby projects.

1607
01:22:41,440 --> 01:22:43,080
The architecture at a glance.

1608
01:22:43,080 --> 01:22:46,720
If you take nothing else from this episode, take the four levels.

1609
01:22:46,720 --> 01:22:51,120
Everything repeats with semantic caching, shrinking prompts with LLM-lingua compression, matching

1610
01:22:51,120 --> 01:22:56,040
tasks to models with Azure AI Foundry Roating and buying capacity correctly with PAYG for

1611
01:22:56,040 --> 01:23:00,000
variable workloads and PTO only for steady high volume production.

1612
01:23:00,000 --> 01:23:03,560
Deploy these four levers behind a governance model that classifies every agent before it

1613
01:23:03,560 --> 01:23:07,720
is built, audits every deployment after it is live and enforces budgets automatically

1614
01:23:07,720 --> 01:23:11,040
through Azure API management and co-pilot studio caps.

1615
01:23:11,040 --> 01:23:12,320
That is the complete framework.

1616
01:23:12,320 --> 01:23:15,960
It is not theoretical and it is running in production environments today and it can be running

1617
01:23:15,960 --> 01:23:18,080
in yours within 90 days.

1618
01:23:18,080 --> 01:23:20,560
Cost architecture beats cost accounting every time.

1619
01:23:20,560 --> 01:23:25,040
The four levers, the three cost classes and the 90-day roadmap are your structural defense

1620
01:23:25,040 --> 01:23:26,600
against runaway spend.

1621
01:23:26,600 --> 01:23:30,760
Build the framework, run the audit and deploy the levers before your next billing cycle.

1622
01:23:30,760 --> 01:23:32,520
Start this week because November is coming.

1623
01:23:32,520 --> 01:23:35,160
And every experiment you build for free is about to start charging.

Mirko Peters Profile Photo

Founder of m365.fm, m365.show and m365con.net

Mirko Peters is a Microsoft 365 expert, content creator, and founder of m365.fm, a platform dedicated to sharing practical insights on modern workplace technologies. His work focuses on Microsoft 365 governance, security, collaboration, and real-world implementation strategies.

Through his podcast and written content, Mirko provides hands-on guidance for IT professionals, architects, and business leaders navigating the complexities of Microsoft 365. He is known for translating complex topics into clear, actionable advice, often highlighting common mistakes and overlooked risks in real-world environments.

With a strong emphasis on community contribution and knowledge sharing, Mirko is actively building a platform that connects experts, shares experiences, and helps organizations get the most out of their Microsoft 365 investments.