The Problem: Azure's 4,000 Role Assignment Limit
Azure subscriptions have a hard limit of 4,000 RBAC role assignments. For enterprise environments running hundreds of application services across multiple regions and lifecycles, this limit becomes a significant constraint.
In our case, we manage hundreds of Key Vaults supporting custom App Services, Function Apps, and Application Gateway listeners. Each service is deployed across multiple regions (typically 2-4) and four lifecycle environments (dev, test, staging, production). Following least privilege principles, each application component requires its own Key Vault with isolated access policies—you can't share one vault across everything when different teams need different access scopes.
The math compounds quickly. A single application with components in 3 regions across 4 lifecycles generates 12+ Key Vaults, each requiring multiple role assignments for managed identities, service principals, and automation accounts. Multiply this across dozens of applications, and RBAC assignments accumulate faster than expected.
The failure mode is abrupt. Deployments succeed until you hit 4,000 assignments, then pipelines fail with generic permission errors. By the time the limit surfaces, you're already debugging production issues instead of planning capacity.
The Solution: A Weekly RBAC Quota Check
I built a Python Azure Function that runs every Tuesday morning to monitor RBAC quota usage proactively:
- Crawls all subscriptions in our management group hierarchy
- Counts RBAC role assignments in each subscription (excluding inherited assignments from parent management groups, which don't count against the limit)
- Alerts at 80% usage (3,200 assignments out of 4,000)
- Creates Azure DevOps tickets with subscription details and remediation guidance
The 80% threshold provides operational headroom. When a subscription approaches this limit, we still have capacity to add RBAC assignments to existing Key Vaults as requirements evolve, or deploy a handful of additional vaults before hitting the hard ceiling. This buffer is critical because our landing zone repository deploys foundational infrastructure by looping through Key Vaults with a single provider configuration per environment. Routing specific vaults to different subscriptions mid-deployment requires significant refactoring of Terraform modules and iteration logic—tedious work that's better avoided through capacity planning.
The function runs automatically without manual intervention, alerting only when thresholds are exceeded.
Avoiding Ticket Spam
A monitoring system that generates duplicate alerts creates noise instead of value. The implementation includes safeguards to prevent ticket spam:
Tag-Based Tracking
Every ticket created by the function receives the tag ccoetools-rbac-quota-usage. Before creating a new ticket, the function queries Azure DevOps using WIQL (Work Item Query Language) to check existing work items:
- If an open ticket already exists for the subscription → Skip creation
- If a ticket was closed within the last 7 days → Skip creation (cooldown period)
This prevents duplicate tickets when a subscription already has active remediation work in progress. The 7-day cooldown handles cases where resource cleanup temporarily drops below threshold, but usage continues to trend upward—avoiding premature re-alerting before teams complete their remediation strategy.
The Exclusion List
Some subscriptions intentionally operate near capacity or are already under active remediation planning. The EXCLUDED_SUBSCRIPTIONS environment variable allows specific subscriptions to be excluded from monitoring, preventing ticket generation during planned maintenance windows or known remediation periods.
Configuration Through Environment Variables
As organizational needs evolve, the function's scope adapts without code changes. Two key environment variables control monitoring behavior:
MANAGEMENT_GROUP_IDS- Defines which management group hierarchies to scan. Updating this variable shifts monitoring scope to different organizational units without redeploying function code.EXEMPT_SUBSCRIPTIONS- Lists subscriptions permanently excluded from monitoring. When subscriptions reach maximum capacity indefinitely (such as legacy environments being phased out), adding them to this list prevents ongoing alerts without removing the underlying resources.
This environment-driven configuration separates operational decisions from application logic, allowing platform teams to adjust monitoring scope through configuration updates rather than code deployments.
Automatic Lifecycle Detection
One neat trick: the function figures out whether a subscription is Dev, Test, Staging, or Production by looking at its management group hierarchy. If the subscription lives under a management group with "Production" in the name, the ticket gets tagged as Prd. This helps route tickets to the right people and set appropriate priority.
lifecycle_map = {
"development": "Dev",
"test": "Tst",
"staging": "Stg",
"production": "Prd",
}
The Tech Stack
- Azure Functions (Python v2 programming model) - Timer trigger, runs on a cron schedule
- Azure Resource Graph - For discovering subscriptions in management groups
- Azure Management API - For counting role assignments (with pagination handling)
- Azure DevOps REST API - For searching existing tickets and creating new ones
- Managed Identity - No credentials to manage, the function just uses its own identity
What a Ticket Looks Like
When the function does create a ticket, it's not just "hey there's a problem." It includes:
- Subscription name and ID
- Current assignment count
- Percentage used
- Specific remediation steps (we have a script that cleans up orphaned assignments from deleted service principals)
- Link to update the enterprise application templates if we need to rotate to a new subscription
Dry Run Mode
The DRY_RUN environment variable enables testing without creating actual tickets. When set to true, the function logs all actions it would take without modifying Azure DevOps work items. This is useful for validating subscription scoping, threshold calculations, and ticket content before enabling production alerting.
Scheduling: Every Tuesday at 9 AM UTC
0 0 9 * * 2
The function runs weekly on Tuesday mornings. This cadence provides regular monitoring without excessive noise, and timing early in the week allows teams to address issues before end-of-week deployments.
Results
The function has successfully identified multiple subscriptions approaching the RBAC limit before reaching critical thresholds. Early detection provides time for capacity planning—whether that means cleaning up orphaned role assignments, consolidating resource access patterns, or provisioning new subscriptions for additional Key Vault deployments.
Proactive monitoring converted a reactive firefighting problem into a planned capacity management task.
Built with Python, Azure Functions, and a healthy fear of hitting arbitrary cloud limits at the worst possible time.