How I Built an Azure Function to Stop RBAC Quota Fires Before They Start

The Problem: Azure's 4,000 Role Assignment Limit

Azure subscriptions have a hard limit of 4,000 RBAC role assignments. For enterprise environments running hundreds of application services across multiple regions and lifecycles, this limit becomes a significant constraint.

In our case, we manage hundreds of Key Vaults supporting custom App Services, Function Apps, and Application Gateway listeners. Each service is deployed across multiple regions (typically 2-4) and four lifecycle environments (dev, test, staging, production). Following least privilege principles, each application component requires its own Key Vault with isolated access policies—you can't share one vault across everything when different teams need different access scopes.

The math compounds quickly. A single application with components in 3 regions across 4 lifecycles generates 12+ Key Vaults, each requiring multiple role assignments for managed identities, service principals, and automation accounts. Multiply this across dozens of applications, and RBAC assignments accumulate faster than expected.

The failure mode is abrupt. Deployments succeed until you hit 4,000 assignments, then pipelines fail with generic permission errors. By the time the limit surfaces, you're already debugging production issues instead of planning capacity.

The Solution: A Weekly RBAC Quota Check

I built a Python Azure Function that runs every Tuesday morning to monitor RBAC quota usage proactively:

Crawls all subscriptions in our management group hierarchy
Counts RBAC role assignments in each subscription (excluding inherited assignments from parent management groups, which don't count against the limit)
Alerts at 80% usage (3,200 assignments out of 4,000)
Creates Azure DevOps tickets with subscription details and remediation guidance

The 80% threshold provides operational headroom. When a subscription approaches this limit, we still have capacity to add RBAC assignments to existing Key Vaults as requirements evolve, or deploy a handful of additional vaults before hitting the hard ceiling. This buffer is critical because our landing zone repository deploys foundational infrastructure by looping through Key Vaults with a single provider configuration per environment. Routing specific vaults to different subscriptions mid-deployment requires significant refactoring of Terraform modules and iteration logic—tedious work that's better avoided through capacity planning.

The function runs automatically without manual intervention, alerting only when thresholds are exceeded.

Avoiding Ticket Spam

A monitoring system that generates duplicate alerts creates noise instead of value. The implementation includes safeguards to prevent ticket spam:

Tag-Based Tracking

Every ticket created by the function receives the tag ccoetools-rbac-quota-usage. Before creating a new ticket, the function queries Azure DevOps using WIQL (Work Item Query Language) to check existing work items:

If an open ticket already exists for the subscription → Skip creation
If a ticket was closed within the last 7 days → Skip creation (cooldown period)

This prevents duplicate tickets when a subscription already has active remediation work in progress. The 7-day cooldown handles cases where resource cleanup temporarily drops below threshold, but usage continues to trend upward—avoiding premature re-alerting before teams complete their remediation strategy.

The Exclusion List

Some subscriptions intentionally operate near capacity or are already under active remediation planning. The EXCLUDED_SUBSCRIPTIONS environment variable allows specific subscriptions to be excluded from monitoring, preventing ticket generation during planned maintenance windows or known remediation periods.

Configuration Through Environment Variables

As organizational needs evolve, the function's scope adapts without code changes. Two key environment variables control monitoring behavior:

MANAGEMENT_GROUP_IDS - Defines which management group hierarchies to scan. Updating this variable shifts monitoring scope to different organizational units without redeploying function code.
EXEMPT_SUBSCRIPTIONS - Lists subscriptions permanently excluded from monitoring. When subscriptions reach maximum capacity indefinitely (such as legacy environments being phased out), adding them to this list prevents ongoing alerts without removing the underlying resources.

This environment-driven configuration separates operational decisions from application logic, allowing platform teams to adjust monitoring scope through configuration updates rather than code deployments.

Automatic Lifecycle Detection

One neat trick: the function figures out whether a subscription is Dev, Test, Staging, or Production by looking at its management group hierarchy. If the subscription lives under a management group with "Production" in the name, the ticket gets tagged as Prd. This helps route tickets to the right people and set appropriate priority.

lifecycle_map = {
    "development": "Dev",
    "test": "Tst", 
    "staging": "Stg",
    "production": "Prd",
}

The Tech Stack

Azure Functions (Python v2 programming model) - Timer trigger, runs on a cron schedule
Azure Resource Graph - For discovering subscriptions in management groups
Azure Management API - For counting role assignments (with pagination handling)
Azure DevOps REST API - For searching existing tickets and creating new ones
Managed Identity - No credentials to manage, the function just uses its own identity

What a Ticket Looks Like

When the function does create a ticket, it's not just "hey there's a problem." It includes:

Subscription name and ID
Current assignment count
Percentage used
Specific remediation steps (we have a script that cleans up orphaned assignments from deleted service principals)
Link to update the enterprise application templates if we need to rotate to a new subscription

Dry Run Mode

The DRY_RUN environment variable enables testing without creating actual tickets. When set to true, the function logs all actions it would take without modifying Azure DevOps work items. This is useful for validating subscription scoping, threshold calculations, and ticket content before enabling production alerting.

Scheduling: Every Tuesday at 9 AM UTC

0 0 9 * * 2

The function runs weekly on Tuesday mornings. This cadence provides regular monitoring without excessive noise, and timing early in the week allows teams to address issues before end-of-week deployments.

Results

The function has successfully identified multiple subscriptions approaching the RBAC limit before reaching critical thresholds. Early detection provides time for capacity planning—whether that means cleaning up orphaned role assignments, consolidating resource access patterns, or provisioning new subscriptions for additional Key Vault deployments.

Proactive monitoring converted a reactive firefighting problem into a planned capacity management task.

Built with Python, Azure Functions, and a healthy fear of hitting arbitrary cloud limits at the worst possible time.