Utilizing undocumented Azure API endpoints for automation

No, not the Azure Batch service we’re talking about an undocumented API endpoint called /batch.

Every once in a while you might want to call the Azure API for hundreds of resources at once, or seemingly at once anyway. There’s a lot of different ways to do that and as always the one you choose probably depends on your use case. But what if you wanted to retrieve something like the power state of a virtual machine for hundreds of VMs across multiple subscriptions and regions? Sure you could write some sort of Python script to call the API hundreds of times in seconds but you might chip away at your subscription/tenant limits by doing that, depending on your frequency.

Why not use a poorly documented (or frankly totally undocumented endpoint) to do so instead? After all, it’s what Microsoft uses themselves in Azure Portal and what Ansible uses to manage Azure inventories too!

Background

My use case is pretty simple, I want to query all of our VMs in our tenant for their power state at 5 minute intervals for an internal monitoring request. Because we’re querying over the entire tenant and not a particular subscription you would think we were limited to the 12,000 individual calls but the actual number (for the tenant we’re querying at least) is 15,000 read operations per hour based on the response header. Lets do some math:

1,000 VMs x 12 (5 min query intervals) = 12,000

We’re way under our limit already! But what if we want to change the query interval to 1 minute? What if we deploy more VMs across our existing tenant?

1200 VMs x 60 (1 min query interval) = 72,000

Oops.

But what if I told you that posting a request to that /batch endpoint with more than 20 requests would only count for 1 operation? That you could even send up to 500 requests in this single /batch request? Seems well worth figuring out yeah?

Proof of Concept

First thing’s first, let’s take a look at an example request using cURL and then break it down to understand what we need and try to understand why.

curl -iX POST "https://management.azure.com/batch?version=2015-11-01" \
-H "Authorization: Bearer $AZURE_ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d @batch_request.json

cat batch_request.json

{
  "requests": [
    {
      "httpMethod": "GET",
      "url": "/subscriptions/SUBSCRIPTION_ID/resourceGroups/my-resource-group/providers/Microsoft.Compute/virtualMachines/vm1?api-version=2023-03-01",
      "name": "vm1-request"
    },
    {
      "httpMethod": "GET",
      "url": "/subscriptions/SUBSCRIPTION_ID/resourceGroups/my-resource-group/providers/Microsoft.Network/networkInterfaces/nic1?api-version=2023-04-01",
      "name": "nic1-request"
    },
    {
      "httpMethod": "GET",
      "url": "/subscriptions/SUBSCRIPTION_ID/resourceGroups/my-resource-group/providers/Microsoft.Storage/storageAccounts/store1?api-version=2023-01-01",
      "name": "storage1-request"
    }
  ]
}

In our example we’re using bearer token that we’ve already generated and we’re passing the request body as a file. Using cURL you’ll run into limits to the size of the request you can send in a single command and storing the request in a file makes sense as you are already probably programmatically generating this request. If we examine the contents of the batch_request.json file we see how it’s formatted and that we’re pulling 3 different resource types in a single request.

When we send the request there are two potential outcomes. The first is that our request is returned to us immediately, which is almost always the case for smaller batch requests (in my experience asking for the data on approximately 20 virtual machine). The second outcome is what you are more likely to run into if you are requesting several hundred machines at a time and that is that you receive a 202 Accepted response which tells you that the request was good but that the data is not ready yet. If you go to the response headers of your request you will see a Location field which contains a URL that looks something like https://management.azure.com/batch/REDACTED where REDACTED is some long string of characters. If you follow that URL you will get a response much like you expect with the different resources you asked for. When your initial request is rather long (think in the hundreds of different resources) you will experience further pagination while ARM processes your requests. In this case you would likely want to sleep for a few seconds between retrying the location URL again.

Also available in the response headers is a x-ms-ratelimit-remaining-tenant-reads field that gives you the number of remaining API read calls you can issue for the next hour, what is neat about this is that each call to the URL found in the Location field counts as one no matter the number of resources returned. I think that the reasoning is that you are consuming a single read each time but it doesn’t really explain why each request inside the batch call (that is a GET request itself) isn’t consuming an additional read each time. Probably because it would break Azure if it was enforced like that.

Background

Proof of Concept

notafed.org