No, not the Azure Batch service we’re talking about an undocumented API endpoint called /batch
.
Every once in a while you might want to call the Azure API for hundreds of resources at once, or seemingly at once anyway. There’s a lot of different ways to do that and as always the one you choose probably depends on your use case. But what if you wanted to retrieve something like the power state of a virtual machine for hundreds of VMs across multiple subscriptions and regions? Sure you could write some sort of Python script to call the API hundreds of times in seconds but you might chip away at your subscription/tenant limits by doing that, depending on your frequency.
Why not use a poorly documented (or frankly totally undocumented endpoint) to do so instead? After all, it’s what Microsoft uses themselves in Azure Portal and what Ansible uses to manage Azure inventories too!
Background
My use case is pretty simple, I want to query all of our VMs in our tenant for their power state at 5 minute intervals for an internal monitoring request. Because we’re querying over the entire tenant and not a particular subscription you would think we were limited to the 12,000 individual calls but the actual number (for the tenant we’re querying at least) is 15,000 read operations per hour based on the response header. Lets do some math:
1,000 VMs x 12 (5 min query intervals) = 12,000
We’re way under our limit already! But what if we want to change the query interval to 1 minute? What if we deploy more VMs across our existing tenant?
1200 VMs x 60 (1 min query interval) = 72,000
Oops.
But what if I told you that posting a request to that /batch
endpoint with more than 20 requests would only count for 1 operation? That you could even send up to 500 requests in this single /batch
request? Seems well worth figuring out yeah?
Proof of Concept
First thing’s first, let’s take a look at an example request using cURL
and then break it down to understand what we need and try to understand why.
curl -iX POST "https://management.azure.com/batch?version=2015-11-01" \
-H "Authorization: Bearer $AZURE_ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d @batch_request.json
cat batch_request.json
{ “requests”: [ { “httpMethod”: “GET”, “url”: “/subscriptions/SUBSCRIPTION_ID/resourceGroups/my-resource-group/providers/Microsoft.Compute/virtualMachines/vm1?api-version=2023-03-01”, “name”: “vm1-request” }, { “httpMethod”: “GET”, “url”: “/subscriptions/SUBSCRIPTION_ID/resourceGroups/my-resource-group/providers/Microsoft.Network/networkInterfaces/nic1?api-version=2023-04-01”, “name”: “nic1-request” }, { “httpMethod”: “GET”, “url”: “/subscriptions/SUBSCRIPTION_ID/resourceGroups/my-resource-group/providers/Microsoft.Storage/storageAccounts/store1?api-version=2023-01-01”, “name”: “storage1-request” } ] }
In our example we’re using bearer token that we’ve already generated and we’re passing the request body as a file. Using cURL
you’ll run into limits to the size of the request you can send in a single command and storing the request in a file makes sense as you are already probably programmatically generating this request. If we examine the contents of the batch_request.json
file we see how it’s formatted and that we’re pulling 3 different resource types in a single request.
When we send the request there are two potential outcomes. The first is that our request is returned to us immediately, which is almost always the case for smaller batch requests (in my experience asking for the data on approximately 20 virtual machine). The second outcome is what you are more likely to run into if you are requesting several hundred machines at a time and that is that you receive a 202 Accepted
response which tells you that the request was good but that the data is not ready yet. If you go to the response headers of your request you will see a Location
field which contains a URL that looks something like https://management.azure.com/batch/REDACTED where REDACTED is some long string of characters. If you follow that URL you will get a response much like you expect with the different resources you asked for. When your initial request is rather long (think in the hundreds of different resources) you will experience further pagination while ARM processes your requests. In this case you would likely want to sleep
for a few seconds between retrying the location URL again.
Also available in the response headers is a x-ms-ratelimit-remaining-tenant-reads
field that gives you the number of remaining API read calls you can issue for the next hour, what is neat about this is that each call to the URL found in the Location
field counts as one no matter the number of resources returned. I think that the reasoning is that you are consuming a single read each time but it doesn’t really explain why each request inside the batch call (that is a GET request itself) isn’t consuming an additional read each time. Probably because it would break Azure if it was enforced like that.