High volume processing on Azure: the need for marginal gains

Disclaimer: in most business cases you should look for maintainability, testability, good data structures, … instead of going for the last marginal gains. Saving a few dollars in running cost will cost you thousands of dollars in extra research and implementation hours, possible bugs, … But once you enter the league of big numbers, marginal gains can save you a lot of money. When are you at this point? Open the Azure calculator with your volume estimates and you’ll know when you’re there.

Azure is a very scalable platform that is most likely able to solve most, if not all of your challenges. With a good architecture, you can scale out far past the limits of a single resource instance. Limits in IT are often defined by the amount of money you want to throw at it. The same goes for Azure, but the ability to scale at the snap of a finger doesn’t mean you should, without thinking twice about it.

I’ve been able to process and store millions of records per hour for less than $2000 a month, but I have also seen solutions struggle to pass 10k records (of similar volume) per hour for the same cost. In the latter case it was a less optimal architecture, but even on the same solution architecture I’ve seen differences of a factor 10 and higher based on implementation details. Once you hit very high volumes, every percentage gain on performance and/or cost makes a huge difference.

The challenge

Today’s example is about processing large files with millions of records and the impact of marginal gains on performance and cost. The key actors in this solution are Azure Functions and Cosmos DB. Of course there are other Azure services involved, but measuring showed that most time and money went into these two services. For this blog post we’ll discuss a C# implementation, but you can apply the principles to other supported programming languages.

The documentation

Azure Function: Depending on the plan you choose, Azure Functions scale pretty well. If necessary you can hit 100-200 parallel instances of a Function app, with multiple threads per instance. This equals to ‘very big number processing’, unless you do very time-intensive processing or make critical async programming mistakes.

Cosmos DB: Azure Cosmos DB is a global distributed, high available NoSQL store with near-infinite horizontal scaling. If you partition correctly and provide the required throughput, you should be able to achieve the 10ms SLA on both read and write, again with large volumes.

Combining both means you should have plenty of processing and storage bandwidth available to solve the challenge. The next step is probably opening the Azure Cost Calculator and try to make an estimate of the cost. The more you use these resources, the better you’ll be able to estimate what you need. At first it might be challenging to get a good estimate on e.g. the amount of Cosmos DB RUs needed.

A good place to start is the Azure Cosmos DB Capacity calulcator.
Even better is to run some PoCs as soon as possible to estimate and validate your cost. I’ve written about this approach before.

Marginal gains

Azure Functions

If you have written code before, you’ve most likely already profiled your code with profiling tools or more basic time measuring like the StopWatch class. Quite often programmers have been focusing on marginal gains (e.g. object serialization libraries) even when they shouldn’t, so I’ll skip this part for now.

Cosmos DB

Everything starts with RU

The basis of Cosmos DB is a Request Unit (RU), which represents the cost of all database operations. Certainly have a look at the linked documentation for everything that impacts the RU cost if you plan to use Cosmos DB.

The value of RU

So why are RUs so important? You provision and pay for a given amount of RUs per second. If you hit the amount of provisioned RUs, your request will get throttled (HTTP 429). The good news is that using the .NET Cosmos SDK, the SDK will handle these errors by retrying any throttled requests without extra code. But of course the bad news is that not knowing how much RUs you need, might either bring your application to its knees or might cost you a lot of money for unused RUs. So measure, monitor and modify the provisioned RUs, I’ll coin it the ’three M’s of Cosmos DB’ ™️😉.

Measuring RU usage

Every .NET SDK action against Cosmos DB returns a ItemResponse<T> object. This object has a RequestCharge property which contains the cost in RU for your request.

ItemResponse<MyEntity> createDocumentResponse = await MyContainer.CreateItemAsync<MyEntity>(entity, new PartitionKey(key));
_log.LogInformation($"Document CreateItemAsync {key}: {createDocumentResponse.Headers.RequestCharge}");

So one of the first tasks you should do is measuring all your basic queries when deciding to use Cosmos DB for a given solution. This will be impacted by the size of your model. Below are some tests where I also make the difference between a ‘metadata-only’ document and the actual data document to show you that every change has an impact.

Action	RU cost
Point read	1
Point read - 404	1.24
Meta document insert	8
Data document insert	40
Data document insert with id-only index	8.38
Data document update with id-only index	12
Data document delete with id-only index	8.38

If you went through the list of factors impacting RU cost, you might have noticed index. By default Cosmos DB indexes all fields in your document, but often you only need a limited amount of fields in your index. Looking at the case above, my cost went down with a factor of 5 just by tweaking the index. And it is not only the RU cost that goes down, smaller indexes certainly have an impact on the data storage costs as well.

The impact of lower RU queries

So why do I focus this much on the cost of a single document, where inserting is a measly 8.38 RU or 40 RU when I might provision 50.000 RU/s? Because its impact might be higher than you think:

Lower RU means more queries (or inserts) on the same provisioned throughput. So either you can process more, or can go cheaper on Cosmos DB.
Even though we have a guaranteed latency SLA of 10ms, lowering the RU cost does have an impact on the processing speed and thus in this case on our Azure Function as well.

If we look at the raw numbers in my test case in which I insert records one by one.

Performance run	Standard Index	Minimal Index
4000 RU / 1000 records
1 Function	23.1s	13.5s
RU Consumption %	73%	32%
4000 RU / 10000 records
1 Function	225s	129s
20000 RU / 10000 records
5 Functions in parallel	N/A	130s

The first thing you notice is that for a single Function processing the same amount of records, I’m using a larger share of my provisioned RUs and take almost double as long to process on the standard (all field) index compared to a minimal (id only) index. If I have my query needs covered by this minimal index, I can opt to easily half the cost by halving the provisioned RU. Or I can use the spare throughput for extra parallel processing power or read queries.

Scaling up the amount of records to be processed changes the duration with the same factor as we have a stable throughput. Also interesting to notice is that Cosmos DB scales out perfectly: I have increased the provisioned RU with a factor 5 to make sure there is plenty of RUs available and running 5 Functions in parallel gives the same throughput as a single Function.

Bonus: In a later run I tweaked the data model to store the same data in another format, trimming down the volume by half. This had an impact on the RU cost for inserts, but of course also on the bandwidth required between my Function app and Cosmos DB, further reducing processing speed by another 35% and freeing up more RUs.

Conclusion

So even though these individual Azure resources have insane scaling limits and come at a very fair price, things can get expensive and take ’long’ to process once we go big. At some point it’s worth the time to start optimizing things and go for marginal gains.

In a next post we’ll cover some extra aspects of high volume data processing.