Real life Azure mistakes that costed someone their Christmas presents!

Note: This post is part of the Festive Tech Calendar 2023. Make sure to have a look at the list of other topics provided by the community. This year the Festive Tech Calendar Team is raising money for the Raspberry Pi Foundation.

Luckily, up until today I have been spared of excessive cloud bills myself. But over the past handful years, I’ve had quite a few interactions with both colleagues and customers on cloud cost. Particularly because their bill was not expected and the result of a mistake rather than just the volume of usage.

Often the conversation starts with a question like “Why is my Azure bill this month so high?” or “How do I prevent expensive bills in the future?”. And maybe the best one: “Can Microsoft take part of this cost?”. And while these are valid questions, they are often not the right questions to ask, maybe it should be “How did we get here in the first place?”.

Let’s look at a few cases I’ve seen over the years and see what we can learn from it. While the title of this post might seem dramatic, it might not be far from the truth in some cases. For good reason, I will not mention names or include screenshots, but the cases are clear enough to learn from.

The junior’s first steps into Azure

The very first case is one of my first Azure projects. The team was looking into different options for capturing and processing IoT sensor data, including Azure and other open-source alternatives. During the PoC stage of the project, one of the team members decided to try out Azure Stream Analytics to capture and store data.

Keeping the explanation short: back in those days a single ASA instance costed around 200$ per month (if I remember correctly). For some unknown reason, the developer decided to create 3 instances “so all messages would certainly be processed” and simply forwarded incoming data to a storage account. This setup was kept around for several months, resulting in a bill of around a few thousand dollars (or euros in our case). Amount of messages processed during the PoC: less than 100. Price per message: 20$.

Lessons learned

While it was a rather small project, it was a good learning experience for the team. A quick resume of the lessons learned:

Look at the documentation to estimate the scale you need when it is the first time using a specific service.
Use the Azure Pricing Calculator to validate the estimated cost of your solution up front.
Finally, keep a close eye on the cost and you can even proactively scale down to see if your solution stays performant enough.

And maybe most importantly:

Validate if the service you are using is the right one for the job. In this case, Azure Stream Analytics was not the right fit as data volume was too low to justify the cost and the team was doing simple forwarding of the data to a storage account.

Note: Azure Stream Analytics has a new pricing model since then, making it a lot cheaper to run. But of course, the lesson learned is still valid.

Large numbers matter

Often, we try to optimize for performance and/or cost, but are spending days or weeks of work to save a few dollars or have minimal performance impact. I tell my teams to write ‘decent’ code and that should be good enough for everyday use. It’s only when you’re having millions of hits (API endpoint, data volume or load on your Azure service, …) that you should start looking into optimizing.

The next case is one where I drew up the initial architecture and immediately noticed a possible issue: the system had to process millions of records per hour. The request for proposal stated that the system should process and store each record as soon as possible (and individually) to minimize data loss in case of a failure. Data would be stored both in an operational data store and on Azure Blob Storage for long-term persistence for audit reasons.

One of the first things I did, was opening up the Azure Pricing Calculator and started to estimate the cost of the each aspect of the solution as this would be my first time storing millions of records per hour and several terabytes of data per month (with up to 10 years of persistence). While the volume itself brought a certain expected cost, I was struck by the cost of write operations: more than 10.000$ per month to be able to fulfill the request of storing each record right away. It was clear to me that we needed to find a different solution (hint: batching to optimize for the block size). In the end, the complete solution (compute, operational and long-term storage, network excluded) for 6 environments ended up costing about half of what the write operations for production alone would have costed us.

Lessons learned

Don’t fall in the trap of early optimization, but when you’re dealing with large numbers, you should look into the impact of your design choices.
Use the Azure Pricing Calculator to validate the estimated cost of your solution up front. Details matter with large numbers.
Run a PoC to validate your assumptions. You can view the cost of each resource (or resource group) per day under cost management (with a delay of up to 24 hours). We ran tests with millions of messages over the timespan of an hour at least once each sprint, to validate performance and cost from the start of the project until go live.

Integration tests costing a luxury car

One of my colleague consultants came to me in quite a panic. Out of nothing, the Azure bill of his client jumped more than tenfold. Luckily, they caught it within a few days and the overage cost was ’limited’ to almost 6 figures! This could easily have been a lot worse (500k to 1m+) if they would have only seen it when the bill came in at the end of the month. The culprit? A combination of a few things:

One team having a lot and very detailed logging in their service to aid troubleshooting any possible issues, as debugging in the cloud is not always easy. Each API call generated between 10 and 30 log messages.
Another team running their integration tests against the development environment of the first team, of course without communication. That week, they added a bunch of tests and also introduced a load test (which in itself is a good idea).
All logging was sent to Azure Application Insights, but also forwarded to Azure Sentinel for security monitoring, effectively about doubling the cost.
Combine all logging of all environments in a single workspace.

While they could partially intervene and try to limit the damages, it took several weeks to get the cost back within reasonable numbers.

Lessons learned

Set up cost management alerts to get notified when your cost is going up. This particular issue was caught within days, but could have cost a fortune.
Separate logging environments, this allows you to set a daily cap on each environment. This way you can limit the damage if something goes wrong. In this case they were unable to do so as they would miss issues in production.
Use feature flags or configuration to enable/disable detailed logging. Most issues are reproducible, so you can have another go with logging enabled. It’s important to catch critical issues, but keeping detailed logging on will result in sampling and possibly have you missing out on important logs.
Communication within an organization is key. If you’re running integration tests against a service, make sure to communicate with the team owning the service.

When the cost is NOT on your Azure bill

For the last case, I won’t touch the typical “We forgot to turn of the VM” cases. Instead, I want to focus on a case where the cost was not on the Azure bill, but on the bottom line of the company itself. One might even say that the Azure bill was significantly lower that particular month.

It all started with another reach out from a colleague. His customer deleted an Azure Blob Storage account with a significant nunmber of business-critical files. As you can guess, there was no soft-delete enabled nor any backup. The question from them was “Can Microsoft restore this storage account?”, but I guess you know the answer to that question. Imagine the cost for them to take backups of everything in Azure and keep it for weeks/months, just in case someone deletes something. There is a reason why you have to explicitly enable soft-delete and backups (and pay for it).

Lessons learned

We’re human and thus can make mistakes, either directly or through bugs in our system.
Use the platform features to help you prevent making mistakes, in this case soft-delete and resource locks.
Make sure to have a backup strategy in place for your business-critical data. This is not only true for Azure, but for any system you’re using.

Note: None of the redundant storage choices (from LRS to GZRS) can be used for backup, redundancy means being able to recover from file corruption or mechanical failure thanks to multiple copies. Deleting a file/account, will propagate to all copies.

Wrapping up

I hope these stories have given you some insights on how to avoid costly mistakes. I’m sure there are many more stories out there, so feel free to share yours in the comments below for others to read and learn. If you run in an issue and are looking for help, you can always reach out on X (Twitter).