Writing

Software, projects & software projects

A CloudWatch Bill Adventure

AWS bills are hard to dig through. My team stumbled upon another weird artefact we can't make sense of. For the first five days of a month we have twice the CloudWatch costs compared to the rest of the month.

Spoiler: We still don't know what's going on. If you have an enlightening idea, contact me.

Cavernous Cost Explorer

Let's use the tools given to us by dear Jeff and drill down on this thing. Grouping by usage type gives us a hint. The cost originates from “EUC1-CW:MetricMonitorUsage”. The name very clearly points us in the right direction: We accumulate costs through custom CloudWatch metrics. These are metrics outside of the AWS namespace.

We get a better view of things if we group by API operation. We see “MetricStorage” doubling and, if you squint enough, “MetricStorage:AWS/EC2” also roughly doubles. This lead me to believe, that we have a custom metric which we write more often at the start of the month. The sample count of the metric should be higher for these first few days.

Seditious Sample Count

We want to check the sample count of each CloudWatch metric. A little script helps listing all metrics. It then fetches and accumulates the sample counts. With a CSV in hand, I headed over to ObservableHQ to do some data exploration. A heat map should help us highlight whether we write a given metric more often. Here is a small excerpt:

There are a few funny patterns, but nothing pointing to the first five days of the month. I also tried using small multiples, they show pretty much the same picture. Which isn't all that surprising of course.

Closing Thoughts

That's were I stopped. We still don't have any idea what's going on here. Even if you look at an hourly billing chart, the costs double exactly for the first five days. 00:00h day one until 23:59h day five. Which contradicts the assumption of a volume discount starting to kick in after a few days of the month. The extra cost does not even account for 1% of our full bill, so it doesn't make sense to invest more time into this investigation. :-(

Other posts