Make it local
2020-05-19
So you have a service which does some data processing. It receives some input messages, maybe via HTTP, maybe from something like Kafka. And it produces some output messages maybe as a HTTP response, maybe into another Kafka topic, or a database. The input messages are very heterogenous, some big, some small. Some messages can be dropped entirely, some require heavy lifting to process them.
This is service has been running for quite some time, but more recently it started to slow down. The data volume might have increased or those messages requiring some serious compute power arrive more frequently now. Nevertheless it’s time to dig in and figure out how to speed things up, explore those bottlenecks and drill them wide open. You could just throw more hardware at the issue and be done with it, for a while at least. But there will always be that nagging feeling, that stinging sensation: How can we make this fast?
That’s the moment you fire up your IDE and for the twentieth time this year figure out how this fricking profiler works again. And that’s when you hit the hurdle: The service can’t be run locally. Things are bad when you find yourself in a situation like this. There shouldn’t have been cut any corners when it comes to the local setup. Your service needs special IAM permissions? Figure out how you can get those on your machine. The service needs database access? Figure out how you can connect to your development database from your machine.
You might think some metrics pushed into a Prometheus cluster will help you with your performance issues. They might, probably on a high level and a long term view. But unless your write an insane amount of metrics, which might be a performance issue in itself, or instrument the hell out of your service in regards to tracing: Nothing is going to beat starting up a service locally and attaching an actual profiler to it. With flame graphs. And call trees.
I discovered that it’s worth going beyond just starting a service locally. It’s worth a lot being able to isolate your hot code, your core pipeline, whatever does the processing of those message, as a piece of standalone code. And if that piece of code is agnostic of the input format, you’re golden. Dumping a bunch of input messages directly from a file into that pipeline over and over again while making some fixes here and some tweaks there, is an empowering concept. You not only get reproducible results, you might even be able to use data resembling your actual live traffic. That’s when you really get to know the nooks and crannies of the implementation. That seemingly innocuous regex, which definitively should be super fast? Yep, that’s eating up 10% of your CPU time, because the service keeps recompiling it multiple times for each message although it’s actually a constant expression. (Yes that happened, I’m not proud).
The point is: Invest in being able to running your code locally. The cloud is fun and all, but your provider is going to be really happy if you buy bigger machines instead of making your code more performant.