← Back to all posts

The Senior Dev Made Our System 30% Faster While I Watched in Disbelief (Kafka + Elasticsearch)

BackendSeptember 8, 20248 min read
The Senior Dev Made Our System 30% Faster While I Watched in Disbelief (Kafka + Elasticsearch)

Our data ingestion was basically trash. Like, embarrassingly bad. We'd been processing messages one by one like it's 2010, and our Elasticsearch was crying every time we sent it a single document. The whole team knew it was slow, but we just... kept shipping features on top of this mess.

Then one of our senior engineers took a look. Took him maybe 10 minutes to spot what we'd all missed for months.

The problem was staring us in the face the whole time. Every single message from Kafka was getting processed individually. Every document was hitting Elasticsearch solo. Like ordering one item at a time from Amazon instead of filling up your cart. We were literally doing the most inefficient thing possible.

Here's what he found:

  1. Kafka Consumption: Concurrency set to 1. Why? Nobody knew. We were processing messages like we were afraid of parallelism.
  2. Elasticsearch Writes: Single document writes everywhere. The bulk API was just sitting there, unused, probably judging us.

I watched him open the Kafka and Elasticsearch docs, spend maybe 20 minutes reading, then just... fix it. No drama, no complicated refactor. Just did what the docs said we should've done from day one:

  • Kafka side: Bumped up to 10k messages per batch, cranked concurrency to 10. Suddenly processing 22k messages/second per pod.
  • Elasticsearch side: Built a proper batching layer that collects documents before sending them to ES using the bulk API. You know, what it's actually designed for.

The results made me feel stupid and impressed at the same time:

  • 10 million records went from 12+ hours to under 1 hour
  • System throughput up 30%+ across the board
  • Infrastructure costs down because we went from 64 pods to 16 (finance team actually sent a thank you email)

The part that stung? This wasn't some genius optimization or obscure trick. It's literally in the first chapter of both docs. "Use batching for high throughput." We'd just never bothered to read it properly. We were so busy building features, we never stopped to ask if we were doing the basics right.

Watching someone senior work is humbling. They don't do magic - they just actually read the documentation and understand the fundamentals. The solution was right under our noses the whole time. We just needed someone to point at it and say "hey, maybe try doing it the way it's designed to be used?"

Now whenever something feels slow, the first thing I do is check if I'm batching. Usually I'm not. Usually it's obvious once someone points it out.

References