How one bad message can crash your entire Kafka pipeline — and how to stop it

20255 min read

I found this out the hard way. During development of ARGUS, I sent a malformed JSON payload to the ingestion endpoint — a missing closing brace, something trivial. The Spring Kafka Streams consumer threw a deserialization exception, logged the error, and retried. Then retried again. Then again. The bad message sat at the head of the partition, and Kafka's delivery guarantee meant it would keep retrying forever — blocking every valid event behind it.

This is a Kafka Poison Pill. It's one of the most common ways a stream processing application becomes completely useless in production.

Why Kafka's guarantees work against you here

Kafka guarantees at-least-once delivery. If your consumer throws an exception processing a message, Kafka assumes the message wasn't processed and redelivers it. This is exactly what you want for transient failures — a network timeout, a momentary database hiccup. But for a permanently malformed message, it's a trap. The message will never be processable. The exception will always be thrown. The consumer will be stuck in an infinite crash loop.

The naive fix that doesn't work

The obvious instinct is to wrap the consumer in a try-catch and swallow exceptions:

// DON'T do this
try {
    StreamEvent event = objectMapper.readValue(rawBytes, StreamEvent.class);
    processEvent(event);
} catch (JsonProcessingException e) {
    log.error("Bad message, skipping", e);
    // silently move on
}

This works but you've now silently dropped data. In a telemetry platform, silent data loss is worse than a crash — at least a crash is visible.

The right approach: Dead Letter Queue

The correct solution is to route bad messages somewhere you can inspect them later, not discard them:

KStream<String, String> rawStream = builder.stream("raw-events");

rawStream .mapValues(value -> { try { return objectMapper.readValue(value, StreamEvent.class); } catch (Exception e) { return null; } }) .split() .branch( (key, value) -> value == null, Branched.withConsumer(s -> s.to("raw-events-dlq")) ) .branch( (key, value) -> value != null, Branched.withConsumer(s -> s.to("valid-events")) ); ```

Malformed messages go to raw-events-dlq. Valid messages proceed to the analytical topology. The primary pipeline never sees the bad data. The DLQ gives you a record of everything that failed — you can inspect it, fix the upstream producer, and replay the messages when the issue is resolved.

What this taught me about distributed systems

Failure modes in distributed systems are rarely dramatic. They're usually subtle — one bad actor degrading the whole system, a retry loop consuming resources invisibly, silent data loss that only shows up in dashboards weeks later. The discipline is designing every component with an explicit answer to the question: what happens when the input is wrong? Crashing is sometimes acceptable. Silently dropping data is rarely acceptable. Routing to a DLQ and continuing is almost always the right call.

Written by Basit Tijani. Find me on GitHub or LinkedIn.