Back to essays

Kafka in Production: Scaling, Failures, and Real Systems

Part 3 — Partitions, producers, consumers, retries, monitoring, and production mistakes to avoid.

1. Topic & Partition Strategy (Most Important Decision)

Partitioning is the most important design decision in Kafka. Because:

  • Partition = unit of parallelism
  • Partition = unit of ordering
  • Partition = unit of scaling

How to choose partition count

You are deciding: max consumer parallelism, future scale, ordering guarantees, open file handles, and rebalance time.

A practical formula:

Partitions = max throughput needed / throughput per partition

When you are still sizing a topic, these bands are a reasonable place to start:

WorkloadTypical starting range
Medium traffic12–24 partitions
High throughput50–200
Very high throughput500+

More partitions are not free. Each extra partition adds:

  • Rebalance cost — more assignments to shuffle when consumers join, leave, or crash
  • File handles — more segment files for brokers to manage

So partition count is always a trade-off: more scalability and parallelism on one side, more operational complexity on the other.

Key-based partitioning

If you need ordering per user, order, or account:

partition = hash(key) % num_partitions

Examples: user_id → ordering per user; order_id → per order; account_id → per account. Without a key → round robin → better load balancing but no ordering.

Hot partition problem

If one key is very frequent, one partition gets all traffic, one consumer is overloaded, and lag increases.

Solutions: composite key (user_id + shard), more partitions, or key bucketing.

2. Producer Design (This Is Where Many Systems Break)

Producer configuration determines data loss, duplicates, throughput, and latency.

SettingValueWhy
acksallEnsure replication
retrieshighHandle transient failures
enable.idempotencetrueAvoid duplicates
linger.ms5–20msEnable batching
compressionsnappy/lz4Reduce network
batch.sizelargerImprove throughput
max.in.flight1–5Ordering vs throughput

Go producer example (franz-go)

Install: go get github.com/twmb/franz-go/pkg/kgo

package main

import (
    "context"
    "log"
    "github.com/twmb/franz-go/pkg/kgo"
)

func main() {
    client, err := kgo.NewClient(
        kgo.SeedBrokers("localhost:9092"),
        kgo.ProducerBatchCompression(kgo.SnappyCompression()),
        kgo.RequiredAcks(kgo.AllISRAcks()),
        kgo.IdempotentProducer(true),
    )
    if err != nil {
        log.Fatal(err)
    }
    defer client.Close()

    topic := "orders"

    record := &kgo.Record{
        Topic: topic,
        Key:   []byte("user-123"),
        Value: []byte("order-created"),
    }

    client.Produce(context.Background(), record, func(_ *kgo.Record, err error) {
        if err != nil {
            log.Println("produce error:", err)
        } else {
            log.Println("message produced")
        }
    })

    client.Flush(context.Background())
}

Important: idempotent producer reduces duplicates; acks=all ensures replication; compression improves throughput; key ensures ordering per user.

3. Consumer Design (Where Most Bugs Happen)

Consumers must handle crashes, rebalancing, duplicate messages, long processing, retries, and DLQs.

TypeMeaning
At most onceMay lose messages
At least onceMay duplicate messages
Exactly onceVery complex, rare

Most real systems use at least once + idempotent consumer — the most practical design.

Go consumer example (franz-go)

package main

import (
    "context"
    "log"
    "github.com/twmb/franz-go/pkg/kgo"
)

func main() {
    client, err := kgo.NewClient(
        kgo.SeedBrokers("localhost:9092"),
        kgo.ConsumerGroup("order-processors"),
        kgo.ConsumeTopics("orders"),
    )
    if err != nil {
        log.Fatal(err)
    }
    defer client.Close()

    ctx := context.Background()

    for {
        fetches := client.PollFetches(ctx)

        if errs := fetches.Errors(); len(errs) > 0 {
            for _, e := range errs {
                log.Println("fetch error:", e)
            }
            continue
        }

        fetches.EachRecord(func(record *kgo.Record) {
            log.Printf("topic=%s partition=%d offset=%d value=%s\n",
                record.Topic, record.Partition, record.Offset, string(record.Value))
            // Process message here
        })

        client.CommitUncommittedOffsets(ctx)
    }
}

Flow: poll → process → commit. Commit after processing — otherwise you can lose messages.

4. Multiple Consumer Groups

One of Kafka’s most powerful features:

Consumer groupWhat it does
order-processorsProcesses orders
analyticsBuilds analytics
fraud-detectionDetects fraud
notificationsSends notifications

Each group reads the same topic, has its own offsets, and works independently. That is why Kafka is not just a queue.

5. Scaling Consumers (Kubernetes / ECS)

This is where real-world systems struggle.

RuleMeaning
Consumers ≤ partitionsOtherwise idle consumers
Partitions > consumersGood
Too many consumersRebalance storms
Scaling frequentlyBad

The rebalance problem

When a new consumer joins, Kafka pauses consumers, reassigns partitions, then resumes. If autoscaling happens frequently, the system spends more time rebalancing than processing.

Solutions: static group membership, cooperative rebalancing, lag-based autoscaling (not CPU-based).

Kafka consumers should scale on lag, not CPU.

6. Retry & DLQ Pattern

Do not spin on the same topic forever. Give failures a dedicated path so the primary topic keeps moving and you never wedge a partition on poison messages.

A common pattern (topic names are examples):

What happens:

  • Message fails on the primary consumer → send a copy to the retry topic
  • Retry worker applies backoff and a cap — still bad? → send to the DLQ
  • DLQ is where humans or tools inspect, fix data, or replay without blocking live traffic

What you avoid: blocking partitions, tight retry loops on the hot path, and unbounded replays that starve healthy work.

7. Monitoring & Alerts (Production Must Have)

MetricWhy
Consumer lagMost important
Broker disk usagePrevent disk full
Under-replicated partitionsBroker failure
ISR shrinkReplication issue
Rebalance countScaling problem
Messages in/outTraffic
Request latencyBroker health

If you monitor only one thing, monitor consumer lag. Lag tells you consumers are slow, crashed, under-provisioned, hitting a hot partition, or downstream is slow.

8. Common Production Mistakes

MistakeResult
Too few partitionsCannot scale
Too many partitionsRebalance slow
Auto commit enabledData loss
No idempotencyDuplicate data
Large messagesBroker slow
Retry in same topicInfinite loop
CPU-based autoscalingRebalance storms
No DLQStuck system
No monitoringSurprise failures

9. Production Checklist

AreaChecklist
TopicPartitions, replication factor
Produceracks=all, idempotent, retries
ConsumerManual commit
RetryRetry topic
DLQPresent
MonitoringLag, ISR, disk
ScalingBased on lag
OrderingKey-based partition
Load testDone
AlertsConfigured

Conclusion — Running Kafka at Scale

Running Kafka in production is not just about producing and consuming messages. It is about understanding partitions, scaling, failures, and consumer behavior.

Kafka is simple in concept: producers write to a log; consumers read from a log; the log is partitioned and replicated.

But running Kafka at scale requires understanding partition strategy, rebalancing, idempotency, retry patterns, consumer scaling, and monitoring.

If designed correctly, Kafka becomes the data backbone of your company. If designed poorly, Kafka becomes a distributed system that fails in creative ways.

Kafka is not hard because of Kafka. Kafka is hard because distributed systems are hard.