Kafka in Production: Scaling, Failures, and Real Systems

Part 3 — Partitions, producers, consumers, retries, monitoring, and production mistakes to avoid.

1. Topic & Partition Strategy (Most Important Decision)

Partitioning is the most important design decision in Kafka. Because:

Partition = unit of parallelism
Partition = unit of ordering
Partition = unit of scaling

How to choose partition count

You are deciding: max consumer parallelism, future scale, ordering guarantees, open file handles, and rebalance time.

A practical formula:

Partitions = max throughput needed / throughput per partition

When you are still sizing a topic, these bands are a reasonable place to start:

Workload	Typical starting range
Medium traffic	12–24 partitions
High throughput	50–200
Very high throughput	500+

More partitions are not free. Each extra partition adds:

Rebalance cost — more assignments to shuffle when consumers join, leave, or crash
File handles — more segment files for brokers to manage

So partition count is always a trade-off: more scalability and parallelism on one side, more operational complexity on the other.

Key-based partitioning

If you need ordering per user, order, or account:

partition = hash(key) % num_partitions

Examples: user_id → ordering per user; order_id → per order; account_id → per account. Without a key → round robin → better load balancing but no ordering.

Hot partition problem

If one key is very frequent, one partition gets all traffic, one consumer is overloaded, and lag increases.

Solutions: composite key (user_id + shard), more partitions, or key bucketing.

2. Producer Design (This Is Where Many Systems Break)

Producer configuration determines data loss, duplicates, throughput, and latency.

Setting	Value	Why
acks	all	Ensure replication
retries	high	Handle transient failures
enable.idempotence	true	Avoid duplicates
linger.ms	5–20ms	Enable batching
compression	snappy/lz4	Reduce network
batch.size	larger	Improve throughput
max.in.flight	1–5	Ordering vs throughput

Go producer example (franz-go)

Install: go get github.com/twmb/franz-go/pkg/kgo

package main

import (
    "context"
    "log"
    "github.com/twmb/franz-go/pkg/kgo"
)

func main() {
    client, err := kgo.NewClient(
        kgo.SeedBrokers("localhost:9092"),
        kgo.ProducerBatchCompression(kgo.SnappyCompression()),
        kgo.RequiredAcks(kgo.AllISRAcks()),
        kgo.IdempotentProducer(true),
    )
    if err != nil {
        log.Fatal(err)
    }
    defer client.Close()

    topic := "orders"

    record := &kgo.Record{
        Topic: topic,
        Key:   []byte("user-123"),
        Value: []byte("order-created"),
    }

    client.Produce(context.Background(), record, func(_ *kgo.Record, err error) {
        if err != nil {
            log.Println("produce error:", err)
        } else {
            log.Println("message produced")
        }
    })

    client.Flush(context.Background())
}

Important: idempotent producer reduces duplicates; acks=all ensures replication; compression improves throughput; key ensures ordering per user.

3. Consumer Design (Where Most Bugs Happen)

Consumers must handle crashes, rebalancing, duplicate messages, long processing, retries, and DLQs.

Type	Meaning
At most once	May lose messages
At least once	May duplicate messages
Exactly once	Very complex, rare

Most real systems use at least once + idempotent consumer — the most practical design.

Go consumer example (franz-go)

package main

import (
    "context"
    "log"
    "github.com/twmb/franz-go/pkg/kgo"
)

func main() {
    client, err := kgo.NewClient(
        kgo.SeedBrokers("localhost:9092"),
        kgo.ConsumerGroup("order-processors"),
        kgo.ConsumeTopics("orders"),
    )
    if err != nil {
        log.Fatal(err)
    }
    defer client.Close()

    ctx := context.Background()

    for {
        fetches := client.PollFetches(ctx)

        if errs := fetches.Errors(); len(errs) > 0 {
            for _, e := range errs {
                log.Println("fetch error:", e)
            }
            continue
        }

        fetches.EachRecord(func(record *kgo.Record) {
            log.Printf("topic=%s partition=%d offset=%d value=%s\n",
                record.Topic, record.Partition, record.Offset, string(record.Value))
            // Process message here
        })

        client.CommitUncommittedOffsets(ctx)
    }
}

Flow: poll → process → commit. Commit after processing — otherwise you can lose messages.

4. Multiple Consumer Groups

One of Kafka’s most powerful features:

Consumer group	What it does
order-processors	Processes orders
analytics	Builds analytics
fraud-detection	Detects fraud
notifications	Sends notifications

Each group reads the same topic, has its own offsets, and works independently. That is why Kafka is not just a queue.

5. Scaling Consumers (Kubernetes / ECS)

This is where real-world systems struggle.

Rule	Meaning
Consumers ≤ partitions	Otherwise idle consumers
Partitions > consumers	Good
Too many consumers	Rebalance storms
Scaling frequently	Bad

The rebalance problem

When a new consumer joins, Kafka pauses consumers, reassigns partitions, then resumes. If autoscaling happens frequently, the system spends more time rebalancing than processing.

Solutions: static group membership, cooperative rebalancing, lag-based autoscaling (not CPU-based).

Kafka consumers should scale on lag, not CPU.

6. Retry & DLQ Pattern

Do not spin on the same topic forever. Give failures a dedicated path so the primary topic keeps moving and you never wedge a partition on poison messages.

A common pattern (topic names are examples):

  ┌─────────────┐          ┌─────────────────┐          ┌─────────────┐
  │   orders    │   fail   │  orders-retry   │   fail   │  orders-dlq │
  │  (primary)  │ ───────► │ (delay/backoff) │ ───────► │ (dead path) │
  └─────────────┘          └─────────────────┘          └─────────────┘

What happens:

Message fails on the primary consumer → send a copy to the retry topic
Retry worker applies backoff and a cap — still bad? → send to the DLQ
DLQ is where humans or tools inspect, fix data, or replay without blocking live traffic

What you avoid: blocking partitions, tight retry loops on the hot path, and unbounded replays that starve healthy work.

7. Monitoring & Alerts (Production Must Have)

Metric	Why
Consumer lag	Most important
Broker disk usage	Prevent disk full
Under-replicated partitions	Broker failure
ISR shrink	Replication issue
Rebalance count	Scaling problem
Messages in/out	Traffic
Request latency	Broker health

If you monitor only one thing, monitor consumer lag. Lag tells you consumers are slow, crashed, under-provisioned, hitting a hot partition, or downstream is slow.

8. Common Production Mistakes

Mistake	Result
Too few partitions	Cannot scale
Too many partitions	Rebalance slow
Auto commit enabled	Data loss
No idempotency	Duplicate data
Large messages	Broker slow
Retry in same topic	Infinite loop
CPU-based autoscaling	Rebalance storms
No DLQ	Stuck system
No monitoring	Surprise failures

9. Production Checklist

Area	Checklist
Topic	Partitions, replication factor
Producer	acks=all, idempotent, retries
Consumer	Manual commit
Retry	Retry topic
DLQ	Present
Monitoring	Lag, ISR, disk
Scaling	Based on lag
Ordering	Key-based partition
Load test	Done
Alerts	Configured

Conclusion — Running Kafka at Scale

Running Kafka in production is not just about producing and consuming messages. It is about understanding partitions, scaling, failures, and consumer behavior.

Kafka is simple in concept: producers write to a log; consumers read from a log; the log is partitioned and replicated.

But running Kafka at scale requires understanding partition strategy, rebalancing, idempotency, retry patterns, consumer scaling, and monitoring.

If designed correctly, Kafka becomes the data backbone of your company. If designed poorly, Kafka becomes a distributed system that fails in creative ways.

Kafka is not hard because of Kafka. Kafka is hard because distributed systems are hard.