Kafka in Production: Scaling, Failures, and Real Systems
Part 3 — Partitions, producers, consumers, retries, monitoring, and production mistakes to avoid.
1. Topic & Partition Strategy (Most Important Decision)
Partitioning is the most important design decision in Kafka. Because:
- Partition = unit of parallelism
- Partition = unit of ordering
- Partition = unit of scaling
How to choose partition count
You are deciding: max consumer parallelism, future scale, ordering guarantees, open file handles, and rebalance time.
A practical formula:
Partitions = max throughput needed / throughput per partition
When you are still sizing a topic, these bands are a reasonable place to start:
| Workload | Typical starting range |
|---|---|
| Medium traffic | 12–24 partitions |
| High throughput | 50–200 |
| Very high throughput | 500+ |
More partitions are not free. Each extra partition adds:
- Rebalance cost — more assignments to shuffle when consumers join, leave, or crash
- File handles — more segment files for brokers to manage
So partition count is always a trade-off: more scalability and parallelism on one side, more operational complexity on the other.
Key-based partitioning
If you need ordering per user, order, or account:
partition = hash(key) % num_partitions
Examples: user_id → ordering per user; order_id → per order; account_id → per account. Without a key → round robin → better load balancing but no ordering.
Hot partition problem
If one key is very frequent, one partition gets all traffic, one consumer is overloaded, and lag increases.
Solutions: composite key (user_id + shard), more partitions, or key bucketing.
2. Producer Design (This Is Where Many Systems Break)
Producer configuration determines data loss, duplicates, throughput, and latency.
| Setting | Value | Why |
|---|---|---|
| acks | all | Ensure replication |
| retries | high | Handle transient failures |
| enable.idempotence | true | Avoid duplicates |
| linger.ms | 5–20ms | Enable batching |
| compression | snappy/lz4 | Reduce network |
| batch.size | larger | Improve throughput |
| max.in.flight | 1–5 | Ordering vs throughput |
Go producer example (franz-go)
Install: go get github.com/twmb/franz-go/pkg/kgo
package main
import (
"context"
"log"
"github.com/twmb/franz-go/pkg/kgo"
)
func main() {
client, err := kgo.NewClient(
kgo.SeedBrokers("localhost:9092"),
kgo.ProducerBatchCompression(kgo.SnappyCompression()),
kgo.RequiredAcks(kgo.AllISRAcks()),
kgo.IdempotentProducer(true),
)
if err != nil {
log.Fatal(err)
}
defer client.Close()
topic := "orders"
record := &kgo.Record{
Topic: topic,
Key: []byte("user-123"),
Value: []byte("order-created"),
}
client.Produce(context.Background(), record, func(_ *kgo.Record, err error) {
if err != nil {
log.Println("produce error:", err)
} else {
log.Println("message produced")
}
})
client.Flush(context.Background())
}Important: idempotent producer reduces duplicates; acks=all ensures replication; compression improves throughput; key ensures ordering per user.
3. Consumer Design (Where Most Bugs Happen)
Consumers must handle crashes, rebalancing, duplicate messages, long processing, retries, and DLQs.
| Type | Meaning |
|---|---|
| At most once | May lose messages |
| At least once | May duplicate messages |
| Exactly once | Very complex, rare |
Most real systems use at least once + idempotent consumer — the most practical design.
Go consumer example (franz-go)
package main
import (
"context"
"log"
"github.com/twmb/franz-go/pkg/kgo"
)
func main() {
client, err := kgo.NewClient(
kgo.SeedBrokers("localhost:9092"),
kgo.ConsumerGroup("order-processors"),
kgo.ConsumeTopics("orders"),
)
if err != nil {
log.Fatal(err)
}
defer client.Close()
ctx := context.Background()
for {
fetches := client.PollFetches(ctx)
if errs := fetches.Errors(); len(errs) > 0 {
for _, e := range errs {
log.Println("fetch error:", e)
}
continue
}
fetches.EachRecord(func(record *kgo.Record) {
log.Printf("topic=%s partition=%d offset=%d value=%s\n",
record.Topic, record.Partition, record.Offset, string(record.Value))
// Process message here
})
client.CommitUncommittedOffsets(ctx)
}
}Flow: poll → process → commit. Commit after processing — otherwise you can lose messages.
4. Multiple Consumer Groups
One of Kafka’s most powerful features:
| Consumer group | What it does |
|---|---|
| order-processors | Processes orders |
| analytics | Builds analytics |
| fraud-detection | Detects fraud |
| notifications | Sends notifications |
Each group reads the same topic, has its own offsets, and works independently. That is why Kafka is not just a queue.
5. Scaling Consumers (Kubernetes / ECS)
This is where real-world systems struggle.
| Rule | Meaning |
|---|---|
| Consumers ≤ partitions | Otherwise idle consumers |
| Partitions > consumers | Good |
| Too many consumers | Rebalance storms |
| Scaling frequently | Bad |
The rebalance problem
When a new consumer joins, Kafka pauses consumers, reassigns partitions, then resumes. If autoscaling happens frequently, the system spends more time rebalancing than processing.
Solutions: static group membership, cooperative rebalancing, lag-based autoscaling (not CPU-based).
Kafka consumers should scale on lag, not CPU.
6. Retry & DLQ Pattern
Do not spin on the same topic forever. Give failures a dedicated path so the primary topic keeps moving and you never wedge a partition on poison messages.
A common pattern (topic names are examples):
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ │ orders │ fail │ orders-retry │ fail │ orders-dlq │ │ (primary) │ ───────► │ (delay/backoff) │ ───────► │ (dead path) │ └─────────────┘ └─────────────────┘ └─────────────┘
What happens:
- Message fails on the primary consumer → send a copy to the retry topic
- Retry worker applies backoff and a cap — still bad? → send to the DLQ
- DLQ is where humans or tools inspect, fix data, or replay without blocking live traffic
What you avoid: blocking partitions, tight retry loops on the hot path, and unbounded replays that starve healthy work.
7. Monitoring & Alerts (Production Must Have)
| Metric | Why |
|---|---|
| Consumer lag | Most important |
| Broker disk usage | Prevent disk full |
| Under-replicated partitions | Broker failure |
| ISR shrink | Replication issue |
| Rebalance count | Scaling problem |
| Messages in/out | Traffic |
| Request latency | Broker health |
If you monitor only one thing, monitor consumer lag. Lag tells you consumers are slow, crashed, under-provisioned, hitting a hot partition, or downstream is slow.
8. Common Production Mistakes
| Mistake | Result |
|---|---|
| Too few partitions | Cannot scale |
| Too many partitions | Rebalance slow |
| Auto commit enabled | Data loss |
| No idempotency | Duplicate data |
| Large messages | Broker slow |
| Retry in same topic | Infinite loop |
| CPU-based autoscaling | Rebalance storms |
| No DLQ | Stuck system |
| No monitoring | Surprise failures |
9. Production Checklist
| Area | Checklist |
|---|---|
| Topic | Partitions, replication factor |
| Producer | acks=all, idempotent, retries |
| Consumer | Manual commit |
| Retry | Retry topic |
| DLQ | Present |
| Monitoring | Lag, ISR, disk |
| Scaling | Based on lag |
| Ordering | Key-based partition |
| Load test | Done |
| Alerts | Configured |
Conclusion — Running Kafka at Scale
Running Kafka in production is not just about producing and consuming messages. It is about understanding partitions, scaling, failures, and consumer behavior.
Kafka is simple in concept: producers write to a log; consumers read from a log; the log is partitioned and replicated.
But running Kafka at scale requires understanding partition strategy, rebalancing, idempotency, retry patterns, consumer scaling, and monitoring.
If designed correctly, Kafka becomes the data backbone of your company. If designed poorly, Kafka becomes a distributed system that fails in creative ways.
Kafka is not hard because of Kafka. Kafka is hard because distributed systems are hard.