diff --git a/.gitignore b/.gitignore index b92b781b..8eac2024 100644 --- a/.gitignore +++ b/.gitignore @@ -7,4 +7,9 @@ vendor *.iml */bootstrap/ _config-overrides.yml -.ruby-version \ No newline at end of file +.ruby-version +.DS_Store +.op/ +_data/documentation/*-SNAPSHOT.yaml +_data/release/*-SNAPSHOT.yaml +download/*-SNAPSHOT/ \ No newline at end of file diff --git a/_posts/2026-05-28-benchmarking-the-proxy.md b/_posts/2026-05-28-benchmarking-the-proxy.md new file mode 100644 index 00000000..869babc0 --- /dev/null +++ b/_posts/2026-05-28-benchmarking-the-proxy.md @@ -0,0 +1,195 @@ +--- +layout: post +title: "Does my proxy look big in this cluster?" +date: 2026-05-28 02:30:00 +0000 +author: "Sam Barker" +author_url: "https://github.com/SamBarker" +categories: benchmarking performance +--- + +Every good benchmarking story starts with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly. + +There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which translates, from polite engineering into plain English, as: "is this thing going to slow down my Kafka?" We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us. + +So we stopped saying "it depends" — we built something you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours. + +**TL;DR**: +- A passthrough proxy adds negligible overhead: publish latency impact is below measurement noise, E2E adds ~2 ms at moderate topic rates, throughput unaffected +- Add record encryption and expect a ~25% throughput reduction; at comfortable rates, E2E latency stays within measurement noise and publish latency adds up to ~10 ms +- The throughput ceiling scales linearly with CPU: budget ~25 mc per MB/s of total proxy traffic (conservative; a companion post, coming soon, has the full sizing formula) +- The full benchmark harness is open source — run it on your own cluster for numbers that reflect your workload + +## What we measured + +We ran three scenarios against the same Apache Kafka® cluster on the same hardware: + +- **Baseline** — producers and consumers talking directly to Kafka, no proxy in the path +- **Passthrough proxy** — traffic routed through Kroxylicious with no filter chain configured +- **Record encryption** — traffic through Kroxylicious with AES-256-GCM record encryption enabled, using HashiCorp Vault as the KMS + +We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) rather than Kafka's own `kafka-producer-perf-test`. OMB is an industry-standard tool that coordinates producers and consumers together, measures end-to-end latency (not just publish latency), and produces structured JSON that makes comparison straightforward. More on why we built a whole harness around it in a companion engineering post, coming soon. + +## Test environment + +No, we didn't run this on a laptop — it's a realistic deployment: an 11-node OpenShift cluster on Fyre (8 workers, 3 masters), IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit. The cluster is sized so that the Kafka brokers, the proxy, and the benchmark workers each run on separate nodes, ensuring traffic crosses real network links rather than looping back on the same host. + +| Component | Details | +|-----------|---------| +| CPU | AMD EPYC-Rome, 2 GHz | +| Memory | 16 GiB per node | +| Cluster | 11-node OpenShift 4.21 (8 workers, 3 masters), RHCOS 9.6 | +| Kafka | 3-broker Strimzi 0.51.0 (Kafka 3.9) cluster, replication factor 3 | +| Kroxylicious | 0.21.0, single proxy pod | +| KMS | HashiCorp Vault 2.0.0 (in-cluster) | + +The primary workload used 1 topic, 1 partition, 1 KB messages. We chose single-partition deliberately: it concentrates all traffic on one broker, so you hit ceilings quickly and any proxy overhead is easy to isolate. We also ran 10-topic and 100-topic workloads to make sure the results hold when load is spread more realistically across brokers. + +One important caveat: this Kafka cluster is deliberately untuned. We're not trying to squeeze every message-per-second out of Kafka — we're using it as a fixed baseline to measure what the proxy adds on top. Kafka experts will find obvious headroom to improve on our baseline numbers; that's fine and expected. The deltas are what matter here, not the absolutes. + +--- + +## The passthrough proxy: negligible overhead + +Good news first. The proxy itself — with no filter chain, just routing traffic — adds almost nothing. The tables below show all three scenarios side by side. + +A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Averages flatter; the p99 is what your slowest clients actually experience, and it's usually the number that matters. + +Two latency metrics appear in the tables. **Publish latency** is measured from the record's intended send time — as dictated by the target producer rate — to when the producer receives the broker's acknowledgement. That means it captures any producer-side delay (backpressure, client queuing, batch accumulation) alongside the network round-trip and ISR replication (we run with `acks=all`). **End-to-end (E2E) latency** is measured from that same intended send time to when the consumer receives the record, adding consumer-side fetch batching on top of everything publish latency already covers. + +### 10 topics, 1 partition each, 1 KB messages — 50,000 msg/s (50 MB/s) + +At moderate topic counts, traffic is concentrated enough that proxy overhead is more visible. + +| Metric | Baseline | Proxy (no filters) | +|--------|----------|--------------------| +| Publish latency avg | 4.3 ms | 4.5 ms (+0.2 ms) | +| Publish latency p99 | 22.4 ms | 19.6 ms (−2.7 ms) | +| E2E latency avg | 96.9 ms | 99.0 ms (+2.1 ms) | +| E2E latency p99 | 193 ms | 190 ms (−3 ms) | +| Throughput | 50,000 msg/s | 50,000 msg/s | + +*Negative deltas for publish latency are within measurement noise — they indicate the proxy is indistinguishable from baseline, not that it improves latency.* + +The passthrough proxy is not adding measurable per-record overhead at this rate. E2E average overhead is +2.1 ms (p<0.001), but practically negligible for any sizing decision. + +### 100 topics, 1 partition each, 1 KB messages — 50,000 msg/s (50 MB/s) + +At higher topic counts, the same total load is spread across more partitions and brokers. The proxy does identical work per record regardless — there is no cross-partition coordination. The point of this table is simply to confirm the pattern holds when load is distributed more broadly. + +| Metric | Baseline | Proxy (no filters) | +|--------|----------|--------------------| +| Publish latency avg | 2.9 ms | 4.1 ms (+1.2 ms) | +| Publish latency p99 | 6.4 ms | 8.1 ms (+1.7 ms) | +| E2E latency avg | 256.7 ms | 254.6 ms (−2.1 ms) | +| E2E latency p99 | 502 ms | 501 ms (−1 ms) | +| Throughput | 50,000 msg/s | 50,000 msg/s | + +Publish latency overhead is statistically significant at 100 topics (proxy-no-filters p99 +27%, p<0.001). But publish latency at 500 msg/s per topic is a small fraction of E2E, and the E2E picture is what operators care about: differences are within measurement noise. + +### 1 topic, 1 partition, 1 KB messages — 10,100 msg/s (10 MB/s) + +With all traffic on a single topic and partition, Kafka is under the most concentrated load — every record contends for the same broker, the same partition, and the same ISR replication round-trip. The proxy still doesn't register. + +| Metric | Baseline | Proxy (no filters) | +|--------|----------|--------------------| +| Publish latency avg | 7.2 ms | 6.7 ms (−0.5 ms) | +| Publish latency p99 | 12.6 ms | 11.8 ms (−0.7 ms) | +| E2E latency avg | 13.0 ms | 13.5 ms (+0.5 ms) | +| E2E latency p99 | 21.0 ms | 21.0 ms (0 ms) | +| Throughput | 10,100 msg/s | 10,100 msg/s | + +**The headline: negligible passthrough overhead — throughput unaffected.** + +What did I take away from this? We replaced a hunch with data. The remarkable part: the proxy is doing this at Layer 7. Most proxies operate on Kafka at Layer 4 — they shuffle bytes without ever understanding what those bytes mean. Kroxylicious works at Layer 7, parsing every Kafka message, yet still adds only a few milliseconds at the E2E average. That's the design working. + +The overhead staying flat across 1, 10, and 100 topics makes sense for the same reason: the proxy doesn't contend between topics. Think of the proxy as independent circuits on a distribution board — switching the breaker for lights doesn't cut power to the fridge. A Kafka broker is more like the mains supply itself — every circuit draws from the same source, so heavy load anywhere reduces what's available everywhere. In the proxy, topics don't contend for shared resources: proxy overhead scales linearly across them, and this data validates it. + +--- + +## Record encryption: now we're doing real work + +Ok, so let's make the proxy smarter — make it do something people actually care about! [Record encryption](https://kroxylicious.io/documentation/0.20.0/html/record-encryption-guide) uses AES-256-GCM to encrypt each record passing through the proxy. AES-256-GCM is going to ask the CPU to work relatively hard on its own, but it's also going to push the proxy to parse each record it receives, unpack it, copy it, encrypt it, and re-pack it before sending it on to the broker. With all that work going on we expect some impact to latency and throughput. To answer our original question we need to identify two things: the latency when everything is going smoothly, and the reduction in throughput all this work causes. Monitoring latency once we go past the throughput inflection point isn't very helpful — it's dominated by the throughput limits and their erratic impacts on the latency of individual requests (a big hello to batching and buffering effects). + +### Latency at sub-saturation rates + +So we know encryption is doing a lot of work, but to find out the real impact we need to compare it to a plain Kafka cluster (and yes, people do run Kroxylicious without filters — TLS termination, stable client endpoints, virtual clusters — but that's a different post). The table below tells us that above a certain inflection point the numbers get really, really noisy — especially in the p99 range. + +**1 topic, 1 KB messages — baseline vs encryption (selected rates from rate sweep):** + +| Rate | Metric | Baseline | Encryption | Delta | +|------|--------|----------|------------|-------| +| 14,300 msg/s | Publish avg | 5.4 ms | 7.6 ms | +2.2 ms (+41%) | +| 14,300 msg/s | Publish p99 | 16.3 ms | 19.2 ms | +2.9 ms (+18%) | +| 17,100 msg/s | Publish avg | 6.3 ms | 8.9 ms | +2.6 ms (+41%) | +| 17,100 msg/s | Publish p99 | 12.5 ms | 21.9 ms | +9.4 ms (+75%) | +| 18,500 msg/s | Publish avg | 10.5 ms | 13.7 ms | +3.2 ms (+30%) | +| 18,500 msg/s | Publish p99 | 22.0 ms | 106.0 ms | +84.0 ms (+382%) | + +The table shows encryption's p99 spiking sharply at 18,500 msg/s — but that ~18k figure is roughly where the forwarding proxy itself saturates (close to the bare Kafka baseline of ~19,400). Encryption gives out earlier. The rate sweep finds exactly where. + +### Throughput ceiling + +A rate-sweep is exactly what it sounds like: pick a starting rate, let OMB run long enough to get a stable measurement, then step up by a fixed increment and repeat until the system can't keep up. We defined "can't keep up" as the sustained throughput dropping by more than 5% below the target rate — at that point, something has saturated. + +We stepped up from 8k to 22k msg/s in 700 msg/s increments, looking for where throughput drops more than 5% below target. The results: + +- **Baseline**: sustained up to ~19,400 msg/s (the ceiling at RF=3 on our test cluster) +- **Encryption**: sustained up to **~14,600 msg/s**, then started intermittently saturating +- **Cost: approximately 25% fewer messages per second per partition** + +The transition wasn't a clean cliff edge — the proxy alternated between sustaining and saturating in a narrow band just above the ceiling. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Stay below 14k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**. + +### The ceiling scales with CPU budget + +The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy. + +The single-producer ceiling at RF=3 is Kafka-limited, not proxy-limited — the ISR replication round-trip caps single-partition throughput regardless of how much CPU the proxy has. The proxy still had meaningful headroom: we ran four producers and aggregate throughput climbed higher, while proxy CPU sat at 570m/1000m. The proxy wasn't the constraint. + +To find the proxy's real ceiling, you need a workload that doesn't hit the Kafka partition limit first: RF=1, spread across multiple topics. With that workload, the ceiling is squarely in the proxy — and it scales linearly with CPU. The mechanism: CPU limit controls `availableProcessors()`, which controls how many Netty event loop threads the proxy creates. More threads, more concurrent connections handled in parallel, higher aggregate ceiling. + +**The practical implication**: the throughput ceiling is not a fixed number — it's a function of the CPU you allocate. Set `requests` equal to `limits` in your pod spec; this makes the CPU budget deterministic and the ceiling predictable. A companion engineering post, coming soon, has the full story of how we found this, including the workload design choices needed to isolate proxy CPU from Kafka's own limits. + +--- + +## Sizing guidance + +Numbers without guidance aren't very useful, so here's how to translate these results into pod specs. + +**Passthrough proxy**: size your Kafka cluster as you normally would. The proxy won't be the bottleneck — but if you want to verify that on your own hardware, the rate sweep — which steps the producer rate up incrementally until the system can't keep up — is exactly the tool for it. Run the baseline and passthrough scenarios back-to-back and you'll have your own numbers. + +**With filters (record encryption is the representative example here):** + +1. **Throughput budget**: record encryption — among the most CPU-intensive filters we can imagine — imposes a CPU-driven throughput ceiling. As a planning formula: + + > **`CPU (mc) = k × (P + N × C)`** + > + > where *mc* = millicores (the Kubernetes CPU scheduling unit; 1,000 mc = 1 core per second), *k* = sizing coefficient (mc/MB/s), *P* = produce throughput (MB/s), *N* = number of consumer groups, *C* = consume throughput per group (MB/s) + + On our hardware (AMD EPYC-Rome 2 GHz with AES-NI), we measured *k* = 25 mc/MB/s on a 10-topic workload with record encryption — a conservative estimate: more realistic deployments with 100+ topics show *k* = 4–8 mc/MB/s, roughly 3× lower. Simpler filters will be cheaper still. *k* is measured from real workloads, so measure your throughput and validate on your own hardware. The companion post (coming soon) has the full coefficient grid across topic counts and core allocations. + + *1:1 (100k msg/s at 1 KB, 1 consumer group)*: k=25, P=100, N=1, C=100 → 25 × (100 + 1 × 100) = 5,000m (~5 cores) + + *Fan-out (same rate, 3 consumer groups)*: k=25, P=100, N=3, C=100 → 25 × (100 + 3 × 100) = 10,000m (~10 cores) + + Not running on Kubernetes? Divide the result by 1,000 to get the number of cores to allocate to the proxy process. + +2. **Latency budget**: well below saturation, expect 2–3 ms additional average publish latency and up to ~15 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it. + +3. **Scaling**: set `requests` equal to `limits` in your pod spec — this makes the CPU budget deterministic, which makes the throughput ceiling predictable. To increase throughput, raise the CPU limit. For redundancy, add proxy pods. + +4. **KMS overhead**: DEK caching means Vault isn't on the hot path for every record. Our tests triggered only 5–19 DEK generation calls per benchmark run. The KMS is not the thing to worry about. + +--- + +## Caveats and next steps + +These are real results from real hardware, but they don't tell a story for your workload. A few things worth knowing before you put these numbers in a slide deck: + +- **Sub-saturation assumed**: all results assume the system is operating below its throughput ceiling — both the proxy's and Kafka's own replication limits. Above either, queueing and batching effects dominate and the numbers in this post no longer apply. A companion post, coming soon, explains how to identify where those ceilings are. +- **Message size**: all results use 1 KB messages. The coefficient is message-size-dependent — encryption overhead as a percentage is likely lower for larger messages. +- **Horizontal scaling**: linear scaling has been validated across CPU allocations on a single pod; multi-pod horizontal scaling hasn't been measured but is expected to follow the same coefficient. +- **Memory**: the workloads tested here are CPU-bound before they become memory-bound — we kept container memory settings consistent across all runs (2 Gi request / 4 Gi limit at the pod level) and it was never the constraint. If you're running larger messages or larger batches, revisit this assumption. + +For the engineering story — why we built a custom harness on top of OMB, what the CPU flamegraphs actually show, and the bugs we found in our own tooling along the way — that's in a companion post, coming soon. + +The full benchmark suite, quickstart guide, and sizing reference are in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). diff --git a/_sass/kroxylicious.scss b/_sass/kroxylicious.scss index 88ba940a..7edfcbdb 100644 --- a/_sass/kroxylicious.scss +++ b/_sass/kroxylicious.scss @@ -355,6 +355,25 @@ b.conum * { margin-bottom: calc(var(--#{$prefix}card-title-spacer-y) / 2); } +.card-text { + table { + width: auto; + max-width: 100%; + margin-bottom: 1rem; + border-collapse: collapse; + + th, td { + padding: 0.5rem 1.25rem; + border: 1px solid var(--bs-border-color); + } + + thead tr { + border-bottom: 2px solid var(--bs-border-color); + background-color: var(--bs-tertiary-bg); + } + } +} + // Documentation page filtering .doc-filters { display: flex; diff --git a/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html b/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html new file mode 100644 index 00000000..89215a71 --- /dev/null +++ b/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html @@ -0,0 +1,28030 @@ + + + + + + + +

encryption/1topic-1kb_2026-04-20T11:38:12Z

+
+ + + +
+
Produced by async-profiler
+
+
+
Frame types
+
Kernel
+
Native
+
C++ (VM)
+
Java compiled
+
Java compiled by C1
+
Inlined
+
Interpreted
+
+
+
Allocation profile
+
Allocated class
+
Allocation outside TLAB
+
Lock profile
+
Lock class
+
 
+
Search
+
Matches regexp
+
+
+
Click frame
Zoom into frame
+
Alt+Click
Remove stack
+
0
Reset zoom
+
I
Invert graph
+
Ctrl+F
Search
+
N
Next match
+
Shift+N
Previous match
+
Esc
Cancel search
+
+
+ +
+

+

Matched:

+ diff --git a/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html b/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html new file mode 100644 index 00000000..c921470d --- /dev/null +++ b/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html @@ -0,0 +1,15382 @@ + + + + + + + +

proxy-no-filters/1topic-1kb_2026-04-15T21:44:15Z

+
+ + + +
+
Produced by async-profiler
+
+
+
Frame types
+
Kernel
+
Native
+
C++ (VM)
+
Java compiled
+
Java compiled by C1
+
Inlined
+
Interpreted
+
+
+
Allocation profile
+
Allocated class
+
Allocation outside TLAB
+
Lock profile
+
Lock class
+
 
+
Search
+
Matches regexp
+
+
+
Click frame
Zoom into frame
+
Alt+Click
Remove stack
+
0
Reset zoom
+
I
Invert graph
+
Ctrl+F
Search
+
N
Next match
+
Shift+N
Previous match
+
Esc
Cancel search
+
+
+ +
+

+

Matched:

+ diff --git a/overview.markdown b/overview.markdown index 8af9ae22..b6b42abb 100644 --- a/overview.markdown +++ b/overview.markdown @@ -66,5 +66,10 @@ Kroxylicious is careful to decode only the Kafka RPCs that the filters actually interested in a particular RPC, its bytes will pass straight through Kroxylicious. This approach helps keep Kroxylicious fast. -The actual performance overhead of using Kroxylicious depends on the particular use-case. +The actual performance overhead of using Kroxylicious depends on the particular use-case. As a guide: + +- **Passthrough proxy (no filters)**: ~0.2 ms additional average publish latency, no throughput impact +- **Record encryption (AES-256-GCM)**: ~26% throughput reduction per partition; 15–40 ms additional p99 latency at sub-saturation rates + +See the [performance reference page]({{ '/performance/' | absolute_url }}) for full benchmark results, methodology, and sizing guidance. diff --git a/performance.markdown b/performance.markdown new file mode 100644 index 00000000..b6f24869 --- /dev/null +++ b/performance.markdown @@ -0,0 +1,105 @@ +--- +layout: overview +title: Performance +permalink: /performance/ +toc: true +--- + +This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. No, we didn't run this on a laptop — it's a realistic deployment: an 8-node OpenShift cluster on Fyre (5 workers, 3 masters), IBM's internal cloud platform — a controlled environment. + +## Test environment + +| Component | Details | +|-----------|---------| +| CPU | AMD EPYC-Rome, 2 GHz | +| Cluster | 8-node OpenShift (5 workers, 3 masters), RHCOS 9.6 | +| Kafka | 3-broker Strimzi cluster, replication factor 3 | +| Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit | +| KMS | HashiCorp Vault (in-cluster) | + +All primary results used 1 KB messages on a single partition. Multi-topic workloads (10 and 100 topics) confirmed that overhead characteristics hold when load is distributed. + +--- + +## Passthrough proxy (no filters) + +The proxy layer itself adds negligible overhead. At sub-saturation rates the additional latency is sub-millisecond on average, with no measurable throughput impact. + +**10 topics, 1 KB messages (5,000 msg/s per topic):** + +| Metric | Baseline | Proxy | Delta | +|--------|----------|-------|-------| +| Publish latency avg | 2.62 ms | 2.79 ms | +0.17 ms (+7%) | +| Publish latency p99 | 14.09 ms | 15.17 ms | +1.08 ms (+8%) | +| E2E latency avg | 94.87 ms | 95.34 ms | +0.47 ms (+0.5%) | +| Publish rate | 5,002 msg/s | 5,002 msg/s | no change | + +**100 topics, 1 KB messages (500 msg/s per topic):** + +| Metric | Baseline | Proxy | Delta | +|--------|----------|-------|-------| +| Publish latency avg | 2.66 ms | 2.82 ms | +0.16 ms (+6%) | +| Publish latency p99 | 5.54 ms | 6.07 ms | +0.53 ms (+10%) | +| Publish rate | 500 msg/s | 500 msg/s | no change | + +--- + +## Record encryption (AES-256-GCM) + +Encryption adds measurable but predictable overhead. The cost scales with producer rate — well below saturation the overhead is small; approaching the saturation point, latency rises sharply. + +### Latency at sub-saturation rates + +**1 topic, 1 KB messages — baseline vs encryption:** + +| Rate | Metric | Baseline | Encryption | Delta | +|------|--------|----------|------------|-------| +| 34,000 msg/s | Publish avg | 8.00 ms | 8.19 ms | +0.19 ms (+2%) | +| 34,000 msg/s | Publish p99 | 48.65 ms | 64.01 ms | +15.35 ms (+32%) | +| 36,000 msg/s | Publish avg | 9.38 ms | 10.46 ms | +1.08 ms (+12%) | +| 36,000 msg/s | Publish p99 | 63.92 ms | 88.98 ms | +25.06 ms (+39%) | +| 37,200 msg/s | Publish avg | 9.12 ms | 12.19 ms | +3.07 ms (+34%) | +| 37,200 msg/s | Publish p99 | 74.88 ms | 113.15 ms | +38.27 ms (+51%) | + +### Throughput ceiling + +| Scenario | Throughput ceiling (1 topic, 1 KB, 1 partition) | +|----------|------------------------------------------------| +| Baseline (direct Kafka) | ~19,400 msg/s | +| Encryption (proxy + AES-256-GCM) | ~14,600 msg/s | +| **Cost** | **~25% fewer messages per second per partition** | + +--- + +## Sizing guidance + +Numbers without guidance aren't very useful, so here's how to translate these results into pod specs. + +**Passthrough proxy**: size your Kafka cluster as you normally would. The proxy will not be the bottleneck. + +**With record encryption:** + +- **Throughput**: use `CPU (mc) = 10 × total proxy throughput (MB/s)` where total = produce MB/s + each consumer group's consume MB/s. For 1:1 produce:consume this simplifies to `20 × produce MB/s`. Add ×1.3 headroom. Measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware. Validated at 1000m, 2000m, and 4000m. Example: 100k msg/s at 1 KB, 1 consumer group = 200 MB/s total → 2000m + headroom → ~2600m. +- **Latency**: expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99, scaling with how close to saturation you operate +- **Scaling**: set `requests` equal to `limits` in your pod spec to make the CPU budget — and therefore the throughput ceiling — deterministic. Increase the CPU limit to raise throughput; add proxy pods for redundancy. +- **KMS**: DEK caching means the KMS is not on the hot path. In testing, each benchmark run triggered only 5–19 DEK generation calls — the KMS is not a bottleneck + +--- + +## Caveats + +These numbers come from a single proxy pod, 1 KB messages, and single-pass measurements. A few things that matter when applying them to your workload: + +- **Message size**: the sizing coefficient is message-size-dependent — encryption overhead as a percentage is likely lower for larger messages +- **Replication factor**: the 1-topic latency and ceiling results ran at RF=3; at that replication factor Kafka's ISR replication creates a per-partition ceiling that sits close to where proxy CPU saturates. The sizing coefficient was derived from RF=1 multi-topic workloads to isolate proxy CPU +- **Horizontal scaling**: linear scaling has been validated across CPU allocations on a single pod; multi-pod scaling hasn't been measured but is expected to follow the same coefficient + +The [engineering post](/blog/2026/05/28/benchmarking-the-proxy-under-the-hood/) has the full methodology detail. + +--- + +## Further reading + +- [Operator guide: results, methodology, and sizing recommendations](/blog/2026/05/21/benchmarking-the-proxy/) — the full benchmark story for operators +- [How hard can it be??? Maxing out a Kroxylicious instance](/blog/2026/05/28/benchmarking-the-proxy-under-the-hood/) — how we measured it, where the CPU goes, and what surprised us +- [Benchmark quickstart](https://github.com/kroxylicious/kroxylicious/tree/main/kroxylicious-openmessaging-benchmarks/QUICKSTART.md) — run the benchmarks yourself \ No newline at end of file