Routing API by tombentley · Pull Request #70 · kroxylicious/design

tombentley · 2025-06-07T00:34:41Z

Proposes a Router plugin API for Kroxylicious that enables routing requests to multiple upstream Kafka clusters, unlocking use cases like union clusters, principal-aware routing, and topic splicing that the Filter API alone cannot address.

Key points:

Router / RouterFactory: new top-level plugin type (per-connection instances, ServiceLoader-discovered), analogous to Filter / FilterFactory
Routes: named pathways from a router to a target (cluster or another router), each with optional filter chains. Routes and routers form a validated DAG
Virtual node IDs: runtime-managed mapping that resolves the node ID collision problem when multiple clusters are presented as one. Routers work exclusively with virtual IDs
RouterContext: provides bootstrapNodeId() and sendRequestToNode() for issuing requests
RouterResult: sealed type (Completed, CompletedNoResponse, Disconnect) encoding outcomes as values rather than side-effects
Static routes: performance optimisation allowing opaque frame forwarding for API keys that don't need inspection
Configuration: new top-level clusterDefinitions and routerDefinitions lists; virtual clusters gain a target property (discriminated union of cluster/router); existing targetCluster deprecated but still supported
Fully backwards compatible — no impact on the Filter API or existing configurations

gunnarmorling

This looks great overall! A few comments and suggestions inline.

gunnarmorling · 2025-06-14T09:00:06Z

+This will prevent the possibility of a request getting stuck in a router loop. 
+
+In order for non-trivial router graphs to be useful, `Router` authors will need to follow the same principle of composition as `Filter` authors.
+That is, a `Router` implementation should only talk to its receivers, and not, for example, make direct connections of its own to a backing cluster.


I agree that routers should not make "out of bands" requests, but this doesn't follow from the previous "in order to be useful" to me. I mean, out of bands requests can be useful, but we're discouraging them for the sake of validation, error handling, etc., I suppose.

The idea is that all communication with the cluster should be exclusively through the router/filter graph. Routers and filters should not make their own TCP connections directly to brokers.
Routers and filters may originate their own requests through the FilterContext/RouterContext API.

Will reword.

I think this comment has been resolved already.

prabhamanepalli

I really like this proposal and ability to present multiple backing Kafka clusters as one single cluster to clients.

Can you please look into different types of routing other than identity/principal based routing.

Routing based on topic name, some embedded attribute in the header or payload itself.

I would like to know if the API being proposed is extensible for other types of routing other than principal based routing.

tombentley · 2025-07-16T05:01:26Z

@prabhamanepalli thanks for taking a look!

Can you please look into different types of routing other than identity/principal based routing.
Routing based on topic name, some embedded attribute in the header or payload itself.

Those should absolutely be possible, and I consider them in-scope for this proposal. The "Union cluster" idea alluded to in the text is really a special case of this, where the mapping between topics (or more generally named broker-side entities like consumer groups) is defined statically in the configuration file. More generally, that mapping could be supplied over the network (e.g. from a database, metadata store or whatever).

It's up to the router how each request gets routed. Most common would be routing a request to a single target cluster or next router. But I don't think the API should enforce that. If a router implementation wanted to route a produce request, for example, to multiple clusters then that should be possible at the API level, even if dual writes are not usually a good idea.

What's not in scope for this proposal currently is making the set of routers, or the set of target clusters, dynamic during the lifetime of the proxy process. Dynamic configuration update is a separate item on the backlog.

This introduces a proposal index (README.md) for the proposals directory to solve the problem of colliding proposal numbers. Currently, multiple open PRs have chosen proposal numbers that conflict with already-merged proposals. This creates confusion and makes it difficult to maintain stable identifiers for proposals - which is essential for productive discussion on the mailing list. The new process: 1. Authors create proposals with temporary filenames (nnn-*) 2. A separate PR to this README allocates the next sequential number 3. Authors rename their proposal once the number is allocated 4. The index is updated again when proposals are accepted This ensures: - Proposal numbers are allocated chronologically - No number collisions occur - Each proposal gets a stable identifier before mailing list discussion - Clear visibility of open vs. accepted proposals Current collisions resolved by this index: - PR kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#88, kroxylicious#89 all have numbers conflicting with accepted proposals and have been allocated new sequential numbers (016-024) Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>

This introduces a proposal index (README.md) for the proposals directory to solve the problem of colliding proposal numbers. Currently, multiple open PRs have chosen proposal numbers that conflict with already-merged proposals. This creates confusion and makes it difficult to maintain stable identifiers for proposals - which is essential for productive discussion on the mailing list. The new process: 1. Authors create proposals with temporary filenames (nnn-*) 2. A separate PR to this README allocates the next sequential number 3. Authors rename their proposal once the number is allocated 4. The index is updated again when proposals are accepted This ensures: - Proposal numbers are allocated chronologically - No number collisions occur - Each proposal gets a stable identifier before mailing list discussion - Clear visibility of open vs. accepted proposals Current collisions resolved by this index: - PR kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#88, kroxylicious#89 all have numbers conflicting with accepted proposals and have been allocated new sequential numbers (016-024) Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Keith Wall <kwall@apache.org>

This change simplifies the proposal numbering system by using PR numbers as proposal identifiers, eliminating number collisions and removing the need for a separate allocation process. Changes: - Simplified proposals/README.md to focus on author workflow - Removed index tables (directory listing serves as the index) - Streamlined instructions for creating and renaming proposals - Updated proposal template with workflow instructions - Require PR number in title format: # <PR#> - <Title> - Moved workflow instructions into comment block - Added GitHub workflow to automatically check proposal numbering - Validates both filename and title format - Updates PR description when proposal files don't match PR number - Provides exact commands to fix naming issues - Removes warning once corrected - Handles both added and renamed files - Runs on all PRs (ready for mandatory status check) - Added notification script for existing open PRs - After merge, run notify-open-prs.sh to ask authors to rebase - Workflow will automatically guide them through renaming - Updated with all current open proposal PRs (kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#85, kroxylicious#88, kroxylicious#93, kroxylicious#94, kroxylicious#96, kroxylicious#98, kroxylicious#99, kroxylicious#100, kroxylicious#101, kroxylicious#103) Proposals 001-019 retain their original numbers. Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Keith Wall <kwall@apache.org>

This change simplifies the proposal numbering system by using PR numbers as proposal identifiers, eliminating number collisions and removing the need for a separate allocation process. Changes: - Simplified proposals/README.md to focus on author workflow - Removed index tables (directory listing serves as the index) - Streamlined instructions for creating and renaming proposals - Updated proposal template with workflow instructions - Require PR number in title format: # <PR#> - <Title> - Moved workflow instructions into comment block - Added GitHub workflow to automatically check proposal numbering - Validates both filename and title format - Updates PR description when proposal files don't match PR number - Provides exact commands to fix naming issues - Removes warning once corrected - Handles both added and renamed files - Runs on all PRs (ready for mandatory status check) - Added notification script for existing open PRs - After merge, run notify-open-prs.sh to ask authors to rebase - Workflow will automatically guide them through renaming - Updated with all current open proposal PRs (#70, #82, #83, #85, #88, #93, #94, #96, #98, #99, #100, #101, #103) Proposals 001-019 retain their original numbers. Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Keith Wall <kwall@apache.org>

kidpollo · 2026-04-30T07:06:23Z

This feature is looking attractive for us so we can simplify our efforts around splitting clusters. Has work in this gone beyond this stage?

k-wall · 2026-05-18T16:44:20Z

I think this proposal is heading in the right direction.

Signed-off-by: Tom Bentley <tbentley@redhat.com>

Co-authored-by: Gunnar Morling <gunnar.morling@googlemail.com> Signed-off-by: Tom Bentley <tombentley@users.noreply.github.com>

Signed-off-by: Tom Bentley <tbentley@redhat.com>

Replace namedTargetCluster/router fields on VirtualCluster with a nested Target (reusing RouteDefinition.Target). The deprecated inline targetCluster field is retained for backward compatibility. Aligns with design proposal kroxylicious/design#70. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Robert Young <robertyoungnz@gmail.com>

Signed-off-by: Tom Bentley <tbentley@redhat.com>

tombentley · 2026-06-03T00:32:28Z

@SamBarker

The pattern is: any fan-out router must understand every stateful protocol feature that might traverse it. The protocol's statefulness prevents the router from being the simple composed thing the DAG model promises.

That's what I mean by "half the model." The DAG handles where things go. It doesn't handle what the protocol requires when things go to multiple places simultaneously. If every fan-out router must absorb the full protocol's statefulness to be correct, the composability benefit is theoretical — the router is a monolith with a DAG-shaped hat on.

That's a fair criticism: In general you cannot compose arbitrary routers into a topology and be guaranteed to get something that works with the full breadth of the protocol.

That's simply unavoidable, not least because the protocol is not set in stone. Someone might propose a KIP which adds or changes a Kafka API in a way which 'breaks' a filter or a router. The API version negotiation part of the protocol gives us an escape hatch to avoid actually breaking protocol guarantees, at the cost of missing out on newer protocol versions.

Because it's unavoidable, I don't see this as a problem with the API being proposed here, but rather with some of the words being used to advocate for that API: I will tone it down a bit.

Perhaps there are alternatives where the runtime takes more responsibility of "knowing about" the Kafka protocol, and exposes narrower APIs to plugins, which in turn have less of an opportunity to break the protocol. But it's not clear to me what those APIs would be, and whether a good range of useful functionality could be expressed in terms of those APIs. In the absence of a concrete alternative, I think solving "just" the routing problem, and not the composability problem, is a reasonable compromise.

But the UX point you're making stands: how can a user compose some plugins and have some assurance that the end result is actually going to work?

There's an obvious analogy with code: how can a computer programmer compose the functions available in any given programming language and have some reassurance that the end result is actually going to work?

Following through on that analogy, we can ask "is there some way to annotate plugins so that their compatibility can be automatically validated by a computation on the annotations?" In other words, "can we have some kind of type checker for plugin composition?". Clearly the answer is "yes". It's merely a question of how quickly, and how far, we want to go down that road. In a Claude code session last week I came up with this, but that's just one point in a huge space of possible mitigations. A word of caution, however, Rice's theorem means that anything we do is unlikely to be entirely satisfactory. We can mitigate, but we cannot solve this problem.

Signed-off-by: Tom Bentley <tbentley@redhat.com>

robobario · 2026-06-03T02:03:32Z

+When a router presents multiple clusters to the client as a single virtual cluster, there is a collision problem: the client might receive node ID 0 in a `METADATA` response from one route and node ID 0 from another, with no way to distinguish them.
+
+The solution is _virtual node IDs_.
+The runtime assigns each `(route, target-cluster node ID)` pair a unique virtual node ID.


The runtime assigns each (route, target-cluster node ID) pair a unique virtual node ID.

I think these definitions might have existed before introduction of a Router DAG. I want to clarify what the virtual node IDs should be if we have.

cluster A with node ids 0,1 cluster B with node ids 0,1 router-inner with routes [route-A(0) -> cluster A, routeB(1) -> cluster B] router-outer with routes [route-C(2) -> route A, routeD(3) -> cluster A, routeE(4) -> cluster B] virtual cluster targets router-outer

Does each nested router have it's own virtual(downstream) to real(upstream) id mapping? So router-inner would map node ids [0-3] to the nodes of cluster A and cluster B, and router-outer would map the node ids emitted from inner-router [0-3] plus the nodes of cluster A and cluster B which it routes to directly. So it would emit virtual node ids [0-7]?

Aligns with design proposal kroxylicious/design#70. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Robert Young <robertyoungnz@gmail.com>

Replace the marker RouterResult interface with a sealed type having three variants: Completed(Response), CompletedNoResponse(), and Disconnect(). Response delivery and disconnect move from side-effects on RouterContext to return values from Router.onRequest(). Reshape RouterContext: add bootstrapNodeId(route) for initial broker discovery, add route parameter to sendRequestToNode, remove sendRequest (no "send to any broker"), sendResponse, and disconnect. Add fromVirtual(route, virtualNodeId) overload to NodeIdMapping to support the route parameter on sendRequestToNode. Pre-register bootstrap virtual node IDs in RouterDispatchHandler so bootstrapNodeId returns a real virtual node ID that forwardToNode can resolve. Aligns with design proposal kroxylicious/design#70. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Robert Young <robertyoungnz@gmail.com>

Aligns the top-level config property name with design proposal kroxylicious/design#70. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Robert Young <robertyoungnz@gmail.com>

Replace flat targetCluster/router fields on RouteDefinition with a nested Target record containing a cluster/router discriminated union. Aligns route config with design proposal kroxylicious/design#70. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Robert Young <robertyoungnz@gmail.com>

Replace namedTargetCluster/router fields on VirtualCluster with a nested Target (reusing RouteDefinition.Target). The deprecated inline targetCluster field is retained for backward compatibility. Aligns with design proposal kroxylicious/design#70. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Robert Young <robertyoungnz@gmail.com>

k-wall · 2026-06-03T10:36:14Z


-**`bootstrapNodeId(route)`** returns the virtual node ID of the bootstrap broker for a named route.
-This is used to send the initial requests (e.g. `METADATA`) before the router has discovered the cluster's broker topology.
+**`bootstrapNodeId(route)`** returns the virtual node ID of a broker on the named route's cluster.


"returns the virtual node ID of a broker on the named route's cluster."

should this read?

returns a virtual node ID of a broker on the named route's cluster.

There's no need to imply that there is only one node which can function as a bootstrap, so yes.

I suppose we might want to think about whether this should return a single id (where presumably the runtime can randomise which one is returned in the case it knows of multiple bootstraps), or just return a list and let the router pick. I don't have a use case where letting the router pick is necessary, and giving router authors the opportunity to not randomize/round robin seems like it's worse than having the runtime do this.

k-wall · 2026-06-03T10:44:03Z

+**`sendRequestToNode(route, virtualNodeId, ...)`** sends a request to a specific broker, identified by the route and a virtual node ID.
+The runtime uses the route to determine which target cluster, and resolves the virtual node ID to a specific upstream broker address, opening a new connection if necessary.
+
+The router passes both the route (which it knows from its routing decision) and the virtual node ID (which it knows from METADATA or `bootstrapNodeId()`).


What about requests that aren't topic centric (ones where metadata doesn't help)? How will the router implement decide which virtual node ID to use.

Say I have a union-router. LIST_TOPICS needs to be fanned out to both routes. How does a router chose which virtual node? It could just call bootstrapNodeId() for each finger, but that may create a hotspot on the upstream brokers.

It's up to the router.

The API is providing bootstrapNodeId(), which functions like the bootstrap.servers configuration for a normal Kafka client.

The router may (but is not forced to) make an initial METADATA request (or it may simply observe a METADATA request coming from the client) to discover the other brokers in the cluster. For requests which can be handled by any broker it's at liberty to select any of those brokers (like the Admin client does)

Easier, code-wise, would be for a router to just direct those kinds of request to the bootstrapNodeId(). This could indeed result in one broker handling all those requests.

The runtime could be more opinionated about routing those requests, but with a cost -- it would need to know about which requests can be routed anywhere, and which must be routed somewhere specific. In other words you'd lose the clean separation about what's making the routing decision -- it's no longer always the router. We'd also have to figure out whether this is something that the runtime did unilaterally (by interpreting the requests being emitted by the router), or whether it was a bilateral things (by having a separate method in the RouterContext for sending these kinds of request).

k-wall · 2026-06-03T11:29:25Z

+
+The route parameter is essential for supporting different `NodeIdMapping` strategies (see _Node ID mapping implementation_ below).
+With a dedicated (one-to-one) mapping, each virtual node belongs to exactly one route, so the route is redundant but serves as a cross-check.
+With a shared (many-to-one) mapping — for example, if proxy instances were presented to clients as broker nodes — a single virtual node may serve brokers from multiple routes, and the route parameter is the only way for the runtime to determine which target cluster the request should reach.


I struggled a bit getting my head many-to-one. Understanding a reason why you'd do it helped me. I think:

A reason you'd want to do this is simplifying the broker topology that is exposed to the client. This would reduce the number of connections that a Kafka client needs to make, which helps lower network resources.

Say I'm using the proxy to present a union of two clusters, each with say a dozen brokers each.

one-to-one mapping would mean that each client potentially open a 24 + 1 connections.

many-to-one could mean you shrink the client topology down to say 3 + 1 connections.

many-to-one could have benefit when proxying a single cluster.

Yes, what you describe is one concrete motivation.

More abstractly, using virtual node ids (and also including route in addition to the virtual node I when calling context.sendRequest) should mean that routers are not coupled to the networking model we currently have.

The networking model we have is one where proxy instances are interchangable -- each proxy instance can function as any of the brokers. That has a few benefits (it means you can freely scale the number of proxy instances), but also comes with down sides (it's not possible to certain things in the proxy which depend on broker identity, like routing all consumers in a consumer group through a single proxy instance).

Other networking models are conceptually possible.

If your proxy instances formed a cluster, which membership and a leader to assign ownership of topic partitions (i.e. the same architecture as Kafka itself) then you can have free scaling (though the scaling operation is no longer trivial), and also do smarter things with the protocol.

If you adopt a 1:1 proxy:broker model then you give away the free-scaling, but each proxy instance can be told which broker it's in front of. In this model there isn't any need for routing though.

The API in this proposal is aiming to allow the same routers to work in 1. Maybe we will never implement a networking model like option 1. But I think an API which doesn't couple us to just our current model, (while not making the API significantly harder to use) is probably better than one which ties our hands.

That said, it would still be possible to implement routers using this API which were incompatible with 1. The API can't guarantee that all possible uses of it would work with all possible networking models. If that bothers you from the PoV of "how would a user know if a given router implementation is compatible with their desired networking model?" then I think the answer is basically what I wrote to Sam yesterday.

k-wall · 2026-06-03T11:40:15Z

+
+```java
+interface Router {
+    CompletionStage<RouterResult> onRequest(short apiVersion,


Observation - the Router has no way to know the nodeId to which the client sent the request. Right?
That's a deliberate choice. You've done that to allow the many-to-one use-case.

I don't think this is a deliberate choice I made, actually.

I think it's more a case that for the topic router I didn't end up needing to know. I strongly suspect that if we did want to support other networking models then routers would want/need to know. And possibly it doesn't even depend on the networking models idea at all, and use cases will emerge for a router to know this even with the existing networking.

From an API design PoV we could make the virtual node id a parameter of onRequest(), but we could also expose it via the RouterContext. Doing the latter means we don't have to make a decision at this point, because adding an accessor on RouterContext would be a binary compatible change.

k-wall · 2026-06-03T12:18:45Z

@tombentley Can you undraft the PR?
Otherwise I believe the proposal and I'd be happy to mark it approved.

hrishabhg · 2026-06-03T07:15:57Z

+`RouterContext` is passed to `Router.onRequest()` and provides methods for issuing requests to routes and delivering responses to the client.
+
+```java
+interface RouterContext {


should it also carry this and BrokerAddressFilter should trigger this so that RouterContext can build access to cluster topologies (specially topic -> cluster mapping).

default void onMetadataResponse(MetadataResponse response, String clusterName) {}

hrishabhg · 2026-06-03T07:44:43Z

+          cluster: cluster-b
+
+virtualClusters:
+  - name: my-vc


naive question.

What is the philosophy of having multiple gateways in a single vc if the filters are at vc level? (essentially, why do someone want to have multiple gateways with everything being same? To me, if filters would be at gateway level, then it makes more sense)

What is the philosophy of having multiple gateways in a single vc if the filters are at vc level?

I'm struggling to recall if there was a deep reason, or if we did this simply to avoid users having to repeat the same list of filter names for each gateway. Maybe @SamBarker can recall?

Did you have a use case for per-gateway filter chains @hrishabhg? Because assuming there are no gotchas we could support it in the configuration (e.g. "a gateway filterchain overrides the vc one", or "gateway filter chains are mutually exclusive with filter chains one the vc"). But if without a good use case I'd rather avoid supporting this just because it's additional complexity.

I think it was a separation of networking concerns, bind to port X or present this SNI address, from filter concerns. We wanted to configure the virtual cluster once and present it to both clients on the same kubernetes cluster (using services) and external clients, using ingress, and thus very different addressing answers.

hrishabhg · 2026-06-03T08:18:48Z

+```java
+sealed interface RouterResult {
+    record Completed(Response response) implements RouterResult {}
+    record CompletedNoResponse() implements RouterResult {}


for forwarding to rest of router chain, will one use CompletedNoResponse?

CompletedNoResponse is intended for 0-ack Produce request only. I thought it was better to have an explicit case for this -- it makes it slightly easier for the runtime to log a warning for a buggy router which hasn't coded for 0-ack Produce requests correctly.

For a 0-ack Produce a router impl would typically call sendRequestToNode() and return a CompletedNoResponse immediately. Perhaps we could make the runtime a bit clever and throw an exception if a developer tries to chain a handler onto the CompletionStage returned from sendRequestToNode() when the request it was passed was a 0-ack Produce.

However, all of this is a kind of best-effort by the runtime to alert router developers that their router is buggy for not considering 0-ack Produce. There's nothing we can do to force router devs to remember to code for it.

hrishabhg · 2026-06-03T08:24:55Z

+routerDefinitions:
+  - name: my-router
+    type: MyRouter
+    config: ...
+    routes:
+      - name: foo
+        id: 0
+        filters: 
+          - my-first-filter
+          - my-second-filter
+        target:
+          cluster: cluster-a
+      - name: bar
+        id: 1
+        filters: 
+          - my-third-filter
+        target:
+          router: my-other-router
+  - name: my-other-router
+    type: AnotherRouter
+    config: ...
+    routes:
+      - name: baz
+        id: 0
+        target:
+          cluster: cluster-b


ClientAuthRoute only authenticates client.
Forwards to SubjectRouter

If Subject is not present

Resolved Subject is mapped to a cluster => it forwards to PlainBasedClusterAuthenticatorRouter which targets clusterA or it forwards to OAuthBasedClusterAuthenticatorRouter which targets clusterB

- name: routerA type: StaticRouter routes: - filters: - PlainAuthFilter target: cluster: clusterA - name: routerB type: StaticRouter routes: - filters: - OAuthFilter target: cluster: clusterA - name: SubjectRouter type: SubjectRouter config: - subA: routerA subB: routerB default: TopicRouter routes: []

It is not a comment. Just trying to make sense. It is looking like routes being a list of 0 or 1 item only.

Does it make more sense to have optional config.routes with filters moving at router level.

- name: routerA type: StaticRouter filters: - PlainAuthFilter config: target: cluster: clusterA - name: routerB type: StaticRouter filters: - OAuthFilter config: target: cluster: clusterA - name: SubjectRouter type: SubjectRouter config: default: target: router: MyTopicBasedRouter routes: - subjects: - subA filters: - PlainAuthFilter target: cluster: clusterA - subjects: - subB - subC filters: - OAuthFilter target: # or targetCluster to reduce nesting cluster: clusterA

OR routes having capability to define its own config (@tombentley demo used similar approach -https://github.com/tombentley/kproxy/blob/routing/kroxylicious-router-topic/README.md#configuration)

- name: HybridRouter # If subject present and mapped to cluster Route to a cluster, else try topic based routing type: SubjectRouter config: default: target: router: MyTopicBasedRouter # if subject based fails routes: - subjects: - subA filters: - PlainAuthFilter target: cluster: clusterA - subjects: - subB - subC filters: - OAuthFilter target: # or targetCluster to reduce nesting cluster: clusterA

With proposed approach, it seems config should also have entry route.

# Router => SubjectForwarder config: subjectRoutes: sub1: routeA sub2: routeB entryRoute: ClientAuthRoute # or the first route in the list.

Aligns with design proposal kroxylicious/design#70. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Replace the marker RouterResult interface with a sealed type having three variants: Completed(Response), CompletedNoResponse(), and Disconnect(). Response delivery and disconnect move from side-effects on RouterContext to return values from Router.onRequest(). Reshape RouterContext: add bootstrapNodeId(route) for initial broker discovery, add route parameter to sendRequestToNode, remove sendRequest (no "send to any broker"), sendResponse, and disconnect. Add fromVirtual(route, virtualNodeId) overload to NodeIdMapping to support the route parameter on sendRequestToNode. Pre-register bootstrap virtual node IDs in RouterDispatchHandler so bootstrapNodeId returns a real virtual node ID that forwardToNode can resolve. Aligns with design proposal kroxylicious/design#70. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Aligns the top-level config property name with design proposal kroxylicious/design#70. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Replace flat targetCluster/router fields on RouteDefinition with a nested Target record containing a cluster/router discriminated union. Aligns route config with design proposal kroxylicious/design#70. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

Replace namedTargetCluster/router fields on VirtualCluster with a nested Target (reusing RouteDefinition.Target). The deprecated inline targetCluster field is retained for backward compatibility. Aligns with design proposal kroxylicious/design#70. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tom Bentley <tbentley@redhat.com>

SamBarker · 2026-06-03T22:03:47Z

Thanks Tom. I think my point got misread, so let me restate it.

I'm not arguing that arbitrary router compositions should guarantee full protocol correctness — of course they can't. That was never the question.

The question is: does the DAG earn its keep?

The DAG's value proposition is composability — simple routers combined into powerful topologies. But the compositions where the DAG is genuinely expressive (fan-out: duplication, live/DR) are exactly where protocol statefulness forces each router to become a protocol-aware monolith. A DuplicationRouter that handles transactions isn't a simple composed thing — it's a full protocol engine that happens to send to two paths.

For the compositions where protocol semantics are manageable (linear chains: principal → topic → clusters), building blocks inside a single router could achieve the same reuse and testability without the DAG infrastructure (nested NodeIdMappings, per-connection router caching, virtual ID space composition).

I think there might be a missing level of abstraction between "full Router that owns the entire request lifecycle" and "static route that forwards opaque frames." The protocol-aware building blocks that every non-trivial router needs (scatter-gather, session management, response composition) aren't part of the model. Each router rebuilds them from scratch. That's not a DAG problem per se — but it does make me wonder whether the DAG is solving the right problem.

Keith's use cases (hot DR switch, conditional filter flow) are compelling and suggest the DAG does carry its weight for stateless compositions. I'm less convinced for stateful fan-out — but I'd welcome concrete use cases that test that boundary.

SamBarker · 2026-06-03T22:04:16Z

+* `fromVirtual(route, virtualNodeId)` — used when the router calls `sendRequestToNode(route, virtualNodeId, ...)`, returning the target-cluster node ID.
+
+The mapping must satisfy: `fromVirtual(route, toVirtual(route, t))` returns `t` for any route and target node ID `t`.
+Note that the mapping does not need to recover the route from the virtual node ID alone — the route is always supplied by the router.


The proposal does a good job of recognising node ID collision as a runtime concern — NodeIdMapping is sealed, transparent to the router author, and handled at the boundary. But I think there's a related class of problem the proposal should acknowledge: protocol identifier mapping.

Node IDs are genuinely virtualised — the runtime invents IDs that don't exist in any cluster. But other cluster-scoped identifiers in the protocol aren't virtualised, they're mapped: the real values from real clusters need to be tracked and injected into the right requests on the right routes.

Producer IDs (PIDs): Each cluster assigns its own via InitProducerId. A multi-cluster router must maintain a mapping of downstream PID → (route, cluster-PID) and rewrite PIDs in Produce requests per-route.

Fetch session IDs: Broker-scoped, need to be tracked per-route.

Topic IDs (UUIDs): Cluster-scoped. The router isn't inventing new UUIDs, but needs to know which cluster's UUID to inject where.

Node IDs are a natural fit for runtime ownership because they're structural — they're how routes address brokers. These protocol identifiers are message content, so it's a longer bow to draw to "routing." But the effect is the same: without runtime-assisted mapping, every multi-cluster router rebuilds this machinery independently.

I'm not saying the proposal needs to solve all of these now. But I think it should recognise protocol identifier mapping as a class of problem rather than handling one instance (node IDs) and treating the rest as pure implementation detail. Naming the problem shapes how router authors approach their implementations and signals where the runtime boundary might move in future.

The runtime concerns itself with virtual node ids because it has to:

the concept of nodes pervades the protocol

nodes have network addresses

the proxy runtime it responsible for the networking (a router is not)

This is not true for those other things you list, so I don't think the runtime is forced to concern itself with them in the same way. They can be a responsibility of a router implementation, which may choose to map them, or do something else.

Now that sendRequestToNode() takes route explicitly, the runtime no longer needs to recover the route from the virtual node ID. What makes node ID mapping a runtime concern rather than a router concern, given that the route parameter already tells the runtime where to send the request?

If the answer is "the runtime hides METADATA response rewriting so routers don't all reimplement it" — that's the same argument for runtime-assisted PID mapping, fetch session tracking, etc. The route parameter removed the architectural distinction between node IDs and other cluster-scoped identifiers.

What makes node ID mapping a runtime concern rather than a router concern, given that the route parameter already tells the runtime where to send the request?

The runtime needs to open sockets for all the nodes/know the mapping from SNI host name to node.

The upstream connection machinery in the POC works with virtual node IDs throughout — forwardToNode() takes a virtual node ID, resolves it to a HostPort via the address cache, and opens the socket. fromVirtual() is never called in the request path. The real node ID is never recovered. The socket layer treats virtual node IDs as opaque keys to look up addresses.

So the entire value of NodeIdMapping is response rewriting — translating real node IDs to virtual ones in METADATA, PRODUCE, FIND_COORDINATOR, and DESCRIBE_CLUSTER responses before the router sees them. That's not a networking concern. It's an identifier translation concern. And it's exactly the same layer where PID mapping, fetch session ID tracking, and topic UUID translation would live.

The runtime already does this translation across four response types because it's the right design — the topic router just works with virtual IDs and never thinks about mapping. Why should every multi-cluster router independently reimplement the same translation for PIDs and fetch session IDs?

That's not a networking concern. It's an identifier translation concern.

It is a networking concern, because it's the runtime that knows how those ids relate to network addresses. Sure node ids are not network addresses, but they directly associated. It seems necessary that there's some way for the router and the runtime to have a common way of identifying them, and the virtual node id is the way I'm proposing here.

And it's exactly the same layer where PID mapping, fetch session ID tracking, and topic UUID translation would live.

I'm not saying that there's some fundamental reason why the runtime could not understand these things in a deeper way. But it doesn't seem to buy much. It does not seem to be necessary for the runtime to know.

Moreover you've not yet pointed to a compelling reason why having the runtime handle these things is objectively better. Abstractly it's slightly displeasing that some identifiers are in a different class than others. But concretely why does that matter?

I think it might be more productive, at this point, if you proposed an alternative API which treats all these identifiers in whatever way you're advocating for. Having that would allow everyone to start evaluating the pros and cons of each API.

tombentley · 2026-06-03T23:29:30Z

A DuplicationRouter that handles transactions isn't a simple composed thing

No it's not, but it fundamentally cannot be AFAICS. Transactions are quite a hard area of the protocol. There are two distinct flavours (KIP-98 and KIP-890). It could evolve more in the future for all we know. Something somewhere in the system has to embed deep knowledge of the Kafka protocol to do non-trival things. By non-trivial I mean any kind of routing that it's not approximately "sticky, based on subject".

That complexity must live either:

In the runtime
In some combination of the runtime and a plugin
In a plugin

The risk of embedding lots of deep protocol knowledge in the runtime is that you make too many assumptions about the kind of behaviours you want to support. Eventually you find that there are use cases which your runtime doesn't handle. This is an argument for "some of the behaviour has to live in a plugin".

The problem of a combination is figuring out a sane API that balances flexibility of behaviour, with understandability and maintainability.

The risk of doing it in a plugin is that each plugin you write might end up with deep protocol knowledge. So there's the risk of duplication.

This proposal is definitely in the camp of loading the protocol complexity into plugins. To me that feels a reasonable thing to do. I think there are a few interesting behaviours which we'd have to write router implementations for. Not lots, and not 1. Let's say half a dozen for the sake of argument. I think the project could sustain writing at least some of these, by adopting the last strategy we're leaving to door open for 3rd parties to write their own if they had niche requirements.

I think there might be a missing level of abstraction between "full Router that owns the entire request lifecycle" and "static route that forwards opaque frames."

I agree, actually. But this is something we've seen for filters too. We don't have a HashicorpRecordEncryption filter and a FortanixDsmRecordEncryption filter. We have one pluggable ("higher level") RecordEncryption filter, which supports multiple KMSes. Likewise we have one pluggable Authorization filter which can support multiple Authorizers.

Generally speaking if you have the right low level API you can build a more specific thing on top. We can have a router that implements "static route that forwards opaque frames." based on Subject and/or client id (and likely other things which are known early enough in the conversation between client and broker). Those actually exist in the POC for testing purposes, as it happens.

This was referenced Jun 7, 2025

Client certificate principal propagation to filter context. kroxylicious/kroxylicious#2270

Closed

Client ssl principal propagation kroxylicious/kroxylicious#2294

Closed

gunnarmorling reviewed Jun 14, 2025

View reviewed changes

tombentley mentioned this pull request Jul 4, 2025

Authentication API #71

Closed

prabhamanepalli reviewed Jul 15, 2025

View reviewed changes