Skip to content

Routing API#70

Open
tombentley wants to merge 13 commits into
kroxylicious:mainfrom
tombentley:routing-api
Open

Routing API#70
tombentley wants to merge 13 commits into
kroxylicious:mainfrom
tombentley:routing-api

Conversation

@tombentley
Copy link
Copy Markdown
Member

@tombentley tombentley commented Jun 7, 2025

Proposes a Router plugin API for Kroxylicious that enables routing requests to multiple upstream Kafka clusters, unlocking use cases like union clusters, principal-aware routing, and topic splicing that the Filter API alone cannot address.

Key points:

  • Router / RouterFactory: new top-level plugin type (per-connection instances, ServiceLoader-discovered), analogous to Filter / FilterFactory
  • Routes: named pathways from a router to a target (cluster or another router), each with optional filter chains. Routes and routers form a validated DAG
  • Virtual node IDs: runtime-managed mapping that resolves the node ID collision problem when multiple clusters are presented as one. Routers work exclusively with virtual IDs
  • RouterContext: provides bootstrapNodeId() and sendRequestToNode() for issuing requests
  • RouterResult: sealed type (Completed, CompletedNoResponse, Disconnect) encoding outcomes as values rather than side-effects
  • Static routes: performance optimisation allowing opaque frame forwarding for API keys that don't need inspection
  • Configuration: new top-level clusterDefinitions and routerDefinitions lists; virtual clusters gain a target property (discriminated union of cluster/router); existing targetCluster deprecated but still supported
  • Fully backwards compatible — no impact on the Filter API or existing configurations

Copy link
Copy Markdown
Collaborator

@gunnarmorling gunnarmorling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall! A few comments and suggestions inline.

Comment thread proposals/004-routing-api.md Outdated
Comment thread proposals/004-routing-api.md Outdated
Comment thread proposals/004-routing-api.md Outdated
Comment thread proposals/070-routing-api.md Outdated
Comment thread proposals/004-routing-api.md Outdated
Comment thread proposals/004-routing-api.md Outdated
This will prevent the possibility of a request getting stuck in a router loop.

In order for non-trivial router graphs to be useful, `Router` authors will need to follow the same principle of composition as `Filter` authors.
That is, a `Router` implementation should only talk to its receivers, and not, for example, make direct connections of its own to a backing cluster.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that routers should not make "out of bands" requests, but this doesn't follow from the previous "in order to be useful" to me. I mean, out of bands requests can be useful, but we're discouraging them for the sake of validation, error handling, etc., I suppose.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that all communication with the cluster should be exclusively through the router/filter graph. Routers and filters should not make their own TCP connections directly to brokers.
Routers and filters may originate their own requests through the FilterContext/RouterContext API.

Will reword.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment has been resolved already.

Comment thread proposals/070-routing-api.md Outdated
Comment thread proposals/070-routing-api.md Outdated
@tombentley tombentley mentioned this pull request Jul 4, 2025
Comment thread proposals/070-routing-api.md Outdated
Copy link
Copy Markdown

@prabhamanepalli prabhamanepalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this proposal and ability to present multiple backing Kafka clusters as one single cluster to clients.

Can you please look into different types of routing other than identity/principal based routing.

Routing based on topic name, some embedded attribute in the header or payload itself.

I would like to know if the API being proposed is extensible for other types of routing other than principal based routing.

@tombentley
Copy link
Copy Markdown
Member Author

@prabhamanepalli thanks for taking a look!

Can you please look into different types of routing other than identity/principal based routing.
Routing based on topic name, some embedded attribute in the header or payload itself.

Those should absolutely be possible, and I consider them in-scope for this proposal. The "Union cluster" idea alluded to in the text is really a special case of this, where the mapping between topics (or more generally named broker-side entities like consumer groups) is defined statically in the configuration file. More generally, that mapping could be supplied over the network (e.g. from a database, metadata store or whatever).

It's up to the router how each request gets routed. Most common would be routing a request to a single target cluster or next router. But I don't think the API should enforce that. If a router implementation wanted to route a produce request, for example, to multiple clusters then that should be possible at the API level, even if dual writes are not usually a good idea.

What's not in scope for this proposal currently is making the set of routers, or the set of target clusters, dynamic during the lifetime of the proxy process. Dynamic configuration update is a separate item on the backlog.

Comment thread proposals/070-routing-api.md Outdated
k-wall added a commit to k-wall/design that referenced this pull request Apr 9, 2026
This introduces a proposal index (README.md) for the proposals directory
to solve the problem of colliding proposal numbers.

Currently, multiple open PRs have chosen proposal numbers that conflict
with already-merged proposals. This creates confusion and makes it
difficult to maintain stable identifiers for proposals - which is
essential for productive discussion on the mailing list.

The new process:
1. Authors create proposals with temporary filenames (nnn-*)
2. A separate PR to this README allocates the next sequential number
3. Authors rename their proposal once the number is allocated
4. The index is updated again when proposals are accepted

This ensures:
- Proposal numbers are allocated chronologically
- No number collisions occur
- Each proposal gets a stable identifier before mailing list discussion
- Clear visibility of open vs. accepted proposals

Current collisions resolved by this index:
- PR kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#88, kroxylicious#89 all have numbers conflicting with accepted
  proposals and have been allocated new sequential numbers (016-024)

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
k-wall added a commit to k-wall/design that referenced this pull request Apr 9, 2026
This introduces a proposal index (README.md) for the proposals directory
to solve the problem of colliding proposal numbers.

Currently, multiple open PRs have chosen proposal numbers that conflict
with already-merged proposals. This creates confusion and makes it
difficult to maintain stable identifiers for proposals - which is
essential for productive discussion on the mailing list.

The new process:
1. Authors create proposals with temporary filenames (nnn-*)
2. A separate PR to this README allocates the next sequential number
3. Authors rename their proposal once the number is allocated
4. The index is updated again when proposals are accepted

This ensures:
- Proposal numbers are allocated chronologically
- No number collisions occur
- Each proposal gets a stable identifier before mailing list discussion
- Clear visibility of open vs. accepted proposals

Current collisions resolved by this index:
- PR kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#88, kroxylicious#89 all have numbers conflicting with accepted
  proposals and have been allocated new sequential numbers (016-024)

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Keith Wall <kwall@apache.org>
k-wall added a commit to k-wall/design that referenced this pull request Apr 9, 2026
This introduces a proposal index (README.md) for the proposals directory
to solve the problem of colliding proposal numbers.

Currently, multiple open PRs have chosen proposal numbers that conflict
with already-merged proposals. This creates confusion and makes it
difficult to maintain stable identifiers for proposals - which is
essential for productive discussion on the mailing list.

The new process:
1. Authors create proposals with temporary filenames (nnn-*)
2. A separate PR to this README allocates the next sequential number
3. Authors rename their proposal once the number is allocated
4. The index is updated again when proposals are accepted

This ensures:
- Proposal numbers are allocated chronologically
- No number collisions occur
- Each proposal gets a stable identifier before mailing list discussion
- Clear visibility of open vs. accepted proposals

Current collisions resolved by this index:
- PR kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#88, kroxylicious#89 all have numbers conflicting with accepted
  proposals and have been allocated new sequential numbers (016-024)

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Keith Wall <kwall@apache.org>
k-wall added a commit to k-wall/design that referenced this pull request Apr 24, 2026
This introduces a proposal index (README.md) for the proposals directory
to solve the problem of colliding proposal numbers.

Currently, multiple open PRs have chosen proposal numbers that conflict
with already-merged proposals. This creates confusion and makes it
difficult to maintain stable identifiers for proposals - which is
essential for productive discussion on the mailing list.

The new process:
1. Authors create proposals with temporary filenames (nnn-*)
2. A separate PR to this README allocates the next sequential number
3. Authors rename their proposal once the number is allocated
4. The index is updated again when proposals are accepted

This ensures:
- Proposal numbers are allocated chronologically
- No number collisions occur
- Each proposal gets a stable identifier before mailing list discussion
- Clear visibility of open vs. accepted proposals

Current collisions resolved by this index:
- PR kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#88, kroxylicious#89 all have numbers conflicting with accepted
  proposals and have been allocated new sequential numbers (016-024)

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Keith Wall <kwall@apache.org>
k-wall added a commit to k-wall/design that referenced this pull request Apr 28, 2026
This change simplifies the proposal numbering system by using PR numbers
as proposal identifiers, eliminating number collisions and removing the
need for a separate allocation process.

Changes:
- Simplified proposals/README.md to focus on author workflow
  - Removed index tables (directory listing serves as the index)
  - Streamlined instructions for creating and renaming proposals
- Updated proposal template with workflow instructions
  - Require PR number in title format: # <PR#> - <Title>
  - Moved workflow instructions into comment block
- Added GitHub workflow to automatically check proposal numbering
  - Validates both filename and title format
  - Updates PR description when proposal files don't match PR number
  - Provides exact commands to fix naming issues
  - Removes warning once corrected
  - Handles both added and renamed files
  - Runs on all PRs (ready for mandatory status check)
- Added notification script for existing open PRs
  - After merge, run notify-open-prs.sh to ask authors to rebase
  - Workflow will automatically guide them through renaming
  - Updated with all current open proposal PRs (kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#85,
    kroxylicious#88, kroxylicious#93, kroxylicious#94, kroxylicious#96, kroxylicious#98, kroxylicious#99, kroxylicious#100, kroxylicious#101, kroxylicious#103)

Proposals 001-019 retain their original numbers.

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Keith Wall <kwall@apache.org>
k-wall added a commit that referenced this pull request Apr 28, 2026
This change simplifies the proposal numbering system by using PR numbers
as proposal identifiers, eliminating number collisions and removing the
need for a separate allocation process.

Changes:
- Simplified proposals/README.md to focus on author workflow
  - Removed index tables (directory listing serves as the index)
  - Streamlined instructions for creating and renaming proposals
- Updated proposal template with workflow instructions
  - Require PR number in title format: # <PR#> - <Title>
  - Moved workflow instructions into comment block
- Added GitHub workflow to automatically check proposal numbering
  - Validates both filename and title format
  - Updates PR description when proposal files don't match PR number
  - Provides exact commands to fix naming issues
  - Removes warning once corrected
  - Handles both added and renamed files
  - Runs on all PRs (ready for mandatory status check)
- Added notification script for existing open PRs
  - After merge, run notify-open-prs.sh to ask authors to rebase
  - Workflow will automatically guide them through renaming
  - Updated with all current open proposal PRs (#70, #82, #83, #85,
    #88, #93, #94, #96, #98, #99, #100, #101, #103)

Proposals 001-019 retain their original numbers.

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Keith Wall <kwall@apache.org>
@kidpollo
Copy link
Copy Markdown

This feature is looking attractive for us so we can simplify our efforts around splitting clusters. Has work in this gone beyond this stage?

Comment thread proposals/070-routing-api.md Outdated
Comment thread proposals/070-routing-api.md Outdated
Comment thread proposals/070-routing-api.md Outdated
Comment thread proposals/070-routing-api.md
Comment thread proposals/070-routing-api.md Outdated
Comment thread proposals/070-routing-api.md Outdated
@k-wall
Copy link
Copy Markdown
Member

k-wall commented May 18, 2026

I think this proposal is heading in the right direction.

tombentley and others added 4 commits May 21, 2026 16:31
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Co-authored-by: Gunnar Morling <gunnar.morling@googlemail.com>
Signed-off-by: Tom Bentley <tombentley@users.noreply.github.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
robobario pushed a commit to robobario/kroxylicious that referenced this pull request Jun 2, 2026
Replace namedTargetCluster/router fields on VirtualCluster with a
nested Target (reusing RouteDefinition.Target). The deprecated inline
targetCluster field is retained for backward compatibility.

Aligns with design proposal kroxylicious/design#70.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Young <robertyoungnz@gmail.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
@tombentley
Copy link
Copy Markdown
Member Author

@SamBarker

The pattern is: any fan-out router must understand every stateful protocol feature that might traverse it. The protocol's statefulness prevents the router from being the simple composed thing the DAG model promises.

That's what I mean by "half the model." The DAG handles where things go. It doesn't handle what the protocol requires when things go to multiple places simultaneously. If every fan-out router must absorb the full protocol's statefulness to be correct, the composability benefit is theoretical — the router is a monolith with a DAG-shaped hat on.

That's a fair criticism: In general you cannot compose arbitrary routers into a topology and be guaranteed to get something that works with the full breadth of the protocol.

That's simply unavoidable, not least because the protocol is not set in stone. Someone might propose a KIP which adds or changes a Kafka API in a way which 'breaks' a filter or a router. The API version negotiation part of the protocol gives us an escape hatch to avoid actually breaking protocol guarantees, at the cost of missing out on newer protocol versions.

Because it's unavoidable, I don't see this as a problem with the API being proposed here, but rather with some of the words being used to advocate for that API: I will tone it down a bit.

Perhaps there are alternatives where the runtime takes more responsibility of "knowing about" the Kafka protocol, and exposes narrower APIs to plugins, which in turn have less of an opportunity to break the protocol. But it's not clear to me what those APIs would be, and whether a good range of useful functionality could be expressed in terms of those APIs. In the absence of a concrete alternative, I think solving "just" the routing problem, and not the composability problem, is a reasonable compromise.

But the UX point you're making stands: how can a user compose some plugins and have some assurance that the end result is actually going to work?

There's an obvious analogy with code: how can a computer programmer compose the functions available in any given programming language and have some reassurance that the end result is actually going to work?

Following through on that analogy, we can ask "is there some way to annotate plugins so that their compatibility can be automatically validated by a computation on the annotations?" In other words, "can we have some kind of type checker for plugin composition?". Clearly the answer is "yes". It's merely a question of how quickly, and how far, we want to go down that road. In a Claude code session last week I came up with this, but that's just one point in a huge space of possible mitigations. A word of caution, however, Rice's theorem means that anything we do is unlikely to be entirely satisfactory. We can mitigate, but we cannot solve this problem.

Signed-off-by: Tom Bentley <tbentley@redhat.com>
When a router presents multiple clusters to the client as a single virtual cluster, there is a collision problem: the client might receive node ID 0 in a `METADATA` response from one route and node ID 0 from another, with no way to distinguish them.

The solution is _virtual node IDs_.
The runtime assigns each `(route, target-cluster node ID)` pair a unique virtual node ID.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime assigns each (route, target-cluster node ID) pair a unique virtual node ID.

I think these definitions might have existed before introduction of a Router DAG. I want to clarify what the virtual node IDs should be if we have.

cluster A with node ids 0,1
cluster B with node ids 0,1
router-inner with routes [route-A(0) -> cluster A, routeB(1) -> cluster B]
router-outer with routes [route-C(2) -> route A, routeD(3) -> cluster A, routeE(4) -> cluster B]
virtual cluster targets router-outer

Does each nested router have it's own virtual(downstream) to real(upstream) id mapping? So router-inner would map node ids [0-3] to the nodes of cluster A and cluster B, and router-outer would map the node ids emitted from inner-router [0-3] plus the nodes of cluster A and cluster B which it routes to directly. So it would emit virtual node ids [0-7]?

robobario pushed a commit to robobario/kroxylicious that referenced this pull request Jun 3, 2026
Aligns with design proposal kroxylicious/design#70.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Young <robertyoungnz@gmail.com>
robobario pushed a commit to robobario/kroxylicious that referenced this pull request Jun 3, 2026
Replace the marker RouterResult interface with a sealed type having
three variants: Completed(Response), CompletedNoResponse(), and
Disconnect(). Response delivery and disconnect move from side-effects
on RouterContext to return values from Router.onRequest().

Reshape RouterContext: add bootstrapNodeId(route) for initial broker
discovery, add route parameter to sendRequestToNode, remove
sendRequest (no "send to any broker"), sendResponse, and disconnect.

Add fromVirtual(route, virtualNodeId) overload to NodeIdMapping to
support the route parameter on sendRequestToNode.

Pre-register bootstrap virtual node IDs in RouterDispatchHandler so
bootstrapNodeId returns a real virtual node ID that forwardToNode
can resolve.

Aligns with design proposal kroxylicious/design#70.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Young <robertyoungnz@gmail.com>
robobario pushed a commit to robobario/kroxylicious that referenced this pull request Jun 3, 2026
Aligns the top-level config property name with design proposal
kroxylicious/design#70.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Young <robertyoungnz@gmail.com>
robobario pushed a commit to robobario/kroxylicious that referenced this pull request Jun 3, 2026
Replace flat targetCluster/router fields on RouteDefinition with a
nested Target record containing a cluster/router discriminated union.
Aligns route config with design proposal kroxylicious/design#70.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Young <robertyoungnz@gmail.com>
robobario pushed a commit to robobario/kroxylicious that referenced this pull request Jun 3, 2026
Replace namedTargetCluster/router fields on VirtualCluster with a
nested Target (reusing RouteDefinition.Target). The deprecated inline
targetCluster field is retained for backward compatibility.

Aligns with design proposal kroxylicious/design#70.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Young <robertyoungnz@gmail.com>

**`bootstrapNodeId(route)`** returns the virtual node ID of the bootstrap broker for a named route.
This is used to send the initial requests (e.g. `METADATA`) before the router has discovered the cluster's broker topology.
**`bootstrapNodeId(route)`** returns the virtual node ID of a broker on the named route's cluster.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returns the virtual node ID of a broker on the named route's cluster."

should this read?

returns a virtual node ID of a broker on the named route's cluster.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to imply that there is only one node which can function as a bootstrap, so yes.

I suppose we might want to think about whether this should return a single id (where presumably the runtime can randomise which one is returned in the case it knows of multiple bootstraps), or just return a list and let the router pick. I don't have a use case where letting the router pick is necessary, and giving router authors the opportunity to not randomize/round robin seems like it's worse than having the runtime do this.

**`sendRequestToNode(route, virtualNodeId, ...)`** sends a request to a specific broker, identified by the route and a virtual node ID.
The runtime uses the route to determine which target cluster, and resolves the virtual node ID to a specific upstream broker address, opening a new connection if necessary.

The router passes both the route (which it knows from its routing decision) and the virtual node ID (which it knows from METADATA or `bootstrapNodeId()`).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about requests that aren't topic centric (ones where metadata doesn't help)? How will the router implement decide which virtual node ID to use.

Say I have a union-router. LIST_TOPICS needs to be fanned out to both routes. How does a router chose which virtual node? It could just call bootstrapNodeId() for each finger, but that may create a hotspot on the upstream brokers.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to the router.

The API is providing bootstrapNodeId(), which functions like the bootstrap.servers configuration for a normal Kafka client.

  • The router may (but is not forced to) make an initial METADATA request (or it may simply observe a METADATA request coming from the client) to discover the other brokers in the cluster. For requests which can be handled by any broker it's at liberty to select any of those brokers (like the Admin client does)
  • Easier, code-wise, would be for a router to just direct those kinds of request to the bootstrapNodeId(). This could indeed result in one broker handling all those requests.

The runtime could be more opinionated about routing those requests, but with a cost -- it would need to know about which requests can be routed anywhere, and which must be routed somewhere specific. In other words you'd lose the clean separation about what's making the routing decision -- it's no longer always the router. We'd also have to figure out whether this is something that the runtime did unilaterally (by interpreting the requests being emitted by the router), or whether it was a bilateral things (by having a separate method in the RouterContext for sending these kinds of request).


The route parameter is essential for supporting different `NodeIdMapping` strategies (see _Node ID mapping implementation_ below).
With a dedicated (one-to-one) mapping, each virtual node belongs to exactly one route, so the route is redundant but serves as a cross-check.
With a shared (many-to-one) mapping — for example, if proxy instances were presented to clients as broker nodes — a single virtual node may serve brokers from multiple routes, and the route parameter is the only way for the runtime to determine which target cluster the request should reach.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled a bit getting my head many-to-one. Understanding a reason why you'd do it helped me. I think:

A reason you'd want to do this is simplifying the broker topology that is exposed to the client. This would reduce the number of connections that a Kafka client needs to make, which helps lower network resources.

Say I'm using the proxy to present a union of two clusters, each with say a dozen brokers each.

  • one-to-one mapping would mean that each client potentially open a 24 + 1 connections.
  • many-to-one could mean you shrink the client topology down to say 3 + 1 connections.

many-to-one could have benefit when proxying a single cluster.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what you describe is one concrete motivation.

More abstractly, using virtual node ids (and also including route in addition to the virtual node I when calling context.sendRequest) should mean that routers are not coupled to the networking model we currently have.

The networking model we have is one where proxy instances are interchangable -- each proxy instance can function as any of the brokers. That has a few benefits (it means you can freely scale the number of proxy instances), but also comes with down sides (it's not possible to certain things in the proxy which depend on broker identity, like routing all consumers in a consumer group through a single proxy instance).

Other networking models are conceptually possible.

  1. If your proxy instances formed a cluster, which membership and a leader to assign ownership of topic partitions (i.e. the same architecture as Kafka itself) then you can have free scaling (though the scaling operation is no longer trivial), and also do smarter things with the protocol.
  2. If you adopt a 1:1 proxy:broker model then you give away the free-scaling, but each proxy instance can be told which broker it's in front of. In this model there isn't any need for routing though.

The API in this proposal is aiming to allow the same routers to work in 1. Maybe we will never implement a networking model like option 1. But I think an API which doesn't couple us to just our current model, (while not making the API significantly harder to use) is probably better than one which ties our hands.

That said, it would still be possible to implement routers using this API which were incompatible with 1. The API can't guarantee that all possible uses of it would work with all possible networking models. If that bothers you from the PoV of "how would a user know if a given router implementation is compatible with their desired networking model?" then I think the answer is basically what I wrote to Sam yesterday.


```java
interface Router {
CompletionStage<RouterResult> onRequest(short apiVersion,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation - the Router has no way to know the nodeId to which the client sent the request. Right?
That's a deliberate choice. You've done that to allow the many-to-one use-case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a deliberate choice I made, actually.

I think it's more a case that for the topic router I didn't end up needing to know. I strongly suspect that if we did want to support other networking models then routers would want/need to know. And possibly it doesn't even depend on the networking models idea at all, and use cases will emerge for a router to know this even with the existing networking.

From an API design PoV we could make the virtual node id a parameter of onRequest(), but we could also expose it via the RouterContext. Doing the latter means we don't have to make a decision at this point, because adding an accessor on RouterContext would be a binary compatible change.

@k-wall
Copy link
Copy Markdown
Member

k-wall commented Jun 3, 2026

@tombentley Can you undraft the PR?
Otherwise I believe the proposal and I'd be happy to mark it approved.

@k-wall k-wall closed this Jun 3, 2026
@k-wall k-wall reopened this Jun 3, 2026
`RouterContext` is passed to `Router.onRequest()` and provides methods for issuing requests to routes and delivering responses to the client.

```java
interface RouterContext {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it also carry this and BrokerAddressFilter should trigger this so that RouterContext can build access to cluster topologies (specially topic -> cluster mapping).

default void onMetadataResponse(MetadataResponse response, String clusterName) {}

cluster: cluster-b

virtualClusters:
- name: my-vc
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive question.

What is the philosophy of having multiple gateways in a single vc if the filters are at vc level? (essentially, why do someone want to have multiple gateways with everything being same? To me, if filters would be at gateway level, then it makes more sense)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the philosophy of having multiple gateways in a single vc if the filters are at vc level?

I'm struggling to recall if there was a deep reason, or if we did this simply to avoid users having to repeat the same list of filter names for each gateway. Maybe @SamBarker can recall?

Did you have a use case for per-gateway filter chains @hrishabhg? Because assuming there are no gotchas we could support it in the configuration (e.g. "a gateway filterchain overrides the vc one", or "gateway filter chains are mutually exclusive with filter chains one the vc"). But if without a good use case I'd rather avoid supporting this just because it's additional complexity.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was a separation of networking concerns, bind to port X or present this SNI address, from filter concerns. We wanted to configure the virtual cluster once and present it to both clients on the same kubernetes cluster (using services) and external clients, using ingress, and thus very different addressing answers.

```java
sealed interface RouterResult {
record Completed(Response response) implements RouterResult {}
record CompletedNoResponse() implements RouterResult {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for forwarding to rest of router chain, will one use CompletedNoResponse?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompletedNoResponse is intended for 0-ack Produce request only. I thought it was better to have an explicit case for this -- it makes it slightly easier for the runtime to log a warning for a buggy router which hasn't coded for 0-ack Produce requests correctly.

For a 0-ack Produce a router impl would typically call sendRequestToNode() and return a CompletedNoResponse immediately. Perhaps we could make the runtime a bit clever and throw an exception if a developer tries to chain a handler onto the CompletionStage returned from sendRequestToNode() when the request it was passed was a 0-ack Produce.

However, all of this is a kind of best-effort by the runtime to alert router developers that their router is buggy for not considering 0-ack Produce. There's nothing we can do to force router devs to remember to code for it.

Comment on lines +274 to +299
routerDefinitions:
- name: my-router
type: MyRouter
config: ...
routes:
- name: foo
id: 0
filters:
- my-first-filter
- my-second-filter
target:
cluster: cluster-a
- name: bar
id: 1
filters:
- my-third-filter
target:
router: my-other-router
- name: my-other-router
type: AnotherRouter
config: ...
routes:
- name: baz
id: 0
target:
cluster: cluster-b
Copy link
Copy Markdown

@hrishabhg hrishabhg Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientAuthRoute only authenticates client.
Forwards to SubjectRouter

  1. If Subject is not present
  2. Resolved Subject is mapped to a cluster => it forwards to PlainBasedClusterAuthenticatorRouter which targets clusterA or it forwards to OAuthBasedClusterAuthenticatorRouter which targets clusterB
- name: routerA
  type: StaticRouter
  routes:
    - filters:
        - PlainAuthFilter
      target:
        cluster: clusterA
- name: routerB
  type: StaticRouter
  routes:
    - filters:
        - OAuthFilter
      target:
        cluster: clusterA
- name: SubjectRouter
  type: SubjectRouter
  config:
    - subA: routerA
      subB: routerB
      default: TopicRouter
  routes: []

It is not a comment. Just trying to make sense. It is looking like routes being a list of 0 or 1 item only.


Does it make more sense to have optional config.routes with filters moving at router level.

- name: routerA
  type: StaticRouter
  filters:
    - PlainAuthFilter
  config:
    target:
      cluster: clusterA
- name: routerB
  type: StaticRouter
  filters:
    - OAuthFilter
  config:
    target:
      cluster: clusterA
- name: SubjectRouter
  type: SubjectRouter
  config:
    default:
      target:
        router: MyTopicBasedRouter
    routes:
      - subjects:
          - subA
        filters:
          - PlainAuthFilter
        target:
          cluster: clusterA
      - subjects:
          - subB
          - subC
        filters:
            - OAuthFilter
        target: # or targetCluster to reduce nesting
          cluster: clusterA

OR routes having capability to define its own config (@tombentley demo used similar approach -https://github.com/tombentley/kproxy/blob/routing/kroxylicious-router-topic/README.md#configuration)

- name: HybridRouter # If subject present and mapped to cluster Route to a cluster, else try topic based routing
  type: SubjectRouter
  config:
    default:
      target:
        router: MyTopicBasedRouter # if subject based fails
    routes:
      - subjects:
          - subA
        filters:
          - PlainAuthFilter
        target:
          cluster: clusterA
      - subjects:
          - subB
          - subC
        filters:
            - OAuthFilter
        target: # or targetCluster to reduce nesting
          cluster: clusterA

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With proposed approach, it seems config should also have entry route.

# Router => SubjectForwarder
config:
      subjectRoutes:
            sub1: routeA
            sub2: routeB
     entryRoute: ClientAuthRoute # or the first route in the list.

@tombentley tombentley marked this pull request as ready for review June 3, 2026 18:39
@tombentley tombentley requested a review from a team as a code owner June 3, 2026 18:39
robobario pushed a commit to robobario/kroxylicious that referenced this pull request Jun 3, 2026
Aligns with design proposal kroxylicious/design#70.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
robobario pushed a commit to robobario/kroxylicious that referenced this pull request Jun 3, 2026
Replace the marker RouterResult interface with a sealed type having
three variants: Completed(Response), CompletedNoResponse(), and
Disconnect(). Response delivery and disconnect move from side-effects
on RouterContext to return values from Router.onRequest().

Reshape RouterContext: add bootstrapNodeId(route) for initial broker
discovery, add route parameter to sendRequestToNode, remove
sendRequest (no "send to any broker"), sendResponse, and disconnect.

Add fromVirtual(route, virtualNodeId) overload to NodeIdMapping to
support the route parameter on sendRequestToNode.

Pre-register bootstrap virtual node IDs in RouterDispatchHandler so
bootstrapNodeId returns a real virtual node ID that forwardToNode
can resolve.

Aligns with design proposal kroxylicious/design#70.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
robobario pushed a commit to robobario/kroxylicious that referenced this pull request Jun 3, 2026
Aligns the top-level config property name with design proposal
kroxylicious/design#70.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
robobario pushed a commit to robobario/kroxylicious that referenced this pull request Jun 3, 2026
Replace flat targetCluster/router fields on RouteDefinition with a
nested Target record containing a cluster/router discriminated union.
Aligns route config with design proposal kroxylicious/design#70.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
robobario pushed a commit to robobario/kroxylicious that referenced this pull request Jun 3, 2026
Replace namedTargetCluster/router fields on VirtualCluster with a
nested Target (reusing RouteDefinition.Target). The deprecated inline
targetCluster field is retained for backward compatibility.

Aligns with design proposal kroxylicious/design#70.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
@SamBarker
Copy link
Copy Markdown
Member

Thanks Tom. I think my point got misread, so let me restate it.

I'm not arguing that arbitrary router compositions should guarantee full protocol correctness — of course they can't. That was never the question.

The question is: does the DAG earn its keep?

The DAG's value proposition is composability — simple routers combined into powerful topologies. But the compositions where the DAG is genuinely expressive (fan-out: duplication, live/DR) are exactly where protocol statefulness forces each router to become a protocol-aware monolith. A DuplicationRouter that handles transactions isn't a simple composed thing — it's a full protocol engine that happens to send to two paths.

For the compositions where protocol semantics are manageable (linear chains: principal → topic → clusters), building blocks inside a single router could achieve the same reuse and testability without the DAG infrastructure (nested NodeIdMappings, per-connection router caching, virtual ID space composition).

I think there might be a missing level of abstraction between "full Router that owns the entire request lifecycle" and "static route that forwards opaque frames." The protocol-aware building blocks that every non-trivial router needs (scatter-gather, session management, response composition) aren't part of the model. Each router rebuilds them from scratch. That's not a DAG problem per se — but it does make me wonder whether the DAG is solving the right problem.

Keith's use cases (hot DR switch, conditional filter flow) are compelling and suggest the DAG does carry its weight for stateless compositions. I'm less convinced for stateful fan-out — but I'd welcome concrete use cases that test that boundary.

* `fromVirtual(route, virtualNodeId)` — used when the router calls `sendRequestToNode(route, virtualNodeId, ...)`, returning the target-cluster node ID.

The mapping must satisfy: `fromVirtual(route, toVirtual(route, t))` returns `t` for any route and target node ID `t`.
Note that the mapping does not need to recover the route from the virtual node ID alone — the route is always supplied by the router.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal does a good job of recognising node ID collision as a runtime concern — NodeIdMapping is sealed, transparent to the router author, and handled at the boundary. But I think there's a related class of problem the proposal should acknowledge: protocol identifier mapping.

Node IDs are genuinely virtualised — the runtime invents IDs that don't exist in any cluster. But other cluster-scoped identifiers in the protocol aren't virtualised, they're mapped: the real values from real clusters need to be tracked and injected into the right requests on the right routes.

  • Producer IDs (PIDs): Each cluster assigns its own via InitProducerId. A multi-cluster router must maintain a mapping of downstream PID → (route, cluster-PID) and rewrite PIDs in Produce requests per-route.
  • Fetch session IDs: Broker-scoped, need to be tracked per-route.
  • Topic IDs (UUIDs): Cluster-scoped. The router isn't inventing new UUIDs, but needs to know which cluster's UUID to inject where.

Node IDs are a natural fit for runtime ownership because they're structural — they're how routes address brokers. These protocol identifiers are message content, so it's a longer bow to draw to "routing." But the effect is the same: without runtime-assisted mapping, every multi-cluster router rebuilds this machinery independently.

I'm not saying the proposal needs to solve all of these now. But I think it should recognise protocol identifier mapping as a class of problem rather than handling one instance (node IDs) and treating the rest as pure implementation detail. Naming the problem shapes how router authors approach their implementations and signals where the runtime boundary might move in future.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime concerns itself with virtual node ids because it has to:

  • the concept of nodes pervades the protocol
  • nodes have network addresses
  • the proxy runtime it responsible for the networking (a router is not)

This is not true for those other things you list, so I don't think the runtime is forced to concern itself with them in the same way. They can be a responsibility of a router implementation, which may choose to map them, or do something else.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that sendRequestToNode() takes route explicitly, the runtime no longer needs to recover the route from the virtual node ID. What makes node ID mapping a runtime concern rather than a router concern, given that the route parameter already tells the runtime where to send the request?

If the answer is "the runtime hides METADATA response rewriting so routers don't all reimplement it" — that's the same argument for runtime-assisted PID mapping, fetch session tracking, etc. The route parameter removed the architectural distinction between node IDs and other cluster-scoped identifiers.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes node ID mapping a runtime concern rather than a router concern, given that the route parameter already tells the runtime where to send the request?

The runtime needs to open sockets for all the nodes/know the mapping from SNI host name to node.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream connection machinery in the POC works with virtual node IDs throughout — forwardToNode() takes a virtual node ID, resolves it to a HostPort via the address cache, and opens the socket. fromVirtual() is never called in the request path. The real node ID is never recovered. The socket layer treats virtual node IDs as opaque keys to look up addresses.

So the entire value of NodeIdMapping is response rewriting — translating real node IDs to virtual ones in METADATA, PRODUCE, FIND_COORDINATOR, and DESCRIBE_CLUSTER responses before the router sees them. That's not a networking concern. It's an identifier translation concern. And it's exactly the same layer where PID mapping, fetch session ID tracking, and topic UUID translation would live.

The runtime already does this translation across four response types because it's the right design — the topic router just works with virtual IDs and never thinks about mapping. Why should every multi-cluster router independently reimplement the same translation for PIDs and fetch session IDs?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a networking concern. It's an identifier translation concern.

It is a networking concern, because it's the runtime that knows how those ids relate to network addresses. Sure node ids are not network addresses, but they directly associated. It seems necessary that there's some way for the router and the runtime to have a common way of identifying them, and the virtual node id is the way I'm proposing here.

And it's exactly the same layer where PID mapping, fetch session ID tracking, and topic UUID translation would live.

I'm not saying that there's some fundamental reason why the runtime could not understand these things in a deeper way. But it doesn't seem to buy much. It does not seem to be necessary for the runtime to know.

Moreover you've not yet pointed to a compelling reason why having the runtime handle these things is objectively better. Abstractly it's slightly displeasing that some identifiers are in a different class than others. But concretely why does that matter?

I think it might be more productive, at this point, if you proposed an alternative API which treats all these identifiers in whatever way you're advocating for. Having that would allow everyone to start evaluating the pros and cons of each API.

@tombentley
Copy link
Copy Markdown
Member Author

A DuplicationRouter that handles transactions isn't a simple composed thing

No it's not, but it fundamentally cannot be AFAICS. Transactions are quite a hard area of the protocol. There are two distinct flavours (KIP-98 and KIP-890). It could evolve more in the future for all we know. Something somewhere in the system has to embed deep knowledge of the Kafka protocol to do non-trival things. By non-trivial I mean any kind of routing that it's not approximately "sticky, based on subject".

That complexity must live either:

  • In the runtime
  • In some combination of the runtime and a plugin
  • In a plugin

The risk of embedding lots of deep protocol knowledge in the runtime is that you make too many assumptions about the kind of behaviours you want to support. Eventually you find that there are use cases which your runtime doesn't handle. This is an argument for "some of the behaviour has to live in a plugin".

The problem of a combination is figuring out a sane API that balances flexibility of behaviour, with understandability and maintainability.

The risk of doing it in a plugin is that each plugin you write might end up with deep protocol knowledge. So there's the risk of duplication.

This proposal is definitely in the camp of loading the protocol complexity into plugins. To me that feels a reasonable thing to do. I think there are a few interesting behaviours which we'd have to write router implementations for. Not lots, and not 1. Let's say half a dozen for the sake of argument. I think the project could sustain writing at least some of these, by adopting the last strategy we're leaving to door open for 3rd parties to write their own if they had niche requirements.

I think there might be a missing level of abstraction between "full Router that owns the entire request lifecycle" and "static route that forwards opaque frames."

I agree, actually. But this is something we've seen for filters too. We don't have a HashicorpRecordEncryption filter and a FortanixDsmRecordEncryption filter. We have one pluggable ("higher level") RecordEncryption filter, which supports multiple KMSes. Likewise we have one pluggable Authorization filter which can support multiple Authorizers.

Generally speaking if you have the right low level API you can build a more specific thing on top. We can have a router that implements "static route that forwards opaque frames." based on Subject and/or client id (and likely other things which are known early enough in the conversation between client and broker). Those actually exist in the POC for testing purposes, as it happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants