Skip to content

[client] Fix stale metadata on readOnlyGateway by adding RetryableGatewayClientProxy#3390

Open
loserwang1024 wants to merge 2 commits into
apache:mainfrom
loserwang1024:retry-with-retry
Open

[client] Fix stale metadata on readOnlyGateway by adding RetryableGatewayClientProxy#3390
loserwang1024 wants to merge 2 commits into
apache:mainfrom
loserwang1024:retry-with-retry

Conversation

@loserwang1024
Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #3389

Brief change log

Tests

API and Format

Documentation

@loserwang1024
Copy link
Copy Markdown
Contributor Author

@swuferhong @wuchong @fresh-borzoni , CC

Copy link
Copy Markdown
Member

@fresh-borzoni fresh-borzoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loserwang1024 Thank you for the very important PR, left some comments, PTAL

private final AdminReadOnlyGateway readOnlyGateway;
private final MetadataUpdater metadataUpdater;

private static final int READ_ONLY_GATEWAY_MAX_RETRIES = 3;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With maxRetries=3, bootstrap reinit needs 4 refreshes. You only get 3 per request.
Shall we loop inside updateMetadata until either success or null-triggered bootstrap?

cause);
// Run metadata refresh and retry on a separate thread to avoid
// blocking Netty IO threads that may complete the failed future.
CompletableFuture.runAsync(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want some backoff?
I mean 3 retries fire in milliseconds, seems wasteful on slow DNS or restarting pods.

Comment thread fluss-rpc/src/main/java/org/apache/fluss/rpc/RetryableGatewayClientProxy.java Outdated
Comment thread fluss-client/src/main/java/org/apache/fluss/client/admin/FlussAdmin.java Outdated
Comment thread fluss-rpc/src/main/java/org/apache/fluss/rpc/RetryableGatewayClientProxy.java Outdated
Comment thread fluss-rpc/src/main/java/org/apache/fluss/rpc/RetryableGatewayClientProxy.java Outdated
@loserwang1024
Copy link
Copy Markdown
Contributor Author

@fresh-borzoni , I've revised the design: instead of retrying 3 times, it now rebuilds metadata via refreshClusterUntilAvailable until either some IP becomes available or it falls back to bootstrap. No backoff for now. Keep refreshClusterUntilAvailable purely "loop until available or bootstrap" to avoid over-engineering. The two existing layers (connection timeout + bootstrap exponential backoff) already provide sufficient throttling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[client] Admin readOnlyGateway never refreshes metadata on network errors, causing permanent RPC failure during server rolling upgrades

3 participants