[server] fix stop replica deletion stuck when TabletServer is offline by gyang94 · Pull Request #3391 · apache/fluss

gyang94 · 2026-05-27T12:56:07Z

Purpose

Linked issue: close #3357

Brief change log

Summary

When a stopReplica RPC fails due to transient network issues or a TabletServer crash, the Coordinator has no reliable retry mechanism. This causes replicas to get stuck and table deletion to never complete, resulting in the tableCount metric never decreasing.

This PR introduces a per-TabletServer sender thread model (aligned with Kafka's ControllerChannelManager / RequestSendThread) and a new ReplicaDeletionIneligible state. These changes provide robust retry and pause/resume semantics for replica deletion.

Changes

Core: Per-TS Sender Thread (`ControlRequestSendThread`)

Dedicated Sender Thread: Each TabletServer gets a dedicated sender thread with a FIFO queue.
New Replica State: Introduced a state for replicas whose deletion cannot proceed (e.g., TS offline or returned a business error).
Resume Logic: TableManager.resumeDeletions() implements 3-step logic:
1. Complete if all replicas succeeded.
2. Retry previously-ineligible replicas on alive TSes.
3. Re-fire eligible tables.
Auto-Resume on Reconnect: processNewTabletServer() clears ineligible marks and triggers resumeDeletions(), so paused deletions automatically resume when a TS reconnects.
Handle Dead TS: processDeadTabletServer() transitions in-flight deletion replicas to ineligible.

Config

coordinator.request.retry.backoff: Backoff between retries (default: 100ms).
coordinator.request.timeout: RPC timeout per attempt (default: 30s).

️ What was removed

retryDeleteAndSuccessDeleteReplicas(): The old "retry-N-then-force-success" mechanism.
failDeleteNumbers tracking and DELETE_TRY_TIMES constant.
Direct RPC calls from CoordinatorRequestBatch (replaced by queue-based dispatch).

Tests

API and Format

Documentation

swuferhong

Hi, @gyang94 thanks for your contributuon, it's an important feature, I left some comments:

swuferhong · 2026-06-02T02:02:18Z

+    }
+
+    @Nullable
+    public Set<TableBucketReplica> getDeletionReplicas() {


This method and this field both are not used. I think it need to be removed

swuferhong · 2026-06-02T02:08:28Z

+ * coordinator → tablet-server RPCs (e.g., {@code NOTIFY_LEADER_AND_ISR}) once they are migrated to
+ * the same sender-thread retry layer.
+ */
+public enum ApiKey {


I don't think we need to introduce this new enum. We already have org.apache.fluss.rpc.protocol.ApiKeys in fluss-rpc, which is the canonical enum for all wire-protocol APIs and already includes STOP_REPLICA — along with all the other control-plane RPCs (NOTIFY_LEADER_AND_ISR, UPDATE_METADATA, NOTIFY_REMOTE_LOG_OFFSETS, etc.).

The new ApiKey only has a single member today, and as we migrate more control-plane RPCs onto the sender thread, its members would just duplicate the ones already in ApiKeys. That gives us two parallel "ApiKey" concepts to keep in sync, which seems unnecessary.

There's also no dependency concern: fluss-server already depends on fluss-rpc and uses ApiKeys elsewhere.

swuferhong · 2026-06-02T02:12:33Z

                    .withDescription("The amount of time to sleep when fetch bucket error occurs.")
                    .withFallbackKeys("log.replica.fetch-backoff-interval");

+    public static final ConfigOption<Duration> COORDINATOR_REQUEST_RETRY_BACKOFF =


Why are the two coordinator-related config options placed together with the log-related ones? Could we move them up to the server module section instead? Also, please add documentation for these options in configuration.md.

swuferhong · 2026-06-02T02:22:39Z

                    .withFallbackKeys("log.replica.fetch-backoff-interval");

+    public static final ConfigOption<Duration> COORDINATOR_REQUEST_RETRY_BACKOFF =
+            key("coordinator.request-retry.backoff-interval")


The two new options use inconsistent key structures:
coordinator.request-retry.backoff-interval — uses a request-retry segment
coordinator.request.timeout — uses a request segment

These are both about the same thing (control-plane requests from the coordinator), so they should share a common coordinator.request. prefix.

Right now request-retry introduces a separate sub-namespace that doesn't line up with request.timeout.Suggested rename to keep them under one prefix:
coordinator.request-retry.backoff-interval→ coordinator.request.retry-backoff
coordinator.request.timeout → keep as is

swuferhong · 2026-06-02T02:24:22Z

+                    .withDescription(
+                            "The backoff duration the coordinator waits before retrying a "
+                                    + "control-plane request to a tablet server after a "
+                                    + "transient RPC-layer failure. Mirrors Kafka's "


Mirrors Kafka's ControllerChannelManager retry backoff (hardcoded 100ms) -> Suggest removing this part — there's no need to mention it here.

swuferhong · 2026-06-02T05:37:16Z

    }

+    @VisibleForTesting
+    Map<Integer, TabletServerChannelState> getChannelStates() {


No use. pls remove it.

swuferhong · 2026-06-02T05:40:46Z

+        assertThat(getGaugeValue(metricGroup, MetricNames.SENDER_ALIVE)).isEqualTo(1);
+    }
+
+    @SuppressWarnings("unchecked")


Redundant suppression, can be remove.

swuferhong · 2026-06-02T05:47:16Z

+        this.metricGroup = metricGroup;
+    }
+
+    public int getTabletServerId() {


No user. remove it.

swuferhong · 2026-06-02T05:47:26Z

+        return tabletServerId;
+    }
+
+    public ServerNode getServerNode() {


No user. remove it.

swuferhong · 2026-06-02T05:56:54Z

    public void removeTabletServer(Integer serverId) {
+        synchronized (channelLock) {
+            TabletServerChannelState state = channelStates.remove(serverId);
+            teardownChannelState(serverId, state);


Maybe this need to remove outside of the lock. ShutdownableThread.shutdown() blocks until the thread is fully joined, and we call it inside the synchronized (channelLock) block. The same lock guards three hot paths:

the SENDER_TOTAL_QUEUE_SIZE gauge lambda — read by the metric-reporter thread;

sendStopBucketReplicaRequest when it looks up the channel state

getChannelStates()

So for as long as the join takes, metric reporting and control-plane sends to all other tablet servers are blocked on a single TS teardown.

gyang94 added 3 commits May 27, 2026 18:42

fix: stop-replica-failed

5fac1ab

fix: stop-replica-failed

3e192c3

fix: stop-replica-failed

463f728

gyang94 mentioned this pull request May 27, 2026

[server] Fix table deletion stuck when StopReplicaRequest send fails #3359

Open

gyang94 force-pushed the per-sender-retry branch from 87952c7 to caaaebb Compare May 28, 2026 07:59

swuferhong added the priority=blocker label Jun 2, 2026

swuferhong reviewed Jun 2, 2026

View reviewed changes

feat: per tablet server sender thread

a490b4c

gyang94 force-pushed the per-sender-retry branch from caaaebb to a490b4c Compare June 2, 2026 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server] fix stop replica deletion stuck when TabletServer is offline#3391

[server] fix stop replica deletion stuck when TabletServer is offline#3391
gyang94 wants to merge 4 commits into
apache:mainfrom
gyang94:per-sender-retry

gyang94 commented May 27, 2026

Uh oh!

swuferhong left a comment

Uh oh!

swuferhong Jun 2, 2026

Uh oh!

swuferhong Jun 2, 2026

Uh oh!

swuferhong Jun 2, 2026

Uh oh!

swuferhong Jun 2, 2026

Uh oh!

swuferhong Jun 2, 2026

Uh oh!

swuferhong Jun 2, 2026

Uh oh!

swuferhong Jun 2, 2026

Uh oh!

swuferhong Jun 2, 2026

Uh oh!

swuferhong Jun 2, 2026

Uh oh!

swuferhong Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gyang94 commented May 27, 2026

Purpose

Brief change log

Summary

Changes

Core: Per-TS Sender Thread (ControlRequestSendThread)

Config

️ What was removed

Tests

API and Format

Documentation

Uh oh!

swuferhong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Core: Per-TS Sender Thread (`ControlRequestSendThread`)