Skip to content

Subscription: add topic owner epoch fencing#17780

Open
VGalaxies wants to merge 1 commit into
masterfrom
subscription-topic-owner-fencing
Open

Subscription: add topic owner epoch fencing#17780
VGalaxies wants to merge 1 commit into
masterfrom
subscription-topic-owner-fencing

Conversation

@VGalaxies
Copy link
Copy Markdown
Contributor

Summary

  • Add topic-level owner epoch metadata for subscription fencing.
  • Propagate owner id and epoch from subscription consumers.
  • Reject stale owners during heartbeat, subscribe, poll, and commit.
  • Add focused tests for serialization and owner transfer fencing.

Tests

  • mvn test -pl iotdb-core/relational-grammar,iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/datanode -Dtest=SubscriptionReceiverV1Test -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • mvn test -pl iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons -Dtest=TopicDeSerTest -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false

@Caideyipi
Copy link
Copy Markdown
Collaborator

Findings

  • P1: TopicMeta 的新尾部字段破坏旧格式兼容。deserialize(InputStream) 用 inputStream.available() 判断是否还有 owner
    字段,但 TopicMeta 在 snapshot 里是连续写入的;旧版本数据没有 owner flag,多个 topic 时这里会把下一个 topic
    的首字节当成 owner flag 消费,导致后续反序列化错位。ByteBuffer.hasRemaining() 在旧 procedure/plan
    列表里也有同类问题。见 TopicMeta.java#L260

(

if (inputStream.available() > 0 && ReadWriteIOUtils.readBool(inputStream)) {
)
和 TopicMetaKeeper.java#L112

(https://github.com/apache/iotdb/blob/76f2a882411ba790f7db290740f0232a1e5c4022/iotdb-core/node-commons/src/main/java/org/apache/iotdb/commons/subscription/meta/topic/TopicMetaKeeper.java#L112)。建议改成版本化或长度前缀格式,或只从
config 属性恢复 owner,不在非自描述对象尾部追加可选字段。

  • P1: owner epoch 单调性没有在真实元数据替换路径上保证。transferOwner() 只在同一个对象上检查递增,但 AlterTopicPlan /
    metadata replace 路径直接 remove + add 新 TopicMeta,新对象从 -1 初始化,所以较小 epoch 也能覆盖当前 epoch,使旧
    owner 重新合法。见 TopicMeta.java#L135

(

if (isOwnerFencingEnabled() && ownerEpoch <= this.ownerEpoch) {
)
和 SubscriptionInfo.java#L340

(https://github.com/apache/iotdb/blob/76f2a882411ba790f7db290740f0232a1e5c4022/iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/persistence/subscription/SubscriptionInfo.java#L340)。建议在
ConfigNode alter/handle meta change 时对比 existing owner epoch 并拒绝回退。

  • P2: 新增的 owner status code 没有接入客户端错误分类。1913-1916 会落到 default,变成 generic critical
    exception;heartbeat 还会把 provider 标成 unavailable,导致 stale owner 场景反复重连/重试且错误信息不稳定。见
    TSStatusCode.java#L319

(https://github.com/apache/iotdb/blob/76f2a882411ba790f7db290740f0232a1e5c4022/iotdb-client/service-rpc/src/main/java/org/apache/iotdb/rpc/TSStatusCode.java#L319)、AbstractSubscriptionProvider.java#L445

(https://github.com/apache/iotdb/blob/76f2a882411ba790f7db290740f0232a1e5c4022/iotdb-client/subscription/src/main/java/org/apache/iotdb/session/subscription/consumer/base/AbstractSubscriptionProvider.java#L445)。建议显式处理
owner fenced/required/lease expired,映射成明确的不可重试业务异常。

@VGalaxies VGalaxies force-pushed the subscription-topic-owner-fencing branch from 76f2a88 to b6086b2 Compare May 28, 2026 10:35
@VGalaxies
Copy link
Copy Markdown
Contributor Author

Thanks @Caideyipi, addressed the three findings in b6086b2.

  • P1 serialization compatibility: removed the optional owner fields from the non-self-describing TopicMeta serialization tail. Owner state is now restored from topic attributes after deserialization, so sequential TopicMeta snapshot/procedure streams are not consumed out of boundary. Added a sequential deserialization regression test.
  • P1 epoch rollback: added TopicMeta.validateOwnerProgression(...) and applied it on both ConfigNode alter-topic metadata replacement and DataNode topic-meta update handling, rejecting owner clears/rollbacks/stale same-epoch owner changes. Added ConfigNode rollback coverage and extended the network-partition old-SN test to verify stale topic meta cannot make the old owner valid again.
  • P2 client classification: mapped 1913-1917 explicitly to SubscriptionOwnerFencedException, a SubscriptionRuntimeNonCriticalException subclass, so stale-owner/business fencing errors no longer fall into the generic critical default path.

Local verification passed:

  • mvn spotless:apply -pl iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/datanode,iotdb-core/confignode -DskipTests
  • mvn test -pl iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons -Dtest=TopicDeSerTest -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • mvn test -pl iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/confignode -Dtest=SubscriptionInfoTest -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • mvn test -pl iotdb-core/relational-grammar,iotdb-client/service-rpc,iotdb-client/subscription,iotdb-core/node-commons,iotdb-core/datanode -Dtest=SubscriptionReceiverV1Test -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • mvn test -pl iotdb-client/service-rpc,iotdb-client/subscription -Dtest=TSStatusCodeTest -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
  • git diff --check

The Sonar duplication check is queued again on the new commit; I will follow up if the rerun still fails.

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
5.2% Duplication on New Code (required ≤ 5%)

See analysis details on SonarQube Cloud

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 44.61538% with 144 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.82%. Comparing base (0c25e53) to head (b6086b2).
⚠️ Report is 14 commits behind head on master.

Files with missing lines Patch % Lines
.../subscription/receiver/SubscriptionReceiverV1.java 0.00% 40 Missing ⚠️
...tdb/commons/subscription/meta/topic/TopicMeta.java 81.72% 17 Missing ⚠️
...on/consumer/base/AbstractSubscriptionConsumer.java 0.00% 10 Missing ⚠️
...umer/base/AbstractSubscriptionConsumerBuilder.java 0.00% 10 Missing ⚠️
...on/consumer/base/AbstractSubscriptionProvider.java 0.00% 10 Missing ⚠️
...on/exception/SubscriptionOwnerFencedException.java 0.00% 8 Missing ⚠️
...on/consumer/tree/SubscriptionTreePullConsumer.java 0.00% 8 Missing ⚠️
...on/consumer/tree/SubscriptionTreePushConsumer.java 0.00% 8 Missing ⚠️
.../db/subscription/agent/SubscriptionTopicAgent.java 76.47% 8 Missing ⚠️
...er/table/SubscriptionTablePullConsumerBuilder.java 0.00% 6 Missing ⚠️
... and 4 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17780      +/-   ##
============================================
+ Coverage     40.58%   40.82%   +0.23%     
- Complexity     2575     2616      +41     
============================================
  Files          5181     5185       +4     
  Lines        350404   351103     +699     
  Branches      44801    44925     +124     
============================================
+ Hits         142225   143334    +1109     
+ Misses       208179   207769     -410     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants