Skip to content

Add Tablet.serializedSize() and comprehensive size validation tests.#824

Merged
jt2594838 merged 3 commits into
apache:developfrom
luoluoyuyu:tablet-serialized-size
Jun 3, 2026
Merged

Add Tablet.serializedSize() and comprehensive size validation tests.#824
jt2594838 merged 3 commits into
apache:developfrom
luoluoyuyu:tablet-serialized-size

Conversation

@luoluoyuyu
Copy link
Copy Markdown
Member

Pre-allocate serialization buffer using exact size estimation, support OBJECT type in tablet serialize/deserialize path, and consolidate serializedSize tests.

Pre-allocate serialization buffer using exact size estimation, support OBJECT
type in tablet serialize/deserialize path, and consolidate serializedSize tests.
@luoluoyuyu luoluoyuyu marked this pull request as draft May 26, 2026 09:02
@Caideyipi
Copy link
Copy Markdown
Contributor

I found a functional issue.

Tablet.serializedSize() claims to return the exact serialized byte size, but it uses
ReadWriteIOUtils.sizeToWrite(insertTargetName) to calculate string sizes. That helper uses s.getBytes(), which
depends on the platform default charset. The actual serialization path uses ReadWriteIOUtils.write(String, ...),
which encodes strings with TSFileConfig.STRING_CHARSET (UTF-8).

So when the device/table name, measurement name, or schema properties contain non-ASCII characters, serializedSize()
can differ from the real serialized size if the process default charset is not UTF-8.

This is probably not an issue when TsFile is used through IoTDB, because IoTDB startup sets the default charset. But
TsFile can also be used independently, and in standalone usage this can make the size estimate incorrect and break the
“exact size” guarantee.

Suggested fix: make ReadWriteIOUtils.sizeToWrite(String) use TSFileConfig.STRING_CHARSET, consistent with the
write path, and add a non-ASCII name test.

There is also a CodeQL alert for integer narrowing/overflow in serializedSizeOfTimes(). Since this method is
intended to return an exact byte size, that should probably be handled as well.

@luoluoyuyu luoluoyuyu marked this pull request as ready for review May 28, 2026 10:30
Comment thread java/tsfile/src/main/java/org/apache/tsfile/write/record/Tablet.java Outdated
Comment on lines 813 to +815
size = Math.addExact(size, Integer.BYTES);
size =
Math.addExact(
size,
ReadWriteIOUtils.sizeToWrite(
new Binary(bitMaps[i].getTruncatedByteArray(rowSize))));
size = Math.addExact(size, Integer.BYTES);
size = Math.addExact(size, BitMap.getSizeOfBytes(rowSize));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why two integers?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two Integer.BYTES entries represent two different fields in the serialized format.

In writeBitMaps(), a non-empty bitmap is serialized as:

  1. hasBitMap flag: 1 byte
  2. rowSize: 4 bytes
  3. Binary length prefix: 4 bytes
  4. bitmap bytes: BitMap.getSizeOfBytes(rowSize)

The previous code used:

ReadWriteIOUtils.sizeToWrite(new Binary(bitMaps[i].getTruncatedByteArray(rowSize)))

That includes the Binary length prefix plus the actual bitmap bytes.

So the new code:

size = Math.addExact(size, Integer.BYTES); // rowSize
size = Math.addExact(size, Integer.BYTES); // Binary length prefix
size = Math.addExact(size, BitMap.getSizeOfBytes(rowSize)); // bitmap bytes

is equivalent to the old calculation. The two integers are not duplicates: one is the bitmap logical size (rowSize), and the other is the length prefix
written by ReadWriteIOUtils.write(Binary, stream).

@jt2594838 jt2594838 merged commit 86ec4b9 into apache:develop Jun 3, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants