Skip to content

C++: optimize batch read/write paths and aligned table null handling#823

Open
ColinLeeo wants to merge 1 commit into
developfrom
colin_all_opts
Open

C++: optimize batch read/write paths and aligned table null handling#823
ColinLeeo wants to merge 1 commit into
developfrom
colin_all_opts

Conversation

@ColinLeeo
Copy link
Copy Markdown
Contributor

@ColinLeeo ColinLeeo commented May 26, 2026

Summary

This PR optimizes the C++ TsFile read/write paths for batch and columnar workloads, and fixes several aligned table null-handling issues uncovered while validating the optimized path.

It consolidates the batch decode/write work from the long-lived optimization branch into a reviewable change for develop.

Supersedes #749, #754, and #774.

Main Changes

  • Add batch decode APIs to the C++ Decoder hierarchy and implement batch paths for PLAIN, TS2DIFF, and Gorilla.
  • Add TsBlock-level batch read paths in ChunkReader and AlignedChunkReader, including shared timestamp decoding for aligned multi-value reads.
  • Add columnar tablet write helpers and batch page writes for fixed-length and string columns.
  • Add optional parallel page decode/write infrastructure behind the existing C++ configuration.
  • Add SIMD-oriented fast paths for TS2DIFF, PLAIN byte swapping, and selected statistics updates.
  • Improve ByteStream, compressor, and page/chunk writer internals used by the optimized paths.
  • Simplify the C wrapper surface by removing unused metadata/tag-filter export APIs.

Correctness Fixes

  • Fix null TAG device boundary detection so null string offsets are not read as valid values.
  • Fix null string FIELD writes so null rows do not read uninitialized string offsets.
  • Preserve all-null aligned value pages by counting page rows separately from non-null statistic counts.
  • Treat missing aligned measurements as sparse columns instead of failing the whole device read.
  • Add a time-only aligned read path for devices whose queried FIELD columns are all null/missing.
  • Fix repeated logical devices inside one tablet by aggregating their segments before writing aligned chunks.
  • Fix ValuePageWriter::reset() so row count and null bitmap state are reset together.
  • Remove unused temporary TS2DIFF prototype headers that were not part of the production implementation.

Compatibility Notes

  • Existing C++ APIs remain additive for the optimized read/write paths.
  • The public C wrapper removes unused metadata export/tag-filter handles. Downstream wrappers should sanity-check that they do not depend on those removed symbols.
  • cpp/third_party/ is left at develop state so existing platform compatibility fixes are preserved.

Verification

  • cmake --build cpp/target/build --target TsFile_Test -j1
  • ctest --test-dir cpp/target/build/test --output-on-failure -R '^TsFileTableReaderTest\.TestNullInTable4$'
  • ctest --test-dir cpp/target/build/test --output-on-failure -j4
  • cd cpp && mvn spotless:check
  • cd cpp && mvn apache-rat:check

Current full C++ test result: 496/496 tests pass.

@ColinLeeo ColinLeeo force-pushed the colin_all_opts branch 2 times, most recently from 69e1658 to 7db1a3a Compare May 28, 2026 09:28
@ColinLeeo ColinLeeo changed the title TsFile C++ batch read/write optimization C++: optimize batch read/write paths and aligned table null handling May 28, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 28, 2026

Codecov Report

❌ Patch coverage is 37.90524% with 2241 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.28%. Comparing base (2a864c5) to head (b77ad62).

Files with missing lines Patch % Lines
cpp/src/reader/aligned_chunk_reader.cc 8.14% 814 Missing and 32 partials ⚠️
cpp/src/cwrapper/tsfile_cwrapper.cc 39.69% 309 Missing and 45 partials ⚠️
cpp/src/encoding/ts2diff_decoder.h 31.70% 127 Missing and 13 partials ⚠️
...p/src/reader/block/single_device_tsblock_reader.cc 61.35% 71 Missing and 26 partials ⚠️
cpp/src/writer/tsfile_writer.cc 53.65% 72 Missing and 23 partials ⚠️
cpp/src/reader/chunk_reader.cc 31.66% 36 Missing and 46 partials ⚠️
cpp/src/reader/filter/time_operator.cc 16.45% 66 Missing ⚠️
cpp/src/encoding/decoder.h 0.00% 60 Missing ⚠️
cpp/src/common/statistic.h 67.42% 30 Missing and 27 partials ⚠️
cpp/src/file/tsfile_io_reader.cc 41.57% 35 Missing and 17 partials ⚠️
... and 41 more
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #823      +/-   ##
===========================================
- Coverage    61.57%   58.28%   -3.30%     
===========================================
  Files          731      733       +2     
  Lines        45874    47652    +1778     
  Branches      6880     7407     +527     
===========================================
- Hits         28249    27774     -475     
- Misses       16614    18684    +2070     
- Partials      1011     1194     +183     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates a large optimization branch into develop: it adds batch decode/write APIs across the C++ Decoder/Encoder hierarchy, multi-value aligned read paths with optional parallel decode, columnar tablet write helpers, SIMD fast paths, and a set of correctness fixes for aligned/table null handling (null TAG segments, null FIELD writes, all-null value pages, sparse aligned columns, repeated logical devices, ValuePageWriter::reset state). It also trims the C wrapper API (drops unused metadata export/tag-filter symbols, then re-adds tag-filter helpers in a different section) and removes several regression tests for behaviors it claims to fix.

Changes:

  • Add batch decode/encode/write paths through Decoder/Encoder/page/chunk writers and a MultiAlignedTimeseriesIndex plus single-device aligned fast-path reader.
  • Several aligned table fixes (null TAG/FIELD, all-null pages, single-device tablet flag, ValuePageWriter::reset, double-free of first-page buffers via release_cur_page_data).
  • Build/infra: SIMD option, optional BUILD_EXAMPLES, mem-stat counters widened to 64-bit, new BlockingQueue, removal of several existing regression tests, license-header punctuation churn in multiple CMake files.

Reviewed changes

Copilot reviewed 118 out of 119 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
cpp/src/reader/filter/{and,or}_filter.h, filter.h, time_operator.h Adds satisfy_batch_time (uses fixed 129-element stack buffer — flagged).
cpp/src/encoding/{plain,decoder,encoder,plain_decoder,dictionary_encoder}.h Batch encode/decode API + dictionary index assignment change (flagged).
cpp/src/writer/{value_,time_,}{chunk,page}_writer.{h,cc} Batch write paths, first-page ownership transfer, larger page buffers.
cpp/src/writer/tsfile_table_writer.{h,cc}, tsfile_writer.h Memoized lowercasing, idempotent close, optional parallel write pool.
cpp/src/reader/* Aligned multi-value batch path, bloom-filter contains, table result-set lifecycle.
cpp/src/file/tsfile_io_writer.{h,cc}, restorable_tsfile_io_writer.cc Recovery cleanup simplified; conditional sync_on_close_ (flagged); chunk-group index for O(1) lookup.
cpp/src/file/tsfile_io_reader.h Device-node cache + multi-SSI alloc.
cpp/src/common/allocator/byte_stream.h, alloc_base.h, mem_alloc.cc Page-mask bitwise modulo + power-of-2 rounding for wrapped buffers (flagged), 64-bit stat counters.
cpp/src/common/{tablet,schema,path,global,thread_pool}.* Single-device flag, string-column uint32_t offsets, Path inlined, config knobs reshuffled.
cpp/src/common/container/{bit_map,blocking_queue,byte_buffer}.* New BlockingQueue, BitMap::may_have_set_bits, bounds asserts.
cpp/src/compress/{snappy,lz4,uncompressed}_compressor.* Safer after_compress ownership handling; Uncompressed now copies.
cpp/src/cwrapper/{tsfile_cwrapper.h,arrow_c.cc} Tag-filter API moved, sliced-Arrow handling reverted (loses prior bug-fix paths).
cpp/test/** Deletes several regression tests (deep path, missing measurement, aligned NULL boundary, dictionary RLE run counts, Arrow slice-with-offset, etc.) and adds new batch/page-boundary tests.
python/tsfile/dataset/reader.py + tests Switches row reads to read_arrow_batch().
cpp/{CMakeLists.txt,examples/**,src/CMakeLists.txt,src/common/CMakeLists.txt,test/CMakeLists.txt} Build flags, SIMD option, Arrow/Parquet-dependent examples, license-header punctuation regressions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +53 to +63
int satisfy_batch_time(const int64_t* times, int count, bool* mask) {
bool mask_right[129];
left_->satisfy_batch_time(times, count, mask);
right_->satisfy_batch_time(times, count, mask_right);
int pass = 0;
for (int i = 0; i < count; ++i) {
mask[i] = mask[i] || mask_right[i];
if (mask[i]) ++pass;
}
return pass;
}
Comment on lines +53 to +63
int satisfy_batch_time(const int64_t* times, int count, bool* mask) {
bool mask_right[129];
left_->satisfy_batch_time(times, count, mask);
right_->satisfy_batch_time(times, count, mask_right);
int pass = 0;
for (int i = 0; i < count; ++i) {
mask[i] = mask[i] && mask_right[i];
if (mask[i]) ++pass;
}
return pass;
}
index_entry_.push_back(value);
map_size_ = map_size_ + value.length();
entry_index_[value] = static_cast<int>(index_entry_.size()) - 1;
entry_index_[value] = entry_index_.size();
Comment on lines +295 to 303
// page_mask_ is used as a bitmask and only works correctly for
// power-of-2 page sizes. Round up to the next power-of-2 so that
// (read_pos_ & page_mask_) gives the correct within-page offset and
// the page-crossing check doesn't misfire on arbitrary buffer sizes.
uint32_t ps = 1;
while (ps < (uint32_t)buf_len) ps <<= 1;
page_size_ = ps;
page_mask_ = ps - 1;
head_.store(&wrapped_page_);
} else if (RET_FAIL(write_file_footer())) {
std::cout << "writer file footer error, ret = " << ret << std::endl;
} else if (RET_FAIL(sync_file())) {
} else if (g_config_value_.sync_on_close_ && RET_FAIL(sync_file())) {
Comment thread cpp/src/CMakeLists.txt
@@ -1,5 +1,5 @@
#[[
Licensed to the Apache Software Foundation (ASF) under one
Licensed to the Apache Software Foundation(ASF) under one

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
software distributed under the LICENSE is distributed on an
file = write_file_new("test/", &error_no);
ASSERT_TRUE(error_no == RET_FILRET_OPEN_ERR ||
error_no == RET_ALREADY_EXIST);
ASSERT_EQ(RET_FILRET_OPEN_ERR, error_no);
Comment on lines 1105 to +1115
@@ -1320,7 +1112,7 @@ TEST_F(TreeQueryByRowTest, DISABLED_QueryByRowFasterThanManualNext) {
write_test_file(devices, measurements, num_rows);

const int num_iters = 5;
const double tolerance = 0.2;
const double tolerance = 0.05;
Comment on lines +119 to +128
uint32_t cur_points = value_page_writer_.get_point_numer();
uint32_t page_remaining =
common::g_config_value_.page_writer_max_point_num_ - cur_points;
if (page_remaining == 0) {
if (RET_FAIL(seal_cur_page(false))) {
return ret;
}
page_remaining =
common::g_config_value_.page_writer_max_point_num_;
}
Squashed PR snapshot of the long-lived `final` work, rebased on top of
current develop (2a864c5).  Combines the original "TsFile C++ batch
read/write optimization" (5f12115) snapshot with subsequent build /
platform fixes and a follow-up read-path optimization commit
(c902b2b).

═════════════════════════════════════════════════════════════════════
Read path
═════════════════════════════════════════════════════════════════════
- Decoder base gains batch APIs (read_batch_int32 / int64 / float /
  double, skip_*); PLAIN, TS2DIFF, Gorilla decoders implement them.
  TS2DIFF has block-level peeking so time filters can skip blocks
  without decoding.  Gorilla adds a raw-pointer GorillaBitReader that
  bypasses ByteStream overhead.
- ChunkReader / AlignedChunkReader add *_DECODE_TV_BATCH methods that
  decode time + value into a TsBlock in one pass, applying batch time
  filters before append.
- AlignedChunkReader supports a multi-value mode: one time chunk + N
  value chunks decoded in a single pass, sharing the decoded timestamps
  and filter mask.  SingleDeviceTsBlockReader auto-detects same-device
  measurements via VectorMeasurementColumnContext.
- Optional page-level parallel decompression via a DecodeThreadPool when
  ENABLE_THREADS is set.  Page-plan classification
  (SKIP / FULL_PASS / BOUNDARY) lets a scatter-free memcpy fast path
  fire when every row passes and no column has nulls.

Additional optimizations (from c902b2b, ported from `final`):

- Aligned fast path: enable_dense_aligned_fast_path defaults true and
  compute_dense_row_count falls back to the TimeseriesIndex top-level
  statistic for single-chunk timeseries (chunk-level stat is omitted
  during serialization for those).  Re-enables the bulk-copy SSI -->
  caller path that was defensively disabled.
- Chunk-level parallel decode: per-column tasks own all that column's
  pages and write into a per-(col,page) PageDecodedState slot; one
  wait_all per chunk amortizes thread-pool overhead.  Hybrid dispatch
  in get_next_page_multi — chunk-level for narrow chunks (<= 6 value
  columns), 4/6 thesis path otherwise to avoid cache thrash.
- Per-worker time decoder/compressor pool (via ThreadPool::
  current_worker_id) parallelizes the previously-serial time-page
  decode loop.
- Pre-decode int64/float/double values in the parallel worker into
  ValueColumnState::pending_decoded_values; multi_DECODE_TV_BATCH then
  memcpys the per-batch slice instead of calling the decoder inline.
- Partial-page bulk scatter: bulk-memcpy path now copies
  min(budget, remaining_in_page) rows from page_time_cursor_ so the
  tail page of every SSI tsblock takes the memcpy fast path instead of
  bleeding into the row-by-row scatter loop.
- tsblock_max_memory_ 64KB -> 2MB so a 10K-row page fits in one SSI
  tsblock and bulk_copy_into doesn't fragment into many tiny batches.

═════════════════════════════════════════════════════════════════════
Write path
═════════════════════════════════════════════════════════════════════
- ValuePageWriter gains write_batch / write_string_batch that take
  timestamp + value + nullness arrays directly, removing the per-value
  append loop.  Tablet exposes set_timestamps / set_column_values /
  set_column_string_repeated / reset for bulk reuse and switches
  StringColumn to an Arrow-compatible offset+buffer layout.
- TS2DIFFEncoder::flush packs all deltas with a single pack_bits_msb +
  write_buf instead of per-value write_bits, falling back to the scalar
  path for the rare bit_width > 56 case.
- Int64Statistic::update_batch (NEON-accelerated min/max/sum).

═════════════════════════════════════════════════════════════════════
Encoding / SIMD
═════════════════════════════════════════════════════════════════════
- TS2DIFF batch decode adds AVX2 helpers via SIMDe (already on develop)
  for both i32 and i64; scalar fallback unchanged.
- PLAIN byte-swap path uses ARM NEON (vrev64q_u8 / vrev32q_u8) when
  available, falling back to __builtin_bswap.
- CMakeLists adds ENABLE_SIMD; Release builds turn on -O3 -march=native
  -flto (off when ASan is on or on Windows/MinGW).

═════════════════════════════════════════════════════════════════════
Allocator / ByteStream / ThreadPool
═════════════════════════════════════════════════════════════════════
- ByteStream caches page_mask_ (= page_size - 1) so the hot path uses a
  bitmask instead of modulo; wrap_from rounds buffer sizes up to a
  power of two for correctness.
- common::ThreadPool gets a thread_local current_worker_id() accessor
  (set by worker_loop) and a num_threads() getter, letting callers
  attach per-worker state without contention.

═════════════════════════════════════════════════════════════════════
Build / platform
═════════════════════════════════════════════════════════════════════
- Linux Release: -march=native + -flto by default, automatically
  dropped under ASan to keep leak detection accurate.
- MSVC / MinGW: replace GCC-only intrinsics, restore lost includes,
  disable LTO + -march=native there.
- Restore tag_filter_create/between, metadata test, and segment
  behavior; restore cwrapper metadata + tag_filter/batch_size args on
  table query C APIs that the batch-opt snapshot had dropped.
- Disable QueryByRowPerformanceTest and the flaky
  QueryByRowFasterThanManualNext test.

═════════════════════════════════════════════════════════════════════
Python binding
═════════════════════════════════════════════════════════════════════
- read_series_by_row: pull TsBlocks via Arrow IPC instead of the
  row-by-row Python loop.  Aligns reader query plumbing with develop
  so the binding sees the same parameter set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants