Add tsfile-cli: inspect and import .tsfile from the command line#829
Add tsfile-cli: inspect and import .tsfile from the command line#829SpriCoder wants to merge 41 commits into
Conversation
Read-only inspect/export verbs (ls/schema/stats/head/cat/select) as a single multi-call `tsfile` binary, backed by the existing C++ reader API.
9 TDD tasks (CMake scaffold -> arg parser -> formatters -> ResultSet pump -> ls/schema/stats/head/cat/select -> install/verify). Spec tweaked to match confirmed C++ APIs (table-model schema blanks encoding/compression; stats = count + time range).
There was a problem hiding this comment.
Pull request overview
Adds a new C++ command-line tool (tsfile-cli) under cpp/tools/ for inspecting/exporting .tsfile contents and importing CSV/TSV into a new table-model .tsfile, with accompanying formatters, command implementations, and a new tool-focused test suite integrated into the existing CMake test target.
Changes:
- Introduces
tsfile-cliexecutable + CLI dispatch/arg parsing, read commands (ls/schema/meta/stats/count/head/cat/sample), and CSV/TSV import (write). - Adds output/input formatting helpers (CSV/TSV/NDJSON/table) and statistics aggregation helpers for metadata-driven commands.
- Integrates tool build + tests into CMake, and adjusts
ReadFile::open()diagnostics to go to stderr.
Reviewed changes
Copilot reviewed 37 out of 38 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
cpp/tools/tools_main.cc |
Adds main() entrypoint that forwards argv to run_cli. |
cpp/tools/CMakeLists.txt |
Adds tsfile-cli build target and install rule. |
cpp/tools/README.md |
Documents build/usage/options and examples for tsfile-cli. |
cpp/tools/skills/tsfile-cli/SKILL.md |
Adds a machine-readable skill doc describing CLI usage. |
cpp/tools/cli/cli_args.h |
Defines ParsedArgs struct for CLI arguments. |
cpp/tools/cli/cli_args.cc |
Implements argument parsing for all commands/options. |
cpp/tools/cli/exit_codes.h |
Defines standardized exit codes for CLI. |
cpp/tools/cli/run_cli.h |
Declares run_cli entrypoint for CLI execution. |
cpp/tools/cli/run_cli.cc |
Implements top-level dispatch, validation, and help/version handling. |
cpp/tools/commands/commands.h |
Declares per-command handlers and shared helpers. |
cpp/tools/commands/cmd_ls.cc |
Implements ls for devices/tables. |
cpp/tools/commands/cmd_schema.cc |
Implements schema listing for tree/table models. |
cpp/tools/commands/cmd_meta.cc |
Implements file-level summary output (meta). |
cpp/tools/commands/cmd_stats.cc |
Implements per-series statistics output (stats). |
cpp/tools/commands/cmd_count.cc |
Implements per-series counts + total (count). |
cpp/tools/commands/cmd_head.cc |
Implements head via shared row query helper. |
cpp/tools/commands/cmd_cat.cc |
Implements cat via shared row query helper. |
cpp/tools/commands/cmd_sample.cc |
Implements reservoir sampling (sample). |
cpp/tools/commands/row_query.cc |
Implements shared querying logic used by row-returning commands. |
cpp/tools/commands/cmd_write.cc |
Implements CSV/TSV import into new table-model .tsfile (write). |
cpp/tools/commands/stat_table.h |
Defines structures/helpers for stats + meta summary extraction. |
cpp/tools/commands/stat_table.cc |
Implements metadata/statistics collection used by stats/meta/count. |
cpp/tools/format/output_format.h |
Defines output format enum and RowWriter. |
cpp/tools/format/output_format.cc |
Implements CSV/TSV/NDJSON/table formatting and escaping. |
cpp/tools/format/input_format.h |
Defines column-spec parsing and delimited-line parsing helpers. |
cpp/tools/format/input_format.cc |
Implements --columns parsing, CSV quote splitting, bool parsing. |
cpp/tools/format/result_set_format.h |
Declares helpers for converting result sets to output. |
cpp/tools/format/result_set_format.cc |
Implements result-set streaming and reservoir-sampled output. |
cpp/test/CMakeLists.txt |
Wires tool tests into TsFile_Test when BUILD_TOOLS is on; adjusts GTest include handling. |
cpp/test/tools/cli_test_util.h |
Adds helpers for CLI fixture generation and temp file naming. |
cpp/test/tools/cli_args_test.cc |
Adds unit tests for arg parsing and CLI-level validation. |
cpp/test/tools/input_format_test.cc |
Adds unit tests for column spec parsing and CSV/TSV splitting. |
cpp/test/tools/output_format_test.cc |
Adds unit tests for escaping, formatting, and table alignment. |
cpp/test/tools/stat_table_test.cc |
Adds unit tests for statistic-to-cell conversion logic. |
cpp/test/tools/command_e2e_test.cc |
Adds in-process E2E tests including a write→read round-trip. |
cpp/src/file/read_file.cc |
Routes ReadFile::open() diagnostics to stderr instead of stdout. |
cpp/CMakeLists.txt |
Adds BUILD_TOOLS option and includes cpp/tools subdirectory. |
.gitignore |
Ignores local AI tooling dirs and test-run artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::cerr << "open file " << file_path << " error :" << fd_ | ||
| << std::endl; | ||
| std::cout << "open error" << errno << " " << strerror(errno) | ||
| std::cerr << "open error" << errno << " " << strerror(errno) | ||
| << std::endl; |
There was a problem hiding this comment.
Fixed in 7bb6b62. ReadFile::open() no longer prints fd_ (always -1 on failure); it now reports strerror(errno) plus the numeric errno.
| if (p.command == "help" || p.command == "--help" || p.command == "-h" || | ||
| (p.help && p.file.empty())) { | ||
| print_usage(out); | ||
| return kExitOk; | ||
| } |
There was a problem hiding this comment.
Fixed in 7bb6b62. Help now triggers whenever -h/--help is present regardless of a positional file, so tsfile-cli <cmd> --help <file> prints usage.
| " ls list devices (tree) or tables (table)\n" | ||
| " schema per-measurement data type/encoding/compression\n" | ||
| " meta file metadata summary\n" | ||
| " stats per-series row count and time range\n" |
There was a problem hiding this comment.
Fixed in 7bb6b62. The stats usage line now reads per-series count, time range, min/max/first/last/sum.
| RowWriter w(out, fmt, | ||
| {"file", "model", "version", "device_count", "table_count", | ||
| "series_count", "start_time", "end_time", "bloom_filter", | ||
| "file_size_bytes"}, | ||
| {common::STRING, common::STRING, common::STRING, common::INT64, |
There was a problem hiding this comment.
Fixed in 7bb6b62 by dropping the version and bloom_filter columns. The public TsFileReader API exposes neither the file version nor the bloom filter, so rather than emit perpetually-null columns I removed them (docs updated to match).
| |---|---| | ||
| | `ls` | List devices (tree model) or tables (table model), one name per line | | ||
| | `schema` | Per-series `target, measurement, datatype, encoding, compression` | | ||
| | `meta` | File summary: model, version, device/table/series counts, time range, file size | |
There was a problem hiding this comment.
Fixed in 7bb6b62: the README meta row no longer lists version (the column was removed from the command).
| std::vector<DataRow> rows; | ||
| while (std::getline(*in, line)) { | ||
| ++line_no; | ||
| strip_cr(line); | ||
| if (line.empty()) { |
There was a problem hiding this comment.
Fixed in 7bb6b62. write now streams rows into fixed 1024-row Tablet batches and flushes as each batch fills, so memory stays flat regardless of input size — the full file is never buffered. Verified with a 2500-row import (3 batches) round-tripping to count=2500.
| case common::INT32: { | ||
| long v = std::strtol(cell.c_str(), &e, 10); | ||
| if (e == nullptr || *e != '\0') { | ||
| error = "bad INT32 '" + cell + "'"; | ||
| return false; |
There was a problem hiding this comment.
Fixed in 7bb6b62. Numeric parsing now clears errno and rejects ERANGE for INT32/INT64/FLOAT/DOUBLE; INT32 additionally range-checks against INT32_MIN/MAX (parsed via strtoll). The timestamp parse rejects ERANGE too. e.g. importing 3000000000 into an INT32 column now errors (INT32 out of range) instead of silently truncating.
| under the License. | ||
| ]] | ||
|
|
||
| message("Running in tools directory") |
There was a problem hiding this comment.
Fixed in 7bb6b62: removed the unconditional message("Running in tools directory").
| void RowWriter::write(const std::vector<std::string>& cells, | ||
| const std::vector<bool>& is_null) { | ||
| if (fmt_ == OutputFormat::kTable) { | ||
| rows_.push_back(cells); | ||
| rows_null_.push_back(is_null); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Left as-is by design. Table rendering must read all rows to compute column widths (same as column -t / csvlook / psql aligned mode), so it can't both stream and align. The unbounded/large path is already covered: when stdout is not a TTY, resolve_format returns tsv, which streams row-by-row with zero buffering — so piped exports never buffer. table is only the default on an interactive TTY, where output is bounded by the terminal anyway; a user dumping millions of rows would pass -f tsv|csv. Rearchitecting table output to stream isn't possible without dropping column alignment, so I'd prefer to keep the current behavior.
| set_target_properties(${GTEST_TARGET} PROPERTIES SYSTEM OFF) | ||
| target_include_directories(${GTEST_TARGET} BEFORE PRIVATE | ||
| ${googletest_SOURCE_DIR}/googletest/include | ||
| ${googletest_SOURCE_DIR}/googletest | ||
| ${googletest_SOURCE_DIR}/googlemock/include | ||
| ${googletest_SOURCE_DIR}/googlemock) |
There was a problem hiding this comment.
Addressed in 7bb6b62. Removed the no-op set_target_properties(... SYSTEM OFF) (there is no boolean SYSTEM target property) and added the vendored GTest include dirs BEFORE PRIVATE to TsFile_Test itself, where the test sources are compiled and header resolution actually matters. The Apple -iquote/-I + -std=c++14 options on the gtest targets remain since they govern GTest's own compilation. Verified the AppleClang Debug build still resolves the FetchContent 1.12.1 headers and all tool tests pass.
|
When handling exceptions in the CLI, we should return readable error messages instead of generic descriptions or error codes. |
- row_query/sample: translate storage error codes to readable phrases instead of emitting a bare numeric code - read_file: drop the always -1 fd_ value from open() diagnostic; keep strerror(errno) - run_cli: honor --help even with a positional file; correct the stats usage text (min/max/first/last/sum) - meta: remove the always-empty version/bloom_filter columns (the public reader API exposes neither); update README and SKILL accordingly - write: stream rows into fixed 1024-row Tablet batches so memory stays bounded regardless of input size - write: reject numeric overflow (ERANGE for int/float/double, plus an INT32 range check) - tools CMake: remove the noisy unconditional configure message - test CMake: drop the no-op SYSTEM target property and force the vendored GTest headers ahead on TsFile_Test where header resolution matters
|
@ColinLeeo Addressed in 7bb6b62. The CLI now translates storage-engine error codes into readable phrases via a |
Resolve cpp/test/CMakeLists.txt conflict: keep develop's GTest acquisition (tar-extract + add_subdirectory instead of FetchContent) and re-apply the AppleClang include-order fix on top, using GTEST_SRC_ROOT. Reformat cmd_meta.cc to satisfy clang-format.
Strict-review follow-up to PR apache#829: - write: reject non-strictly-increasing timestamps per device (tag tuple) with a located message; refuse --output equal to the input file; remove the partial output on any failure so no corrupt .tsfile is left behind - write/query/sample failures now print a human-readable cause via error_code_message() instead of a bare numeric code; the helper lives in the output_format layer so read and write share it - schema: report real encoding/compression for table-model columns instead of always-empty cells - columns spec: reject duplicate column names - reject flags that do not apply to a command (write-only flags on read commands, row/range flags on metadata commands, --header-match with --no-header), and give a clear error when an option precedes the command - rename the read-output helpers to emit_result_set* and the JSON predicate to emits_json_bare so the names match what they do - docs: document the per-device timestamp ordering rule and drop the unimplemented "help <command>" form
The file holds generic statistics helpers (collect_series_stats, collect_file_summary, statistic_value_cells) used by stats/count/meta for both the tree and table models. "table" wrongly implied the table model; "statistics" describes what it actually provides.
Covers per-device timestamp-order rejection (including across batch flushes), --output anti-alias and unlink-on-failure, large streaming round-trip, numeric overflow detection, duplicate-column rejection, flag-applicability errors, the leading-option error, error_code_message mapping, --help with a positional file, and table-model schema encoding/compression.
Summary
Adds
tsfile-cli, a single pipe-friendly C++ command-line tool (cpp/tools/) for working with.tsfilefiles from the shell. Built entirely on the existingstorage::TsFileReader/TsFileTableWriterAPIs; no storage-engine changes.Inspect / export (read-only)
ls— list devices (tree model) or tables (table model)schema— per-series datatype / encoding / compressionmeta— file-level summary: model, counts, global time range, file sizestats— per-seriescount, start, end, min, max, first, last, sum(from statistics, no page scan)count— per-series row counts + totalhead/cat— preview / stream rows, with projection (-m), time range (--start/--end),--offset/-nsample— deterministic reservoir sample (--seed)csv|tsv|json|table(TTY-adaptive); data → stdout, diagnostics → stderr; exit codes0/1/2/3Import (write)
write— import CSV/TSV rows into a new table-model.tsfile--columns name:TYPE:tag|fieldschema (no type inference); first input column is the timestamp-);-ooutput;-f csv|tsv; optional--no-header/--header-match-vprints a one-line summary), Unix-styleOther
ReadFile::openerrors now go to stderr (were stdout) so read output stays pipe-cleanTest plan
cd cpp && bash build.sh -t=Debugbuildsbin/tsfile-cli+TsFile_Test(add--disable-antlr4on CMake >= 4)InputFormatTest,ParseArgsTest,RunCliTest,CliE2E,RowWriterTest,StatTableTesttsfile-cli meta|ls|schema|stats|count <file.tsfile>printf 'time,id,v\n0,d,1\n' | tsfile-cli write --table t --columns "id:STRING:tag,v:INT64:field" -o out.tsfile -thentsfile-cli count -f tsv out.tsfile