Skip to content

Add tsfile-cli: inspect and import .tsfile from the command line#829

Open
SpriCoder wants to merge 41 commits into
apache:developfrom
SpriCoder:feat/tsfile-cli
Open

Add tsfile-cli: inspect and import .tsfile from the command line#829
SpriCoder wants to merge 41 commits into
apache:developfrom
SpriCoder:feat/tsfile-cli

Conversation

@SpriCoder
Copy link
Copy Markdown
Contributor

@SpriCoder SpriCoder commented Jun 3, 2026

Summary

Adds tsfile-cli, a single pipe-friendly C++ command-line tool (cpp/tools/) for working with .tsfile files from the shell. Built entirely on the existing storage::TsFileReader / TsFileTableWriter APIs; no storage-engine changes.

Inspect / export (read-only)

  • ls — list devices (tree model) or tables (table model)
  • schema — per-series datatype / encoding / compression
  • meta — file-level summary: model, counts, global time range, file size
  • stats — per-series count, start, end, min, max, first, last, sum (from statistics, no page scan)
  • count — per-series row counts + total
  • head / cat — preview / stream rows, with projection (-m), time range (--start/--end), --offset / -n
  • sample — deterministic reservoir sample (--seed)
  • Output csv|tsv|json|table (TTY-adaptive); data → stdout, diagnostics → stderr; exit codes 0/1/2/3

Import (write)

  • write — import CSV/TSV rows into a new table-model .tsfile
  • Explicit --columns name:TYPE:tag|field schema (no type inference); first input column is the timestamp
  • Reads a file or stdin (-); -o output; -f csv|tsv; optional --no-header / --header-match
  • Silent on success (-v prints a one-line summary), Unix-style

Other

  • ReadFile::open errors now go to stderr (were stdout) so read output stays pipe-clean
  • Unit + in-process E2E tests: arg parsing, formatters, statistics, CSV/TSV parsing, and a write→read round-trip

Test plan

  • cd cpp && bash build.sh -t=Debug builds bin/tsfile-cli + TsFile_Test (add --disable-antlr4 on CMake >= 4)
  • CLI suites pass: InputFormatTest, ParseArgsTest, RunCliTest, CliE2E, RowWriterTest, StatTableTest
  • tsfile-cli meta|ls|schema|stats|count <file.tsfile>
  • round-trip: printf 'time,id,v\n0,d,1\n' | tsfile-cli write --table t --columns "id:STRING:tag,v:INT64:field" -o out.tsfile - then tsfile-cli count -f tsv out.tsfile

SpriCoder added 30 commits June 1, 2026 16:09
Read-only inspect/export verbs (ls/schema/stats/head/cat/select) as a
single multi-call `tsfile` binary, backed by the existing C++ reader API.
9 TDD tasks (CMake scaffold -> arg parser -> formatters -> ResultSet pump
-> ls/schema/stats/head/cat/select -> install/verify). Spec tweaked to
match confirmed C++ APIs (table-model schema blanks encoding/compression;
stats = count + time range).
@SpriCoder SpriCoder marked this pull request as draft June 3, 2026 10:06
@SpriCoder SpriCoder marked this pull request as ready for review June 3, 2026 17:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new C++ command-line tool (tsfile-cli) under cpp/tools/ for inspecting/exporting .tsfile contents and importing CSV/TSV into a new table-model .tsfile, with accompanying formatters, command implementations, and a new tool-focused test suite integrated into the existing CMake test target.

Changes:

  • Introduces tsfile-cli executable + CLI dispatch/arg parsing, read commands (ls/schema/meta/stats/count/head/cat/sample), and CSV/TSV import (write).
  • Adds output/input formatting helpers (CSV/TSV/NDJSON/table) and statistics aggregation helpers for metadata-driven commands.
  • Integrates tool build + tests into CMake, and adjusts ReadFile::open() diagnostics to go to stderr.

Reviewed changes

Copilot reviewed 37 out of 38 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
cpp/tools/tools_main.cc Adds main() entrypoint that forwards argv to run_cli.
cpp/tools/CMakeLists.txt Adds tsfile-cli build target and install rule.
cpp/tools/README.md Documents build/usage/options and examples for tsfile-cli.
cpp/tools/skills/tsfile-cli/SKILL.md Adds a machine-readable skill doc describing CLI usage.
cpp/tools/cli/cli_args.h Defines ParsedArgs struct for CLI arguments.
cpp/tools/cli/cli_args.cc Implements argument parsing for all commands/options.
cpp/tools/cli/exit_codes.h Defines standardized exit codes for CLI.
cpp/tools/cli/run_cli.h Declares run_cli entrypoint for CLI execution.
cpp/tools/cli/run_cli.cc Implements top-level dispatch, validation, and help/version handling.
cpp/tools/commands/commands.h Declares per-command handlers and shared helpers.
cpp/tools/commands/cmd_ls.cc Implements ls for devices/tables.
cpp/tools/commands/cmd_schema.cc Implements schema listing for tree/table models.
cpp/tools/commands/cmd_meta.cc Implements file-level summary output (meta).
cpp/tools/commands/cmd_stats.cc Implements per-series statistics output (stats).
cpp/tools/commands/cmd_count.cc Implements per-series counts + total (count).
cpp/tools/commands/cmd_head.cc Implements head via shared row query helper.
cpp/tools/commands/cmd_cat.cc Implements cat via shared row query helper.
cpp/tools/commands/cmd_sample.cc Implements reservoir sampling (sample).
cpp/tools/commands/row_query.cc Implements shared querying logic used by row-returning commands.
cpp/tools/commands/cmd_write.cc Implements CSV/TSV import into new table-model .tsfile (write).
cpp/tools/commands/stat_table.h Defines structures/helpers for stats + meta summary extraction.
cpp/tools/commands/stat_table.cc Implements metadata/statistics collection used by stats/meta/count.
cpp/tools/format/output_format.h Defines output format enum and RowWriter.
cpp/tools/format/output_format.cc Implements CSV/TSV/NDJSON/table formatting and escaping.
cpp/tools/format/input_format.h Defines column-spec parsing and delimited-line parsing helpers.
cpp/tools/format/input_format.cc Implements --columns parsing, CSV quote splitting, bool parsing.
cpp/tools/format/result_set_format.h Declares helpers for converting result sets to output.
cpp/tools/format/result_set_format.cc Implements result-set streaming and reservoir-sampled output.
cpp/test/CMakeLists.txt Wires tool tests into TsFile_Test when BUILD_TOOLS is on; adjusts GTest include handling.
cpp/test/tools/cli_test_util.h Adds helpers for CLI fixture generation and temp file naming.
cpp/test/tools/cli_args_test.cc Adds unit tests for arg parsing and CLI-level validation.
cpp/test/tools/input_format_test.cc Adds unit tests for column spec parsing and CSV/TSV splitting.
cpp/test/tools/output_format_test.cc Adds unit tests for escaping, formatting, and table alignment.
cpp/test/tools/stat_table_test.cc Adds unit tests for statistic-to-cell conversion logic.
cpp/test/tools/command_e2e_test.cc Adds in-process E2E tests including a write→read round-trip.
cpp/src/file/read_file.cc Routes ReadFile::open() diagnostics to stderr instead of stdout.
cpp/CMakeLists.txt Adds BUILD_TOOLS option and includes cpp/tools subdirectory.
.gitignore Ignores local AI tooling dirs and test-run artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/src/file/read_file.cc Outdated
Comment on lines 54 to 57
std::cerr << "open file " << file_path << " error :" << fd_
<< std::endl;
std::cout << "open error" << errno << " " << strerror(errno)
std::cerr << "open error" << errno << " " << strerror(errno)
<< std::endl;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7bb6b62. ReadFile::open() no longer prints fd_ (always -1 on failure); it now reports strerror(errno) plus the numeric errno.

Comment thread cpp/tools/cli/run_cli.cc
Comment on lines +145 to +149
if (p.command == "help" || p.command == "--help" || p.command == "-h" ||
(p.help && p.file.empty())) {
print_usage(out);
return kExitOk;
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7bb6b62. Help now triggers whenever -h/--help is present regardless of a positional file, so tsfile-cli <cmd> --help <file> prints usage.

Comment thread cpp/tools/cli/run_cli.cc Outdated
" ls list devices (tree) or tables (table)\n"
" schema per-measurement data type/encoding/compression\n"
" meta file metadata summary\n"
" stats per-series row count and time range\n"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7bb6b62. The stats usage line now reads per-series count, time range, min/max/first/last/sum.

Comment thread cpp/tools/commands/cmd_meta.cc Outdated
Comment on lines +31 to +35
RowWriter w(out, fmt,
{"file", "model", "version", "device_count", "table_count",
"series_count", "start_time", "end_time", "bloom_filter",
"file_size_bytes"},
{common::STRING, common::STRING, common::STRING, common::INT64,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7bb6b62 by dropping the version and bloom_filter columns. The public TsFileReader API exposes neither the file version nor the bloom filter, so rather than emit perpetually-null columns I removed them (docs updated to match).

Comment thread cpp/tools/README.md Outdated
|---|---|
| `ls` | List devices (tree model) or tables (table model), one name per line |
| `schema` | Per-series `target, measurement, datatype, encoding, compression` |
| `meta` | File summary: model, version, device/table/series counts, time range, file size |
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7bb6b62: the README meta row no longer lists version (the column was removed from the command).

Comment thread cpp/tools/commands/cmd_write.cc Outdated
Comment on lines +166 to +170
std::vector<DataRow> rows;
while (std::getline(*in, line)) {
++line_no;
strip_cr(line);
if (line.empty()) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7bb6b62. write now streams rows into fixed 1024-row Tablet batches and flushes as each batch fills, so memory stays flat regardless of input size — the full file is never buffered. Verified with a 2500-row import (3 batches) round-tripping to count=2500.

Comment on lines +71 to +75
case common::INT32: {
long v = std::strtol(cell.c_str(), &e, 10);
if (e == nullptr || *e != '\0') {
error = "bad INT32 '" + cell + "'";
return false;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7bb6b62. Numeric parsing now clears errno and rejects ERANGE for INT32/INT64/FLOAT/DOUBLE; INT32 additionally range-checks against INT32_MIN/MAX (parsed via strtoll). The timestamp parse rejects ERANGE too. e.g. importing 3000000000 into an INT32 column now errors (INT32 out of range) instead of silently truncating.

Comment thread cpp/tools/CMakeLists.txt Outdated
under the License.
]]

message("Running in tools directory")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7bb6b62: removed the unconditional message("Running in tools directory").

Comment on lines +235 to +241
void RowWriter::write(const std::vector<std::string>& cells,
const std::vector<bool>& is_null) {
if (fmt_ == OutputFormat::kTable) {
rows_.push_back(cells);
rows_null_.push_back(is_null);
return;
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left as-is by design. Table rendering must read all rows to compute column widths (same as column -t / csvlook / psql aligned mode), so it can't both stream and align. The unbounded/large path is already covered: when stdout is not a TTY, resolve_format returns tsv, which streams row-by-row with zero buffering — so piped exports never buffer. table is only the default on an interactive TTY, where output is bounded by the terminal anyway; a user dumping millions of rows would pass -f tsv|csv. Rearchitecting table output to stream isn't possible without dropping column alignment, so I'd prefer to keep the current behavior.

Comment thread cpp/test/CMakeLists.txt Outdated
Comment on lines +70 to +75
set_target_properties(${GTEST_TARGET} PROPERTIES SYSTEM OFF)
target_include_directories(${GTEST_TARGET} BEFORE PRIVATE
${googletest_SOURCE_DIR}/googletest/include
${googletest_SOURCE_DIR}/googletest
${googletest_SOURCE_DIR}/googlemock/include
${googletest_SOURCE_DIR}/googlemock)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 7bb6b62. Removed the no-op set_target_properties(... SYSTEM OFF) (there is no boolean SYSTEM target property) and added the vendored GTest include dirs BEFORE PRIVATE to TsFile_Test itself, where the test sources are compiled and header resolution actually matters. The Apple -iquote/-I + -std=c++14 options on the gtest targets remain since they govern GTest's own compilation. Verified the AppleClang Debug build still resolves the FetchContent 1.12.1 headers and all tool tests pass.

@ColinLeeo
Copy link
Copy Markdown
Contributor

When handling exceptions in the CLI, we should return readable error messages instead of generic descriptions or error codes.

- row_query/sample: translate storage error codes to readable phrases
  instead of emitting a bare numeric code
- read_file: drop the always -1 fd_ value from open() diagnostic; keep
  strerror(errno)
- run_cli: honor --help even with a positional file; correct the stats
  usage text (min/max/first/last/sum)
- meta: remove the always-empty version/bloom_filter columns (the public
  reader API exposes neither); update README and SKILL accordingly
- write: stream rows into fixed 1024-row Tablet batches so memory stays
  bounded regardless of input size
- write: reject numeric overflow (ERANGE for int/float/double, plus an
  INT32 range check)
- tools CMake: remove the noisy unconditional configure message
- test CMake: drop the no-op SYSTEM target property and force the vendored
  GTest headers ahead on TsFile_Test where header resolution matters
@SpriCoder
Copy link
Copy Markdown
Contributor Author

@ColinLeeo Addressed in 7bb6b62. The CLI now translates storage-engine error codes into readable phrases via a query_error_text() helper — e.g. Error: query failed: table does not exist instead of query failed (code 49) — used by both the row-query and sample paths, and the result-read failure path now prints a readable message too. The other diagnostics (file open/corrupt, bad field count, type/overflow, header mismatch) were already human-readable. If a specific path is still surfacing a raw code, point me at it and I'll map it.

SpriCoder added 4 commits June 5, 2026 20:12
Resolve cpp/test/CMakeLists.txt conflict: keep develop's GTest acquisition
(tar-extract + add_subdirectory instead of FetchContent) and re-apply the
AppleClang include-order fix on top, using GTEST_SRC_ROOT. Reformat
cmd_meta.cc to satisfy clang-format.
Strict-review follow-up to PR apache#829:
- write: reject non-strictly-increasing timestamps per device (tag tuple)
  with a located message; refuse --output equal to the input file; remove
  the partial output on any failure so no corrupt .tsfile is left behind
- write/query/sample failures now print a human-readable cause via
  error_code_message() instead of a bare numeric code; the helper lives in
  the output_format layer so read and write share it
- schema: report real encoding/compression for table-model columns instead
  of always-empty cells
- columns spec: reject duplicate column names
- reject flags that do not apply to a command (write-only flags on read
  commands, row/range flags on metadata commands, --header-match with
  --no-header), and give a clear error when an option precedes the command
- rename the read-output helpers to emit_result_set* and the JSON predicate
  to emits_json_bare so the names match what they do
- docs: document the per-device timestamp ordering rule and drop the
  unimplemented "help <command>" form
The file holds generic statistics helpers (collect_series_stats,
collect_file_summary, statistic_value_cells) used by stats/count/meta for
both the tree and table models. "table" wrongly implied the table model;
"statistics" describes what it actually provides.
Covers per-device timestamp-order rejection (including across batch
flushes), --output anti-alias and unlink-on-failure, large streaming
round-trip, numeric overflow detection, duplicate-column rejection,
flag-applicability errors, the leading-option error, error_code_message
mapping, --help with a positional file, and table-model schema
encoding/compression.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants