add time to first token in OnnxDiscrepancyCheck by xadupre · Pull Request #2535 · microsoft/Olive

xadupre · 2026-06-22T12:26:13Z

Describe your changes

add metrics time to first token, time to first 5 tokens in OnnxDiscrepancyCheck

Copilot

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Adds time-to-first-token and time-to-first-N-tokens latency metrics to OnnxDiscrepancyCheck’s generation comparison output, and updates tests accordingly.

Changes:

Introduces a new time_to_first_n_tokens config option and surfaces latency metrics for both Transformers and ORT GenAI generation.
Changes compare_generation to return a results dictionary (instead of an int) and updates result aggregation/logging.
Updates ONNX discrepancy check tests to validate the new return shape and metric keys.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
test/passes/onnx/test_discrepancy_check.py	Adjusts assertions for the new dict return type and validates presence of new latency metric fields.
olive/passes/onnx/discrepancy_check.py	Adds latency measurement/reporting for TTFT/TTFN and changes generation comparison to return structured results.

… (with dedicated GGUF conversion pass) (#2548) ## Describe your changes Merges #2536, #2535, #2534. Additionally adds llama.cpp integration and other improvements to `OnnxDiscrepancyCheck` and test-mode workflow handling: - **New `llama_cpp` flag** (`bool`, default `False`) on `OnnxDiscrepancyCheck` — when enabled, compares inference with llama.cpp. - **New `llama_cpp_env_path` parameter** (`Optional[str]`) — path to the `llama_env` virtual environment where `llama-cpp-python` and `convert_hf_to_gguf.py` are installed (defaults to `"llama_env"` relative to cwd). - **New `--test_llama_path` CLI option** — specifies the path to the `llama_env` virtual environment when running with `--test`. Using `--test_llama_path` without `--test` emits a warning. - **New `ConvertHfToGGUF` pass** (`olive/passes/pytorch/convert_hf_to_gguf.py`) — injected when `--test_llama_path` is provided. This pass converts the test HF model to GGUF ahead of discrepancy checking and stores the GGUF path in model attributes for downstream reuse. - **`compare_llama_cpp()` updates** — now reuses a preconverted GGUF when available; otherwise it falls back to in-method HF→GGUF conversion. llama.cpp comparison failures are captured in discrepancy results (status/failures) instead of aborting the whole run, so ONNX generation can still complete. - **Improved `--test_metrics` parsing** — now accepts both space-separated (`--test_metrics mae speedup`) and comma-separated (`--test_metrics mae,speedup`) forms. - **Fixed `add_discrepancy_check_pass` update-in-place** — existing discrepancy-pass config generated by dry-run is updated in-place so current `--test_metrics`, `--output_path`, and llama settings are applied. - **Fixed test model persistence across engine cache hits** — `ModelBuilder` stores a reference HF copy (`reference_hf_model/`) alongside cached ONNX outputs; discrepancy check falls back to this copy if the original test model path is missing. - **New `SaveTestModelConfig` pass** (`olive/passes/pytorch/save_test_model_config.py`) — injected at the start of passes for `--test`; ensures test model config/marker (and random test model persistence path usage) is set up before downstream passes. - **CI workflow** (`test-model-fast.yml`) — includes setup of a llama environment and llama.cpp conversion script dependencies. - **Updated documentation** (`cli-fast-test.md`) — clarifies where layer reduction happens, when test-model directories are created, cache fallback behavior, and llama.cpp test flow including the dedicated GGUF conversion pass. ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. ## (Optional) Issue link --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

add time to first token in OnnxDiscrepancyCheck

bf0a978

xadupre requested a review from Copilot June 22, 2026 12:27

Copilot started reviewing on behalf of xadupre June 22, 2026 12:34 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Comment thread olive/passes/onnx/discrepancy_check.py Outdated

Comment thread olive/passes/onnx/discrepancy_check.py

Comment thread olive/passes/onnx/discrepancy_check.py Outdated

Comment thread test/passes/onnx/test_discrepancy_check.py

Copilot started work on behalf of xadupre June 22, 2026 12:49 View session

Copilot finished work on behalf of xadupre June 22, 2026 12:55

Copilot started work on behalf of xadupre June 22, 2026 12:56 View session

Add latency key assertions to fully matching discrepancy test

1bdee25

Copilot finished work on behalf of xadupre June 22, 2026 13:03

Copilot finished work on behalf of xadupre June 22, 2026 13:06

Copilot started work on behalf of xadupre June 22, 2026 13:38 View session

Handle zero max_new_tokens in generation metrics

142ddea

Copilot finished work on behalf of xadupre June 22, 2026 13:45

Use single measured transformers generation for latency metrics

39cac1c

Copilot finished work on behalf of xadupre June 22, 2026 13:48

xadupre marked this pull request as ready for review June 22, 2026 14:36

xadupre added 2 commits June 25, 2026 09:45

Merge branch 'main' into xadupre/tts

141f35a

Merge branch 'main' into xadupre/tts

9c7365d

xadupre mentioned this pull request Jun 29, 2026

Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration #2546

Open

5 tasks

Copilot AI mentioned this pull request Jul 1, 2026

Merge 3 existing PRs for OnnxDiscrepancyCheck + llama.cpp integration (with dedicated GGUF conversion pass) #2548

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add time to first token in OnnxDiscrepancyCheck#2535

add time to first token in OnnxDiscrepancyCheck#2535
xadupre wants to merge 6 commits into
mainfrom
xadupre/tts

xadupre commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

xadupre commented Jun 22, 2026

Describe your changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants