Experiment/code review al skill by WaelAbuSeada · Pull Request #645 · microsoft/BC-Bench

WaelAbuSeada · 2026-05-20T19:37:09Z

No description provided.

…C-Bench into category/code-review

…display and refactoring comment parsing logic

…egory/code-review

This reverts commit 394d005.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reverts the diagnostics feature that captured tool_path in hooks and parsed it to verify whether the AL code-review skill and its domain instruction files were read during execution. Removes: - skill_read_diagnostics field on AgentMetrics - parse_skill_read_diagnostics_from_hooks/_from_session_log - tool_path capture in log-tool-usage.ps1 - Diagnostic copilot-cli.log dump and stdout capture in copilot agent - Associated imports, exports, and tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Renames src/bcbench/evaluate/comment_judge.py to codereview_judge.py since it is only used by the code-review pipeline. Also collapses subprocess stdout/stderr=PIPE into capture_output=True per ruff UP022. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ndings

…ings (001,003,005,009,011,012,016)

Replace the flat bullet list under '## Result Summary' with four grouped tables (Comment counts, Micro, Macro, Quality), proper percent units, the Greek beta symbol, and a collapsible 'How to read these metrics' block that defines Micro/Macro, matched comments (including the LLM-judge confirmation step), F1, the Fbeta formula, severity MAE, and the valid review output rate. The same treatment is applied to the local Rich console summary via four Rich tables and a Panel with compact explanations. Scope is strictly code-review: - New render_github_metrics_markdown / render_console_metrics methods have default implementations on EvaluationResultSummary that produce byte-identical output to the previous bullet rendering. - Only CodeReviewResultSummary overrides them; bug-fix, test-generation and judge-based summaries are unchanged. - display_summary() output is untouched, so the JSON artifact, leaderboard JSON, and bceval upload all see the same shape. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ings All 15 positive entries (synthetic__performance-018..032) now produce zero out-of-domain findings while preserving in-domain performance anchors.

Add a polymorphic `sort_key` property on `BaseEvaluationResult` so the Detailed Results table renders in a stable, human-friendly order: - Default sort key uses natural-sorted instance_id, so `feature-2` precedes `feature-10` and zero-padded ids like `-001/-002/-010/-011` sort numerically. - `CodeReviewResult` overrides `sort_key` to group by domain first, then by natural-sorted instance_id, so the table reads as domain blocks. Applies to both the GitHub job summary and the Rich console summary. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

All 9 false-positive performance entries (-001, -003, -005, -007, -009, -011, -012, -014, -016) previously had expected_comments=[]. They are deliberate negative tests: the patches contain code that LOOKS like an AL perf anti-pattern but is correct (e.g., CalcFields in OnValidate, Get on singleton setup, RecordRef on bounded tables). This change PRESERVES the false-positive bait code unchanged and APPENDS a new AL file per entry containing one or more genuine, isolated perf anti-patterns drawn from performance.md. Each new finding is added to the entry's expected_comments. Result: - Every perf entry now has >= 1 expected_comment (24/24) - Comment count distribution: 6 with 1, 12 with 2, 6 with 3 (was 9 with 0, 12 with 1, 3 with 2+) - performance.md rule coverage: 64.3% -> 89.3% (25 / 28 rules) - New rule categories now exercised: FindFirst-with-full-PK -> Get, SetCurrentKey/key selection, RecordRef in hot loops, DB op in event subscribers w/o cheap guard, FindSet(true) on read-only iteration, removed SourceTableTemporary on API page Entries now test both precision (don't flag the FP bait) AND recall (catch the new real issues) instead of precision only. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

After the perf enrichment commit the dataset had odd gaps in the numeric suffix of instance_id: - performance: was missing 002, 004, 006, 008, 010, 013, 015, 017 (24 entries spanning 001..032) - privacy: was missing 007 (15 entries spanning 001..016) Renumber within each domain so the suffix is contiguous 001..N, preserving relative order. Only the instance_id field is touched; descriptions, patches, expected_comments, etc. are unchanged. No tracked source/tests/docs reference these IDs so this is a safe rename. All 81 entries still pass the CodeReviewEntry Pydantic schema and all 482 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two issues caused every job in run #27378333674 to emit: ##[warning] Result for synthetic__<id> missing metrics: execution_time, llm_duration, prompt_tokens, completion_tokens, tool_usage 1. Copilot CLI v1.0.61 changed its summary output: - 'Requests 0.33 Premium (1m 45s)' became 'AI Credits 58.4 (1m 14s)' - 'Tokens up 317.5k . down 4.3k . 255.0k (cached)' became 'Tokens up 413.9k (368.1k cached) . down 4.5k (500 reasoning)' The bullet-anchored token regex would no longer match because of the inline (cached)/(reasoning) annotations between the two values, and there was no parser for 'AI Credits'. Added an AI-Credits regex and split the tokens parse into separate up/down lookups on the Tokens line. 2. The Copilot hook config in setup_hooks() only sets the 'powershell' field, but per docs.github.com/en/copilot/reference/hooks-reference Linux Copilot CLI runs honor only the 'bash' or 'command' field. As a result tool_usage was always empty on GH Actions Linux runners. Added a 'bash' field that invokes pwsh (preinstalled on ubuntu-24.04 runners) to run the same .ps1 hook script; kept 'powershell' for Windows. After the fix only llm_duration remains None for new runs - the CLI no longer reports API time separately from total session time, which is a real upstream change worth keeping visible in the warning. Added 3 new metrics tests covering the v1.0.61 format and asserted the bash field on the Copilot hook config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Last fix added a bash field calling pwsh on Linux, but tool_usage was still null on the next run. pwsh-on-Linux + the .ps1 stdin/env-var path was failing silently somewhere (Copilot's process-*.log isn't uploaded, so we couldn't see why). Switch the Linux hook to Python instead - python3 is preinstalled on every GH runner, has trivial JSON/stdin handling, and removes pwsh as a dependency on Linux. Changes: - New cross-platform hook script: src/bcbench/agent/shared/hooks/log_tool_usage.py mirrors the existing .ps1 behavior (extracts tool_name/toolName, expands lsp:<operation>, appends one JSON line per call to BCBENCH_TOOL_LOG, swallows all errors with exit 0). - hooks_operations._setup_copilot_hooks now writes both fields: bash invokes python3 with the new .py script (used on Linux/macOS), powershell keeps invoking powershell.exe with the .ps1 (used on Windows). The .ps1 path is unchanged so Windows self-hosted runners that already work (bug-fix, test-generation) are untouched. - PathConfig gains hook_script_path_py alongside hook_script_path. - copilot-evaluation.yml artifact upload now includes *.log files so the next investigation can see Copilot's debug session log directly. - 8 new tests cover the Python hook (happy path, camelCase, lsp expansion, lsp args as JSON string, missing tool_name, missing env, malformed stdin) and the hooks_operations test asserts python3 and log_tool_usage.py are in the bash field. All 493 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Probe harness materializes patch into throwaway repo, runs al-code-review skill, and validates expected_comments recall + out-of-domain (OOD) finding discipline. - probe_codereview_case.py: single-entry probe (patch -> tmp dir -> copilot skill -> review.json -> match check) - probe_codereview_batch.py: parallel runner with ThreadPoolExecutor + per-entry reports - apply_enrichment.py: ingests tmp/enrichment-design-<domain>.json, appends new-file diffs + expected_comments - unindent_bait_files.py: reformats pre-existing FP-bait files from 4-space to 2-space (eliminates style-domain OOD on non-style entries) - fix_enrichment_iteration_1.py / _2.py: targeted post-probe corrections for individual entries Verified: 28/28 zero-expected entries now have in-domain expected findings with 0 OOD. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds true-positive expected_comments to entries where expected_comments was empty, covering all four non-performance domains (mirrors prior perf enrichment in 23ce854). Per-entry approach: - Append a short new .al file to the patch that introduces 1-2 isolated in-domain anti-patterns - Reformat pre-existing FP-bait files from 4-space to 2-space indent (eliminates style-domain OOD) - Add matching expected_comments anchored to the violating lines Counts: - security: 7 entries (object IDs 50301-50307) - privacy: 8 entries (object IDs 50321-50328) - style: 9 entries (object IDs 50341-50349) - upgrade: 4 entries (object IDs 50361-50364) Verified locally via tools/probe_codereview_batch.py against the al-code-review skill (claude-opus-4.8): 28/28 produce matching in-domain findings with 0 OOD. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Auto-fix + manual cleanup so CI pre-commit (ruff check + ruff format + ty) passes: - log_tool_usage.py: SIM105 (try/except/pass -> contextlib.suppress); move import to top - dump_entries.py: ANN201, PTH207 (glob -> Path.rglob), SIM115 (use context manager), E741 (l -> line) - ood_worklist.py: same modernizations + remove unused os/glob imports - run_entry.py: ruff format reflow (no behavior change) - test_codereview.py: ruff format reflow (single-line; no behavior change) Verified locally: ruff check passes, ruff format clean, 8/8 hook tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t branch Snapshots all non-experiment files from experiment/code-review-al-skill onto category/code-review. Experiment-specific assets (al-code-review skill and custom instructions under microsoft-BCApps/instructions/*.md) remain only on the experiment branch. Highlights: - Dataset: enriched 28 zero-expected entries (security/privacy/style/upgrade) with in-domain expected_comments; cleaned up OOD bait across pre-existing entries; renumbered performance and privacy entries to be contiguous. - Eval: domain-aware code-review evaluation, codereview_judge for LLM-confirmed matches, improved review parsing, grouped per-domain summary layout. - Results: domain-split metrics, leaderboard refresh, severity_mae, macro_f1. - Tooling: probe_codereview_case/batch harness for local skill testing, apply_enrichment + unindent_bait_files + fix_enrichment_iteration_{1,2} scripts used to produce the dataset enrichment, dump_entries, ood_worklist, run_entry helpers. - Hooks: Python log_tool_usage hook (Linux-compatible) with process log capture and unit tests. - Workflow: copilot-evaluation.yml updates for category routing and metrics. - Lint: ruff/ty cleanups across tools/, tests/, and shared hooks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

The /al-code-review skill prompt, custom instructions, and skills are experiment-specific. Revert config.yaml to the defaults (/review prompt, instructions/skills disabled). Experiment branch keeps its own version. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

# Conflicts: # src/bcbench/agent/shared/config.yaml

haoranpb and others added 30 commits April 8, 2026 13:57

few more udpates for new categories

db388bd

Refactor evaluation and dataset operations for improved workspace setup

57c004e

enable skipping container setup in action

8e2f216

fix missing implementation for MockEvaluationPipeline

69a8db8

Refactor evaluation result classes to be more generic

7549d92

Merge branch 'main' into fix/more-ready-for-categories

f32dd00

Improve readabilty of GitHub Action summary

a4089b9

fix failing tests

99af6b2

Code Review POC

e1b0b93

Merge branch 'main' into category/code-review

1a68d78

fix merge conflict resolution mistake

3ec10a0

Merge branch 'main' into category/code-review

4e52832

Make container parameters optional in evaluate and run commands

a9f59d9

Merge branch 'category/code-review' of https://github.com/microsoft/B…

065e1aa

…C-Bench into category/code-review

Enhance code review functionality by adding expected review comments …

4ad4bd9

…display and refactoring comment parsing logic

better hanlding container for not required categories

92951c4

Merge branch 'main' into category/code-review

7902610

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

dad9289

…egory/code-review

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

f1c4894

…egory/code-review

prefer copilot.exe executable

aa48a29

Normalize code-review dataset and preserve eval outputs

a244503

Fix code-review branch setup and workflow wiring

9f6c353

Require review.json and add log-based recovery fallback

1a58e44

Harden code-review prompt for Windows copilot.cmd parsing

d0e8076

Experiment: use al-code-review skill template

0b764ef

Add skip-container-setup option to evaluation workflows

83a4b28

Fix codereview lint issues in pipeline helpers

394d005

Revert "Fix codereview lint issues in pipeline helpers"

ae0f1d2

This reverts commit 394d005.

Expand code-review detailed table metrics

143075b

Expand code-review detailed table metrics

7411246

WaelAbuSeada added 3 commits June 10, 2026 11:26

Reduce OOD in style positive entries 012/014/016/017

f5bece9

Eliminate OOD in style positive entry 011

10a6d4b

Eliminate OOD in style positive entry 013

603bfe6

github-code-quality Bot found potential problems Jun 10, 2026

View reviewed changes

Comment thread tools/dump_entries.py Fixed

Comment thread tools/dump_entries.py Fixed

Comment thread tools/ood_worklist.py Fixed

Comment thread src/bcbench/evaluate/comment_judge.py Fixed

Comment thread tools/dump_entries.py Fixed

WaelAbuSeada and others added 15 commits June 10, 2026 12:26

Clean up diagnostic logging from comment judge

d027ae8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix code-review upgrade entries 001-009 to eliminate out-of-domain fi…

8560fcc

…ndings

Fix code-review performance negatives to eliminate out-of-domain find…

02c0836

…ings (001,003,005,009,011,012,016)

Fix code-review performance positives to eliminate out-of-domain find…

7154f36

…ings All 15 positive entries (synthetic__performance-018..032) now produce zero out-of-domain findings while preserving in-domain performance anchors.

Remove temporary batch_verify.py eval driver script

41ec36a

WIP: Pass-1 OOD cleanup for non-perf code-review entries (16)

0f510f1

github-code-quality Bot found potential problems Jun 12, 2026

View reviewed changes

Comment thread tools/unindent_bait_files.py Fixed

Comment thread tools/probe_codereview_case.py Fixed

Comment thread tools/probe_codereview_batch.py Fixed

WaelAbuSeada force-pushed the experiment/code-review-al-skill branch from 281c231 to 68e81c4 Compare June 12, 2026 04:18

WaelAbuSeada and others added 8 commits June 11, 2026 22:25

Potential fix for pull request finding 'Variable defined multiple times'

f5825f2

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Merge branch 'category/code-review' into experiment/code-review-al-skill

740a31a

Merge branch 'category/code-review' into experiment/code-review-al-skill

3efaba1

Include domain and suggestedCode in code-review finding schema

e77c93b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'category/code-review' into experiment/code-review-al-skill

8f4cc63

# Conflicts: # src/bcbench/agent/shared/config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment/code review al skill#645

Experiment/code review al skill#645
WaelAbuSeada wants to merge 123 commits into
mainfrom
experiment/code-review-al-skill

WaelAbuSeada commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WaelAbuSeada commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants