Skip to content

Experiment/code review al skill#645

Open
WaelAbuSeada wants to merge 123 commits into
mainfrom
experiment/code-review-al-skill
Open

Experiment/code review al skill#645
WaelAbuSeada wants to merge 123 commits into
mainfrom
experiment/code-review-al-skill

Conversation

@WaelAbuSeada

Copy link
Copy Markdown
Member

No description provided.

haoranpb and others added 30 commits April 8, 2026 13:57
…display and refactoring comment parsing logic
Comment thread tools/dump_entries.py Fixed
Comment thread tools/dump_entries.py Fixed
Comment thread tools/ood_worklist.py Fixed
Comment thread src/bcbench/evaluate/comment_judge.py Fixed
Comment thread tools/dump_entries.py Fixed
WaelAbuSeada and others added 15 commits June 10, 2026 12:26
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reverts the diagnostics feature that captured tool_path in hooks and parsed it to verify whether the AL code-review skill and its domain instruction files were read during execution. Removes:

- skill_read_diagnostics field on AgentMetrics

- parse_skill_read_diagnostics_from_hooks/_from_session_log

- tool_path capture in log-tool-usage.ps1

- Diagnostic copilot-cli.log dump and stdout capture in copilot agent

- Associated imports, exports, and tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Renames src/bcbench/evaluate/comment_judge.py to codereview_judge.py since it is only used by the code-review pipeline. Also collapses subprocess stdout/stderr=PIPE into capture_output=True per ruff UP022.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the flat bullet list under '## Result Summary' with four grouped

tables (Comment counts, Micro, Macro, Quality), proper percent units,

the Greek beta symbol, and a collapsible 'How to read these metrics'

block that defines Micro/Macro, matched comments (including the LLM-judge

confirmation step), F1, the Fbeta formula, severity MAE, and the valid

review output rate.

The same treatment is applied to the local Rich console summary via four

Rich tables and a Panel with compact explanations.

Scope is strictly code-review:

- New render_github_metrics_markdown / render_console_metrics methods

  have default implementations on EvaluationResultSummary that produce

  byte-identical output to the previous bullet rendering.

- Only CodeReviewResultSummary overrides them; bug-fix, test-generation

  and judge-based summaries are unchanged.

- display_summary() output is untouched, so the JSON artifact,

  leaderboard JSON, and bceval upload all see the same shape.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ings

All 15 positive entries (synthetic__performance-018..032) now produce zero out-of-domain findings while preserving in-domain performance anchors.
Add a polymorphic `sort_key` property on `BaseEvaluationResult` so the

Detailed Results table renders in a stable, human-friendly order:

- Default sort key uses natural-sorted instance_id, so `feature-2` precedes

  `feature-10` and zero-padded ids like `-001/-002/-010/-011` sort numerically.

- `CodeReviewResult` overrides `sort_key` to group by domain first, then

  by natural-sorted instance_id, so the table reads as domain blocks.

Applies to both the GitHub job summary and the Rich console summary.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All 9 false-positive performance entries (-001, -003, -005, -007, -009, -011, -012, -014, -016) previously had expected_comments=[]. They are deliberate negative tests: the patches contain code that LOOKS like an AL perf anti-pattern but is correct (e.g., CalcFields in OnValidate, Get on singleton setup, RecordRef on bounded tables).

This change PRESERVES the false-positive bait code unchanged and APPENDS a new AL file per entry containing one or more genuine, isolated perf anti-patterns drawn from performance.md. Each new finding is added to the entry's expected_comments. Result:

- Every perf entry now has >= 1 expected_comment (24/24)

- Comment count distribution: 6 with 1, 12 with 2, 6 with 3 (was 9 with 0, 12 with 1, 3 with 2+)

- performance.md rule coverage: 64.3% -> 89.3% (25 / 28 rules)

- New rule categories now exercised: FindFirst-with-full-PK -> Get, SetCurrentKey/key selection, RecordRef in hot loops, DB op in event subscribers w/o cheap guard, FindSet(true) on read-only iteration, removed SourceTableTemporary on API page

Entries now test both precision (don't flag the FP bait) AND recall (catch the new real issues) instead of precision only.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After the perf enrichment commit the dataset had odd gaps in the numeric suffix of instance_id:

- performance: was missing 002, 004, 006, 008, 010, 013, 015, 017 (24 entries spanning 001..032)

- privacy: was missing 007 (15 entries spanning 001..016)

Renumber within each domain so the suffix is contiguous 001..N, preserving relative order. Only the instance_id field is touched; descriptions, patches, expected_comments, etc. are unchanged. No tracked source/tests/docs reference these IDs so this is a safe rename. All 81 entries still pass the CodeReviewEntry Pydantic schema and all 482 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two issues caused every job in run #27378333674 to emit:

  ##[warning] Result for synthetic__<id> missing metrics: execution_time, llm_duration, prompt_tokens, completion_tokens, tool_usage

1. Copilot CLI v1.0.61 changed its summary output:

   - 'Requests 0.33 Premium (1m 45s)' became 'AI Credits 58.4 (1m 14s)'

   - 'Tokens  up 317.5k . down 4.3k . 255.0k (cached)' became 'Tokens  up 413.9k (368.1k cached) . down 4.5k (500 reasoning)'

   The bullet-anchored token regex would no longer match because of the inline (cached)/(reasoning) annotations between the two values, and there was no parser for 'AI Credits'. Added an AI-Credits regex and split the tokens parse into separate up/down lookups on the Tokens line.

2. The Copilot hook config in setup_hooks() only sets the 'powershell' field, but per docs.github.com/en/copilot/reference/hooks-reference Linux Copilot CLI runs honor only the 'bash' or 'command' field. As a result tool_usage was always empty on GH Actions Linux runners. Added a 'bash' field that invokes pwsh (preinstalled on ubuntu-24.04 runners) to run the same .ps1 hook script; kept 'powershell' for Windows.

After the fix only llm_duration remains None for new runs - the CLI no longer reports API time separately from total session time, which is a real upstream change worth keeping visible in the warning.

Added 3 new metrics tests covering the v1.0.61 format and asserted the bash field on the Copilot hook config.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Last fix added a bash field calling pwsh on Linux, but tool_usage was still null on the next run. pwsh-on-Linux + the .ps1 stdin/env-var path was failing silently somewhere (Copilot's process-*.log isn't uploaded, so we couldn't see why). Switch the Linux hook to Python instead - python3 is preinstalled on every GH runner, has trivial JSON/stdin handling, and removes pwsh as a dependency on Linux.

Changes:

- New cross-platform hook script: src/bcbench/agent/shared/hooks/log_tool_usage.py mirrors the existing .ps1 behavior (extracts tool_name/toolName, expands lsp:<operation>, appends one JSON line per call to BCBENCH_TOOL_LOG, swallows all errors with exit 0).

- hooks_operations._setup_copilot_hooks now writes both fields: bash invokes python3 with the new .py script (used on Linux/macOS), powershell keeps invoking powershell.exe with the .ps1 (used on Windows). The .ps1 path is unchanged so Windows self-hosted runners that already work (bug-fix, test-generation) are untouched.

- PathConfig gains hook_script_path_py alongside hook_script_path.

- copilot-evaluation.yml artifact upload now includes *.log files so the next investigation can see Copilot's debug session log directly.

- 8 new tests cover the Python hook (happy path, camelCase, lsp expansion, lsp args as JSON string, missing tool_name, missing env, malformed stdin) and the hooks_operations test asserts python3 and log_tool_usage.py are in the bash field.

All 493 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Probe harness materializes patch into throwaway repo, runs al-code-review skill,
and validates expected_comments recall + out-of-domain (OOD) finding discipline.

- probe_codereview_case.py: single-entry probe (patch -> tmp dir -> copilot skill -> review.json -> match check)
- probe_codereview_batch.py: parallel runner with ThreadPoolExecutor + per-entry reports
- apply_enrichment.py: ingests tmp/enrichment-design-<domain>.json, appends new-file diffs + expected_comments
- unindent_bait_files.py: reformats pre-existing FP-bait files from 4-space to 2-space (eliminates style-domain OOD on non-style entries)
- fix_enrichment_iteration_1.py / _2.py: targeted post-probe corrections for individual entries

Verified: 28/28 zero-expected entries now have in-domain expected findings with 0 OOD.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread tools/unindent_bait_files.py Fixed
Comment thread tools/probe_codereview_case.py Fixed
Comment thread tools/probe_codereview_batch.py Fixed
Adds true-positive expected_comments to entries where expected_comments was empty,
covering all four non-performance domains (mirrors prior perf enrichment in 23ce854).

Per-entry approach:
- Append a short new .al file to the patch that introduces 1-2 isolated in-domain anti-patterns
- Reformat pre-existing FP-bait files from 4-space to 2-space indent (eliminates style-domain OOD)
- Add matching expected_comments anchored to the violating lines

Counts:
- security: 7 entries (object IDs 50301-50307)
- privacy:  8 entries (object IDs 50321-50328)
- style:    9 entries (object IDs 50341-50349)
- upgrade:  4 entries (object IDs 50361-50364)

Verified locally via tools/probe_codereview_batch.py against the al-code-review
skill (claude-opus-4.8): 28/28 produce matching in-domain findings with 0 OOD.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@WaelAbuSeada WaelAbuSeada force-pushed the experiment/code-review-al-skill branch from 281c231 to 68e81c4 Compare June 12, 2026 04:18
WaelAbuSeada and others added 8 commits June 11, 2026 22:25
Auto-fix + manual cleanup so CI pre-commit (ruff check + ruff format + ty) passes:
- log_tool_usage.py: SIM105 (try/except/pass -> contextlib.suppress); move import to top
- dump_entries.py: ANN201, PTH207 (glob -> Path.rglob), SIM115 (use context manager), E741 (l -> line)
- ood_worklist.py: same modernizations + remove unused os/glob imports
- run_entry.py: ruff format reflow (no behavior change)
- test_codereview.py: ruff format reflow (single-line; no behavior change)

Verified locally: ruff check passes, ruff format clean, 8/8 hook tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t branch

Snapshots all non-experiment files from experiment/code-review-al-skill onto
category/code-review. Experiment-specific assets (al-code-review skill and
custom instructions under microsoft-BCApps/instructions/*.md) remain only on
the experiment branch.

Highlights:
- Dataset: enriched 28 zero-expected entries (security/privacy/style/upgrade)
  with in-domain expected_comments; cleaned up OOD bait across pre-existing
  entries; renumbered performance and privacy entries to be contiguous.
- Eval: domain-aware code-review evaluation, codereview_judge for LLM-confirmed
  matches, improved review parsing, grouped per-domain summary layout.
- Results: domain-split metrics, leaderboard refresh, severity_mae, macro_f1.
- Tooling: probe_codereview_case/batch harness for local skill testing,
  apply_enrichment + unindent_bait_files + fix_enrichment_iteration_{1,2}
  scripts used to produce the dataset enrichment, dump_entries, ood_worklist,
  run_entry helpers.
- Hooks: Python log_tool_usage hook (Linux-compatible) with process log capture
  and unit tests.
- Workflow: copilot-evaluation.yml updates for category routing and metrics.
- Lint: ruff/ty cleanups across tools/, tests/, and shared hooks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
The /al-code-review skill prompt, custom instructions, and skills are
experiment-specific. Revert config.yaml to the defaults (/review prompt,
instructions/skills disabled). Experiment branch keeps its own version.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	src/bcbench/agent/shared/config.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants