Skip to content

feat(benchmarks): Support local Claude UI benchmark suites#429

Open
cameroncooke wants to merge 5 commits into
mainfrom
cam/feat/configurable-claude-ui-benchmark-tools
Open

feat(benchmarks): Support local Claude UI benchmark suites#429
cameroncooke wants to merge 5 commits into
mainfrom
cam/feat/configurable-claude-ui-benchmark-tools

Conversation

@cameroncooke
Copy link
Copy Markdown
Collaborator

@cameroncooke cameroncooke commented May 26, 2026

Add a local Claude Code benchmark harness for measuring UI task runs against XcodeBuildMCP.

This keeps the committed suite focused on first-party XcodeBuildMCP coverage while allowing private/local benchmark suites to live under a generic ignored benchmarks/claude-ui/local/ tree. The harness records observed benchmark data rather than treating metric deltas as regression failures, writes per-run artifacts, and reports task completion separately from measured tool usage and timing.

The committed baselines were refreshed from three canonical full-suite runs. Local/private suite results are intentionally not committed.

This PR is stacked on #427.

Copy link
Copy Markdown
Collaborator Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Comment thread src/benchmarks/claude-ui/harness.ts
Comment thread src/benchmarks/claude-ui/harness.ts
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Old sequence config key silently ignored instead of rejected
    • Added 'sequence' key to rejectRemovedConfigKeys with migration message 'removed; sequence checking is now observational only'

Create PR

Or push these changes by commenting:

@cursor push 8dbf5edd5a
Preview (8dbf5edd5a)
diff --git a/src/benchmarks/claude-ui/config.ts b/src/benchmarks/claude-ui/config.ts
--- a/src/benchmarks/claude-ui/config.ts
+++ b/src/benchmarks/claude-ui/config.ts
@@ -247,6 +247,7 @@
     allowedVariance: 'removed; baselines are observed data only',
     expectedFailures: 'removed; benchmark stumbles are observed data',
     expectedToolSequence: 'renamed to baselineToolSequence',
+    sequence: 'removed; sequence checking is now observational only',
   };
   for (const [key, message] of Object.entries(removedKeys)) {
     if (raw[key] !== undefined) throw new Error(`${source}.${key}: ${message}`);

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit ee9adf9. Configure here.

Comment thread src/benchmarks/claude-ui/config.ts
Comment thread src/benchmarks/claude-ui/harness.ts
Comment thread src/benchmarks/claude-ui/harness.ts
Comment thread src/benchmarks/claude-ui/__tests__/claude-ui-benchmark.test.ts
Comment thread src/benchmarks/claude-ui/harness.ts Outdated
Comment thread src/benchmarks/claude-ui/transcript.ts
Comment thread src/benchmarks/claude-ui/transcript.ts
Comment thread src/benchmarks/claude-ui/simulator-deletion.ts Outdated
Comment thread src/benchmarks/claude-ui/harness.ts
Comment thread src/benchmarks/claude-ui/transcript.ts
Comment thread src/benchmarks/claude-ui/preflight-commands.ts Outdated
@cameroncooke cameroncooke changed the title feat(benchmarks): Add Claude UI benchmark harness feat(benchmarks): Support local Claude UI benchmark suites May 26, 2026
@cameroncooke cameroncooke changed the base branch from cam/feat/claude-ui-benchmark-harness to graphite-base/429 May 26, 2026 11:23
cameroncooke and others added 4 commits May 26, 2026 11:23
Add a local Claude Code benchmark harness for measuring UI task runs
against XcodeBuildMCP and optional local tool surfaces. The harness now
records observed baselines, writes per-run artifacts, supports private
local suites, and reports completion separately from benchmark metrics.

Keep vendor/private suites out of tracked source by discovering ignored
local benchmark suites from a generic local directory. Refresh the
committed first-party baselines from the latest canonical benchmark runs.

Co-Authored-By: OpenAI Codex <noreply@openai.com>
Treat configured failure pattern matches as incomplete benchmark runs so
CI exits non-zero for explicitly declared failure conditions. Reject
activateSkill configs without skillDirs during suite parsing to avoid
late failures after expensive setup.

Co-Authored-By: OpenAI Codex <noreply@openai.com>
Reject the old sequence suite config key with an explicit migration
message so migrated benchmarks do not silently drop sequence checks.

Co-Authored-By: OpenAI Codex <noreply@openai.com>
Validate activated benchmark skills before setup, preserve authoritative Claude stream results only for harness-terminated runs, and make transcript failure accounting robust for missing or duplicate Bash tool results.

Co-Authored-By: OpenAI Codex <noreply@openai.com>
@cameroncooke cameroncooke force-pushed the cam/feat/configurable-claude-ui-benchmark-tools branch from 9c53757 to 2082db2 Compare May 26, 2026 11:23
@graphite-app graphite-app Bot changed the base branch from graphite-base/429 to main May 26, 2026 11:24
Detach timed Claude commands before process-group termination and fix RocketSim preflight launch detection for direct app/path commands.

Co-Authored-By: OpenAI Codex <noreply@openai.com>
@cameroncooke cameroncooke force-pushed the cam/feat/configurable-claude-ui-benchmark-tools branch from 2082db2 to cd53c71 Compare May 26, 2026 11:24
}),
)
.catch(reject);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: runCommand lacks an error handler on child.stdin. A premature child process exit could cause an unhandled EPIPE error, crashing the harness.
Severity: MEDIUM

Suggested Fix

Attach an error handler to the child.stdin stream to gracefully handle potential EPIPE errors, preventing them from crashing the process. A standard pattern is child.stdin.on('error', (err) => { if (err.code !== 'EPIPE') reject(err); }); which ignores pipe errors but rejects other stream errors.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/benchmarks/claude-ui/harness.ts#L353

Potential issue: The `runCommand` function in `src/benchmarks/claude-ui/harness.ts`
writes to a child process's standard input using `child.stdin.end()` but does not attach
an error handler to the `child.stdin` stream. If the child process exits before
consuming all the piped input (e.g., due to a corrupted binary, permission errors, or
invalid arguments), the stream can emit an `EPIPE` error. Because this error is
unhandled, it will propagate and crash the entire harness process. While unlikely during
normal operation, this defensive gap leaves the harness vulnerable to failures in child
process startup.

Comment on lines 313 to 316
const entryType = asString(entry.type);
const lineText = rawLine.length > 600 ? `${rawLine.slice(0, 600)}…` : rawLine;

for (const matcher of patternMatchers) {
if (matcher.regex.test(rawLine)) {
patternFailures.push({ pattern: matcher.pattern, line, excerpt: lineText });
}
}

if (entryType === 'result') {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The matching behavior for failurePatterns has silently changed, now defaulting to only scan commands and toolResults, which may break existing benchmark configurations.
Severity: MEDIUM

Suggested Fix

To prevent silent failures in existing benchmark suites, provide a migration path. Either add a warning for configurations that use failurePatterns without the new failurePatternTargets field, or introduce a mechanism that allows configurations to explicitly opt-in to the old behavior of scanning the entire transcript.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/benchmarks/claude-ui/transcript.ts#L313-L316

Potential issue: The logic for matching `failurePatterns` in benchmark configurations
has changed. Previously, patterns were matched against the entire transcript, but now
they are only matched against command inputs and tool results by default. Existing
configurations that use `failurePatterns` are not notified of this change. This could
cause suites that rely on matching patterns in other parts of the transcript (like
assistant messages) to silently stop detecting their intended failure conditions, as the
new behavior is adopted without any warning or migration guidance.

Also affects:

  • src/benchmarks/claude-ui/transcript.ts:335~348
  • src/benchmarks/claude-ui/transcript.ts:396~404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant