feat(benchmarks): Support local Claude UI benchmark suites by cameroncooke · Pull Request #429 · getsentry/XcodeBuildMCP

cameroncooke · 2026-05-26T01:28:51Z

Add a local Claude Code benchmark harness for measuring UI task runs against XcodeBuildMCP.

This keeps the committed suite focused on first-party XcodeBuildMCP coverage while allowing private/local benchmark suites to live under a generic ignored benchmarks/claude-ui/local/ tree. The harness records observed benchmark data rather than treating metric deltas as regression failures, writes per-run artifacts, and reports task completion separately from measured tool usage and timing.

The committed baselines were refreshed from three canonical full-suite runs. Local/private suite results are intentionally not committed.

This PR is stacked on #427.

cameroncooke · 2026-05-26T01:29:06Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Old sequence config key silently ignored instead of rejected
- Added 'sequence' key to rejectRemovedConfigKeys with migration message 'removed; sequence checking is now observational only'

Or push these changes by commenting:

@cursor push 8dbf5edd5a

Preview (8dbf5edd5a)

diff --git a/src/benchmarks/claude-ui/config.ts b/src/benchmarks/claude-ui/config.ts
--- a/src/benchmarks/claude-ui/config.ts
+++ b/src/benchmarks/claude-ui/config.ts
@@ -247,6 +247,7 @@
     allowedVariance: 'removed; baselines are observed data only',
     expectedFailures: 'removed; benchmark stumbles are observed data',
     expectedToolSequence: 'renamed to baselineToolSequence',
+    sequence: 'removed; sequence checking is now observational only',
   };
   for (const [key, message] of Object.entries(removedKeys)) {
     if (raw[key] !== undefined) throw new Error(`${source}.${key}: ${message}`);

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit ee9adf9. Configure here.}

Add a local Claude Code benchmark harness for measuring UI task runs against XcodeBuildMCP and optional local tool surfaces. The harness now records observed baselines, writes per-run artifacts, supports private local suites, and reports completion separately from benchmark metrics. Keep vendor/private suites out of tracked source by discovering ignored local benchmark suites from a generic local directory. Refresh the committed first-party baselines from the latest canonical benchmark runs. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Treat configured failure pattern matches as incomplete benchmark runs so CI exits non-zero for explicitly declared failure conditions. Reject activateSkill configs without skillDirs during suite parsing to avoid late failures after expensive setup. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Reject the old sequence suite config key with an explicit migration message so migrated benchmarks do not silently drop sequence checks. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Validate activated benchmark skills before setup, preserve authoritative Claude stream results only for harness-terminated runs, and make transcript failure accounting robust for missing or duplicate Bash tool results. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Detach timed Claude commands before process-group termination and fix RocketSim preflight launch detection for direct app/path commands. Co-Authored-By: OpenAI Codex <noreply@openai.com>

sentry · 2026-05-26T11:33:23Z

+          }),
+        )
        .catch(reject);
    });


Bug: runCommand lacks an error handler on child.stdin. A premature child process exit could cause an unhandled EPIPE error, crashing the harness.
_{Severity: MEDIUM}

Suggested Fix

Attach an error handler to the child.stdin stream to gracefully handle potential EPIPE errors, preventing them from crashing the process. A standard pattern is child.stdin.on('error', (err) => { if (err.code !== 'EPIPE') reject(err); }); which ignores pipe errors but rejects other stream errors.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/benchmarks/claude-ui/harness.ts#L353 Potential issue: The `runCommand` function in `src/benchmarks/claude-ui/harness.ts` writes to a child process's standard input using `child.stdin.end()` but does not attach an error handler to the `child.stdin` stream. If the child process exits before consuming all the piped input (e.g., due to a corrupted binary, permission errors, or invalid arguments), the stream can emit an `EPIPE` error. Because this error is unhandled, it will propagate and crash the entire harness process. While unlikely during normal operation, this defensive gap leaves the harness vulnerable to failures in child process startup.

sentry · 2026-05-26T11:33:23Z

    const entryType = asString(entry.type);
    const lineText = rawLine.length > 600 ? `${rawLine.slice(0, 600)}…` : rawLine;

-    for (const matcher of patternMatchers) {
-      if (matcher.regex.test(rawLine)) {
-        patternFailures.push({ pattern: matcher.pattern, line, excerpt: lineText });
-      }
-    }
-
    if (entryType === 'result') {


Bug: The matching behavior for failurePatterns has silently changed, now defaulting to only scan commands and toolResults, which may break existing benchmark configurations.
_{Severity: MEDIUM}

Suggested Fix

To prevent silent failures in existing benchmark suites, provide a migration path. Either add a warning for configurations that use failurePatterns without the new failurePatternTargets field, or introduce a mechanism that allows configurations to explicitly opt-in to the old behavior of scanning the entire transcript.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/benchmarks/claude-ui/transcript.ts#L313-L316 Potential issue: The logic for matching `failurePatterns` in benchmark configurations has changed. Previously, patterns were matched against the entire transcript, but now they are only matched against command inputs and tool results by default. Existing configurations that use `failurePatterns` are not notified of this change. This could cause suites that rely on matching patterns in other parts of the transcript (like assistant messages) to silently stop detecting their intended failure conditions, as the new behavior is adopted without any warning or migration guidance.

Also affects:

src/benchmarks/claude-ui/transcript.ts:335~348

src/benchmarks/claude-ui/transcript.ts:396~404

cameroncooke mentioned this pull request May 26, 2026

feat(ui-automation): Add rs/1 runtime automation parity #416

Merged

cameroncooke mentioned this pull request May 26, 2026

feat(benchmarks): Add Claude UI benchmark harness #427

Merged