feat(benchmarks): Support local Claude UI benchmark suites#429
feat(benchmarks): Support local Claude UI benchmark suites#429cameroncooke wants to merge 5 commits into
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Old
sequenceconfig key silently ignored instead of rejected- Added 'sequence' key to rejectRemovedConfigKeys with migration message 'removed; sequence checking is now observational only'
Or push these changes by commenting:
@cursor push 8dbf5edd5a
Preview (8dbf5edd5a)
diff --git a/src/benchmarks/claude-ui/config.ts b/src/benchmarks/claude-ui/config.ts
--- a/src/benchmarks/claude-ui/config.ts
+++ b/src/benchmarks/claude-ui/config.ts
@@ -247,6 +247,7 @@
allowedVariance: 'removed; baselines are observed data only',
expectedFailures: 'removed; benchmark stumbles are observed data',
expectedToolSequence: 'renamed to baselineToolSequence',
+ sequence: 'removed; sequence checking is now observational only',
};
for (const [key, message] of Object.entries(removedKeys)) {
if (raw[key] !== undefined) throw new Error(`${source}.${key}: ${message}`);You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit ee9adf9. Configure here.
Add a local Claude Code benchmark harness for measuring UI task runs against XcodeBuildMCP and optional local tool surfaces. The harness now records observed baselines, writes per-run artifacts, supports private local suites, and reports completion separately from benchmark metrics. Keep vendor/private suites out of tracked source by discovering ignored local benchmark suites from a generic local directory. Refresh the committed first-party baselines from the latest canonical benchmark runs. Co-Authored-By: OpenAI Codex <noreply@openai.com>
Treat configured failure pattern matches as incomplete benchmark runs so CI exits non-zero for explicitly declared failure conditions. Reject activateSkill configs without skillDirs during suite parsing to avoid late failures after expensive setup. Co-Authored-By: OpenAI Codex <noreply@openai.com>
Reject the old sequence suite config key with an explicit migration message so migrated benchmarks do not silently drop sequence checks. Co-Authored-By: OpenAI Codex <noreply@openai.com>
Validate activated benchmark skills before setup, preserve authoritative Claude stream results only for harness-terminated runs, and make transcript failure accounting robust for missing or duplicate Bash tool results. Co-Authored-By: OpenAI Codex <noreply@openai.com>
989ab76 to
3eaed16
Compare
9c53757 to
2082db2
Compare
Detach timed Claude commands before process-group termination and fix RocketSim preflight launch detection for direct app/path commands. Co-Authored-By: OpenAI Codex <noreply@openai.com>
2082db2 to
cd53c71
Compare
| }), | ||
| ) | ||
| .catch(reject); | ||
| }); |
There was a problem hiding this comment.
Bug: runCommand lacks an error handler on child.stdin. A premature child process exit could cause an unhandled EPIPE error, crashing the harness.
Severity: MEDIUM
Suggested Fix
Attach an error handler to the child.stdin stream to gracefully handle potential EPIPE errors, preventing them from crashing the process. A standard pattern is child.stdin.on('error', (err) => { if (err.code !== 'EPIPE') reject(err); }); which ignores pipe errors but rejects other stream errors.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/benchmarks/claude-ui/harness.ts#L353
Potential issue: The `runCommand` function in `src/benchmarks/claude-ui/harness.ts`
writes to a child process's standard input using `child.stdin.end()` but does not attach
an error handler to the `child.stdin` stream. If the child process exits before
consuming all the piped input (e.g., due to a corrupted binary, permission errors, or
invalid arguments), the stream can emit an `EPIPE` error. Because this error is
unhandled, it will propagate and crash the entire harness process. While unlikely during
normal operation, this defensive gap leaves the harness vulnerable to failures in child
process startup.
| const entryType = asString(entry.type); | ||
| const lineText = rawLine.length > 600 ? `${rawLine.slice(0, 600)}…` : rawLine; | ||
|
|
||
| for (const matcher of patternMatchers) { | ||
| if (matcher.regex.test(rawLine)) { | ||
| patternFailures.push({ pattern: matcher.pattern, line, excerpt: lineText }); | ||
| } | ||
| } | ||
|
|
||
| if (entryType === 'result') { |
There was a problem hiding this comment.
Bug: The matching behavior for failurePatterns has silently changed, now defaulting to only scan commands and toolResults, which may break existing benchmark configurations.
Severity: MEDIUM
Suggested Fix
To prevent silent failures in existing benchmark suites, provide a migration path. Either add a warning for configurations that use failurePatterns without the new failurePatternTargets field, or introduce a mechanism that allows configurations to explicitly opt-in to the old behavior of scanning the entire transcript.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/benchmarks/claude-ui/transcript.ts#L313-L316
Potential issue: The logic for matching `failurePatterns` in benchmark configurations
has changed. Previously, patterns were matched against the entire transcript, but now
they are only matched against command inputs and tool results by default. Existing
configurations that use `failurePatterns` are not notified of this change. This could
cause suites that rely on matching patterns in other parts of the transcript (like
assistant messages) to silently stop detecting their intended failure conditions, as the
new behavior is adopted without any warning or migration guidance.
Also affects:
src/benchmarks/claude-ui/transcript.ts:335~348src/benchmarks/claude-ui/transcript.ts:396~404



Add a local Claude Code benchmark harness for measuring UI task runs against XcodeBuildMCP.
This keeps the committed suite focused on first-party XcodeBuildMCP coverage while allowing private/local benchmark suites to live under a generic ignored
benchmarks/claude-ui/local/tree. The harness records observed benchmark data rather than treating metric deltas as regression failures, writes per-run artifacts, and reports task completion separately from measured tool usage and timing.The committed baselines were refreshed from three canonical full-suite runs. Local/private suite results are intentionally not committed.
This PR is stacked on #427.