Skip to content

tangle-network/agent-eval

Repository files navigation

@tangle-network/agent-eval

Evaluate and improve AI agents from the runs they already produce.

agent-eval turns agent outputs, traces, judge scores, and production feedback into a decision packet: did this change help, what failed, what should ship, and what needs more data?

npm pypi tests license: MIT

Use it when you need to:

  • compare a candidate agent/prompt/model against a baseline,
  • turn production traces or human feedback into eval results,
  • run a gated self-improvement loop,
  • explain failures by cluster, cost, judge disagreement, and release risk.

It is a library, not a SaaS requirement. TypeScript is first-class; Python can call the same wire protocol through agent-eval-rpc.


Install

pnpm add @tangle-network/agent-eval

Python clients can use the RPC package:

pip install agent-eval-rpc

Quick start

1. Analyze runs you already have

Start here if you already have production logs, benchmark rows, human ratings, or agent run records.

import { analyzeRuns } from '@tangle-network/agent-eval/contract'

const report = await analyzeRuns({
  runs, // RunRecord[]
  baselineRuns,
})

console.log(report.recommendations)
console.log(report.lift)
console.log(report.failureClusters)

The output includes score distributions, lift confidence intervals, failure modes, cost-quality tradeoffs, judge agreement, contamination checks, and release recommendations when the input supports them.

2. Run a gated improvement loop

Use this when you have scenarios, a runnable agent, and judges.

import { selfImprove } from '@tangle-network/agent-eval/contract'

const result = await selfImprove({
  scenarios,
  dispatch: async ({ scenario }) => myAgent.run(scenario),
  judges: [myJudge],
  baselineSurface: { systemPrompt: currentPrompt },
})

console.log(result.gateDecision)
console.log(result.winnerSurface)
console.log(result.insight.recommendations)

selfImprove() evaluates candidates on held-out scenarios before recommending a winner.

3. Adapt existing data

import { analyzeRuns, fromFeedbackTable, fromOtelSpans } from '@tangle-network/agent-eval/contract'

const { runs, raterScores } = fromFeedbackTable({
  ratings: parseYourFeedbackTable(),
})

const traceRuns = fromOtelSpans({ spans: yourOtelSpans })

await analyzeRuns({ runs: [...runs, ...traceRuns], raterScores })

Core concepts

  • RunRecord: the durable row for one agent run: model, prompt/config hashes, split, cost, tokens, outcome.
  • Scenario: one task or case the agent attempts.
  • Judge: a scoring function, rule-based or model-based.
  • InsightReport: the decision packet returned by analyzeRuns() and embedded in selfImprove().
  • Gate: the policy that decides ship, hold, or need_more_data.

Examples

Journey Example Who it's for
Closed loop — improve a prompt under statistical confidence examples/selfimprove-quickstart/ Teams with scenarios + judges + agent in hand
Multi-rater feedback corpus — turn Obsidian/Sheets/CSV ratings into actionable insights examples/customer-feedback-loop/ Teams reviewing AI outputs by hand who want to compress that taste into per-member LLM judges + close the loop
Production OTel traces — analyze logs you already have, no closed loop required examples/customer-otel-traces/ Teams running agents in prod with observability, no eval discipline yet

Each example: README.md + a single index.ts runnable via pnpm tsx. Prints the resulting InsightReport to stdout.


Subpath entry points

Subpath What it gives you
…/contract The headline, frozen surface — new code starts here. selfImprove, analyzeRuns, runEval, runCampaign, runImprovementLoop, diffRuns; intake adapters (fromFeedbackTable, fromOtelSpans); drivers (gepaDriver, evolutionaryDriver); gates (defaultProductionGate, heldOutGate, paretoSignificanceGate, composeGate); the deployment-outcome store; storage; and the five core types Scenario / Dispatch / JudgeConfig / Mutator / Gate.
…/hosted createHostedClient / hostedClientFromEnv + the wire types to ship eval-run events + trace spans to a hosted orchestrator (ours or your own implementation of the spec)
…/adapters/otel createOtelBridge — forwards OpenTelemetry-shape spans into the hosted-tier ingest, no @opentelemetry/* dependency
…/adapters/langchain Wrap any LangChain Runnable as a Dispatch (or JudgeConfig), no @langchain/core peer dep
…/adapters/http httpDispatch + runDispatchServer — run a campaign's worker on another machine (multi-region, driver-as-a-service)
…/campaign The measurement + improvement engine (@experimental): runProfileMatrix, compareDrivers, every driver (gepaDriver, haloDriver, skillOptDriver, aceDriver, memoryCurationDriver, …), the gates, storage backends, and loop provenance. /contract re-exports the stable subset.
…/rl RL bridge from eval artifacts to training signal: verifiable rewards, preferences, OPE, PRM, tournaments, contamination, compute curves, plus the durable corpus + buildRlDataset / datasheet bundle
…/reporting Release-decision statistics: pairedBootstrap, benjaminiHochberg, anytime-valid sequential e-values, evaluateReleaseConfidence, and the report renderers
…/analyst The trace-analyst surface: AnalystRegistry + buildDefaultAnalystRegistry (run the failure-clustering panel), FindingsStore, and the LLM chat transports
…/traces Trace stores + emitters, OTLP-JSONL deterministic replay, analyzeTraces, and the traceAnalystOnRunComplete hook
…/control Agent control loop: runAgentControlLoop (observe → validate → decide → act), action policy, propose/review
…/matrix runAgentMatrix — an N-axis cartesian over caller-supplied substrate values, per-axis pass/score/cost/duration
…/multishot N-shot persona × shot matrix runner (runMultishot / runMultishotMatrix)
…/wire The cross-language HTTP/RPC server + Zod schemas (the source-of-truth protocol the Python client speaks) + the built-in rubric registry
…/benchmarks BenchmarkAdapter contract + deterministicSplit + the bundled routing reference benchmark

Specialized surfaces (subpath-only): …/prm (process-reward grading + best-of-N), …/meta-eval (judge calibration + the deployment-outcome store), …/pipelines (trace-diagnostic views: budget breach, failure cluster, stuck loop, …), …/governance (EU AI Act / NIST AI RMF / SOC2 reports), …/knowledge (knowledge-readiness gating before a run), …/builder-eval (code-generator three-layer eval), …/storyboard (trace → watchable replay), …/authenticity (anti-Goodhart "real or convincing BS" scorer over produced files), …/workflow (workflow-trace eval + partner export), …/telemetry (Workers-safe telemetry client).

The root export remains available for backward compatibility; new code should prefer the focused subpaths above — /contract first.


Composition with the stack

agent-eval is the bottom of the layering: consumers depend on it, it depends on none of them.

agent-runtime    Runs agents (chat turns, one-shot tasks, multi-attempt loops), captures every
                 run as a trace, and calls optimizePrompt / runImprovementLoop. Produces the
                 RunRecords + traces agent-eval scores. Depends on agent-eval.

agent-eval       selfImprove, analyzeRuns, runCampaign + drivers (gepaDriver, …), the gates
   (this repo)   (heldOutGate, defaultProductionGate, paretoSignificanceGate), the InsightReport
                 decision packet, the RL bridge, the wire protocol. Depends on neither consumer.

agent-knowledge  proposeKnowledgeWrites / applyKnowledgeWriteBlocks. agent-eval's analyst findings
                 feed it; the knowledge gate consumes them. Depends on agent-eval.

sandbox          AgentProfile, Sandbox.create, streamPrompt. The execution surface the runtime's
                 loops run on; agent-eval scores what comes back.

The rule: agent-eval has zero upward dependencies on a consumer. A concept that makes sense without a running agent loop — a verdict, a run record, a scenario, a judge score — is substrate and lives here; a runtime-shaped one (a sandbox profile, a validation context with an abort signal) lives in agent-runtime. When in doubt, lean substrate.


Concepts + design

The .claude/skills/agent-eval/SKILL.md skill ships embedded directives so LLM agents writing integration code don't reintroduce historical bug classes.


Hosted tier

Wire your loop to a hosted orchestrator (ours, or your own implementation of the spec) with one config:

await selfImprove({
  scenarios, dispatch, judges, baselineSurface,
  hostedTenant: {
    endpoint: 'https://intelligence.tangle.tools',
    apiKey: process.env.TANGLE_API_KEY!,
    tenantId: 'your-tenant',
  },
})

The substrate runs the loop in your process. Only the eval-run events + (optional) trace spans go to the orchestrator. Your scenarios, your judges, your raw data — never sent. Spec at docs/hosted-ingest-spec.md; reference receiver at examples/hosted-ingest-server/.


Development

Run an example:

pnpm tsx examples/selfimprove-quickstart/index.ts
pnpm tsx examples/customer-feedback-loop/index.ts
pnpm tsx examples/customer-otel-traces/index.ts

Run the test suite:

pnpm install
pnpm build
pnpm test

Stability + versioning

The /contract surface is the stability contract: its barrel freezes the API — a 0.x minor only adds; nothing there changes shape or disappears. Depend on /contract (and the documented subpaths) rather than the root barrel.

In the deeper subpaths, @stable / @experimental JSDoc markers (visible in IDE hover + .d.ts) call out what may still move — most granularly in /rl (tagged per export) and /campaign (whole barrel @experimental, since /contract re-exports only its settled subset).

Tag Meaning
@stable API frozen at this major. Breaking changes require a major bump.
@experimental Interface may evolve before becoming @stable. Pin the patch version if you depend on it.
@internal Not part of the public contract. Use the documented subpath instead.

CHANGELOG.md tracks every release with what's new / additive / breaking.


License

MIT. See LICENSE.

About

Evaluate and improve AI agents from the data they produce.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors