Research Overview
ForgeTrace: A Failure-Aware, Provenance-Native Trajectory Dataset for Software Agents
ForgeTrace proposes a stricter way to evaluate software agents: not just whether they finish with a correct patch, but whether they can operate safely over persistent state, recover from failure, obey policy, coordinate with other agents, and explain what changed with verifiable provenance.
This page summarizes the ForgeTrace technical specification (preprint) without reproducing the full paper inline.
# What ForgeTrace Is
ForgeTrace is a trajectory dataset and benchmark suite for software agents operating through Model Context Protocol toolchains in realistic, deterministic environments. Instead of treating the final patch as the whole product, it treats the manifest-backed trajectory bundle as the benchmark artifact: tool invocations, failures, recoveries, version transitions, audit proofs, policy interactions, and evaluation outputs tied to rights metadata and release controls.
The specification is built around a simple claim: software agents should be evaluated on stateful competence, not only end-state correctness.
# Benchmark Families
ForgeTrace defines six benchmark families that separate clean repair work from recovery, governance, collaboration, and evidence-backed reasoning.
Repair
Classic issue-to-patch tasks focused on deterministic software repair.
Measures repository exploration, file edits, version-aware repair, and end-state correctness.
Recover
Repair-style tasks with deterministic failure injected into the workflow.
Measures failure detection, diagnosis, strategy shifts, and minimal-destructive recovery.
Resume
Interrupted tasks where the agent must continue from persisted workspace artifacts alone.
Measures externalized-state quality, resume fidelity, action continuity, and unnecessary rework.
Govern
Policy- and safety-centered tasks where correct behavior matters as much as task completion.
Measures policy literacy, least-privilege behavior, secret hygiene, and compliant recovery paths.
Collaborate
Shared-state tasks involving multiple agents, locks, handoffs, and concurrent mutation hazards.
Measures lock etiquette, handoff quality, conflict avoidance, and consistent attribution.
Forensics
Post-hoc explanation and verification tasks grounded in audit, versions, and evidence.
Measures provenance reasoning, audit interpretation, version-diff literacy, and evidence-backed explanation.
# Why This Benchmark Is Different
Provenance-Native Evaluation
ForgeTrace treats version history, audit verification, manifests, and reconciled traces as first-class benchmark evidence.
The goal is not only to know that a result happened, but to verify how it happened.
Externalized Memory Only
Resume ability is measured from observable workspace state: files, logs, handoff notes, and structured artifacts.
Hidden provider memory is intentionally excluded from the benchmark’s core logic.
Capability-Scoped Tasks
Each task declares its tool surface, policy overlay, and fault profile instead of exposing every capability by default.
This keeps the benchmark aligned with the specific skill being measured.
Deterministic-First Scoring
Where possible, ForgeTrace scores with tests, diffs, version history, policy outcomes, and manifest completeness.
LLM judging is reserved for residual dimensions like explanation quality.
# Three-Plane Architecture
The specification separates authoritative state from execution. Undisk MCP and Cloudflare hold benchmark-critical state; E2B is execution capacity only.
ForgeTrace also keeps a strict split between the agent plane and the harness plane: evaluated agents see MCP tools only; fault injection, forking, trace export, and orchestration stay harness-only.
# Capability Profiles
A major part of the v7 specification is explicit task-level capability scoping. The profiles below describe the intended benchmark surface at a high level.
| Profile | Purpose | Typically Allowed |
|---|---|---|
| P1 | Workspace core tasks | File read/write/search, version history, diff, restore, audit access. |
| P2 | Concurrency tasks | P1 plus collaboration surfaces for locks, handoffs, and optimistic coordination. |
| P3 | Governance tasks | P1 plus policy, secret management, and staged upload workflows. |
| P4 | Execution tasks | P1–P3 plus stateless execution surfaces for build, test, and runtime checks. |
| P5 | Forensics tasks | P1 plus deeper provenance and audit access for investigation and explanation. |
| P6 | Release tasks | P5 plus release-oriented outputs tied to canonical storage and manifests. |
# Trajectory Bundle, Rights, And Release
Manifest-Backed Trajectory Bundles
ForgeTrace’s benchmark product is the reconciled trajectory bundle rather than a patch alone.
That bundle joins agent interaction traces, workspace provenance, execution telemetry, harness control traces, and evaluation outputs.
Rights Before Release
No task, run, or artifact enters a releasable split without rights metadata, usage restrictions, and a release decision.
If rights are unclear, the specification keeps the artifact internal only.
ForgeTrace-Bench
Hosted hidden evaluation for controlled benchmarking and certification-style use.
ForgeTrace-Lite
Public, redacted, open release built around rights-cleared and provenance-linked artifacts.
ForgeTrace-Train
Richer gated training and evaluation corpus for licensed and controlled downstream use.
# Closing View
ForgeTrace is not framed as a benchmark about tool-count breadth. It is framed as a benchmark about whether an agent can work safely and competently over persistent software state, recover when things go wrong, and prove what it did afterward.