Skip to content

Research Overview

ForgeTrace: A Failure-Aware, Provenance-Native Trajectory Dataset for Software Agents

ForgeTrace proposes a stricter way to evaluate software agents: not just whether they finish with a correct patch, but whether they can operate safely over persistent state, recover from failure, obey policy, coordinate with other agents, and explain what changed with verifiable provenance.

Version 7.0 · Technical specification (preprint) · Permanent URL: mcp.undisk.app/forgetrace

This page summarizes the ForgeTrace technical specification (preprint) without reproducing the full paper inline.


# What ForgeTrace Is

ForgeTrace is a trajectory dataset and benchmark suite for software agents operating through Model Context Protocol toolchains in realistic, deterministic environments. Instead of treating the final patch as the whole product, it treats the manifest-backed trajectory bundle as the benchmark artifact: tool invocations, failures, recoveries, version transitions, audit proofs, policy interactions, and evaluation outputs tied to rights metadata and release controls.

The specification is built around a simple claim: software agents should be evaluated on stateful competence, not only end-state correctness.


# Benchmark Families

ForgeTrace defines six benchmark families that separate clean repair work from recovery, governance, collaboration, and evidence-backed reasoning.

Repair

Classic issue-to-patch tasks focused on deterministic software repair.

Measures repository exploration, file edits, version-aware repair, and end-state correctness.

Recover

Repair-style tasks with deterministic failure injected into the workflow.

Measures failure detection, diagnosis, strategy shifts, and minimal-destructive recovery.

Resume

Interrupted tasks where the agent must continue from persisted workspace artifacts alone.

Measures externalized-state quality, resume fidelity, action continuity, and unnecessary rework.

Govern

Policy- and safety-centered tasks where correct behavior matters as much as task completion.

Measures policy literacy, least-privilege behavior, secret hygiene, and compliant recovery paths.

Collaborate

Shared-state tasks involving multiple agents, locks, handoffs, and concurrent mutation hazards.

Measures lock etiquette, handoff quality, conflict avoidance, and consistent attribution.

Forensics

Post-hoc explanation and verification tasks grounded in audit, versions, and evidence.

Measures provenance reasoning, audit interpretation, version-diff literacy, and evidence-backed explanation.


# Why This Benchmark Is Different

Provenance-Native Evaluation

ForgeTrace treats version history, audit verification, manifests, and reconciled traces as first-class benchmark evidence.

The goal is not only to know that a result happened, but to verify how it happened.

Externalized Memory Only

Resume ability is measured from observable workspace state: files, logs, handoff notes, and structured artifacts.

Hidden provider memory is intentionally excluded from the benchmark’s core logic.

Capability-Scoped Tasks

Each task declares its tool surface, policy overlay, and fault profile instead of exposing every capability by default.

This keeps the benchmark aligned with the specific skill being measured.

Deterministic-First Scoring

Where possible, ForgeTrace scores with tests, diffs, version history, policy outcomes, and manifest completeness.

LLM judging is reserved for residual dimensions like explanation quality.


# Three-Plane Architecture

The specification separates authoritative state from execution. Undisk MCP and Cloudflare hold benchmark-critical state; E2B is execution capacity only.

Undisk MCP Authoritative workspace surface for immutable file versioning, audit trails, restore semantics, policy controls, secrets, and collaboration state.
Cloudflare Control plane for ingress, routing, authentication, metadata, release handling, queues, D1, Durable Objects, and canonical artifact storage in R2.
E2B Ephemeral execution only: create sandbox, run workload, stream artifacts, write manifests, terminate compute. It is not authoritative storage.

ForgeTrace also keeps a strict split between the agent plane and the harness plane: evaluated agents see MCP tools only; fault injection, forking, trace export, and orchestration stay harness-only.


# Capability Profiles

A major part of the v7 specification is explicit task-level capability scoping. The profiles below describe the intended benchmark surface at a high level.

Profile Purpose Typically Allowed
P1 Workspace core tasks File read/write/search, version history, diff, restore, audit access.
P2 Concurrency tasks P1 plus collaboration surfaces for locks, handoffs, and optimistic coordination.
P3 Governance tasks P1 plus policy, secret management, and staged upload workflows.
P4 Execution tasks P1–P3 plus stateless execution surfaces for build, test, and runtime checks.
P5 Forensics tasks P1 plus deeper provenance and audit access for investigation and explanation.
P6 Release tasks P5 plus release-oriented outputs tied to canonical storage and manifests.

# Trajectory Bundle, Rights, And Release

Manifest-Backed Trajectory Bundles

ForgeTrace’s benchmark product is the reconciled trajectory bundle rather than a patch alone.

That bundle joins agent interaction traces, workspace provenance, execution telemetry, harness control traces, and evaluation outputs.

Rights Before Release

No task, run, or artifact enters a releasable split without rights metadata, usage restrictions, and a release decision.

If rights are unclear, the specification keeps the artifact internal only.

ForgeTrace-Bench

Hosted hidden evaluation for controlled benchmarking and certification-style use.

ForgeTrace-Lite

Public, redacted, open release built around rights-cleared and provenance-linked artifacts.

ForgeTrace-Train

Richer gated training and evaluation corpus for licensed and controlled downstream use.


# Closing View

ForgeTrace is not framed as a benchmark about tool-count breadth. It is framed as a benchmark about whether an agent can work safely and competently over persistent software state, recover when things go wrong, and prove what it did afterward.