Scope and Framing

This review covers visible chain-of-thought (CoT), scratchpad outputs, and agent trajectories — not hidden activations or internal representations. The question it addresses: given two reasoning traces over the same problem, what is the state of the art for structurally comparing them, localizing divergences, and determining which divergences matter?

As of March 2026, there is no single canonical method analogous to git diff for reasoning traces. The best current systems perform a semantic, dependency-aware diff: they decompose traces into atomic reasoning units, infer which earlier units each step depends on, align the two traces structurally, and then apply a verifier or process reward model (PRM) to judge whether a given divergence is valid, useful, or harmful.

This shift happened because the field now broadly recognizes that answer accuracy alone is insufficient. Correct answers can emerge from incorrect or unfaithful traces, and the evaluation literature still reflects no consensus on a single metric beyond broad criteria: factuality, validity, coherence, and utility (Zhang et al., 2025).

Reference-Based Trace Comparison

When a gold or human-preferred trace is available, the most direct recent proposal is Alignment Score (Gao et al., 2025), which compares a model CoT to a reference trace using semantic-entropy matrices over intermediate steps. The method is notable because it compares reasoning structure semantically rather than through string overlap, but it remains an emerging metric rather than a field default. The more established direction is to convert traces into graphs of semantically coherent steps and compare those graphs rather than raw text.

Step-Level Error Localization Without a Reference

The most mature subproblem is step-level error localization on self-contained domains, particularly mathematics and STEM.

ProcessBench (Zheng et al., 2024) provides 3,400 math problems annotated at the step level. On this benchmark, critic-style judges outperform most existing PRMs. A PRM fine-tuned on PRM800K generalizes better than many synthetic-label PRMs, while o1-mini leads the reported results over QwQ-32B and GPT-4o.

VerifyBench extends verifier evaluation to 3,989 expert-level problems across mathematics, physics, chemistry, and biology. Responses average 4,553 tokens with inter-annotator agreement of 0.88–0.92. Specialized verifiers generally lead in accuracy, while general-purpose LLM judges serve better as high-recall filters.

Premise-Aware and Graph-Based Differencing

For actual trace differencing — comparing the structure of how reasoning unfolds — the strongest recent work is premise-aware and graph-based.

PARC (Wang et al., 2025) transforms a linear chain-of-thought into a premise-linked directed acyclic graph (DAG). Even open models reach approximately 90% recall in premise identification, and verifying steps only against their identified premises improves error identification by 6–16 absolute points over full-prefix verification. The EMNLP survey explicitly highlights this partial-context strategy as both cheaper and sometimes more accurate, because it removes distractors and separates direct errors from accumulated errors.

CoTJudger pushes this further by atomizing free-form CoT, building dependency graphs, extracting a Shortest Effective Path, and measuring redundancy structurally rather than by token count alone. If any current system approximates a genuine “reasoning diff,” this is probably the closest.

Reasoning Graphs as Model Fingerprints

A related line of work treats reasoning graphs as model-level signatures. “Mapping the Minds of LLMs” (Srivastava et al., 2025) clusters verbose traces into semantically coherent steps, assigns logical relations between steps, and finds that richer branching and convergence structure correlates with higher accuracy. This matters for differencing because it suggests the informative comparison is often not “which sentence changed” but “did this model branch, backtrack, merge, or over-linearize at this point?”

Trajectory-Level Reward Signals

On the reward-model side, the frontier is moving from purely local step labels to local + global trajectory supervision.

ReasonFlux-PRM is built specifically for trajectory-response pairs and trains with both step-level and trajectory-level signals. MCNIG (Luo et al., 2026) proposes a cheaper automatic labeling method by estimating how much each step changes the probability of the correct answer, and shows those labels drive best-of-N selection across math, Python, SQL, and scientific QA.

The local/global combination is currently the strongest recipe when the goal is a diff that says not only where two traces diverged, but which divergence mattered for the final outcome.

Causal Attribution and White-Box Analysis

For causal differencing — identifying which divergence changed the computation rather than just the surface text — the frontier is sentence-level attribution and white-box graph analysis.

Thought Anchors (Chen et al., 2026) analyzes traces at the sentence level with counterfactual rollouts, attention-based sentence interaction, and causal attention suppression to find sentences that disproportionately shape later reasoning.

CRV goes deeper and treats attribution graphs as execution traces, arguing that flawed CoT has a distinct computational structure detectable from those graphs.

These methods target “which difference changed the computation” rather than “which text changed,” but both remain closer to the research frontier than the verifier/benchmark stack above.

Open Problems

Long, Self-Correcting Traces

The biggest weak spot is long, self-correcting, open-domain traces. DeltaBench (Gao et al., 2025), built for long o1-style CoTs, reports high redundancy and substantial useless reflection. Its best critique model reaches only 40.8 F1. The survey corroborates this: current evaluators struggle with traces that include backtracking and self-correction, and evaluation for tasks requiring external knowledge remains underdeveloped.

Faithfulness of Visible Traces

Visible traces are not always faithful. The survey notes that logically wrong steps can still be useful and yield the correct answer. A 2026 legibility study found that some of the strongest reasoning models produce some of the least legible traces. A 2026 classifier-sensitivity study showed that reported faithfulness rates swing substantially depending on which judge or classifier is used. Any differencing of visible CoT should therefore be treated as analysis of the reported reasoning, not necessarily the model’s full internal computation.

Agent Trajectories

In agent settings, the same structural pattern holds but with environment-grounded labels.

AgentDiagnose scores competencies including backtracking, task decomposition, observation reading, self-verification, and objective quality. AgentProcessBench (Cao et al., 2025) provides 1,000 trajectories with 8,509 human-labeled steps and evaluates both overall step accuracy and first-error localization. TRACE evaluates whole trajectories with a utility function combining efficiency, evidence grounding, robustness, and scaffolded capability.

In agent evaluation, “diff” is moving away from exact step matching and toward multi-dimensional trajectory diagnostics.

Summary

The strongest 2026 stack for reasoning-trace comparison is:

  1. Premise-aware / graph-structured semantic alignment — decompose into steps, infer dependencies, align structurally
  2. Step-level verification — PRM or critic-based error localization at each reasoning unit
  3. Trajectory-level utility and redundancy metrics — global signals that contextualize local divergences
  4. Causal / white-box attribution (frontier) — identifying which divergences actually changed the downstream computation

Plain text diff remains useful for formatting and debugging, but it is no longer the serious baseline for reasoning-trace comparison.


References

  1. Zhang et al. (2025). “A Survey on the Evaluation of Chain-of-Thought Reasoning.” Findings of EMNLP 2025. PDF
  2. Gao et al. (2025). “Alignment Score: Evaluating Chain-of-Thought via Semantic Alignment.” arXiv. 2511.06168
  3. Zheng et al. (2024). “ProcessBench: Identifying Process Errors in Mathematical Reasoning.” arXiv. 2412.06559
  4. Wang et al. (2025). “PARC: Premise-Aware Chain-of-Thought Reasoning Evaluation.” arXiv. 2502.02362
  5. Srivastava et al. (2025). “Mapping the Minds of Large Language Models.” EMNLP 2025. PDF
  6. Luo et al. (2026). “MCNIG: Monte Carlo N-gram Importance for Step-Level Reward.” arXiv. 2506.18896
  7. Chen et al. (2026). “Thought Anchors: Sentence-Level Attribution in Chain-of-Thought.” arXiv. 2506.19143
  8. Gao et al. (2025). “DeltaBench: Evaluating Long Chain-of-Thought Reasoning.” ACL 2025. PDF
  9. Cao et al. (2025). “AgentProcessBench: Step-Level Evaluation of Agent Trajectories.” EMNLP 2025 Demos. PDF