Skip to main content
ConeReplay
U.S. Provisional 64/043,722 · Private beta

Ship AI agents with confidence.

Pre-deploy regression testing for multi-agent LLM systems. Replay proposed prompt, model, and agent changes against your recorded production traces — byte-identical where it doesn't need to change, ~85% cheaper than re-running live.

The counterfactual gap.

Agent regressions surface in production. Faithful replay answers “what happened.” Naive full re-execution reintroduces sampling noise and burns tokens. Neither tells you whether your proposed prompt change would have broken customer 27 in yesterday's traces.

How it works.

  1. 1

    Record.

    Wrap your LangGraph app. ConeReplay captures every LLM call, tool invocation, and cross-agent message — each tagged with a per-turn causal vector clock.

  2. 2

    Replay with a twist.

    Propose a change. ConeReplay computes the causal cone of affected events. Out-of-cone events are served byte-identically from the recording. In-cone events re-execute with deterministic, content-hash-derived seeds.

  3. 3

    Review.

    Get a divergence report per trace — or aggregated across a corpus — straight into your PR. Ship only changes that do what you intended.

<1s
Cone computation for a 100-turn trace, single core
~85%
Token reduction vs naive full re-execution
100%
Reproducibility across repeated replays

Quickstart.

A rough sketch of the SDK. Final shape locks in when the first design partner signs on.

# Install
pip install conereplay[langgraph]

# Record production traces
from conereplay import Recorder, attach_to_langgraph

recorder = Recorder(store_path="./traces.db")
attach_to_langgraph(my_app, recorder)

# Replay a proposed prompt change against the last 100 traces
$ conereplay run --diff new_prompt.md --traces last-100 --fail-on outcome-change:1%

Request beta access.

We're onboarding design partners running LangGraph multi-agent apps in production. Send a note about your setup and we'll be in touch within a day.

Email us →