Ship AI agents with confidence.
Pre-deploy regression testing for multi-agent LLM systems. Replay proposed prompt, model, and agent changes against your recorded production traces — byte-identical where it doesn't need to change, ~85% cheaper than re-running live.
The counterfactual gap.
Agent regressions surface in production. Faithful replay answers “what happened.” Naive full re-execution reintroduces sampling noise and burns tokens. Neither tells you whether your proposed prompt change would have broken customer 27 in yesterday's traces.
How it works.
- 1
Record.
Wrap your LangGraph app. ConeReplay captures every LLM call, tool invocation, and cross-agent message — each tagged with a per-turn causal vector clock.
- 2
Replay with a twist.
Propose a change. ConeReplay computes the causal cone of affected events. Out-of-cone events are served byte-identically from the recording. In-cone events re-execute with deterministic, content-hash-derived seeds.
- 3
Review.
Get a divergence report per trace — or aggregated across a corpus — straight into your PR. Ship only changes that do what you intended.
Quickstart.
A rough sketch of the SDK. Final shape locks in when the first design partner signs on.
# Install pip install conereplay[langgraph] # Record production traces from conereplay import Recorder, attach_to_langgraph recorder = Recorder(store_path="./traces.db") attach_to_langgraph(my_app, recorder) # Replay a proposed prompt change against the last 100 traces $ conereplay run --diff new_prompt.md --traces last-100 --fail-on outcome-change:1%
Request beta access.
We're onboarding design partners running LangGraph multi-agent apps in production. Send a note about your setup and we'll be in touch within a day.
Email us →