ChronoPath Eval Report
Generated: 2026-04-29T11:09:29.277Z
Methodology
10 prompts × 2 generator models, scored on 4 dimensions by a held-out Claude judge (different model than the generator):
1. Factual accuracy — every claim traces to a provided source
2. Persona fit — narrative matches the persona's prompting directive
3. Cultural sensitivity — no orientalism, condescension, or flattening of caste/class/religious nuance
4. Source-bias awareness — uses sources critically, not interchangeably
Cross-model judging within Claude family. Sonnet generations are judged by Haiku; Haiku generations are judged by Sonnet. The judge model is never the same as the generator. This reduces — but does not eliminate — self-preference bias, since both judges share the same model family.
Results
| Generator | n | Factual | Persona | Sensitivity | Bias-aware | Total /20 | Revisions triggered |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.5 | 10 | 2.50 | 4.50 | 3.40 | 2.90 | 13.30 | 10/10 |
| Claude Haiku 4.5 | 10 | 3.10 | 4.20 | 3.50 | 3.30 | 14.10 | 6/10 |
Per-prompt detail
| Prompt | Stop | Persona | Generator | Total |
|---|---|---|---|---|
| p01 | shaniwar-wada | italian-tourist | Claude Sonnet 4.5 | 11/20 |
| p01 | shaniwar-wada | italian-tourist | Claude Haiku 4.5 | 16/20 |
| p02 | shaniwar-wada | schoolkid | Claude Sonnet 4.5 | 10/20 |
| p02 | shaniwar-wada | schoolkid | Claude Haiku 4.5 | 11/20 |
| p03 | shaniwar-wada | historian | Claude Sonnet 4.5 | 17/20 |
| p03 | shaniwar-wada | historian | Claude Haiku 4.5 | 18/20 |
| p04 | lal-mahal | first-timer | Claude Sonnet 4.5 | 13/20 |
| p04 | lal-mahal | first-timer | Claude Haiku 4.5 | 14/20 |
| p05 | lal-mahal | historian | Claude Sonnet 4.5 | 12/20 |
| p05 | lal-mahal | historian | Claude Haiku 4.5 | 18/20 |
| p06 | kasba-ganpati | italian-tourist | Claude Sonnet 4.5 | 12/20 |
| p06 | kasba-ganpati | italian-tourist | Claude Haiku 4.5 | 11/20 |
| p07 | vishrambaug-wada | historian | Claude Sonnet 4.5 | 16/20 |
| p07 | vishrambaug-wada | historian | Claude Haiku 4.5 | 20/20 |
| p08 | vishrambaug-wada | schoolkid | Claude Sonnet 4.5 | 13/20 |
| p08 | vishrambaug-wada | schoolkid | Claude Haiku 4.5 | 9/20 |
| p09 | phule-wada | italian-tourist | Claude Sonnet 4.5 | 14/20 |
| p09 | phule-wada | italian-tourist | Claude Haiku 4.5 | 12/20 |
| p10 | phule-wada | first-timer | Claude Sonnet 4.5 | 15/20 |
| p10 | phule-wada | first-timer | Claude Haiku 4.5 | 12/20 |
Limitations (stated openly)
- LLM-as-judge is useful for relative model ranking, not absolute scoring.
- All judges are Claude-family models. Cross-family validation (GPT, Gemini as third judges) is future work.
- Judges inherit family-level biases — particularly toward verbose, hedged outputs.
- Corpus is small (~14 sources, 5 stops). Absolute scores are not stable across runs; relative ordering is.
- Each prompt judged once. Production-grade evals would judge each output multiple times and average to reduce variance.
- The eval set itself was authored alongside the system; an independently authored eval set would be a stronger benchmark.
How to read this
The score table is the headline. The per-prompt detail surfaces specific cases where models diverge sharply.
Reproducing
Run npm run eval to regenerate this report.