Back

ChronoPath Eval Report

Generated: 2026-04-29T11:09:29.277Z

Methodology

10 prompts × 2 generator models, scored on 4 dimensions by a held-out Claude judge (different model than the generator):

1. Factual accuracy — every claim traces to a provided source

2. Persona fit — narrative matches the persona's prompting directive

3. Cultural sensitivity — no orientalism, condescension, or flattening of caste/class/religious nuance

4. Source-bias awareness — uses sources critically, not interchangeably

Cross-model judging within Claude family. Sonnet generations are judged by Haiku; Haiku generations are judged by Sonnet. The judge model is never the same as the generator. This reduces — but does not eliminate — self-preference bias, since both judges share the same model family.

Results

GeneratornFactualPersonaSensitivityBias-awareTotal /20Revisions triggered
Claude Sonnet 4.5102.504.503.402.9013.3010/10
Claude Haiku 4.5103.104.203.503.3014.106/10

Per-prompt detail

PromptStopPersonaGeneratorTotal
p01shaniwar-wadaitalian-touristClaude Sonnet 4.511/20
p01shaniwar-wadaitalian-touristClaude Haiku 4.516/20
p02shaniwar-wadaschoolkidClaude Sonnet 4.510/20
p02shaniwar-wadaschoolkidClaude Haiku 4.511/20
p03shaniwar-wadahistorianClaude Sonnet 4.517/20
p03shaniwar-wadahistorianClaude Haiku 4.518/20
p04lal-mahalfirst-timerClaude Sonnet 4.513/20
p04lal-mahalfirst-timerClaude Haiku 4.514/20
p05lal-mahalhistorianClaude Sonnet 4.512/20
p05lal-mahalhistorianClaude Haiku 4.518/20
p06kasba-ganpatiitalian-touristClaude Sonnet 4.512/20
p06kasba-ganpatiitalian-touristClaude Haiku 4.511/20
p07vishrambaug-wadahistorianClaude Sonnet 4.516/20
p07vishrambaug-wadahistorianClaude Haiku 4.520/20
p08vishrambaug-wadaschoolkidClaude Sonnet 4.513/20
p08vishrambaug-wadaschoolkidClaude Haiku 4.59/20
p09phule-wadaitalian-touristClaude Sonnet 4.514/20
p09phule-wadaitalian-touristClaude Haiku 4.512/20
p10phule-wadafirst-timerClaude Sonnet 4.515/20
p10phule-wadafirst-timerClaude Haiku 4.512/20

Limitations (stated openly)

  • LLM-as-judge is useful for relative model ranking, not absolute scoring.
  • All judges are Claude-family models. Cross-family validation (GPT, Gemini as third judges) is future work.
  • Judges inherit family-level biases — particularly toward verbose, hedged outputs.
  • Corpus is small (~14 sources, 5 stops). Absolute scores are not stable across runs; relative ordering is.
  • Each prompt judged once. Production-grade evals would judge each output multiple times and average to reduce variance.
  • The eval set itself was authored alongside the system; an independently authored eval set would be a stronger benchmark.

How to read this

The score table is the headline. The per-prompt detail surfaces specific cases where models diverge sharply.

Reproducing

Run npm run eval to regenerate this report.