Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

Most RAG demos answer “what’s the right chunk?” Very few can answer thetwo questions a regulator or an auditor will actually ask:

Replay this decision — show me the exact, complete record of how
this answer was produced.

Reconstruct the past — what did your system know at the moment it
answered, not what it knows now?

I got tired of hand-waving at both, so I shipped two pre-registered,deterministic benchmarks alongside JAMES,my local-first, audit-native Graph-RAG. Pre-registered means the metrics,scenarios, and decision rules were locked before the numbers came in —no post-hoc story-fitting.

RAB — Replayable-Audit Benchmark

RAB measures whether your audit trail is good enough to replay adecision, with three deterministic metrics:

Metric
What it checks
EU AI Act

AC — Audit Completeness
Is every decision-relevant event logged?
Art. 10

RF — Replay Fidelity
Can you re-derive the answer from the log alone?
Art. 12

PC — Provenance Coverage
Does every claim trace to a source?
Art. 19

The three metrics map verbatim to EU AI Act Articles 10, 12, and 19 —record-keeping obligations that apply from 2026-08-02 (per Article 113).

Scenario S1 result:

AC RF PC
JAMES 1.000 1.000 1.000
Baseline-0 0.275 0.000 0.000 (vanilla default-logging)

Enter fullscreen mode

Exit fullscreen mode

The gap is the whole point. “We have logs” (AC 0.275) is not the same as”we can replay the decision” (RF 0). Default application logging gets youa partial event trail and zero replay/provenance — which is exactly thefailure mode an Article 12 audit would surface.

LRB — Lifecycle Retrieval Benchmark

RAG facts go stale. A policy is superseded, a price changes, a spec isrevised. LRB asks: when you query as of a point in time, do youretrieve the fact that was valid then, or whatever overwrote it?

Three systems compared:

V — Vanilla: no time handling.

N — Naive-supersede: newest fact wins.

J — JAMES: validity-window retrieval (reconstruct_graph_at

The R@1 ordering V
points (a 12.5× scale span) — time-aware retrieval beats both naiveoverwrite and no time-handling at every scale, not just one lucky cell.

At publication scale (S3):

R@1
V 0.502
N 0.721
J 0.845

Enter fullscreen mode

Exit fullscreen mode

How to run it yourself

Everything is local — Ollama (gemma4:e4b default) + BAAI/bge-m3embeddings + ChromaDB. No cloud LLM account.

git clone https://github.com/Hashevolution/James-RAG-Evol
cp .env.example .env
pip install -r requirements.txt
ollama pull gemma4:e4b
# benchmark runners live in scripts/research/ (lrb_run*.py, rab_*)

Enter fullscreen mode

Exit fullscreen mode

Honest framing

These are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on ascenario I designed is a starting line, not proof of generalsuperiority — the value is that the scenarios, metrics, and baselines arepublic and deterministic, so you can run them, disagree, and beat thenumbers.

Feedback I’d value most: (a) does the AC/RF/PC ↔ Art. 10/12/19 mappinghold up under your reading of the text? (b) is “newest wins” the rightNaive-supersede baseline for LRB, or is there a stronger one I should add?

Source link

Football