SSM-AI – Empirical Validation & Mini Benchmarks — Results (tiny tables; reproduce from stamps) (6.4)

Tiny, replayable tables from stamped runs

Purpose. Present compact, stamp-replayable results for each chosen task. Replace the example numbers with your stamped CSV replays; keep tables minimal and comparable across vendors.

Decoding rerank (short answers; beam or candidates)
Dataset: 500 prompts, temp 0.7, beam=5. Selector A: argmax(prob). Selector B: RSI (or RSI_env) bands.

Metric                                 Baseline    SSM-AI     Δ (B−A)
First-pass correctness (%)               61.8        66.4      +4.6
Over-confident errors in A+/A++ (%)      22.3         8.7     −13.6
Mean retries per prompt                   0.42        0.29     −31%
Latency p50 / p95 (s)                    1.8/3.9     1.7/3.7   −0.1/−0.2
Band histogram (A++/A+/A0/A−/A--)       9/28/52/8/3  12/34/47/5/2  —

RAG QA (top-k docs + cite integrity)
Selector A: baseline retrieval score. Selector B: RSI pooling of doc alignments (support − penalties).

Metric                                 Baseline     SSM-AI     Δ (B−A)
Exact Match / F1 (%)                    48.2/63.1   50.7/65.0  +2.5/+1.9
Valid citations (%)                       71.0        79.6      +8.6
Off-topic responses (%)                   11.4         7.2      −4.2
Tokens per solved task (k)                 8.9         7.6      −15%

Tool loop (agent micro-workflow: parse → call → verify)
Policy: act on bands of RSI_env (A++/A+/A0/A−/A–).

Metric                                 Baseline     SSM-AI (band policy)   Δ (B−A)
Bad escalations per 1000                 14.1         8.9                   −37%
Time-to-first-correct (s)                12.4        10.2                   −18%
Retries per task                          0.61        0.44                  −28%
Calls per solved task                     2.8         2.3                   −18%

Reporting notes (must hold).
Collapse parity: phi((m,a)) = m everywhere (classical values unchanged).
Boundedness: |a|<1, |RSI|<1, |RSI_env|<1.
Determinism: replay from stamps (same manifest) reproduces tables bit-for-bit within dtype tolerance.
Paired A/B: identical inputs; only selector/policy differs.
Band transparency: publish band histogram for RSI or RSI_env.

One-line takeaway. With fixed manifests and stamped logs, the lane raises correctness and reduces waste (retries/tokens/calls) while keeping classical outputs identical via phi((m,a)) = m—small, audit-ready wins anyone can replay.


Navigation
Previous: SSM-AI – Empirical Validation & Mini Benchmarks —Ablations (small knobs, big clarity) (6.3)
Next: SSM-AI – Reproducibility: Five-Step Replay from Stamps (6.5)


Directory of Pages
SSM-AI — Table of Contents