SSM-AI – Empirical Validation & Mini Benchmarks — Metrics (calculator-fast; no retraining) (6.2)

Calculator-fast, model-agnostic checks

Purpose. Define small, reproducible metrics that show quality and efficiency lifts when selecting by bounded alignment (RSI, RSI_env) — while keeping classical outputs intact via phi((m,a)) = m.

Core quality & efficiency (report Δ = SSM-AI − Baseline).
Retries ↓ — mean retries per task.
Time-to-first-correct ↑ — median steps to first correct output.
Over-confidence exposure ↑ — fraction of incorrect items landing in A-/A-- (should increase; surfacing risk earlier).
Band distribution — histogram of RSI or RSI_env across A++ … A--.
Stability (order/shard invariance)|RSI_batch − RSI_stream| within numeric epsilon.
Correlation — Spearman/Pearson between RSI (or RSI_env) and correctness.
OPEX proxy — tokens and tool-calls per solved task.

Minimal formulas (copy-paste).

# Per item i, with gold correctness y_i ∈ {0,1}
retry_mean      := mean(retries_i)
t_first_correct := median(steps_to_first_correct_i)

# Over-confidence exposure on incorrect items (risk surfaced in low bands)
exposed_risk := mean( (y_i == 0) and (band_i in {"A-","A--"}) )

# Stability across execution modes (expect ~0)
stability_eps := mean( abs(RSI_batch_i - RSI_stream_i) )

# Correlation with correctness
pearson_r := corr(RSI_env_i, y_i)
spearman_r:= rank_corr(RSI_env_i, y_i)

# OPEX proxies
tokens_per_solved := sum(tokens_i) / max( count(y_i==1), 1 )
calls_per_solved  := sum(tool_calls_i) / max( count(y_i==1), 1 )

Band histogram (crisp visibility).

# Count and share per band on RSI_env (or RSI)
count_App := count(band=="A++")
count_Ap  := count(band=="A+")
count_A0  := count(band=="A0")
count_Am  := count(band=="A-")
count_AMM := count(band=="A--")
share_X   := count_X / total_items

QA invariants (must hold for every run).

# Collapse parity (never break classical values)
phi((m,a)) = m

# Boundedness
|a| < 1; |RSI| < 1; |RSI_env| < 1

# Order/shard invariance (within dtype epsilon)
RSI_batch ≈ RSI_stream ≈ RSI_shuffled

# Zero-evidence guard
if W_in == 0 → RSI := 0; band := "A0"; reason := "insufficient_evidence"

Reporting template (one table per task).

Metric                      Baseline     SSM-AI (RSI_env)   Δ (B−A)
First-pass correctness (%)    ...            ...             ...
Retries per task              ...            ...             ...
Time-to-first-correct (s)     ...            ...             ...
Over-confidence exposure      ...            ...             ...
Tokens per solved task        ...            ...             ...
Tool calls per solved task    ...            ...             ...
Stability epsilon             ...            ...             ...
Pearson / Spearman            ...            ...             ...

One-line takeaway. Use tiny, stamp-replayable metrics to prove real lifts — quality↑, waste↓, stability↔ — while Shunyaya collapse parity phi((m,a)) = m guarantees classical outputs remain unchanged.


Navigation
Previous: SSM-AI – Empirical Validation & Mini Benchmarks — Task Suite & Protocol (6.1)
Next: SSM-AI – Empirical Validation & Mini Benchmarks — Ablations (small knobs, big clarity) (6.3)


Directory of Pages
SSM-AI — Table of Contents