Calculator-fast, model-agnostic checks
Purpose. Define small, reproducible metrics that show quality and efficiency lifts when selecting by bounded alignment (RSI, RSI_env) — while keeping classical outputs intact via phi((m,a)) = m.
Core quality & efficiency (report Δ = SSM-AI − Baseline).
• Retries ↓ — mean retries per task.
• Time-to-first-correct ↑ — median steps to first correct output.
• Over-confidence exposure ↑ — fraction of incorrect items landing in A-/A-- (should increase; surfacing risk earlier).
• Band distribution — histogram of RSI or RSI_env across A++ … A--.
• Stability (order/shard invariance) — |RSI_batch − RSI_stream| within numeric epsilon.
• Correlation — Spearman/Pearson between RSI (or RSI_env) and correctness.
• OPEX proxy — tokens and tool-calls per solved task.
Minimal formulas (copy-paste).
# Per item i, with gold correctness y_i ∈ {0,1}
retry_mean := mean(retries_i)
t_first_correct := median(steps_to_first_correct_i)
# Over-confidence exposure on incorrect items (risk surfaced in low bands)
exposed_risk := mean( (y_i == 0) and (band_i in {"A-","A--"}) )
# Stability across execution modes (expect ~0)
stability_eps := mean( abs(RSI_batch_i - RSI_stream_i) )
# Correlation with correctness
pearson_r := corr(RSI_env_i, y_i)
spearman_r:= rank_corr(RSI_env_i, y_i)
# OPEX proxies
tokens_per_solved := sum(tokens_i) / max( count(y_i==1), 1 )
calls_per_solved := sum(tool_calls_i) / max( count(y_i==1), 1 )
Band histogram (crisp visibility).
# Count and share per band on RSI_env (or RSI)
count_App := count(band=="A++")
count_Ap := count(band=="A+")
count_A0 := count(band=="A0")
count_Am := count(band=="A-")
count_AMM := count(band=="A--")
share_X := count_X / total_items
QA invariants (must hold for every run).
# Collapse parity (never break classical values)
phi((m,a)) = m
# Boundedness
|a| < 1; |RSI| < 1; |RSI_env| < 1
# Order/shard invariance (within dtype epsilon)
RSI_batch ≈ RSI_stream ≈ RSI_shuffled
# Zero-evidence guard
if W_in == 0 → RSI := 0; band := "A0"; reason := "insufficient_evidence"
Reporting template (one table per task).
Metric Baseline SSM-AI (RSI_env) Δ (B−A)
First-pass correctness (%) ... ... ...
Retries per task ... ... ...
Time-to-first-correct (s) ... ... ...
Over-confidence exposure ... ... ...
Tokens per solved task ... ... ...
Tool calls per solved task ... ... ...
Stability epsilon ... ... ...
Pearson / Spearman ... ... ...
One-line takeaway. Use tiny, stamp-replayable metrics to prove real lifts — quality↑, waste↓, stability↔ — while Shunyaya collapse parity phi((m,a)) = m guarantees classical outputs remain unchanged.
Navigation
Previous: SSM-AI – Empirical Validation & Mini Benchmarks — Task Suite & Protocol (6.1)
Next: SSM-AI – Empirical Validation & Mini Benchmarks — Ablations (small knobs, big clarity) (6.3)
Directory of Pages
SSM-AI — Table of Contents