SSM-AI – Empirical Validation & Mini Benchmarks — Task Suite & Protocol (6.1)

Tiny, public, replayable A/Bs

Purpose. Demonstrate measurable lifts in quality and efficiency by adding the alignment lane — while preserving classical outputs via phi((m,a)) = m. Keep runs lightweight, public-dataset friendly, and stamp-replayable.

Tasks (pick any 2–3 to start).
• Decoding rerank (LLM): compare baseline argmax(prob) vs selector by RSI_env.
• RAG QA (top-k + cite): rank candidates/docs by RSI; keep retrieval m intact; measure answer correctness and cite integrity.
• Tool loop (agent micro-workflow): one or two API calls + parse; decide retry/escalate using bands on RSI_env.

Freeze a tiny manifest (reuse across runs).
Keys to publish once: eps_a, eps_w, c (lens gain), weights_policy (e.g., w := |m|^gamma, gamma = 1), bands, gate_mode ("mul" or "u_scale"), division_policy, lens_id, Unit, dtype.

Per-decision stamp (one ASCII line). Log exactly the fields needed for perfect replay:

ts, run_id, item_id, U, W, RSI, RSI_env, band, g_t,
F,D,L,E,V,Q, lens_id, Unit, c, eps_a, eps_w,
weights_policy, division_policy, combine_policy, dtype,
knobs_hash, file_sha256_in, file_sha256_out

Selector math (fixed across tasks).

# lens → align:
a_in  := tanh(-c*e_in)
a_out := tanh(+c*e_out)

# chooser (two-channel form):
U_in := Σ w*atanh(a_in)
V_out := Σ w*atanh(a_out)
W_in := Σ w
RSI  := tanh( (V_out - U_in) / max(W_in, eps_w) )

# gate (alignment-only):
RSI_env := g_t * RSI            # mode "mul"
# or
RSI_env := tanh(g_t * atanh(RSI))  # mode "u_scale"

Evaluation windows (paired A/B, identical inputs).
• A (baseline): classical selector (e.g., argmax(prob), existing heuristic).
• B (SSM-AI): choose by RSI_env (or advisory bands).
Ensure identical prompts, seeds, slices; only the selector differs.

Primary metrics (report Δ = B − A).
• Decoding: first-correct ↑, hallucination-rate ↓, retries ↓, p50/p95 latency ↔/↓.
• RAG: answer EM/F1 ↑, cite-integrity ↑, off-topic ↓.
• Tools: successful-completion ↑, bad-escalations ↓, retries ↓, tokens/calls ↓.
Always include band histogram of RSI_env and count of actions gated by A-/A--.

QA invariants (must pass).

phi((m,a)) = m                  # collapse parity everywhere
|a| < 1, |RSI| < 1, |RSI_env| < 1
stream == batch == shuffled     # within dtype epsilon
if W_in == 0 ⇒ RSI := 0, band := "A0", reason := insufficient_evidence

Minimal replay protocol (5 lines).

1) Load manifest.
2) Recompute a_in := tanh(-c*e), a_out := tanh(+c*e).
3) Fuse U += w*atanh(a), W += w (two channels as above).
4) RSI := tanh((V_out - U_in)/max(W_in, eps_w)).
5) Gate → RSI_env; band with hysteresis if declared.

Success criteria (greenlight to publish).
At least one task shows clear lift on ≥2 primary metrics, all QA invariants hold, and replay matches stamps exactly.

One-line takeaway. Freeze a tiny manifest, run paired A/B on fixed tasks, stamp ASCII logs with U/W/RSI/RSI_env/band, and publish deltas — small, auditable wins that anyone can replay.

Navigation
Previous: SSM-AI – Developer Hooks (5.5)
Next: SSM-AI – Metrics (calculator-fast; no retraining) — 6.2

Directory of Pages
SSM-AI — Table of Contents