SSM-AI – Scalability: HW Parity & Performance (7.4, 7.5)

Match semantics across CPU/GPU/ASIC; keep it O(N) with O(1) memory.

7.4 Software–Hardware Parity (fixed-point notes)
• Identical semantics across targets. The sequence clamp → atanh → add in u → divide → tanh must match (within dtype tolerance). This guarantees collapse parity everywhere: phi((m,a)) = m, and batch == stream == shuffled.
• Range planning (summary). Keep internal u in a symmetric fixed-point range [−Umax, +Umax] such that tanh(Umax) ≈ 0.999; pick Umax for your hardware to avoid saturation.
• Quantized clamps. Enforce |a| < 1 via quantized eps_a/eps_w constants in firmware/microcode; carry them through manifest.
• Saturating adds (if required). Use saturating addition for U in fixed-point, then invert once with tanh at readout.
• Golden vectors. Ship a tiny cross-target pack for float32/float64/fixed-point; run in CI to assert parity.

# Canonical sequence (must match across targets)
a_c := clamp(a, -1+eps_a, +1-eps_a)
u    := atanh(a_c)
U   += w*u
W   += w
a_out := tanh( U / max(W, eps_w) )

# Fixed-point sketch (implementation detail; same math)
# 1) quantize a_c into Qn.m; 2) LUT/cordic for atanh; 3) saturating add on U; 4) single tanh at readout
# Note: semantics unchanged; only representation differs.

Hardware acceptance quicklist

Clamp discipline: quantized clamp yields |a_c| < 1 under all lanes/gates.
U/W additivity: shard merges by U := SUM U_k, W := SUM W_k reproduce batch results.
Fixed-point span: chosen Umax avoids premature saturation for your worst-case path length.
Golden parity: fixed-point outputs match float within tolerance on the vector pack.
Determinism: fixed manifest ⇒ identical results across boards/runners (within dtype tolerance).

7.5 Performance Considerations (big-O and memory)
• Per item: O(1) math (clamp, atanh, add, optional band).
• Per stream: O(N) time, O(1) memory (carry only U, W).
• Vectorization: Apply clamp/atanh/tanh elementwise; reduce via weighted sums; fuse in u only once at readout.
• Throughput knobs:

LUTs/approximations for tanh/atanh on accelerators (validated against golden vectors).
Batch atanh calls when |a| arrays are available; cache small-|a| regimes if profiling shows wins.
Kahan/pairwise accumulation for U on very long paths; W via standard sum.

# O(1) per item; O(N) per stream
for item in stream:
  a_c := clamp(a, -1+eps_a, +1-eps_a)
  U  += w*atanh(a_c)
  W  += w
# Finalize once
a_out := tanh( U / max(W, eps_w) )

# Micro-optimizations (profile-driven):
# - vector_atanh(a_c_vec)  # batched transcendental
# - kahan_add(&U, w*atanh(a_c))  # optional numerical guard for long paths

One-line takeaway. Keep semantics identical across software and hardware by adhering to the canonical clamp→atanh→sum→divide→tanh flow, verified with golden vectors; run streams in O(N) time / O(1) memory by carrying just (U,W).

Navigation
Previous: SSM-AI – Scalability & Precision: Long Paths + Dtype Guardrails (7.1-7.3)
Next: SSM-AI – Robustness & Troubleshooting (7.6, 7.7)

Directory of Pages
SSM-AI — Table of Contents