Match semantics across CPU/GPU/ASIC; keep it O(N) with O(1) memory.
7.4 Software–Hardware Parity (fixed-point notes)
• Identical semantics across targets. The sequence clamp → atanh → add in u → divide → tanh must match (within dtype tolerance). This guarantees collapse parity everywhere: phi((m,a)) = m, and batch == stream == shuffled.
• Range planning (summary). Keep internal u in a symmetric fixed-point range [−Umax, +Umax] such that tanh(Umax) ≈ 0.999; pick Umax for your hardware to avoid saturation.
• Quantized clamps. Enforce |a| < 1 via quantized eps_a/eps_w constants in firmware/microcode; carry them through manifest.
• Saturating adds (if required). Use saturating addition for U in fixed-point, then invert once with tanh at readout.
• Golden vectors. Ship a tiny cross-target pack for float32/float64/fixed-point; run in CI to assert parity.
# Canonical sequence (must match across targets)
a_c := clamp(a, -1+eps_a, +1-eps_a)
u := atanh(a_c)
U += w*u
W += w
a_out := tanh( U / max(W, eps_w) )
# Fixed-point sketch (implementation detail; same math)
# 1) quantize a_c into Qn.m; 2) LUT/cordic for atanh; 3) saturating add on U; 4) single tanh at readout
# Note: semantics unchanged; only representation differs.
Hardware acceptance quicklist
- Clamp discipline: quantized clamp yields
|a_c| < 1under all lanes/gates. - U/W additivity: shard merges by
U := SUM U_k,W := SUM W_kreproduce batch results. - Fixed-point span: chosen
Umaxavoids premature saturation for your worst-case path length. - Golden parity: fixed-point outputs match float within tolerance on the vector pack.
- Determinism: fixed manifest ⇒ identical results across boards/runners (within dtype tolerance).
7.5 Performance Considerations (big-O and memory)
• Per item: O(1) math (clamp, atanh, add, optional band).
• Per stream: O(N) time, O(1) memory (carry only U, W).
• Vectorization: Apply clamp/atanh/tanh elementwise; reduce via weighted sums; fuse in u only once at readout.
• Throughput knobs:
- LUTs/approximations for
tanh/atanhon accelerators (validated against golden vectors). - Batch atanh calls when
|a|arrays are available; cache small-|a| regimes if profiling shows wins. - Kahan/pairwise accumulation for
Uon very long paths;Wvia standard sum.
# O(1) per item; O(N) per stream
for item in stream:
a_c := clamp(a, -1+eps_a, +1-eps_a)
U += w*atanh(a_c)
W += w
# Finalize once
a_out := tanh( U / max(W, eps_w) )
# Micro-optimizations (profile-driven):
# - vector_atanh(a_c_vec) # batched transcendental
# - kahan_add(&U, w*atanh(a_c)) # optional numerical guard for long paths
One-line takeaway. Keep semantics identical across software and hardware by adhering to the canonical clamp→atanh→sum→divide→tanh flow, verified with golden vectors; run streams in O(N) time / O(1) memory by carrying just (U,W).
Navigation
Previous: SSM-AI – Scalability & Precision: Long Paths + Dtype Guardrails (7.1-7.3)
Next: SSM-AI – Robustness & Troubleshooting (7.6, 7.7)
Directory of Pages
SSM-AI — Table of Contents