Fixed-point hardware lane: 1/cycle, software-parity, stamp-ready.
Purpose. Map the lane math to a tiny, deterministic hardware substrate without changing semantics. Classical values remain untouched: phi((m,a)) = m. The lane uses the same kernel: a_c := clamp(a, -1+eps_a, +1-eps_a), u := atanh(a_c), streaming fuse U += w*u, W += w, and a_out := tanh( U / max(W, eps_w) ). Optional gate: RSI_env := g * RSI or RSI_env := tanh( g * atanh(RSI) ).
Meta: This page includes “Shunyaya Symbolic Mathematical AI (SSM-AI)” for SEO.
G4) Pipeline and latency (one sample per cycle, typical)
Target: 1 result/cycle after pipe fill; deterministic latency across builds.
• Stage counts (indicative):CLAMP: 1 → ATANH_LUT: 2–3 → W_MUL: 1 → ACC_UW: 1 → DIV_SAT: 6–12 → TANH_LUT: 2–3
Total: ~14–21 cycles latency; throughput: 1/cycle after fill.
• Shard merge: add-only on (U, W) (no LUTs involved).
• Flush rule: publish only at boundaries:
u_bar := U / max(W, eps_w)
a_out := tanh(u_bar)
• Back-pressure: stall at DIV_SAT only; keep ACC_UW single-cycle.
• Clock targets: 100–300 MHz on mid-range FPGAs; higher on ASICs (lane-local).
G5) Gate and bands in hardware
Goal: environment/policy scaling without touching magnitudes.
• Gate “mul” (cheap):
RSI_env := g * RSI
• Gate “u_scale” (bounded even at edges):
RSI_env := tanh( g * atanh(RSI) )
• Bands (defaults, fixed-point domain): thresholds at +0.90, +0.60, -0.60, -0.90.
Hysteresis: use h_up, h_dn offsets to de-chatter labels.
Purity: alignment-only; classical m never enters this datapath (phi((m,a)) = m).
G6) Determinism & parity tests (must pass)
Golden vectors (fixed-point must match float within tolerance):
1) a1 := tanh(0.2), a2 := tanh(0.4), w1=w2=1
Expect a_out := tanh( (0.2+0.4)/2 ) = tanh(0.3) ≈ 0.291313
2) RSI := tanh(0.7) ≈ 0.604368 → U=0.700000, W=1 → replay equals input
3) Lane mul/div (M2):
a_mul := tanh(0.5 + 0.2) ≈ 0.604368
a_div := tanh(0.5 - 0.2) ≈ 0.291313
4) Gate “mul” with g=0.81:
RSI_env := 0.81 * 0.604368 ≈ 0.489538
5) Gate “u_scale” with g=0.81, RSI=0.70:
RSI_env := tanh( 0.81 * atanh(0.70) ) = tanh( 0.81 * 0.867301 ) ≈ 0.605961
6) Order/shard invariance:
permute/shard a stream; final a_out identical when (U,W) merged
Tolerance targets:
16-bit path : |delta_RSI| <= 5e-4, band decisions identical
32-bit path : |delta_RSI| <= 5e-7, bit-exact bands and stamps
Acceptance recipe (portable):
• Quantize inputs → run HW → dequantize → compare to float64 reference
• Assert: (a_out_fp within tolerance) AND (band_fp == band_float)
• Assert: stamps identical (or differ only in declared Q-format rounding fields)
G7) Fixed-point arithmetic details
Multiply (widen-then-round):Qp.q * Qp.q -> Q(2p).(2q) then round/saturate back to lane format.
Accumulators: keep wider to avoid overflow. For 1e6 items with w=1, W_acc needs ≥ 20 integer bits → choose 32-bit or 40-bit for (U,W).
Division guard:
u_bar := U / max(W, eps_w_fx) # if W=0, emit a_out := 0
Saturation: never wrap; saturate to nearest representable endpoint.
Recommended Q-formats (lane-only):
u, U/W, w*u:
Q4.12 (16-bit, low cost) : range [-8, +8), step ≈ 2.44e-4
Q6.26 (32-bit, comfort) : range [-32,+32), step ≈ 1.49e-8
weights w:
uniform → w := 1 (int)
strength → w := |m|^gamma, quantize to Q6.10 or Q8.8 per spread
Guard constants (manifest):
eps_a = 1e-6
eps_w = 1e-12
eps_a_fx := to_fx(eps_a)
eps_w_fx := to_fx(eps_w)
Monotone LUTs (strict): piecewise LUT + short polynomial with strict monotonicity for both atanh and tanh to preserve ordering.
Copy-code (paste into WordPress code block)
# === PIPELINE (indicative stages) ===
# CLAMP(1) -> ATANH_LUT(2-3) -> W_MUL(1) -> ACC_UW(1) -> DIV_SAT(6-12) -> TANH_LUT(2-3)
# Latency ~14–21 cycles, throughput 1/cycle after fill.
# === GATE & BANDS ===
# Gate (mul): RSI_env := g * RSI
# Gate (u_scale): RSI_env := tanh( g * atanh(RSI) )
# Bands: A++: a>=0.90; A+: [0.60,0.90); A0: (-0.60,0.60); A-: (-0.90,-0.60]; A--: a<=-0.90
# Hysteresis: use h_up, h_dn (fixed-point)
# Purity: magnitude m never enters; phi((m,a)) = m
# === GOLDEN VECTORS & TOLERANCE ===
# 1) a_out ≈ tanh(0.3) ≈ 0.291313 # from tanh((0.2+0.4)/2)
# 2) RSI := tanh(0.7) ≈ 0.604368 → U=0.700000; W=1 → replay equals input
# 3) M2: a_mul ≈ 0.604368; a_div ≈ 0.291313
# 4) Gate mul (g=0.81): RSI_env ≈ 0.489538
# 5) Gate u_scale: tanh( 0.81 * atanh(0.70) ) ≈ 0.605961
# 6) Order/shard invariance: identical a_out after merge
# Tolerances: 16-bit |delta_RSI|<=5e-4; 32-bit |delta_RSI|<=5e-7
# === FIXED-POINT DETAILS ===
# Multiply: widen then round (e.g., Q4.12 * Q4.12 -> Q8.24 -> round to Q6.26/Q4.12)
# Accumulators: U_acc,W_acc wider (e.g., 32 or 40 bits)
# Division guard: u_bar := U / max(W, eps_w_fx); if W==0 → a_out := 0
# Saturation: no wrap; saturate to endpoints.
# Suggested Q formats:
# u, U/W, w*u:
# Q4.12 (16b) range [-8,+8), step ≈ 2.44e-4
# Q6.26 (32b) range [-32,+32), step ≈ 1.49e-8
# weights w:
# uniform → 1
# strength → |m|^gamma, quantize Q6.10 or Q8.8
# Guards:
# eps_a = 1e-6; eps_w = 1e-12; use eps_a_fx, eps_w_fx in RTL
One-line takeaway. The lane is a tiny streaming MAC with two monotone LUTs (atanh in, tanh out) and an (U,W) accumulator. With a format like Q6.26, you get deterministic, order-invariant, stamp-ready results identical to software semantics, while phi((m,a)) = m keeps all classical numbers pristine.
Stamp example (hardware snapshot).SSMCLOCK1|iso_utc|svc=lane_hw|U=0.700000|W=1.000000|RSI=0.604368|g=0.81|RSI_env=0.489538|band=A0|fx=Q6.26|manifest=knobs_hash
Navigation
Previous: SSM-AI – Appendix G — SSMH Acceleration Parity (G1–G3)
Next: SSM-AI – Appendix G — SSMH Acceleration Parity (G8–G11)
Directory of Pages
SSM-AI — Table of Contents