PFMay 27
Range, Not Precision: Block-Floating-Point Half-Precision FFT and SAR Imaging on Apple SiliconMohamed Amine Bergach
Half precision (FP16) promises to double FFT throughput on GPUs, but the prevailing view is that its 10-bit mantissa makes it unsuitable for radar-grade signal processing. We show this framing is wrong on Apple Silicon: the binding constraint for FFT and Synthetic Aperture Radar (SAR) is not mantissa \emph{precision} but the 5-bit exponent's \emph{dynamic range}. We first measure that an FP16 FFT is mantissa-limited at 56--61~dB signal-to-quantization-noise ratio (SQNR) -- comfortably radar-usable -- yet a naïve FP16 SAR pipeline produces \emph{only} \texttt{NaN}, because the conjugate--FFT--conjugate inverse transform grows magnitudes by a factor of $N$, and the matched-filter product ($\sim\!5\times10^6$ at $N\!=\!4096$) overflows FP16's 65{,}504 ceiling. We resolve this with a fixed-shift \emph{block-floating-point} (BFP) schedule: a single $1/N$ scale applied before each inverse transform bounds every intermediate below 4096. A cascade follows: range-compression output becomes $O(1)$ instead of $O(N)$, which in turn keeps the downstream azimuth-FFT output FP16-loadable instead of overflowing at $O(N^2)$. The result is the first quality-preserving FP16 SAR pipeline: peak/integrated sidelobe ratios, target SNR, and resolution match the FP32 reference to within $0.1$~dB at $42$~dB end-to-end SQNR, while a radix-8 FP16 FFT reaches 306~GFLOPS -- $2.2\times$ over the 139~GFLOPS FP32 baseline -- on a fanless Apple~M1. Finally, we measure that FP8 (E4M3/E5M2) collapses to 14--20~dB SQNR, making FP16 \emph{today's} precision floor for FFT-based radar -- one that future precision-recovery methods may yet lower -- and showing that the lever for low precision here is range management, not mantissa bits.
PFMay 7
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple SiliconMohamed Amine Bergach
KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $λ$ $+$ per-group abs-max $+$ int4 nibble pack), exposed as a HuggingFace \texttt{Cache} subclass, runs \emph{faster than fp16} across $256$--$4096$-token prefixes on Gemma-3 1B ($-3$ to $-8\%$ ms/tok) and at short context on Qwen2.5-1.5B ($-0.7$ to $-2.6\%$ through $1$K), with $3\times$ persistent memory compression and quality preserved ($\dPPL = 0.000$ Qwen short-prompt; $+3.6$ hook $\dPPL$ Gemma). The kernel's $\sim\!25$\,ns/vec overhead is below the bandwidth savings from $3\times$ compression. The fused kernel also closes Qwen's 4-bit per-token catastrophe ($\dPPL = +7975 \to +638.6$, $12.5\times$ reduction) at $182$\,GFLOPS / $D{=}128$. Supporting findings: $\SRFT$ and $\SRHT$ are statistically indistinguishable for KV quality (we pick $\SRFT$ for mixed-radix and matrix-multiply alignment); a learned-rotation ablation surfaces a regularization role for the fixed random SRFT base (learning $R+λ$ without SRFT lowers calibration MSE $84.9\%$ vs $50.3\%$ but yields worse PPL); Householder rotations at $k{=}d/2$ reflectors are effectively lossless at $d{=}256$.
DCMar 29
Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory DecompositionMohamed Amine Bergach
We present an optimized Fast Fourier Transform (FFT) implementation for Apple Silicon GPUs, achieving 138.45~GFLOPS for $N\!=\!4096$ complex single-precision transforms -- a 29\% improvement over Apple's highly optimized vDSP/Accelerate baseline (107~GFLOPS). Our approach is grounded in a \emph{two-tier local memory model} that formally characterizes the Apple GPU's 208~KiB register file as the primary data-resident tier and the 32~KiB threadgroup memory as an exchange-only tier, extending the decomposition framework established in a 2015 PhD thesis on Intel integrated GPU FFT for radar processing. We implement and evaluate radix-4 and radix-8 split-radix Stockham kernels in Metal Shading Language (MSL), demonstrating that the radix-8 decimation-in-time butterfly with 512 threads yields the best performance. We further present the first investigation of Apple's \texttt{simdgroup\_matrix} 8$\times$8 hardware MMA for FFT butterfly computation and report the counter-intuitive finding that on Apple GPU, threadgroup memory barriers are inexpensive ($\sim$2 cycles) while scattered threadgroup access patterns are the true bottleneck. Our multi-size implementation supports $N\!=\!256$ through $N\!=\!16384$ using a four-step decomposition for sizes exceeding the 32~KiB threadgroup memory limit. All kernels are validated against vDSP reference outputs.
PFApr 4
From 8 Seconds to 370ms: Kernel-Fused SAR Imaging on Apple Silicon via Single-Dispatch FFT PipelinesMohamed Amine Bergach
We present the first kernel-fused SAR Range Doppler pipeline on any GPU platform. By fusing FFT, matched-filter multiply, and IFFT into a single Metal compute dispatch -- keeping all intermediate data in 32\,KiB on-chip memory -- we process a $4096\!\times\!4096$ complex SAR scene in \textbf{370\,ms} on an Apple M1 GPU, a \textbf{22$\times$} speedup over the multi-dispatch baseline (8.16\,s). We further report the first FFT to exploit Apple's \texttt{simdgroup\_matrix} 8$\times$8 hardware MMA, enabled by an in-place Cooley--Tukey decimation-in-frequency formulation that halves the memory footprint versus Stockham. Radar image quality is preserved: all five point targets show 0.0\,dB SNR deviation from the unfused FP32 reference.
PFApr 1
Dual-Select FMA Butterfly for FFT: Eliminating Twiddle Factor Singularities with Bounded Precomputed RatiosMohamed Amine Bergach
The fused multiply-add (FMA) instruction enables the radix-2 FFT butterfly to be computed in 6~FMA operations -- the proven minimum. The classical factorization by Linzer and Feig~\cite{linzer1993} precomputes the ratio $\cotθ= \cosθ/\sinθ$, which is singular when the twiddle factor is $W^0 = 1$ (i.e., $\sinθ= 0$). Standard practice clamps $\sinθ$ to a small epsilon, degrading numerical precision. We observe that an alternative factorization using $\cosθ$ as the outer multiplier (precomputing $\tanθ$) avoids this particular singularity but introduces a new one at $W^{N/4}$. We then propose a \emph{dual-select} strategy that chooses, per twiddle factor, whichever factorization yields $|\text{ratio}| \leq 1$. This eliminates all singularities, requires no epsilon clamping, and bounds the precomputed ratio to unity for all twiddle factors. For $N = 1024$, the worst-case ratio drops from 163 (Linzer-Feig) to exactly~1.0 (dual-select), yielding a $235\times$ tighter error bound in FP16 arithmetic over 10~FFT passes. The strategy adds zero computational overhead -- only the precomputed twiddle table changes.
PFApr 6
Training Transformers in Cosine Coefficient SpaceMohamed Amine Bergach
We parameterize the weight matrices of a transformer in the two-dimensional discrete cosine transform (DCT) domain, retaining only the lowest-frequency coefficients. At each forward pass the full weight matrix is reconstructed via the inverse DCT; gradients propagate through the reconstruction to update the spectral coefficients directly. On character-level language modeling (Shakespeare, 1M characters), a 4-layer transformer trained from scratch in this representation matches the perplexity of the standard parameterization (6.1 vs.\ 6.1) while storing 52\% of the parameters. At 4$\times$ compression (29\% of parameters), the model reaches perplexity 6.9 -- outperforming a low-rank baseline (perplexity 8.8 at 21\% of parameters) at a comparable reduction. The method requires no architectural changes, no pre-trained checkpoint, and no auxiliary loss. It reduces to replacing each \texttt{nn.Linear} with a drop-in spectral layer that stores $K$ DCT coefficients instead of $n \times m$ weights.
PFApr 5
Shortest-Path FFT: Optimal SIMD Instruction Scheduling via Graph SearchMohamed Amine Bergach
An $N$-point FFT admits many valid implementations that differ in radix choice, stage ordering, and register-blocking strategy. These alternatives use different SIMD instruction mixes with different latencies, yet produce the same mathematical result. We show that finding the fastest implementation is a shortest-path problem on a directed acyclic graph. We formalize two variants of this graph. In the \emph{context-free} model, nodes represent computation stages and edge weights are independently measured instruction costs. In the \emph{context-aware} model, nodes are expanded to encode the \emph{predecessor edge type}, so that edge weights capture inter-operation correlations such as cache warming -- the cost of operation~B depends on which operation~A preceded it. This addresses a limitation identified but deliberately bypassed by FFTW \citep{FrigoJohnson1998}: that optimal-substructure assumptions break down ``because of the different states of the cache.'' Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal). The context-aware Dijkstra discovers $\text{R4} \to \text{R2} \to \text{R4} \to \text{R4} \to \text{Fused-8}$ at 29.8~GFLOPS -- a $5.2\times$ improvement over pure radix-2 and 34\% faster than the context-free result. This arrangement includes a radix-2 pass \emph{sandwiched between} radix-4 passes, exploiting cache residuals that only exist in context. No context-free search can discover this.