Aaron R. Flouro

LG
h-index3
6papers
7citations
Novelty55%
AI Score51

6 Papers

17.5SPApr 20
Safety-Certified CRT Sparse FFT: $Ω(k^2)$ Lower Bound and $O(N \log N)$ Worst-Case

Aaron R. Flouro, Shawn P. Chadwick

Computing Fourier transforms of k-sparse signals, where only k of N frequencies are non-zero, is fundamental in compressed sensing, radar, and medical imaging. While the Fast Fourier Transform (FFT) evaluates all N frequencies in $O(N \log N)$ time, sufficiently sparse signals should admit sub-linear complexity in N. Existing sparse FFT algorithms using Chinese Remainder Theorem (CRT) reconstruction rely on moduli selection choices whose worst-case implications have not been fully characterized. This paper makes two contributions. First, we establish an $Ω(k^2)$ adversarial lower bound on candidate growth for CRT-based sparse FFT when moduli are not pairwise coprime (specifically when $m_3 \mid m_1 m_2$), implying an $O(k^2 N)$ worst-case validation cost that can exceed dense FFT time. This vulnerability is practically relevant, since moduli must often divide N to avoid spectral leakage, in which case non-pairwise-coprime configurations can be unavoidable. Pairwise coprime moduli avoid the proven attack; whether analogous constructions exist for such moduli remains an open question. Second, we present a robustness framework that wraps a 3-view CRT sparse front end with lightweight certificates (bucket occupancy, candidate count) and an adaptive dense FFT fallback. For signals passing the certificates, the sparse path achieves $O(\sqrt{N} \log N + k N)$ complexity; when certificates detect collision risk, the algorithm reverts to $O(N \log N)$ dense FFT, guaranteeing worst-case performance matching the classical bound.

LGJan 14
Multi-Teacher Ensemble Distillation: A Mathematical Framework for Probability-Domain Knowledge Aggregation

Aaron R. Flouro, Shawn P. Chadwick

Building on the probability-domain distillation framework of Sparse-KD, we develop an axiomatic, operator-theoretic framework for multi-teacher ensemble knowledge distillation. Rather than prescribing a specific aggregation formula, we define five core axioms governing valid knowledge aggregation operators, encompassing convexity, positivity, continuity, weight monotonicity, and temperature coherence. We prove the existence and non-uniqueness of operator families satisfying these axioms, establishing that multiple distinct aggregation mechanisms conform to the same foundational principles. Within this framework, we establish operator-agnostic guarantees showing that multi-teacher aggregation reduces both stochastic variance and systematic supervisory bias under heterogeneous teachers, while providing Jensen-type bounds, log-loss guarantees, and safety attenuation properties. For aggregation operators linear in teacher weights, we further establish classical ensemble variance-reduction results under standard independence assumptions, with extensions to correlated-error regimes. The framework provides theoretical grounding for multi-teacher distillation from diverse frontier models while admitting multiple valid implementation strategies.

LGJan 30
Post-Training Probability Manifold Correction via Structured SVD Pruning and Self-Referential Distillation

Aaron R. Flouro, Shawn P. Chadwick

Large language models are expensive to deploy. We introduce Sparse Knowledge Distillation (SparseKD), a post-training method that compresses transformer models by combining structured SVD pruning with self-referential knowledge distillation. The key insight is simple: instead of using an external teacher, the model teaches itself by matching its own probability distribution from before compression. This self-referential setup enables surprisingly strong quality recovery after aggressive pruning. Our experiments reveal an unexpected finding: self-referential distillation alone, applied post-training under an identical objective and fixed calibration dataset, improves model quality by 39% relative to the original converged checkpoint. When combined with structured pruning, SparseKD achieves 15-65% parameter reduction with acceptable quality trade-offs. Kernel profiling shows that speedups arise entirely from reduced dense matrix multiplication in feed-forward layers while attention remains unchanged, making this approach complementary to attention optimizations. We validate across two model families (0.6B and 3.8B parameters) with multi-seed experiments confirming high reproducibility. SparseKD requires no external super-teacher, no architectural changes, and no custom inference kernels, making it immediately deployable with existing infrastructure.

74.4SPMay 5
Deterministic Sparse FFT via Keyed Multi-View Gating with $O(\sqrt{N} \log k)$ Expected Time

Aaron R. Flouro, Shawn P. Chadwick

We introduce a deterministic sparse Fourier transform framework based on a keyed multi-view gating mechanism that leverages 2-of-3 Chinese Remainder Theorem (CRT) agreement to reduce candidate frequency pairs from $O(k^2)$ to $Θ(k)$ under sparse-regime assumptions. Unlike prior approaches that rely on randomized bucketization for candidate formation, the proposed method provides deterministic structure with probabilistic guarantees arising only from assumptions on frequency placement and independence of affine hashing across views. The algorithm is realized through a peeling-based recovery procedure that extracts frequencies directly from singleton bins without explicit pair enumeration. A recursive self-reduction eliminates the $O(\sqrt{N} \log N)$ preprocessing floor, yielding $O(\sqrt{N} \log k)$ expected identification time while maintaining an $O(N \log N)$ worst-case bound via deterministic dense-FFT fallback. A multi-view verification framework combining Parseval energy consistency and bin-wise residual checks ensures bounded failure probability and no false negatives under correct verification. This establishes a framework combining deterministic candidate reduction, sublinear expected complexity, and worst-case safety guarantees within a CRT-based sparse FFT architecture.

LGJan 19
Recursive Meta-Distillation: An Axiomatic Framework for Iterative Knowledge Refinement

Aaron R. Flouro, Shawn P. Chadwick

Recent work in probability-domain knowledge distillation has established axiomatic frameworks for temperature scaling, multi-teacher aggregation, and bias-variance trade-offs in single-stage settings. However, the mathematical behavior of recursive or multi-generation distillation remains poorly understood, with prior approaches relying primarily on empirical heuristics. In this work, we introduce an axiomatic and operator-theoretic framework for recursive meta-distillation, formalizing iterative knowledge distillation as a sequence of probability-distribution operators with explicit anchoring to base teachers. We define structural axioms for valid meta-teacher construction and prove the existence of non-trivial operator families satisfying these axioms without specifying particular algorithms or loss functions. Under mild realizability and convexity assumptions, we show that anchored recursive distillation induces contraction in KL divergence, yielding geometric convergence to base teacher distributions and a unique, globally attractive fixed point. The contribution is foundational rather than algorithmic: the framework characterizes when recursive distillation is mathematically well-posed and convergent rather than error-accumulating, independent of model architecture, optimization details, or specific operator instantiations. These results provide a theoretical basis for understanding stability, bias-variance behavior, and failure modes in iterative and multi-teacher distillation under capacity constraints.

LGJan 25
Adaptive Weighting in Knowledge Distillation: An Axiomatic Framework for Multi-Scale Teacher Ensemble Optimization

Aaron R. Flouro, Shawn P. Chadwick

Knowledge distillation with multiple teachers is increasingly used to improve robustness, efficiency, and safety, yet existing approaches rely largely on heuristic or implementation-specific weighting schemes. This paper develops an operator-agnostic axiomatic framework for adaptive weighting in multi-teacher knowledge distillation across three complementary scales: token, task, and context. We formalize structural conditions under which adaptive weighting operators are well-defined, admit multiple non-equivalent implementations, and can be hierarchically composed via product-structure normalization. Within this framework, we establish existence and non-uniqueness of conforming operators, characterize convergence of gradient-based optimization under standard assumptions, analyze stability and perturbation robustness, and provide an abstract formulation of safety-constrained distillation. The results decouple theoretical guarantees from specific weighting formulas, enabling principled analysis of adaptive distillation methods under heterogeneity, distribution shift, and safety constraints.