ARMar 6
A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGANeelesh Gupta, Peter Wang, Rajgopal Kannan et al.
Gated DeltaNet (GDN) is a linear attention mechanism that replaces the growing KV cache with a fixed-size recurrent state. Hybrid LLMs like Qwen3-Next use 75% GDN layers and achieve competitive accuracy to attention-only models. However, at batch-1, GDN decode is memory-bound on GPUs since the full recurrent state must be round-tripped through HBM every token. We show that this bottleneck is architectural, not algorithmic, as all subquadratic sequence models exhibit arithmetic intensities below 1 FLOP/B at decode time, making them more memory-bound than standard Transformers. We present an FPGA accelerator that eliminates this bottleneck by holding the full 2 MB recurrent state persistently in on-chip BRAM, converting the workload from memory-bound to compute-bound. Our design fuses the GDN recurrence into a five-phase pipelined datapath that performs only one read and one write pass over each state matrix per token, exploits Grouped Value Attention for paired-head parallelism, and overlaps preparation, computation, and output storage via dataflow pipelining. We explore four design points on an AMD Alveo U55C using Vitis HLS, varying head-level parallelism from 2 to 16 value-heads per iteration. Our fastest configuration achieves 63 $μ$s per token, 4.5$\times$ faster than the GPU reference on NVIDIA H100 PCIe. Post-implementation power analysis reports 9.96 W on-chip, yielding up to 60$\times$ greater energy efficiency per token decoded.
IVSep 28, 2025Code
Adapting Large Language Models to Mitigate Skin Tone Biases in Clinical Dermatology Tasks: A Mixed-Methods StudyKiran Nijjer, Ryan Bui, Derek Jiu et al.
SkinGPT-4, a large vision-language model, leverages annotated skin disease images to augment clinical workflows in underserved communities. However, its training dataset predominantly represents lighter skin tones, limiting diagnostic accuracy for darker tones. Here, we evaluated performance biases in SkinGPT-4 across skin tones on common skin diseases, including eczema, allergic-contact dermatitis, and psoriasis using the open-sourced SCIN dataset. We leveraged the SkinGPT-4 backbone to develop finetuned models for custom skin disease classification tasks and explored bias mitigation strategies. Clinical evaluation by board-certified dermatologists on six relevant skin diseases from 300 SCIN cases assessed images for diagnostic accuracy, informativity, physician utility, and patient utility. Model fairness metrics, including demographic parity and equalized odds, were calculated across skin tones. SkinGPT-4 achieved an average demographic parity of 0.10 across Fitzpatrick types, with notable differences of 0.10-0.15 between lightest and darkest tones across evaluation metrics. Model hallucinations in artifacts and anatomy occurred at a rate of 17.8. Our customized models achieved average F1, precision, and AUROC of 0.75, 0.78, and 0.78 across visually similar disease pairs. Fairness analysis showed an average demographic parity of 0.75, with a maximum disparity of 0.21 across skin tones. The best model achieved parity scores of 0.83, 0.83, 0.76, 0.89, 0.90, and 0.90 for Fitzpatrick I-VI, indicating robust fairness. Large language models such as SkinGPT-4 showed weaker performance on darker tones. Model biases exist across evaluation criteria, and hallucinations may affect diagnostic efficacy. These findings demonstrate the efficacy of training accurate, fair models using existing backbones for custom skin disease classification.
LGDec 28, 2025
Enabling Long FFT Convolutions on Memory-Constrained FPGAs via ChunkingPeter Wang, Neelesh Gupta, Viktor Prasanna
The need for long-context reasoning has led to alternative neural network architectures besides Transformers and self-attention, a popular model being Hyena, which employs causal 1D-convolutions implemented with FFTs. Long convolutions enable efficient global context mixing, but requirements for intermediate results exceed the 2-3 MB Block RAM capacity of FPGAs. We present a chunked FFT convolution approach enabling 450K length sequence by 450K length filter convolutions on an Alveo U200 FPGA with 2.8 MB BRAM through chunking and overlap-add reconstruction. We find that throughput scales proportionally with chunk size while degrading minimally by 7% for our longest sequences, demonstrating that careful memory management enables deployment of long-context primitives on edge FPGAs without sacrificing performance.
IMSep 10, 2019
Photometric light curves classification with machine learningTatiana Gabruseva, Sergey Zlobin, Peter Wang
The Large Synoptic Survey Telescope will complete its survey in 2022 and produce terabytes of imaging data each night. To work with this massive onset of data, automated algorithms to classify astronomical light curves are crucial. Here, we present a method for automated classification of photometric light curves for a range of astronomical objects. Our approach is based on the gradient boosting of decision trees, feature extraction and selection, and augmentation. The solution was developed in the context of The Photometric LSST Astronomical Time Series Classification Challenge (PLAsTiCC) and achieved one of the top results in the challenge.
AISep 4, 2019
Fractals2019: Combinatorial Optimisation with Dynamic Constraint AnnealingMikhail Prokopenko, Peter Wang
Fractals2019 started as a new experimental entry in the RoboCup Soccer 2D Simulation League, based on Gliders2d code base, and advanced to become a RoboCup-2019 champion. We employ combinatorial optimisation methods, within the framework of Guided Self-Organisation, with the search guided by local constraints. We present examples of several tactical tasks based on the Gliders2d code (version v2), including the search for an optimal assignment of heterogeneous player types, as well as blocking behaviours, offside trap, and attacking formations. We propose a new method, Dynamic Constraint Annealing, for solving dynamic constraint satisfaction problems, and apply it to optimise thermodynamic potential of collective behaviours, under dynamically induced constraints.
CVJun 27, 2017
Hierarchical Model for Long-term Video PredictionPeter Wang, Zhongxia Yan, Jeff Zhang
Video prediction has been an active topic of research in the past few years. Many algorithms focus on pixel-level predictions, which generates results that blur and disintegrate within a few frames. In this project, we use a hierarchical approach for long-term video prediction. We aim at estimating high-level structure in the input frame first, then predict how that structure grows in the future. Finally, we use an image analogy network to recover a realistic image from the predicted structure. Our method is largely adopted from the work by Villegas et al. The method is built with a combination of LSTMs and analogy-based convolutional auto-encoder networks. Additionally, in order to generate more realistic frame predictions, we also adopt adversarial loss. We evaluate our method on the Penn Action dataset, and demonstrate good results on high-level long-term structure prediction.
RODec 18, 2014
Simulation leagues: Enabling replicable and robust investigation of complex robotic systemsDavid M Budden, Peter Wang, Oliver Obst et al.
Physically-realistic simulated environments are powerful platforms for enabling measurable, replicable and statistically-robust investigation of complex robotic systems. Such environments are epitomised by the RoboCup simulation leagues, which have been successfully utilised to conduct massively-parallel experiments in topics including: optimisation of bipedal locomotion, self-localisation from noisy perception data and planning complex multi-agent strategies without direct agent-to-agent communication. Many of these systems are later transferred to physical robots, making the simulation leagues invaluable well-beyond the scope of simulated soccer matches. In this study, we provide an overview of the RoboCup simulation leagues and describe their properties as they pertain to replicable and robust robotics research. To demonstrate their utility directly, we leverage the ability to run parallelised experiments to evaluate different competition formats (e.g. round robin) for the RoboCup 2D simulation league. Our results demonstrate that a previously-proposed hybrid format minimises fluctuations from 'true' (statistically-significant) team performance rankings within the time constraints of the RoboCup world finals. Our experimental analysis would be impossible with physical robots alone, and we encourage other researchers to explore the potential for enriching their experimental pipelines with simulated components, both to minimise experimental costsand enable others to replicate and expand upon their results in a hardware-independent manner.
MAMar 17, 2014
Simulation leagues: Analysis of competition formatsDavid Budden, Peter Wang, Oliver Obst et al.
The selection of an appropriate competition format is critical for both the success and credibility of any competition, both real and simulated. In this paper, the automated parallelism offered by the RoboCupSoccer 2D simulation league is leveraged to conduct a 28,000 game round-robin between the top 8 teams from RoboCup 2012 and 2013. A proposed new competition format is found to reduce variation from the resultant statistically significant team performance rankings by 75% and 67%, when compared to the actual competition results from RoboCup 2012 and 2013 respectively. These results are statistically validated by generating 10,000 random tournaments for each of the three considered formats and comparing the respective distributions of ranking discrepancy.