69.4DCMay 29
HeLoCo: Efficient asynchronous low-communication training under data and device heterogeneityAbdullah Al Asif, Patrick Diem, Juan Pablo Muñoz et al.
Distributed Low-Communication (DiLoCo) training reduces communication overhead by allowing workers to perform multiple local optimization steps before sending pseudo-gradients to a global outer update. Its asynchronous variant further improves hardware utilization by removing synchronization barriers, but at the cost of stale pseudo-gradients computed from outdated model states. As a result, these updates can become misaligned with the current global optimization direction, particularly in heterogeneous systems. This issue becomes even more pronounced when data are non-IID, a setting that has not been well studied in asynchronous low-communication training. To address this limitation, we propose \textbf{HeLoCo}, a direction-aware correction method for asynchronous low-communication training that uses outer momentum as a reference for the current optimization trajectory and selectively adjusts incoming pseudo-gradients before the outer update. Updates that remain aligned are preserved, while directionally conflicting components are corrected. On multilingual language-model training with heterogeneous workers and non-IID data, HeLoCo consistently improves validation loss. It outperforms existing asynchronous DiLoCo-based baselines by up to 7.5\% at a fixed token budget, exceeds asynchronous momentum look-ahead by up to 3.3\% at a fixed wall-clock budget, and surpasses the synchronous baseline by up to 22.1\% under severe system heterogeneity. Our analysis further shows how staleness, worker speed, and data heterogeneity shape update quality and convergence in highly decentralized and heterogeneous training setups.
64.7DCMay 26
SuperSFL: Resource-Heterogeneous Federated Split Learning with Weight-Sharing Super-NetworksAbdullah Al Asif, Sixing Yu, Juan Pablo Munoz et al.
SplitFed Learning (SFL) combines federated learning and split learning to enable collaborative training across distributed edge devices; however, it faces significant challenges in heterogeneous environments with diverse computational and communication capabilities. This paper proposes \textit{SuperSFL}, a federated split learning framework that leverages a weight-sharing super-network to dynamically generate resource-aware client-specific subnetworks, effectively mitigating device heterogeneity. SuperSFL introduces Three-Phase Gradient Fusion (TPGF), an optimization mechanism that coordinates local updates, server-side computation, and gradient fusion to accelerate convergence. In addition, a fault-tolerant client-side classifier and collaborative client--server aggregation enable uninterrupted training under intermittent communication failures. Experimental results on CIFAR-10 and CIFAR-100 with up to 100 heterogeneous clients show that SuperSFL converges $2$--$5\times$ faster in terms of communication rounds than baseline SFL while achieving higher accuracy, resulting in up to $20\times$ lower total communication cost and $13\times$ shorter training time. SuperSFL also demonstrates improved energy efficiency compared to baseline methods, making it a practical solution for federated learning in heterogeneous edge environments.
LGNov 28, 2025
PerfMamba: Performance Analysis and Pruning of Selective State Space ModelsAbdullah Al Asif, Mobina Kashaniyan, Sixing Yu et al.
Recent advances in sequence modeling have introduced selective SSMs as promising alternatives to Transformer architectures, offering theoretical computational efficiency and sequence processing advantages. A comprehensive understanding of selective SSMs in runtime behavior, resource utilization patterns, and scaling characteristics still remains unexplored, thus obstructing their optimal deployment and further architectural improvements. This paper presents a thorough empirical study of Mamba-1 and Mamba-2, systematically profiled for performance to assess the design principles that contribute to their efficiency in state-space modeling. A detailed analysis of computation patterns, memory access, I/O characteristics, and scaling properties was performed for sequence lengths ranging from 64 to 16384 tokens. Our findings show that the SSM component, a central part of the selective SSM architecture, demands a significant portion of computational resources compared to other components in the Mamba block. Based on these insights, we propose a pruning technique that selectively removes low-activity states within the SSM component, achieving measurable throughput and memory gains while maintaining accuracy within a moderate pruning regime. This approach results in performance improvements across varying sequence lengths, achieving a 1.14x speedup and reducing memory usage by 11.50\%. These results offer valuable guidance for designing more efficient SSM architectures that can be applied to a wide range of real-world applications.
CVNov 8, 2024
A Real-time Face Mask Detection and Social Distancing System for COVID-19 using Attention-InceptionV3 ModelAbdullah Al Asif, Farhana Chowdhury Tisha
One of the deadliest pandemics is now happening in the current world due to COVID-19. This contagious virus is spreading like wildfire around the whole world. To minimize the spreading of this virus, World Health Organization (WHO) has made protocols mandatory for wearing face masks and maintaining 6 feet physical distance. In this paper, we have developed a system that can detect the proper maintenance of that distance and people are properly using masks or not. We have used the customized attention-inceptionv3 model in this system for the identification of those two components. We have used two different datasets along with 10,800 images including both with and without Face Mask images. The training accuracy has been achieved 98% and validation accuracy 99.5%. The system can conduct a precision value of around 98.2% and the frame rate per second (FPS) was 25.0. So, with this system, we can identify high-risk areas with the highest possibility of the virus spreading zone. This may help authorities to take necessary steps to locate those risky areas and alert the local people to ensure proper precautions in no time.
CLMar 31, 2022
Bangla hate speech detection on social media using attention-based recurrent neural networkAmit Kumar Das, Abdullah Al Asif, Anik Paul et al.
Hate speech has spread more rapidly through the daily use of technology and, most notably, by sharing your opinions or feelings on social media in a negative aspect. Although numerous works have been carried out in detecting hate speeches in English, German, and other languages, very few works have been carried out in the context of the Bengali language. In contrast, millions of people communicate on social media in Bengali. The few existing works that have been carried out need improvements in both accuracy and interpretability. This article proposed encoder decoder based machine learning model, a popular tool in NLP, to classify user's Bengali comments on Facebook pages. A dataset of 7,425 Bengali comments, consisting of seven distinct categories of hate speeches, was used to train and evaluate our model. For extracting and encoding local features from the comments, 1D convolutional layers were used. Finally, the attention mechanism, LSTM, and GRU based decoders have been used for predicting hate speech categories. Among the three encoder decoder algorithms, the attention-based decoder obtained the best accuracy (77%).