LGJul 2, 2024
Agreement-Based Cascading for Efficient InferenceSteven Kolawole, Don Dennis, Ameet Talwalkar et al.
Adaptive inference schemes reduce the cost of machine learning inference by assigning smaller models to easier examples, attempting to avoid invocation of larger models when possible. In this work we explore a simple, effective adaptive inference technique we term Agreement-Based Cascading (ABC). ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing. Although ensemble execution introduces additional expense, we show that these costs can be easily offset in practice due to large expected differences in model sizes, parallel inference execution capabilities, and accuracy benefits of ensembling. We examine ABC theoretically and empirically in terms of these parameters, showing that the approach can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy. Additionally, we explore the performance of ABC relative to existing cascading methods in three common scenarios: (1) edge-to-cloud inference, where ABC reduces communication costs by up to 14x; (2) cloud-based model serving, where it achieves a 3x reduction in rental costs; and (3) inference via model API services, where ABC achieves a 2-25x reduction in average price per token/request relative to state-of-the-art LLM cascades.
CLMar 29, 2023
Adapting to the Low-Resource Double-Bind: Investigating Low-Compute Methods on Low-Resource African LanguagesColin Leong, Herumb Shandilya, Bonaventure F. P. Dossou et al. · mila
Many natural language processing (NLP) tasks make use of massively pre-trained language models, which are computationally expensive. However, access to high computational resources added to the issue of data scarcity of African languages constitutes a real barrier to research experiments on these languages. In this work, we explore the applicability of low-compute approaches such as language adapters in the context of this low-resource double-bind. We intend to answer the following question: do language adapters allow those who are doubly bound by data and compute to practically build useful models? Through fine-tuning experiments on African languages, we evaluate their effectiveness as cost-effective approaches to low-resource African NLP. Using solely free compute resources, our results show that language adapters achieve comparable performances to massive pre-trained language models which are heavy on computational resources. This opens the door to further experimentation and exploration on full-extent of language adapters capacities.
LGFeb 8, 2024
Everybody Prune Now: Structured Pruning of LLMs with only Forward PassesLucio Dery, Steven Kolawole, Jean-François Kagy et al.
Structured pruning is a promising approach to create smaller, faster LLMs. However, existing methods typically rely on backward passes, which can inflate memory requirements and compute costs. In this work we introduce Bonsai, a gradient-free structured pruning method that eliminates the need for backpropagation, significantly reducing memory requirements and compute costs while achieving state-of-the-art pruning performance. Bonsai uses forward-pass-only perturbative pruning to enable efficient compression of large models on a broader range of hardware configurations. Unlike existing structured pruning approaches, Bonsai not only achieves better compression with fewer resources, but also produces models that are twice as fast as those generated by semi-structured pruning. As a concrete demonstration, we use Bonsai to prune an 8B LLaMA-3 model to 50% sparsity on a single A6000 GPU -- a task infeasible with backprop-based methods, which require 2-3x memory. Our results show that removing backprop as a requirement not only enables pruning larger models on constrained hardware but can also lead to state-of-the-art efficiency and performance.
LGJun 23, 2025
PARALLELPROMPT: Extracting Parallelism from Large Language Model QueriesSteven Kolawole, Keshav Santhanam, Virginia Smith et al.
LLM serving systems typically treat user prompts as monolithic inputs, optimizing inference through decoding tricks or inter-query batching. However, many real-world prompts contain latent semantic parallelism--decomposable structures where subtasks can be executed independently to reduce latency while preserving meaning. We introduce PARALLELPROMPT, the first benchmark for measuring intra-query parallelism in natural user prompts. Our dataset comprises over 37,000 real-world prompts from public LLM chat logs, each annotated with a structured schema capturing task templates, shared context, and iteration inputs. These schemas are extracted using LLM-assisted prompting with rule-based multilingual validation. To evaluate the benefits of decomposition, we provide an execution suite that benchmarks serial vs. parallel strategies, measuring latency, structural adherence, and semantic fidelity. Our results show that intra-query parallelism can be successfully parsed in over 75% of curated datasets, unlocking up to 5x speedups on tasks like translation, comprehension, and comparative analysis, with minimal quality degradation. By releasing this benchmark, curation pipeline, and evaluation suite, we provide the first standardized testbed for studying structure-aware execution in LLM serving pipelines.
CLApr 6, 2024
What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language ModelsBusayo Awobade, Mardiyyah Oduwole, Steven Kolawole
Compression techniques have been crucial in advancing machine learning by enabling efficient training and deployment of large-scale language models. However, these techniques have received limited attention in the context of low-resource language models, which are trained on even smaller amounts of data and under computational constraints, a scenario known as the "low-resource double-bind." This paper investigates the effectiveness of pruning, knowledge distillation, and quantization on an exclusively low-resourced, small-data language model, AfriBERTa. Through a battery of experiments, we assess the effects of compression on performance across several metrics beyond accuracy. Our study provides evidence that compression techniques significantly improve the efficiency and effectiveness of small-data language models, confirming that the prevailing beliefs regarding the effects of compression on large, heavily parameterized models hold true for less-parameterized, small-data models.
CRFeb 21
PrivacyBench: Privacy Isn't Free in Hybrid Privacy-Preserving Vision SystemsNnaemeka Obiefuna, Samuel Oyeneye, Similoluwa Odunaiya et al.
Privacy preserving machine learning deployments in sensitive deep learning applications; from medical imaging to autonomous systems; increasingly require combining multiple techniques. Yet, practitioners lack systematic guidance to assess the synergistic and non-additive interactions of these hybrid configurations, relying instead on isolated technique analysis that misses critical system level interactions. We introduce PrivacyBench, a benchmarking framework that reveals striking failures in privacy technique combinations with severe deployment implications. Through systematic evaluation across ResNet18 and ViT models on medical datasets, we uncover that FL + DP combinations exhibit severe convergence failure, with accuracy dropping from 98% to 13% while compute costs and energy consumption substantially increase. In contrast, FL + SMPC maintains near-baseline performance with modest overhead. Our framework provides the first systematic platform for evaluating privacy-utility-cost trade-offs through automated YAML configuration, resource monitoring, and reproducible experimental protocols. PrivacyBench enables practitioners to identify problematic technique interactions before deployment, moving privacy-preserving computer vision from ad-hoc evaluation toward principled systems design. These findings demonstrate that privacy techniques cannot be composed arbitrarily and provide critical guidance for robust deployment in resource-constrained environments.
CLSep 26, 2025
Semantic Agreement Enables Efficient Open-Ended LLM CascadesDuncan Soiffer, Steven Kolawole, Virginia Smith
Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement -- meaning-level consensus between ensemble outputs -- as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.
CLSep 9, 2025
From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine TranslationMardiyyah Oduwole, Oluwatosin Olajide, Jamiu Suleiman et al.
The linguistic diversity across the African continent presents different challenges and opportunities for machine translation. This study explores the effects of data augmentation techniques in improving translation systems in low-resource African languages. We focus on two data augmentation techniques: sentence concatenation with back translation and switch-out, applying them across six African languages. Our experiments show significant improvements in machine translation performance, with a minimum increase of 25\% in BLEU score across all six languages. We provide a comprehensive analysis and highlight the potential of these techniques to improve machine translation systems for low-resource languages, contributing to the development of more robust translation systems for under-resourced languages.
CVMay 30, 2023
Vision Transformers for Mobile Applications: A Short SurveyNahid Alam, Steven Kolawole, Simardeep Sethi et al.
Vision Transformers (ViTs) have demonstrated state-of-the-art performance on many Computer Vision Tasks. Unfortunately, deploying these large-scale ViTs is resource-consuming and impossible for many mobile devices. While most in the community are building for larger and larger ViTs, we ask a completely opposite question: How small can a ViT be within the tradeoffs of accuracy and inference latency that make it suitable for mobile deployment? We look into a few ViTs specifically designed for mobile applications and observe that they modify the transformer's architecture or are built around the combination of CNN and transformer. Recent work has also attempted to create sparse ViT networks and proposed alternatives to the attention module. In this paper, we study these architectures, identify the challenges and analyze what really makes a vision transformer suitable for mobile applications. We aim to serve as a baseline for future research direction and hopefully lay the foundation to choose the exemplary vision transformer architecture for your application running on mobile devices.
CVNov 1, 2021
Sign-to-Speech Model for Sign Language Understanding: A Case Study of Nigerian Sign LanguageSteven Kolawole, Opeyemi Osakuade, Nayan Saxena et al.
Through this paper, we seek to reduce the communication barrier between the hearing-impaired community and the larger society who are usually not familiar with sign language in the sub-Saharan region of Africa with the largest occurrences of hearing disability cases, while using Nigeria as a case study. The dataset is a pioneer dataset for the Nigerian Sign Language and was created in collaboration with relevant stakeholders. We pre-processed the data in readiness for two different object detection models and a classification model and employed diverse evaluation metrics to gauge model performance on sign-language to text conversion tasks. Finally, we convert the predicted sign texts to speech and deploy the best performing model in a lightweight application that works in real-time and achieves impressive results converting sign words/phrases to text and subsequently, into speech.