CVNov 1, 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and OpportunitiesMd Farhan Ishmam, Md Sakib Hossain Shovon, M. F. Mridha et al.
The multimodal task of Visual Question Answering (VQA) encompassing elements of Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers to questions on any visual input. Over time, the scope of VQA has expanded from datasets focusing on an extensive collection of natural images to datasets featuring synthetic images, video, 3D environments, and various other visual inputs. The emergence of large pre-trained networks has shifted the early VQA approaches relying on feature extraction and fusion schemes to vision language pre-training (VLP) techniques. However, there is a lack of comprehensive surveys that encompass both traditional VQA architectures and contemporary VLP-based methods. Furthermore, the VLP challenges in the lens of VQA haven't been thoroughly explored, leaving room for potential open problems to emerge. Our work presents a survey in the domain of VQA that delves into the intricacies of VQA datasets and methods over the field's history, introduces a detailed taxonomy to categorize the facets of VQA, and highlights the recent trends, challenges, and scopes for improvement. We further generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future investigation. The work aims to navigate both beginners and experts by shedding light on the potential avenues of research and expanding the boundaries of the field.
CLAug 16, 2024
BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment AnalysisSadia Alam, Md Farhan Ishmam, Navid Hasin Alvee et al.
The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with 4 sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose 14 baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of 69.8% and an F1 score of 69.1% on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.
CLAug 16, 2024
FourierKAN outperforms MLP on Text Classification Head Fine-tuningAbdullah Al Imran, Md Farhan Ishmam
In resource constraint settings, adaptation to downstream classification tasks involves fine-tuning the final layer of a classifier (i.e. classification head) while keeping rest of the model weights frozen. Multi-Layer Perceptron (MLP) heads fine-tuned with pre-trained transformer backbones have long been the de facto standard for text classification head fine-tuning. However, the fixed non-linearity of MLPs often struggles to fully capture the nuances of contextual embeddings produced by pre-trained models, while also being computationally expensive. In our work, we investigate the efficacy of KAN and its variant, Fourier KAN (FR-KAN), as alternative text classification heads. Our experiments reveal that FR-KAN significantly outperforms MLPs with an average improvement of 10% in accuracy and 11% in F1-score across seven pre-trained transformer models and four text classification tasks. Beyond performance gains, FR-KAN is more computationally efficient and trains faster with fewer parameters. These results underscore the potential of FR-KAN to serve as a lightweight classification head, with broader implications for advancing other Natural Language Processing (NLP) tasks.
CVJul 3, 2024
Visual Robustness Benchmark for Visual Question Answering (VQA)Md Farhan Ishmam, Ishmam Tashdeed, Talukder Asir Saadat et al.
Can Visual Question Answering (VQA) systems perform just as well when deployed in the real world? Or are they susceptible to realistic corruption effects e.g. image blur, which can be detrimental in sensitive applications, such as medical VQA? While linguistic or textual robustness has been thoroughly explored in the VQA literature, there has yet to be any significant work on the visual robustness of VQA models. We propose the first large-scale benchmark comprising 213,000 augmented images, challenging the visual robustness of multiple VQA models and assessing the strength of realistic visual corruptions. Additionally, we have designed several robustness evaluation metrics that can be aggregated into a unified metric and tailored to fit a variety of use cases. Our experiments reveal several insights into the relationships between model size, performance, and robustness with the visual corruptions. Our benchmark highlights the need for a balanced approach in model development that considers model performance without compromising the robustness.
93.6CLApr 19
DORA Explorer: Improving the Exploration Ability of LLMs Without TrainingPriya Gurjar, Md Farhan Ishmam, Kenneth Marino
Despite the rapid progress, LLMs for sequential decision-making (i.e., LLM agents) still struggle to produce diverse outputs. This leads to insufficient exploration, convergence to sub-optimal solutions, and becoming stuck in loops. Such limitations can be problematic in environments that require active exploration to gather information and make decisions. Sampling methods such as temperature scaling introduce token-level randomness but fail to produce enough diversity at the sequence level. We analyze LLM exploration in the classic Multi-Armed Bandit (MAB) setting and the Text Adventure Learning Environment Suite (TALES). We find that current decoding strategies and prompting methods like Chain-of-Thought and Tree-of-Thought are insufficient for robust exploration. To address this, we introduce DORA Explorer (Diversity-Oriented Ranking of Actions), a training-free framework for improving exploration in LLM agents. DORA generates diverse action candidates, scores them using token log-probabilities, and selects actions using a tunable exploration parameter. DORA achieves UCB-competitive performance on MAB and consistent gains across TALES, e.g., improving Qwen2.5-7B's performance from 29.2% to 45.5% in TextWorld. Our project is available at: https://dora-explore.github.io/.
CLOct 18, 2025Code
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token AttributionSyed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran et al.
Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. Much of this overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. We address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to preserve the top-k% tokens in their original order, and obtain a sparse frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a suite of frontier LLMs. For the first three tasks, a 20% prompt reduction incurs only a marginal loss in task performance, demonstrating that contemporary LLMs can reconstruct elided context from high-salience cues. In contrast, performance on mathematical reasoning deteriorates sharply, reflecting a stronger dependence on complete token continuity. Further analysis with bottom-k% and random-k% tokens reveals asymmetric performance patterns that may suggest potential task contamination effects, wherein models may resort to shallow memorized patterns from pretraining exposure for conventional NLP tasks. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs, and delineate the boundary between tasks tolerant to contextual sparsity and those requiring exhaustive context. Our source code and models are available at: https://github.com/Starscream-11813/Frugal-ICL.
CLOct 17, 2024
BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated BanglaFabiha Haider, Fariha Tanjim Shifat, Md Farhan Ishmam et al.
The proliferation of transliterated texts in digital spaces has emphasized the need for detecting and classifying hate speech in languages beyond English, particularly in low-resource languages. As online discourse can perpetuate discrimination based on target groups, e.g. gender, religion, and origin, multi-label classification of hateful content can help in comprehending hate motivation and enhance content moderation. While previous efforts have focused on monolingual or binary hate classification tasks, no work has yet addressed the challenge of multi-label hate speech classification in transliterated Bangla. We introduce BanTH, the first multi-label transliterated Bangla hate speech dataset comprising 37.3k samples. The samples are sourced from YouTube comments, where each instance is labeled with one or more target groups, reflecting the regional demographic. We establish novel transformer encoder-based baselines by further pre-training on transliterated Bangla corpus. We also propose a novel translation-based LLM prompting strategy for transliterated text. Experiments reveal that our further pre-trained encoders are achieving state-of-the-art performance on the BanTH dataset, while our translation-based prompting outperforms other strategies in the zero-shot setting. The introduction of BanTH not only fills a critical gap in hate speech research for Bangla but also sets the stage for future exploration into code-mixed and multi-label classification challenges in underrepresented languages.
AIMar 5
TimeWarp: Evaluating Web Agents by Revisiting the PastMd Farhan Ishmam, Kenneth Marino
The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout. TimeWarp consists of three web environments, each with six UI versions spanning different eras of the internet, paired with a set of complex, realistic tasks requiring different forms of web navigation. Our experiments reveal web agents' vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories. To address this, we propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions. By training agents on teacher rollouts using our BC-variant, we achieve substantial performance gains: $20.4\%\rightarrow37.7\%$ for Qwen-3 4B and $0\%\rightarrow27.0\%$ for Llama-3.1 8B models. We hope our work helps researchers study generalization across web designs and unlock a new paradigm for collecting plans rather than trajectories, thereby improving the robustness of web agents.
HCAug 22, 2025
Prompting with Sign Parameters for Low-resource Sign Language Instruction GenerationMd Tariquzzaman, Md Farhan Ishmam, Saiyma Sittul Muna et al.
Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step textual instructions that enable non-SL users to imitate and learn SL gestures, promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to appear in the VLM pre-training data. To enhance zero-shot performance, we introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL parameters, like hand shape, motion, and orientation, directly into the textual prompts. Subsuming standard sign parameters into the prompt makes the instructions more structured and reproducible than free-form natural text from vanilla prompting. We envision that our work would promote inclusivity and advancement in SL learning systems for the under-resourced communities.
CVOct 19, 2024
ChitroJera: A Regionally Relevant Visual Question Answering Dataset for BanglaDeeparghya Dutta Barua, Md Sakib Ul Rahman Sourove, Md Fahim et al.
Visual Question Answer (VQA) poses the problem of answering a natural language question about a visual context. Bangla, despite being a widely spoken language, is considered low-resource in the realm of VQA due to the lack of proper benchmarks, challenging models known to be performant in other languages. Furthermore, existing Bangla VQA datasets offer little regional relevance and are largely adapted from their foreign counterparts. To address these challenges, we introduce a large-scale Bangla VQA dataset, ChitroJera, totaling over 15k samples from diverse and locally relevant data sources. We assess the performance of text encoders, image encoders, multimodal models, and our novel dual-encoder models. The experiments reveal that the pre-trained dual-encoders outperform other models of their scale. We also evaluate the performance of current large vision language models (LVLMs) using prompt-based techniques, achieving the overall best performance. Given the underdeveloped state of existing datasets, we envision ChitroJera expanding the scope of Vision-Language tasks in Bangla.