Zohaib Khan

CL
h-index17
10papers
27citations
Novelty50%
AI Score55

10 Papers

CVAug 15, 2024Code
Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

Zohaib Khan, Muhammad Khaquan, Omer Tafveez et al.

The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities. Code is available on GitHub.

75.5CLApr 8Code
To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs

Zohaib Khan, Mustafa Dogan, Ifeoma Okoh et al.

Misinformation is on the rise, and the strong writing capabilities of LLMs lower the barrier for malicious actors to produce and disseminate false information. We study how LLMs behave when prompted to spread misinformation across languages and target countries, and introduce GlobalLies, a multilingual parallel dataset of 440 misinformation generation prompt templates and 6,867 entities, spanning 8 languages and 195 countries. Using both human annotations and large-scale LLM-as-a-judge evaluations across hundreds of thousands of generations from state-of-the-art models, we show that misinformation generation varies systematically based on the country being discussed. Propagation of lies by LLMs is substantially higher in many lower-resource languages and for countries with a lower Human Development Index (HDI). We find that existing mitigation strategies provide uneven protection: input safety classifiers exhibit cross-lingual gaps, and retrieval-augmented fact-checking remains inconsistent across regions due to unequal information availability. We release GlobalLies for research purposes, aiming to support the development of mitigation strategies to reduce the spread of global misinformation: https://github.com/zohaib-khan5040/globallies

LGMar 7Code
Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

Muhammad Khalifa, Zohaib Khan, Omer Tafveez et al.

Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward-hacking rates. Using this environment, we study reward hacking in open-weight LLMs and find that such behaviors can be unintentionally learned during supervised fine-tuning (SFT) when even a small fraction of reward-hacking trajectories leak into training data. As little as 1\% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open-source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at https://github.com/zohaib-khan5040/Countdown-Code.

CLMar 3
Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

Angana Borah, Zohaib Khan, Rada Mihalcea et al.

Misinformation is a growing societal threat, and susceptibility to misinformative claims varies across demographic groups due to differences in underlying beliefs. As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor. We introduce BeliefSim, a simulation framework that constructs demographic belief profiles using psychology-informed taxonomies and survey priors. We study prompt-based conditioning and post-training adaptation, and conduct a multi-fold evaluation using: (i) susceptibility accuracy and (ii) counterfactual demographic sensitivity. Across both datasets and modeling strategies, we show that beliefs provide a strong prior for simulating misinformation susceptibility, with accuracy up to 92%.

CLOct 16, 2024Code
With a Grain of SALT: Are LLMs Fair Across Social Dimensions?

Samee Arif, Zohaib Khan, Maaidah Kaleem et al.

This paper presents a systematic analysis of biases in open-source Large Language Models (LLMs), across gender, religion, and race. Our study evaluates bias in smaller-scale Llama and Gemma models using the SALT ($\textbf{S}$ocial $\textbf{A}$ppropriateness in $\textbf{L}$LM-Generated $\textbf{T}$ext) dataset, which incorporates five distinct bias triggers: General Debate, Positioned Debate, Career Advice, Problem Solving, and CV Generation. To quantify bias, we measure win rates in General Debate and the assignment of negative roles in Positioned Debate. For real-world use cases, such as Career Advice, Problem Solving, and CV Generation, we anonymize the outputs to remove explicit demographic identifiers and use DeepSeek-R1 as an automated evaluator. We also address inherent biases in LLM-based evaluation, including evaluation bias, positional bias, and length bias, and validate our results through human evaluations. Our findings reveal consistent polarization across models, with certain demographic groups receiving systematically favorable or unfavorable treatment. By introducing SALT, we provide a comprehensive benchmark for bias analysis and underscore the need for robust bias mitigation strategies in the development of equitable AI systems.

CLOct 6, 2025
Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages

Tarek Naous, Anagha Savit, Carlos Rafael Catalan et al.

As Large Language Models (LLMs) gain stronger multilingual capabilities, their ability to handle culturally diverse entities becomes crucial. Prior work has shown that LLMs often favor Western-associated entities in Arabic, raising concerns about cultural fairness. Due to the lack of multilingual benchmarks, it remains unclear if such biases also manifest in different non-Western languages. In this paper, we introduce Camellia, a benchmark for measuring entity-centric cultural biases in nine Asian languages spanning six distinct Asian cultures. Camellia includes 19,530 entities manually annotated for association with the specific Asian or Western culture, as well as 2,173 naturally occurring masked contexts for entities derived from social media posts. Using Camellia, we evaluate cultural biases in four recent multilingual LLM families across various tasks such as cultural context adaptation, sentiment association, and entity extractive QA. Our analyses show a struggle by LLMs at cultural adaptation in all Asian languages, with performance differing across models developed in regions with varying access to culturally-relevant data. We further observe that different LLM families hold their distinct biases, differing in how they associate cultures with particular sentiments. Lastly, we find that LLMs struggle with context understanding in Asian languages, creating performance gaps between cultures in entity extraction.

SISep 10, 2025
Scaling Truth: The Confidence Paradox in AI Fact-Checking

Ihsan A. Qazi, Zohaib Khan, Abdullah Ghani et al.

The rise of misinformation underscores the need for scalable and reliable fact-checking solutions. Large language models (LLMs) hold promise in automating fact verification, yet their effectiveness across global contexts remains uncertain. We systematically evaluate nine established LLMs across multiple categories (open/closed-source, multiple sizes, diverse architectures, reasoning-based) using 5,000 claims previously assessed by 174 professional fact-checking organizations across 47 languages. Our methodology tests model generalizability on claims postdating training cutoffs and four prompting strategies mirroring both citizen and professional fact-checker interactions, with over 240,000 human annotations as ground truth. Findings reveal a concerning pattern resembling the Dunning-Kruger effect: smaller, accessible models show high confidence despite lower accuracy, while larger models demonstrate higher accuracy but lower confidence. This risks systemic bias in information verification, as resource-constrained organizations typically use smaller models. Performance gaps are most pronounced for non-English languages and claims originating from the Global South, threatening to widen existing information inequalities. These results establish a multilingual benchmark for future research and provide an evidence base for policy aimed at ensuring equitable access to trustworthy, AI-assisted fact-checking.

CVApr 17, 2019
Converting a Common Document Scanner to a Multispectral Scanner

Zohaib Khan, Faisal Shafait, Ajmal Mian

We propose the construction of a prototype scanner designed to capture multispectral images of documents. A standard sheet-feed scanner is modified by disconnecting its internal light source and connecting an external multispectral light source comprising of narrow band light emitting diodes (LED). A document is scanned by illuminating the scanner light guide successively with different LEDs and capturing a scan of the document. The system is portable and can be used for potential applications in verification of questioned documents, cheques, receipts and bank notes.

CVJul 29, 2014
Hyperspectral Imaging and Analysis for Sparse Reconstruction and Recognition

Zohaib Khan

This thesis proposes spatio-spectral techniques for hyperspectral image analysis. Adaptive spatio-spectral support and variable exposure hyperspectral imaging is demonstrated to improve spectral reflectance recovery from hyperspectral images. Novel spectral dimensionality reduction techniques have been proposed from the perspective of spectral only and spatio-spectral information preservation. It was found that the joint sparse and joint group sparse hyperspectral image models achieve lower reconstruction error and higher recognition accuracy using only a small subset of bands. Hyperspectral image databases have been developed and made publicly available for further research in compressed hyperspectral imaging, forensic document analysis and spectral reflectance recovery.

CVFeb 6, 2014
Multispectral Palmprint Encoding and Recognition

Zohaib Khan, Faisal Shafait, Yiqun Hu et al.

Palmprints are emerging as a new entity in multi-modal biometrics for human identification and verification. Multispectral palmprint images captured in the visible and infrared spectrum not only contain the wrinkles and ridge structure of a palm, but also the underlying pattern of veins; making them a highly discriminating biometric identifier. In this paper, we propose a feature encoding scheme for robust and highly accurate representation and matching of multispectral palmprints. To facilitate compact storage of the feature, we design a binary hash table structure that allows for efficient matching in large databases. Comprehensive experiments for both identification and verification scenarios are performed on two public datasets -- one captured with a contact-based sensor (PolyU dataset), and the other with a contact-free sensor (CASIA dataset). Recognition results in various experimental setups show that the proposed method consistently outperforms existing state-of-the-art methods. Error rates achieved by our method (0.003% on PolyU and 0.2% on CASIA) are the lowest reported in literature on both dataset and clearly indicate the viability of palmprint as a reliable and promising biometric. All source codes are publicly available.