Jaisidh Singh

CV
h-index63
4papers
46citations
Novelty53%
AI Score45

4 Papers

LGFeb 6
Explaining Grokking in Transformers through the Lens of Inductive Bias

Jaisidh Singh, Diganta Misra, Antonio Orvieto

We investigate grokking in transformers through the lens of inductive bias: dispositions arising from architecture or optimization that let the network prefer one solution over another. We first show that architectural choices such as the position of Layer Normalization (LN) strongly modulates grokking speed. This modulation is explained by isolating how LN on specific pathways shapes shortcut-learning and attention entropy. Subsequently, we study how different optimization settings modulate grokking, inducing distinct interpretations of previously proposed controls such as readout scale. Particularly, we find that using readout scale as a control for lazy training can be confounded by learning rate and weight decay in our setting. Accordingly, we show that features evolve continuously throughout training, suggesting grokking in transformers can be more nuanced than a lazy-to-rich transition of the learning regime. Finally, we show how generalization predictably emerges with feature compressibility in grokking, across different modulators of inductive bias. Our code is released at https://tinyurl.com/y52u3cad.

CVMar 29, 2024
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations

Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa et al.

Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt and impairing visual semantic matching and reasoning. An important aspect of reasoning in logic and language is negations. This paper highlights the limitations of popular VLMs such as CLIP, at understanding the implications of negations, i.e., the effect of the word "not" in a given prompt. To enable evaluation of VLMs on fluent prompts with negations, we present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions. Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations. This training paradigm improves CoN-CLIP's ability to encode semantics reliably, resulting in 3.85% average gain in top-1 accuracy for zero-shot image classification across 8 datasets. Further, CoN-CLIP outperforms CLIP on challenging compositionality benchmarks such as SugarCREPE by 4.4%, showcasing emergent compositional understanding of objects, relations, and attributes in text. Overall, our work addresses a crucial limitation of VLMs by introducing a dataset and framework that strengthens semantic associations between images and text, demonstrating improved large-scale foundation models with significantly reduced computational cost, promoting efficiency and accessibility.

CVJul 14, 2025
(Almost) Free Modality Stitching of Foundation Models

Jaisidh Singh, Diganta Misra, Boris Knyazev et al.

Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N \times M$ combinations of uni-modal models. In our experiments, Hyma reduces the cost of searching for the best performing uni-modal model pair by $10\times$, while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.

CVNov 16, 2024
Automatic Discovery and Assessment of Interpretable Systematic Errors in Semantic Segmentation

Jaisidh Singh, Sonam Singh, Amit Arvind Kale et al.

This paper presents a novel method for discovering systematic errors in segmentation models. For instance, a systematic error in the segmentation model can be a sufficiently large number of misclassifications from the model as a parking meter for a target class of pedestrians. With the rapid deployment of these models in critical applications such as autonomous driving, it is vital to detect and interpret these systematic errors. However, the key challenge is automatically discovering such failures on unlabelled data and forming interpretable semantic sub-groups for intervention. For this, we leverage multimodal foundation models to retrieve errors and use conceptual linkage along with erroneous nature to study the systematic nature of these errors. We demonstrate that such errors are present in SOTA segmentation models (UperNet ConvNeXt and UperNet Swin) trained on the Berkeley Deep Drive and benchmark the approach qualitatively and quantitatively, showing its effectiveness by discovering coherent systematic errors for these models. Our work opens up the avenue to model analysis and intervention that have so far been underexplored in semantic segmentation.