LGJun 3
Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease ClassificationYogesh Kumar, Vrushank Ahire, Mudasir Ganaie
Early and accurate detection of Alzheimer's disease (AD) is important for timely intervention and disease management. Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM) and its Universum-based variants have shown promising results for AD classification. However, existing methods treat Universum samples as independent points and do not consider the geometric relationships among them. This paper proposes two graph-guided Universum learning models, namely UG-GEPSVM and IUG-GEPSVM, for AD versus cognitively normal (CN) classification using structural MRI data. In the proposed framework, mild cognitive impairment (MCI) subjects are used as Universum data to provide intermediate information between AD and CN classes. A graph is constructed over the Universum samples using Gaussian similarity, Minimum Spanning Tree connectivity, and multi-hop propagation. From this graph, a Laplacian matrix is derived that captures the geometric structure of the MCI samples. This Laplacian-based regularization is incorporated into the learning process in place of the conventional independent Universum penalty term. UG-GEPSVM integrates this regularization into the generalized eigenvalue formulation, while IUG-GEPSVM extends the numerically stable improved GEPSVM framework using a standard eigenvalue formulation. Experiments on ADNI MRI dataset variants using ICA- and PCA-based features at five different noise levels show that both proposed models consistently outperform existing GEPSVM and Universum-based methods. UG-GEPSVM achieves the highest average AUC of 88.07% and maintains stable performance under increasing noise levels. Statistical tests further confirm the significance of the observed improvements.
CRMay 26
Analyzing Linear Layers in Related-Differential CryptanalysisYogesh Kumar, Akshay Ankush Yadav, Susanta Samanta
In AES-like ciphers, diffusion layers are commonly instantiated using MDS matrices, since their optimal branch number yields strong diffusion guarantees and underpins classical resistance arguments against differential and linear cryptanalysis. However, Daemen and Rijmen (2009) showed that linear layers may still exhibit related-differential structure beyond what the MDS criterion captures, and Bardeh and Rijmen (2022) demonstrated that this phenomenon can be exploited in attacks on reduced-round AES. In this work, we systematically investigate the conditions under which linear layers avoid or exhibit these differentials, identifying matrix classes for which such structure is unavoidable. We first prove that every non-MDS matrix admits a nontrivial pair of related differentials, showing that the MDS property is necessary for avoiding them. We then establish that every odd-order symmetric MDS matrix admits related differentials, which rules out broad families of Cauchy-based constructions. We also substantially strengthen the circulant case by proving that related differentials are unavoidable for every circulant matrix of order $n$ with $n \not\equiv \pm 2 \pmod{12}$. Finally, we revisit the characterization of $3 \times 3$ MDS matrices over $\mathbb{F}_{2^m}$ for the absence of related differentials, and derive an explicit necessary and sufficient criterion in terms of $15$ polynomial constraints.
OCNov 3, 2018
Optimal multiplexing of sparse controllers for linear systemsYogesh Kumar, Sukumar Srikant, Debasish Chatterjee
This article treats three problems of sparse and optimal multiplexing a finite ensemble of linear control systems. Given an ensemble of linear control systems, multiplexing of the controllers consists of an algorithm that selects, at each time \(t\), only one from the ensemble of linear systems is actively controlled whereas the other systems evolve in open-loop. The first problem treated here is a ballistic reachability problem where the control signals are required to be maximally sparse and multiplexed, the second concerns sparse and optimally multiplexed linear quadratic control, and the third is a sparse and optimally multiplexed Mayer problem. Numerical experiments are provided to demonstrate the efficacy of the techniques developed here.
CVNov 16, 2025Code
Temporal Object-Aware Vision Transformer for Few-Shot Video Object DetectionYogesh Kumar, Anand Mishra
Few-shot Video Object Detection (FSVOD) addresses the challenge of detecting novel objects in videos with limited labeled examples, overcoming the constraints of traditional detection methods that require extensive training data. This task presents key challenges, including maintaining temporal consistency across frames affected by occlusion and appearance variations, and achieving novel object generalization without relying on complex region proposals, which are often computationally expensive and require task-specific training. Our novel object-aware temporal modeling approach addresses these challenges by incorporating a filtering mechanism that selectively propagates high-confidence object features across frames. This enables efficient feature progression, reduces noise accumulation, and enhances detection accuracy in a few-shot setting. By utilizing few-shot trained detection and classification heads with focused feature propagation, we achieve robust temporal consistency without depending on explicit object tube proposals. Our approach achieves performance gains, with AP improvements of 3.7% (FSVOD-500), 5.3% (FSYTV-40), 4.3% (VidOR), and 4.5 (VidVRD) in the 5-shot setting. Further results demonstrate improvements in 1-shot, 3-shot, and 10-shot configurations. We make the code public at: https://github.com/yogesh-iitj/fs-video-vit
LGDec 24, 2025
A Unified Framework for EEG Seizure Detection Using Universum-Integrated Generalized Eigenvalues Proximal Support Vector MachineYogesh Kumar, Vrushank Ahire, M. A. Ganaie
The paper presents novel Universum-enhanced classifiers: the Universum Generalized Eigenvalue Proximal Support Vector Machine (U-GEPSVM) and the Improved U-GEPSVM (IU-GEPSVM) for EEG signal classification. Using the computational efficiency of generalized eigenvalue decomposition and the generalization benefits of Universum learning, the proposed models address critical challenges in EEG analysis: non-stationarity, low signal-to-noise ratio, and limited labeled data. U-GEPSVM extends the GEPSVM framework by incorporating Universum constraints through a ratio-based objective function, while IU-GEPSVM enhances stability through a weighted difference-based formulation that provides independent control over class separation and Universum alignment. The models are evaluated on the Bonn University EEG dataset across two binary classification tasks: (O vs S)-healthy (eyes closed) vs seizure, and (Z vs S)-healthy (eyes open) vs seizure. IU-GEPSVM achieves peak accuracies of 85% (O vs S) and 80% (Z vs S), with mean accuracies of 81.29% and 77.57% respectively, outperforming baseline methods.
CVMar 15, 2024
Improving Medical Multi-modal Contrastive Learning with Expert AnnotationsYogesh Kumar, Pekka Marttinen
We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.
CVJan 29, 2025
Towards Making Flowchart Images Machine InterpretableShreya Shukla, Prajwal Gatti, Yogesh Kumar et al.
Computer programming textbooks and software documentations often contain flowcharts to illustrate the flow of an algorithm or procedure. Modern OCR engines often tag these flowcharts as graphics and ignore them in further processing. In this paper, we work towards making flowchart images machine-interpretable by converting them to executable Python codes. To this end, inspired by the recent success in natural language to code generation literature, we present a novel transformer-based framework, namely FloCo-T5. Our model is well-suited for this task,as it can effectively learn semantics, structure, and patterns of programming languages, which it leverages to generate syntactically correct code. We also used a task-specific pre-training objective to pre-train FloCo-T5 using a large number of logic-preserving augmented code samples. Further, to perform a rigorous study of this problem, we introduce theFloCo dataset that contains 11,884 flowchart images and their corresponding Python codes. Our experiments show promising results, and FloCo-T5 clearly outperforms related competitive baselines on code generation metrics. We make our dataset and implementation publicly available.
CVAug 21, 2025
Aligning Moments in Time using Video QueriesYogesh Kumar, Uday Agarwal, Manish Gupta et al.
Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query video. Additionally, to provide a strong task-specific initialization for MATR, we propose a self-supervised pre-training technique that involves training the model to localize random clips within videos. Extensive experiments demonstrate that MATR achieves notable performance improvements of 13.1% in R@1 and 8.1% in mIoU on an absolute scale compared to state-of-the-art methods on the popular ActivityNet-VRL dataset. Additionally, on our newly proposed dataset, SportsMoments, MATR shows a 14.7% gain in R@1 and a 14.4% gain in mIoU on an absolute scale over strong baselines.
CVAug 25, 2025
Language-Guided Temporal Token Pruning for Efficient VideoLLM ProcessingYogesh Kumar
Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms. We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune video tokens, preserving contextual continuity while reducing computational overhead. Unlike uniform pruning or keyframe selection, LGTTP retains higher token density in temporally relevant segments. Our model-agnostic framework integrates with TimeChat and LLaVA-Video, achieving a 65% reduction in computation while preserving 97-99% of the original performance. On QVHighlights, LGTTP improves HIT@1 by +9.5%, and on Charades-STA, it retains 99.6% of R@1. It excels on queries with explicit temporal markers and remains effective across general video understanding tasks.
CVMay 3, 2025
VideoLLM Benchmarks and Evaluation: A SurveyYogesh Kumar
The rapid development of Large Language Models (LLMs) has catalyzed significant advancements in video understanding technologies. This survey provides a comprehensive analysis of benchmarks and evaluation methodologies specifically designed or used for Video Large Language Models (VideoLLMs). We examine the current landscape of video understanding benchmarks, discussing their characteristics, evaluation protocols, and limitations. The paper analyzes various evaluation methodologies, including closed-set, open-set, and specialized evaluations for temporal and spatiotemporal understanding tasks. We highlight the performance trends of state-of-the-art VideoLLMs across these benchmarks and identify key challenges in current evaluation frameworks. Additionally, we propose future research directions to enhance benchmark design, evaluation metrics, and protocols, including the need for more diverse, multimodal, and interpretability-focused benchmarks. This survey aims to equip researchers with a structured understanding of how to effectively evaluate VideoLLMs and identify promising avenues for advancing the field of video understanding with large language models.
MLJan 31, 2022
Deconfounded Representation Similarity for Comparison of Neural NetworksTianyu Cui, Yogesh Kumar, Pekka Marttinen et al.
Similarity metrics such as representational similarity analysis (RSA) and centered kernel alignment (CKA) have been used to compare layer-wise representations between neural networks. However, these metrics are confounded by the population structure of data items in the input space, leading to spuriously high similarity for even completely random neural networks and inconsistent domain relations in transfer learning. We introduce a simple and generally applicable fix to adjust for the confounder with covariate adjustment regression, which retains the intuitive invariance properties of the original similarity measures. We show that deconfounding the similarity metrics increases the resolution of detecting semantically similar neural networks. Moreover, in real-world applications, deconfounding improves the consistency of representation similarities with domain similarities in transfer learning, and increases correlation with out-of-distribution accuracy.
LGAug 31, 2021
SANSformers: Self-Supervised Forecasting in Electronic Health Records with Attention-Free ModelsYogesh Kumar, Alexander Ilin, Henri Salo et al.
Despite the proven effectiveness of Transformer neural networks across multiple domains, their performance with Electronic Health Records (EHR) can be nuanced. The unique, multidimensional sequential nature of EHR data can sometimes make even simple linear models with carefully engineered features more competitive. Thus, the advantages of Transformers, such as efficient transfer learning and improved scalability are not always fully exploited in EHR applications. Addressing these challenges, we introduce SANSformer, an attention-free sequential model designed with specific inductive biases to cater for the unique characteristics of EHR data. In this work, we aim to forecast the demand for healthcare services, by predicting the number of patient visits to healthcare facilities. The challenge amplifies when dealing with divergent patient subgroups, like those with rare diseases, which are characterized by unique health trajectories and are typically smaller in size. To address this, we employ a self-supervised pretraining strategy, Generative Summary Pretraining (GSP), which predicts future summary statistics based on past health records of a patient. Our models are pretrained on a health registry of nearly one million patients, then fine-tuned for specific subgroup prediction tasks, showcasing the potential to handle the multifaceted nature of EHR data. In evaluation, SANSformer consistently surpasses robust EHR baselines, with our GSP pretraining method notably amplifying model performance, particularly within smaller patient subgroups. Our results illuminate the promising potential of tailored attention-free models and self-supervised pretraining in refining healthcare utilization predictions across various patient demographics.