CVMay 1, 2023Code
SelfDocSeg: A Self-Supervised vision-based Approach towards Document SegmentationSubhajit Maity, Sanket Biswas, Siladittya Manna et al.
Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature extraction, etc. However, most of the existing works have ignored the crucial fact regarding the scarcity of labeled data. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain and thus making data annotation a tedious task. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches which use text mining and textual labels, we use a complete vision-based approach in pre-training without any ground-truth label or its derivative. Instead, we generate pseudo-layouts from the document images to pre-train an image encoder to learn the document object representation and localization in a self-supervised framework before fine-tuning it with an object detection model. We show that our pipeline sets a new benchmark in this context and performs at par with the existing methods and the supervised counterparts, if not outperforms. The code is made publicly available at: https://github.com/MaitySubhajit/SelfDocSeg
CVJul 10, 2025
Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint DetectionSubhajit Maity, Ayan Kumar Bhunia, Subhadeep Koley et al.
Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.
CVMay 29, 2025
Sketch Down the FLOPs: Towards Efficient Networks for Human SketchAneeshan Sain, Subhajit Maity, Pinaki Nath Chowdhury et al.
As sketch research has collectively matured over time, its adaptation for at-mass commercialisation emerges on the immediate horizon. Despite an already mature research endeavour for photos, there is no research on the efficient inference specifically designed for sketch data. In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. We specifically chose fine-grained sketch-based image retrieval (FG-SBIR) as a demonstrator as the most recognised sketch problem with immediate commercial value. Technically speaking, we first propose a cross-modal knowledge distillation network to transfer existing photo efficient networks to be compatible with sketch, which brings down number of FLOPs and model parameters by 97.96% percent and 84.89% respectively. We then exploit the abstract trait of sketch to introduce a RL-based canvas selector that dynamically adjusts to the abstraction level which further cuts down number of FLOPs by two thirds. The end result is an overall reduction of 99.37% of FLOPs (from 40.18G to 0.254G) when compared with a full network, while retaining the accuracy (33.03% vs 32.77%) -- finally making an efficient network for the sparse sketch data that exhibit even fewer FLOPs than the best photo counterpart.
LGMar 13, 2025
Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?Subhajit Maity, Killian Hitsman, Xin Li et al.
Kolmogorov-Arnold networks (KANs) are a remarkable innovation that consists of learnable activation functions, with the potential to capture more complex relationships from data. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep networks, including advanced architectures such as vision Transformers (ViTs). This work asks whether KAN could learn token interactions. In this paper, we design the first learnable attention called Kolmogorov-Arnold Attention (KArAt) for ViTs that can operate on any basis, ranging from Fourier, Wavelets, Splines, to Rational Functions. However, learnable activations in the attention cause a memory explosion. To remedy this, we propose a modular version of KArAt that uses a low-rank approximation. By adopting the Fourier basis, Fourier-KArAt and its variants, in some cases, outperform their traditional softmax counterparts, or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K. We also deploy Fourier KArAt to ConViT and Swin-Transformer, and use it in detection and segmentation with ViT-Det. We dissect the performance of these architectures by analyzing their loss landscapes, weight distributions, optimizer paths, attention visualizations, and transferability to other datasets. KArAt's learnable activation yields a better attention score across all ViTs, indicating improved token-to-token interactions and contributing to enhanced inference. Still, its generalizability does not scale with larger ViTs. However, many factors, including the present computing interface, affect the relative performance of parameter- and memory-heavy KArAts. We note that the goal of this paper is not to produce efficient attention or challenge the traditional activations; by designing KArAt, we are the first to show that attention can be learned and encourage researchers to explore KArAt in conjunction with more advanced architectures.
CVJun 27, 2024
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across HeadsAli K. Rahimian, Manish K. Govind, Subhajit Maity et al.
Vision Transformers and their variants have achieved remarkable success in diverse visual perception tasks. Despite their effectiveness, they suffer from two significant limitations. First, the quadratic computational complexity of multi-head self-attention (MHSA), which restricts scalability to large token counts, and second, a high dependency on large-scale training data to attain competitive performance. In this paper, to address these challenges, we propose a novel sparse self-attention mechanism named Fibottention. Fibottention employs structured sparsity patterns derived from the Wythoff array, enabling an $\mathcal{O}(N \log N)$ computational complexity in self-attention. By design, its sparsity patterns vary across attention heads, which provably reduces redundant pairwise interactions while ensuring sufficient and diverse coverage. This leads to an \emph{inception-like functional diversity} in the attention heads, and promotes more informative and disentangled representations. We integrate Fibottention into standard Transformer architectures and conduct extensive experiments across multiple domains, including image classification, video understanding, and robot learning. Results demonstrate that models equipped with Fibottention either significantly outperform or achieve on-par performance with their dense MHSA counterparts, while leveraging only $2\%$ of all pairwise interactions across self-attention heads in typical settings, $2-6\%$ of the pairwise interactions in self-attention heads, resulting in substantial computational savings. Moreover, when compared to existing sparse attention mechanisms, Fibottention consistently achieves superior results on a FLOP-equivalency basis. Finally, we provide an in-depth analysis of the enhanced feature diversity resulting from our attention design and discuss its implications for efficient representation learning.
CVJun 12, 2024
DistilDoc: Knowledge Distillation for Visually-Rich Document ApplicationsJordy Van Landeghem, Subhajit Maity, Ayan Banerjee et al.
This work explores knowledge distillation (KD) for visually-rich document (VRD) applications such as document layout analysis (DLA) and document image classification (DIC). While VRD research is dependent on increasingly sophisticated and cumbersome models, the field has neglected to study efficiency via model compression. Here, we design a KD experimentation methodology for more lean, performant models on document understanding (DU) tasks that are integral within larger task pipelines. We carefully selected KD strategies (response-based, feature-based) for distilling knowledge to and from backbones with different architectures (ResNet, ViT, DiT) and capacities (base, small, tiny). We study what affects the teacher-student knowledge gap and find that some methods (tuned vanilla KD, MSE, SimKD with an apt projector) can consistently outperform supervised student training. Furthermore, we design downstream task setups to evaluate covariate shift and the robustness of distilled DLA models on zero-shot layout-aware document visual question answering (DocVQA). DLA-KD experiments result in a large mAP knowledge gap, which unpredictably translates to downstream robustness, accentuating the need to further explore how to efficiently obtain more semantic document layout awareness.
CVMay 28, 2023
Image Hash Minimization for Tamper DetectionSubhajit Maity, Ram Kumar Karsh
Tamper detection using image hash is a very common problem of modern days. Several research and advancements have already been done to address this problem. However, most of the existing methods lack the accuracy of tamper detection when the tampered area is low, as well as requiring long image hashes. In this paper, we propose a novel method objectively to minimize the hash length while enhancing the performance at low tampered area.
CVOct 24, 2018
Fault Area Detection in Leaf Diseases using k-means ClusteringSubhajit Maity, Sujan Sarkar, Avinaba Tapadar et al.
With increasing population the crisis of food is getting bigger day by day.In this time of crisis,the leaf disease of crops is the biggest problem in the food industry.In this paper, we have addressed that problem and proposed an efficient method to detect leaf disease.Leaf diseases can be detected from sample images of the leaf with the help of image processing and segmentation.Using k-means clustering and Otsu's method the faulty region in a leaf is detected which helps to determine proper course of action to be taken.Further the ratio of normal and faulty region if calculated would be able to predict if the leaf can be cured at all.