Puskal Khadka

CV
h-index5
3papers
8citations
Novelty40%
AI Score44

3 Papers

21.1CVMar 20Code
MFil-Mamba: Multi-Filter Scanning for Spatial Redundancy-Aware Visual State Space Models

Puskal Khadka, KC Santosh

State Space Models (SSMs), especially recent Mamba architecture, have achieved remarkable success in sequence modeling tasks. However, extending SSMs to computer vision remains challenging due to the non-sequential structure of visual data and its complex 2D spatial dependencies. Although several early studies have explored adapting selective SSMs for vision applications, most approaches primarily depend on employing various traversal strategies over the same input. This introduces redundancy and distorts the intricate spatial relationships within images. To address these challenges, we propose MFil-Mamba, a novel visual state space architecture built on a multi-filter scanning backbone. Unlike fixed multi-directional traversal methods, our design enables each scan to capture unique and contextually relevant spatial information while minimizing redundancy. Furthermore, we incorporate an adaptive weighting mechanism to effectively fuse outputs from multiple scans in addition to architectural enhancements. MFil-Mamba achieves superior performance over existing state-of-the-art models across various benchmarks that include image classification, object detection, instance segmentation, and semantic segmentation. For example, our tiny variant attains 83.2% top-1 accuracy on ImageNet-1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on the ADE20K dataset. Code and models are available at https://github.com/puskal-khadka/MFil-Mamba.

CVSep 10, 2025Code
CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision

Puskal Khadka, Rodrigue Rizk, Longwei Wang et al.

Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction in small datasets, particularly due to the lack of key inductive biases such as locality and translation equivariance. To mitigate this, we propose CoSwin, a novel feature-fusion architecture that augments the hierarchical shifted window attention with localized convolutional feature learning. Specifically, CoSwin integrates a learnable local feature enhancement module into each attention block, enabling the model to simultaneously capture fine-grained spatial details and global semantic structure. We evaluate CoSwin on multiple image classification benchmarks including CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Our experimental results show consistent performance gains over state-of-the-art convolutional and transformer-based models. Notably, CoSwin achieves improvements of 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over the baseline Swin Transformer. These improvements underscore the effectiveness of local-global feature fusion in enhancing the generalization and robustness of transformers for small-scale vision. Code and pretrained weights available at https://github.com/puskal-khadka/coswin

IRAug 16, 2023
Content-based Recommendation Engine for Video Streaming Platform

Puskal Khadka, Prabhav Lamichhane

Recommendation engines suggest content, products, or services to the user by using machine learning algorithms. This paper proposes a content-based recommendation engine that provides personalized video suggestions based on users' previous interactions and preferences. The engine uses TF-IDF (Term Frequency-Inverse Document Frequency) text vectorization technique to evaluate the relevance of words in video descriptions, followed by the computation of cosine similarity between content items to determine their degree of similarity. The system's performance is evaluated using precision, recall, and F1-score metrics. Experimental results demonstrate the effectiveness of content-based filtering in delivering relevant and personalized video recommendations to users. This approach can enhance user engagement on video streaming platforms and reduce search time, providing a more intuitive, preference-based viewing experience.