Sajid Javed

CV
h-index36
42papers
1,570citations
Novelty39%
AI Score57

42 Papers

CVAug 30, 2023Code
Improving Underwater Visual Tracking With a Large Scale Dataset and Image Enhancement

Basit Alawode, Fayaz Ali Dharejo, Mehnaz Ummar et al.

This paper presents a new dataset and general tracker enhancement method for Underwater Visual Object Tracking (UVOT). Despite its significance, underwater tracking has remained unexplored due to data inaccessibility. It poses distinct challenges; the underwater environment exhibits non-uniform lighting conditions, low visibility, lack of sharpness, low contrast, camouflage, and reflections from suspended particles. Performance of traditional tracking methods designed primarily for terrestrial or open-air scenarios drops in such conditions. We address the problem by proposing a novel underwater image enhancement algorithm designed specifically to boost tracking quality. The method has resulted in a significant performance improvement, of up to 5.0% AUC, of state-of-the-art (SOTA) visual trackers. To develop robust and accurate UVOT methods, large-scale datasets are required. To this end, we introduce a large-scale UVOT benchmark dataset consisting of 400 video segments and 275,000 manually annotated frames enabling underwater training and evaluation of deep trackers. The videos are labelled with several underwater-specific tracking attributes including watercolor variation, target distractors, camouflage, target relative size, and low visibility conditions. The UVOT400 dataset, tracking results, and the code are publicly available on: https://github.com/BasitAlawode/UWVOT400.

CVFeb 28, 2023Code
DFR-FastMOT: Detection Failure Resistant Tracker for Fast Multi-Object Tracking Based on Sensor Fusion

Mohamed Nagy, Majid Khonji, Jorge Dias et al.

Persistent multi-object tracking (MOT) allows autonomous vehicles to navigate safely in highly dynamic environments. One of the well-known challenges in MOT is object occlusion when an object becomes unobservant for subsequent frames. The current MOT methods store objects information, like objects' trajectory, in internal memory to recover the objects after occlusions. However, they retain short-term memory to save computational time and avoid slowing down the MOT method. As a result, they lose track of objects in some occlusion scenarios, particularly long ones. In this paper, we propose DFR-FastMOT, a light MOT method that uses data from a camera and LiDAR sensors and relies on an algebraic formulation for object association and fusion. The formulation boosts the computational time and permits long-term memory that tackles more occlusion scenarios. Our method shows outstanding tracking performance over recent learning and non-learning benchmarks with about 3% and 4% margin in MOTA, respectively. Also, we conduct extensive experiments that simulate occlusion phenomena by employing detectors with various distortion levels. The proposed solution enables superior performance under various distortion levels in detection over current state-of-art methods. Our framework processes about 7,763 frames in 1.48 seconds, which is seven times faster than recent benchmarks. The framework will be available at https://github.com/MohamedNagyMostafa/DFR-FastMOT.

CVAug 22, 2022Code
Learning Branched Fusion and Orthogonal Projection for Face-Voice Association

Muhammad Saad Saeed, Shah Nawaz, Muhammad Haris Khan et al.

Recent years have seen an increased interest in establishing association between faces and voices of celebrities leveraging audio-visual information from YouTube. Prior works adopt metric learning methods to learn an embedding space that is amenable for associated matching and verification tasks. Albeit showing some progress, such formulations are, however, restrictive due to dependency on distance-dependent margin parameter, poor run-time training complexity, and reliance on carefully crafted negative mining procedures. In this work, we hypothesize that an enriched representation coupled with an effective yet efficient supervision is important towards realizing a discriminative joint embedding space for face-voice association tasks. To this end, we propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings and clusters them based on their identity labels via orthogonality constraints. We coin our proposed mechanism as fusion and orthogonal projection (FOP) and instantiate in a two-stream network. The overall resulting framework is evaluated on VoxCeleb1 and MAV-Celeb datasets with a multitude of tasks, including cross-modal verification and matching. Results reveal that our method performs favourably against the current state-of-the-art methods and our proposed formulation of supervision is more effective and efficient than the ones employed by the contemporary methods. In addition, we leverage cross-modal verification and matching tasks to analyze the impact of multiple languages on face-voice association. Code is available: \url{https://github.com/msaadsaeed/FOP}

CVDec 10, 2025Code
Cytoplasmic Strings Analysis in Human Embryo Time-Lapse Videos using Deep Learning Framework

Anabia Sohail, Mohamad Alansari, Ahmed Abughali et al.

Infertility is a major global health issue, and while in-vitro fertilization has improved treatment outcomes, embryo selection remains a critical bottleneck. Time-lapse imaging enables continuous, non-invasive monitoring of embryo development, yet most automated assessment methods rely solely on conventional morphokinetic features and overlook emerging biomarkers. Cytoplasmic Strings, thin filamentous structures connecting the inner cell mass and trophectoderm in expanded blastocysts, have been associated with faster blastocyst formation, higher blastocyst grades, and improved viability. However, CS assessment currently depends on manual visual inspection, which is labor-intensive, subjective, and severely affected by detection and subtle visual appearance. In this work, we present, to the best of our knowledge, the first computational framework for CS analysis in human IVF embryos. We first design a human-in-the-loop annotation pipeline to curate a biologically validated CS dataset from TLI videos, comprising 13,568 frames with highly sparse CS-positive instances. Building on this dataset, we propose a two-stage deep learning framework that (i) classifies CS presence at the frame level and (ii) localizes CS regions in positive cases. To address severe imbalance and feature uncertainty, we introduce the Novel Uncertainty-aware Contractive Embedding (NUCE) loss, which couples confidence-aware reweighting with an embedding contraction term to form compact, well-separated class clusters. NUCE consistently improves F1-score across five transformer backbones, while RF-DETR-based localization achieves state-of-the-art (SOTA) detection performance for thin, low-contrast CS structures. The source code will be made publicly available at: https://github.com/HamadYA/CS_Detection.

CVDec 27, 2025Code
Rethinking Memory Design in SAM-Based Visual Object Tracking

Mohamad Alansari, Muzammal Naseer, Hasan Al Marzouqi et al.

\noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric representation. Building on this insight, we faithfully reimplement these memory mechanisms within the SAM3 framework and conduct large-scale evaluations across ten diverse benchmarks, enabling a controlled analysis of memory design independent of backbone strength. Guided by our empirical findings, we propose a unified hybrid memory framework that explicitly decomposes memory into short-term appearance memory and long-term distractor-resolving memory. This decomposition enables the integration of existing memory policies in a modular and principled manner. Extensive experiments demonstrate that the proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones. Code is available at: https://github.com/HamadYA/SAM3_Tracking_Zoo. \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}

CVAug 8, 2023
EFaR 2023: Efficient Face Recognition Competition

Jan Niklas Kolf, Fadi Boutros, Jurek Elliesen et al.

This paper presents the summary of the Efficient Face Recognition Competition (EFaR) held at the 2023 International Joint Conference on Biometrics (IJCB 2023). The competition received 17 submissions from 6 different teams. To drive further development of efficient face recognition models, the submitted solutions are ranked based on a weighted score of the achieved verification accuracies on a diverse set of benchmarks, as well as the deployability given by the number of floating-point operations and model size. The evaluation of submissions is extended to bias, cross-quality, and large-scale recognition benchmarks. Overall, the paper gives an overview of the achieved performance values of the submitted solutions as well as a diverse set of baselines. The submitted solutions use small, efficient network architectures to reduce the computational cost, some solutions apply model quantization. An outlook on possible techniques that are underrepresented in current solutions is given as well.

CVJul 22, 2024Code
Predicting the Best of N Visual Trackers

Basit Alawode, Sajid Javed, Arif Mahmood et al.

We observe that the performance of SOTA visual trackers surprisingly strongly varies across different video attributes and datasets. No single tracker remains the best performer across all tracking attributes and datasets. To bridge this gap, for a given video sequence, we predict the "Best of the N Trackers", called the BofN meta-tracker. At its core, a Tracking Performance Prediction Network (TP2N) selects a predicted best performing visual tracker for the given video sequence using only a few initial frames. We also introduce a frame-level BofN meta-tracker which keeps predicting best performer after regular temporal intervals. The TP2N is based on self-supervised learning architectures MocoV2, SwAv, BT, and DINO; experiments show that the DINO with ViT-S as a backbone performs the best. The video-level BofN meta-tracker outperforms, by a large margin, existing SOTA trackers on nine standard benchmarks - LaSOT, TrackingNet, GOT-10K, VOT2019, VOT2021, VOT2022, UAV123, OTB100, and WebUAV-3M. Further improvement is achieved by the frame-level BofN meta-tracker effectively handling variations in the tracking scenarios within long sequences. For instance, on GOT-10k, BofN meta-tracker average overlap is 88.7% and 91.1% with video and frame-level settings respectively. The best performing tracker, RTS, achieves 85.20% AO. On VOT2022, BofN expected average overlap is 67.88% and 70.98% with video and frame level settings, compared to the best performing ARTrack, 64.12%. This work also presents an extensive evaluation of competitive tracking methods on all commonly used benchmarks, following their protocols. The code, the trained models, and the results will soon be made publicly available on https://github.com/BasitAlawode/Best_of_N_Trackers.

CVJul 13, 2022
Graph CNN for Moving Object Detection in Complex Environments from Unseen Videos

Jhony H. Giraldo, Sajid Javed, Naoufel Werghi et al.

Moving Object Detection (MOD) is a fundamental step for many computer vision applications. MOD becomes very challenging when a video sequence captured from a static or moving camera suffers from the challenges: camouflage, shadow, dynamic backgrounds, and lighting variations, to name a few. Deep learning methods have been successfully applied to address MOD with competitive performance. However, in order to handle the overfitting problem, deep learning methods require a large amount of labeled data which is a laborious task as exhaustive annotations are always not available. Moreover, some MOD deep learning methods show performance degradation in the presence of unseen video sequences because the testing and training splits of the same sequences are involved during the network learning process. In this work, we pose the problem of MOD as a node classification problem using Graph Convolutional Neural Networks (GCNNs). Our algorithm, dubbed as GraphMOD-Net, encompasses instance segmentation, background initialization, feature extraction, and graph construction. GraphMOD-Net is tested on unseen videos and outperforms state-of-the-art methods in unsupervised, semi-supervised, and supervised learning in several challenges of the Change Detection 2014 (CDNet2014) and UCSD background subtraction datasets.

IVApr 8, 2022
Underwater Image Enhancement Using Pre-trained Transformer

Abderrahmene Boudiaf, Yuhang Guo, Adarsh Ghimire et al.

The goal of this work is to apply a denoising image transformer to remove the distortion from underwater images and compare it with other similar approaches. Automatic restoration of underwater images plays an important role since it allows to increase the quality of the images, without the need for more expensive equipment. This is a critical example of the important role of the machine learning algorithms to support marine exploration and monitoring, reducing the need for human intervention like the manual processing of the images, thus saving time, effort, and cost. This paper is the first application of the image transformer-based approach called "Pre-Trained Image Processing Transformer" to underwater images. This approach is tested on the UFO-120 dataset, containing 1500 images with the corresponding clean images.

CVMay 17Code
SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation

Abderrahmene Boudiaf, Irfan Hussain, Sajid Javed

Here's a trimmed version under 1920 characters: Open-vocabulary segmentation models such as SAM3 achieve strong performance through concept-level text prompting, yet degrade when the target class is visually underrepresented in pretraining data or when its appearance departs from canonical depictions. Text prompts provide no spatial signal to resolve such ambiguity. We present SegRAG, a training-free retrieval-augmented segmentation framework that grounds SAM3 with spatially precise, class-specific point prompts derived from a curated DINOv3 feature bank. During an offline stage, patch-level descriptors are extracted from annotated reference images using a frozen DINOv3 ViT-L/16 backbone and filtered by Intra-Class Cohesion Distillation (ICCD), retaining only prototypes that reliably retrieve within-class foreground. At inference, Topographic Similarity Grounding (TSG) computes a cosine-similarity landscape between the query image and retrieved prototypes, identifies spatially coherent high-confidence regions via connected-component analysis, and extracts peak locations through non-maximum suppression. These point prompts are delivered to SAM3 alongside the class-name text in a single joint grounding pass, enabling the mask decoder to resolve semantic intent and spatial evidence together. SegRAG requires no task-specific training and no synthetic data. On four open-vocabulary benchmarks it achieves consistent gains over the SAM3 text-only baseline, with improvements of up to +3.92 mIoU on LVIS. On AgML agricultural benchmarks representing a zero-shot domain transfer setting, it raises mean IoU from 25.27 to 59.24 (+33.97) and recovers individual classes from zero to over 95 mIoU. Ablation studies confirm that ICCD, TSG, and joint prompting each contribute independently and compound when combined. Code is available at https://github.com/boudiafA/SegRAG.

CVMar 14Code
AgriChat: A Multimodal Large Language Model for Agriculture Image Understanding

Abderrahmene Boudiaf, Irfan Hussain, Sajid Javed

The deployment of Multimodal Large Language Models (MLLMs) in agriculture is currently stalled by a critical trade-off: the existing literature lacks the large-scale agricultural datasets required for robust model development and evaluation, while current state-of-the-art models lack the verified domain expertise necessary to reason across diverse taxonomies. To address these challenges, we propose the Vision-to-Verified-Knowledge (V2VK) pipeline, a novel generative AI-driven annotation framework that integrates visual captioning with web-augmented scientific retrieval to autonomously generate the AgriMM benchmark, effectively eliminating biological hallucinations by grounding training data in verified phytopathological literature. The AgriMM benchmark contains over 3,000 agricultural classes and more than 607k VQAs spanning multiple tasks, including fine-grained plant species identification, plant disease symptom recognition, crop counting, and ripeness assessment. Leveraging this verifiable data, we present AgriChat, a specialized MLLM that presents broad knowledge across thousands of agricultural classes and provides detailed agricultural assessments with extensive explanations. Extensive evaluation across diverse tasks, datasets, and evaluation conditions reveals both the capabilities and limitations of current agricultural MLLMs, while demonstrating AgriChat's superior performance over other open-source models, including internal and external benchmarks. The results validate that preserving visual detail combined with web-verified knowledge constitutes a reliable pathway toward robust and trustworthy agricultural AI. The code and dataset are publicly available at https://github.com/boudiafA/AgriChat .

LGFeb 21, 2023
Higher-order Sparse Convolutions in Graph Neural Networks

Jhony H. Giraldo, Sajid Javed, Arif Mahmood et al.

Graph Neural Networks (GNNs) have been applied to many problems in computer sciences. Capturing higher-order relationships between nodes is crucial to increase the expressive power of GNNs. However, existing methods to capture these relationships could be infeasible for large-scale graphs. In this work, we introduce a new higher-order sparse convolution based on the Sobolev norm of graph signals. Our Sparse Sobolev GNN (S-SobGNN) computes a cascade of filters on each layer with increasing Hadamard powers to get a more diverse set of functions, and then a linear combination layer weights the embeddings of each filter. We evaluate S-SobGNN in several applications of semi-supervised learning. S-SobGNN shows competitive performance in all applications as compared to several state-of-the-art methods.

CVMar 25Code
MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding

Basit Alawode, Arif Mahmood, Muaz Khalifa Al-Radi et al.

Whole Slide Images (WSIs) exhibit hierarchical structure, where diagnostic information emerges from cellular morphology, regional tissue organization, and global context. Existing Computational Pathology (CPath) Multimodal Large Language Models (MLLMs) typically compress an entire WSI into a single embedding, which hinders fine-grained grounding and ignores how pathologists synthesize evidence across different scales. We introduce \textbf{MLLM-HWSI}, a Hierarchical WSI-level MLLM that aligns visual features with pathology language at four distinct scales, cell as word, patch as phrase, region as sentence, and WSI as paragraph to support interpretable evidence-grounded reasoning. MLLM-HWSI decomposes each WSI into multi-scale embeddings with scale-specific projectors and jointly enforces (i) a hierarchical contrastive objective and (ii) a cross-scale consistency loss, preserving semantic coherence from cells to the WSI. We compute diagnostically relevant patches and aggregate segmented cell embeddings into a compact cellular token per-patch using a lightweight \textit{Cell-Cell Attention Fusion (CCAF)} transformer. The projected multi-scale tokens are fused with text tokens and fed to an instruction-tuned LLM for open-ended reasoning, VQA, report, and caption generation tasks. Trained in three stages, MLLM-HWSI achieves new SOTA results on 13 WSI-level benchmarks across six CPath tasks. By aligning language with multi-scale visual evidence, MLLM-HWSI provides accurate, interpretable outputs that mirror diagnostic workflows and advance holistic WSI understanding. Code is available at: \href{https://github.com/BasitAlawode/HWSI-MLLM}{GitHub}.

CVSep 27, 2023
Learning Spatial-Temporal Regularized Tensor Sparse RPCA for Background Subtraction

Basit Alawode, Sajid Javed

Video background subtraction is one of the fundamental problems in computer vision that aims to segment all moving objects. Robust principal component analysis has been identified as a promising unsupervised paradigm for background subtraction tasks in the last decade thanks to its competitive performance in a number of benchmark datasets. Tensor robust principal component analysis variations have improved background subtraction performance further. However, because moving object pixels in the sparse component are treated independently and do not have to adhere to spatial-temporal structured-sparsity constraints, performance is reduced for sequences with dynamic backgrounds, camouflaged, and camera jitter problems. In this work, we present a spatial-temporal regularized tensor sparse RPCA algorithm for precise background subtraction. Within the sparse component, we impose spatial-temporal regularizations in the form of normalized graph-Laplacian matrices. To do this, we build two graphs, one across the input tensor spatial locations and the other across its frontal slices in the time domain. While maximizing the objective function, we compel the tensor sparse component to serve as the spatiotemporal eigenvectors of the graph-Laplacian matrices. The disconnected moving object pixels in the sparse component are preserved by the proposed graph-based regularizations since they both comprise of spatiotemporal subspace-based structure. Additionally, we propose a unique objective function that employs batch and online-based optimization methods to jointly maximize the background-foreground and spatial-temporal regularization components. Experiments are performed on six publicly available background subtraction datasets that demonstrate the superior performance of the proposed algorithm compared to several existing methods. Our source code will be available very soon.

CVNov 17, 2023
3D-TexSeg: Unsupervised Segmentation of 3D Texture using Mutual Transformer Learning

Iyyakutti Iyappan Ganapathi, Fayaz Ali, Sajid Javed et al.

Analysis of the 3D Texture is indispensable for various tasks, such as retrieval, segmentation, classification, and inspection of sculptures, knitted fabrics, and biological tissues. A 3D texture is a locally repeated surface variation independent of the surface's overall shape and can be determined using the local neighborhood and its characteristics. Existing techniques typically employ computer vision techniques that analyze a 3D mesh globally, derive features, and then utilize the obtained features for retrieval or classification. Several traditional and learning-based methods exist in the literature, however, only a few are on 3D texture, and nothing yet, to the best of our knowledge, on the unsupervised schemes. This paper presents an original framework for the unsupervised segmentation of the 3D texture on the mesh manifold. We approach this problem as binary surface segmentation, partitioning the mesh surface into textured and non-textured regions without prior annotation. We devise a mutual transformer-based system comprising a label generator and a cleaner. The two models take geometric image representations of the surface mesh facets and label them as texture or non-texture across an iterative mutual learning scheme. Extensive experiments on three publicly available datasets with diverse texture patterns demonstrate that the proposed framework outperforms standard and SOTA unsupervised techniques and competes reasonably with supervised methods.

CVMay 21, 2022
Robot Person Following in Uniform Crowd Environment

Adarsh Ghimire, Xiaoxiong Zhang, Sajid Javed et al.

Person-tracking robots have many applications, such as in security, elderly care, and socializing robots. Such a task is particularly challenging when the person is moving in a Uniform crowd. Also, despite significant progress of trackers reported in the literature, state-of-the-art trackers have hardly addressed person following in such scenarios. In this work, we focus on improving the perceptivity of a robot for a person following task by developing a robust and real-time applicable object tracker. We present a new robot person tracking system with a new RGB-D tracker, Deep Tracking with RGB-D (DTRD) that is resilient to tricky challenges introduced by the uniform crowd environment. Our tracker utilizes transformer encoder-decoder architecture with RGB and depth information to discriminate the target person from similar distractors. A substantial amount of comprehensive experiments and results demonstrate that our tracker has higher performance in two quantitative evaluation metrics and confirms its superiority over other SOTA trackers.

CVSep 2, 2022
Person Monitoring by Full Body Tracking in Uniform Crowd Environment

Zhibo Zhang, Omar Alremeithi, Maryam Almheiri et al.

Full body trackers are utilized for surveillance and security purposes, such as person-tracking robots. In the Middle East, uniform crowd environments are the norm which challenges state-of-the-art trackers. Despite tremendous improvements in tracker technology documented in the past literature, these trackers have not been trained using a dataset that captures these environments. In this work, we develop an annotated dataset with one specific target per video in a uniform crowd environment. The dataset was generated in four different scenarios where mainly the target was moving alongside the crowd, sometimes occluding with them, and other times the camera's view of the target is blocked by the crowd for a short period. After the annotations, it was used in evaluating and fine-tuning a state-of-the-art tracker. Our results have shown that the fine-tuned tracker performed better on the evaluation dataset based on two quantitative evaluation metrics, compared to the initial pre-trained tracker.

CVApr 19, 2022
Real-Time Face Recognition System

Adarsh Ghimire, Naoufel Werghi, Sajid Javed et al.

Over the past few decades, interest in algorithms for face recognition has been growing rapidly and has even surpassed human-level performance. Despite their accomplishments, their practical integration with a real-time performance-hungry system is not feasible due to high computational costs. So in this paper, we explore the recent, fast, and accurate face recognition system that can be easily integrated with real-time devices, and tested the algorithms on robot hardware platforms to confirm their robustness and speed.

CVApr 26, 2025Code
Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

Shahad Albastaki, Anabia Sohail, Iyyakutti Iyappan Ganapathi et al.

In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations. Cross-resolution alignment using a multimodal encoder enhances the model's ability to capture context from multiple resolutions in histology images. Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms state-of-the-art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at: https://github.com/BasitAlawode/MR-PLIP

CVMar 12Code
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

Mohamad Alansari, Naufal Suryanto, Divya Velayudhan et al.

Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Project page: https://risys-lab.github.io/SPARROW

CVMay 29, 2025Code
CLDTracker: A Comprehensive Language Description for Visual Tracking

Mohamad Alansari, Sajid Javed, Iyyakutti Iyappan Ganapathi et al.

VOT remains a fundamental yet challenging task in computer vision due to dynamic appearance changes, occlusions, and background clutter. Traditional trackers, relying primarily on visual cues, often struggle in such complex scenarios. Recent advancements in VLMs have shown promise in semantic understanding for tasks like open-vocabulary detection and image captioning, suggesting their potential for VOT. However, the direct application of VLMs to VOT is hindered by critical limitations: the absence of a rich and comprehensive textual representation that semantically captures the target object's nuances, limiting the effective use of language information; inefficient fusion mechanisms that fail to optimally integrate visual and textual features, preventing a holistic understanding of the target; and a lack of temporal modeling of the target's evolving appearance in the language domain, leading to a disconnect between the initial description and the object's subsequent visual changes. To bridge these gaps and unlock the full potential of VLMs for VOT, we propose CLDTracker, a novel Comprehensive Language Description framework for robust visual Tracking. Our tracker introduces a dual-branch architecture consisting of a textual and a visual branch. In the textual branch, we construct a rich bag of textual descriptions derived by harnessing the powerful VLMs such as CLIP and GPT-4V, enriched with semantic and contextual cues to address the lack of rich textual representation. Experiments on six standard VOT benchmarks demonstrate that CLDTracker achieves SOTA performance, validating the effectiveness of leveraging robust and temporally-adaptive vision-language representations for tracking. Code and models are publicly available at: https://github.com/HamadYA/CLDTracker

IVMar 26, 2025Code
Underwater Image Enhancement by Convolutional Spiking Neural Networks

Vidya Sudevan, Fakhreddine Zayer, Rizwana Kausar et al.

Underwater image enhancement (UIE) is fundamental for marine applications, including autonomous vision-based navigation. Deep learning methods using convolutional neural networks (CNN) and vision transformers advanced UIE performance. Recently, spiking neural networks (SNN) have gained attention for their lightweight design, energy efficiency, and scalability. This paper introduces UIE-SNN, the first SNN-based UIE algorithm to improve visibility of underwater images. UIE-SNN is a 19- layered convolutional spiking encoder-decoder framework with skip connections, directly trained using surrogate gradient-based backpropagation through time (BPTT) strategy. We explore and validate the influence of training datasets on energy reduction, a unique advantage of UIE-SNN architecture, in contrast to the conventional learning-based architectures, where energy consumption is model-dependent. UIE-SNN optimizes the loss function in latent space representation to reconstruct clear underwater images. Our algorithm performs on par with its non-spiking counterpart methods in terms of PSNR and structural similarity index (SSIM) at reduced timesteps ($T=5$) and energy consumption of $85\%$. The algorithm is trained on two publicly available benchmark datasets, UIEB and EUVP, and tested on unseen images from UIEB, EUVP, LSUI, U45, and our custom UIE dataset. The UIE-SNN algorithm achieves PSNR of \(17.7801~dB\) and SSIM of \(0.7454\) on UIEB, and PSNR of \(23.1725~dB\) and SSIM of \(0.7890\) on EUVP. UIE-SNN achieves this algorithmic performance with fewer operators (\(147.49\) GSOPs) and energy (\(0.1327~J\)) compared to its non-spiking counterpart (GFLOPs = \(218.88\) and Energy=\(1.0068~J\)). Compared with existing SOTA UIE methods, UIE-SNN achieves an average of \(6.5\times\) improvement in energy efficiency. The source code is available at \href{https://github.com/vidya-rejul/UIE-SNN.git}{UIE-SNN}.

CVJun 9, 2025Code
Spatio-Temporal State Space Model For Efficient Event-Based Optical Flow

Muhammad Ahmed Humais, Xiaoqian Huang, Hussain Sajwani et al.

Event cameras unlock new frontiers that were previously unthinkable with standard frame-based cameras. One notable example is low-latency motion estimation (optical flow), which is critical for many real-time applications. In such applications, the computational efficiency of algorithms is paramount. Although recent deep learning paradigms such as CNN, RNN, or ViT have shown remarkable performance, they often lack the desired computational efficiency. Conversely, asynchronous event-based methods including SNNs and GNNs are computationally efficient; however, these approaches fail to capture sufficient spatio-temporal information, a powerful feature required to achieve better performance for optical flow estimation. In this work, we introduce Spatio-Temporal State Space Model (STSSM) module along with a novel network architecture to develop an extremely efficient solution with competitive performance. Our STSSM module leverages state-space models to effectively capture spatio-temporal correlations in event data, offering higher performance with lower complexity compared to ViT, CNN-based architectures in similar settings. Our model achieves 4.5x faster inference and 8x lower computations compared to TMA and 2x lower computations compared to EV-FlowNet with competitive performance on the DSEC benchmark. Our code will be available at https://github.com/AhmedHumais/E-STMFlow

CVJun 3, 2025Code
MVTD: A Benchmark Dataset for Maritime Visual Object Tracking

Ahsan Baidar Bakht, Muhayy Ud Din, Sajid Javed et al.

Visual Object Tracking (VOT) is a fundamental task with widespread applications in autonomous navigation, surveillance, and maritime robotics. Despite significant advances in generic object tracking, maritime environments continue to present unique challenges, including specular water reflections, low-contrast targets, dynamically changing backgrounds, and frequent occlusions. These complexities significantly degrade the performance of state-of-the-art tracking algorithms, highlighting the need for domain-specific datasets. To address this gap, we introduce the Maritime Visual Tracking Dataset (MVTD), a comprehensive and publicly available benchmark specifically designed for maritime VOT. MVTD comprises 182 high-resolution video sequences, totaling approximately 150,000 frames, and includes four representative object classes: boat, ship, sailboat, and unmanned surface vehicle (USV). The dataset captures a diverse range of operational conditions and maritime scenarios, reflecting the real-world complexities of maritime environments. We evaluated 14 recent SOTA tracking algorithms on the MVTD benchmark and observed substantial performance degradation compared to their performance on general-purpose datasets. However, when fine-tuned on MVTD, these models demonstrate significant performance gains, underscoring the effectiveness of domain adaptation and the importance of transfer learning in specialized tracking contexts. The MVTD dataset fills a critical gap in the visual tracking community by providing a realistic and challenging benchmark for maritime scenarios. Dataset and Source Code can be accessed here "https://github.com/AhsanBaidar/MVTD".

CVJan 20, 2025
Anomaly Detection for Industrial Applications, Its Challenges, Solutions, and Future Directions: A Review

Abdelrahman Alzarooni, Ehtesham Iqbal, Samee Ullah Khan et al.

Anomaly detection from images captured using camera sensors is one of the mainstream applications at the industrial level. Particularly, it maintains the quality and optimizes the efficiency in production processes across diverse industrial tasks, including advanced manufacturing and aerospace engineering. Traditional anomaly detection workflow is based on a manual inspection by human operators, which is a tedious task. Advances in intelligent automated inspection systems have revolutionized the Industrial Anomaly Detection (IAD) process. Recent vision-based approaches can automatically extract, process, and interpret features using computer vision and align with the goals of automation in industrial operations. In light of the shift in inspection methodologies, this survey reviews studies published since 2019, with a specific focus on vision-based anomaly detection. The components of an IAD pipeline that are overlooked in existing surveys are presented, including areas related to data acquisition, preprocessing, learning mechanisms, and evaluation. In addition to the collected publications, several scientific and industry-related challenges and their perspective solutions are highlighted. Popular and relevant industrial datasets are also summarized, providing further insight into inspection applications. Finally, future directions of vision-based IAD are discussed, offering researchers insight into the state-of-the-art of industrial inspection.

CVFeb 3, 2025
AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis

Basit Alawode, Iyyakutti Iyappan Ganapathi, Sajid Javed et al.

The preservation of aquatic biodiversity is critical in mitigating the effects of climate change. Aquatic scene understanding plays a pivotal role in aiding marine scientists in their decision-making processes. In this paper, we introduce AquaticCLIP, a novel contrastive language-image pre-training model tailored for aquatic scene understanding. AquaticCLIP presents a new unsupervised learning framework that aligns images and texts in aquatic environments, enabling tasks such as segmentation, classification, detection, and object counting. By leveraging our large-scale underwater image-text paired dataset without the need for ground-truth annotations, our model enriches existing vision-language models in the aquatic domain. For this purpose, we construct a 2 million underwater image-text paired dataset using heterogeneous resources, including YouTube, Netflix, NatGeo, etc. To fine-tune AquaticCLIP, we propose a prompt-guided vision encoder that progressively aggregates patch features via learnable prompts, while a vision-guided mechanism enhances the language encoder by incorporating visual context. The model is optimized through a contrastive pretraining loss to align visual and textual modalities. AquaticCLIP achieves notable performance improvements in zero-shot settings across multiple underwater computer vision tasks, outperforming existing methods in both robustness and interpretability. Our model sets a new benchmark for vision-language applications in underwater environments. The code and dataset for AquaticCLIP are publicly available on GitHub at xxx.

CVApr 16, 2024
Neuromorphic Vision-based Motion Segmentation with Graph Transformer Neural Network

Yusra Alkendi, Rana Azzam, Sajid Javed et al.

Moving object segmentation is critical to interpret scene dynamics for robotic navigation systems in challenging environments. Neuromorphic vision sensors are tailored for motion perception due to their asynchronous nature, high temporal resolution, and reduced power consumption. However, their unconventional output requires novel perception paradigms to leverage their spatially sparse and temporally dense nature. In this work, we propose a novel event-based motion segmentation algorithm using a Graph Transformer Neural Network, dubbed GTNN. Our proposed algorithm processes event streams as 3D graphs by a series of nonlinear transformations to unveil local and global spatiotemporal correlations between events. Based on these correlations, events belonging to moving objects are segmented from the background without prior knowledge of the dynamic scene geometry. The algorithm is trained on publicly available datasets including MOD, EV-IMO, and \textcolor{black}{EV-IMO2} using the proposed training scheme to facilitate efficient training on extensive datasets. Moreover, we introduce the Dynamic Object Mask-aware Event Labeling (DOMEL) approach for generating approximate ground-truth labels for event-based motion segmentation datasets. We use DOMEL to label our own recorded Event dataset for Motion Segmentation (EMS-DOMEL), which we release to the public for further research and benchmarking. Rigorous experiments are conducted on several unseen publicly-available datasets where the results revealed that GTNN outperforms state-of-the-art methods in the presence of dynamic background variations, motion patterns, and multiple dynamic objects with varying sizes and velocities. GTNN achieves significant performance gains with an average increase of 9.4% and 4.5% in terms of motion segmentation accuracy (IoU%) and detection rate (DR%), respectively.

CVJan 2, 2024
Accurate and Efficient Urban Street Tree Inventory with Deep Learning on Mobile Phone Imagery

Asim Khan, Umair Nawaz, Anwaar Ulhaq et al.

Deforestation, a major contributor to climate change, poses detrimental consequences such as agricultural sector disruption, global warming, flash floods, and landslides. Conventional approaches to urban street tree inventory suffer from inaccuracies and necessitate specialised equipment. To overcome these challenges, this paper proposes an innovative method that leverages deep learning techniques and mobile phone imaging for urban street tree inventory. Our approach utilises a pair of images captured by smartphone cameras to accurately segment tree trunks and compute the diameter at breast height (DBH). Compared to traditional methods, our approach exhibits several advantages, including superior accuracy, reduced dependency on specialised equipment, and applicability in hard-to-reach areas. We evaluated our method on a comprehensive dataset of 400 trees and achieved a DBH estimation accuracy with an error rate of less than 2.5%. Our method holds significant potential for substantially improving forest management practices. By enhancing the accuracy and efficiency of tree inventory, our model empowers urban management to mitigate the adverse effects of deforestation and climate change.

CVApr 13, 2025
snnTrans-DHZ: A Lightweight Spiking Neural Network Architecture for Underwater Image Dehazing

Vidya Sudevan, Fakhreddine Zayer, Rizwana Kausar et al.

Underwater image dehazing is critical for vision-based marine operations because light scattering and absorption can severely reduce visibility. This paper introduces snnTrans-DHZ, a lightweight Spiking Neural Network (SNN) specifically designed for underwater dehazing. By leveraging the temporal dynamics of SNNs, snnTrans-DHZ efficiently processes time-dependent raw image sequences while maintaining low power consumption. Static underwater images are first converted into time-dependent sequences by repeatedly inputting the same image over user-defined timesteps. These RGB sequences are then transformed into LAB color space representations and processed concurrently. The architecture features three key modules: (i) a K estimator that extracts features from multiple color space representations; (ii) a Background Light Estimator that jointly infers the background light component from the RGB-LAB images; and (iii) a soft image reconstruction module that produces haze-free, visibility-enhanced outputs. The snnTrans-DHZ model is directly trained using a surrogate gradient-based backpropagation through time (BPTT) strategy alongside a novel combined loss function. Evaluated on the UIEB benchmark, snnTrans-DHZ achieves a PSNR of 21.68 dB and an SSIM of 0.8795, and on the EUVP dataset, it yields a PSNR of 23.46 dB and an SSIM of 0.8439. With only 0.5670 million network parameters, and requiring just 7.42 GSOPs and 0.0151 J of energy, the algorithm significantly outperforms existing state-of-the-art methods in terms of efficiency. These features make snnTrans-DHZ highly suitable for deployment in underwater robotics, marine exploration, and environmental monitoring.

IVOct 18, 2024
Advancing Histopathology with Deep Learning Under Data Scarcity: A Decade in Review

Ahmad Obeid, Said Boumaraf, Anabia Sohail et al.

Recent years witnessed remarkable progress in computational histopathology, largely fueled by deep learning. This brought the clinical adoption of deep learning-based tools within reach, promising significant benefits to healthcare, offering a valuable second opinion on diagnoses, streamlining complex tasks, and mitigating the risks of inconsistency and bias in clinical decisions. However, a well-known challenge is that deep learning models may contain up to billions of parameters; supervising their training effectively would require vast labeled datasets to achieve reliable generalization and noise resilience. In medical imaging, particularly histopathology, amassing such extensive labeled data collections places additional demands on clinicians and incurs higher costs, which hinders the art's progress. Addressing this challenge, researchers devised various strategies for leveraging deep learning with limited data and annotation availability. In this paper, we present a comprehensive review of deep learning applications in histopathology, with a focus on the challenges posed by data scarcity over the past decade. We systematically categorize and compare various approaches, evaluate their distinct contributions using benchmarking tables, and highlight their respective advantages and limitations. Additionally, we address gaps in existing reviews and identify underexplored research opportunities, underscoring the potential for future advancements in this field.

CVFeb 20
MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method

Ahsan Baidar Bakht, Mohamad Alansari, Muhayy Ud Din et al.

Underwater Object Tracking (UOT) is crucial for efficient marine robotics, large scale ecological monitoring, and ocean exploration; however, progress has been hindered by the scarcity of large, multimodal, and diverse datasets. Existing benchmarks remain small and RGB only, limiting robustness under severe color distortion, turbidity, and low visibility conditions. We introduce MUOT_3M, the first pseudo multimodal UOT benchmark comprising 3 million frames from 3,030 videos (27.8h) annotated with 32 tracking attributes, 677 fine grained classes, and synchronized RGB, estimated enhanced RGB, estimated depth, and language modalities validated by a marine biologist. Building upon MUOT_3M, we propose MUTrack, a SAM-based multimodal to unimodal tracker featuring visual geometric alignment, vision language fusion, and four level knowledge distillation that transfers multimodal knowledge into a unimodal student model. Extensive evaluations across five UOT benchmarks demonstrate that MUTrack achieves up to 8.40% higher AUC and 7.80% higher precision than the strongest SOTA baselines while running at 24 FPS. MUOT_3M and MUTrack establish a new foundation for scalable, multimodally trained yet practically deployable underwater tracking.

CVJun 7, 2024
CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment

Sajid Javed, Arif Mahmood, Iyyakutti Iyappan Ganapathi et al.

This paper proposes Comprehensive Pathology Language Image Pre-training (CPLIP), a new unsupervised technique designed to enhance the alignment of images and text in histopathology for tasks such as classification and segmentation. This methodology enriches vision-language models by leveraging extensive data without needing ground truth annotations. CPLIP involves constructing a pathology-specific dictionary, generating textual descriptions for images using language models, and retrieving relevant images for each text snippet via a pre-trained model. The model is then fine-tuned using a many-to-many contrastive learning method to align complex interrelated concepts across both modalities. Evaluated across multiple histopathology tasks, CPLIP shows notable improvements in zero-shot learning scenarios, outperforming existing methods in both interpretability and robustness and setting a higher benchmark for the application of vision-language models in the field. To encourage further research and replication, the code for CPLIP is available on GitHub at https://cplip.github.io/

CVMay 3, 2023
Unsupervised Mutual Transformer Learning for Multi-Gigapixel Whole Slide Image Classification

Sajid Javed, Arif Mahmood, Talha Qaiser et al.

Classification of gigapixel Whole Slide Images (WSIs) is an important prediction task in the emerging area of computational pathology. There has been a surge of research in deep learning models for WSI classification with clinical applications such as cancer detection or prediction of molecular mutations from WSIs. Most methods require expensive and labor-intensive manual annotations by expert pathologists. Weakly supervised Multiple Instance Learning (MIL) methods have recently demonstrated excellent performance; however, they still require large slide-level labeled training datasets that need a careful inspection of each slide by an expert pathologist. In this work, we propose a fully unsupervised WSI classification algorithm based on mutual transformer learning. Instances from gigapixel WSI (i.e., image patches) are transformed into a latent space and then inverse-transformed to the original space. Using the transformation loss, pseudo-labels are generated and cleaned using a transformer label-cleaner. The proposed transformer-based pseudo-label generation and cleaning modules mutually train each other iteratively in an unsupervised manner. A discriminative learning mechanism is introduced to improve normal versus cancerous instance labeling. In addition to unsupervised classification, we demonstrate the effectiveness of the proposed framework for weak supervision for cancer subtype classification as downstream analysis. Extensive experiments on four publicly available datasets show excellent performance compared to the state-of-the-art methods. We intend to make the source code of our algorithm publicly available soon.

CVDec 17, 2021
Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers

Yusra Alkendi, Rana Azzam, Abdulla Ayyad et al.

Neuromorphic vision is a bio-inspired technology that has triggered a paradigm shift in the computer-vision community and is serving as a key-enabler for a multitude of applications. This technology has offered significant advantages including reduced power consumption, reduced processing needs, and communication speed-ups. However, neuromorphic cameras suffer from significant amounts of measurement noise. This noise deteriorates the performance of neuromorphic event-based perception and navigation algorithms. In this paper, we propose a novel noise filtration algorithm to eliminate events which do not represent real log-intensity variations in the observed scene. We employ a Graph Neural Network (GNN)-driven transformer algorithm, called GNN-Transformer, to classify every active event pixel in the raw stream into real-log intensity variation or noise. Within the GNN, a message-passing framework, called EventConv, is carried out to reflect the spatiotemporal correlation among the events, while preserving their asynchronous nature. We also introduce the Known-object Ground-Truth Labeling (KoGTL) approach for generating approximate ground truth labels of event streams under various illumination conditions. KoGTL is used to generate labeled datasets, from experiments recorded in chalenging lighting conditions. These datasets are used to train and extensively test our proposed algorithm. When tested on unseen datasets, the proposed algorithm outperforms existing methods by 8.8% in terms of filtration accuracy. Additional tests are also conducted on publicly available datasets to demonstrate the generalization capabilities of the proposed algorithm in the presence of illumination variations and different motion dynamics. Compared to existing solutions, qualitative results verified the superior capability of the proposed algorithm to eliminate noise while preserving meaningful scene events.

CVDec 6, 2021
Visual Object Tracking with Discriminative Filters and Siamese Networks: A Survey and Outlook

Sajid Javed, Martin Danelljan, Fahad Shahbaz Khan et al.

Accurate and robust visual object tracking is one of the most challenging and fundamental computer vision problems. It entails estimating the trajectory of the target in an image sequence, given only its initial location, and segmentation, or its rough approximation in the form of a bounding box. Discriminative Correlation Filters (DCFs) and deep Siamese Networks (SNs) have emerged as dominating tracking paradigms, which have led to significant progress. Following the rapid evolution of visual object tracking in the last decade, this survey presents a systematic and thorough review of more than 90 DCFs and Siamese trackers, based on results in nine tracking benchmarks. First, we present the background theory of both the DCF and Siamese tracking core formulations. Then, we distinguish and comprehensively review the shared as well as specific open research challenges in both these tracking paradigms. Furthermore, we thoroughly analyze the performance of DCF and Siamese trackers on nine benchmarks, covering different experimental aspects of visual tracking: datasets, evaluation metrics, performance, and speed comparisons. We finish the survey by presenting recommendations and suggestions for distinguished open challenges based on our analysis.

CVDec 6, 2018
Handcrafted and Deep Trackers: Recent Visual Object Tracking Approaches and Trends

Mustansar Fiaz, Arif Mahmood, Sajid Javed et al.

In recent years visual object tracking has become a very active research area. An increasing number of tracking algorithms are being proposed each year. It is because tracking has wide applications in various real world problems such as human-computer interaction, autonomous vehicles, robotics, surveillance and security just to name a few. In the current study, we review latest trends and advances in the tracking area and evaluate the robustness of different trackers based on the feature extraction methods. The first part of this work comprises a comprehensive survey of the recently proposed trackers. We broadly categorize trackers into Correlation Filter based Trackers (CFTs) and Non-CFTs. Each category is further classified into various types based on the architecture and the tracking mechanism. In the second part, we experimentally evaluated 24 recent trackers for robustness, and compared handcrafted and deep feature based trackers. We observe that trackers using deep features performed better, though in some cases a fusion of both increased performance significantly. In order to overcome the drawbacks of the existing benchmarks, a new benchmark Object Tracking and Temple Color (OTTC) has also been proposed and used in the evaluation of different algorithms. We analyze the performance of trackers over eleven different challenges in OTTC, and three other benchmarks. Our study concludes that Discriminative Correlation Filter (DCF) based trackers perform better than the others. Our study also reveals that inclusion of different types of regularizations over DCF often results in boosted tracking performance. Finally, we sum up our study by pointing out some insights and indicating future trends in visual object tracking field.

CVNov 13, 2018
Deep Neural Network Concepts for Background Subtraction: A Systematic Review and Comparative Evaluation

Thierry Bouwmans, Sajid Javed, Maryam Sultana et al.

Conventional neural networks show a powerful framework for background subtraction in video acquired by static cameras. Indeed, the well-known SOBS method and its variants based on neural networks were the leader methods on the largescale CDnet 2012 dataset during a long time. Recently, convolutional neural networks which belong to deep learning methods were employed with success for background initialization, foreground detection and deep learned features. Currently, the top current background subtraction methods in CDnet 2014 are based on deep neural networks with a large gap of performance in comparison on the conventional unsupervised approaches based on multi-features or multi-cues strategies. Furthermore, a huge amount of papers was published since 2016 when Braham and Van Droogenbroeck published their first work on CNN applied to background subtraction providing a regular gain of performance. In this context, we provide the first review of deep neural network concepts in background subtraction for novices and experts in order to analyze this success and to provide further directions. For this, we first surveyed the methods used background initialization, background subtraction and deep learned features. Then, we discuss the adequacy of deep neural networks for background subtraction. Finally, experimental results are presented on the CDnet 2014 dataset.

CVNov 5, 2018
Unsupervised RGBD Video Object Segmentation Using GANs

Maryam Sultana, Arif Mahmood, Sajid Javed et al.

Video object segmentation is a fundamental step in many advanced vision applications. Most existing algorithms are based on handcrafted features such as HOG, super-pixel segmentation or texture-based techniques, while recently deep features have been found to be more efficient. Existing algorithms observe performance degradation in the presence of challenges such as illumination variations, shadows, and color camouflage. To handle these challenges we propose a fusion based moving object segmentation algorithm which exploits color as well as depth information using GAN to achieve more accuracy. Our goal is to segment moving objects in the presence of challenging background scenes, in real environments. To address this problem, GAN is trained in an unsupervised manner on color and depth information independently with challenging video sequences. During testing, the trained GAN generates backgrounds similar to that in the test sample. The generated background samples are then compared with the test sample to segment moving objects. The final result is computed by fusion of object boundaries in both modalities, RGB and the depth. The comparison of our proposed algorithm with five state-of-the-art methods on publicly available dataset has shown the strength of our algorithm for moving object segmentation in videos in the presence of challenging real scenarios.

CVMay 21, 2018
Unsupervised Deep Context Prediction for Background Foreground Separation

Maryam Sultana, Arif Mahmood, Sajid Javed et al.

In many advanced video based applications background modeling is a pre-processing step to eliminate redundant data, for instance in tracking or video surveillance applications. Over the past years background subtraction is usually based on low level or hand-crafted features such as raw color components, gradients, or local binary patterns. The background subtraction algorithms performance suffer in the presence of various challenges such as dynamic backgrounds, photometric variations, camera jitters, and shadows. To handle these challenges for the purpose of accurate background modeling we propose a unified framework based on the algorithm of image inpainting. It is an unsupervised visual feature learning hybrid Generative Adversarial algorithm based on context prediction. We have also presented the solution of random region inpainting by the fusion of center region inpaiting and random region inpainting with the help of poisson blending technique. Furthermore we also evaluated foreground object detection with the fusion of our proposed method and morphological operations. The comparison of our proposed method with 12 state-of-the-art methods shows its stability in the application of background estimation and foreground detection.

CVJan 29, 2018
Comparative Study of ECO and CFNet Trackers in Noisy Environment

Mustansar Fiaz, Sajid Javed, Arif Mahmood et al.

Object tracking is one of the most challenging task and has secured significant attention of computer vision researchers in the past two decades. Recent deep learning based trackers have shown good performance on various tracking challenges. A tracking method should track objects in sequential frames accurately in challenges such as deformation, low resolution, occlusion, scale and light variations. Most trackers achieve good performance on specific challenges instead of all tracking problems, hence there is a lack of general purpose tracking algorithms that can perform well in all conditions. Moreover, performance of tracking techniques has not been evaluated in noisy environments. Visual object tracking has real world applications and there is good chance that noise may get added during image acquisition in surveillance cameras. We aim to study the robustness of two state of the art trackers in the presence of noise including Efficient Convolutional Operators (ECO) and Correlation Filter Network (CFNet). Our study demonstrates that the performance of these trackers degrades as the noise level increases, which demonstrate the need to design more robust tracking algorithms.

ITNov 26, 2017
Robust Subspace Learning: Robust PCA, Robust Subspace Tracking, and Robust Subspace Recovery

Namrata Vaswani, Thierry Bouwmans, Sajid Javed et al.

PCA is one of the most widely used dimension reduction techniques. A related easier problem is "subspace learning" or "subspace estimation". Given relatively clean data, both are easily solved via singular value decomposition (SVD). The problem of subspace learning or PCA in the presence of outliers is called robust subspace learning or robust PCA (RPCA). For long data sequences, if one tries to use a single lower dimensional subspace to represent the data, the required subspace dimension may end up being quite large. For such data, a better model is to assume that it lies in a low-dimensional subspace that can change over time, albeit gradually. The problem of tracking such data (and the subspaces) while being robust to outliers is called robust subspace tracking (RST). This article provides a magazine-style overview of the entire field of robust subspace learning and tracking. In particular solutions for three problems are discussed in detail: RPCA via sparse+low-rank matrix decomposition (S+LR), RST via S+LR, and "robust subspace recovery (RSR)". RSR assumes that an entire data vector is either an outlier or an inlier. The S+LR formulation instead assumes that outliers occur on only a few data vector indices and hence are well modeled as sparse corruptions.

CVNov 4, 2015
Decomposition into Low-rank plus Additive Matrices for Background/Foreground Separation: A Review for a Comparative Evaluation with a Large-Scale Dataset

Thierry Bouwmans, Andrews Sobral, Sajid Javed et al.

Recent research on problem formulations based on decomposition into low-rank plus sparse matrices shows a suitable framework to separate moving objects from the background. The most representative problem formulation is the Robust Principal Component Analysis (RPCA) solved via Principal Component Pursuit (PCP) which decomposes a data matrix in a low-rank matrix and a sparse matrix. However, similar robust implicit or explicit decompositions can be made in the following problem formulations: Robust Non-negative Matrix Factorization (RNMF), Robust Matrix Completion (RMC), Robust Subspace Recovery (RSR), Robust Subspace Tracking (RST) and Robust Low-Rank Minimization (RLRM). The main goal of these similar problem formulations is to obtain explicitly or implicitly a decomposition into low-rank matrix plus additive matrices. In this context, this work aims to initiate a rigorous and comprehensive review of the similar problem formulations in robust subspace learning and tracking based on decomposition into low-rank plus additive matrices for testing and ranking existing algorithms for background/foreground separation. For this, we first provide a preliminary review of the recent developments in the different problem formulations which allows us to define a unified view that we called Decomposition into Low-rank plus Additive Matrices (DLAM). Then, we examine carefully each method in each robust subspace learning/tracking frameworks with their decomposition, their loss functions, their optimization problem and their solvers. Furthermore, we investigate if incremental algorithms and real-time implementations can be achieved for background/foreground separation. Finally, experimental results on a large-scale dataset called Background Models Challenge (BMC 2012) show the comparative performance of 32 different robust subspace learning/tracking methods.