Ali Zia

CV
h-index33
24papers
131citations
Novelty38%
AI Score50

24 Papers

LGSep 26, 2023Code
ICML 2023 Topological Deep Learning Challenge : Design and Results

Mathilde Papillon, Mustafa Hajij, Helen Jenne et al.

This paper presents the computational challenge on topological deep learning that was hosted within the ICML 2023 Workshop on Topology and Geometry in Machine Learning. The competition asked participants to provide open-source implementations of topological neural networks from the literature by contributing to the python packages TopoNetX (data processing) and TopoModelX (deep learning). The challenge attracted twenty-eight qualifying submissions in its two-month duration. This paper describes the design of the challenge and summarizes its main findings.

LGFeb 8, 2023
Topological Deep Learning: A Review of an Emerging Paradigm

Ali Zia, Abdelwahed Khamis, James Nichols et al.

Topological data analysis (TDA) provides insight into data shape. The summaries obtained by these methods are principled global descriptions of multi-dimensional data whilst exhibiting stable properties such as robustness to deformation and noise. Such properties are desirable in deep learning pipelines but they are typically obtained using non-TDA strategies. This is partly caused by the difficulty of combining TDA constructs (e.g. barcode and persistence diagrams) with current deep learning algorithms. Fortunately, we are now witnessing a growth of deep learning applications embracing topologically-guided components. In this survey, we review the nascent field of topological deep learning by first revisiting the core concepts of TDA. We then explore how the use of TDA techniques has evolved over time to support deep learning frameworks, and how they can be integrated into different aspects of deep learning. Furthermore, we touch on TDA usage for analyzing existing deep models; deep topological analytics. Finally, we discuss the challenges and future prospects of topological deep learning.

CVJun 25, 2023
High Spectral Spatial Resolution Synthetic HyperSpectral Dataset form multi-source fusion

Yajie Sun, Ali Zia, Jun Zhou

This research paper introduces a synthetic hyperspectral dataset that combines high spectral and spatial resolution imaging to achieve a comprehensive, accurate, and detailed representation of observed scenes or objects. Obtaining such desirable qualities is challenging when relying on a single camera. The proposed dataset addresses this limitation by leveraging three modalities: RGB, push-broom visible hyperspectral camera, and snapshot infrared hyperspectral camera, each offering distinct spatial and spectral resolutions. Different camera systems exhibit varying photometric properties, resulting in a trade-off between spatial and spectral resolution. RGB cameras typically offer high spatial resolution but limited spectral resolution, while hyperspectral cameras possess high spectral resolution at the expense of spatial resolution. Moreover, hyperspectral cameras themselves employ different capturing techniques and spectral ranges, further complicating the acquisition of comprehensive data. By integrating the photometric properties of these modalities, a single synthetic hyperspectral image can be generated, facilitating the exploration of broader spectral-spatial relationships for improved analysis, monitoring, and decision-making across various fields. This paper emphasizes the importance of multi-modal fusion in producing a high-quality synthetic hyperspectral dataset with consistent spectral intervals between bands.

CVFeb 16, 2023
Spectral 3D Computer Vision -- A Review

Yajie Sun, Ali Zia, Vivien Rolland et al.

Spectral 3D computer vision examines both the geometric and spectral properties of objects. It provides a deeper understanding of an object's physical properties by providing information from narrow bands in various regions of the electromagnetic spectrum. Mapping the spectral information onto the 3D model reveals changes in the spectra-structure space or enhances 3D representations with properties such as reflectance, chromatic aberration, and varying defocus blur. This emerging paradigm advances traditional computer vision and opens new avenues of research in 3D structure, depth estimation, motion analysis, and more. It has found applications in areas such as smart agriculture, environment monitoring, building inspection, geological exploration, and digital cultural heritage records. This survey offers a comprehensive overview of spectral 3D computer vision, including a unified taxonomy of methods, key application areas, and future challenges and prospects.

12.9CVMar 10
WS-Net: Weak-Signal Representation Learning and Gated Abundance Reconstruction for Hyperspectral Unmixing via State-Space and Weak Signal Attention Fusion

Zekun Long, Ali Zia, Guanyiman Fu et al.

Weak spectral responses in hyperspectral images are often obscured by dominant endmembers and sensor noise, resulting in inaccurate abundance estimation. This paper introduces WS-Net, a deep unmixing framework specifically designed to address weak-signal collapse through state-space modelling and Weak Signal Attention fusion. The network features a multi-resolution wavelet-fused encoder that captures both high-frequency discontinuities and smooth spectral variations with a hybrid backbone that integrates a Mamba state-space branch for efficient long-range dependency modelling. It also incorporates a Weak Signal Attention branch that selectively enhances low-similarity spectral cues. A learnable gating mechanism adaptively fuses both representations, while the decoder leverages KL-divergence-based regularisation to enforce separability between dominant and weak endmembers. Experiments on one simulated and two real datasets (synthetic dataset, Samson, and Apex) demonstrate consistent improvements over six state-of-the-art baselines, achieving up to 55% and 63% reductions in RMSE and SAD, respectively. The framework maintains stable accuracy under low-SNR conditions, particularly for weak endmembers, establishing WS-Net as a robust and computationally efficient benchmark for weak-signal hyperspectral unmixing.

CVJul 25, 2024
SSTD: Stripe-Like Space Target Detection Using Single-Point Weak Supervision

Zijian Zhu, Ali Zia, Xuesong Li et al.

Stripe-like space target detection (SSTD) plays a key role in enhancing space situational awareness and assessing spacecraft behaviour. This domain faces three challenges: the lack of publicly available datasets, interference from stray light and stars, and the variability of stripe-like targets, which makes manual labeling both inaccurate and labor-intensive. In response, we introduces `AstroStripeSet', a pioneering dataset designed for SSTD, aiming to bridge the gap in academic resources and advance research in SSTD. Furthermore, we propose a novel teacher-student label evolution framework with single-point weak supervision, providing a new solution to the challenges of manual labeling. This framework starts with generating initial pseudo-labels using the zero-shot capabilities of the Segment Anything Model (SAM) in a single-point setting. After that, the fine-tuned StripeSAM serves as the teacher and the newly developed StripeNet as the student, consistently improving segmentation performance through label evolution, which iteratively refines these labels. We also introduce `GeoDice', a new loss function customized for the linear characteristics of stripe-like targets. Extensive experiments show that our method matches fully supervised approaches, exhibits strong zero-shot generalization for diverse space-based and ground-based real-world images, and sets a new state-of-the-art (SOTA) benchmark. Our AstroStripeSet dataset and code will be made publicly available.

CVAug 9, 2024
Collaborative Static-Dynamic Teaching: A Semi-Supervised Framework for Stripe-Like Space Target Detection

Zijian Zhu, Ali Zia, Xuesong Li et al.

Stripe-like space target detection (SSTD) is crucial for space situational awareness. Traditional unsupervised methods often fail in low signal-to-noise ratio and variable stripe-like space targets scenarios, leading to weak generalization. Although fully supervised learning methods improve model generalization, they require extensive pixel-level labels for training. In the SSTD task, manually creating these labels is often inaccurate and labor-intensive. Semi-supervised learning (SSL) methods reduce the need for these labels and enhance model generalizability, but their performance is limited by pseudo-label quality. To address this, we introduce an innovative Collaborative Static-Dynamic Teacher (CSDT) SSL framework, which includes static and dynamic teacher models as well as a student model. This framework employs a customized adaptive pseudo-labeling (APL) strategy, transitioning from initial static teaching to adaptive collaborative teaching, guiding the student model's training. The exponential moving average (EMA) mechanism further enhances this process by feeding new stripe-like knowledge back to the dynamic teacher model through the student model, creating a positive feedback loop that continuously enhances the quality of pseudo-labels. Moreover, we present MSSA-Net, a novel SSTD network featuring a multi-scale dual-path convolution (MDPC) block and a feature map weighted attention (FMWA) block, designed to extract diverse stripe-like features within the CSDT SSL training framework. Extensive experiments verify the state-of-the-art performance of our framework on the AstroStripeSet and various ground-based and space-based real-world datasets.

CVJan 28
Test-Time Adaptation for Anomaly Segmentation via Topology-Aware Optimal Transport Chaining

Ali Zia, Usman Ali, Umer Ramzan et al.

Deep topological data analysis (TDA) offers a principled framework for capturing structural invariants such as connectivity and cycles that persist across scales, making it a natural fit for anomaly segmentation (AS). Unlike thresholdbased binarisation, which produces brittle masks under distribution shift, TDA allows anomalies to be characterised as disruptions to global structure rather than local fluctuations. We introduce TopoOT, a topology-aware optimal transport (OT) framework that integrates multi-filtration persistence diagrams (PDs) with test-time adaptation (TTA). Our key innovation is Optimal Transport Chaining, which sequentially aligns PDs across thresholds and filtrations, yielding geodesic stability scores that identify features consistently preserved across scales. These stabilityaware pseudo-labels supervise a lightweight head trained online with OT-consistency and contrastive objectives, ensuring robust adaptation under domain shift. Across standard 2D and 3D anomaly detection benchmarks, TopoOT achieves state-of-the-art performance, outperforming the most competitive methods by up to +24.1% mean F1 on 2D datasets and +10.2% on 3D AS benchmarks.

20.9CVMay 14
Implicit spatial-frequency fusion of hyperspectral and lidar data via kolmogorov-arnold networks

Zekun Long, Judy X. Yang, Jing Wang et al.

Hyperspectral image (HSI) classification is challenging in complex scenes due to spectral ambiguity, spatial heterogeneity, and the strong coupling between material properties and geometric structures. Although LiDAR provides complementary elevation information, most HSI-LiDAR fusion methods rely on CNNs or MLPs with fixed activation functions and linear weights. These methods struggle to model structural discontinuities in LiDAR data, intricate spectral features of HSI, and their interactions. In addition, fusion of the two modalities in both spatial and frequency domains with LiDAR guidance remains underexplored. To address these issues, we propose the Implicit Frequency-Geometry Fusion Network (IFGNet), which leverages Kolmogorov-Arnold Networks (KANs) with learnable spline-based functions to adaptively capture highly nonlinear relationships between hyperspectral and LiDAR features. Furthermore, IFGNet introduces a LiDAR-guided implicit aggregation module in both spatial and frequency domains, enhancing geometry-aware spatial representations while capturing global structural patterns. Experiments on the Houston 2013 and MUUFL benchmarks demonstrate that IFGNet consistently outperforms existing fusion methods in overall accuracy, average accuracy, and Cohen's Kappa, while maintaining an efficient architecture.

CVOct 20, 2025Code
2D_3D Feature Fusion via Cross-Modal Latent Synthesis and Attention Guided Restoration for Industrial Anomaly Detection

Usman Ali, Ali Zia, Abdul Rehman et al.

Industrial anomaly detection (IAD) increasingly benefits from integrating 2D and 3D data, but robust cross-modal fusion remains challenging. We propose a novel unsupervised framework, Multi-Modal Attention-Driven Fusion Restoration (MAFR), which synthesises a unified latent space from RGB images and point clouds using a shared fusion encoder, followed by attention-guided, modality-specific decoders. Anomalies are localised by measuring reconstruction errors between input features and their restored counterparts. Evaluations on the MVTec 3D-AD and Eyecandies benchmarks demonstrate that MAFR achieves state-of-the-art results, with a mean I-AUROC of 0.972 and 0.901, respectively. The framework also exhibits strong performance in few-shot learning settings, and ablation studies confirm the critical roles of the fusion architecture and composite loss. MAFR offers a principled approach for fusing visual and geometric information, advancing the robustness and accuracy of industrial anomaly detection. Code is available at https://github.com/adabrh/MAFR

36.3LGMay 9
TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection

Farid Hazratian, Ali Zia, Hien Duy Nguyen

Out-of-distribution (OOD) robustness is difficult to diagnose when target-domain labels are unavailable. We consider a more restrictive source-only variant of unsupervised accuracy estimation: selecting robust checkpoints using only source-domain representations, with no target samples or target labels. We propose \textbf{TopoGeoScore}, a source-only geometric scorer for label-free OOD checkpoint selection. Given a trained checkpoint, we construct class-conditional mutual $k$-nearest-neighbour graphs from source embeddings and extract three interpretable signals: a torsion-inspired reduced Laplacian log-determinant for global class-manifold complexity, Ollivier--Ricci curvature for local neighbourhood regularity, and higher-order topological summaries for fragmented connectivity, loops, and global--local inconsistency. Instead of fixing their weights by hand, TopoGeoScore learns a non-negative linear score through a self-supervised objective that enforces invariance under approximately geometry-preserving embedding views and separation from structure-breaking views. The score remains interpretable and uses no target-domain samples or labels. Results across CIFAR-based corruption and distribution-shift benchmarks, ImageNet-C, MNLI$\to$HANS transfer, and OGBN-Arxiv suggest that source representations contain measurable global--local--topological evidence of robustness, supporting practical checkpoint selection before deployment under distribution shift.

58.5CVMar 10
Geometry-Aware Semantic Reasoning for Training Free Video Anomaly Detection

Ali Zia, Usman Ali, Muhammad Umer Ramzan et al.

Training-free video anomaly detection (VAD) has recently emerged as a scalable alternative to supervised approaches, yet existing methods largely rely on static prompting and geometry-agnostic feature fusion. As a result, anomaly inference is often reduced to shallow similarity matching over Euclidean embeddings, leading to unstable predictions and limited interpretability, especially in complex or hierarchically structured scenes. We introduce MM-VAD, a geometry-aware semantic reasoning framework for training free VAD that reframes anomaly detection as adaptive test-time inference rather than fixed feature comparison. Our approach projects caption-derived scene representations into hyperbolic space to better preserve hierarchical structure and performs anomaly assessment through an adaptive question answering process over a frozen large language model. A lightweight, learnable prompt is optimised at test time using an unsupervised confidence-sparsity objective, enabling context-specific calibration without updating any backbone parameters. To further ground semantic predictions in visual evidence, we incorporate a covariance-aware Mahalanobis refinement that stabilises cross-modal alignment. Across four benchmarks, MM-VAD consistently improves over prior training-free methods, achieving 90.03% AUC on XD-Violence and 83.24%, 96.95%, and 98.81% on UCF-Crime, ShanghaiTech, and UCSD Ped2, respectively. Our results demonstrate that geometry-aware representation and adaptive semantic calibration provide a principled and effective alternative to static Euclidean matching in training-free VAD.

4.8CVMar 10
Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

Ali Zia, Muhammad Umer Ramzan, Usman Ali et al.

Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.

LGFeb 4, 2024
TopoX: A Suite of Python Packages for Machine Learning on Topological Domains

Mustafa Hajij, Mathilde Papillon, Florian Frantzen et al.

We introduce TopoX, a Python software suite that provides reliable and user-friendly building blocks for computing and machine learning on topological domains that extend graphs: hypergraphs, simplicial, cellular, path and combinatorial complexes. TopoX consists of three packages: TopoNetX facilitates constructing and computing on these domains, including working with nodes, edges and higher-order cells; TopoEmbedX provides methods to embed topological domains into vector spaces, akin to popular graph-based embedding algorithms such as node2vec; TopoModelX is built on top of PyTorch and offers a comprehensive toolbox of higher-order message passing functions for neural networks on topological domains. The extensively documented and unit-tested source code of TopoX is available under MIT license at https://pyt-team.github.io/}{https://pyt-team.github.io/.

CVNov 28, 2024
Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement

Muhammad Umer Ramzan, Ali Zia, Abdelwahed Khamis et al.

This paper presents a novel deep-learning framework that significantly enhances the transformation of rudimentary face sketches into high-fidelity colour images. Employing a Convolutional Block Attention-based Auto-encoder Network (CA2N), our approach effectively captures and enhances critical facial features through a block attention mechanism within an encoder-decoder architecture. Subsequently, the framework utilises a noise-induced conditional Generative Adversarial Network (cGAN) process that allows the system to maintain high performance even on domains unseen during the training. These enhancements lead to considerable improvements in image realism and fidelity, with our model achieving superior performance metrics that outperform the best method by FID margin of 17, 23, and 38 on CelebAMask-HQ, CUHK, and CUFSF datasets; respectively. The model sets a new state-of-the-art in sketch-to-image generation, can generalize across sketch types, and offers a robust solution for applications such as criminal identification in law enforcement.

CVOct 30, 2024
NeFF-BioNet: Crop Biomass Prediction from Point Cloud to Drone Imagery

Xuesong Li, Zeeshan Hayder, Ali Zia et al.

Crop biomass offers crucial insights into plant health and yield, making it essential for crop science, farming systems, and agricultural research. However, current measurement methods, which are labor-intensive, destructive, and imprecise, hinder large-scale quantification of this trait. To address this limitation, we present a biomass prediction network (BioNet), designed for adaptation across different data modalities, including point clouds and drone imagery. Our BioNet, utilizing a sparse 3D convolutional neural network (CNN) and a transformer-based prediction module, processes point clouds and other 3D data representations to predict biomass. To further extend BioNet for drone imagery, we integrate a neural feature field (NeFF) module, enabling 3D structure reconstruction and the transformation of 2D semantic features from vision foundation models into the corresponding 3D surfaces. For the point cloud modality, BioNet demonstrates superior performance on two public datasets, with an approximate 6.1% relative improvement (RI) over the state-of-the-art. In the RGB image modality, the combination of BioNet and NeFF achieves a 7.9% RI. Additionally, the NeFF-based approach utilizes inexpensive, portable drone-mounted cameras, providing a scalable solution for large field applications.

CVJan 28, 2025
Determining Mosaic Resilience in Sugarcane Plants using Hyperspectral Images

Ali Zia, Jun Zhou, Muyiwa Olayemi

Sugarcane mosaic disease poses a serious threat to the Australian sugarcane industry, leading to yield losses of up to 30% in susceptible varieties. Existing manual inspection methods for detecting mosaic resilience are inefficient and impractical for large-scale application. This study introduces a novel approach using hyperspectral imaging and machine learning to detect mosaic resilience by leveraging global feature representation from local spectral patches. Hyperspectral data were collected from eight sugarcane varieties under controlled and field conditions. Local spectral patches were analyzed to capture spatial and spectral variations, which were then aggregated into global feature representations using a ResNet18 deep learning architecture. While classical methods like Support Vector Machines struggled to utilize spatial-spectral relationships effectively, the deep learning model achieved high classification accuracy, demonstrating its capacity to identify mosaic resilience from fine-grained hyperspectral data. This approach enhances early detection capabilities, enabling more efficient management of susceptible strains and contributing to sustainable sugarcane production.

LGFeb 3
Representation Geometry as a Diagnostic for Out-of-Distribution Robustness

Ali Zia, Farid Hazratian

Robust generalization under distribution shift remains difficult to monitor and optimize in the absence of target-domain labels, as models with similar in-distribution accuracy can exhibit markedly different out-of-distribution (OOD) performance. While prior work has focused on training-time regularization and low-order representation statistics, little is known about whether the geometric structure of learned embeddings provides reliable post-hoc signals of robustness. We propose a geometry-based diagnostic framework that constructs class-conditional mutual k-nearest-neighbor graphs from in-distribution embeddings and extracts two complementary invariants: a global spectral complexity proxy based on the reduced log-determinant of the normalized Laplacian, and a local smoothness measure based on Ollivier--Ricci curvature. Across multiple architectures, training regimes, and corruption benchmarks, we find that lower spectral complexity and higher mean curvature consistently predict stronger OOD accuracy across checkpoints. Controlled perturbations and topological analyses further show that these signals reflect meaningful representation structure rather than superficial embedding statistics. Our results demonstrate that representation geometry enables interpretable, label-free robustness diagnosis and supports reliable unsupervised checkpoint selection under distribution shift.

CVOct 20, 2025
Split-Fuse-Transport: Annotation-Free Saliency via Dual Clustering and Optimal Transport Alignment

Muhammad Umer Ramzan, Ali Zia, Abdelwahed Khamis et al.

Salient object detection (SOD) aims to segment visually prominent regions in images and serves as a foundational task for various computer vision applications. We posit that SOD can now reach near-supervised accuracy without a single pixel-level label, but only when reliable pseudo-masks are available. We revisit the prototype-based line of work and make two key observations. First, boundary pixels and interior pixels obey markedly different geometry; second, the global consistency enforced by optimal transport (OT) is underutilized if prototype quality is weak. To address this, we introduce POTNet, an adaptation of Prototypical Optimal Transport that replaces POT's single k-means step with an entropy-guided dual-clustering head: high-entropy pixels are organized by spectral clustering, low-entropy pixels by k-means, and the two prototype sets are subsequently aligned by OT. This split-fuse-transport design yields sharper, part-aware pseudo-masks in a single forward pass, without handcrafted priors. Those masks supervise a standard MaskFormer-style encoder-decoder, giving rise to AutoSOD, an end-to-end unsupervised SOD pipeline that eliminates SelfMask's offline voting yet improves both accuracy and training efficiency. Extensive experiments on five benchmarks show that AutoSOD outperforms unsupervised methods by up to 26% and weakly supervised methods by up to 36% in F-measure, further narrowing the gap to fully supervised models.

AIOct 17, 2025
Hypergraph Contrastive Sensor Fusion for Multimodal Fault Diagnosis in Induction Motors

Usman Ali, Ali Zia, Waqas Ali et al.

Reliable induction motor (IM) fault diagnosis is vital for industrial safety and operational continuity, mitigating costly unplanned downtime. Conventional approaches often struggle to capture complex multimodal signal relationships, are constrained to unimodal data or single fault types, and exhibit performance degradation under noisy or cross-domain conditions. This paper proposes the Multimodal Hypergraph Contrastive Attention Network (MM-HCAN), a unified framework for robust fault diagnosis. To the best of our knowledge, MM-HCAN is the first to integrate contrastive learning within a hypergraph topology specifically designed for multimodal sensor fusion, enabling the joint modelling of intra- and inter-modal dependencies and enhancing generalisation beyond Euclidean embedding spaces. The model facilitates simultaneous diagnosis of bearing, stator, and rotor faults, addressing the engineering need for consolidated di- agnostic capabilities. Evaluated on three real-world benchmarks, MM-HCAN achieves up to 99.82% accuracy with strong cross-domain generalisation and resilience to noise, demonstrating its suitability for real-world deployment. An ablation study validates the contribution of each component. MM-HCAN provides a scalable and robust solution for comprehensive multi-fault diagnosis, supporting predictive maintenance and extended asset longevity in industrial environments.

CVJun 1, 2025
A Review on Coarse to Fine-Grained Animal Action Recognition

Ali Zia, Renuka Sharma, Abdelwahed Khamis et al.

This review provides an in-depth exploration of the field of animal action recognition, focusing on coarse-grained (CG) and fine-grained (FG) techniques. The primary aim is to examine the current state of research in animal behaviour recognition and to elucidate the unique challenges associated with recognising subtle animal actions in outdoor environments. These challenges differ significantly from those encountered in human action recognition due to factors such as non-rigid body structures, frequent occlusions, and the lack of large-scale, annotated datasets. The review begins by discussing the evolution of human action recognition, a more established field, highlighting how it progressed from broad, coarse actions in controlled settings to the demand for fine-grained recognition in dynamic environments. This shift is particularly relevant for animal action recognition, where behavioural variability and environmental complexity present unique challenges that human-centric models cannot fully address. The review then underscores the critical differences between human and animal action recognition, with an emphasis on high intra-species variability, unstructured datasets, and the natural complexity of animal habitats. Techniques like spatio-temporal deep learning frameworks (e.g., SlowFast) are evaluated for their effectiveness in animal behaviour analysis, along with the limitations of existing datasets. By assessing the strengths and weaknesses of current methodologies and introducing a recently-published dataset, the review outlines future directions for advancing fine-grained action recognition, aiming to improve accuracy and generalisability in behaviour analysis across species.

IVMay 3, 2025
Multi-Scale Target-Aware Representation Learning for Fundus Image Enhancement

Haofan Wu, Yin Huang, Yuqing Wu et al.

High-quality fundus images provide essential anatomical information for clinical screening and ophthalmic disease diagnosis. Yet, due to hardware limitations, operational variability, and patient compliance, fundus images often suffer from low resolution and signal-to-noise ratio. Recent years have witnessed promising progress in fundus image enhancement. However, existing works usually focus on restoring structural details or global characteristics of fundus images, lacking a unified image enhancement framework to recover comprehensive multi-scale information. Moreover, few methods pinpoint the target of image enhancement, e.g., lesions, which is crucial for medical image-based diagnosis. To address these challenges, we propose a multi-scale target-aware representation learning framework (MTRL-FIE) for efficient fundus image enhancement. Specifically, we propose a multi-scale feature encoder (MFE) that employs wavelet decomposition to embed both low-frequency structural information and high-frequency details. Next, we design a structure-preserving hierarchical decoder (SHD) to fuse multi-scale feature embeddings for real fundus image restoration. SHD integrates hierarchical fusion and group attention mechanisms to achieve adaptive feature fusion while retaining local structural smoothness. Meanwhile, a target-aware feature aggregation (TFA) module is used to enhance pathological regions and reduce artifacts. Experimental results on multiple fundus image datasets demonstrate the effectiveness and generalizability of MTRL-FIE for fundus image enhancement. Compared to state-of-the-art methods, MTRL-FIE achieves superior enhancement performance with a more lightweight architecture. Furthermore, our approach generalizes to other ophthalmic image processing tasks without supervised fine-tuning, highlighting its potential for clinical applications.

CVApr 17, 2024
MMCBE: Multi-modality Dataset for Crop Biomass Prediction and Beyond

Xuesong Li, Zeeshan Hayder, Ali Zia et al.

Crop biomass, a critical indicator of plant growth, health, and productivity, is invaluable for crop breeding programs and agronomic research. However, the accurate and scalable quantification of crop biomass remains inaccessible due to limitations in existing measurement methods. One of the obstacles impeding the advancement of current crop biomass prediction methodologies is the scarcity of publicly available datasets. Addressing this gap, we introduce a new dataset in this domain, i.e. Multi-modality dataset for crop biomass estimation (MMCBE). Comprising 216 sets of multi-view drone images, coupled with LiDAR point clouds, and hand-labelled ground truth, MMCBE represents the first multi-modality one in the field. This dataset aims to establish benchmark methods for crop biomass quantification and foster the development of vision-based approaches. We have rigorously evaluated state-of-the-art crop biomass estimation methods using MMCBE and ventured into additional potential applications, such as 3D crop reconstruction from drone imagery and novel-view rendering. With this publication, we are making our comprehensive dataset available to the broader community.

CVMay 26, 2023
CVB: A Video Dataset of Cattle Visual Behaviors

Ali Zia, Renuka Sharma, Reza Arablouei et al.

Existing image/video datasets for cattle behavior recognition are mostly small, lack well-defined labels, or are collected in unrealistic controlled environments. This limits the utility of machine learning (ML) models learned from them. Therefore, we introduce a new dataset, called Cattle Visual Behaviors (CVB), that consists of 502 video clips, each fifteen seconds long, captured in natural lighting conditions, and annotated with eleven visually perceptible behaviors of grazing cattle. We use the Computer Vision Annotation Tool (CVAT) to collect our annotations. To make the procedure more efficient, we perform an initial detection and tracking of cattle in the videos using appropriate pre-trained models. The results are corrected by domain experts along with cattle behavior labeling in CVAT. The pre-hoc detection and tracking step significantly reduces the manual annotation time and effort. Moreover, we convert CVB to the atomic visual action (AVA) format and train and evaluate the popular SlowFast action recognition model on it. The associated preliminary results confirm that we can localize the cattle and recognize their frequently occurring behaviors with confidence. By creating and sharing CVB, our aim is to develop improved models capable of recognizing all important behaviors accurately and to assist other researchers and practitioners in developing and evaluating new ML models for cattle behavior classification using video data.