CVMar 12, 2023Code
Extending global-local view alignment for self-supervised learning with remote sensing imageryXinye Wanyan, Sachith Seneviratne, Shuchang Shen et al.
Since large number of high-quality remote sensing images are readily accessible, exploiting the corpus of images with less manual annotation draws increasing attention. Self-supervised models acquire general feature representations by formulating a pretext task that generates pseudo-labels for massive unlabeled data to provide supervision for training. While prior studies have explored multiple self-supervised learning techniques in remote sensing domain, pretext tasks based on local-global view alignment remain underexplored, despite achieving state-of-the-art results on natural imagery. Inspired by DINO, which employs an effective representation learning structure with knowledge distillation based on global-local view alignment, we formulate two pretext tasks for self-supervised learning on remote sensing imagery (SSLRS). Using these tasks, we explore the effectiveness of positive temporal contrast as well as multi-sized views on SSLRS. We extend DINO and propose DINO-MC which uses local views of various sized crops instead of a single fixed size in order to alleviate the limited variation in object size observed in remote sensing imagery. Our experiments demonstrate that even when pre-trained on only 10% of the dataset, DINO-MC performs on par or better than existing state-of-the-art SSLRS methods on multiple remote sensing tasks, while using less computational resources. All codes, models, and results are released at https://github.com/WennyXY/DINO-MC.
CVMar 8, 2022Code
Self-Supervision, Remote Sensing and Abstraction: Representation Learning Across 3 Million LocationsSachith Seneviratne, Kerry A. Nice, Jasper S. Wijnands et al.
Self-supervision based deep learning classification approaches have received considerable attention in academic literature. However, the performance of such methods on remote sensing imagery domains remains under-explored. In this work, we explore contrastive representation learning methods on the task of imagery-based city classification, an important problem in urban computing. We use satellite and map imagery across 2 domains, 3 million locations and more than 1500 cities. We show that self-supervised methods can build a generalizable representation from as few as 200 cities, with representations achieving over 95\% accuracy in unseen cities with minimal additional training. We also find that the performance discrepancy of such methods, when compared to supervised methods, induced by the domain discrepancy between natural imagery and abstract imagery is significant for remote sensing imagery. We compare all analysis against existing supervised models from academic literature and open-source our models for broader usage and further criticism.
CVMar 13, 2023Code
FireRisk: A Remote Sensing Dataset for Fire Risk Assessment with Benchmarks Using Supervised and Self-supervised LearningShuchang Shen, Sachith Seneviratne, Xinye Wanyan et al.
In recent decades, wildfires, as widespread and extremely destructive natural disasters, have caused tremendous property losses and fatalities, as well as extensive damage to forest ecosystems. Many fire risk assessment projects have been proposed to prevent wildfires, but GIS-based methods are inherently challenging to scale to different geographic areas due to variations in data collection and local conditions. Inspired by the abundance of publicly available remote sensing projects and the burgeoning development of deep learning in computer vision, our research focuses on assessing fire risk using remote sensing imagery. In this work, we propose a novel remote sensing dataset, FireRisk, consisting of 7 fire risk classes with a total of 91872 labelled images for fire risk assessment. This remote sensing dataset is labelled with the fire risk classes supplied by the Wildfire Hazard Potential (WHP) raster dataset, and remote sensing images are collected using the National Agriculture Imagery Program (NAIP), a high-resolution remote sensing imagery program. On FireRisk, we present benchmark performance for supervised and self-supervised representations, with Masked Autoencoders (MAE) pre-trained on ImageNet1k achieving the highest classification accuracy, 65.29%. This remote sensing dataset, FireRisk, provides a new direction for fire risk assessment, and we make it publicly available on https://github.com/CharmonyShen/FireRisk.
CVAug 3, 2022Code
DALLE-URBAN: Capturing the urban design expertise of large text to image transformersSachith Seneviratne, Damith Senanayake, Sanka Rasnayaka et al.
Automatically converting text descriptions into images using transformer architectures has recently received considerable attention. Such advances have implications for many applied design disciplines across fashion, art, architecture, urban planning, landscape design and the future tools available to such disciplines. However, a detailed analysis capturing the capabilities of such models, specifically with a focus on the built environment, has not been performed to date. In this work, we investigate the capabilities and biases of such text-to-image methods as it applies to the built environment in detail. We use a systematic grammar to generate queries related to the built environment and evaluate resulting generated images. We generate 1020 different images and find that text to image transformers are robust at generating realistic images across different domains for this use-case. Generated imagery can be found at the github: https://github.com/sachith500/DALLEURBAN
CVSep 18, 2023Code
Scalable Label-efficient Footpath Network Generation Using Remote Sensing Data and Self-supervised LearningXinye Wanyan, Sachith Seneviratne, Kerry Nice et al.
Footpath mapping, modeling, and analysis can provide important geospatial insights to many fields of study, including transport, health, environment and urban planning. The availability of robust Geographic Information System (GIS) layers can benefit the management of infrastructure inventories, especially at local government level with urban planners responsible for the deployment and maintenance of such infrastructure. However, many cities still lack real-time information on the location, connectivity, and width of footpaths, and/or employ costly and manual survey means to gather this information. This work designs and implements an automatic pipeline for generating footpath networks based on remote sensing images using machine learning models. The annotation of segmentation tasks, especially labeling remote sensing images with specialized requirements, is very expensive, so we aim to introduce a pipeline requiring less labeled data. Considering supervised methods require large amounts of training data, we use a self-supervised method for feature representation learning to reduce annotation requirements. Then the pre-trained model is used as the encoder of the U-Net for footpath segmentation. Based on the generated masks, the footpath polygons are extracted and converted to footpath networks which can be loaded and visualized by geographic information systems conveniently. Validation results indicate considerable consistency when compared to manually collected GIS layers. The footpath network generation pipeline proposed in this work is low-cost and extensible, and it can be applied where remote sensing images are available. Github: https://github.com/WennyXY/FootpathSeg.
CVOct 6, 2022
IJCB 2022 Mobile Behavioral Biometrics Competition (MobileB2C)Giuseppe Stragapede, Ruben Vera-Rodriguez, Ruben Tolosana et al.
This paper describes the experimental framework and results of the IJCB 2022 Mobile Behavioral Biometrics Competition (MobileB2C). The aim of MobileB2C is benchmarking mobile user authentication systems based on behavioral biometric traits transparently acquired by mobile devices during ordinary Human-Computer Interaction (HCI), using a novel public database, BehavePassDB, and a standard experimental protocol. The competition is divided into four tasks corresponding to typical user activities: keystroke, text reading, gallery swiping, and tapping. The data are composed of touchscreen data and several background sensor data simultaneously acquired. "Random" (different users with different devices) and "skilled" (different user on the same device attempting to imitate the legitimate one) impostor scenarios are considered. The results achieved by the participants show the feasibility of user authentication through behavioral biometrics, although this proves to be a non-trivial challenge. MobileB2C will be established as an on-going competition.
CVMar 23Code
Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image ClassificationJayanie Bogahawatte, Sachith Seneviratne, Saman Halgamuge
Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at https://github.com/Jayanie/HIPSS.
CRAug 15, 2022
Self-Supervised Vision Transformers for Malware DetectionSachith Seneviratne, Ridwan Shariffdeen, Sanka Rasnayaka et al.
Malware detection plays a crucial role in cyber-security with the increase in malware growth and advancements in cyber-attacks. Previously unseen malware which is not determined by security vendors are often used in these attacks and it is becoming inevitable to find a solution that can self-learn from unlabeled sample data. This paper presents SHERLOCK, a self-supervision based deep learning model to detect malware based on the Vision Transformer (ViT) architecture. SHERLOCK is a novel malware detection method which learns unique features to differentiate malware from benign programs with the use of image-based binary representation. Experimental results using 1.2 million Android applications across a hierarchy of 47 types and 696 families, shows that self-supervised learning can achieve an accuracy of 97% for the binary classification of malware which is higher than existing state-of-the-art techniques. Our proposed model is also able to outperform state-of-the-art techniques for multi-class malware classification of types and family with macro-F1 score of .497 and .491 respectively.
CVAug 17, 2022
Urban feature analysis from aerial remote sensing imagery using self-supervised and semi-supervised computer visionSachith Seneviratne, Jasper S. Wijnands, Kerry Nice et al.
Analysis of overhead imagery using computer vision is a problem that has received considerable attention in academic literature. Most techniques that operate in this space are both highly specialised and require expensive manual annotation of large datasets. These problems are addressed here through the development of a more generic framework, incorporating advances in representation learning which allows for more flexibility in analysing new categories of imagery with limited labeled data. First, a robust representation of an unlabeled aerial imagery dataset was created based on the momentum contrast mechanism. This was subsequently specialised for different tasks by building accurate classifiers with as few as 200 labeled images. The successful low-level detection of urban infrastructure evolution over a 10-year period from 60 million unlabeled images, exemplifies the substantial potential of our approach to advance quantitative urban research.
LGAug 27, 2024
GINN-KAN: Interpretability pipelining with applications in Physics Informed Neural NetworksNisal Ranasinghe, Yu Xia, Sachith Seneviratne et al.
Neural networks are powerful function approximators, yet their ``black-box" nature often renders them opaque and difficult to interpret. While many post-hoc explanation methods exist, they typically fail to capture the underlying reasoning processes of the networks. A truly interpretable neural network would be trained similarly to conventional models using techniques such as backpropagation, but additionally provide insights into the learned input-output relationships. In this work, we introduce the concept of interpretability pipelineing, to incorporate multiple interpretability techniques to outperform each individual technique. To this end, we first evaluate several architectures that promise such interpretability, with a particular focus on two recent models selected for their potential to incorporate interpretability into standard neural network architectures while still leveraging backpropagation: the Growing Interpretable Neural Network (GINN) and Kolmogorov Arnold Networks (KAN). We analyze the limitations and strengths of each and introduce a novel interpretable neural network GINN-KAN that synthesizes the advantages of both models. When tested on the Feynman symbolic regression benchmark datasets, GINN-KAN outperforms both GINN and KAN. To highlight the capabilities and the generalizability of this approach, we position GINN-KAN as an alternative to conventional black-box networks in Physics-Informed Neural Networks (PINNs). We expect this to have far-reaching implications in the application of deep learning pipelines in the natural sciences. Our experiments with this interpretable PINN on 15 different partial differential equations demonstrate that GINN-KAN augmented PINNs outperform PINNs with black-box networks in solving differential equations and surpass the capabilities of both GINN and KAN.
CVMar 17
MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image EditingZhaoyuan Qiu, Ken Chen, Xiangwei Wang et al.
Existing instruction-based image editing models perform well with simple, single-step instructions but degrade in realistic scenarios that involve multiple, lengthy, and interdependent directives. A main cause is the scarcity of training data with complex multi-instruction annotations. However, it is costly to collect such data and retrain these models. To address this challenge, we propose MSRAMIE, a training-free agent framework built on Multimodal Large Language Model (MLLM). MSRAMIE takes existing editing models as plug-in components and handle multi-instruction tasks via structured multimodal reasoning. It orchestrates iterative interactions between an MLLM-based Instructor and an image editing Actor, introducing a novel reasoning topology that comprises the proposed Tree-of-States and Graph-of-References. During inference, complex instructions are decomposed into multiple editing steps which enable state transitions, cross-step information aggregation, and original input recall, which enables systematic exploration of the image editing space and flexible progressive output refinement. The visualizable inference topology further provides interpretable and controllable decision pathways. Experiments show that as the instruction complexity increases, MSRAMIE can improve instruction following over 15% and increases the probability of finishing all modifications in a single run over 100%, while preserving perceptual quality and maintaining visual consistency.
LGMar 15
From Specification to Architecture: A Theory Compiler for Knowledge-Guided Machine LearningAsela Hevapathige, Yu Xia, Sachith Seneviratne et al.
Theory-guided machine learning has demonstrated that including authentic domain knowledge directly into model design improves performance, sample efficiency and out-of-distribution generalisation. Yet the process by which a formal domain theory is translated into architectural constraints remains entirely manual, specific to each domain formalism, and devoid of any formal correctness guarantee. This translation is non-transferable between domains, not verified, and does not scale. We propose the Theory Compiler: a system that accepts a typed, machine-readable domain theory as input and automatically produces an architecture whose function space is provably constrained to be consistent with that theory by construction, not by regularisation. We identify three foundational open problems whose resolution defines our research agenda: (1) designing a universal theory formalisation language with decidable type-checking; (2) constructing a compositionally correct compilation algorithm from theory primitives to architectural modules; and (3) establishing soundness and completeness criteria for formal verification. We further conjecture that compiled architectures match or exceed manually-designed counterparts in generalisation performance while requiring substantially less training data, a claim we ground in classical statistical learning theory. We argue that recent advances in formal machine learning theory, large language models, and the growth of an interdisciplinary research community have made this paradigm achievable for the first time.
CVMar 11, 2025Code
SphOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Learning ModelsNadarasar Bahavan, Sachith Seneviratne, Saman Halgamuge
The widespread use of deep learning classifiers necessitates Open-set recognition (OSR), which enables the identification of input data not only from classes known during training but also from unknown classes that might be present in test data. Many existing OSR methods are computationally expensive due to the reliance on complex generative models or suffer from high training costs. We investigate OSR from a representation-learning perspective, specifically through spherical embeddings. We introduce SphOR, a computationally efficient representation learning method that models the feature space as a mixture of von Mises-Fisher distributions. This approach enables the use of semantically ambiguous samples during training, to improve the detection of samples from unknown classes. We further explore the relationship between OSR performance and key representation learning properties which influence how well features are structured in high-dimensional space. Extensive experiments on multiple OSR benchmarks demonstrate the effectiveness of our method, producing state-of-the-art results, with improvements up-to 6% that validate its performance. Code at https://github.com/nadarasarbahavan/SpHOR
CVNov 18, 2025Code
Flood-LDM: Generalizable Latent Diffusion Models for rapid and accurate zero-shot High-Resolution Flood MappingSun Han Neo, Sachith Seneviratne, Herath Mudiyanselage Viraj Vidura Herath et al.
Flood prediction is critical for emergency planning and response to mitigate human and economic losses. Traditional physics-based hydrodynamic models generate high-resolution flood maps using numerical methods requiring fine-grid discretization; which are computationally intensive and impractical for real-time large-scale applications. While recent studies have applied convolutional neural networks for flood map super-resolution with good accuracy and speed, they suffer from limited generalizability to unseen areas. In this paper, we propose a novel approach that leverages latent diffusion models to perform super-resolution on coarse-grid flood maps, with the objective of achieving the accuracy of fine-grid flood maps while significantly reducing inference time. Experimental results demonstrate that latent diffusion models substantially decrease the computational time required to produce high-fidelity flood maps without compromising on accuracy, enabling their use in real-time flood risk management. Moreover, diffusion models exhibit superior generalizability across different physical locations, with transfer learning further accelerating adaptation to new geographic regions. Our approach also incorporates physics-informed inputs, addressing the common limitation of black-box behavior in machine learning, thereby enhancing interpretability. Code is available at https://github.com/neosunhan/flood-diff.
CVMay 11, 2023Code
Undercover Deepfakes: Detecting Fake Segments in VideosSanjay Saha, Rashindrie Perera, Sachith Seneviratne et al.
The recent renaissance in generative models, driven primarily by the advent of diffusion models and iterative improvement in GAN methods, has enabled many creative applications. However, each advancement is also accompanied by a rise in the potential for misuse. In the arena of the deepfake generation, this is a key societal issue. In particular, the ability to modify segments of videos using such generative techniques creates a new paradigm of deepfakes which are mostly real videos altered slightly to distort the truth. This paradigm has been under-explored by the current deepfake detection methods in the academic literature. In this paper, we present a deepfake detection method that can address this issue by performing deepfake prediction at the frame and video levels. To facilitate testing our method, we prepared a new benchmark dataset where videos have both real and fake frame sequences with very subtle transitions. We provide a benchmark on the proposed dataset with our detection method which utilizes the Vision Transformer based on Scaling and Shifting to learn spatial features, and a Timeseries Transformer to learn temporal features of the videos to help facilitate the interpretation of possible deepfakes. Extensive experiments on a variety of deepfake generation methods show excellent results by the proposed method on temporal segmentation and classical video-level predictions as well. In particular, the paradigm we address will form a powerful tool for the moderation of deepfakes, where human oversight can be better targeted to the parts of videos suspected of being deepfakes. All experiments can be reproduced at: github.com/rgb91/temporal-deepfake-segmentation.
CVDec 15, 2021Code
Does a Face Mask Protect my Privacy?: Deep Learning to Predict Protected Attributes from Masked Face ImagesSachith Seneviratne, Nuran Kasthuriarachchi, Sanka Rasnayaka et al.
Contactless and efficient systems are implemented rapidly to advocate preventive methods in the fight against the COVID-19 pandemic. Despite the positive benefits of such systems, there is potential for exploitation by invading user privacy. In this work, we analyse the privacy invasiveness of face biometric systems by predicting privacy-sensitive soft-biometrics using masked face images. We train and apply a CNN based on the ResNet-50 architecture with 20,003 synthetic masked images and measure the privacy invasiveness. Despite the popular belief of the privacy benefits of wearing a mask among people, we show that there is no significant difference to privacy invasiveness when a mask is worn. In our experiments we were able to accurately predict sex (94.7%),race (83.1%) and age (MAE 6.21 and RMSE 8.33) from masked face images. Our proposed approach can serve as a baseline utility to evaluate the privacy-invasiveness of artificial intelligence systems that make use of privacy-sensitive information. We open-source all contributions for re-producibility and broader use by the research community.
CVJun 10, 2021Code
Multi-Dataset Benchmarks for Masked Identification using Contrastive Representation LearningSachith Seneviratne, Nuran Kasthuriaarachchi, Sanka Rasnayaka
The COVID-19 pandemic has drastically changed accepted norms globally. Within the past year, masks have been used as a public health response to limit the spread of the virus. This sudden change has rendered many face recognition based access control, authentication and surveillance systems ineffective. Official documents such as passports, driving license and national identity cards are enrolled with fully uncovered face images. However, in the current global situation, face matching systems should be able to match these reference images with masked face images. As an example, in an airport or security checkpoint it is safer to match the unmasked image of the identifying document to the masked person rather than asking them to remove the mask. We find that current facial recognition techniques are not robust to this form of occlusion. To address this unique requirement presented due to the current circumstance, we propose a set of re-purposed datasets and a benchmark for researchers to use. We also propose a contrastive visual representation learning based pre-training workflow which is specialized to masked vs unmasked face matching. We ensure that our method learns robust features to differentiate people across varying data collection scenarios. We achieve this by training over many different datasets and validating our result by testing on various holdout datasets. The specialized weights trained by our method outperform standard face recognition features for masked to unmasked face matching. We believe the provided synthetic mask generating code, our novel training approach and the trained weights from the masked face models will help in adopting existing face recognition systems to operate in the current global environment. We open-source all contributions for broader use by the research community.
LGDec 18, 2023
GINN-LP: A Growing Interpretable Neural Network for Discovering Multivariate Laurent Polynomial EquationsNisal Ranasinghe, Damith Senanayake, Sachith Seneviratne et al.
Traditional machine learning is generally treated as a black-box optimization problem and does not typically produce interpretable functions that connect inputs and outputs. However, the ability to discover such interpretable functions is desirable. In this work, we propose GINN-LP, an interpretable neural network to discover the form and coefficients of the underlying equation of a dataset, when the equation is assumed to take the form of a multivariate Laurent Polynomial. This is facilitated by a new type of interpretable neural network block, named the "power-term approximator block", consisting of logarithmic and exponential activation functions. GINN-LP is end-to-end differentiable, making it possible to use backpropagation for training. We propose a neural network growth strategy that will enable finding the suitable number of terms in the Laurent polynomial that represents the data, along with sparsity regularization to promote the discovery of concise equations. To the best of our knowledge, this is the first model that can discover arbitrary multivariate Laurent polynomial terms without any prior information on the order. Our approach is first evaluated on a subset of data used in SRBench, a benchmark for symbolic regression. We first show that GINN-LP outperforms the state-of-the-art symbolic regression methods on datasets generated using 48 real-world equations in the form of multivariate Laurent polynomials. Next, we propose an ensemble method that combines our method with a high-performing symbolic regression method, enabling us to discover non-Laurent polynomial equations. We achieve state-of-the-art results in equation discovery, showing an absolute improvement of 7.1% over the best contender, by applying this ensemble method to 113 datasets within SRBench with known ground-truth equations.
LGDec 6, 2024
Rethinking Time Series Forecasting with LLMs via Nearest Neighbor Contrastive LearningJayanie Bogahawatte, Sachith Seneviratne, Maneesha Perera et al.
Adapting Large Language Models (LLMs) that are extensively trained on abundant text data, and customizing the input prompt to enable time series forecasting has received considerable attention. While recent work has shown great potential for adapting the learned prior of LLMs, the formulation of the prompt to finetune LLMs remains challenging as prompt should be aligned with time series data. Additionally, current approaches do not effectively leverage word token embeddings which embody the rich representation space learned by LLMs. This emphasizes the need for a robust approach to formulate the prompt which utilizes the word token embeddings while effectively representing the characteristics of the time series. To address these challenges, we propose NNCL-TLLM: Nearest Neighbor Contrastive Learning for Time series forecasting via LLMs. First, we generate time series compatible text prototypes such that each text prototype represents both word token embeddings in its neighborhood and time series characteristics via end-to-end finetuning. Next, we draw inspiration from Nearest Neighbor Contrastive Learning to formulate the prompt while obtaining the top-$k$ nearest neighbor time series compatible text prototypes. We then fine-tune the layer normalization and positional embeddings of the LLM, keeping the other layers intact, reducing the trainable parameters and decreasing the computational cost. Our comprehensive experiments demonstrate that NNCL-TLLM outperforms in few-shot forecasting while achieving competitive or superior performance over the state-of-the-art methods in long-term and short-term forecasting tasks.
SPMay 18, 2025
FRAME-C: A knowledge-augmented deep learning pipeline for classifying multi-electrode array electrophysiological signalsNisal Ranasinghe, Dzung Do-Ha, Simon Maksour et al.
Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disorder characterized by motor neuron degeneration, with alterations in neural excitability serving as key indicators. Recent advancements in induced pluripotent stem cell (iPSC) technology have enabled the generation of human iPSC-derived neuronal cultures, which, when combined with multi-electrode array (MEA) electrophysiology, provide rich spatial and temporal electrophysiological data. Traditionally, MEA data is analyzed using handcrafted features based on potentially imperfect domain knowledge, which while useful may not fully capture all useful characteristics inherent in the data. Machine learning, particularly deep learning, has the potential to automatically learn relevant characteristics from raw data without solely relying on handcrafted feature extraction. However, handcrafted features remain critical for encoding domain knowledge and improving interpretability, especially with limited or noisy data. This study introduces FRAME-C, a knowledge-augmented machine learning pipeline that combines domain knowledge, raw spike waveform data, and deep learning techniques to classify MEA signals and identify ALS-specific phenotypes. FRAME-C leverages deep learning to learn important features from spike waveforms while incorporating handcrafted features such as spike amplitude, inter-spike interval, and spike duration, preserving key spatial and temporal information. We validate FRAME-C on both simulated and real MEA data from human iPSC-derived neuronal cultures, demonstrating superior performance over existing classification methods. FRAME-C shows over 11% improvement on real data and up to 25% on simulated data. We also show FRAME-C can evaluate handcrafted feature importance, providing insights into ALS phenotypes.
LGMar 28, 2025
Arch-LLM: Taming LLMs for Neural Architecture Generation via Unsupervised Discrete Representation LearningDeshani Geethika Poddenige, Sachith Seneviratne, Damith Senanayake et al.
Unsupervised representation learning has been widely explored across various modalities, including neural architectures, where it plays a key role in downstream applications like Neural Architecture Search (NAS). These methods typically learn an unsupervised representation space before generating/ sampling architectures for the downstream search. A common approach involves the use of Variational Autoencoders (VAEs) to map discrete architectures onto a continuous representation space, however, sampling from these spaces often leads to a high percentage of invalid or duplicate neural architectures. This could be due to the unnatural mapping of inherently discrete architectural space onto a continuous space, which emphasizes the need for a robust discrete representation of these architectures. To address this, we introduce a Vector Quantized Variational Autoencoder (VQ-VAE) to learn a discrete latent space more naturally aligned with the discrete neural architectures. In contrast to VAEs, VQ-VAEs (i) map each architecture into a discrete code sequence and (ii) allow the prior to be learned by any generative model rather than assuming a normal distribution. We then represent these architecture latent codes as numerical sequences and train a text-to-text model leveraging a Large Language Model to learn and generate sequences representing architectures. We experiment our method with Inception/ ResNet-like cell-based search spaces, namely NAS-Bench-101 and NAS-Bench-201. Compared to VAE-based methods, our approach improves the generation of valid and unique architectures by over 80% on NASBench-101 and over 8% on NASBench-201. Finally, we demonstrate the applicability of our method in NAS employing a sequence-modeling-based NAS algorithm.
CVMar 11, 2025
Unmasking the Unknown: Facial Deepfake Detection in the Open-Set ParadigmNadarasar Bahavan, Sanjay Saha, Ken Chen et al.
Facial forgery methods such as deepfakes can be misused for identity manipulation and spreading misinformation. They have evolved alongside advancements in generative AI, leading to new and more sophisticated forgery techniques that diverge from existing 'known' methods. Conventional deepfake detection methods use the closedset paradigm, thus limiting their applicability to detecting forgeries created using methods that are not part of the training dataset. In this paper, we propose a shift from the closed-set paradigm for deepfake detection. In the open-set paradigm, models are designed not only to identify images created by known facial forgery methods but also to identify and flag those produced by previously unknown methods as 'unknown' and not as unforged/real/unmanipulated. In this paper, we propose an open-set deepfake classification algorithm based on supervised contrastive learning. The open-set paradigm used in our model allows it to function as a more robust tool capable of handling emerging and unseen deepfake techniques, enhancing reliability and confidence, and complementing forensic analysis. In open-set paradigm, we identify three groups including the "unknown group that is neither considered known deepfake nor real. We investigate deepfake open-set classification across three scenarios, classifying deepfakes from unknown methods not as real, distinguishing real images from deepfakes, and classifying deepfakes from known methods, using the FaceForensics++ dataset as a benchmark. Our method achieves state of the art results in the first two tasks and competitive results in the third task.
CVDec 10, 2024
TT-MPD: Test Time Model Pruning and DistillationHaihang Wu, Wei Wang, Tamasha Malepathirana et al.
Pruning can be an effective method of compressing large pre-trained models for inference speed acceleration. Previous pruning approaches rely on access to the original training dataset for both pruning and subsequent fine-tuning. However, access to the training data can be limited due to concerns such as data privacy and commercial confidentiality. Furthermore, with covariate shift (disparities between test and training data distributions), pruning and finetuning with training datasets can hinder the generalization of the pruned model to test data. To address these issues, pruning and finetuning the model with test time samples becomes essential. However, test-time model pruning and fine-tuning incur additional computation costs and slow down the model's prediction speed, thus posing efficiency issues. Existing pruning methods are not efficient enough for test time model pruning setting, since finetuning the pruned model is needed to evaluate the importance of removable components. To address this, we propose two variables to approximate the fine-tuned accuracy. We then introduce an efficient pruning method that considers the approximated finetuned accuracy and potential inference latency saving. To enhance fine-tuning efficiency, we propose an efficient knowledge distillation method that only needs to generate pseudo labels for a small set of finetuning samples one time, thereby reducing the expensive pseudo-label generation cost. Experimental results demonstrate that our method achieves a comparable or superior tradeoff between test accuracy and inference latency, with a 32% relative reduction in pruning and finetuning time compared to the best existing method.
CVJun 19, 2024
AniFaceDiff: Animating Stylized Avatars via Parametric Conditioned Diffusion ModelsKen Chen, Sachith Seneviratne, Wei Wang et al.
Animating stylized avatars with dynamic poses and expressions has attracted increasing attention for its broad range of applications. Previous research has made significant progress by training controllable generative models to synthesize animations based on reference characteristics, pose, and expression conditions. However, the mechanisms used in these methods to control pose and expression often inadvertently introduce unintended features from the target motion, while also causing a loss of expression-related details, particularly when applied to stylized animation. This paper proposes a new method based on Stable Diffusion, called AniFaceDiff, incorporating a new conditioning module for animating stylized avatars. First, we propose a refined spatial conditioning approach by Facial Alignment to prevent the inclusion of identity characteristics from the target motion. Then, we introduce an Expression Adapter that incorporates additional cross-attention layers to address the potential loss of expression-related information. Our approach effectively preserves pose and expression from the target video while maintaining input image consistency. Extensive experiments demonstrate that our method achieves state-of-the-art results, showcasing superior image quality, preservation of reference features, and expression accuracy, particularly for out-of-domain animation across diverse styles, highlighting its versatility and strong generalization capabilities. This work aims to enhance the quality of virtual stylized animation for positive applications. To promote responsible use in virtual environments, we contribute to the advancement of detection for generative content by evaluating state-of-the-art detectors, highlighting potential areas for improvement, and suggesting solutions.
CVMay 5, 2023
Semantic Segmentation using Vision Transformers: A surveyHans Thisanke, Chamli Deshan, Kavindu Chamith et al.
Semantic segmentation has a broad range of applications in a variety of domains including land coverage analysis, autonomous driving, and medical image analysis. Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation. Even though ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection since ViT is not a general purpose backbone due to its patch partitioning scheme. In this survey, we discuss some of the different ViT architectures that can be used for semantic segmentation and how their evolution managed the above-stated challenge. The rise of ViT and its performance with a high success rate motivated the community to slowly replace the traditional convolutional neural networks in various computer vision tasks. This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets. This will be worthwhile for the community to yield knowledge regarding the implementations carried out in semantic segmentation and to discover more efficient methodologies using ViTs.
CVJun 29, 2021
MFR 2021: Masked Face Recognition CompetitionFadi Boutros, Naser Damer, Jan Niklas Kolf et al.
This paper presents a summary of the Masked Face Recognition Competitions (MFR) held within the 2021 International Joint Conference on Biometrics (IJCB 2021). The competition attracted a total of 10 participating teams with valid submissions. The affiliations of these teams are diverse and associated with academia and industry in nine different countries. These teams successfully submitted 18 valid solutions. The competition is designed to motivate solutions aiming at enhancing the face recognition accuracy of masked faces. Moreover, the competition considered the deployability of the proposed solutions by taking the compactness of the face recognition models into account. A private dataset representing a collaborative, multi-session, real masked, capture scenario is used to evaluate the submitted solutions. In comparison to one of the top-performing academic face recognition solutions, 10 out of the 18 submitted solutions did score higher masked face verification accuracy.