Chee Seng Chan

CV
h-index33
55papers
4,190citations
Novelty48%
AI Score61

55 Papers

IVApr 1, 2022
Extremely Low-light Image Enhancement with Scene Text Restoration

Pohao Hsu, Che-Tsung Lin, Chun Chet Ng et al. · microsoft-research

Deep learning-based methods have made impressive progress in enhancing extremely low-light images - the image quality of the reconstructed images has generally improved. However, we found out that most of these methods could not sufficiently recover the image details, for instance, the texts in the scene. In this paper, a novel image enhancement framework is proposed to precisely restore the scene texts, as well as the overall quality of the image simultaneously under extremely low-light images conditions. Mainly, we employed a self-regularised attention map, an edge map, and a novel text detection loss. In addition, leveraging synthetic low-light images is beneficial for image enhancement on the genuine ones in terms of text detection. The quantitative and qualitative experimental results have shown that the proposed model outperforms state-of-the-art methods in image restoration, text detection, and text spotting on See In the Dark and ICDAR15 datasets.

CVFeb 15, 2023Code
Unsupervised Hashing with Similarity Distribution Calibration

Kam Woh Ng, Xiatian Zhu, Jiun Tian Hoe et al.

Unsupervised hashing methods typically aim to preserve the similarity between data points in a feature space by mapping them to binary hash codes. However, these methods often overlook the fact that the similarity between data points in the continuous feature space may not be preserved in the discrete hash code space, due to the limited similarity range of hash codes. The similarity range is bounded by the code length and can lead to a problem known as similarity collapse. That is, the positive and negative pairs of data points become less distinguishable from each other in the hash space. To alleviate this problem, in this paper a novel Similarity Distribution Calibration (SDC) method is introduced. SDC aligns the hash code similarity distribution towards a calibration distribution (e.g., beta distribution) with sufficient spread across the entire similarity range, thus alleviating the similarity collapse problem. Extensive experiments show that our SDC outperforms significantly the state-of-the-art alternatives on coarse category-level and instance-level image retrieval. Code is available at https://github.com/kamwoh/sdc.

CLMay 24Code
They Are Not the Same: Direct Causes Are Not Grounded Emotion Explanations

Zhuangzhuang Pan, Yan Xia, Chee Seng Chan

Emotion-Cause Pair Extraction (ECPE) was introduced to explain why an emotion occurs, but this goal is now often reduced to binary pair/non-pair prediction. This proxy is useful for direct-cause extraction, yet easy to over-read as evidence grounded emotion explanation. We show that this interpretation is only partially valid. In IEMO-MECP, 90.9% of original positives remain emo-cause and 95.0% of original negatives remain non-pair, confirming that the binary ECPE task is largely preserved. The problem is that direct triggers alone do not constitute a grounded explanation. Emo-context, an utterance that helps interpret a target emotion without directly causing it, appears on both sides of the original boundary and is enriched near binary uncertainty, showing that the binary boundary has no stable place for such discourse evidence. Across evaluated ECPE models, direct triggers are recovered more reliably than contextual support. Under shortcut pressure, this imbalance becomes consequential. Binary-trained models assign higher pair scores to nearby lexically similar non-pair candidates than to evidence supported but structurally harder emo-cause and emo-context pairs. Thus, pair scores can reward convenient attributions over grounded explanations. High binary ECPE performance indicates that a model can identify direct triggers; it does not indicate that the model has explained the emotion. Code is publicly available at https://github.com/panzhzh/ECPExsame.

CLMay 20Code
Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis

Yan Xia, Zhuangzhuang Pan, Amirrudin Kamsin et al.

Aspect-Term Sentiment Analysis (ATSA) in multi-aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re-encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single-pass inference framework that encodes each sentence once to construct a reusable, depth-ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re-encoding. This decouples shared sentence encoding from lightweight, aspect-conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end-to-end computation by up to 60% in multi-aspect settings (M >= 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at https://github.com/panzhzh/acl-dabs

CRAug 31, 2023
Everyone Can Attack: Repurpose Lossy Compression as a Natural Backdoor Attack

Sze Jue Yang, Quang Nguyen, Chee Seng Chan et al. · baidu

The vulnerabilities to backdoor attacks have recently threatened the trustworthiness of machine learning models in practical applications. Conventional wisdom suggests that not everyone can be an attacker since the process of designing the trigger generation algorithm often involves significant effort and extensive experimentation to ensure the attack's stealthiness and effectiveness. Alternatively, this paper shows that there exists a more severe backdoor threat: anyone can exploit an easily-accessible algorithm for silent backdoor attacks. Specifically, this attacker can employ the widely-used lossy image compression from a plethora of compression tools to effortlessly inject a trigger pattern into an image without leaving any noticeable trace; i.e., the generated triggers are natural artifacts. One does not require extensive knowledge to click on the "convert" or "save as" button while using tools for lossy image compression. Via this attack, the adversary does not need to design a trigger generator as seen in prior works and only requires poisoning the data. Empirically, the proposed attack consistently achieves 100% attack success rate in several benchmark datasets such as MNIST, CIFAR-10, GTSRB and CelebA. More significantly, the proposed attack can still achieve almost 100% attack success rate with very small (approximately 10%) poisoning rates in the clean label setting. The generated trigger of the proposed attack using one lossy compression algorithm is also transferable across other related compression algorithms, exacerbating the severity of this backdoor threat. This work takes another crucial step toward understanding the extensive risks of backdoor attacks in practice, urging practitioners to investigate similar attacks and relevant backdoor mitigation methods.

CVApr 15Code
OneHOI: Unifying Human-Object Interaction Generation and Editing

Jiun Tian Hoe, Weipeng Hu, Xudong Jiang et al.

Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.

CLOct 3, 2022Code
An Embarrassingly Simple Approach for Intellectual Property Rights Protection on Recurrent Neural Networks

Zhi Qin Tan, Hao Shan Wong, Chee Seng Chan

Capitalise on deep learning models, offering Natural Language Processing (NLP) solutions as a part of the Machine Learning as a Service (MLaaS) has generated handsome revenues. At the same time, it is known that the creation of these lucrative deep models is non-trivial. Therefore, protecting these inventions intellectual property rights (IPR) from being abused, stolen and plagiarized is vital. This paper proposes a practical approach for the IPR protection on recurrent neural networks (RNN) without all the bells and whistles of existing IPR solutions. Particularly, we introduce the Gatekeeper concept that resembles the recurrent nature in RNN architecture to embed keys. Also, we design the model training scheme in a way such that the protected RNN model will retain its original performance iff a genuine key is presented. Extensive experiments showed that our protection scheme is robust and effective against ambiguity and removal attacks in both white-box and black-box protection schemes on different RNN variants. Code is available at https://github.com/zhiqin1998/RecurrentIPR

LGMay 23, 2024Code
Ferrari: Federated Feature Unlearning via Optimizing Feature Sensitivity

Hanlin Gu, Win Kent Ong, Chee Seng Chan et al.

The advent of Federated Learning (FL) highlights the practical necessity for the right to be forgotten for all clients, allowing them to request data deletion from the machine learning models service provider. This necessity has spurred a growing demand for Federated Unlearning (FU). Feature unlearning has gained considerable attention due to its applications in unlearning sensitive, backdoor, and biased features. Existing methods employ the influence function to achieve feature unlearning, which is impractical for FL as it necessitates the participation of other clients, if not all, in the unlearning process. Furthermore, current research lacks an evaluation of the effectiveness of feature unlearning. To address these limitations, we define feature sensitivity in evaluating feature unlearning according to Lipschitz continuity. This metric characterizes the model outputs rate of change or sensitivity to perturbations in the input feature. We then propose an effective federated feature unlearning framework called Ferrari, which minimizes feature sensitivity. Extensive experimental results and theoretical analysis demonstrate the effectiveness of Ferrari across various feature unlearning scenarios, including sensitive, backdoor, and biased features. The code is publicly available at https://github.com/OngWinKent/Federated-Feature-Unlearning

CVApr 22, 2024Code
Text in the Dark: Extremely Low-Light Text Image Enhancement

Che-Tsung Lin, Chun Chet Ng, Zhi Qin Tan et al.

Extremely low-light text images are common in natural scenes, making scene text detection and recognition challenging. One solution is to enhance these images using low-light image enhancement methods before text extraction. However, previous methods often do not try to particularly address the significance of low-level features, which are crucial for optimal performance on downstream scene text tasks. Further research is also hindered by the lack of extremely low-light text datasets. To address these limitations, we propose a novel encoder-decoder framework with an edge-aware attention module to focus on scene text regions during enhancement. Our proposed method uses novel text detection and edge reconstruction losses to emphasize low-level scene text features, leading to successful text extraction. Additionally, we present a Supervised Deep Curve Estimation (Supervised-DCE) model to synthesize extremely low-light images based on publicly available scene text datasets such as ICDAR15 (IC15). We also labeled texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets to allow for objective assessment of extremely low-light image enhancement through scene text tasks. Extensive experiments show that our model outperforms state-of-the-art methods in terms of both image quality and scene text metrics on the widely-used LOL, SID, and synthetic IC15 datasets. Code and dataset will be released publicly at https://github.com/chunchet-ng/Text-in-the-Dark.

CVJan 7, 2025Code
Chirpy3D: Creative Fine-grained 3D Object Fabrication via Part Sampling

Kam Woh Ng, Jing Yang, Jia Wei Sii et al.

We present Chirpy3D, a novel approach for fine-grained 3D object generation, tackling the challenging task of synthesizing creative 3D objects in a zero-shot setting, with access only to unposed 2D images of seen categories. Without structured supervision -- such as camera poses, 3D part annotations, or object-specific labels -- the model must infer plausible 3D structures, capture fine-grained details, and generalize to novel objects using only category-level labels from seen categories. To address this, Chirpy3D introduces a multi-view diffusion model that decomposes training objects into anchor parts in an unsupervised manner, representing the latent space of both seen and unseen parts as continuous distributions. This allows smooth interpolation and flexible recombination of parts to generate entirely new objects with species-specific details. A self-supervised feature consistency loss further ensures structural and semantic coherence. The result is the first system capable of generating entirely novel 3D objects with species-specific fine-grained details through flexible part sampling and composition. Our experiments demonstrate that Chirpy3D surpasses existing methods in generating creative 3D objects with higher quality and fine-grained details. Code will be released at https://github.com/kamwoh/chirpy3d.

CVFeb 11, 2022Code
ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning

Jia Huei Tan, Ying Hua Tan, Chee Seng Chan et al.

Recent research that applies Transformer-based architectures to image captioning has resulted in state-of-the-art image captioning performance, capitalising on the success of Transformers on natural language tasks. Unfortunately, though these models work well, one major flaw is their large model sizes. To this end, we present three parameter reduction methods for image captioning Transformers: Radix Encoding, cross-layer parameter sharing, and attention parameter sharing. By combining these methods, our proposed ACORT models have 3.7x to 21.6x fewer parameters than the baseline model without compromising test performance. Results on the MS-COCO dataset demonstrate that our ACORT models are competitive against baselines and SOTA approaches, with CIDEr score >=126. Finally, we present qualitative results and ablation studies to demonstrate the efficacy of the proposed changes further. Code and pre-trained models are publicly available at https://github.com/jiahuei/sparse-image-captioning.

CVOct 7, 2021Code
End-to-End Supermask Pruning: Learning to Prune Image Captioning Models

Jia Huei Tan, Chee Seng Chan, Joon Huang Chuah

With the advancement of deep models, research work on image captioning has led to a remarkable gain in raw performance over the last decade, along with increasing model complexity and computational cost. However, surprisingly works on compression of deep networks for image captioning task has received little to no attention. For the first time in image captioning research, we provide an extensive comparison of various unstructured weight pruning methods on three different popular image captioning architectures, namely Soft-Attention, Up-Down and Object Relation Transformer. Following this, we propose a novel end-to-end weight pruning method that performs gradual sparsification based on weight sensitivity to the training loss. The pruning schemes are then extended with encoder pruning, where we show that conducting both decoder pruning and training simultaneously prior to the encoder pruning provides good overall performance. Empirically, we show that an 80% to 95% sparse network (up to 75% reduction in model size) can either match or outperform its dense counterpart. The code and pre-trained models for Up-Down and Object Relation Transformer that are capable of achieving CIDEr scores >120 on the MS-COCO dataset but with only 8.7 MB and 14.5 MB in model size (size reduction of 96% and 94% respectively against dense versions) are publicly available at https://github.com/jiahuei/sparse-image-captioning.

CVSep 29, 2021Code
One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective

Jiun Tian Hoe, Kam Woh Ng, Tianyu Zhang et al.

A deep hashing model typically has two main learning objectives: to make the learned binary hash codes discriminative and to minimize a quantization error. With further constraints such as bit balance and code orthogonality, it is not uncommon for existing models to employ a large number (>4) of losses. This leads to difficulties in model training and subsequently impedes their effectiveness. In this work, we propose a novel deep hashing model with only a single learning objective. Specifically, we show that maximizing the cosine similarity between the continuous codes and their corresponding binary orthogonal codes can ensure both hash code discriminativeness and quantization error minimization. Further, with this learning objective, code balancing can be achieved by simply using a Batch Normalization (BN) layer and multi-label classification is also straightforward with label smoothing. The result is an one-loss deep hashing model that removes all the hassles of tuning the weights of various losses. Importantly, extensive experiments show that our model is highly effective, outperforming the state-of-the-art multi-loss hashing models on three large-scale instance retrieval benchmarks, often by significant margins. Code is available at https://github.com/kamwoh/orthohash

CVAug 25, 2020Code
Protect, Show, Attend and Tell: Empowering Image Captioning Models with Ownership Protection

Jian Han Lim, Chee Seng Chan, Kam Woh Ng et al.

By and large, existing Intellectual Property (IP) protection on deep neural networks typically i) focus on image classification task only, and ii) follow a standard digital watermarking framework that was conventionally used to protect the ownership of multimedia and video content. This paper demonstrates that the current digital watermarking framework is insufficient to protect image captioning tasks that are often regarded as one of the frontiers AI problems. As a remedy, this paper studies and proposes two different embedding schemes in the hidden memory state of a recurrent neural network to protect the image captioning model. From empirical points, we prove that a forged key will yield an unusable image captioning model, defeating the purpose of infringement. To the best of our knowledge, this work is the first to propose ownership protection on image captioning task. Also, extensive experiments show that the proposed method does not compromise the original image captioning performance on all common captioning metrics on Flickr30k and MS-COCO datasets, and at the same time it is able to withstand both removal and ambiguity attacks. Code is available at https://github.com/jianhanlim/ipr-imagecaptioning

CRSep 16, 2019Code
[Extended version] Rethinking Deep Neural Network Ownership Verification: Embedding Passports to Defeat Ambiguity Attacks

Lixin Fan, Kam Woh Ng, Chee Seng Chan

With substantial amount of time, resources and human (team) efforts invested to explore and develop successful deep neural networks (DNN), there emerges an urgent need to protect these inventions from being illegally copied, redistributed, or abused without respecting the intellectual properties of legitimate owners. Following recent progresses along this line, we investigate a number of watermark-based DNN ownership verification methods in the face of ambiguity attacks, which aim to cast doubts on the ownership verification by forging counterfeit watermarks. It is shown that ambiguity attacks pose serious threats to existing DNN watermarking methods. As remedies to the above-mentioned loophole, this paper proposes novel passport-based DNN ownership verification schemes which are both robust to network modifications and resilient to ambiguity attacks. The gist of embedding digital passports is to design and train DNN models in a way such that, the DNN inference performance of an original task will be significantly deteriorated due to forged passports. In other words, genuine passports are not only verified by looking for the predefined signatures, but also reasserted by the unyielding DNN model inference performances. Extensive experimental results justify the effectiveness of the proposed passport-based DNN ownership verification schemes. Code and models are available at https://github.com/kamwoh/DeepIPR

CVMar 4, 2019Code
COMIC: Towards A Compact Image Captioning Model with Attention

Jia Huei Tan, Chee Seng Chan, Joon Huang Chuah

Recent works in image captioning have shown very promising raw performance. However, we realize that most of these encoder-decoder style networks with attention do not scale naturally to large vocabulary size, making them difficult to be deployed on embedded system with limited hardware resources. This is because the size of word and output embedding matrices grow proportionally with the size of vocabulary, adversely affecting the compactness of these networks. To address this limitation, this paper introduces a brand new idea in the domain of image captioning. That is, we tackle the problem of compactness of image captioning models which is hitherto unexplored. We showed that, our proposed model, named COMIC for COMpact Image Captioning, achieves comparable results in five common evaluation metrics with state-of-the-art approaches on both MS-COCO and InstaPIC-1.1M datasets despite having an embedding vocabulary size that is 39x - 99x smaller. The source code and models are available at: https://github.com/jiahuei/COMIC-Compact-Image-Captioning-with-Attention

CVMay 29, 2018Code
Getting to Know Low-light Images with The Exclusively Dark Dataset

Yuen Peng Loh, Chee Seng Chan

Low-light is an inescapable element of our daily surroundings that greatly affects the efficiency of our vision. Research works on low-light has seen a steady growth, particularly in the field of image enhancement, but there is still a lack of a go-to database as benchmark. Besides, research fields that may assist us in low-light environments, such as object detection, has glossed over this aspect even though breakthroughs-after-breakthroughs had been achieved in recent years, most noticeably from the lack of low-light data (less than 2% of the total images) in successful public benchmark dataset such as PASCAL VOC, ImageNet, and Microsoft COCO. Thus, we propose the Exclusively Dark dataset to elevate this data drought, consisting exclusively of ten different types of low-light images (i.e. low, ambient, object, single, weak, strong, screen, window, shadow and twilight) captured in visible light only with image and object level annotations. Moreover, we share insightful findings in regards to the effects of low-light on the object detection task by analyzing visualizations of both hand-crafted and learned features. Most importantly, we found that the effects of low-light reaches far deeper into the features than can be solved by simple "illumination invariance'". It is our hope that this analysis and the Exclusively Dark dataset can encourage the growth in low-light domain researches on different fields. The Exclusively Dark dataset with its annotation is available at https://github.com/cs-chan/Exclusively-Dark-Image-Dataset

CVOct 28, 2017Code
Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition

Chee Kheng Chng, Chee Seng Chan

Text in curve orientation, despite being one of the common text orientations in real world environment, has close to zero existence in well received scene text datasets such as ICDAR2013 and MSRA-TD500. The main motivation of Total-Text is to fill this gap and facilitate a new research direction for the scene text community. On top of the conventional horizontal and multi-oriented texts, it features curved-oriented text. Total-Text is highly diversified in orientations, more than half of its images have a combination of more than two orientations. Recently, a new breed of solutions that casted text detection as a segmentation problem has demonstrated their effectiveness against multi-oriented text. In order to evaluate its robustness against curved text, we fine-tuned DeconvNet and benchmark it on Total-Text. Total-Text with its annotation is available at https://github.com/cs-chan/Total-Text-Dataset

CVAug 31, 2017Code
Improved ArtGAN for Conditional Synthesis of Natural Image and Artwork

Wei Ren Tan, Chee Seng Chan, Hernan Aguirre et al.

This paper proposes a series of new approaches to improve Generative Adversarial Network (GAN) for conditional image synthesis and we name the proposed model as ArtGAN. One of the key innovation of ArtGAN is that, the gradient of the loss function w.r.t. the label (randomly assigned to each generated image) is back-propagated from the categorical discriminator to the generator. With the feedback from the label information, the generator is able to learn more efficiently and generate image with better quality. Inspired by recent works, an autoencoder is incorporated into the categorical discriminator for additional complementary information. Last but not least, we introduce a novel strategy to improve the image quality. In the experiments, we evaluate ArtGAN on CIFAR-10 and STL-10 via ablation studies. The empirical results showed that our proposed model outperforms the state-of-the-art results on CIFAR-10 in terms of Inception score. Qualitatively, we demonstrate that ArtGAN is able to generate plausible-looking images on Oxford-102 and CUB-200, as well as able to draw realistic artworks based on style, artist, and genre. The source code and models are available at: https://github.com/cs-chan/ArtGAN

CVFeb 24
VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos

Aihua Mao, Kaihang Huang, Yong-Jin Liu et al.

3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imitating actions, not just by examining shapes. Motivated by this intuition, we introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision. To achieve this, we propose VAGNet, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address. To support this new setting, we introduce PVAD, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works. Extensive experiments on PVAD show that VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines. The code and dataset will be open publicly.

LGFeb 14, 2025
Ten Challenging Problems in Federated Foundation Models

Tao Fan, Hanlin Gu, Xuemei Cao et al.

Federated Foundation Models (FedFMs) represent a distributed learning paradigm that fuses general competences of foundation models as well as privacy-preserving capabilities of federated learning. This combination allows the large foundation models and the small local domain models at the remote clients to learn from each other in a teacher-student learning setting. This paper provides a comprehensive summary of the ten challenging problems inherent in FedFMs, encompassing foundational theory, utilization of private data, continual learning, unlearning, Non-IID and graph data, bidirectional knowledge transfer, incentive mechanism design, game mechanism design, model watermarking, and efficiency. The ten challenging problems manifest in five pivotal aspects: ``Foundational Theory," which aims to establish a coherent and unifying theoretical framework for FedFMs. ``Data," addressing the difficulties in leveraging domain-specific knowledge from private data while maintaining privacy; ``Heterogeneity," examining variations in data, model, and computational resources across clients; ``Security and Privacy," focusing on defenses against malicious attacks and model theft; and ``Efficiency," highlighting the need for improvements in training, communication, and parameter efficiency. For each problem, we offer a clear mathematical definition on the objective function, analyze existing methods, and discuss the key challenges and potential solutions. This in-depth exploration aims to advance the theoretical foundations of FedFMs, guide practical implementations, and inspire future research to overcome these obstacles, thereby enabling the robust, efficient, and privacy-preserving FedFMs in various real-world applications.

CVJan 15, 2025
Yuan: Yielding Unblemished Aesthetics Through A Unified Network for Visual Imperfections Removal in Generated Images

Zhenyu Yu, Chee Seng Chan

Generative AI presents transformative potential across various domains, from creative arts to scientific visualization. However, the utility of AI-generated imagery is often compromised by visual flaws, including anatomical inaccuracies, improper object placements, and misplaced textual elements. These imperfections pose significant challenges for practical applications. To overcome these limitations, we introduce \textit{Yuan}, a novel framework that autonomously corrects visual imperfections in text-to-image synthesis. \textit{Yuan} uniquely conditions on both the textual prompt and the segmented image, generating precise masks that identify areas in need of refinement without requiring manual intervention -- a common constraint in previous methodologies. Following the automated masking process, an advanced inpainting module seamlessly integrates contextually coherent content into the identified regions, preserving the integrity and fidelity of the original image and associated text prompts. Through extensive experimentation on publicly available datasets such as ImageNet100 and Stanford Dogs, along with a custom-generated dataset, \textit{Yuan} demonstrated superior performance in eliminating visual imperfections. Our approach consistently achieved higher scores in quantitative metrics, including NIQE, BRISQUE, and PI, alongside favorable qualitative evaluations. These results underscore \textit{Yuan}'s potential to significantly enhance the quality and applicability of AI-generated images across diverse fields.

CVJan 17, 2024
IPR-NeRF: Ownership Verification meets Neural Radiance Field

Win Kent Ong, Kam Woh Ng, Chee Seng Chan et al.

Neural Radiance Field (NeRF) models have gained significant attention in the computer vision community in the recent past with state-of-the-art visual quality and produced impressive demonstrations. Since then, technopreneurs have sought to leverage NeRF models into a profitable business. Therefore, NeRF models make it worth the risk of plagiarizers illegally copying, re-distributing, or misusing those models. This paper proposes a comprehensive intellectual property (IP) protection framework for the NeRF model in both black-box and white-box settings, namely IPR-NeRF. In the black-box setting, a diffusion-based solution is introduced to embed and extract the watermark via a two-stage optimization process. In the white-box setting, a designated digital signature is embedded into the weights of the NeRF model by adopting the sign loss objective. Our extensive experiments demonstrate that not only does our approach maintain the fidelity (\ie, the rendering quality) of IPR-NeRF models, but it is also robust against both ambiguity and removal attacks compared to prior arts.

LGOct 14, 2024
A few-shot Label Unlearning in Vertical Federated Learning

Hanlin Gu, Hong Xi Tae, Chee Seng Chan et al.

This paper addresses the critical challenge of unlearning in Vertical Federated Learning (VFL), an area that has received limited attention compared to horizontal federated learning. We introduce the first approach specifically designed to tackle label unlearning in VFL, focusing on scenarios where the active party aims to mitigate the risk of label leakage. Our method leverages a limited amount of labeled data, utilizing manifold mixup to augment the forward embedding of insufficient data, followed by gradient ascent on the augmented embeddings to erase label information from the models. This combination of augmentation and gradient ascent enables high unlearning effectiveness while maintaining efficiency, completing the unlearning procedure within seconds. Extensive experiments conducted on diverse datasets, including MNIST, CIFAR10, CIFAR100, and ModelNet, validate the efficacy and scalability of our approach. This work represents a significant advancement in federated learning, addressing the unique challenges of unlearning in VFL while preserving both privacy and computational efficiency.

CLOct 9, 2025
Banking Done Right: Redefining Retail Banking with Language-Centric AI

Xin Jie Chua, Jeraelyn Ming Li Tan, Jia Xuan Tan et al.

This paper presents Ryt AI, an LLM-native agentic framework that powers Ryt Bank to enable customers to execute core financial transactions through natural language conversation. This represents the first global regulator-approved deployment worldwide where conversational AI functions as the primary banking interface, in contrast to prior assistants that have been limited to advisory or support roles. Built entirely in-house, Ryt AI is powered by ILMU, a closed-source LLM developed internally, and replaces rigid multi-screen workflows with a single dialogue orchestrated by four LLM-powered agents (Guardrails, Intent, Payment, and FAQ). Each agent attaches a task-specific LoRA adapter to ILMU, which is hosted within the bank's infrastructure to ensure consistent behavior with minimal overhead. Deterministic guardrails, human-in-the-loop confirmation, and a stateless audit architecture provide defense-in-depth for security and compliance. The result is Banking Done Right: demonstrating that regulator-approved natural-language interfaces can reliably support core financial operations under strict governance.

CLAug 7, 2025
MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints

Zhong Ken Hew, Jia Xin Low, Sze Jue Yang et al.

Large Language Models (LLMs) often exhibit cultural biases due to training data dominated by high-resource languages like English and Chinese. This poses challenges for accurately representing and evaluating diverse cultural contexts, particularly in low-resource language settings. To address this, we introduce MyCulture, a benchmark designed to comprehensively evaluate LLMs on Malaysian culture across six pillars: arts, attire, customs, entertainment, food, and religion presented in Bahasa Melayu. Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options, thereby reducing guessing and mitigating format bias. We provide a theoretical justification for the effectiveness of this open-ended structure in improving both fairness and discriminative power. Furthermore, we analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations. Our evaluation across a range of regional and international LLMs reveals significant disparities in cultural comprehension, highlighting the urgent need for culturally grounded and linguistically inclusive benchmarks in the development and assessment of LLMs.

LGApr 18, 2025
Stratify: Rethinking Federated Learning for Non-IID Data through Balanced Sampling

Hui Yeok Wong, Chee Kau Lim, Chee Seng Chan

Federated Learning (FL) on non-independently and identically distributed (non-IID) data remains a critical challenge, as existing approaches struggle with severe data heterogeneity. Current methods primarily address symptoms of non-IID by applying incremental adjustments to Federated Averaging (FedAvg), rather than directly resolving its inherent design limitations. Consequently, performance significantly deteriorates under highly heterogeneous conditions, as the fundamental issue of imbalanced exposure to diverse class and feature distributions remains unresolved. This paper introduces Stratify, a novel FL framework designed to systematically manage class and feature distributions throughout training, effectively tackling the root cause of non-IID challenges. Inspired by classical stratified sampling, our approach employs a Stratified Label Schedule (SLS) to ensure balanced exposure across labels, significantly reducing bias and variance in aggregated gradients. Complementing SLS, we propose a label-aware client selection strategy, restricting participation exclusively to clients possessing data relevant to scheduled labels. Additionally, Stratify incorporates a fine-grained, high-frequency update scheme, accelerating convergence and further mitigating data heterogeneity. To uphold privacy, we implement a secure client selection protocol leveraging homomorphic encryption, enabling precise global label statistics without disclosing sensitive client information. Extensive evaluations on MNIST, CIFAR-10, CIFAR-100, Tiny-ImageNet, COVTYPE, PACS, and Digits-DG demonstrate that Stratify attains performance comparable to IID baselines, accelerates convergence, and reduces client-side computation compared to state-of-the-art methods, underscoring its practical effectiveness in realistic federated learning scenarios.

GRMar 12, 2025
InteractEdit: Zero-Shot Editing of Human-Object Interactions in Images

Jiun Tian Hoe, Weipeng Hu, Wei Zhou et al.

This paper presents InteractEdit, a novel framework for zero-shot Human-Object Interaction (HOI) editing, addressing the challenging task of transforming an existing interaction in an image into a new, desired interaction while preserving the identities of the subject and object. Unlike simpler image editing scenarios such as attribute manipulation, object replacement or style transfer, HOI editing involves complex spatial, contextual, and relational dependencies inherent in humans-objects interactions. Existing methods often overfit to the source image structure, limiting their ability to adapt to the substantial structural modifications demanded by new interactions. To address this, InteractEdit decomposes each scene into subject, object, and background components, then employs Low-Rank Adaptation (LoRA) and selective fine-tuning to preserve pretrained interaction priors while learning the visual identity of the source image. This regularization strategy effectively balances interaction edits with identity consistency. We further introduce IEBench, the most comprehensive benchmark for HOI editing, which evaluates both interaction editing and identity preservation. Our extensive experiments show that InteractEdit significantly outperforms existing methods, establishing a strong baseline for future HOI editing research and unlocking new possibilities for creative and practical applications. Code will be released upon publication.

CVDec 29, 2024
Protégé: Learn and Generate Basic Makeup Styles with Generative Adversarial Networks (GANs)

Jia Wei Sii, Chee Seng Chan

Makeup is no longer confined to physical application; people now use mobile apps to digitally apply makeup to their photos, which they then share on social media. However, while this shift has made makeup more accessible, designing diverse makeup styles tailored to individual faces remains a challenge. This challenge currently must still be done manually by humans. Existing systems, such as makeup recommendation engines and makeup transfer techniques, offer limitations in creating innovative makeups for different individuals "intuitively" -- significant user effort and knowledge needed and limited makeup options available in app. Our motivation is to address this challenge by proposing Protégé, a new makeup application, leveraging recent generative model -- GANs to learn and automatically generate makeup styles. This is a task that existing makeup applications (i.e., makeup recommendation systems using expert system and makeup transfer methods) are unable to perform. Extensive experiments has been conducted to demonstrate the capability of Protégé in learning and creating diverse makeups, providing a convenient and intuitive way, marking a significant leap in digital makeup technology!

CVApr 22, 2024
Gorgeous: Create Your Desired Character Facial Makeup from Any Ideas

Jia Wei Sii, Chee Seng Chan

Contemporary makeup transfer methods primarily focus on replicating makeup from one face to another, considerably limiting their use in creating diverse and creative character makeup essential for visual storytelling. Such methods typically fail to address the need for uniqueness and contextual relevance, specifically aligning with character and story settings as they depend heavily on existing facial makeup in reference images. This approach also presents a significant challenge when attempting to source a perfectly matched facial makeup style, further complicating the creation of makeup designs inspired by various story elements, such as theme, background, and props that do not necessarily feature faces. To address these limitations, we introduce $Gorgeous$, a novel diffusion-based makeup application method that goes beyond simple transfer by innovatively crafting unique and thematic facial makeup. Unlike traditional methods, $Gorgeous$ does not require the presence of a face in the reference images. Instead, it draws artistic inspiration from a minimal set of three to five images, which can be of any type, and transforms these elements into practical makeup applications directly on the face. Our comprehensive experiments demonstrate that $Gorgeous$ can effectively generate distinctive character facial makeup inspired by the chosen thematic reference images. This approach opens up new possibilities for integrating broader story elements into character makeup, thereby enhancing the narrative depth and visual impact in storytelling.

CVDec 10, 2023
InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models

Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan et al.

Large-scale text-to-image (T2I) diffusion models have showcased incredible capabilities in generating coherent images based on textual descriptions, enabling vast applications in content generation. While recent advancements have introduced control over factors such as object localization, posture, and image contours, a crucial gap remains in our ability to control the interactions between objects in the generated content. Well-controlling interactions in generated images could yield meaningful applications, such as creating realistic scenes with interacting characters. In this work, we study the problems of conditioning T2I diffusion models with Human-Object Interaction (HOI) information, consisting of a triplet label (person, action, object) and corresponding bounding boxes. We propose a pluggable interaction control model, called InteractDiffusion that extends existing pre-trained T2I diffusion models to enable them being better conditioned on interactions. Specifically, we tokenize the HOI information and learn their relationships via interaction embeddings. A conditioning self-attention layer is trained to map HOI tokens to visual tokens, thereby conditioning the visual tokens better in existing T2I diffusion models. Our model attains the ability to control the interaction and location on existing T2I diffusion models, which outperforms existing baselines by a large margin in HOI detection score, as well as fidelity in FID and KID. Project page: https://jiuntian.github.io/interactdiffusion.

CVJul 12, 2021
ICDAR 2021 Competition on Integrated Circuit Text Spotting and Aesthetic Assessment

Chun Chet Ng, Akmalul Khairi Bin Nazaruddin, Yeong Khang Lee et al.

With hundreds of thousands of electronic chip components are being manufactured every day, chip manufacturers have seen an increasing demand in seeking a more efficient and effective way of inspecting the quality of printed texts on chip components. The major problem that deters this area of research is the lacking of realistic text on chips datasets to act as a strong foundation. Hence, a text on chips dataset, ICText is used as the main target for the proposed Robust Reading Challenge on Integrated Circuit Text Spotting and Aesthetic Assessment (RRC-ICText) 2021 to encourage the research on this problem. Throughout the entire competition, we have received a total of 233 submissions from 10 unique teams/individuals. Details of the competition and submission results are presented in this report.

AIMar 16, 2021
Ternary Hashing

Chang Liu, Lixin Fan, Kam Woh Ng et al.

This paper proposes a novel ternary hash encoding for learning to hash methods, which provides a principled more efficient coding scheme with performances better than those of the state-of-the-art binary hashing counterparts. Two kinds of axiomatic ternary logic, Kleene logic and Łukasiewicz logic are adopted to calculate the Ternary Hamming Distance (THD) for both the learning/encoding and testing/querying phases. Our work demonstrates that, with an efficient implementation of ternary logic on standard binary machines, the proposed ternary hashing is compared favorably to the binary hashing methods with consistent improvements of retrieval mean average precision (mAP) ranging from 1\% to 5.9\% as shown in CIFAR10, NUS-WIDE and ImageNet100 datasets.

CRFeb 8, 2021
Protecting Intellectual Property of Generative Adversarial Networks from Ambiguity Attack

Ding Sheng Ong, Chee Seng Chan, Kam Woh Ng et al.

Ever since Machine Learning as a Service (MLaaS) emerges as a viable business that utilizes deep learning models to generate lucrative revenue, Intellectual Property Right (IPR) has become a major concern because these deep learning models can easily be replicated, shared, and re-distributed by any unauthorized third parties. To the best of our knowledge, one of the prominent deep learning models - Generative Adversarial Networks (GANs) which has been widely used to create photorealistic image are totally unprotected despite the existence of pioneering IPR protection methodology for Convolutional Neural Networks (CNNs). This paper therefore presents a complete protection framework in both black-box and white-box settings to enforce IPR protection on GANs. Empirically, we show that the proposed method does not compromise the original GANs performance (i.e. image generation, image super-resolution, style transfer), and at the same time, it is able to withstand both removal and ambiguity attacks against embedded watermarks.

LGJun 20, 2020
Rethinking Privacy Preserving Deep Learning: How to Evaluate and Thwart Privacy Attacks

Lixin Fan, Kam Woh Ng, Ce Ju et al.

This paper investigates capabilities of Privacy-Preserving Deep Learning (PPDL) mechanisms against various forms of privacy attacks. First, we propose to quantitatively measure the trade-off between model accuracy and privacy losses incurred by reconstruction, tracing and membership attacks. Second, we formulate reconstruction attacks as solving a noisy system of linear equations, and prove that attacks are guaranteed to be defeated if condition (2) is unfulfilled. Third, based on theoretical analysis, a novel Secret Polarization Network (SPN) is proposed to thwart privacy attacks, which pose serious challenges to existing PPDL methods. Extensive experiments showed that model accuracies are improved on average by 5-20% compared with baseline mechanisms, in regimes where data privacy are satisfactorily protected.

CVFeb 24, 2020
On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering

Xinyu Wang, Yuliang Liu, Chunhua Shen et al.

Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to generalize. This is visible in the fact that they are vulnerable to learning coincidental correlations in the data rather than deeper relations between image content and ideas expressed in language. We present a dataset that takes a step towards addressing this problem in that it contains questions expressed in two languages, and an evaluation process that co-opts a well understood image-based metric to reflect the method's ability to reason. Measuring reasoning directly encourages generalization by penalizing answers that are coincidentally correct. The dataset reflects the scene-text version of the VQA problem, and the reasoning evaluation can be seen as a text-based version of a referring expression challenge. Experiments and analysis are provided that show the value of the dataset.

CVSep 17, 2019
ICDAR 2019 Competition on Large-scale Street View Text with Partial Labeling -- RRC-LSVT

Yipeng Sun, Zihan Ni, Chee-Kheng Chng et al.

Robust text reading from street view images provides valuable information for various applications. Performance improvement of existing methods in such a challenging scenario heavily relies on the amount of fully annotated training data, which is costly and in-efficient to obtain. To scale up the amount of training data while keeping the labeling procedure cost-effective, this competition introduces a new challenge on Large-scale Street View Text with Partial Labeling (LSVT), providing 50, 000 and 400, 000 images in full and weak annotations, respectively. This competition aims to explore the abilities of state-of-the-art methods to detect and recognize text instances from large-scale street view images, closing the gap between research benchmarks and real applications. During the competition period, a total of 41 teams participated in the two proposed tasks with 132 valid submissions, i.e., text detection and end-to-end text spotting. This paper includes dataset descriptions, task definitions, evaluation protocols and results summaries of the ICDAR 2019-LSVT challenge.

CVSep 16, 2019
ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text (RRC-ArT)

Chee-Kheng Chng, Yuliang Liu, Yipeng Sun et al.

This paper reports the ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text (RRC-ArT) that consists of three major challenges: i) scene text detection, ii) scene text recognition, and iii) scene text spotting. A total of 78 submissions from 46 unique teams/individuals were received for this competition. The top performing score of each challenge is as follows: i) T1 - 82.65%, ii) T2.1 - 74.3%, iii) T2.2 - 85.32%, iv) T3.1 - 53.86%, and v) T3.2 - 54.91%. Apart from the results, this paper also details the ArT dataset, tasks description, evaluation metrics and participants methods. The dataset, the evaluation kit as well as the results are publicly available at https://rrc.cvc.uab.es/?ch=14

CVAug 28, 2019
Image Captioning with Sparse Recurrent Neural Network

Jia Huei Tan, Chee Seng Chan, Joon Huang Chuah

Recurrent Neural Network (RNN) has been widely used to tackle a wide variety of language generation problems and are capable of attaining state-of-the-art (SOTA) performance. However despite its impressive results, the large number of parameters in the RNN model makes deployment to mobile and embedded devices infeasible. Driven by this problem, many works have proposed a number of pruning methods to reduce the sizes of the RNN model. In this work, we propose an end-to-end pruning method for image captioning models equipped with visual attention. Our proposed method is able to achieve sparsity levels up to 97.5% without significant performance loss relative to the baseline (~ 2% loss at 40x compression after fine-tuning). Our method is also simple to use and tune, facilitating faster development times for neural network practitioners. We perform extensive experiments on the popular MS-COCO dataset in order to empirically validate the efficacy of our proposed method.

CRMay 10, 2019
Digital Passport: A Novel Technological Strategy for Intellectual Property Protection of Convolutional Neural Networks

Lixin Fan, KamWoh Ng, Chee Seng Chan

In order to prevent deep neural networks from being infringed by unauthorized parties, we propose a generic solution which embeds a designated digital passport into a network, and subsequently, either paralyzes the network functionalities for unauthorized usages or maintain its functionalities in the presence of a verified passport. Such a desired network behavior is successfully demonstrated in a number of implementation schemes, which provide reliable, preventive and timely protections against tens of thousands of fake-passport deceptions. Extensive experiments also show that the deep neural network performance under unauthorized usages deteriorate significantly (e.g. with 33% to 82% reductions of CIFAR10 classification accuracies), while networks endorsed with valid passports remain intact.

LGJan 20, 2019
A Universal Logic Operator for Interpretable Deep Convolution Networks

KamWoh Ng, Lixin Fan, Chee Seng Chan

Explaining neural network computation in terms of probabilistic/fuzzy logical operations has attracted much attention due to its simplicity and high interpretability. Different choices of logical operators such as AND, OR and XOR give rise to another dimension for network optimization, and in this paper, we study the open problem of learning a universal logical operator without prescribing to any logical operations manually. Insightful observations along this exploration furnish deep convolution networks with a novel logical interpretation.

CVNov 11, 2017
Phrase-based Image Captioning with Hierarchical LSTM Model

Ying Hua Tan, Chee Seng Chan

Automatic generation of caption to describe the content of an image has been gaining a lot of research interests recently, where most of the existing works treat the image caption as pure sequential data. Natural language, however possess a temporal hierarchy structure, with complex dependencies between each subsequence. In this paper, we propose a phrase-based hierarchical Long Short-Term Memory (phi-LSTM) model to generate image description. In contrast to the conventional solutions that generate caption in a pure sequential manner, our proposed model decodes image caption from phrase to sentence. It consists of a phrase decoder at the bottom hierarchy to decode noun phrases of variable length, and an abbreviated sentence decoder at the upper hierarchy to decode an abbreviated form of the image description. A complete image caption is formed by combining the generated phrases with sentence during the inference stage. Empirically, our proposed model shows a better or competitive result on the Flickr8k, Flickr30k and MS-COCO datasets in comparison to the state-of-the art models. We also show that our proposed model is able to generate more novel captions (not seen in the training data) which are richer in word contents in all these three datasets.

CVSep 1, 2017
A Comprehensive Survey of Deep Learning in Remote Sensing: Theories, Tools and Challenges for the Community

John E. Ball, Derek T. Anderson, Chee Seng Chan

In recent years, deep learning (DL), a re-branding of neural networks (NNs), has risen to the top in numerous areas, namely computer vision (CV), speech recognition, natural language processing, etc. Whereas remote sensing (RS) possesses a number of unique challenges, primarily related to sensors and applications, inevitably RS draws from many of the same theories as CV; e.g., statistics, fusion, and machine learning, to name a few. This means that the RS community should be aware of, if not at the leading edge of, of advancements like DL. Herein, we provide the most comprehensive survey of state-of-the-art RS DL research. We also review recent new developments in the DL field that can be used in DL for RS. Namely, we focus on theories, tools and challenges for the RS community. Specifically, we focus on unsolved challenges and opportunities as it relates to (i) inadequate data sets, (ii) human-understandable solutions for modelling physical phenomena, (iii) Big Data, (iv) non-traditional heterogeneous data sources, (v) DL architectures and learning algorithms for spectral, spatial and temporal data, (vi) transfer learning, (vii) an improved theoretical understanding of DL systems, (viii) high barriers to entry, and (ix) training and optimizing the DL.

CVFeb 11, 2017
ArtGAN: Artwork Synthesis with Conditional Categorical GANs

Wei Ren Tan, Chee Seng Chan, Hernan Aguirre et al.

This paper proposes an extension to the Generative Adversarial Networks (GANs), namely as ARTGAN to synthetically generate more challenging and complex images such as artwork that have abstract characteristics. This is in contrast to most of the current solutions that focused on generating natural images such as room interiors, birds, flowers and faces. The key innovation of our work is to allow back-propagation of the loss function w.r.t. the labels (randomly assigned to each generated images) to the generator from the discriminator. With the feedback from the label information, the generator is able to learn faster and achieve better generated image quality. Empirically, we show that the proposed ARTGAN is capable to create realistic artwork, as well as generate compelling real world images that globally look natural with clear shape on CIFAR-10.

CLAug 20, 2016
phi-LSTM: A Phrase-based Hierarchical LSTM Model for Image Captioning

Ying Hua Tan, Chee Seng Chan

A picture is worth a thousand words. Not until recently, however, we noticed some success stories in understanding of visual scenes: a model that is able to detect/name objects, describe their attributes, and recognize their relationships/interactions. In this paper, we propose a phrase-based hierarchical Long Short-Term Memory (phi-LSTM) model to generate image description. The proposed model encodes sentence as a sequence of combination of phrases and words, instead of a sequence of words alone as in those conventional solutions. The two levels of this model are dedicated to i) learn to generate image relevant noun phrases, and ii) produce appropriate image description from the phrases and other words in the corpus. Adopting a convolutional neural network to learn image features and the LSTM to learn the word sequence in a sentence, the proposed model has shown better or competitive results in comparison to the state-of-the-art models on Flickr8k and Flickr30k datasets.

CVNov 20, 2015
Crowd Behavior Analysis: A Review where Physics meets Biology

Ven Jyn Kok, Mei Kuan Lim, Chee Seng Chan

Although the traits emerged in a mass gathering are often non-deliberative, the act of mass impulse may lead to irre- vocable crowd disasters. The two-fold increase of carnage in crowd since the past two decades has spurred significant advances in the field of computer vision, towards effective and proactive crowd surveillance. Computer vision stud- ies related to crowd are observed to resonate with the understanding of the emergent behavior in physics (complex systems) and biology (animal swarm). These studies, which are inspired by biology and physics, share surprisingly common insights, and interesting contradictions. However, this aspect of discussion has not been fully explored. Therefore, this survey provides the readers with a review of the state-of-the-art methods in crowd behavior analysis from the physics and biologically inspired perspectives. We provide insights and comprehensive discussions for a broader understanding of the underlying prospect of blending physics and biology studies in computer vision.

CVJun 28, 2015
Deep-Plant: Plant Identification with convolutional neural networks

Sue Han Lee, Chee Seng Chan, Paul Wilkin et al.

This paper studies convolutional neural networks (CNN) to learn unsupervised feature representations for 44 different plant species, collected at the Royal Botanic Gardens, Kew, England. To gain intuition on the chosen features from the CNN model (opposed to a 'black box' solution), a visualisation technique based on the deconvolutional networks (DN) is utilized. It is found that venations of different order have been chosen to uniquely represent each of the plant species. Experimental results using these CNN features with different classifiers show consistency and superiority compared to the state-of-the art solutions which rely on hand-crafted features.

CVDec 1, 2014
Fuzzy human motion analysis: A review

Chern Hong Lim, Ekta Vats, Chee Seng Chan

Human Motion Analysis (HMA) is currently one of the most popularly active research domains as such significant research interests are motivated by a number of real world applications such as video surveillance, sports analysis, healthcare monitoring and so on. However, most of these real world applications face high levels of uncertainties that can affect the operations of such applications. Hence, the fuzzy set theory has been applied and showed great success in the recent past. In this paper, we aim at reviewing the fuzzy set oriented approaches for HMA, individuating how the fuzzy set may improve the HMA, envisaging and delineating the future perspectives. To the best of our knowledge, there is not found a single survey in the current literature that has discussed and reviewed fuzzy approaches towards the HMA. For ease of understanding, we conceptually classify the human motion into three broad levels: Low-Level (LoL), Mid-Level (MiL), and High-Level (HiL) HMA.

CVOct 15, 2014
Detection of Salient Regions in Crowded Scenes

Mei Kuan Lim, Chee Seng Chan, Dorothy Monekosso et al.

The increasing number of cameras and a handful of human operators to monitor the video inputs from hundreds of cameras leave the system ill equipped to fulfil the task of detecting anomalies. Thus, there is a dire need to automatically detect regions that require immediate attention for a more effective and proactive surveillance. We propose a framework that utilises the temporal variations in the flow field of a crowd scene to automatically detect salient regions, while eliminating the need to have prior knowledge of the scene or training. We deem the flow fields to be a dynamic system and adopt the stability theory of dynamical systems, to determine the motion dynamics within a given area. In the context of this work, salient regions refer to areas with high motion dynamics, where points in a particular region are unstable. Experimental results on public, crowd scenes have shown the effectiveness of the proposed method in detecting salient regions which correspond to unstable flow, occlusions, bottlenecks, entries and exits.

CVOct 14, 2014
Crowd Saliency Detection via Global Similarity Structure

Mei Kuan Lim, Ven Jyn Kok, Chen Change Loy et al.

It is common for CCTV operators to overlook inter- esting events taking place within the crowd due to large number of people in the crowded scene (i.e. marathon, rally). Thus, there is a dire need to automate the detection of salient crowd regions acquiring immediate attention for a more effective and proactive surveillance. This paper proposes a novel framework to identify and localize salient regions in a crowd scene, by transforming low-level features extracted from crowd motion field into a global similarity structure. The global similarity structure representation allows the discovery of the intrinsic manifold of the motion dynamics, which could not be captured by the low-level representation. Ranking is then performed on the global similarity structure to identify a set of extrema. The proposed approach is unsupervised so learning stage is eliminated. Experimental results on public datasets demonstrates the effectiveness of exploiting such extrema in identifying salient regions in various crowd scenarios that exhibit crowding, local irregular motion, and unique motion areas such as sources and sinks.