Yain-Whar Si

CV
h-index29
13papers
40citations
Novelty46%
AI Score43

13 Papers

77.0CRApr 2Code
An End-to-End Model for Logits-Based Large Language Models Watermarking

Kahim Wong, Jicheng Zhou, Jiantao Zhou et al.

The rise of LLMs has increased concerns over source tracing and copyright protection for AIGC, highlighting the need for advanced detection technologies. Passive detection methods usually face high false positives, while active watermarking techniques using logits or sampling manipulation offer more effective protection. Existing LLM watermarking methods, though effective on unaltered content, suffer significant performance drops when the text is modified and could introduce biases that degrade LLM performance in downstream tasks. These methods fail to achieve an optimal tradeoff between text quality and robustness, particularly due to the lack of end-to-end optimization of the encoder and decoder. In this paper, we introduce a novel end-to-end logits perturbation method for watermarking LLM-generated text. By jointly optimization, our approach achieves a better balance between quality and robustness. To address non-differentiable operations in the end-to-end training pipeline, we introduce an online prompting technique that leverages the on-the-fly LLM as a differentiable surrogate. Our method achieves superior robustness, outperforming distortion-free methods by 37-39% under paraphrasing and 17.2% on average, while maintaining text quality on par with these distortion-free methods in terms of text perplexity and downstream tasks. Our method can be easily generalized to different LLMs. Code is available at https://github.com/KahimWong/E2E-LLM-Watermark.

CVApr 17, 2024Code
FastFace: Fast-converging Scheduler for Large-scale Face Recognition Training with One GPU

Xueyuan Gong, Zhiquan Liu, Yain-Whar Si et al.

Computing power has evolved into a foundational and indispensable resource in the area of deep learning, particularly in tasks such as Face Recognition (FR) model training on large-scale datasets, where multiple GPUs are often a necessity. Recognizing this challenge, some FR methods have started exploring ways to compress the fully-connected layer in FR models. Unlike other approaches, our observations reveal that without prompt scheduling of the learning rate (LR) during FR model training, the loss curve tends to exhibit numerous stationary subsequences. To address this issue, we introduce a novel LR scheduler leveraging Exponential Moving Average (EMA) and Haar Convolutional Kernel (HCK) to eliminate stationary subsequences, resulting in a significant reduction in converging time. However, the proposed scheduler incurs a considerable computational overhead due to its time complexity. To overcome this limitation, we propose FastFace, a fast-converging scheduler with negligible time complexity, i.e. O(1) per iteration, during training. In practice, FastFace is able to accelerate FR model training to a quarter of its original time without sacrificing more than 1% accuracy, making large-scale FR training feasible even with just one single GPU in terms of both time and space complexity. Extensive experiments validate the efficiency and effectiveness of FastFace. The code is publicly available at: https://github.com/amoonfana/FastFace

CVJul 22, 2025Code
ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement

Kahim Wong, Jicheng Zhou, Haiwei Wu et al.

The advancement of image editing tools has enabled malicious manipulation of sensitive document images, underscoring the need for robust document image forgery detection.Though forgery detectors for natural images have been extensively studied, they struggle with document images, as the tampered regions can be seamlessly blended into the uniform document background (BG) and structured text. On the other hand, existing document-specific methods lack sufficient robustness against various degradations, which limits their practical deployment. This paper presents ADCD-Net, a robust document forgery localization model that adaptively leverages the RGB/DCT forensic traces and integrates key characteristics of document images. Specifically, to address the DCT traces' sensitivity to block misalignment, we adaptively modulate the DCT feature contribution based on a predicted alignment score, resulting in much improved resilience to various distortions, including resizing and cropping. Also, a hierarchical content disentanglement approach is proposed to boost the localization performance via mitigating the text-BG disparities. Furthermore, noticing the predominantly pristine nature of BG regions, we construct a pristine prototype capturing traces of untampered regions, and eventually enhance both the localization accuracy and robustness. Our proposed ADCD-Net demonstrates superior forgery localization performance, consistently outperforming state-of-the-art methods by 20.79\% averaged over 5 types of distortions. The code is available at https://github.com/KAHIMWONG/ACDC-Net.

CVApr 4, 2025Code
FontGuard: A Robust Font Watermarking Approach Leveraging Deep Font Knowledge

Kahim Wong, Jicheng Zhou, Kemou Li et al.

The proliferation of AI-generated content brings significant concerns on the forensic and security issues such as source tracing, copyright protection, etc, highlighting the need for effective watermarking technologies. Font-based text watermarking has emerged as an effective solution to embed information, which could ensure copyright, traceability, and compliance of the generated text content. Existing font watermarking methods usually neglect essential font knowledge, which leads to watermarked fonts of low quality and limited embedding capacity. These methods are also vulnerable to real-world distortions, low-resolution fonts, and inaccurate character segmentation. In this paper, we introduce FontGuard, a novel font watermarking model that harnesses the capabilities of font models and language-guided contrastive learning. Unlike previous methods that focus solely on the pixel-level alteration, FontGuard modifies fonts by altering hidden style features, resulting in better font quality upon watermark embedding. We also leverage the font manifold to increase the embedding capacity of our proposed method by generating substantial font variants closely resembling the original font. Furthermore, in the decoder, we employ an image-text contrastive learning to reconstruct the embedded bits, which can achieve desirable robustness against various real-world transmission distortions. FontGuard outperforms state-of-the-art methods by +5.4%, +7.4%, and +5.8% in decoding accuracy under synthetic, cross-media, and online social network distortions, respectively, while improving the visual quality by 52.7% in terms of LPIPS. Moreover, FontGuard uniquely allows the generation of watermarked fonts for unseen fonts without re-training the network. The code and dataset are available at https://github.com/KAHIMWONG/FontGuard.

CVDec 8, 2023
X2-Softmax: Margin Adaptive Loss Function for Face Recognition

Jiamu Xu, Xiaoxiang Liu, Xinyuan Zhang et al.

Learning the discriminative features of different faces is an important task in face recognition. By extracting face features in neural networks, it becomes easy to measure the similarity of different face images, which makes face recognition possible. To enhance the neural network's face feature separability, incorporating an angular margin during training is common practice. State-of-the-art loss functions CosFace and ArcFace apply fixed margins between weights of classes to enhance the inter-class separation of face features. Since the distribution of samples in the training set is imbalanced, similarities between different identities are unequal. Therefore, using an inappropriately fixed angular margin may lead to the problem that the model is difficult to converge or the face features are not discriminative enough. It is more in line with our intuition that the margins are angular adaptive, which could increase with the angles between classes growing. In this paper, we propose a new angular margin loss named X2-Softmax. X2-Softmax loss has adaptive angular margins, which provide the margin that increases with the angle between different classes growing. The angular adaptive margin ensures model flexibility and effectively improves the effect of face recognition. We have trained the neural network with X2-Softmax loss on the MS1Mv3 dataset and tested it on several evaluation benchmarks to demonstrate the effectiveness and superiority of our loss function.

CVMar 10, 2025
AttFC: Attention Fully-Connected Layer for Large-Scale Face Recognition with One GPU

Zhuowen Zheng, Yain-Whar Si, Xiaochen Yuan et al.

Nowadays, with the advancement of deep neural networks (DNNs) and the availability of large-scale datasets, the face recognition (FR) model has achieved exceptional performance. However, since the parameter magnitude of the fully connected (FC) layer directly depends on the number of identities in the dataset. If training the FR model on large-scale datasets, the size of the model parameter will be excessively huge, leading to substantial demand for computational resources, such as time and memory. This paper proposes the attention fully connected (AttFC) layer, which could significantly reduce computational resources. AttFC employs an attention loader to generate the generative class center (GCC), and dynamically store the class center with Dynamic Class Container (DCC). DCC only stores a small subset of all class centers in FC, thus its parameter count is substantially less than the FC layer. Also, training face recognition models on large-scale datasets with one GPU often encounter out-of-memory (OOM) issues. AttFC overcomes this and achieves comparable performance to state-of-the-art methods.

CVMar 8, 2025
MSConv: Multiplicative and Subtractive Convolution for Face Recognition

Si Zhou, Yain-Whar Si, Xiaochen Yuan et al.

In Neural Networks, there are various methods of feature fusion. Different strategies can significantly affect the effectiveness of feature representation, consequently influencing the ability of model to extract representative and discriminative features. In the field of face recognition, traditional feature fusion methods include feature concatenation and feature addition. Recently, various attention mechanism-based fusion strategies have emerged. However, we found that these methods primarily focus on the important features in the image, referred to as salient features in this paper, while neglecting another equally important set of features for image recognition tasks, which we term differential features. This may cause the model to overlook critical local differences when dealing with complex facial samples. Therefore, in this paper, we propose an efficient convolution module called MSConv (Multiplicative and Subtractive Convolution), designed to balance the learning of model about salient and differential features. Specifically, we employ multi-scale mixed convolution to capture both local and broader contextual information from face images, and then utilize Multiplication Operation (MO) and Subtraction Operation (SO) to extract salient and differential features, respectively. Experimental results demonstrate that by integrating both salient and differential features, MSConv outperforms models that only focus on salient features.

CVMar 5, 2025
RVAFM: Re-parameterizing Vertical Attention Fusion Module for Handwritten Paragraph Text Recognition

Jinhui Zheng, Zhiquan Liu, Yain-Whar Si et al.

Handwritten Paragraph Text Recognition (HPTR) is a challenging task in Computer Vision, requiring the transformation of a paragraph text image, rich in handwritten text, into text encoding sequences. One of the most advanced models for this task is Vertical Attention Network (VAN), which utilizes a Vertical Attention Module (VAM) to implicitly segment paragraph text images into text lines, thereby reducing the difficulty of the recognition task. However, from a network structure perspective, VAM is a single-branch module, which is less effective in learning compared to multi-branch modules. In this paper, we propose a new module, named Re-parameterizing Vertical Attention Fusion Module (RVAFM), which incorporates structural re-parameterization techniques. RVAFM decouples the structure of the module during training and inference stages. During training, it uses a multi-branch structure for more effective learning, and during inference, it uses a single-branch structure for faster processing. The features learned by the multi-branch structure are fused into the single-branch structure through a special fusion method named Re-parameterization Fusion (RF) without any loss of information. As a result, we achieve a Character Error Rate (CER) of 4.44% and a Word Error Rate (WER) of 14.37% on the IAM paragraph-level test set. Additionally, the inference speed is slightly faster than VAN.

CRFeb 19, 2022
NFTCert: NFT-Based Certificates With Online Payment Gateway

Xiongfei Zhao, Yain-Whar Si

Nowadays, academic certificates are still widely issued in paper format. Traditional certificate verification is a lengthy, manually intensive, and sometimes expensive process. In this paper, we propose a novel NFT-based certificate framework called NFTCert, which enables the establishment of links between a legitimate certificate and its owner through a Blockchain. In this paper, we describe the implementation of the NFTCert framework, including schema definition, minting, verification, and revocation of NFT-based certificates. We also introduce a payment gateway into the minting process, which enables NFTCert to be used by a wider audience. Therefore, participants of NFTCerts do not need to rely on cryptocurrency for transactions. All in all, the proposed framework is designed to achieve usability, authenticity, confidentiality, transparency, and availability properties when it is compared to existing Blockchain-based systems.

CRFeb 19, 2022
Dynamic Transaction Storage Strategies for a Sustainable Blockchain

Xiongfei Zhao, Yain-Whar Si

As the core technology behind Bitcoin, Blockchain's decentralized, tamper-proof, and traceable features make it the preferred platform for organizational innovation. In current Bitcoin, block reward is halved every four years, and transaction fees are expected to become the majority of miner revenues around 2140. When transaction fee dominates mining rewards, strategic deviations such as Selfish Mining, Undercutting, and Mining Gap could threaten the integrity and security of the Blockchain. This paper proposes a set of Dynamic Transaction Storage (DTS) strategies for maintaining a sustainable Blockchain under the transaction-fee regime. We demonstrate that block incentive volatility can be reduced through systematic simulation by applying DTS strategies and avoiding strategic deviations. With DTS, public Blockchains such as Bitcoin become sustainable when the mining reward is solely based on the transaction fee.

LGDec 4, 2021
KDCTime: Knowledge Distillation with Calibration on InceptionTime for Time-series Classification

Xueyuan Gong, Yain-Whar Si, Yongqi Tian et al.

Time-series classification approaches based on deep neural networks are easy to be overfitting on UCR datasets, which is caused by the few-shot problem of those datasets. Therefore, in order to alleviate the overfitting phenomenon for further improving the accuracy, we first propose Label Smoothing for InceptionTime (LSTime), which adopts the information of soft labels compared to just hard labels. Next, instead of manually adjusting soft labels by LSTime, Knowledge Distillation for InceptionTime (KDTime) is proposed in order to automatically generate soft labels by the teacher model. At last, in order to rectify the incorrect predicted soft labels from the teacher model, Knowledge Distillation with Calibration for InceptionTime (KDCTime) is proposed, where it contains two optional calibrating strategies, i.e. KDC by Translating (KDCT) and KDC by Reordering (KDCR). The experimental results show that the accuracy of KDCTime is promising, while its inference time is two orders of magnitude faster than ROCKET with an acceptable training time overhead.

LGMar 26, 2021
Multi-source Transfer Learning with Ensemble for Financial Time Series Forecasting

Qi-Qiao He, Patrick Cheong-Iao Pang, Yain-Whar Si

Although transfer learning is proven to be effective in computer vision and natural language processing applications, it is rarely investigated in forecasting financial time series. Majority of existing works on transfer learning are based on single-source transfer learning due to the availability of open-access large-scale datasets. However, in financial domain, the lengths of individual time series are relatively short and single-source transfer learning models are less effective. Therefore, in this paper, we investigate multi-source deep transfer learning for financial time series. We propose two multi-source transfer learning methods namely Weighted Average Ensemble for Transfer Learning (WAETL) and Tree-structured Parzen Estimator Ensemble Selection (TPEES). The effectiveness of our approach is evaluated on financial time series extracted from stock markets. Experiment results reveal that TPEES outperforms other baseline methods on majority of multi-source transfer tasks.

HCSep 20, 2019
An Experimental Comparison of Map-like Visualisations and Treemaps

Patrick Cheong-Iao Pang, Robert P. Biuk-Aghai, Simon Fong et al.

Treemaps have been used in information visualisation for over two decades. They make use of nested filled areas to represent information hierarchies such as file systems, library catalogues, etc. Recent years have witnessed the emergence of visualisations that resemble geographic maps. In this paper we present a study that compares the performance of one such map-like visualisation with the original two forms of the treemap, namely nested and non-nested treemaps. Our study employed a mixed-method evaluation of accuracy, speed and usability (such as the ease-of-use and helpfulness of understanding the information). We found that accuracy was highest for the map-like visualisations, followed by nested treemaps and lastly non-nested treemaps. Task performance was fastest for nested treemaps, followed by non-nested treemaps, and then map-like visualisations. For usability, nested treemaps was considered slightly more helpful than map-like visualisations while non-nested performed poorly. We conclude that the results regarding accuracy are promising for the use of map-like visualisations in tasks involving the visualisation of hierarchical information, while non-nested treemap are favoured in tasks requiring speed.