Si Chen

CV
h-index16
51papers
1,627citations
Novelty57%
AI Score61

51 Papers

CVJun 4
Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models

Liangsheng Liu, Si Chen, Jiamin Wu et al.

Vision-Language Models (VLMs), such as CLIP, have shown strong zero-shot generalization but remain highly vulnerable to adversarial perturbations, posing serious risks in real-world applications. Test-time defenses for VLMs have recently emerged as a promising and efficient approach to defend against adversarial attacks without requiring costly large-scale retraining. In this work, we uncover a surprising phenomenon: under diverse input transformations, adversarial images in CLIP's feature space consistently shift along a dominant direction, in contrast to the dispersed patterns of clean images. We hypothesize that this dominant shift, termed the Defense Direction, opposes the adversarial shift, pointing features back toward their correct class centers. Building on this insight, we propose Directional Bias-guided Defense (DBD), a test-time framework that estimates the Defense Direction and employs a DB-score-based two-stream reconstruction strategy to recover robust representations. Experiments on 15 datasets demonstrate that DBD not only achieves SOTA adversarial robustness while preserving clean accuracy, but also reveals the counterintuitive result that adversarial accuracy can even surpass clean accuracy. This demonstrates that adversarial perturbations inherently encode directional priors about the true decision boundary.

CVJul 16, 2022Code
Learn-to-Decompose: Cascaded Decomposition Network for Cross-Domain Few-Shot Facial Expression Recognition

Xinyi Zou, Yan Yan, Jing-Hao Xue et al.

Most existing compound facial expression recognition (FER) methods rely on large-scale labeled compound expression data for training. However, collecting such data is labor-intensive and time-consuming. In this paper, we address the compound FER task in the cross-domain few-shot learning (FSL) setting, which requires only a few samples of compound expressions in the target domain. Specifically, we propose a novel cascaded decomposition network (CDNet), which cascades several learn-to-decompose modules with shared parameters based on a sequential decomposition mechanism, to obtain a transferable feature space. To alleviate the overfitting problem caused by limited base classes in our task, a partial regularization strategy is designed to effectively exploit the best of both episodic training and batch training. By training across similar tasks on multiple basic expression datasets, CDNet learns the ability of learn-to-decompose that can be easily adapted to identify unseen compound expressions. Extensive experiments on both in-the-lab and in-the-wild compound expression datasets demonstrate the superiority of our proposed CDNet against several state-of-the-art FSL methods. Code is available at: https://github.com/zouxinyi0625/CDNet.

IRJun 23, 2022Code
Shilling Black-box Recommender Systems by Learning to Generate Fake User Profiles

Chen Lin, Si Chen, Meifang Zeng et al.

Due to the pivotal role of Recommender Systems (RS) in guiding customers towards the purchase, there is a natural motivation for unscrupulous parties to spoof RS for profits. In this paper, we study Shilling Attack where an adversarial party injects a number of fake user profiles for improper purposes. Conventional Shilling Attack approaches lack attack transferability (i.e., attacks are not effective on some victim RS models) and/or attack invisibility (i.e., injected profiles can be easily detected). To overcome these issues, we present Leg-UP, a novel attack model based on the Generative Adversarial Network. Leg-UP learns user behavior patterns from real users in the sampled ``templates'' and constructs fake user profiles. To simulate real users, the generator in Leg-UP directly outputs discrete ratings. To enhance attack transferability, the parameters of the generator are optimized by maximizing the attack performance on a surrogate RS model. To improve attack invisibility, Leg-UP adopts a discriminator to guide the generator to generate undetectable fake user profiles. Experiments on benchmarks have shown that Leg-UP exceeds state-of-the-art Shilling Attack methods on a wide range of victim RS models. The source code of our work is available at: https://github.com/XMUDM/ShillingAttack.

CVMar 8, 2022
Stage-Aware Feature Alignment Network for Real-Time Semantic Segmentation of Street Scenes

Xi Weng, Yan Yan, Si Chen et al.

Over the past few years, deep convolutional neural network-based methods have made great progress in semantic segmentation of street scenes. Some recent methods align feature maps to alleviate the semantic gap between them and achieve high segmentation accuracy. However, they usually adopt the feature alignment modules with the same network configuration in the decoder and thus ignore the different roles of stages of the decoder during feature aggregation, leading to a complex decoder structure. Such a manner greatly affects the inference speed. In this paper, we present a novel Stage-aware Feature Alignment Network (SFANet) based on the encoder-decoder structure for real-time semantic segmentation of street scenes. Specifically, a Stage-aware Feature Alignment module (SFA) is proposed to align and aggregate two adjacent levels of feature maps effectively. In the SFA, by taking into account the unique role of each stage in the decoder, a novel stage-aware Feature Enhancement Block (FEB) is designed to enhance spatial details and contextual information of feature maps from the encoder. In this way, we are able to address the misalignment problem with a very simple and efficient multi-branch decoder structure. Moreover, an auxiliary training strategy is developed to explicitly alleviate the multi-scale object problem without bringing additional computational costs during the inference phase. Experimental results show that the proposed SFANet exhibits a good balance between accuracy and speed for real-time semantic segmentation of street scenes. In particular, based on ResNet-18, SFANet respectively obtains 78.1% and 74.7% mean of class-wise Intersection-over-Union (mIoU) at inference speeds of 37 FPS and 96 FPS on the challenging Cityscapes and CamVid test datasets by using only a single GTX 1080Ti GPU.

CLApr 15, 2022
Just Fine-tune Twice: Selective Differential Privacy for Large Language Models

Weiyan Shi, Ryan Shea, Si Chen et al.

Protecting large language models from privacy leakage is becoming increasingly crucial with their wide adoption in real-world products. Yet applying differential privacy (DP), a canonical notion with provable privacy guarantees for machine learning models, to those models remains challenging due to the trade-off between model utility and privacy loss. Utilizing the fact that sensitive information in language data tends to be sparse, Shi et al. (2021) formalized a DP notion extension called Selective Differential Privacy (SDP) to protect only the sensitive tokens defined by a policy function. However, their algorithm only works for RNN-based models. In this paper, we develop a novel framework, Just Fine-tune Twice (JFT), that achieves SDP for state-of-the-art large transformer-based models. Our method is easy to implement: it first fine-tunes the model with redacted in-domain data, and then fine-tunes it again with the original in-domain data using a private training mechanism. Furthermore, we study the scenario of imperfect implementation of policy functions that misses sensitive tokens and develop systematic methods to handle it. Experiments show that our method achieves strong utility compared to previous baselines. We also analyze the SDP privacy guarantee empirically with the canary insertion attack.

LGMar 3, 2022
Label-Only Model Inversion Attacks via Boundary Repulsion

Mostafa Kahla, Si Chen, Hoang Anh Just et al.

Recent studies show that the state-of-the-art deep neural networks are vulnerable to model inversion attacks, in which access to a model is abused to reconstruct private training data of any given target class. Existing attacks rely on having access to either the complete target model (whitebox) or the model's soft-labels (blackbox). However, no prior work has been done in the harder but more practical scenario, in which the attacker only has access to the model's predicted label, without a confidence measure. In this paper, we introduce an algorithm, Boundary-Repelling Model Inversion (BREP-MI), to invert private training data using only the target model's predicted labels. The key idea of our algorithm is to evaluate the model's predicted labels over a sphere and then estimate the direction to reach the target class's centroid. Using the example of face recognition, we show that the images reconstructed by BREP-MI successfully reproduce the semantics of the private training data for various datasets and target model architectures. We compare BREP-MI with the state-of-the-art whitebox and blackbox model inversion attacks and the results show that despite assuming less knowledge about the target model, BREP-MI outperforms the blackbox attack and achieves comparable results to the whitebox attack.

IRJun 29, 2023
Multi-Scenario Ranking with Adaptive Feature Learning

Yu Tian, Bofang Li, Si Chen et al.

Recently, Multi-Scenario Learning (MSL) is widely used in recommendation and retrieval systems in the industry because it facilitates transfer learning from different scenarios, mitigating data sparsity and reducing maintenance cost. These efforts produce different MSL paradigms by searching more optimal network structure, such as Auxiliary Network, Expert Network, and Multi-Tower Network. It is intuitive that different scenarios could hold their specific characteristics, activating the user's intents quite differently. In other words, different kinds of auxiliary features would bear varying importance under different scenarios. With more discriminative feature representations refined in a scenario-aware manner, better ranking performance could be easily obtained without expensive search for the optimal network structure. Unfortunately, this simple idea is mainly overlooked but much desired in real-world systems.Further analysis also validates the rationality of adaptive feature learning under a multi-scenario scheme. Moreover, our A/B test results on the Alibaba search advertising platform also demonstrate that Maria is superior in production environments.

LGJul 29, 2024Code
AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

Feiyang Kang, Yifan Sun, Bingbing Wen et al.

Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of LLM pre-training. We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales, challenging the existing practice of determining competitive mixtures in small-scale experiments and directly applying them at much larger scales. To address this, we propose AutoScale, a two-stage, scale-aware data composition framework. First, AutoScale fits a parametric model that predicts the model's loss under different data compositions, then uses it to find an approximate best allocation at smaller, more manageable budgets. Next, leveraging a novel theoretical analysis of how optimal compositions evolve with scale, AutoScale extrapolates that composition to larger budgets without further retraining. Empirically, AutoScale accelerates convergence and improves downstream performance. For instance, when pre-training GPT-2 Large, it achieves a 28% faster perplexity reduction than baselines and up to a 38% speed-up over unweighted training, while yielding best-average results on various downstream tasks. Overall, our findings illustrate how domain importance shifts with training scale, underscoring the need for scale-dependent data curation in LLM training. Our code is open-sourced.

CVJun 14, 2022
Turning a Curse into a Blessing: Enabling In-Distribution-Data-Free Backdoor Removal via Stabilized Model Inversion

Si Chen, Yi Zeng, Jiachen T. Wang et al.

Many backdoor removal techniques in machine learning models require clean in-distribution data, which may not always be available due to proprietary datasets. Model inversion techniques, often considered privacy threats, can reconstruct realistic training samples, potentially eliminating the need for in-distribution data. Prior attempts to combine backdoor removal and model inversion yielded limited results. Our work is the first to provide a thorough understanding of leveraging model inversion for effective backdoor removal by addressing key questions about reconstructed samples' properties, perceptual similarity, and the potential presence of backdoor triggers. We establish that relying solely on perceptual similarity is insufficient for robust defenses, and the stability of model predictions in response to input and parameter perturbations is also crucial. To tackle this, we introduce a novel bi-level optimization-based framework for model inversion, promoting stability and visual quality. Interestingly, we discover that reconstructed samples from a pre-trained generator's latent space are backdoor-free, even when utilizing signals from a backdoored model. We provide a theoretical analysis to support this finding. Our evaluation demonstrates that our stabilized model inversion technique achieves state-of-the-art backdoor removal performance without clean in-distribution data, matching or surpassing performance using the same amount of clean samples.

CRSep 14, 2022
SEEK: model extraction attack against hybrid secure inference protocols

Si Chen, Junfeng Fan

Security concerns about a machine learning model used in a prediction-as-a-service include the privacy of the model, the query and the result. Secure inference solutions based on homomorphic encryption (HE) and/or multiparty computation (MPC) have been developed to protect all the sensitive information. One of the most efficient type of solution utilizes HE for linear layers, and MPC for non-linear layers. However, for such hybrid protocols with semi-honest security, an adversary can malleate the intermediate features in the inference process, and extract model information more effectively than methods against inference service in plaintext. In this paper, we propose SEEK, a general extraction method for hybrid secure inference services outputing only class labels. This method can extract each layer of the target model independently, and is not affected by the depth of the model. For ResNet-18, SEEK can extract a parameter with less than 50 queries on average, with average error less than $0.03\%$.

CVDec 13, 2023Code
High-Order Structure Based Middle-Feature Learning for Visible-Infrared Person Re-Identification

Liuxiang Qiu, Si Chen, Yan Yan et al.

Visible-infrared person re-identification (VI-ReID) aims to retrieve images of the same persons captured by visible (VIS) and infrared (IR) cameras. Existing VI-ReID methods ignore high-order structure information of features while being relatively difficult to learn a reasonable common feature space due to the large modality discrepancy between VIS and IR images. To address the above problems, we propose a novel high-order structure based middle-feature learning network (HOS-Net) for effective VI-ReID. Specifically, we first leverage a short- and long-range feature extraction (SLE) module to effectively exploit both short-range and long-range features. Then, we propose a high-order structure learning (HSL) module to successfully model the high-order relationship across different local features of each person image based on a whitened hypergraph network.This greatly alleviates model collapse and enhances feature representations. Finally, we develop a common feature space learning (CFL) module to learn a discriminative and reasonable common feature space based on middle features generated by aligning features from different modalities and ranges. In particular, a modality-range identity-center contrastive (MRIC) loss is proposed to reduce the distances between the VIS, IR, and middle features, smoothing the training process. Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets show that our HOS-Net achieves superior state-of-the-art performance. Our code is available at \url{https://github.com/Jaulaucoeng/HOS-Net}.

LGOct 25, 2023
Learning to Rank for Active Learning via Multi-Task Bilevel Optimization

Zixin Ding, Si Chen, Ruoxi Jia et al.

Active learning is a promising paradigm to reduce the labeling cost by strategically requesting labels to improve model performance. However, existing active learning methods often rely on expensive acquisition function to compute, extensive modeling retraining and multiple rounds of interaction with annotators. To address these limitations, we propose a novel approach for active learning, which aims to select batches of unlabeled instances through a learned surrogate model for data acquisition. A key challenge in this approach is developing an acquisition function that generalizes well, as the history of data, which forms part of the utility function's input, grows over time. Our novel algorithmic contribution is a bilevel multi-task bilevel optimization framework that predicts the relative utility -- measured by the validation accuracy -- of different training sets, and ensures the learned acquisition function generalizes effectively. For cases where validation accuracy is expensive to evaluate, we introduce efficient interpolation-based surrogate models to estimate the utility function, reducing the evaluation cost. We demonstrate the performance of our approach through extensive experiments on standard active classification benchmarks. By employing our learned utility function, we show significant improvements over traditional techniques, paving the way for more efficient and effective utility maximization in active learning applications.

CLJan 28
Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy

Si Chen, Le Huy Khiem, Annalisa Szymanski et al.

Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.

GRSep 2, 2025
Fidelity-preserving enhancement of ptychography with foundational text-to-image models

Ming Du, Volker Rose, Junjing Deng et al.

Ptychographic phase retrieval enables high-resolution imaging of complex samples but often suffers from artifacts such as grid pathology and multislice crosstalk, which degrade reconstructed images. We propose a plug-and-play (PnP) framework that integrates physics model-based phase retrieval with text-guided image editing using foundational diffusion models. By employing the alternating direction method of multipliers (ADMM), our approach ensures consensus between data fidelity and artifact removal subproblems, maintaining physics consistency while enhancing image quality. Artifact removal is achieved using a text-guided diffusion image editing method (LEDITS++) with a pre-trained foundational diffusion model, allowing users to specify artifacts for removal in natural language. Demonstrations on simulated and experimental datasets show significant improvements in artifact suppression and structural fidelity, validated by metrics such as peak signal-to-noise ratio (PSNR) and diffraction pattern consistency. This work highlights the combination of text-guided generative models and model-based phase retrieval algorithms as a transferable and fidelity-preserving method for high-quality diffraction imaging.

CRJun 21, 2022
Using EBGAN for Anomaly Intrusion Detection

Yi Cui, Wenfeng Shen, Jian Zhang et al.

As an active network security protection scheme, intrusion detection system (IDS) undertakes the important responsibility of detecting network attacks in the form of malicious network traffic. Intrusion detection technology is an important part of IDS. At present, many scholars have carried out extensive research on intrusion detection technology. However, developing an efficient intrusion detection method for massive network traffic data is still difficult. Since Generative Adversarial Networks (GANs) have powerful modeling capabilities for complex high-dimensional data, they provide new ideas for addressing this problem. In this paper, we put forward an EBGAN-based intrusion detection method, IDS-EBGAN, that classifies network records as normal traffic or malicious traffic. The generator in IDS-EBGAN is responsible for converting the original malicious network traffic in the training set into adversarial malicious examples. This is because we want to use adversarial learning to improve the ability of discriminator to detect malicious traffic. At the same time, the discriminator adopts Autoencoder model. During testing, IDS-EBGAN uses reconstruction error of discriminator to classify traffic records.

AIMar 18
TeachingCoach: A Fine-Tuned Scaffolding Chatbot for Instructional Guidance to Instructors

Isabel Molnar, Peiyu Li, Si Chen et al.

Higher education instructors often lack timely and pedagogically grounded support, as scalable instructional guidance remains limited and existing tools rely on generic chatbot advice or non-scalable teaching center human-human consultations. We present TeachingCoach, a pedagogically grounded chatbot designed to support instructor professional development through real-time, conversational guidance. TeachingCoach is built on a data-centric pipeline that extracts pedagogical rules from educational resources and uses synthetic dialogue generation to fine-tune a specialized language model that guides instructors through problem identification, diagnosis, and strategy development. Expert evaluations show TeachingCoach produces clearer, more reflective, and more responsive guidance than a GPT-4o mini baseline, while a user study with higher education instructors highlights trade-offs between conversational depth and interaction efficiency. Together, these results demonstrate that pedagogically grounded, synthetic data driven chatbots can improve instructional support and offer a scalable design approach for future instructional chatbot systems.

CLOct 26, 2025Code
Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks

Peiyu Li, Xiuxiu Tang, Si Chen et al.

Large language model evaluation requires thousands of benchmark items, making evaluations expensive and slow. Existing methods compute average accuracy across fixed item sets, treating all items equally despite varying quality and informativeness. We present ATLAS an adaptive testing framework using Item Response Theory (IRT) to estimate model ability through Fisher information-guided item selection. Our analysis of five major benchmarks reveals that 3-6% of items exhibit negative discrimination, indicating annotation errors that corrupt static evaluation. ATLAS achieves 90% item reduction while maintaining measurement precision: on HellaSwag (5,608 items), we match full-benchmark estimates using only 42 items with 0.154 MAE. Our framework maintains item exposure rates below 10% and test overlap at 16-27%, compared to static benchmarks where every model sees all items (100% exposure). Among 4,000+ tested models, IRT ranks differ from accuracy ranks: models with the same accuracy get different IRT scores, and 23-31% of all models shift by more than 10 rank positions. Code and calibrated item banks are available at https://github.com/Peiyu-Georgia-Li/ATLAS.git.

QMApr 12, 2025Code
BioChemInsight: An Open-Source Toolkit for Automated Identification and Recognition of Optical Chemical Structures and Activity Data in Scientific Publications

Zhe Wang, Fangtian Fu, Wei Zhang et al.

Automated extraction of chemical structures and their bioactivity data is crucial for accelerating drug discovery and enabling data-driven pharmaceutical research. Existing optical chemical structure recognition (OCSR) tools fail to autonomously associate molecular structures with their bioactivity profiles, creating a critical bottleneck in structure-activity relationship (SAR) analysis. Here, we present BioChemInsight, an open-source pipeline that integrates: (1) DECIMER Segmentation and MolVec for chemical structure recognition, (2) Qwen2.5-VL-32B for compound identifier association, and (3) PaddleOCR with Gemini-2.0-flash for bioactivity extraction and unit normalization. We evaluated the performance of BioChemInsight on 25 patents and 17 articles. BioChemInsight achieved 95% accuracy for tabular patent data (structure/identifier recognition), with lower accuracy in non-tabular patents (~80% structures, ~75% identifiers), plus 92.2 % bioactivity extraction accuracy. For articles, it attained >99% identifiers and 78-80% structure accuracy in non-tabular formats, plus 97.4% bioactivity extraction accuracy. The system generates ready-to-use SAR datasets, reducing data preprocessing time from weeks to hours while enabling applications in high-throughput screening and ML-driven drug design (https://github.com/dahuilangda/BioChemInsight).

DBMar 31, 2025Code
Text2Schema: Filling the Gap in Designing Database Table Structures based on Natural Language

Qin Wang, Youhuan Li, Yansong Feng et al.

People without a database background usually rely on file systems or tools such as Excel for data management, which often lead to redundancy and data inconsistency. Relational databases possess strong data management capabilities, but require a high level of professional expertise from users. Although there are already many works on Text2SQL to automate the translation of natural language into SQL queries for data manipulation, all of them presuppose that the database schema is pre-designed. In practice, schema design itself demands domain expertise, and research on directly generating schemas from textual requirements remains unexplored. In this paper, we systematically define a new problem, called Text2Schema, to convert a natural language text requirement into a relational database schema. With an effective Text2Schema technique, users can effortlessly create database table structures using natural language, and subsequently leverage existing Text2SQL techniques to perform data manipulations, which significantly narrows the gap between non-technical personnel and highly efficient, versatile relational database systems. We propose SchemaAgent, an LLM-based multi-agent framework for Text2Schema. We emulate the workflow of manual schema design by assigning specialized roles to agents and enabling effective collaboration to refine their respective subtasks. We also incorporate dedicated roles for reflection and inspection, along with an innovative error detection and correction mechanism to identify and rectify issues across various phases. Moreover, we build and open source a benchmark containing 381 pairs of requirement description and schema. Experimental results demonstrate the superiority of our approach over comparative work.

LGOct 8, 2020Code
Knowledge-Enriched Distributional Model Inversion Attacks

Si Chen, Mostafa Kahla, Ruoxi Jia et al.

Model inversion (MI) attacks are aimed at reconstructing training data from model parameters. Such attacks have triggered increasing concerns about privacy, especially given a growing number of online model repositories. However, existing MI attacks against deep neural networks (DNNs) have large room for performance improvement. We present a novel inversion-specific GAN that can better distill knowledge useful for performing attacks on private models from public data. In particular, we train the discriminator to differentiate not only the real and fake samples but the soft-labels provided by the target model. Moreover, unlike previous work that directly searches for a single data point to represent a target class, we propose to model a private data distribution for each target class. Our experiments show that the combination of these techniques can significantly boost the success rate of the state-of-the-art MI attacks by 150%, and generalize better to a variety of datasets and models. Our code is available at https://github.com/SCccc21/Knowledge-Enriched-DMI.

CVAug 28, 2024
Towards reliable respiratory disease diagnosis based on cough sounds and vision transformers

Qian Wang, Zhaoyang Bu, Jiaxuan Mao et al.

Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or deep models of moderate scales. On the other hand, the developed approaches are trained and evaluated on small-scale data due to the difficulty of curating and annotating clinical data on scale. To address these issues in prior works, we create a unified framework to evaluate various deep models from lightweight Convolutional Neural Networks (e.g., ResNet18) to modern vision transformers and compare their performance in respiratory disease classification. Based on the observations from such an extensive empirical study, we propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set. Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.

NIApr 28
Chorusing Synchronization Signals for Ambient 5G Backscatter

Yunyun Feng, Chenhong Cao, Si Chen et al.

5G backscatter communication presents an emerging energy-efficient IoT connectivity solution with enhanced availability and data rate advantages over traditional wireless networks. For 5G backscatter, synchronization is crucial as it ensures high-quality transmission. Popular synchronization methods employ autocorrelation and cross-correlation for accurate timing, yet they are constrained by resources. Traditional cross-correlation-based methods for resource utilization optimization also fail in 5G backscatter due to the presence of multiple templates for 5G. A synchronization strategy that supports high accuracy and low power would be highly attractive for wireless backscatter communication. We propose Symmetric Differential (SD)-based Sync, an accurate and resource-efficient synchronization method for 5G backscatter. We have observed that the envelope of the 5G Primary Synchronization Signal (PSS) exhibits a unique mirror symmetry, which enables us to employ differential techniques for low-power PSS detection. We extensively evaluated our design using a testbed of backscatter hardware, SDR gNodeB, and User Equipment (UE). Results show that our SD consumes 3,175 D flip-flops, which is 87x lower than NR fine timing (NFT), 181x lower than symmetry-based semi-template sync (SST), and 30x lower than symmetric autocorrelation (SA)-based sync.

AIApr 2, 2025
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning

Si Chen, Xiao Yu, Ninareh Mehrabi et al.

The exploitation of large language models (LLMs) for malicious purposes poses significant security risks as these models become more powerful and widespread. While most existing red-teaming frameworks focus on single-turn attacks, real-world adversaries typically operate in multi-turn scenarios, iteratively probing for vulnerabilities and adapting their prompts based on threat model responses. In this paper, we propose \AlgName, a novel multi-turn red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions: global tactic-wise learning that accumulates knowledge over time and generalizes to new attack goals, and local prompt-wise learning that refines implementations for specific goals when initial attempts fail. Unlike previous multi-turn approaches that rely on fixed strategy sets, \AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics. Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90\% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns, outperforming state-of-the-art baselines. These results highlight the effectiveness of dynamic learning in identifying and exploiting model vulnerabilities in realistic multi-turn scenarios.

CLJun 20, 2025
SocialSim: Towards Socialized Simulation of Emotional Support Conversation

Zhuang Chen, Yaru Cao, Guanqun Bi et al.

Emotional support conversation (ESC) helps reduce people's psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this paper, we introduce SocialSim, a novel framework that simulates ESC by integrating key aspects of social interactions: social disclosure and social awareness. On the seeker side, we facilitate social disclosure by constructing a comprehensive persona bank that captures diverse and authentic help-seeking scenarios. On the supporter side, we enhance social awareness by eliciting cognitive reasoning to generate logical and supportive responses. Building upon SocialSim, we construct SSConv, a large-scale synthetic ESC corpus of which quality can even surpass crowdsourced ESC data. We further train a chatbot on SSConv and demonstrate its state-of-the-art performance in both automatic and human evaluations. We believe SocialSim offers a scalable way to synthesize ESC, making emotional care more accessible and practical.

IRFeb 1, 2025
MIM: Multi-modal Content Interest Modeling Paradigm for User Behavior Modeling

Bencheng Yan, Si Chen, Shichang Jia et al.

Click-Through Rate (CTR) prediction is a crucial task in recommendation systems, online searches, and advertising platforms, where accurately capturing users' real interests in content is essential for performance. However, existing methods heavily rely on ID embeddings, which fail to reflect users' true preferences for content such as images and titles. This limitation becomes particularly evident in cold-start and long-tail scenarios, where traditional approaches struggle to deliver effective results. To address these challenges, we propose a novel Multi-modal Content Interest Modeling paradigm (MIM), which consists of three key stages: Pre-training, Content-Interest-Aware Supervised Fine-Tuning (C-SFT), and Content-Interest-Aware UBM (CiUBM). The pre-training stage adapts foundational models to domain-specific data, enabling the extraction of high-quality multi-modal embeddings. The C-SFT stage bridges the semantic gap between content and user interests by leveraging user behavior signals to guide the alignment of embeddings with user preferences. Finally, the CiUBM stage integrates multi-modal embeddings and ID-based collaborative filtering signals into a unified framework. Comprehensive offline experiments and online A/B tests conducted on the Taobao, one of the world's largest e-commerce platforms, demonstrated the effectiveness and efficiency of MIM method. The method has been successfully deployed online, achieving a significant increase of +14.14% in CTR and +4.12% in RPM, showcasing its industrial applicability and substantial impact on platform performance. To promote further research, we have publicly released the code and dataset at https://pan.quark.cn/s/8fc8ec3e74f3.

CLMay 21, 2024
The 2nd FutureDial Challenge: Dialog Systems with Retrieval Augmented Generation (FutureDial-RAG)

Yucheng Cai, Si Chen, Yuxuan Wu et al.

Recently, increasing research interests have focused on retrieval augmented generation (RAG) to mitigate hallucination for large language models (LLMs). Following this trend, we launch the FutureDial-RAG challenge at SLT 2024, which aims at promoting the study of RAG for dialog systems. The challenge builds upon the MobileCS2 dataset, a real-life customer service datasets with nearly 3000 high-quality dialogs containing annotations for knowledge base query and corresponding results. Over the dataset, we define two tasks, track 1 for knowledge retrieval and track 2 for response generation, which are core research questions in dialog systems with RAG. We build baseline systems for the two tracks and design metrics to measure whether the systems can perform accurate retrieval and generate informative and coherent response. The baseline results show that it is very challenging to perform well on the two tasks, which encourages the participating teams and the community to study how to make better use of RAG for real-life dialog systems.

LGJul 6, 2025
Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning

Mahavir Dabas, Si Chen, Charles Fleming et al.

Safety alignment is crucial for large language models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. We introduce ACTOR (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and data-efficient training framework that minimizes over-refusals by leveraging internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model's ability to handle harmful queries and preserve overall utility.

CLApr 22, 2024
FASTTRACK: Fast and Accurate Fact Tracing for LLMs

Si Chen, Feiyang Kang, Ning Yu et al.

Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are merely relevant and those that actually provide supportive evidence for the information sought by the query. This limitation often results in suboptimal effectiveness. Moreover, these approaches necessitate the examination of the similarity of individual training points for each query, imposing significant computational demands and creating a substantial barrier for practical applications. This paper introduces FASTTRACK, a novel approach that harnesses the capabilities of Large Language Models (LLMs) to validate supportive evidence for queries and at the same time clusters the training database towards a reduced extent for LLMs to trace facts. Our experiments show that FASTTRACK substantially outperforms existing methods in both accuracy and efficiency, achieving more than 100\% improvement in F1 score over the state-of-the-art methods while being X33 faster than \texttt{TracIn}.

CVNov 24, 2025
Peregrine: One-Shot Fine-Tuning for FHE Inference of General Deep CNNs

Huaming Ling, Ying Wang, Si Chen et al.

We address two fundamental challenges in adapting general deep CNNs for FHE-based inference: approximating non-linear activations such as ReLU with low-degree polynomials while minimizing accuracy degradation, and overcoming the ciphertext capacity barrier that constrains high-resolution image processing on FHE inference. Our contributions are twofold: (1) a single-stage fine-tuning (SFT) strategy that directly converts pre-trained CNNs into FHE-friendly forms using low-degree polynomials, achieving competitive accuracy with minimal training overhead; and (2) a generalized interleaved packing (GIP) scheme that is compatible with feature maps of virtually arbitrary spatial resolutions, accompanied by a suite of carefully designed homomorphic operators that preserve the GIP-form encryption throughout computation. These advances enable efficient, end-to-end FHE inference across diverse CNN architectures. Experiments on CIFAR-10, ImageNet, and MS COCO demonstrate that the FHE-friendly CNNs obtained via our SFT strategy achieve accuracy comparable to baselines using ReLU or SiLU activations. Moreover, this work presents the first demonstration of FHE-based inference for YOLO architectures in object detection leveraging low-degree polynomial activations.

QMOct 29, 2025
Optimizing Mirror-Image Peptide Sequence Design for Data Storage via Peptide Bond Cleavage Prediction

Yilong Lu, Si Chen, Songyan Gao et al.

Traditional non-biological storage media, such as hard drives, face limitations in both storage density and lifespan due to the rapid growth of data in the big data era. Mirror-image peptides composed of D-amino acids have emerged as a promising biological storage medium due to their high storage density, structural stability, and long lifespan. The sequencing of mirror-image peptides relies on \textit{de-novo} technology. However, its accuracy is limited by the scarcity of tandem mass spectrometry datasets and the challenges that current algorithms encounter when processing these peptides directly. This study is the first to propose improving sequencing accuracy indirectly by optimizing the design of mirror-image peptide sequences. In this work, we introduce DBond, a deep neural network based model that integrates sequence features, precursor ion properties, and mass spectrometry environmental factors for the prediction of mirror-image peptide bond cleavage. In this process, sequences with a high peptide bond cleavage ratio, which are easy to sequence, are selected. The main contributions of this study are as follows. First, we constructed MiPD513, a tandem mass spectrometry dataset containing 513 mirror-image peptides. Second, we developed the peptide bond cleavage labeling algorithm (PBCLA), which generated approximately 12.5 million labeled data based on MiPD513. Third, we proposed a dual prediction strategy that combines multi-label and single-label classification. On an independent test set, the single-label classification strategy outperformed other methods in both single and multiple peptide bond cleavage prediction tasks, offering a strong foundation for sequence optimization.

CVSep 29, 2025
Accurate Cobb Angle Estimation via SVD-Based Curve Detection and Vertebral Wedging Quantification

Chang Shi, Nan Meng, Yipeng Zhuang et al.

Adolescent idiopathic scoliosis (AIS) is a common spinal deformity affecting approximately 2.2% of boys and 4.8% of girls worldwide. The Cobb angle serves as the gold standard for AIS severity assessment, yet traditional manual measurements suffer from significant observer variability, compromising diagnostic accuracy. Despite prior automation attempts, existing methods use simplified spinal models and predetermined curve patterns that fail to address clinical complexity. We present a novel deep learning framework for AIS assessment that simultaneously predicts both superior and inferior endplate angles with corresponding midpoint coordinates for each vertebra, preserving the anatomical reality of vertebral wedging in progressive AIS. Our approach combines an HRNet backbone with Swin-Transformer modules and biomechanically informed constraints for enhanced feature extraction. We employ Singular Value Decomposition (SVD) to analyze angle predictions directly from vertebral morphology, enabling flexible detection of diverse scoliosis patterns without predefined curve assumptions. Using 630 full-spine anteroposterior radiographs from patients aged 10-18 years with rigorous dual-rater annotation, our method achieved 83.45% diagnostic accuracy and 2.55° mean absolute error. The framework demonstrates exceptional generalization capability on out-of-distribution cases. Additionally, we introduce the Vertebral Wedging Index (VWI), a novel metric quantifying vertebral deformation. Longitudinal analysis revealed VWI's significant prognostic correlation with curve progression while traditional Cobb angles showed no correlation, providing robust support for early AIS detection, personalized treatment planning, and progression monitoring.

HCSep 15, 2025
Exploring Conversational Design Choices in LLMs for Pedagogical Purposes: Socratic and Narrative Approaches for Improving Instructor's Teaching Practice

Si Chen, Isabel R. Molnar, Peiyu Li et al.

Large language models (LLMs) typically generate direct answers, yet they are increasingly used as learning tools. Studying instructors' usage is critical, given their role in teaching and guiding AI adoption in education. We designed and evaluated TeaPT, an LLM for pedagogical purposes that supports instructors' professional development through two conversational approaches: a Socratic approach that uses guided questioning to foster reflection, and a Narrative approach that offers elaborated suggestions to extend externalized cognition. In a mixed-method study with 41 higher-education instructors, the Socratic version elicited greater engagement, while the Narrative version was preferred for actionable guidance. Subgroup analyses further revealed that less-experienced, AI-optimistic instructors favored the Socratic version, whereas more-experienced, AI-cautious instructors preferred the Narrative version. We contribute design implications for LLMs for pedagogical purposes, showing how adaptive conversational approaches can support instructors with varied profiles while highlighting how AI attitudes and experience shape interaction and learning.

AISep 9, 2025
Aligning LLMs for the Classroom with Knowledge-Based Retrieval -- A Comparative RAG Study

Amay Jain, Liu Cui, Si Chen

Large language models like ChatGPT are increasingly used in classrooms, but they often provide outdated or fabricated information that can mislead students. Retrieval Augmented Generation (RAG) improves reliability of LLMs by grounding responses in external resources. We investigate two accessible RAG paradigms, vector-based retrieval and graph-based retrieval to identify best practices for classroom question answering (QA). Existing comparative studies fail to account for pedagogical factors such as educational disciplines, question types, and practical deployment costs. Using a novel dataset, EduScopeQA, of 3,176 questions across academic subjects, we measure performance on various educational query types, from specific facts to broad thematic discussions. We also evaluate system alignment with a dataset of systematically altered textbooks that contradict the LLM's latent knowledge. We find that OpenAI Vector Search RAG (representing vector-based RAG) performs well as a low-cost generalist, especially for quick fact retrieval. On the other hand, GraphRAG Global excels at providing pedagogically rich answers to thematic queries, and GraphRAG Local achieves the highest accuracy with the dense, altered textbooks when corpus integrity is critical. Accounting for the 10-20x higher resource usage of GraphRAG (representing graph-based RAG), we show that a dynamic branching framework that routes queries to the optimal retrieval method boosts fidelity and efficiency. These insights provide actionable guidelines for educators and system designers to integrate RAG-augmented LLMs into learning environments effectively.

AIAug 6, 2025
\textsc{SimInstruct}: A Responsible Tool for Collecting Scaffolding Dialogues Between Experts and LLM-Simulated Novices

Si Chen, Izzy Molnar, Ting Hua et al.

High-quality, multi-turn instructional dialogues between novices and experts are essential for developing AI systems that support teaching, learning, and decision-making. These dialogues often involve scaffolding -- the process by which an expert supports a novice's thinking through questions, feedback, and step-by-step guidance. However, such data are scarce due to privacy concerns in recording and the vulnerability inherent in help-seeking. We present SimInstruct, a scalable, expert-in-the-loop tool for collecting scaffolding dialogues. Using teaching development coaching as an example domain, SimInstruct simulates novice instructors via LLMs, varying their teaching challenges and LLM's persona traits, while human experts provide multi-turn feedback, reasoning, and instructional support. This design enables the creation of realistic, pedagogically rich dialogues without requiring real novice participants. Our results reveal that persona traits, such as extroversion and introversion, meaningfully influence how experts engage. Compared to real mentoring recordings, SimInstruct dialogues demonstrate comparable pedagogical relevance and cognitive depth. Experts also reported the process as engaging and reflective, improving both data quality and their own professional insight. We further fine-tuned a LLaMA model to be an expert model using the augmented dataset, which outperformed GPT-4o in instructional quality. Our analysis highlights GPT-4o's limitations in weak reflective questioning, overuse of generic praise, a condescending tone, and a tendency to overwhelm novices with excessive suggestions.

CVJan 18, 2022
When Facial Expression Recognition Meets Few-Shot Learning: A Joint and Alternate Learning Framework

Xinyi Zou, Yan Yan, Jing-Hao Xue et al.

Human emotions involve basic and compound facial expressions. However, current research on facial expression recognition (FER) mainly focuses on basic expressions, and thus fails to address the diversity of human emotions in practical scenarios. Meanwhile, existing work on compound FER relies heavily on abundant labeled compound expression training data, which are often laboriously collected under the professional instruction of psychology. In this paper, we study compound FER in the cross-domain few-shot learning setting, where only a few images of novel classes from the target domain are required as a reference. In particular, we aim to identify unseen compound expressions with the model trained on easily accessible basic expression datasets. To alleviate the problem of limited base classes in our FER task, we propose a novel Emotion Guided Similarity Network (EGS-Net), consisting of an emotion branch and a similarity branch, based on a two-stage learning framework. Specifically, in the first stage, the similarity branch is jointly trained with the emotion branch in a multi-task fashion. With the regularization of the emotion branch, we prevent the similarity branch from overfitting to sampled base classes that are highly overlapped across different episodes. In the second stage, the emotion branch and the similarity branch play a "two-student game" to alternately learn from each other, thereby further improving the inference ability of the similarity branch on unseen compound expressions. Experimental results on both in-the-lab and in-the-wild compound expression datasets demonstrate the superiority of our proposed method against several state-of-the-art methods.

LGNov 24, 2021
ModelPred: A Framework for Predicting Trained Model from Training Data

Yingyan Zeng, Jiachen T. Wang, Si Chen et al.

In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model. This is critical for building trust in various stages of a machine learning pipeline: from cleaning poor-quality samples and tracking important ones to be collected during data preparation, to calibrating uncertainty of model prediction, to interpreting why certain behaviors of a model emerge during deployment. Specifically, ModelPred learns a parameterized function that takes a dataset $S$ as the input and predicts the model obtained by training on $S$. Our work differs from the recent work of Datamodels [1] as we aim for predicting the trained model parameters directly instead of the trained model behaviors. We demonstrate that a neural network-based set function class is capable of learning the complex relationships between the training data and model parameters. We introduce novel global and local regularization techniques to prevent overfitting and we rigorously characterize the expressive power of neural networks (NN) in approximating the end-to-end training process. Through extensive empirical investigations, we show that ModelPred enables a variety of applications that boost the interpretability and accountability of machine learning (ML), such as data valuation, data selection, memorization quantification, and model calibration.

CRNov 24, 2021
SASH: Efficient Secure Aggregation Based on SHPRG For Federated Learning

Zizhen Liu, Si Chen, Jing Ye et al.

To prevent private training data leakage in Fed?erated Learning systems, we propose a novel se?cure aggregation scheme based on seed homomor?phic pseudo-random generator (SHPRG), named SASH. SASH leverages the homomorphic property of SHPRG to simplify the masking and demask?ing scheme, which for each of the clients and for the server, entails an overhead linear w.r.t model size and constant w.r.t number of clients. We prove that even against worst-case colluding adversaries, SASH preserves training data privacy, while being resilient to dropouts without extra overhead. We experimentally demonstrate SASH significantly improves the efficiency to 20 times over baseline, especially in the more realistic case where the numbers of clients and model size become large, and a cer?tain percentage of clients drop out from the system.

LGOct 7, 2021
Adversarial Unlearning of Backdoors via Implicit Hypergradient

Yi Zeng, Si Chen, Won Park et al.

We propose a minimax formulation for removing backdoors from a given poisoned model based on a small set of clean data. This formulation encompasses much of prior work on backdoor removal. We propose the Implicit Bacdoor Adversarial Unlearning (I-BAU) algorithm to solve the minimax. Unlike previous work, which breaks down the minimax into separate inner and outer problems, our algorithm utilizes the implicit hypergradient to account for the interdependence between inner and outer optimization. We theoretically analyze its convergence and the generalizability of the robustness gained by solving minimax on clean data to unseen test data. In our evaluation, we compare I-BAU with six state-of-art backdoor defenses on seven backdoor attacks over two datasets and various attack settings, including the common setting where the attacker targets one class as well as important but underexplored settings where multiple classes are targeted. I-BAU's performance is comparable to and most often significantly better than the best baseline. Particularly, its performance is more robust to the variation on triggers, attack settings, poison ratio, and clean data size. Moreover, I-BAU requires less computation to take effect; particularly, it is more than $13\times$ faster than the most efficient baseline in the single-target attack setting. Furthermore, it can remain effective in the extreme case where the defender can only access 100 clean samples -- a setting where all the baselines fail to produce acceptable results.

LGJul 14, 2021
Zero-Round Active Learning

Si Chen, Tianhao Wang, Ruoxi Jia

Active learning (AL) aims at reducing labeling effort by identifying the most valuable unlabeled data points from a large pool. Traditional AL frameworks have two limitations: First, they perform data selection in a multi-round manner, which is time-consuming and impractical. Second, they usually assume that there are a small amount of labeled data points available in the same domain as the data in the unlabeled pool. Recent work proposes a solution for one-round active learning based on data utility learning and optimization, which fixes the first issue but still requires the initially labeled data points in the same domain. In this paper, we propose $\mathrm{D^2ULO}$ as a solution that solves both issues. Specifically, $\mathrm{D^2ULO}$ leverages the idea of domain adaptation (DA) to train a data utility model which can effectively predict the utility for any given unlabeled data in the target domain once labeled. The trained data utility model can then be used to select high-utility data and at the same time, provide an estimate for the utility of the selected data. Our algorithm does not rely on any feedback from annotators in the target domain and hence, can be used to perform zero-round active learning or warm-start existing multi-round active learning strategies. Our experiments show that $\mathrm{D^2ULO}$ outperforms the existing state-of-the-art AL strategies equipped with domain adaptation over various domain shift settings (e.g., real-to-real data and synthetic-to-real data). Particularly, $\mathrm{D^2ULO}$ is applicable to the scenario where source and target labels have mismatches, which is not supported by the existing works.

LGApr 23, 2021
One-Round Active Learning

Tianhao Wang, Si Chen, Ruoxi Jia

In this work, we initiate the study of one-round active learning, which aims to select a subset of unlabeled data points that achieve the highest model performance after being labeled with only the information from initially labeled data points. The challenge of directly applying existing data selection criteria to the one-round setting is that they are not indicative of model performance when available labeled data is limited. We address the challenge by explicitly modeling the dependence of model performance on the dataset. Specifically, we propose DULO, a data-driven framework for one-round active learning, wherein we learn a model to predict the model performance for a given dataset and then leverage this model to guide the selection of unlabeled data. Our results demonstrate that DULO leads to the state-of-the-art performance on various active learning benchmarks in the one-round setting.

AIFeb 13, 2021
Sequential Recommendation in Online Games with Multiple Sequences, Tasks and User Levels

Si Chen, Yuqiu Qian, Hui Li et al.

Online gaming is growing faster than ever before, with increasing challenges of providing better user experience. Recommender systems (RS) for online games face unique challenges since they must fulfill players' distinct desires, at different user levels, based on their action sequences of various action types. Although many sequential RS already exist, they are mainly single-sequence, single-task, and single-user-level. In this paper, we introduce a new sequential recommendation model for multiple sequences, multiple tasks, and multiple user levels (abbreviated as M$^3$Rec) in Tencent Games platform, which can fully utilize complex data in online games. We leverage Graph Neural Network and multi-task learning to design M$^3$Rec in order to model the complex information in the heterogeneous sequential recommendation scenario of Tencent Games. We verify the effectiveness of M$^3$Rec on three online games of Tencent Games platform, in both offline and online evaluations. The results show that M$^3$Rec successfully addresses the challenges of recommendation in online games, and it generates superior recommendations compared with state-of-the-art sequential recommendation approaches.

IRMay 17, 2020
Attacking Recommender Systems with Augmented User Profiles

Chen Lin, Si Chen, Hui Li et al.

Recommendation Systems (RS) have become an essential part of many online services. Due to its pivotal role in guiding customers towards purchasing, there is a natural motivation for unscrupulous parties to spoof RS for profits. In this paper, we study the shilling attack: a subsistent and profitable attack where an adversarial party injects a number of user profiles to promote or demote a target item. Conventional shilling attack models are based on simple heuristics that can be easily detected, or directly adopt adversarial attack methods without a special design for RS. Moreover, the study on the attack impact on deep learning based RS is missing in the literature, making the effects of shilling attack against real RS doubtful. We present a novel Augmented Shilling Attack framework (AUSH) and implement it with the idea of Generative Adversarial Network. AUSH is capable of tailoring attacks against RS according to budget and complex attack goals, such as targeting a specific user group. We experimentally show that the attack impact of AUSH is noticeable on a wide range of RS including both classic and modern deep learning based RS, while it is virtually undetectable by the state-of-the-art attack detection model.

CVFeb 7, 2020
Adaptive Deep Metric Embeddings for Person Re-Identification under Occlusions

Wanxiang Yang, Yan Yan, Si Chen

Person re-identification (ReID) under occlusions is a challenging problem in video surveillance. Most of existing person ReID methods take advantage of local features to deal with occlusions. However, these methods usually independently extract features from the local regions of an image without considering the relationship among different local regions. In this paper, we propose a novel person ReID method, which learns the spatial dependencies between the local regions and extracts the discriminative feature representation of the pedestrian image based on Long Short-Term Memory (LSTM), dealing with the problem of occlusions. In particular, we propose a novel loss (termed the adaptive nearest neighbor loss) based on the classification uncertainty to effectively reduce intra-class variations while enlarging inter-class differences within the adaptive neighborhood of the sample. The proposed loss enables the deep neural network to adaptively learn discriminative metric embeddings, which significantly improve the generalization capability of recognizing unseen person identities. Extensive comparative evaluations on challenging person ReID datasets demonstrate the significantly improved performance of the proposed method compared with several state-of-the-art methods.

CVFeb 7, 2020
Object-Adaptive LSTM Network for Real-time Visual Tracking with Adversarial Data Augmentation

Yihan Du, Yan Yan, Si Chen et al.

In recent years, deep learning based visual tracking methods have obtained great success owing to the powerful feature representation ability of Convolutional Neural Networks (CNNs). Among these methods, classification-based tracking methods exhibit excellent performance while their speeds are heavily limited by the expensive computation for massive proposal feature extraction. In contrast, matching-based tracking methods (such as Siamese networks) possess remarkable speed superiority. However, the absence of online updating renders these methods unadaptable to significant object appearance variations. In this paper, we propose a novel real-time visual tracking method, which adopts an object-adaptive LSTM network to effectively capture the video sequential dependencies and adaptively learn the object appearance variations. For high computational efficiency, we also present a fast proposal selection strategy, which utilizes the matching-based tracking method to pre-estimate dense proposals and selects high-quality ones to feed to the LSTM network for classification. This strategy efficiently filters out some irrelevant proposals and avoids the redundant computation for feature extraction, which enables our method to operate faster than conventional classification-based tracking methods. In addition, to handle the problems of sample inadequacy and class imbalance during online tracking, we adopt a data augmentation technique based on the Generative Adversarial Network (GAN) to facilitate the training of the LSTM network. Extensive experiments on four visual tracking benchmarks demonstrate the state-of-the-art performance of our method in terms of both tracking accuracy and speed, which exhibits great potentials of recurrent structures for visual tracking.

CVFeb 6, 2020
Joint Deep Learning of Facial Expression Synthesis and Recognition

Yan Yan, Ying Huang, Si Chen et al.

Recently, deep learning based facial expression recognition (FER) methods have attracted considerable attention and they usually require large-scale labelled training data. Nonetheless, the publicly available facial expression databases typically contain a small amount of labelled data. In this paper, to overcome the above issue, we propose a novel joint deep learning of facial expression synthesis and recognition method for effective FER. More specifically, the proposed method involves a two-stage learning procedure. Firstly, a facial expression synthesis generative adversarial network (FESGAN) is pre-trained to generate facial images with different facial expressions. To increase the diversity of the training images, FESGAN is elaborately designed to generate images with new identities from a prior distribution. Secondly, an expression recognition network is jointly learned with the pre-trained FESGAN in a unified framework. In particular, the classification loss computed from the recognition network is used to simultaneously optimize the performance of both the recognition network and the generator of FESGAN. Moreover, in order to alleviate the problem of data bias between the real images and the synthetic images, we propose an intra-class loss with a novel real data-guided back-propagation (RDBP) algorithm to reduce the intra-class variations of images from the same class, which can significantly improve the final performance. Extensive experimental results on public facial expression databases demonstrate the superiority of the proposed method compared with several state-of-the-art FER methods.

OCMar 27, 2019
The Global Convergence Analysis of the Bat Algorithm Using a Markovian Framework and Dynamical System Theory

Si Chen, Guo-Hua Peng, Xing-Shi He et al.

The bat algorithm (BA) has been shown to be effective to solve a wider range of optimization problems. However, there is not much theoretical analysis concerning its convergence and stability. In order to prove the convergence of the bat algorithm, we have built a Markov model for the algorithm and proved that the state sequence of the bat population forms a finite homogeneous Markov chain, satisfying the global convergence criteria. Then, we prove that the bat algorithm can have global convergence. In addition, in order to enhance the convergence performance of the algorithm, we have designed an updated model using the dynamical system theory in terms of a dynamic matrix, and the parameter ranges for the algorithm stability are then obtained. We then use some benchmark functions to demonstrate that BA can indeed achieve global optimality efficiently for these functions.

CVMay 3, 2018
Multi-task Learning of Cascaded CNN for Facial Attribute Classification

Ni Zhuang, Yan Yan, Si Chen et al.

Recently, facial attribute classification (FAC) has attracted significant attention in the computer vision community. Great progress has been made along with the availability of challenging FAC datasets. However, conventional FAC methods usually firstly pre-process the input images (i.e., perform face detection and alignment) and then predict facial attributes. These methods ignore the inherent dependencies among these tasks (i.e., face detection, facial landmark localization and FAC). Moreover, some methods using convolutional neural network are trained based on the fixed loss weights without considering the differences between facial attributes. In order to address the above problems, we propose a novel multi-task learning of cas- caded convolutional neural network method, termed MCFA, for predicting multiple facial attributes simultaneously. Specifically, the proposed method takes advantage of three cascaded sub-networks (i.e., S_Net, M_Net and L_Net corresponding to the neural networks under different scales) to jointly train multiple tasks in a coarse-to-fine manner, which can achieve end-to-end optimization. Furthermore, the proposed method automatically assigns the loss weight to each facial attribute based on a novel dynamic weighting scheme, thus making the proposed method concentrate on predicting the more difficult facial attributes. Experimental results show that the proposed method outperforms several state-of-the-art FAC methods on the challenging CelebA and LFWA datasets.

CVMay 3, 2018
Multi-label Learning Based Deep Transfer Neural Network for Facial Attribute Classification

Ni Zhuang, Yan Yan, Si Chen et al.

Deep Neural Network (DNN) has recently achieved outstanding performance in a variety of computer vision tasks, including facial attribute classification. The great success of classifying facial attributes with DNN often relies on a massive amount of labelled data. However, in real-world applications, labelled data are only provided for some commonly used attributes (such as age, gender); whereas, unlabelled data are available for other attributes (such as attraction, hairline). To address the above problem, we propose a novel deep transfer neural network method based on multi-label learning for facial attribute classification, termed FMTNet, which consists of three sub-networks: the Face detection Network (FNet), the Multi-label learning Network (MNet) and the Transfer learning Network (TNet). Firstly, based on the Faster Region-based Convolutional Neural Network (Faster R-CNN), FNet is fine-tuned for face detection. Then, MNet is fine-tuned by FNet to predict multiple attributes with labelled data, where an effective loss weight scheme is developed to explicitly exploit the correlation between facial attributes based on attribute grouping. Finally, based on MNet, TNet is trained by taking advantage of unsupervised domain adaptation for unlabelled facial attribute classification. The three sub-networks are tightly coupled to perform effective facial attribute classification. A distinguishing characteristic of the proposed FMTNet method is that the three sub-networks (FNet, MNet and TNet) are constructed in a similar network structure. Extensive experimental results on challenging face datasets demonstrate the effectiveness of our proposed method compared with several state-of-the-art methods.

CVMar 25, 2016
Quadratic Projection Based Feature Extraction with Its Application to Biometric Recognition

Yan Yan, Hanzi Wang, Si Chen et al.

This paper presents a novel quadratic projection based feature extraction framework, where a set of quadratic matrices is learned to distinguish each class from all other classes. We formulate quadratic matrix learning (QML) as a standard semidefinite programming (SDP) problem. However, the con- ventional interior-point SDP solvers do not scale well to the problem of QML for high-dimensional data. To solve the scalability of QML, we develop an efficient algorithm, termed DualQML, based on the Lagrange duality theory, to extract nonlinear features. To evaluate the feasibility and effectiveness of the proposed framework, we conduct extensive experiments on biometric recognition. Experimental results on three representative biometric recogni- tion tasks, including face, palmprint, and ear recognition, demonstrate the superiority of the DualQML-based feature extraction algorithm compared to the current state-of-the-art algorithms

CVDec 13, 2015
Deep Tracking: Visual Tracking Using Deep Convolutional Networks

Meera Hahn, Si Chen, Afshin Dehghan

In this paper, we study a discriminatively trained deep convolutional network for the task of visual tracking. Our tracker utilizes both motion and appearance features that are extracted from a pre-trained dual stream deep convolution network. We show that the features extracted from our dual-stream network can provide rich information about the target and this leads to competitive performance against state of the art tracking methods on a visual tracking benchmark.