Chen Liu

CV
h-index46
167papers
8,329citations
Novelty52%
AI Score62

167 Papers

CVMar 3, 2023Code
Diverse 3D Hand Gesture Prediction from Body Dynamics by Bilateral Hand Disentanglement

Xingqun Qi, Chen Liu, Muyi Sun et al.

Predicting natural and diverse 3D hand gestures from the upper body dynamics is a practical yet challenging task in virtual avatar creation. Previous works usually overlook the asymmetric motions between two hands and generate two hands in a holistic manner, leading to unnatural results. In this work, we introduce a novel bilateral hand disentanglement based two-stage 3D hand generation method to achieve natural and diverse 3D hand prediction from body dynamics. In the first stage, we intend to generate natural hand gestures by two hand-disentanglement branches. Considering the asymmetric gestures and motions of two hands, we introduce a Spatial-Residual Memory (SRM) module to model spatial interaction between the body and each hand by residual learning. To enhance the coordination of two hand motions wrt. body dynamics holistically, we then present a Temporal-Motion Memory (TMM) module. TMM can effectively model the temporal association between body dynamics and two hand motions. The second stage is built upon the insight that 3D hand predictions should be non-deterministic given the sequential body postures. Thus, we further diversify our 3D hand predictions based on the initial output from the stage one. Concretely, we propose a Prototypical-Memory Sampling Strategy (PSS) to generate the non-deterministic hand gestures by gradient-based Markov Chain Monte Carlo (MCMC) sampling. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on the B2H dataset and our newly collected TED Hands dataset. The dataset and code are available at https://github.com/XingqunQi-lab/Diverse-3D-Hand-Gesture-Prediction.

LGNov 2, 2022Code
Behavior Prior Representation learning for Offline Reinforcement Learning

Hongyu Zang, Xin Li, Jie Yu et al.

Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks. The code is available at \url{https://github.com/bit1029public/offline_bpr}.

LGJun 6, 2022
Fast Adversarial Training with Adaptive Step Size

Zhichao Huang, Yanbo Fan, Chen Liu et al.

While adversarial training and its variants have shown to be the most effective algorithms to defend against adversarial attacks, their extremely slow training process makes it hard to scale to large datasets like ImageNet. The key idea of recent works to accelerate adversarial training is to substitute multi-step attacks (e.g., PGD) with single-step attacks (e.g., FGSM). However, these single-step methods suffer from catastrophic overfitting, where the accuracy against PGD attack suddenly drops to nearly 0% during training, destroying the robustness of the networks. In this work, we study the phenomenon from the perspective of training instances. We show that catastrophic overfitting is instance-dependent and fitting instances with larger gradient norm is more likely to cause catastrophic overfitting. Based on our findings, we propose a simple but effective method, Adversarial Training with Adaptive Step size (ATAS). ATAS learns an instancewise adaptive step size that is inversely proportional to its gradient norm. The theoretical analysis shows that ATAS converges faster than the commonly adopted non-adaptive counterparts. Empirically, ATAS consistently mitigates catastrophic overfitting and achieves higher robust accuracy on CIFAR10, CIFAR100 and ImageNet when evaluated on various adversarial budgets.

AIJun 1
SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

Kuan Li, Shuo Zhang, Huacan Wang et al.

Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user intent, preferences, and multi-device interactions. However, existing smart-home benchmarks often focus on static instruction-to-API mapping or limited simulations, failing to evaluate whether LLMs can reason, interact, and act reliably in realistic household scenarios. To address these limitations, we introduce SMH-Bench, a comprehensive benchmark for evaluating LLMs in smart-home environments. Built upon HomeEnv, an executable and verifiable smart-home simulator, SMH-Bench contains 1,100 high-quality tasks spanning 7 categories and 22 fine-grained subcategories. It further stratifies tasks across simple, medium and complex homes, ranging from small apartments to dense multi-room environments with 135 devices. Experiments show that although frontier LLMs achieve strong performance on explicit control and query tasks, they still exhibit significant weaknesses in automation task scheduling, ambiguity handling and personalized reasoning, especially as home complexity increases. We hope SMH-Bench will facilitate the development of more reliable, context-aware, and practically deployable smart-home agents.

ROMar 10Code
Cutting the Cord: System Architecture for Low-Cost, GPU-Accelerated Bimanual Mobile Manipulation

Artemis Shaw, Chen Liu, Justin Costa et al.

We present a bimanual mobile manipulator built on the open-source XLeRobot with integrated onboard compute for less than \$1300. Key contributions include: (1) optimized mechanical design maximizing stiffness-to-weight ratio, (2) a Tri-Bus power topology isolating compute from motor-induced voltage transients, and (3) embedded autonomy using NVIDIA Jetson Orin Nano for untethered operation. The platform enables teleoperation, autonomous SLAM navigation, and vision-based manipulation without external dependencies, providing a low-cost alternative for research and education in robotics and robot learning.

AIMay 31
HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

Yi Gu, Huacan Wang, Shuo Zhang et al.

Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and performing multi-turn reasoning. However, existing methods struggle to generate high-quality training data for smart home agents. We propose HomeFlow, a verifiable data flywheel for this domain. HomeFlow uses HomeEnv as a unified simulation environment and HomeMaker to procedurally generate diverse home settings. Subsequently, Blueprint compiles open-ended user intents into executable state-based success conditions, while MCTS-Flow synthesizes diverse, verifiable multi-turn trajectories through environment-guided tree search. We then optimize the agents via supervised fine-tuning and step-wise RLVE, which facilitates iterative improvement through authentic physical feedback. We further construct SmartHome-Bench to evaluate the agent across various smart home tasks. On this benchmark, HomeFlow-RL-4B and HomeFlow-RL-8B achieve task success rates of 84.60% and 87.03%. It is worth noting that HomeFlow-RL-8B even surpasses the leading GPT-5.5 by 1.23 percentage points.

CVNov 29, 2022
PatchMix Augmentation to Identify Causal Features in Few-shot Learning

Chengming Xu, Chen Liu, Xinwei Sun et al.

The task of Few-shot learning (FSL) aims to transfer the knowledge learned from base categories with sufficient labelled data to novel categories with scarce known information. It is currently an important research question and has great practical values in the real-world applications. Despite extensive previous efforts are made on few-shot learning tasks, we emphasize that most existing methods did not take into account the distributional shift caused by sample selection bias in the FSL scenario. Such a selection bias can induce spurious correlation between the semantic causal features, that are causally and semantically related to the class label, and the other non-causal features. Critically, the former ones should be invariant across changes in distributions, highly related to the classes of interest, and thus well generalizable to novel classes, while the latter ones are not stable to changes in the distribution. To resolve this problem, we propose a novel data augmentation strategy dubbed as PatchMix that can break this spurious dependency by replacing the patch-level information and supervision of the query images with random gallery images from different classes from the query ones. We theoretically show that such an augmentation mechanism, different from existing ones, is able to identify the causal features. To further make these features to be discriminative enough for classification, we propose Correlation-guided Reconstruction (CGR) and Hardness-Aware module for instance discrimination and easier discrimination between similar classes. Moreover, such a framework can be adapted to the unsupervised FSL scenario.

CVJul 26, 2022
P2ANet: A Dataset and Benchmark for Dense Action Detection from Table Tennis Match Broadcasting Videos

Jiang Bian, Xuhong Li, Tao Wang et al.

While deep learning has been widely used for video analytics, such as video classification and action detection, dense action detection with fast-moving subjects from sports videos is still challenging. In this work, we release yet another sports video benchmark \TheName{} for \emph{\underline{P}}ing \emph{\underline{P}}ong-\emph{\underline{A}}ction detection, which consists of 2,721 video clips collected from the broadcasting videos of professional table tennis matches in World Table Tennis Championships and Olympiads. We work with a crew of table tennis professionals and referees on a specially designed annotation toolbox to obtain fine-grained action labels (in 14 classes) for every ping-pong action that appeared in the dataset, and formulate two sets of action detection problems -- \emph{action localization} and \emph{action recognition}. We evaluate a number of commonly-seen action recognition (e.g., TSM, TSN, Video SwinTransformer, and Slowfast) and action localization models (e.g., BSN, BSN++, BMN, TCANet), using \TheName{} for both problems, under various settings. These models can only achieve 48\% area under the AR-AN curve for localization and 82\% top-one accuracy for recognition since the ping-pong actions are dense with fast-moving subjects but broadcasting videos are with only 25 FPS. The results confirm that \TheName{} is still a challenging task and can be used as a special benchmark for dense action detection from videos.

NAOct 11, 2016
A finite volume/discontinuous Galerkin method for the advective Cahn-Hilliard equation with degenerate mobility on porous domains stemming from micro-CT imaging

Florian Frank, Chen Liu, Faruk O. Alpak et al.

A numerical method is formulated for the solution of the advective Cahn-Hilliard (CH) equation with constant and degenerate mobility in three-dimensional porous media with non-vanishing velocity on the exterior boundary. The CH equation describes phase separation of an immiscible binary mixture at constant temperature in the presence of a mass constraint and dissipation of free energy. Porous media/pore-scale problems specifically entail high-resolution images of rocks in which the solid matrix and pore spaces are fully resolved. The interior penalty discontinuous Galerkin method is used for the spatial discretization of the CH equation in mixed form, while a semi-implicit convex-concave splitting is utilized for temporal discretization. The spatial approximation order is arbitrary, while it reduces to a finite volume scheme for the choice of elementwise constants. The resulting nonlinear systems of equations are reduced using the Schur complement and solved via Newton's method. The numerical scheme is first validated using numerical convergence tests and then applied to a number of fundamental problems for validation and numerical experimentation purposes including the case of degenerate mobility. First-order physical applicability and robustness of the numerical method are shown in a breakthrough scenario on a voxel set obtained from a micro-CT scan of a real sandstone rock sample.

CVJul 24, 2024Code
Affective Behaviour Analysis via Progressive Learning

Chen Liu, Wei Zhang, Feng Qiu et al.

Affective Behavior Analysis aims to develop emotionally intelligent technology that can recognize and respond to human emotions. To advance this field, the 7th Affective Behavior Analysis in-the-wild (ABAW) competition holds the Multi-Task Learning Challenge based on the s-Aff-Wild2 database. The participants are required to develop a framework that achieves Valence-Arousal Estimation, Expression Recognition, and AU detection simultaneously. To achieve this goal, we propose a progressive multi-task learning framework that fully leverages the distinct focuses of each task on facial emotion features. Specifically, our method design can be summarized into three main aspects: 1) Separate Training and Joint Training: We first train each task model separately and then perform joint training based on the pre-trained models, fully utilizing the feature focus aspects of each task to improve the overall framework performance. 2) Feature Fusion and Temporal Modeling:} We investigate effective strategies for fusing features extracted from each task-specific model and incorporate temporal feature modeling during the joint training phase, which further refines the performance of each task. 3) Joint Training Strategy Optimization: To identify the optimal joint training approach, we conduct a comprehensive strategy search, experimenting with various task combinations and training methodologies to further elevate the overall performance of each task. According to the official results, our team achieves first place in the MTL challenge with a total score of 1.5286 (i.e., AU F-score 0.5580, Expression F-score 0.4286, CCC VA score 0.5420). Our code is publicly available at https://github.com/YenanLiu/ABAW7th.

CVApr 6, 2023
RFAConv: Receptive-Field Attention Convolution for Improving Convolutional Neural Networks

Xin Zhang, Chen Liu, Degang Yang et al.

In the realm of deep learning, spatial attention mechanisms have emerged as a vital method for enhancing the performance of convolutional neural networks. However, these mechanisms possess inherent limitations that cannot be overlooked. This work delves into the mechanism of spatial attention and reveals a new insight. It is that the mechanism essentially addresses the issue of convolutional parameter sharing. By addressing this issue, the convolutional kernel can efficiently extract features by employing varying weights at distinct locations. However, current spatial attention mechanisms focus on shallow attention to spatial features, which is insufficient to address the fundamental challenge of parameter sharing in convolutions involving larger kernels. In response to this challenge, we introduce a novel attention mechanism known as Receptive-Field Attention (RFA). Compared to existing spatial attention methods, RFA not only concentrates on the receptive-field spatial features but also offers effective attention weights for large convolutional kernels. Building upon the RFA concept, a Receptive-Field Attention Convolution (RFAConv) is proposed to supplant the conventional standard convolution. Notably, it offers nearly negligible increment of computational overhead and parameters, while significantly improving network performance. Furthermore, this work reveals that current spatial attention mechanisms require enhanced prioritization of receptive-field spatial features to optimize network performance. To validate the advantages of the proposed methods, we conduct many experiments across several authoritative datasets, including ImageNet, COCO, VOC, and Roboflow...

LGMay 20Code
DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models

Xuyang Zhong, Qizhang Li, Yiwen Guo et al.

We propose DualOptim+, a novel optimization framework for improving machine unlearning in large language models. It introduces a base state to capture common representations shared by forgetting and retaining objectives and delta states to preserve objective-specific residuals. This architecture allows the optimizer to adaptively bridge shared and decoupled states based on the directional conflict between forgetting and retaining gradients. We further introduce DualOptim+ 8bit, a quantized variant that reduces memory overhead without compromising performance. Extensive experiments across fictitious and real-world unlearning, safety alignment, and multi-task learning tasks demonstrate that DualOptim+ consistently achieves a superior trade-off between different objectives. Codes are available at https://github.com/CityU-MLO/DualOptimPlus.

MLSep 5, 2023
Optimal Sample Selection Through Uncertainty Estimation and Its Application in Deep Learning

Yong Lin, Chen Liu, Chenlu Ye et al.

Modern deep learning heavily relies on large labeled datasets, which often comse with high costs in terms of both manual labeling and computational resources. To mitigate these challenges, researchers have explored the use of informative subset selection techniques, including coreset selection and active learning. Specifically, coreset selection involves sampling data with both input ($\bx$) and output ($\by$), active learning focuses solely on the input data ($\bx$). In this study, we present a theoretically optimal solution for addressing both coreset selection and active learning within the context of linear softmax regression. Our proposed method, COPS (unCertainty based OPtimal Sub-sampling), is designed to minimize the expected loss of a model trained on subsampled data. Unlike existing approaches that rely on explicit calculations of the inverse covariance matrix, which are not easily applicable to deep learning scenarios, COPS leverages the model's logits to estimate the sampling ratio. This sampling ratio is closely associated with model uncertainty and can be effectively applied to deep learning tasks. Furthermore, we address the challenge of model sensitivity to misspecification by incorporating a down-weighting approach for low-density samples, drawing inspiration from previous works. To assess the effectiveness of our proposed method, we conducted extensive empirical experiments using deep neural networks on benchmark datasets. The results consistently showcase the superior performance of COPS compared to baseline methods, reaffirming its efficacy.

CVNov 30, 2022
Split-PU: Hardness-aware Training Strategy for Positive-Unlabeled Learning

Chengming Xu, Chen Liu, Siqian Yang et al.

Positive-Unlabeled (PU) learning aims to learn a model with rare positive samples and abundant unlabeled samples. Compared with classical binary classification, the task of PU learning is much more challenging due to the existence of many incompletely-annotated data instances. Since only part of the most confident positive samples are available and evidence is not enough to categorize the rest samples, many of these unlabeled data may also be the positive samples. Research on this topic is particularly useful and essential to many real-world tasks which demand very expensive labelling cost. For example, the recognition tasks in disease diagnosis, recommendation system and satellite image recognition may only have few positive samples that can be annotated by the experts. These methods mainly omit the intrinsic hardness of some unlabeled data, which can result in sub-optimal performance as a consequence of fitting the easy noisy data and not sufficiently utilizing the hard data. In this paper, we focus on improving the commonly-used nnPU with a novel training pipeline. We highlight the intrinsic difference of hardness of samples in the dataset and the proper learning strategies for easy and hard data. By considering this fact, we propose first splitting the unlabeled dataset with an early-stop strategy. The samples that have inconsistent predictions between the temporary and base model are considered as hard samples. Then the model utilizes a noise-tolerant Jensen-Shannon divergence loss for easy data; and a dual-source consistency regularization for hard data which includes a cross-consistency between student and base model for low-level features and self-consistency for high-level features and predictions, respectively.

GROct 7, 2022
Learning to Learn and Sample BRDFs

Chen Liu, Michael Fischer, Tobias Ritschel

We propose a method to accelerate the joint process of physically acquiring and learning neural Bi-directional Reflectance Distribution Function (BRDF) models. While BRDF learning alone can be accelerated by meta-learning, acquisition remains slow as it relies on a mechanical process. We show that meta-learning can be extended to optimize the physical sampling pattern, too. After our method has been meta-trained for a set of fully-sampled BRDFs, it is able to quickly train on new BRDFs with up to five orders of magnitude fewer physical acquisition samples at similar quality. Our approach also extends to other linear and non-linear BRDF models, which we show in an extensive evaluation.

CLApr 22Code
The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models

Yilun Liu, Chunguang Zhao, Mengyao Piao et al.

Evaluating the multilingual and multicultural capabilities of Large Language Models (LLMs) is essential for their global utility. However, current benchmarks face three critical limitations: (1) fragmented evaluation dimensions that often neglect deep cultural nuances; (2) insufficient language coverage in subjective tasks relying on low-quality machine translation; and (3) shallow analysis that lacks diagnostic depth beyond simple rankings. To address these, we introduce GaoYao, a comprehensive benchmark with 182.3k samples, 26 languages and 51 nations/areas. First, GaoYao proposes a unified framework categorizing evaluation tasks into three cultural layers (General Multilingual, Cross-cultural, Monocultural) and nine cognitive sub-layers. Second, we achieve native-quality expansion by leveraging experts to rigorously localize subjective benchmarks into 19 languages and synthesizing cross-cultural test sets for 34 cultures, surpassing prior coverage by up to 111%. Third, we conduct an in-depth diagnostic analysis on 20+ flagship and compact LLMs. Our findings reveal significant geographical performance disparities and distinct gaps between tasks, offering a reliable map for future work. We release the benchmark (https://github.com/lunyiliu/GaoYao).

CVSep 23, 2022
CUTS: A Deep Learning and Topological Framework for Multigranular Unsupervised Medical Image Segmentation

Chen Liu, Matthew Amodio, Liangbo L. Shen et al.

Segmenting medical images is critical to facilitating both patient diagnoses and quantitative research. A major limiting factor is the lack of labeled data, as obtaining expert annotations for each new set of imaging data and task can be labor intensive and inconsistent among annotators. We present CUTS, an unsupervised deep learning framework for medical image segmentation. CUTS operates in two stages. For each image, it produces an embedding map via intra-image contrastive learning and local patch reconstruction. Then, these embeddings are partitioned at dynamic granularity levels that correspond to the data topology. CUTS yields a series of coarse-to-fine-grained segmentations that highlight features at various granularities. We applied CUTS to retinal fundus images and two types of brain MRI images to delineate structures and patterns at different scales. When evaluated against predefined anatomical masks, CUTS improved the dice coefficient and Hausdorff distance by at least 10% compared to existing unsupervised methods. Finally, CUTS showed performance on par with Segment Anything Models (SAM, MedSAM, SAM-Med2D) pre-trained on gigantic labeled datasets.

LGMay 28
A Full-Pipeline Framework for Evaluating Membership Inference Attacks in Machine Learning

Ding Chen, Xinwen Cheng, Xuyang Zhong et al.

While Membership Inference Attacks (MIAs) are the prevailing method for identifying training data, their application has expanded into privacy auditing and machine unlearning. Nevertheless, the field lacks a systematic framework for evaluating how different contexts affect MIA efficacy. Without such a characterization, practitioners risk deploying algorithms that perform well on benchmarks but become statistically irrelevant when faced with the nuances of specific, real-world datasets. To bridge this gap and provide actionable insights, we introduce a comprehensive evaluation framework that systematically characterizes privacy risks across the entire machine learning pipeline, spanning data, architectures, algorithms, and post-training modules. Designed to inherently capture diverse operational contexts, our framework rigorously evaluates state-of-the-art MIAs across a broad spectrum of training configurations. To account for varying misclassification costs in real-world deployments, we employ three complementary metrics: Balanced Accuracy for symmetric costs, alongside TPR at low FPR (or TNR at low FNR) for asymmetric scenarios where false alarms or missed detections are strictly penalized. Furthermore, recognizing that existing MIAs assume divergent adversary capabilities, we formalize two standardized threat models and adapt these attacks into corresponding variants to ensure an equitable benchmark. Extensive empirical evaluations demonstrate that the efficacy of specific MIA methodologies is highly sensitive to the assumed threat models and chosen evaluation metrics. Ultimately, we distill these findings into actionable guidelines and provide a ready-to-use auditing toolkit, empowering practitioners to conduct better privacy assessments.

CVAug 20, 2023
BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

Chen Liu, Peike Li, Hu Zhang et al.

Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding sources by predicting pixel-wise maps. Previous methods assume that each sound component in an audio signal always has a visual counterpart in the image. However, this assumption overlooks that off-screen sounds and background noise often contaminate the audio recordings in real-world scenarios. They impose significant challenges on building a consistent semantic mapping between audio and visual signals for AVS models and thus impede precise sound localization. In this work, we propose a two-stage bootstrapping audio-visual segmentation framework by incorporating multi-modal foundation knowledge. In a nutshell, our BAVS is designed to eliminate the interference of background noise or off-screen sounds in segmentation by establishing the audio-visual correspondences in an explicit manner. In the first stage, we employ a segmentation model to localize potential sounding objects from visual data without being affected by contaminated audio signals. Meanwhile, we also utilize a foundation audio classification model to discern audio semantics. Considering the audio tags provided by the audio foundation model are noisy, associating object masks with audio tags is not trivial. Thus, in the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects. Here, we construct an audio-visual tree based on the hierarchical correspondence between sounds and object categories. We then examine the label concurrency between the localized objects and classified audio tags by tracing the audio-visual tree. With AVIS, we can effectively segment real-sounding objects. Extensive experiments demonstrate the superiority of our method on AVS datasets, particularly in scenarios involving background noise. Our project website is https://yenanliu.github.io/AVSS.github.io/.

SDJul 31, 2023
Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

Chen Liu, Peike Li, Xingqun Qi et al.

The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.

EMFeb 16, 2023
Deep Learning Enhanced Realized GARCH

Chen Liu, Chao Wang, Minh-Ngoc Tran et al.

We propose a new approach to volatility modeling by combining deep learning (LSTM) and realized volatility measures. This LSTM-enhanced realized GARCH framework incorporates and distills modeling advances from financial econometrics, high frequency trading data and deep learning. Bayesian inference via the Sequential Monte Carlo method is employed for statistical inference and forecasting. The new framework can jointly model the returns and realized volatility measures, has an excellent in-sample fit and superior predictive performance compared to several benchmark models, while being able to adapt well to the stylized facts in volatility. The performance of the new framework is tested using a wide range of metrics, from marginal likelihood, volatility forecasting, to tail risk forecasting and option pricing. We report on a comprehensive empirical study using 31 widely traded stock indices over a time period that includes COVID-19 pandemic.

CVMay 8
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu, Yi Xin, Yuhui Zhang et al.

Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.

CVNov 15, 2022
Evidence-based Match-status-Aware Gait Recognition for Out-of-Gallery Gait Identification

Heming Du, Chen Liu, Ming Wang et al.

Existing gait recognition methods typically identify individuals based on the similarity between probe and gallery samples. However, these methods often neglect the fact that the gallery may not contain identities corresponding to the probes, leading to incorrect recognition.To identify Out-of-Gallery (OOG) gait queries, we propose an Evidence-based Match-status-Aware Gait Recognition (EMA-GR) framework. Inspired by Evidential Deep Learning (EDL), EMA-GR is designed to quantify the uncertainty associated with the match status of recognition. Thus, EMA-GR identifies whether the probe has a counterpart in the gallery. Specifically, we adopt an evidence collector to gather match status evidence from a recognition result pair and parameterize a Dirichlet distribution over the gathered evidence, following the Dempster-Shafer Theory of Evidence (DST). We measure the uncertainty and predict the match status of the recognition results, and thus determine whether the probe is an OOG query.To the best of our knowledge, our method is the first attempt to tackle OOG queries in gait recognition. Moreover, EMA-GR is agnostic against gait recognition methods and improves the robustness against OOG queries. Extensive experiments demonstrate that our method achieves state-of-the-art performance on datasets with OOG queries, and can also generalize well to other identity-retrieval tasks. Importantly, our method surpasses existing state-of-the-art methods by a substantial margin, achieving a 51.26% improvement when the OOG query rate is around 50% on OUMVLP.

AIApr 13Code
SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering

Ningyan Zhu, Huacan Wang, Jie Zhou et al.

The rise of OpenClaw in early 2026 marks the moment when millions of users began deploying personal AI agents into their daily lives, delegating tasks ranging from travel planning to multi-step research. This scale of adoption signals that two parallel arcs of development have reached an inflection point. First is a paradigm shift in AI engineering, evolving from prompt and context engineering to harness engineering-designing the complete infrastructure necessary to transform unconstrained agents into controllable, auditable, and production-reliable systems. As model capabilities converge, this harness layer is becoming the primary site of architectural differentiation. Second is the evolution of human-agent interaction from discrete tasks toward a persistent, contextually aware collaborative relationship, which demands open, trustworthy and extensible harness infrastructure. We present SemaClaw, an open-source multi-agent application framework that addresses these shifts by taking a step towards general-purpose personal AI agents through harness engineering. Our primary contributions include a DAG-based two-phase hybrid agent team orchestration method, a PermissionBridge behavioral safety system, a three-tier context management architecture, and an agentic wiki skill for automated personal knowledge base construction.

NANov 15, 2018
Numerical analysis of a discontinuous Galerkin method for Cahn-Hilliard-Navier-Stokes equations

Chen Liu, Beatrice Riviere

In this paper, we derive a theoretical analysis of an interior penalty discontinuous Galerkin methods for solving the Cahn-Hilliard-Navier-Stokes model problem. We prove unconditional unique solvability of the discrete system, obtain unconditional discrete energy dissipation law, and derive stability bounds with a generalized chemical energy density. Convergence of the method is obtained by proving optimal a priori error estimates. Our analysis of the unique solvability is valid for both symmetric and non-symmetric versions of the discontinuous Galerkin formulation.

LGApr 10, 2023
Deploying Machine Learning Models to Ahead-of-Time Runtime on Edge Using MicroTVM

Chen Liu, Matthias Jobst, Liyuan Guo et al.

In the past few years, more and more AI applications have been applied to edge devices. However, models trained by data scientists with machine learning frameworks, such as PyTorch or TensorFlow, can not be seamlessly executed on edge. In this paper, we develop an end-to-end code generator parsing a pre-trained model to C source libraries for the backend using MicroTVM, a machine learning compiler framework extension addressing inference on bare metal devices. An analysis shows that specific compute-intensive operators can be easily offloaded to the dedicated accelerator with a Universal Modular Accelerator (UMA) interface, while others are processed in the CPU cores. By using the automatically generated ahead-of-time C runtime, we conduct a hand gesture recognition experiment on an ARM Cortex M4F core.

IVMar 15, 2023
Lung Nodule Segmentation and Uncertain Region Prediction with an Uncertainty-Aware Attention Mechanism

Han Yang, Qiuli Wang, Yue Zhang et al.

Radiologists possess diverse training and clinical experiences, leading to variations in the segmentation annotations of lung nodules and resulting in segmentation uncertainty.Conventional methods typically select a single annotation as the learning target or attempt to learn a latent space comprising multiple annotations. However, these approaches fail to leverage the valuable information inherent in the consensus and disagreements among the multiple annotations. In this paper, we propose an Uncertainty-Aware Attention Mechanism (UAAM) that utilizes consensus and disagreements among multiple annotations to facilitate better segmentation. To this end, we introduce the Multi-Confidence Mask (MCM), which combines a Low-Confidence (LC) Mask and a High-Confidence (HC) Mask.The LC mask indicates regions with low segmentation confidence, where radiologists may have different segmentation choices. Following UAAM, we further design an Uncertainty-Guide Multi-Confidence Segmentation Network (UGMCS-Net), which contains three modules: a Feature Extracting Module that captures a general feature of a lung nodule, an Uncertainty-Aware Module that produces three features for the the annotations' union, intersection, and annotation set, and an Intersection-Union Constraining Module that uses distances between the three features to balance the predictions of final segmentation and MCM. To comprehensively demonstrate the performance of our method, we propose a Complex Nodule Validation on LIDC-IDRI, which tests UGMCS-Net's segmentation performance on lung nodules that are difficult to segment using common methods. Experimental results demonstrate that our method can significantly improve the segmentation performance on nodules that are difficult to segment using conventional methods.

GRSep 20, 2024
FreeAvatar: Robust 3D Facial Animation Transfer by Learning an Expression Foundation Model

Feng Qiu, Wei Zhang, Chen Liu et al.

Video-driven 3D facial animation transfer aims to drive avatars to reproduce the expressions of actors. Existing methods have achieved remarkable results by constraining both geometric and perceptual consistency. However, geometric constraints (like those designed on facial landmarks) are insufficient to capture subtle emotions, while expression features trained on classification tasks lack fine granularity for complex emotions. To address this, we propose \textbf{FreeAvatar}, a robust facial animation transfer method that relies solely on our learned expression representation. Specifically, FreeAvatar consists of two main components: the expression foundation model and the facial animation transfer model. In the first component, we initially construct a facial feature space through a face reconstruction task and then optimize the expression feature space by exploring the similarities among different expressions. Benefiting from training on the amounts of unlabeled facial images and re-collected expression comparison dataset, our model adapts freely and effectively to any in-the-wild input facial images. In the facial animation transfer component, we propose a novel Expression-driven Multi-avatar Animator, which first maps expressive semantics to the facial control parameters of 3D avatars and then imposes perceptual constraints between the input and output images to maintain expression consistency. To make the entire process differentiable, we employ a trained neural renderer to translate rig parameters into corresponding images. Furthermore, unlike previous methods that require separate decoders for each avatar, we propose a dynamic identity injection module that allows for the joint training of multiple avatars within a single network.

CVJan 10, 2023
Leveraging Diffusion For Strong and High Quality Face Morphing Attacks

Zander W. Blasingame, Chen Liu

Face morphing attacks seek to deceive a Face Recognition (FR) system by presenting a morphed image consisting of the biometric qualities from two different identities with the aim of triggering a false acceptance with one of the two identities, thereby presenting a significant threat to biometric systems. The success of a morphing attack is dependent on the ability of the morphed image to represent the biometric characteristics of both identities that were used to create the image. We present a novel morphing attack that uses a Diffusion-based architecture to improve the visual fidelity of the image and the ability of the morphing attack to represent characteristics from both identities. We demonstrate the effectiveness of the proposed attack by evaluating its visual fidelity via the Frechet Inception Distance (FID). Also, extensive experiments are conducted to measure the vulnerability of FR systems to the proposed attack. The ability of a morphing attack detector to detect the proposed attack is measured and compared against two state-of-the-art GAN-based morphing attacks along with two Landmark-based attacks. Additionally, a novel metric to measure the relative strength between different morphing attacks is introduced and evaluated.

EMSep 5, 2023
Global Neural Networks and The Data Scaling Effect in Financial Time Series Forecasting

Chen Liu, Minh-Ngoc Tran, Chao Wang et al.

Neural networks have revolutionized many empirical fields, yet their application to financial time series forecasting remains controversial. In this study, we demonstrate that the conventional practice of estimating models locally in data-scarce environments may underlie the mixed empirical performance observed in prior work. By focusing on volatility forecasting, we employ a dataset comprising over 10,000 global stocks and implement a global estimation strategy that pools information across cross-sections. Our econometric analysis reveals that forecasting accuracy improves markedly as the training dataset becomes larger and more heterogeneous. Notably, even with as little as 12 months of data, globally trained networks deliver robust predictions for individual stocks and portfolios that are not even in the training dataset. Furthermore, our interpretation of the model dynamics shows that these networks not only capture key stylized facts of volatility but also exhibit resilience to outliers and rapid adaptation to market regime changes. These findings underscore the importance of leveraging extensive and diverse datasets in financial forecasting and advocate for a shift from traditional local training approaches to integrated global estimation methods.

QMApr 6
TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots

Tianyu Liu, Weihao Xuan, Hao Wu et al.

Advances in AI have introduced several strong models in computational pathology to usher it into the era of multi-modal diagnosis, analysis, and interpretation. However, the current pathology-specific visual language models still lack capacities in making the diagnosis with rigorous reasoning paths as well as handling divergent tasks, and thus, challenges of building AI Copilots for real scenarios still exist. Here we introduce TeamPath, an AI system powered by reinforcement learning and router-enhanced solutions based on large-scale histopathology multimodal datasets, to work as a virtual assistant for expert-level disease diagnosis, patch-level information summarization, and cross-modality generation to integrate transcriptomic information for clinical usage. We also collaborate with pathologists from Yale School of Medicine to demonstrate that TeamPath can assist them in working more efficiently by identifying and correcting expert conclusions and reasoning paths. We also discuss the human evaluation results to support the reasoning quality from TeamPath. Overall, TeamPath can flexibly choose the best settings according to the needs, and serve as an innovative and reliable system for information communication across different modalities and experts.

AIOct 9, 2023
Divide and Ensemble: Progressively Learning for the Unknown

Hu Zhang, Xin Shen, Heming Du et al.

In the wheat nutrient deficiencies classification challenge, we present the DividE and EnseMble (DEEM) method for progressive test data predictions. We find that (1) test images are provided in the challenge; (2) samples are equipped with their collection dates; (3) the samples of different dates show notable discrepancies. Based on the findings, we partition the dataset into discrete groups by the dates and train models on each divided group. We then adopt the pseudo-labeling approach to label the test data and incorporate those with high confidence into the training set. In pseudo-labeling, we leverage models ensemble with different architectures to enhance the reliability of predictions. The pseudo-labeling and ensembled model training are iteratively conducted until all test samples are labeled. Finally, the separated models for each group are unified to obtain the model for the whole dataset. Our method achieves an average of 93.6\% Top-1 test accuracy~(94.0\% on WW2020 and 93.2\% on WR2021) and wins the 1$st$ place in the Deep Nutrient Deficiency Challenge~\footnote{https://cvppa2023.github.io/challenges/}.

MLSep 14, 2024
Hyperedge Representations with Hypergraph Wavelets: Applications to Spatial Transcriptomics

Xingzhi Sun, Charles Xu, João F. Rocha et al.

In many data-driven applications, higher-order relationships among multiple objects are essential in capturing complex interactions. Hypergraphs, which generalize graphs by allowing edges to connect any number of nodes, provide a flexible and powerful framework for modeling such higher-order relationships. In this work, we introduce hypergraph diffusion wavelets and describe their favorable spectral and spatial properties. We demonstrate their utility for biomedical discovery in spatially resolved transcriptomics by applying the method to represent disease-relevant cellular niches for Alzheimer's disease.

LGMay 19
BrainDyn: A Sheaf Neural ODE for Generative Brain Dynamics

Siddharth Viswanath, Panayiotis Ketonis, Chen Liu et al.

Efficient neural network models that generate brain-like dynamic activity can be a valuable resource for generating synthetic data, analyzing differences in brain transients under conditions such as testing perturbation activity or inferring the underlying generative dynamics. However, large language models (LLMs) or standard recurrent neural networks (RNNs) ignore the anatomical organization and therefore do not produce components that align with brain regions. On the other hand, graph-based networks often have very simple message passing rules that are not sufficiently expressive for brain-like dynamics. To address this, we introduce BrainDyn, a sheaf neural ordinary differential equation (neural ODE) model for continuous-time dynamics on structured brain graphs. BrainDyn encodes the recent activity history of each brain region using a long short-term memory (LSTM) model over a sliding temporal window to produce hidden states, or stalks, that are projected through learnable restriction maps into edge-specific shared spaces. Discrepancies between neighboring nodes in these shared spaces are characterized by a sheaf Laplacian that can facilitate message passing between neuronal units. The output of these messages is then fed to a neural ODE that governs the continuous-time evolution of neuronal activity. We evaluated BrainDyn on resting-state fMRI (PNC dataset), scalp EEG with focal epilepsy (TUSZ dataset), and simulated activity from the NEST spiking network simulator. BrainDyn achieves strong forecasting ability across modalities, and the resulting representations support downstream tasks including in silico perturbation prediction.

IVJun 8, 2022
Dual Windows Are Significant: Learning from Mediastinal Window and Focusing on Lung Window

Qiuli Wang, Xin Tan, Chen Liu

Since the pandemic of COVID-19, several deep learning methods were proposed to analyze the chest Computed Tomography (CT) for diagnosis. In the current situation, the disease course classification is significant for medical personnel to decide the treatment. Most previous deep-learning-based methods extract features observed from the lung window. However, it has been proved that some appearances related to diagnosis can be observed better from the mediastinal window rather than the lung window, e.g., the pulmonary consolidation happens more in severe symptoms. In this paper, we propose a novel Dual Window RCNN Network (DWRNet), which mainly learns the distinctive features from the successive mediastinal window. Regarding the features extracted from the lung window, we introduce the Lung Window Attention Block (LWA Block) to pay additional attention to them for enhancing the mediastinal-window features. Moreover, instead of picking up specific slices from the whole CT slices, we use a Recurrent CNN and analyze successive slices as videos. Experimental results show that the fused and representative features improve the predictions of disease course by reaching the accuracy of 90.57%, against the baseline with an accuracy of 84.86%. Ablation studies demonstrate that combined dual window features are more efficient than lung-window features alone, while paying attention to lung-window features can improve the model's stability.

CVJul 15, 2022
Adversarial Focal Loss: Asking Your Discriminator for Hard Examples

Chen Liu, Xiaomeng Dong, Michael Potter et al.

Focal Loss has reached incredible popularity as it uses a simple technique to identify and utilize hard examples to achieve better performance on classification. However, this method does not easily generalize outside of classification tasks, such as in keypoint detection. In this paper, we propose a novel adaptation of Focal Loss for keypoint detection tasks, called Adversarial Focal Loss (AFL). AFL not only is semantically analogous to Focal loss, but also works as a plug-and-chug upgrade for arbitrary loss functions. While Focal Loss requires output from a classifier, AFL leverages a separate adversarial network to produce a difficulty score for each input. This difficulty score can then be used to dynamically prioritize learning on hard examples, even in absence of a classifier. In this work, we show AFL's effectiveness in enhancing existing methods in keypoint detection and verify its capability to re-weigh examples based on difficulty.

SEMay 18
Contextualized Code Pretraining for Code Generation

Chen Liu, Qingyuan Liang, Hanwen Zhang et al.

As code generation becomes increasingly central to improving software development efficiency, modern code models are largely trained and evaluated on code with natural-language descriptions. In real projects, developers often implement missing functions under limited project-specific artifacts, while the local call-site context is already available in the surrounding code. This usage context provides actionable cues about expected behavior, but existing models are not explicitly optimized to leverage it reliably, leading to implementations that may not integrate smoothly with surrounding usage in repository settings. In this work, we propose contextualized code pretraining, an invocation-aware framework that integrates calling context into both the training and evaluation of code models. Using static analysis, we automatically extract large-scale caller-callee pairs from real repositories to construct pretraining tasks and benchmarks that condition generation on the calling context. We train CallerGen, the first code models pretrained with invocation-aware objectives spanning multiple sizes, and evaluate them on CallerEval, a new benchmark featuring realistic scenarios. Experiments show that CallerGen outperforms comparable-scale models and remains competitive with larger ones across two benchmarks. Our 220M and 0.5B models achieve 16.58% and 22.81@% pass1, surpassing baselines on CallerEval. These results highlight the importance of calling context in realistic code generation.

CVDec 4, 2023Code
Assessing Neural Network Representations During Training Using Noise-Resilient Diffusion Spectral Entropy

Danqi Liao, Chen Liu, Benjamin W. Christensen et al.

Entropy and mutual information in neural networks provide rich information on the learning process, but they have proven difficult to compute reliably in high dimensions. Indeed, in noisy and high-dimensional data, traditional estimates in ambient dimensions approach a fixed entropy and are prohibitively hard to compute. To address these issues, we leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures. Specifically, we define diffusion spectral entropy (DSE) in neural representations of a dataset as well as diffusion spectral mutual information (DSMI) between different variables representing data. First, we show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data that outperform classic Shannon entropy, nonparametric estimation, and mutual information neural estimation (MINE). We then study the evolution of representations in classification networks with supervised learning, self-supervision, or overfitting. We observe that (1) DSE of neural representations increases during training; (2) DSMI with the class label increases during generalizable learning but stays stagnant during overfitting; (3) DSMI with the input signal shows differing trends: on MNIST it increases, while on CIFAR-10 and STL-10 it decreases. Finally, we show that DSE can be used to guide better network initialization and that DSMI can be used to predict downstream classification accuracy across 962 models on ImageNet. The official implementation is available at https://github.com/ChenLiu-1996/DiffusionSpectralEntropy.

CVOct 14, 2023
Fast-DiM: Towards Fast Diffusion Morphs

Zander W. Blasingame, Chen Liu

Diffusion Morphs (DiM) are a recent state-of-the-art method for creating high quality face morphs; however, they require a high number of network function evaluations (NFE) to create the morphs. We propose a new DiM pipeline, Fast-DiM, which can create morphs of a similar quality but with fewer NFE. We investigate the ODE solvers used to solve the Probability Flow ODE and the impact they have on the the creation of face morphs. Additionally, we employ an alternative method for encoding images into the latent space of the Diffusion model by solving the Probability Flow ODE as time runs forwards. Our experiments show that we can reduce the NFE by upwards of 85% in the encoding process while experiencing only 1.6\% reduction in Mated Morph Presentation Match Rate (MMPMR). Likewise, we showed we could cut NFE, in the sampling process, in half with only a maximal reduction of 0.23% in MMPMR.

LGOct 25, 2023
Towards Control-Centric Representations in Reinforcement Learning from Images

Chen Liu, Hongyu Zang, Xin Li et al.

Image-based Reinforcement Learning is a practical yet challenging task. A major hurdle lies in extracting control-centric representations while disregarding irrelevant information. While approaches that follow the bisimulation principle exhibit the potential in learning state representations to address this issue, they still grapple with the limited expressive capacity of latent dynamics and the inadaptability to sparse reward environments. To address these limitations, we introduce ReBis, which aims to capture control-centric information by integrating reward-free control information alongside reward-specific knowledge. ReBis utilizes a transformer architecture to implicitly model the dynamics and incorporates block-wise masking to eliminate spatiotemporal redundancy. Moreover, ReBis combines bisimulation-based loss with asymmetric reconstruction loss to prevent feature collapse in environments with sparse rewards. Empirical studies on two large benchmarks, including Atari games and DeepMind Control Suit, demonstrate that ReBis has superior performance compared to existing methods, proving its effectiveness.

CVApr 9, 2024Code
Greedy-DiM: Greedy Algorithms for Unreasonably Effective Face Morphs

Zander W. Blasingame, Chen Liu

Morphing attacks are an emerging threat to state-of-the-art Face Recognition (FR) systems, which aim to create a single image that contains the biometric information of multiple identities. Diffusion Morphs (DiM) are a recently proposed morphing attack that has achieved state-of-the-art performance for representation-based morphing attacks. However, none of the existing research on DiMs have leveraged the iterative nature of DiMs and left the DiM model as a black box, treating it no differently than one would a Generative Adversarial Network (GAN) or Varational AutoEncoder (VAE). We propose a greedy strategy on the iterative sampling process of DiM models which searches for an optimal step guided by an identity-based heuristic function. We compare our proposed algorithm against ten other state-of-the-art morphing algorithms using the open-source SYN-MAD 2022 competition dataset. We find that our proposed algorithm is unreasonably effective, fooling all of the tested FR systems with an MMPMR of 100%, outperforming all other morphing algorithms compared.

SESep 8, 2024
GUI Test Migration via Abstraction and Concretization

Yakun Zhang, Chen Liu, Xiaofei Xie et al.

GUI test migration aims to produce test cases with events and assertions to test specific functionalities of a target app. Existing migration approaches typically focus on the widget-mapping paradigm that maps widgets from source apps to target apps. However, since different apps may implement the same functionality in different ways, direct mapping may result in incomplete or buggy test cases, thus significantly impacting the effectiveness of testing target functionality and the practical applicability of migration approaches. In this paper, we propose a new migration paradigm (i.e., the abstraction-concretization paradigm) that first abstracts the test logic for the target functionality and then utilizes this logic to generate the concrete GUI test case. Furthermore, we introduce MACdroid, the first approach that migrates GUI test cases based on this paradigm. Specifically, we propose an abstraction technique that utilizes source test cases from source apps targeting the same functionality to extract a general test logic for that functionality. Then, we propose a concretization technique that utilizes the general test logic to guide an LLM in generating the corresponding GUI test case (including events and assertions) for the target app. We evaluate MACdroid on two widely-used datasets (including 31 apps, 34 functionalities, and 123 test cases). On the FrUITeR dataset, the test cases generated by MACdroid successfully test 64% of the target functionalities, improving the baselines by 191%. On the Lin dataset, MACdroid successfully tests 75% of the target functionalities, outperforming the baselines by 42%. These results underscore the effectiveness of MACdroid in GUI test migration.

CVMay 23, 2024Code
AdjointDEIS: Efficient Gradients for Diffusion Models

Zander W. Blasingame, Chen Liu

The optimization of the latents and parameters of diffusion models with respect to some differentiable metric defined on the output of the model is a challenging and complex problem. The sampling for diffusion models is done by solving either the probability flow ODE or diffusion SDE wherein a neural network approximates the score function allowing a numerical ODE/SDE solver to be used. However, naive backpropagation techniques are memory intensive, requiring the storage of all intermediate states, and face additional complexity in handling the injected noise from the diffusion term of the diffusion SDE. We propose a novel family of bespoke ODE solvers to the continuous adjoint equations for diffusion models, which we call AdjointDEIS. We exploit the unique construction of diffusion SDEs to further simplify the formulation of the continuous adjoint equations using exponential integrators. Moreover, we provide convergence order guarantees for our bespoke solvers. Significantly, we show that continuous adjoint equations for diffusion SDEs actually simplify to a simple ODE. Lastly, we demonstrate the effectiveness of AdjointDEIS for guided generation with an adversarial attack in the form of the face morphing problem. Our code will be released at https: //github.com/zblasingame/AdjointDEIS.

CVSep 19, 2024
GaRField++: Reinforced Gaussian Radiance Fields for Large-Scale 3D Scene Reconstruction

Hanyue Zhang, Zhiliu Yang, Xinhe Zuo et al.

This paper proposes a novel framework for large-scale scene reconstruction based on 3D Gaussian splatting (3DGS) and aims to address the scalability and accuracy challenges faced by existing methods. For tackling the scalability issue, we split the large scene into multiple cells, and the candidate point-cloud and camera views of each cell are correlated through a visibility-based camera selection and a progressive point-cloud extension. To reinforce the rendering quality, three highlighted improvements are made in comparison with vanilla 3DGS, which are a strategy of the ray-Gaussian intersection and the novel Gaussians density control for learning efficiency, an appearance decoupling module based on ConvKAN network to solve uneven lighting conditions in large-scale scenes, and a refined final loss with the color loss, the depth distortion loss, and the normal consistency loss. Finally, the seamless stitching procedure is executed to merge the individual Gaussian radiance field for novel view synthesis across different cells. Evaluation of Mill19, Urban3D, and MatrixCity datasets shows that our method consistently generates more high-fidelity rendering results than state-of-the-art methods of large-scale scene reconstruction. We further validate the generalizability of the proposed approach by rendering on self-collected video clips recorded by a commercial drone.

LGJan 30
Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models

Chen Liu, Xingzhi Sun, Xi Xiao et al.

Large language models (LLMs) achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational costs. To better understand LLM scaling, we study representational differences between LLMs and their smaller counterparts, with the goal of replicating the representational qualities of larger models in the smaller models. We observe a geometric phenomenon which we term $\textbf{embedding condensation}$, where token embeddings collapse into a narrow cone-like subspace in some language models. Through systematic analyses across multiple Transformer families, we show that small models such as $\texttt{GPT2}$ and $\texttt{Qwen3-0.6B}$ exhibit severe condensation, whereas the larger models such as $\texttt{GPT2-xl}$ and $\texttt{Qwen3-32B}$ are more resistant to this phenomenon. Additional observations show that embedding condensation is not reliably mitigated by knowledge distillation from larger models. To fight against it, we formulate a dispersion loss that explicitly encourages embedding dispersion during training. Experiments demonstrate that it mitigates condensation, recovers dispersion patterns seen in larger models, and yields performance gains across 10 benchmarks. We believe this work offers a principled path toward improving smaller Transformers without additional parameters.

LGJul 11, 2024
Controlling the Fidelity and Diversity of Deep Generative Models via Pseudo Density

Shuangqi Li, Chen Liu, Tong Zhang et al.

We introduce an approach to bias deep generative models, such as GANs and diffusion models, towards generating data with either enhanced fidelity or increased diversity. Our approach involves manipulating the distribution of training and generated data through a novel metric for individual samples, named pseudo density, which is based on the nearest-neighbor information from real samples. Our approach offers three distinct techniques to adjust the fidelity and diversity of deep generative models: 1) Per-sample perturbation, enabling precise adjustments for individual samples towards either more common or more unique characteristics; 2) Importance sampling during model inference to enhance either fidelity or diversity in the generated data; 3) Fine-tuning with importance sampling, which guides the generative model to learn an adjusted distribution, thus controlling fidelity and diversity. Furthermore, our fine-tuning method demonstrates the ability to improve the Frechet Inception Distance (FID) for pre-trained generative models with minimal iterations.

CLApr 30, 2024Code
StablePT: Towards Stable Prompting for Few-shot Learning via Input Separation

Xiaoming Liu, Chen Liu, Zhaohan Zhang et al.

Large language models have shown their ability to become effective few-shot learners with prompting, revolutionizing the paradigm of learning with data scarcity. However, this approach largely depends on the quality of prompt initialization, and always exhibits large variability among different runs. Such property makes prompt tuning highly unreliable and vulnerable to poorly constructed prompts, which limits its extension to more real-world applications. To tackle this issue, we propose to treat the hard prompt and soft prompt as separate inputs to mitigate noise brought by the prompt initialization. Furthermore, we optimize soft prompts with contrastive learning for utilizing class-aware information in the training process to maintain model performance. Experimental results demonstrate that \sysname outperforms state-of-the-art methods by 6.97% in accuracy and reduces the standard deviation by 1.92 on average. Furthermore, extensive experiments underscore its robustness and stability across 8 datasets covering various tasks. Codes are available at https://github.com/lccc0528/Stable/tree/main.

CVOct 6, 2023
Enhancing the Authenticity of Rendered Portraits with Identity-Consistent Transfer Learning

Luyuan Wang, Yiqian Wu, Yongliang Yang et al.

Despite rapid advances in computer graphics, creating high-quality photo-realistic virtual portraits is prohibitively expensive. Furthermore, the well-know ''uncanny valley'' effect in rendered portraits has a significant impact on the user experience, especially when the depiction closely resembles a human likeness, where any minor artifacts can evoke feelings of eeriness and repulsiveness. In this paper, we present a novel photo-realistic portrait generation framework that can effectively mitigate the ''uncanny valley'' effect and improve the overall authenticity of rendered portraits. Our key idea is to employ transfer learning to learn an identity-consistent mapping from the latent space of rendered portraits to that of real portraits. During the inference stage, the input portrait of an avatar can be directly transferred to a realistic portrait by changing its appearance style while maintaining the facial identity. To this end, we collect a new dataset, Daz-Rendered-Faces-HQ (DRFHQ), that is specifically designed for rendering-style portraits. We leverage this dataset to fine-tune the StyleGAN2 generator, using our carefully crafted framework, which helps to preserve the geometric and color features relevant to facial identity. We evaluate our framework using portraits with diverse gender, age, and race variations. Qualitative and quantitative evaluations and ablation studies show the advantages of our method compared to state-of-the-art approaches.

CLJan 12
Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

Bingyang Ye, Shan Chen, Jingxuan Tu et al.

Large language models are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models' judgments about these scientific ideas. Towards this goal, we introduce PoT, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers' agendas). PoT freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human-model misalignment against signals such as peer-review awards. In addition, PoT provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines under prompt ablations and budget scaling. Across 30,000+ instances spanning four benchmark domains, we find that, compared with non-agent baselines, higher interaction budgets generally improve agent performance, while the benefit of tool use is strongly task-dependent. By combining time-partitioned, future-verifiable targets with an offline sandbox for tool use, PoT supports scalable evaluation of agents on future-facing scientific idea judgment tasks.

SEApr 13
Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure

Huacan Wang, Jie Zhou, Ningyan Zhu et al.

AI coding agents have become central to developer workflows, yet every existing solution locks its reasoning capabilities within a specific delivery form, such as a CLI, IDE plugin, or web application. This limitation creates systemic barriers when enterprises attempt to reuse these capabilities across heterogeneous engineering environments. To address this challenge, we present Sema Code, an open AI coding framework built on the principle of being embeddable, pluggable, and framework-first. Sema Code completely decouples the core agent engine from all client layers, publishing it as a standalone npm library that any runtime can drive programmatically. Built around this architecture, we designed eight key mechanisms: multi-tenant engine isolation, FIFO input queuing with safe session reconstruction, adaptive context compression, multi-agent collaborative scheduling, intelligent Todo-based process management, four-layer asynchronous permission control, three-tier ecosystem integration spanning MCP, Skills, and Plugins, and a background task framework with separated execution and observation privileges. These mechanisms collectively address the engineering challenges of transforming a complex agent engine into a shared, programmable core. Demonstrating its architectural versatility, the same Sema Core engine simultaneously powers a VSCode extension and a multi-channel messaging gateway, which we name SemaClaw, to unify agent interactions across platforms such as Telegram and Feishu. These represent two fundamentally different product forms sharing an identical reasoning kernel, differing only at the client layer.