CVFeb 6Code
Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language ModelsHaoyu Zhang, Zhipeng Li, Yiwen Guo et al.
Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.
HCMar 26
Automating UI Optimization through Multi-Agentic ReasoningZhipeng Li, Christoph Gebhardt, Yi-Chi Liao et al.
We present AutoOptimization, a novel multi-objective optimization framework for adapting user interfaces. From a user's verbal preferences for changing a UI, our framework guides a prioritization-based Pareto frontier search over candidate layouts. It selects suitable objective functions for UI placement while simultaneously parameterizing them according to the user's instructions to define the optimization problem. A solver then generates a series of optimal UI layouts, which our framework validates against the user's instructions to adapt the UI with the final solution. Our approach thus overcomes the previous need for manual inspection of layouts and the use of population averages for objective parameters. We integrate multiple agents sequentially within our framework, enabling the system to leverage their reasoning capabilities to interpret user preferences, configure the optimization problem, and validate optimization outcomes.
CVJul 6, 2025Code
Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset, Experiment, and AnalysisZhipeng Li, Kegang Wang, Hanguang Xiao et al.
Remote photoplethysmography (rPPG) is a non-contact technique for measuring human physiological signals. Due to its convenience and non-invasiveness, it has demonstrated broad application potential in areas such as health monitoring and emotion recognition. In recent years, the release of numerous public datasets has significantly advanced the performance of rPPG algorithms under ideal lighting conditions. However, the effectiveness of current rPPG methods in realistic nighttime scenarios with dynamic lighting variations remains largely unknown. Moreover, there is a severe lack of datasets specifically designed for such challenging environments, which has substantially hindered progress in this area of research. To address this gap, we present and release a large-scale rPPG dataset collected under dynamic lighting conditions at night, named DLCN. The dataset comprises approximately 13 hours of video data and corresponding synchronized physiological signals from 98 participants, covering four representative nighttime lighting scenarios. DLCN offers high diversity and realism, making it a valuable resource for evaluating algorithm robustness in complex conditions. Built upon the proposed Happy-rPPG Toolkit, we conduct extensive experiments and provide a comprehensive analysis of the challenges faced by state-of-the-art rPPG methods when applied to DLCN. The dataset and code are publicly available at https://github.com/dalaoplan/Happp-rPPG-Toolkit.
CLMay 14, 2024
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in MedicineHanguang Xiao, Feizhong Zhou, Xingyue Liu et al.
Since the release of ChatGPT and GPT-4, large language models (LLMs) and multimodal large language models (MLLMs) have attracted widespread attention for their exceptional capabilities in understanding, reasoning, and generation, introducing transformative paradigms for integrating artificial intelligence into medicine. This survey provides a comprehensive overview of the development, principles, application scenarios, challenges, and future directions of LLMs and MLLMs in medicine. Specifically, it begins by examining the paradigm shift, tracing the transition from traditional models to LLMs and MLLMs, and highlighting the unique advantages of these LLMs and MLLMs in medical applications. Next, the survey reviews existing medical LLMs and MLLMs, providing detailed guidance on their construction and evaluation in a clear and systematic manner. Subsequently, to underscore the substantial value of LLMs and MLLMs in healthcare, the survey explores five promising applications in the field. Finally, the survey addresses the challenges confronting medical LLMs and MLLMs and proposes practical strategies and future directions for their integration into medicine. In summary, this survey offers a comprehensive analysis of the technical methodologies and practical clinical applications of medical LLMs and MLLMs, with the goal of bridging the gap between these advanced technologies and clinical practice, thereby fostering the evolution of the next generation of intelligent healthcare systems.
HCMar 3, 2024
Time2Stop: Adaptive and Explainable Human-AI Loop for Smartphone Overuse InterventionAdiba Orzikulova, Han Xiao, Zhipeng Li et al.
Despite a rich history of investigating smartphone overuse intervention techniques, AI-based just-in-time adaptive intervention (JITAI) methods for overuse reduction are lacking. We develop Time2Stop, an intelligent, adaptive, and explainable JITAI system that leverages machine learning to identify optimal intervention timings, introduces interventions with transparent AI explanations, and collects user feedback to establish a human-AI loop and adapt the intervention model over time. We conducted an 8-week field experiment (N=71) to evaluate the effectiveness of both the adaptation and explanation aspects of Time2Stop. Our results indicate that our adaptive models significantly outperform the baseline methods on intervention accuracy (>32.8\% relatively) and receptivity (>8.0\%). In addition, incorporating explanations further enhances the effectiveness by 53.8\% and 11.4\% on accuracy and receptivity, respectively. Moreover, Time2Stop significantly reduces overuse, decreasing app visit frequency by 7.0$\sim$8.9\%. Our subjective data also echoed these quantitative measures. Participants preferred the adaptive interventions and rated the system highly on intervention time accuracy, effectiveness, and level of trust. We envision our work can inspire future research on JITAI systems with a human-AI loop to evolve with users.
HCJul 21, 2025
Efficient Visual Appearance Optimization by Learning from Prior PreferencesZhipeng Li, Yi-Chi Liao, Christian Holz
Adjusting visual parameters such as brightness and contrast is common in our everyday experiences. Finding the optimal parameter setting is challenging due to the large search space and the lack of an explicit objective function, leaving users to rely solely on their implicit preferences. Prior work has explored Preferential Bayesian Optimization (PBO) to address this challenge, involving users to iteratively select preferred designs from candidate sets. However, PBO often requires many rounds of preference comparisons, making it more suitable for designers than everyday end-users. We propose Meta-PO, a novel method that integrates PBO with meta-learning to improve sample efficiency. Specifically, Meta-PO infers prior users' preferences and stores them as models, which are leveraged to intelligently suggest design candidates for the new users, enabling faster convergence and more personalized results. An experimental evaluation of our method for appearance design tasks on 2D and 3D content showed that participants achieved satisfactory appearance in 5.86 iterations using Meta-PO when participants shared similar goals with a population (e.g., tuning for a ``warm'' look) and in 8 iterations even generalizes across divergent goals (e.g., from ``vintage'', ``warm'', to ``holiday''). Meta-PO makes personalized visual optimization more applicable to end-users through a generalizable, more efficient optimization conditioned on preferences, with the potential to scale interface personalization more broadly.
CLJul 11, 2025
Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in ConversationJialong Mai, Xiaofen Xing, Yawei Li et al.
Recent research has focused on applying speech large language model (SLLM) to improve speech emotion recognition (SER). However, the inherently high frame rate in speech modality severely limits the signal processing and understanding capabilities of SLLM. For example, a SLLM with a 4K context window can only process 80 seconds of audio at 50Hz feature sampling rate before reaching its capacity limit. Input token compression methods used in SLLM overlook the continuity and inertia of emotions across multiple conversation turns. This paper proposes a Dynamic Parameter Memory (DPM) mechanism with contextual semantics and sentence-level emotion encoding, enabling processing of unlimited-length audio with limited context windows in SLLM. Specifically, DPM progressively encodes sentence-level information and emotions into a temporary LoRA module during inference to effectively "memorize" the contextual information. We trained an emotion SLLM as a backbone and incorporated our DPM into inference for emotion recognition in conversation (ERC). Experimental results on the IEMOCAP dataset show that DPM significantly improves the emotion recognition capabilities of SLLM when processing long audio sequences, achieving state-of-the-art performance.
AIJun 21, 2024
Automated architectural space layout planning using a physics-inspired generative design frameworkZhipeng Li, Sichao Li, Geoff Hinchcliffe et al.
The determination of space layout is one of the primary activities in the schematic design stage of an architectural project. The initial layout planning defines the shape, dimension, and circulation pattern of internal spaces; which can also affect performance and cost of the construction. When carried out manually, space layout planning can be complicated, repetitive and time consuming. In this work, a generative design framework for the automatic generation of spatial architectural layout has been developed. The proposed approach integrates a novel physics-inspired parametric model for space layout planning and an evolutionary optimisation metaheuristic. Results revealed that such a generative design framework can generate a wide variety of design suggestions at the schematic design stage, applicable to complex design problems.
LGDec 13, 2023
Multi-perspective Feedback-attention Coupling Model for Continuous-time Dynamic GraphsXiaobo Zhu, Yan Wu, Zhipeng Li et al.
Recently, representation learning over graph networks has gained popularity, with various models showing promising results. Despite this, several challenges persist: 1) most methods are designed for static or discrete-time dynamic graphs; 2) existing continuous-time dynamic graph algorithms focus on a single evolving perspective; and 3) many continuous-time dynamic graph approaches necessitate numerous temporal neighbors to capture long-term dependencies. In response, this paper introduces the Multi-Perspective Feedback-Attention Coupling (MPFA) model. MPFA incorporates information from both evolving and raw perspectives, efficiently learning the interleaved dynamics of observed processes. The evolving perspective employs temporal self-attention to distinguish continuously evolving temporal neighbors for information aggregation. Through dynamic updates, this perspective can capture long-term dependencies using a small number of temporal neighbors. Meanwhile, the raw perspective utilizes a feedback attention module with growth characteristic coefficients to aggregate raw neighborhood information. Experimental results on a self-organizing dataset and seven public datasets validate the efficacy and competitiveness of our proposed model.
CVFeb 20, 2021
Towards Accurate and Compact Architectures via Neural Architecture TransformerYong Guo, Yin Zheng, Mingkui Tan et al.
Designing effective architectures is one of the key factors behind the success of deep neural networks. Existing deep architectures are either manually designed or automatically searched by some Neural Architecture Search (NAS) methods. However, even a well-designed/searched architecture may still contain many nonsignificant or redundant modules/operations. Thus, it is necessary to optimize the operations inside an architecture to improve the performance without introducing extra computational cost. To this end, we have proposed a Neural Architecture Transformer (NAT) method which casts the optimization problem into a Markov Decision Process (MDP) and seeks to replace the redundant operations with more efficient operations, such as skip or null connection. Note that NAT only considers a small number of possible transitions and thus comes with a limited search/transition space. As a result, such a small search space may hamper the performance of architecture optimization. To address this issue, we propose a Neural Architecture Transformer++ (NAT++) method which further enlarges the set of candidate transitions to improve the performance of architecture optimization. Specifically, we present a two-level transition rule to obtain valid transitions, i.e., allowing operations to have more efficient types (e.g., convolution->separable convolution) or smaller kernel sizes (e.g., 5x5->3x3). Note that different operations may have different valid transitions. We further propose a Binary-Masked Softmax (BMSoftmax) layer to omit the possible invalid transitions. Extensive experiments on several benchmark datasets show that the transformed architecture significantly outperforms both its original counterpart and the architectures optimized by existing methods.
IVDec 2, 2020
An Improved Iterative Neural Network for High-Quality Image-Domain Material Decomposition in Dual-Energy CTZhipeng Li, Yong Long, Il Yong Chun
Dual-energy computed tomography (DECT) has been widely used in many applications that need material decomposition. Image-domain methods directly decompose material images from high- and low-energy attenuation images, and thus, are susceptible to noise and artifacts on attenuation images. The purpose of this study is to develop an improved iterative neural network (INN) for high-quality image-domain material decomposition in DECT, and to study its properties. We propose a new INN architecture for DECT material decomposition. The proposed INN architecture uses distinct cross-material convolutional neural network (CNN) in image refining modules, and uses image decomposition physics in image reconstruction modules. The distinct cross-material CNN refiners incorporate distinct encoding-decoding filters and cross-material model that captures correlations between different materials. We study the distinct cross-material CNN refiner with patch-based reformulation and tight-frame condition. Numerical experiments with extended cardiactorso (XCAT) phantom and clinical data show that the proposed INN significantly improves the image quality over several image-domain material decomposition methods, including a conventional model-based image decomposition (MBID) method using an edge-preserving regularizer, a recent MBID method using pre-learned material-wise sparsifying transforms, and a noniterative deep CNN method. Our study with patch-based reformulations reveals that learned filters of distinct cross-material CNN refiners can approximately satisfy the tight-frame condition.
LGOct 26, 2019
SUPER Learning: A Supervised-Unsupervised Framework for Low-Dose CT Image ReconstructionZhipeng Li, Siqi Ye, Yong Long et al.
Recent years have witnessed growing interest in machine learning-based models and techniques for low-dose X-ray CT (LDCT) imaging tasks. The methods can typically be categorized into supervised learning methods and unsupervised or model-based learning methods. Supervised learning methods have recently shown success in image restoration tasks. However, they often rely on large training sets. Model-based learning methods such as dictionary or transform learning do not require large or paired training sets and often have good generalization properties, since they learn general properties of CT image sets. Recent works have shown the promising reconstruction performance of methods such as PWLS-ULTRA that rely on clustering the underlying (reconstructed) image patches into a learned union of transforms. In this paper, we propose a new Supervised-UnsuPERvised (SUPER) reconstruction framework for LDCT image reconstruction that combines the benefits of supervised learning methods and (unsupervised) transform learning-based methods such as PWLS-ULTRA that involve highly image-adaptive clustering. The SUPER model consists of several layers, each of which includes a deep network learned in a supervised manner and an unsupervised iterative method that involves image-adaptive components. The SUPER reconstruction algorithms are learned in a greedy manner from training data. The proposed SUPER learning methods dramatically outperform both the constituent supervised learning-based networks and iterative algorithms for LDCT, and use much fewer iterations in the iterative reconstruction modules.
IRJul 18, 2019
Combinatorial Keyword Recommendations for Sponsored Search with Deep Reinforcement LearningZhipeng Li, Jianwei Wu, Lin Sun et al.
In sponsored search, keyword recommendations help advertisers to achieve much better performance within limited budget. Many works have been done to mine numerous candidate keywords from search logs or landing pages. However, the strategy to select from given candidates remains to be improved. The existing relevance-based, popularity-based and regular combinatorial strategies fail to take the internal or external competitions among keywords into consideration. In this paper, we regard keyword recommendations as a combinatorial optimization problem and solve it with a modified pointer network structure. The model is trained on an actor-critic based deep reinforcement learning framework. A pre-clustering method called Equal Size K-Means is proposed to accelerate the training and testing procedure on the framework by reducing the action space. The performance of framework is evaluated both in offline and online environments, and remarkable improvements can be observed.
IVJan 1, 2019
DECT-MULTRA: Dual-Energy CT Image Decomposition With Learned Mixed Material Models and Efficient ClusteringZhipeng Li, Saiprasad Ravishankar, Yong Long et al.
Dual energy computed tomography (DECT) imaging plays an important role in advanced imaging applications due to its material decomposition capability. Image-domain decomposition operates directly on CT images using linear matrix inversion, but the decomposed material images can be severely degraded by noise and artifacts. This paper proposes a new method dubbed DECT-MULTRA for image-domain DECT material decomposition that combines conventional penalized weighted-least squares (PWLS) estimation with regularization based on a mixed union of learned transforms (MULTRA) model. Our proposed approach pre-learns a union of common-material sparsifying transforms from patches extracted from all the basis materials, and a union of cross-material sparsifying transforms from multi-material patches. The common-material transforms capture the common properties among different material images, while the cross-material transforms capture the cross-dependencies. The proposed PWLS formulation is optimized efficiently by alternating between an image update step and a sparse coding and clustering step, with both of these steps having closed-form solutions. The effectiveness of our method is validated with both XCAT phantom and clinical head data. The results demonstrate that our proposed method provides superior material image quality and decomposition accuracy compared to other competing methods.
MLNov 2, 2017
Sparse-View X-Ray CT Reconstruction Using $\ell_1$ Prior with Learned TransformXuehang Zheng, Il Yong Chun, Zhipeng Li et al.
A major challenge in X-ray computed tomography (CT) is reducing radiation dose while maintaining high quality of reconstructed images. To reduce the radiation dose, one can reduce the number of projection views (sparse-view CT); however, it becomes difficult to achieve high-quality image reconstruction as the number of projection views decreases. Researchers have applied the concept of learning sparse representations from (high-quality) CT image dataset to the sparse-view CT reconstruction. We propose a new statistical CT reconstruction model that combines penalized weighted-least squares (PWLS) and $\ell_1$ prior with learned sparsifying transform (PWLS-ST-$\ell_1$), and a corresponding efficient algorithm based on Alternating Direction Method of Multipliers (ADMM). To moderate the difficulty of tuning ADMM parameters, we propose a new ADMM parameter selection scheme based on approximated condition numbers. We interpret the proposed model by analyzing the minimum mean square error of its ($\ell_2$-norm relaxed) image update estimator. Our results with the extended cardiac-torso (XCAT) phantom data and clinical chest data show that, for sparse-view 2D fan-beam CT and 3D axial cone-beam CT, PWLS-ST-$\ell_1$ improves the quality of reconstructed images compared to the CT reconstruction methods using edge-preserving regularizer and $\ell_2$ prior with learned ST. These results also show that, for sparse-view 2D fan-beam CT, PWLS-ST-$\ell_1$ achieves comparable or better image quality and requires much shorter runtime than PWLS-DL using a learned overcomplete dictionary. Our results with clinical chest data show that, methods using the unsupervised learned prior generalize better than a state-of-the-art deep "denoising" neural network that does not use a physical imaging model.