Yiming Luo

h-index25

19papers

360citations

Novelty44%

AI Score53

Ranked #30,767 of 201,018 authors (top 15%)#12,499 in CV (top 21%)

19 Papers

CLJul 25, 2024Code

Closing the gap between open-source and commercial large language models for medical evidence summarization

Gongbo Zhang, Qiao Jin, Yiliang Zhou et al.

Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance in summarizing medical evidence. Utilizing a benchmark dataset, MedReview, consisting of 8,161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the fine-tuned LLMs obtained an increase of 9.89 in ROUGE-L (95% confidence interval: 8.94-10.81), 13.21 in METEOR score (95% confidence interval: 12.05-14.37), and 15.82 in CHRF score (95% confidence interval: 13.89-16.44). The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were also manifested in both human and GPT4-simulated evaluations. Our results can be applied to guide model selection for tasks demanding particular domain knowledge, such as medical evidence summarization.

CVMar 24, 2022

Self-supervised Video-centralised Transformer for Video Face Clustering

Yujiang Wang, Mingzhi Dong, Jie Shen et al.

This paper presents a novel method for face clustering in videos using a video-centralised transformer. Previous works often employed contrastive learning to learn frame-level representation and used average pooling to aggregate the features along the temporal dimension. This approach may not fully capture the complicated video dynamics. In addition, despite the recent progress in video-based contrastive learning, few have attempted to learn a self-supervised clustering-friendly face representation that benefits the video face clustering task. To overcome these limitations, our method employs a transformer to directly learn video-level representations that can better reflect the temporally-varying property of faces in videos, while we also propose a video-centralised self-supervised framework to train the transformer model. We also investigate face clustering in egocentric videos, a fast-emerging field that has not been studied yet in works related to face clustering. To this end, we present and release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering. We evaluate our proposed method on both the widely used Big Bang Theory (BBT) dataset and the new EasyCom-Clustering dataset. Results show the performance of our video-centralised transformer has surpassed all previous state-of-the-art methods on both benchmarks, exhibiting a self-attentive understanding of face videos.

CVJan 29

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo et al.

This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.

CVFeb 13, 2023

Multiple Appropriate Facial Reaction Generation in Dyadic Interaction Settings: What, Why and How?

Siyang Song, Micol Spitale, Yiming Luo et al.

According to the Stimulus Organism Response (SOR) theory, all human behavioral reactions are stimulated by context, where people will process the received stimulus and produce an appropriate reaction. This implies that in a specific context for a given input stimulus, a person can react differently according to their internal state and other contextual factors. Analogously, in dyadic interactions, humans communicate using verbal and nonverbal cues, where a broad spectrum of listeners' non-verbal reactions might be appropriate for responding to a specific speaker behaviour. There already exists a body of work that investigated the problem of automatically generating an appropriate reaction for a given input. However, none attempted to automatically generate multiple appropriate reactions in the context of dyadic interactions and evaluate the appropriateness of those reactions using objective measures. This paper starts by defining the facial Multiple Appropriate Reaction Generation (fMARG) task for the first time in the literature and proposes a new set of objective evaluation metrics to evaluate the appropriateness of the generated reactions. The paper subsequently introduces a framework to predict, generate, and evaluate multiple appropriate facial reactions.

IRAug 9, 2024

Enhancing Exploratory Learning through Exploratory Search with the Emergence of Large Language Models

Yiming Luo, Patrick Cheong-Iao Pang, Shanton Chang

In the information era, how learners find, evaluate, and effectively use information has become a challenging issue, especially with the added complexity of large language models (LLMs) that have further confused learners in their information retrieval and search activities. This study attempts to unpack this complexity by combining exploratory search strategies with the theories of exploratory learning to form a new theoretical model of exploratory learning from the perspective of students' learning. Our work adapts Kolb's learning model by incorporating high-frequency exploration and feedback loops, aiming to promote deep cognitive and higher-order cognitive skill development in students. Additionally, this paper discusses and suggests how advanced LLMs integrated into information retrieval and information theory can support students in their exploratory searches, contributing theoretically to promoting student-computer interaction and supporting their learning journeys in the new era with LLMs.

CLAug 9, 2024

Ensemble BERT: A student social network text sentiment classification model based on ensemble learning and BERT architecture

Kai Jiang, Honghao Yang, Yuexian Wang et al.

The mental health assessment of middle school students has always been one of the focuses in the field of education. This paper introduces a new ensemble learning network based on BERT, employing the concept of enhancing model performance by integrating multiple classifiers. We trained a range of BERT-based learners, which combined using the majority voting method. We collect social network text data of middle school students through China's Weibo and apply the method to the task of classifying emotional tendencies in middle school students' social network texts. Experimental results suggest that the ensemble learning network has a better performance than the base model and the performance of the ensemble learning model, consisting of three single-layer BERT models, is barely the same as a three-layer BERT model but requires 11.58% more training time. Therefore, in terms of balancing prediction effect and efficiency, the deeper BERT network should be preferred for training. However, for interpretability, network ensembles can provide acceptable solutions.

CVAug 26, 2025Code

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

Shaojin Wu, Mengqi Huang, Yufeng Cheng et al.

Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO

CLApr 20

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang, Yiming Luo, Aimin Zhou et al.

Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique--revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.

OTJan 9

Immunological Density Shapes Recovery Trajectories in Long COVID

Jing Wang, Tong Zhang, Xing Niu et al.

Post-acute sequelae of SARS-CoV-2 infection (Long COVID) frequently persists for months, yet drivers of clinical remission remain incompletely defined. Here we analyzed 97,564 longitudinal PASC assessments from 13,511 participants with linked vaccination histories to disentangle passive temporal progression from vaccine-associated change. Using a clinically validated threshold (PASC $\geq 12$), trajectories separated into three phenotypes: Protected (persistently sub-threshold), Refractory (persistently symptomatic), and Responders (transitioning from symptomatic to recovered). Across the full cohort, symptom severity increased modestly with elapsed time ($r=0.0521$, $P=1.26\times10^{-59}$), whereas cumulative vaccination showed an inverse association with severity ($r=-0.0434$, $P=5.95\times10^{-42}$). In summary, baseline Long COVID severity appears clinically deterministic. In the absence of intervention, symptoms typically persist without spontaneous resolution. Recovery is primarily associated with repeated immunization.

CVApr 23, 2025

DreamO: A Unified Framework for Image Customization

Chong Mou, Yanze Wu, Wenxu Wu et al.

Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.

LGMar 18

Predicting Trajectories of Long COVID in Adult Women: The Critical Role of Causal Disentanglement

Jing Wang, Jie Shen, Yiming Luo et al.

Early prediction of Post-Acute Sequelae of SARS-CoV-2 severity is a critical challenge for women's health, particularly given the diagnostic overlap between PASC and common hormonal transitions such as menopause. Identifying and accounting for these confounding factors is essential for accurate long-term trajectory prediction. We conducted a retrospective study of 1,155 women (mean age 61) from the NIH RECOVER dataset. By integrating static clinical profiles with four weeks of longitudinal wearable data (monitoring cardiac activity and sleep), we developed a causal network based on a Large Language Model to predict future PASC scores. Our framework achieved a precision of 86.7\% in clinical severity prediction. Our causal attribution analysis demonstrate the model's ability to differentiate between active pathology and baseline noise: direct indicators such as breathlessness and malaise reached maximum saliency (1.00), while confounding factors like menopause and diabetes were successfully suppressed with saliency scores below 0.27.

CVSep 26, 2025

A Comprehensive Evaluation of Transformer-Based Question Answering Models and RAG-Enhanced Design

Zichen Zhang, Kunlong Zhang, Hongwei Ruan et al.

Transformer-based models have advanced the field of question answering, but multi-hop reasoning, where answers require combining evidence across multiple passages, remains difficult. This paper presents a comprehensive evaluation of retrieval strategies for multi-hop question answering within a retrieval-augmented generation framework. We compare cosine similarity, maximal marginal relevance, and a hybrid method that integrates dense embeddings with lexical overlap and re-ranking. To further improve retrieval, we adapt the EfficientRAG pipeline for query optimization, introducing token labeling and iterative refinement while maintaining efficiency. Experiments on the HotpotQA dataset show that the hybrid approach substantially outperforms baseline methods, achieving a relative improvement of 50 percent in exact match and 47 percent in F1 score compared to cosine similarity. Error analysis reveals that hybrid retrieval improves entity recall and evidence complementarity, while remaining limited in handling distractors and temporal reasoning. Overall, the results suggest that hybrid retrieval-augmented generation provides a practical zero-shot solution for multi-hop question answering, balancing accuracy, efficiency, and interpretability.

CVApr 25, 2025

Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning

Yuanbing Ouyang, Yizhuo Liang, Qingpeng Li et al.

Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data characteristics. This study introduces 'LVTP', a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering. It integrates high-level semantics and basic visual attributes for precise segmentation. A novel dynamic scoring mechanism using multi-scale Tsallis entropy weighting overcomes limitations of traditional single-parameter entropy. The framework also incorporates low-level feature analysis to preserve critical edge information while optimizing computational cost. As a plug-and-play module, it requires no architectural changes or additional training. Evaluations across multiple datasets show 20%-45% computational reductions with negligible performance loss, outperforming existing methods in balancing cost and accuracy, especially in complex edge regions.

HCJan 9, 2022

In-Device Feedback in Immersive Head-Mounted Displays for Distance Perception During Teleoperation of Unmanned Ground Vehicles

Yiming Luo, Jialin Wang, Rongkai Shi et al.

In recent years, Virtual Reality (VR) Head-Mounted Displays (HMD) have been used to provide an immersive, first-person view in real-time for the remote-control of Unmanned Ground Vehicles (UGV). One critical issue is that it is challenging to perceive the distance of obstacles surrounding the vehicle from 2D views in the HMD, which deteriorates the control of UGV. Conventional distance indicators used in HMD take up screen space which leads clutter on the display and can further reduce situation awareness of the physical environment. To address the issue, in this paper we propose off-screen in-device feedback using vibro-tactile and/or light-visual cues to provide real-time distance information for the remote control of UGV. Results from a study show a significantly better performance with either feedback type, reduced workload and improved usability in a driving task that requires continuous perception of the distance between the UGV and its environmental objects or obstacles. Our findings show a solid case for in-device vibro-tactile and/or light-visual feedback to support remote operation of UGVs that highly relies on distance perception of objects.

HCSep 29, 2021

RelicVR: A Virtual Reality Game for Active Exploration of Archaeological Relics

Yilin Liu, Yiming Lin, Rongkai Shi et al.

Digitalization is changing how people visit museums and explore the artifacts they house. Museums, as important educational venues outside classrooms, need to actively explore the application of digital interactive media, including games that can balance entertainment and knowledge acquisition. In this paper, we introduce RelicVR, a virtual reality (VR) game that encourages players to discover artifacts through physical interaction in a game-based approach. Players need to unearth artifacts hidden in a clod enclosure by using available tools and physical movements. The game relies on the dynamic voxel deformation technique to allow players to chip away earth covering the artifacts. We added uncertainty in the exploration process to bring it closer to how archaeological discovery happens in real life. Players do not know the shape or features of the hidden artifact and have to take away the earth gradually but strategically without hitting the artifact itself. From playtesting sessions with eight participants, we found that the uncertainty elements are conducive to their engagement and exploration experience. Overall, RelicVR is an innovative game that can improve players' learning motivation and outcomes of ancient artifacts.

HCJul 12, 2021

Monoscopic vs. Stereoscopic Views and Display Types in the Teleoperation of Unmanned Ground Vehicles for Object Avoidance

Yiming Luo, Jialin Wang, Hai-Ning Liang et al.

Virtual reality (VR) head-mounted displays (HMD) have recently been used to provide an immersive, first-person vision/view in real-time for manipulating remotely-controlled unmanned ground vehicles (UGV). The teleoperation of UGV can be challenging for operators when it is done in real time. One big challenge is for operators to perceive quickly and rapidly the distance of objects that are around the UGV while it is moving. In this research, we explore the use of monoscopic and stereoscopic views and display types (immersive and non-immersive VR) for operating vehicles remotely. We conducted two user studies to explore their feasibility and advantages. Results show a significantly better performance when using an immersive display with stereoscopic view for dynamic, real-time navigation tasks that require avoiding both moving and static obstacles. The use of stereoscopic view in an immersive display in particular improved user performance and led to better usability.

CVJan 25, 2021

DeepDT: Learning Geometry From Delaunay Triangulation for Surface Reconstruction

Yiming Luo, Zhenxing Mi, Wenbing Tao

In this paper, a novel learning-based network, named DeepDT, is proposed to reconstruct the surface from Delaunay triangulation of point cloud. DeepDT learns to predict inside/outside labels of Delaunay tetrahedrons directly from a point cloud and corresponding Delaunay triangulation. The local geometry features are first extracted from the input point cloud and aggregated into a graph deriving from the Delaunay triangulation. Then a graph filtering is applied on the aggregated features in order to add structural regularization to the label prediction of tetrahedrons. Due to the complicated spatial relations between tetrahedrons and the triangles, it is impossible to directly generate ground truth labels of tetrahedrons from ground truth surface. Therefore, we propose a multi-label supervision strategy which votes for the label of a tetrahedron with labels of sampling locations inside it. The proposed DeepDT can maintain abundant geometry details without generating overly complex surfaces, especially for inner surfaces of open scenes. Meanwhile, the generalization ability and time consumption of the proposed method is acceptable and competitive compared with the state-of-the-art methods. Experiments demonstrate the superior performance of the proposed DeepDT.

CVNov 18, 2019

SSRNet: Scalable 3D Surface Reconstruction Network

Zhenxing Mi, Yiming Luo, Wenbing Tao

Existing learning-based surface reconstruction methods from point clouds are still facing challenges in terms of scalability and preservation of details on large-scale point clouds. In this paper, we propose the SSRNet, a novel scalable learning-based method for surface reconstruction. The proposed SSRNet constructs local geometry-aware features for octree vertices and designs a scalable reconstruction pipeline, which not only greatly enhances the predication accuracy of the relative position between the vertices and the implicit surface facilitating the surface reconstruction quality, but also allows dividing the point cloud and octree vertices and processing different parts in parallel for superior scalability on large-scale point clouds with millions of points. Moreover, SSRNet demonstrates outstanding generalization capability and only needs several surface data for training, much less than other learning-based reconstruction methods, which can effectively avoid overfitting. The trained model of SSRNet on one dataset can be directly used on other datasets with superior performance. Finally, the time consumption with SSRNet on a large-scale point cloud is acceptable and competitive. To our knowledge, the proposed SSRNet is the first to really bring a convincing solution to the scalability issue of the learning-based surface reconstruction methods, and is an important step to make learning-based methods competitive with respect to geometry processing methods on real-world and challenging data. Experiments show that our method achieves a breakthrough in scalability and quality compared with state-of-the-art learning-based methods.

CVFeb 7, 2018

Unsupervised Typography Transfer

Hanfei Sun, Yiming Luo, Ziang Lu

Traditional methods in Chinese typography synthesis view characters as an assembly of radicals and strokes, but they rely on manual definition of the key points, which is still time-costing. Some recent work on computer vision proposes a brand new approach: to treat every Chinese character as an independent and inseparable image, so the pre-processing and post-processing of each character can be avoided. Then with a combination of a transfer network and a discriminating network, one typography can be well transferred to another. Despite the quite satisfying performance of the model, the training process requires to be supervised, which means in the training data each character in the source domain and the target domain needs to be perfectly paired. Sometimes the pairing is time-costing, and sometimes there is no perfect pairing, such as the pairing between traditional Chinese and simplified Chinese characters. In this paper, we proposed an unsupervised typography transfer method which doesn't need pairing.