Xian Wang

CV
h-index33
19papers
191citations
Novelty45%
AI Score53

19 Papers

CVJan 10, 2025Code
Valley2: Exploring Multimodal Models with Scalable Vision-Language Design

Ziheng Wu, Zhenghao Chen, Ruipu Luo et al.

Recently, vision-language models have made remarkable progress, demonstrating outstanding capabilities in various tasks such as image captioning and video understanding. We introduce Valley2, a novel multimodal large language model designed to enhance performance across all domains and extend the boundaries of practical applications in e-commerce and short video scenarios. Notably, Valley2 achieves state-of-the-art (SOTA) performance on e-commerce benchmarks, surpassing open-source models of similar size by a large margin (79.66 vs. 72.76). Additionally, Valley2 ranks second on the OpenCompass leaderboard among models with fewer than 10B parameters, with an impressive average score of 67.4. The code and model weights are open-sourced at https://github.com/bytedance/Valley.

48.9HCMar 29
Conflict Resolution Strategies for Co-manipulation of Virtual Objects Under Non-disjoint Conditions

Xian Wang, Xuanru Cheng, Rongkai Shi et al.

Virtual Reality (VR) co-manipulation enables multiple users to collaboratively interact with shared virtual objects. However, existing research treats objects as monolithic entities, overlooking scenarios where users need to manipulate different sub-components simultaneously. This work addresses conflict resolution when users select overlapping vertices (non-disjoint sets) during co-manipulation. We present a comprehensive framework comprising preventive strategies (Object-level and Action-level Restrictions) and reactive strategies (computational conflict resolution). Through two user studies with 76 participants (38 pairs), we evaluated these approaches in collaborative wireframe editing tasks. Study 1 identified Averaging as the optimal computational method, balancing task efficiency with user experience. Study 2 highlighted that Action-level Restriction, which permits overlapping selections but restricts concurrent identical operations, achieved better performance compared to exclusive object locking. Reactive strategies using averaging provided smooth collaboration for experienced users, while second-user priority enabled quick corrections. Our findings indicate that optimal strategy selection depends on task requirements, user expertise, and collaboration patterns. Based on the findings, we provide design implications for developing VR collaboration systems that support flexible sub-components manipulation while maintaining collaborative awareness and minimizing conflicts.

CVDec 12, 2025
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

Wei Chow, Linfeng Li, Lingdong Kong et al.

Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

78.0CLMar 27
AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents

Wenbo Gao, Renxi Liu, Xian Wang et al.

Autonomous agents powered by large language models (LLMs) perform complex tasks through long-horizon reasoning and tool interaction, where a fundamental trade-off arises between execution efficiency and reasoning robustness. Models at different capability-cost levels offer complementary advantages: lower-cost models enable fast execution but may struggle on difficult reasoning segments, while stronger models provide more robust reasoning at higher computational cost. We present AgentCollab, a self-driven collaborative inference framework that dynamically coordinates models with different reasoning capacities during agent execution. Instead of relying on external routing modules, the framework uses the agent's own self-reflection signal to determine whether the current reasoning trajectory is making meaningful progress, and escalates control to a stronger reasoning tier only when necessary. To further stabilize long-horizon execution, we introduce a difficulty-aware cumulative escalation strategy that allocates additional reasoning budget based on recent failure signals. In our experiments, we instantiate this framework using a two-level small-large model setting. Experiments on diverse multi-step agent benchmarks show that AgentCollab consistently improves the accuracy-efficiency Pareto frontier of LLM agents.

ROSep 25, 2024
Dashing for the Golden Snitch: Multi-Drone Time-Optimal Motion Planning with Multi-Agent Reinforcement Learning

Xian Wang, Jin Zhou, Yuanli Feng et al.

Recent innovations in autonomous drones have facilitated time-optimal flight in single-drone configurations, and enhanced maneuverability in multi-drone systems by applying optimal control and learning-based methods. However, few studies have achieved time-optimal motion planning for multi-drone systems, particularly during highly agile maneuvers or in dynamic scenarios. This paper presents a decentralized policy network using multi-agent reinforcement learning for time-optimal multi-drone flight. To strike a balance between flight efficiency and collision avoidance, we introduce a soft collision-free mechanism inspired by optimization-based methods. By customizing PPO in a centralized training, decentralized execution (CTDE) fashion, we unlock higher efficiency and stability in training while ensuring lightweight implementation. Extensive simulations show that, despite slight performance trade-offs compared to single-drone systems, our multi-drone approach maintains near-time-optimal performance with a low collision rate. Real-world experiments validate our method, with two quadrotors using the same network as in simulation achieving a maximum speed of 13.65 m/s and a maximum body rate of 13.4 rad/s in a 5.5 m * 5.5 m * 2.0 m space across various tracks, relying entirely on onboard computation.

58.1ROMar 11
MAVEN: A Meta-Reinforcement Learning Framework for Varying-Dynamics Expertise in Agile Quadrotor Maneuvers

Jin Zhou, Dongcheng Cao, Xian Wang et al.

Reinforcement learning (RL) has emerged as a powerful paradigm for achieving online agile navigation with quadrotors. Despite this success, policies trained via standard RL typically fail to generalize across significant dynamic variations, exhibiting a critical lack of adaptability. This work introduces MAVEN, a meta-RL framework that enables a single policy to achieve robust end-to-end navigation across a wide range of quadrotor dynamics. Our approach features a novel predictive context encoder, which learns to infer a latent representation of the system dynamics from interaction history. We demonstrate our method in agile waypoint traversal tasks under two challenging scenarios: large variations in quadrotor mass and severe single-rotor thrust loss. We leverage a GPU-vectorized simulator to distribute tasks across thousands of parallel environments, overcoming the long training times of meta-RL to converge in less than an hour. Through extensive experiments in both simulation and the real world, we validate that MAVEN achieves superior adaptation and agility. The policy successfully executes zero-shot sim-to-real transfer, demonstrating robust online adaptation by performing high-speed maneuvers despite mass variations of up to 66.7% and single-rotor thrust losses as severe as 70%.

97.2CVMay 11
Masked Generative Transformer Is What You Need for Image Editing

Wei Chow, Linfeng Li, Xian Sun et al.

Diffusion models dominate image editing, yet their global denoising mechanism entangles edited regions with surrounding context, causing modifications to propagate into areas that should remain intact. We propose a fundamentally different approach by leveraging Masked Generative Transformers (MGTs), whose localized token-prediction paradigm naturally confines changes to intended regions. We present EditMGT, an MGT-based editing framework that is the first of its kind. Our approach employs multi-layer attention consolidation to aggregate cross-attention maps into precise edit localization signals, and region-hold sampling to explicitly prevent token flipping in non-target areas. To support training, we construct CrispEdit-2M, a 2M-sample high-resolution (>1024) editing dataset spanning seven categories. With only 960M parameters, EditMGT achieves state-of-the-art image similarity on multiple benchmarks while delivering 6x faster editing, demonstrating that MGTs offer a compelling alternative to diffusion-based editing.

HCMar 6
Non-urgent Messages Do Not Jump into My Headset Suddenly! Adaptive Notification Design in Mixed Reality

Jingyao Zheng, Xian Wang, Sven Mayer et al.

Mixed reality (MR) notification systems currently display all messages in fixed central locations regardless of urgency, leading to unnecessary interruptions and cognitive overload. Drawing from previous MR/Virtual Reality (VR) notification design work and calm technology principles, we developed an adaptive notification system that adjusts spatial placement based on urgency levels: non-urgent notifications appear as peripheral icons accessible via head movement, moderately urgent messages anchor to the user's hand, and very urgent notifications transition progressively from peripheral to central view. Through a within-subjects study (N=18), we evaluated our adaptive system against the default centralised approach. Results demonstrate that the adaptive system significantly reduces mental workload (p=0.041), temporal workload (p=0.008), and frustration (p=0.004) while maintaining comparable notification awareness. Logistic regression analysis reveals that users prefer the adaptive system even with classification errors, provided the combined misclassification rate (disruptiveness + omission errors) remains below a determinable threshold. Our findings establish the first empirical evidence that urgency-based spatial notification distribution effectively addresses core MR usability challenges, offering practical design guidelines for immersive notification systems that balance user attention management with information accessibility.

HCNov 5, 2025
When Generative Artificial Intelligence meets Extended Reality: A Systematic Review

Xinyu Ning, Yan Zhuo, Xian Wang et al.

With the continuous advancement of technology, the application of generative artificial intelligence (AI) in various fields is gradually demonstrating great potential, particularly when combined with Extended Reality (XR), creating unprecedented possibilities. This survey article systematically reviews the applications of generative AI in XR, covering as much relevant literature as possible from 2023 to 2025. The application areas of generative AI in XR and its key technology implementations are summarised through PRISMA screening and analysis of the final 26 articles. The survey highlights existing articles from the last three years related to how XR utilises generative AI, providing insights into current trends and research gaps. We also explore potential opportunities for future research to further empower XR through generative AI, providing guidance and information for future generative XR research.

CLOct 23, 2024
LMLPA: Language Model Linguistic Personality Assessment

Jingyao Zheng, Xian Wang, Simo Hosio et al.

Large Language Models (LLMs) are increasingly used in everyday life and research. One of the most common use cases is conversational interactions, enabled by the language generation capabilities of LLMs. Just as between two humans, a conversation between an LLM-powered entity and a human depends on the personality of the conversants. However, measuring the personality of a given LLM is currently a challenge. This paper introduces the Language Model Linguistic Personality Assessment (LMLPA), a system designed to evaluate the linguistic personalities of LLMs. Our system helps to understand LLMs' language generation capabilities by quantitatively assessing the distinct personality traits reflected in their linguistic outputs. Unlike traditional human-centric psychometrics, the LMLPA adapts a personality assessment questionnaire, specifically the Big Five Inventory, to align with the operational capabilities of LLMs, and also incorporates the findings from previous language-based personality measurement literature. To mitigate sensitivity to the order of options, our questionnaire is designed to be open-ended, resulting in textual answers. Thus, the AI rater is needed to transform ambiguous personality information from text responses into clear numerical indicators of personality traits. Utilising Principal Component Analysis and reliability validations, our findings demonstrate that LLMs possess distinct personality traits that can be effectively quantified by the LMLPA. This research contributes to Human-Computer Interaction and Human-Centered AI, providing a robust framework for future studies to refine AI personality assessments and expand their applications in multiple areas, including education and manufacturing.

CVJun 25, 2025
Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

Zhentao He, Can Zhang, Ziheng Wu et al.

Recent advancements in multimodal large language models have enhanced document understanding by integrating textual and visual information. However, existing models exhibit incompleteness within their paradigm in real-world scenarios, particularly under visual degradation. In such conditions, the current response paradigm often fails to adequately perceive visual degradation and ambiguity, leading to overreliance on linguistic priors or misaligned visual-textual reasoning. This difficulty in recognizing uncertainty frequently results in the generation of hallucinatory content, especially when a precise answer is not feasible. To better demonstrate and analyze this phenomenon and problem, we propose KIE-HVQA, the first benchmark dedicated to evaluating OCR hallucination in degraded document understanding. This dataset includes test samples spanning identity cards and invoices, with simulated real-world degradations for OCR reliability. This setup allows for evaluating models' capacity, under degraded input, to distinguish reliable visual information and answer accordingly, thereby highlighting the challenge of avoiding hallucination on uncertain data. To achieve vision-faithful reasoning and thereby avoid the aforementioned issues, we further introduce a GRPO-based framework featuring a novel reward mechanism. By incorporating a self-awareness of visual uncertainty and an analysis method that initiates refusal to answer to increase task difficulty within our supervised fine-tuning and reinforcement learning framework, we successfully mitigated hallucinations in ambiguous regions. Experiments on Qwen2.5-VL demonstrate that our 7B-parameter model achieves a 22\% absolute improvement in hallucination-free accuracy over GPT-4o on KIE-HVQA and there is no significant performance drop in standard tasks, highlighting both effectiveness and robustness.

CVJun 3, 2025
MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

Wei Chow, Yuan Gao, Linfeng Li et al.

Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.

HCJan 18, 2022
VibroWeight: Simulating Weight and Center of Gravity Changes of Objects in Virtual Reality for Enhanced Realism

Xian Wang, Diego Monteiro, Lik-Hang Lee et al.

Haptic feedback in virtual reality (VR) allows users to perceive the physical properties of virtual objects (e.g., their weight and motion patterns). However, the lack of haptic sensations deteriorates users' immersion and overall experience. In this work, we designed and implemented a low-cost hardware prototype with liquid metal, VibroWeight, which can work in complementarity with commercial VR handheld controllers. VibroWeight is characterized by bimodal feedback cues in VR, driven by adaptive absolute mass (weights) and gravity shift. To our knowledge, liquid metal is used in a VR haptic device for the first time. Our 29 participants show that VibroWeight delivers significantly better VR experiences in realism and comfort.

CVAug 19, 2021
Real-time Image Enhancer via Learnable Spatial-aware 3D Lookup Tables

Tao Wang, Yong Li, Jingyang Peng et al.

Recently, deep learning-based image enhancement algorithms achieved state-of-the-art (SOTA) performance on several publicly available datasets. However, most existing methods fail to meet practical requirements either for visual perception or for computation efficiency, especially for high-resolution images. In this paper, we propose a novel real-time image enhancer via learnable spatial-aware 3-dimentional lookup tables(3D LUTs), which well considers global scenario and local spatial information. Specifically, we introduce a light weight two-head weight predictor that has two outputs. One is a 1D weight vector used for image-level scenario adaptation, the other is a 3D weight map aimed for pixel-wise category fusion. We learn the spatial-aware 3D LUTs and fuse them according to the aforementioned weights in an end-to-end manner. The fused LUT is then used to transform the source image into the target tone in an efficient way. Extensive results show that our model outperforms SOTA image enhancement methods on public datasets both subjectively and objectively, and that our model only takes about 4ms to process a 4K resolution image on one NVIDIA V100 GPU.

HCApr 15, 2021
Spatial Knowledge Acquisition in Virtual and Physical Reality: A Comparative Evaluation

Diego Monteiro, Xian Wang, Hai-Ning Liang et al.

Virtual Reality (VR) head-mounted displays (HMDs) have been studied widely as tools for the most diverse kinds of training activities. One special kind that is the basis for many real-world applications is spatial knowledge acquisition and navigation. For example, knowing well by heart escape routes can be an important factor for firefighters and soldiers. Prior research on how well knowledge acquired in virtual worlds translates to real, physical one has had mixed results, with some suggesting spatial learning in VR is akin to using a regular 2D display. However, VR HMDs have evolved drastically in the last decade, and little is known about how spatial training skills in a simulated environment using up-to-date VR HMDs compares to training in the real world. In this paper, we aim to investigate how people trained in a VR maze compare against those trained in a physical maze in terms of recall of the position of items inside the environment. While our results did not find significant differences in time performance for people who experienced the physical and those who trained in VR, other behavioural factors were different.

CVDec 6, 2020
Robust Image Captioning

Daniel Yarnell, Xian Wang

Automated captioning of photos is a mission that incorporates the difficulties of photo analysis and text generation. One essential feature of captioning is the concept of attention: how to determine what to specify and in which sequence. In this study, we leverage the Object Relation using adversarial robust cut algorithm, that grows upon this method by specifically embedding knowledge about the spatial association between input data through graph representation. Our experimental study represent the promising performance of our proposed method for image captioning.

HCOct 12, 2020
Evaluating the Effect of Audience in a Virtual Reality Presentation Training Tool

Diego Monteiro, Hai-Ning Liang, Hongji Li et al.

Public speaking is an essential skill in everyone's professional or academic career. Nevertheless, honing this skill is often tricky because training in front of a mirror does not give feedback or inspire the same anxiety as present-ing in front of an audience. Further, most people do not always have access to the place where the presentation will happen. In this research, we developed a Virtual Reality (VR) environment to assist in improving people's presentation skills. Our system uses 3D scanned people to create more realistic scenarios. We conducted a study with twelve participants who had no prior experience with VR. We validated our virtual environment by analyzing whether it was preferred to no VR system and accepted regardless of the existence of a virtual audience. Our results show that users overwhelmingly prefer to use the VR system as a tool to help them improve their public speaking skills than training in an empty environment. However, the preference for an audience is mixed.

CVOct 10, 2018
Image Super-Resolution Using VDSR-ResNeXt and SRCGAN

Saifuddin Hitawala, Yao Li, Xian Wang et al.

Over the past decade, many Super Resolution techniques have been developed using deep learning. Among those, generative adversarial networks (GAN) and very deep convolutional networks (VDSR) have shown promising results in terms of HR image quality and computational speed. In this paper, we propose two approaches based on these two algorithms: VDSR-ResNeXt, which is a deep multi-branch convolutional network inspired by VDSR and ResNeXt; and SRCGAN, which is a conditional GAN that explicitly passes class labels as input to the GAN. The two methods were implemented on common SR benchmark datasets for both quantitative and qualitative assessment.

OPTICSOct 28, 2014
A Short Image Series Based Scheme for Time Series Digital Image Correlation

Xian Wang, Shaopeng Ma

A new scheme for digital image correlation, i.e., short time series DIC (STS-DIC) is proposed. Instead of processing the original deformed speckle images individually, STS-DIC combines several adjacent deformed speckle images from a short time series and then processes the averaged image, for which deformation continuity over time is introduced. The deformation of several adjacent images is assumed to be linear in time and a new spatial-temporal displacement representation method with eight unknowns is presented based on the subset-based representation method. Then, the model of STS-DIC is created and a solving scheme is developed based on the Newton-Raphson iteration. The proposed method is verified for numerical and experimental cases. The results show that the proposed STS-DIC greatly improves the accuracy of traditional DIC, both under simple and complicated deformation conditions, while retaining acceptable actual computational cost.