Xinyan Chen

CV
h-index27
16papers
791citations
Novelty51%
AI Score63

16 Papers

LGJul 16, 2022Code
SenseFi: A Library and Benchmark on Deep-Learning-Empowered WiFi Human Sensing

Jianfei Yang, Xinyan Chen, Dazhuo Wang et al. · berkeley

WiFi sensing has been evolving rapidly in recent years. Empowered by propagation models and deep learning methods, many challenging applications are realized such as WiFi-based human activity recognition and gesture recognition. However, in contrast to deep learning for visual recognition and natural language processing, no sufficiently comprehensive public benchmark exists. In this paper, we review the recent progress on deep learning enabled WiFi sensing, and then propose a benchmark, SenseFi, to study the effectiveness of various deep learning models for WiFi sensing. These advanced models are compared in terms of distinct sensing tasks, WiFi platforms, recognition accuracy, model size, computational complexity, feature transferability, and adaptability of unsupervised learning. It is also regarded as a tutorial for deep learning based WiFi sensing, starting from CSI hardware platform to sensing algorithms. The extensive experiments provide us with experiences in deep model design, learning strategy skills and training techniques for real-world applications. To the best of our knowledge, this is the first benchmark with an open-source library for deep learning in WiFi sensing research. The benchmark codes are available at https://github.com/xyanchen/WiFi-CSI-Sensing-Benchmark.

NIApr 8, 2022
EfficientFi: Towards Large-Scale Lightweight WiFi Sensing via CSI Compression

Jianfei Yang, Xinyan Chen, Han Zou et al. · berkeley

WiFi technology has been applied to various places due to the increasing requirement of high-speed Internet access. Recently, besides network services, WiFi sensing is appealing in smart homes since it is device-free, cost-effective and privacy-preserving. Though numerous WiFi sensing methods have been developed, most of them only consider single smart home scenario. Without the connection of powerful cloud server and massive users, large-scale WiFi sensing is still difficult. In this paper, we firstly analyze and summarize these obstacles, and propose an efficient large-scale WiFi sensing framework, namely EfficientFi. The EfficientFi works with edge computing at WiFi APs and cloud computing at center servers. It consists of a novel deep neural network that can compress fine-grained WiFi Channel State Information (CSI) at edge, restore CSI at cloud, and perform sensing tasks simultaneously. A quantized auto-encoder and a joint classifier are designed to achieve these goals in an end-to-end fashion. To the best of our knowledge, the EfficientFi is the first IoT-cloud-enabled WiFi sensing framework that significantly reduces communication overhead while realizing sensing tasks accurately. We utilized human activity recognition and identification via WiFi sensing as two case studies, and conduct extensive experiments to evaluate the EfficientFi. The results show that it compresses CSI data from 1.368Mb/s to 0.768Kb/s with extremely low error of data reconstruction and achieves over 98% accuracy for human activity recognition.

AIFeb 4, 2023Code
HardSATGEN: Understanding the Difficulty of Hard SAT Formula Generation and A Strong Structure-Hardness-Aware Baseline

Yang Li, Xinyan Chen, Wenxuan Guo et al.

Industrial SAT formula generation is a critical yet challenging task. Existing SAT generation approaches can hardly simultaneously capture the global structural properties and maintain plausible computational hardness. We first present an in-depth analysis for the limitation of previous learning methods in reproducing the computational hardness of original instances, which may stem from the inherent homogeneity in their adopted split-merge procedure. On top of the observations that industrial formulae exhibit clear community structure and oversplit substructures lead to the difficulty in semantic formation of logical structures, we propose HardSATGEN, which introduces a fine-grained control mechanism to the neural split-merge paradigm for SAT formula generation to better recover the structural and computational properties of the industrial benchmarks. Experiments including evaluations on private and practical corporate testbed show the superiority of HardSATGEN being the only method to successfully augment formulae maintaining similar computational hardness and capturing the global structural properties simultaneously. Compared to the best previous methods, the average performance gains achieve 38.5% in structural statistics, 88.4% in computational metrics, and over 140.7% in the effectiveness of guiding solver tuning by our generated instances. Source code is available at http://github.com/Thinklab-SJTU/HardSATGEN

NIApr 12, 2022
AutoFi: Towards Automatic WiFi Human Sensing via Geometric Self-Supervised Learning

Jianfei Yang, Xinyan Chen, Han Zou et al. · berkeley

WiFi sensing technology has shown superiority in smart homes among various sensors for its cost-effective and privacy-preserving merits. It is empowered by Channel State Information (CSI) extracted from WiFi signals and advanced machine learning models to analyze motion patterns in CSI. Many learning-based models have been proposed for kinds of applications, but they severely suffer from environmental dependency. Though domain adaptation methods have been proposed to tackle this issue, it is not practical to collect high-quality, well-segmented and balanced CSI samples in a new environment for adaptation algorithms, but randomly-captured CSI samples can be easily collected. {\color{black}In this paper, we firstly explore how to learn a robust model from these low-quality CSI samples, and propose AutoFi, an annotation-efficient WiFi sensing model based on a novel geometric self-supervised learning algorithm.} The AutoFi fully utilizes unlabeled low-quality CSI samples that are captured randomly, and then transfers the knowledge to specific tasks defined by users, which is the first work to achieve cross-task transfer in WiFi sensing. The AutoFi is implemented on a pair of Atheros WiFi APs for evaluation. The AutoFi transfers knowledge from randomly collected CSI samples into human gait recognition and achieves state-of-the-art performance. Furthermore, we simulate cross-task transfer using public datasets to further demonstrate its capacity for cross-task learning. For the UT-HAR and Widar datasets, the AutoFi achieves satisfactory results on activity recognition and gesture recognition without any prior training. We believe that the AutoFi takes a huge step toward automatic WiFi sensing without any developer engagement.

HCJul 22, 2024Code
PsyDI: Towards a Personalized and Progressively In-depth Chatbot for Psychological Measurements

Xueyan Li, Xinyan Chen, Yazhe Niu et al.

In the field of psychology, traditional assessment methods, such as standardized scales, are frequently critiqued for their static nature, lack of personalization, and reduced participant engagement, while comprehensive counseling evaluations are often inaccessible. The complexity of quantifying psychological traits further limits these methods. Despite advances with large language models (LLMs), many still depend on single-round Question-and-Answer interactions. To bridge this gap, we introduce PsyDI, a personalized and progressively in-depth chatbot designed for psychological measurements, exemplified by its application in the Myers-Briggs Type Indicator (MBTI) framework. PsyDI leverages user-related multi-modal information and engages in customized, multi-turn interactions to provide personalized, easily accessible measurements, while ensuring precise MBTI type determination. To address the challenge of unquantifiable psychological traits, we introduce a novel training paradigm that involves learning the ranking of proxy variables associated with these traits, culminating in a robust score model for MBTI measurements. The score model enables PsyDI to conduct comprehensive and precise measurements through multi-turn interactions within a unified estimation context. Through various experiments, we validate the efficacy of both the score model and the PsyDI pipeline, demonstrating its potential to serve as a general framework for psychological measurements. Furthermore, the online deployment of PsyDI has garnered substantial user engagement, with over 3,000 visits, resulting in the collection of numerous multi-turn dialogues annotated with MBTI types, which facilitates further research. The source code for the training and web service components is publicly available as a part of OpenDILab at: https://github.com/opendilab/PsyDI

CVOct 30, 2025
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Ziyu Guo, Xinyan Chen, Renrui Zhang et al.

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

CVMar 20
MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints

Yu Qi, Xinyi Xu, Ziyu Guo et al.

Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable deployment, a property we define as reasoning coherence. To bridge the gap in literature for missing reasoning coherence evaluation, we propose MME-CoF-Pro, a comprehensive video reasoning benchmark to assess reasoning coherence in video models. Specifically, MME-CoF-Pro contains 303 samples across 16 categories, ranging from visual logical to scientific reasoning. It introduces Reasoning Score as evaluation metric for assessing process-level necessary intermediate reasoning steps, and includes three evaluation settings, (a) no hint (b) text hint and (c) visual hint, enabling a controlled investigation into the underlying mechanisms of reasoning hint guidance. Evaluation results in 7 open and closed-source video models reveals insights including: (1) Video generative models exhibit weak reasoning coherence, decoupled from generation quality. (2) Text hints boost apparent correctness but often cause inconsistency and hallucinated reasoning (3) Visual hints benefit structured perceptual tasks but struggle with fine-grained perception. Website: https://video-reasoning-coherence.github.io/

CVJun 5, 2025Code
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Xinyan Chen, Renrui Zhang, Dongzhi Jiang et al.

Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT

CVMay 14
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Ziyu Guo, Rain Liu, Xinyan Chen et al.

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

CYMar 19, 2025Code
EmpathyAgent: Can Embodied Agents Conduct Empathetic Actions?

Xinyan Chen, Jiaxin Ge, Hongming Dai et al.

Empathy is fundamental to human interactions, yet it remains unclear whether embodied agents can provide human-like empathetic support. Existing works have studied agents' tasks solving and social interactions abilities, but whether agents can understand empathetic needs and conduct empathetic behaviors remains overlooked. To address this, we introduce EmpathyAgent, the first benchmark to evaluate and enhance agents' empathetic actions across diverse scenarios. EmpathyAgent contains 10,000 multimodal samples with corresponding empathetic task plans and three different challenges. To systematically evaluate the agents' empathetic actions, we propose an empathy-specific evaluation suite that evaluates the agents' empathy process. We benchmark current models and found that exhibiting empathetic actions remains a significant challenge. Meanwhile, we train Llama3-8B using EmpathyAgent and find it can potentially enhance empathetic behavior. By establishing a standard benchmark for evaluating empathetic actions, we hope to advance research in empathetic embodied agents. Our code and data are publicly available at https://github.com/xinyan-cxy/EmpathyAgent.

CVNov 20, 2025Code
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Ziyu Guo, Renrui Zhang, Hongyu Li et al.

Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.

CVDec 23, 2023Code
Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training

Xinyan Chen, Jiaxin Ge, Tianjun Zhang et al.

Diffusion models have shown impressive performance in many domains. However, the model's capability to follow natural language instructions (e.g., spatial relationships between objects, generating complex scenes) is still unsatisfactory. In this work, we propose Iterative Prompt Relabeling (IPR), a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback. IPR first samples a batch of images conditioned on the text, then relabels the text prompts of unmatched text-image pairs with classifier feedback. We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations. With IPR, we improved up to 15.22% (absolute improvement) on the challenging spatial relation VISOR benchmark, demonstrating superior performance compared to previous RL methods. Our code is publicly available at https://github.com/xinyan-cxy/IPR-RLDF.

CVFeb 13, 2025
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Dongzhi Jiang, Renrui Zhang, Ziyu Guo et al.

Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: 1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; 2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and 3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/

CVOct 14, 2024
X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing

Xinyan Chen, Jianfei Yang

Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel "X-fusion" mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies.

CVNov 19, 2025
Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training

Yunjiao Zhou, Xinyan Chen, Junlang Qian et al.

Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7\% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.

SPMay 12, 2023
MM-Fi: Multi-Modal Non-Intrusive 4D Human Dataset for Versatile Wireless Sensing

Jianfei Yang, He Huang, Yunjiao Zhou et al.

4D human perception plays an essential role in a myriad of applications, such as home automation and metaverse avatar simulation. However, existing solutions which mainly rely on cameras and wearable devices are either privacy intrusive or inconvenient to use. To address these issues, wireless sensing has emerged as a promising alternative, leveraging LiDAR, mmWave radar, and WiFi signals for device-free human sensing. In this paper, we propose MM-Fi, the first multi-modal non-intrusive 4D human dataset with 27 daily or rehabilitation action categories, to bridge the gap between wireless sensing and high-level human perception tasks. MM-Fi consists of over 320k synchronized frames of five modalities from 40 human subjects. Various annotations are provided to support potential sensing tasks, e.g., human pose estimation and action recognition. Extensive experiments have been conducted to compare the sensing capacity of each or several modalities in terms of multiple tasks. We envision that MM-Fi can contribute to wireless sensing research with respect to action recognition, human pose estimation, multi-modal learning, cross-modal supervision, and interdisciplinary healthcare research.