Honghao Fu

CV
h-index19
9papers
385citations
Novelty58%
AI Score61

9 Papers

CVApr 24, 2025Code
Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing et al. · tsinghua

In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.

CVDec 22, 2023Code
BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction

Honghao Fu, Zhiqi Shen, Jing Jih Chin et al.

Analyzing and reconstructing visual stimuli from brain signals effectively advances the understanding of human visual system. However, the EEG signals are complex and contain significant noise. This leads to substantial limitations in existing works of visual stimuli reconstruction from EEG, such as difficulties in aligning EEG embeddings with the fine-grained semantic information and a heavy reliance on additional large self-collected dataset for training. To address these challenges, we propose a novel approach called BrainVis. Firstly, we divide the EEG signals into various units and apply a self-supervised approach on them to obtain EEG time-domain features, in an attempt to ease the training difficulty. Additionally, we also propose to utilize the frequency-domain features to enhance the EEG representations. Then, we simultaneously align EEG time-frequency embeddings with the interpolation of the coarse and fine-grained semantics in the CLIP space, to highlight the primary visual components and reduce the cross-modal alignment difficulty. Finally, we adopt the cascaded diffusion models to reconstruct images. Using only 10\% training data of the previous work, our proposed BrainVis outperforms state of the arts in both semantic fidelity reconstruction and generation quality. The code is available at https://github.com/RomGai/BrainVis.

QUANT-PHApr 20
Nonlocal Games in the High-Noise Regime: Optimal Quantum Values and Rigidity

Honghao Fu, Minglong Qin, Haochen Xu et al.

Motivated by the limitations of near-term quantum devices, we study nonlocal games in the high-noise regime, where the two players may share arbitrarily many copies of a noisy entangled state. In this regime, existing rigidity theorems are unable to certify any nontrivial quantum structure. We first characterize the maximal quantum winning probabilities of the CHSH game [Clauser et al. '69], the Magic Square game [Mermin '90], and their 2-out-of-n variants [Chao et al. '18] as explicit functions of the noise rate. These characterizations enable the construction of device-independent protocols for estimating the underlying noise level. Building on these results, we prove noise-robust rigidity theorems showing that these games certify one, two, and n pairs of anticommuting Pauli observables, respectively. To our knowledge, these are the first rigidity results of Pauli measurements that remain sound in the high-noise regime, which has applications in Measurement-Device-Independent (MDI) cryptography and studying the computational power of Multi-prover Interactive Proof System with entanglement and a vanishing completeness-soundness gap ($\text{MIP}^*_0$). Our proofs rely on Sum-of-Squares decompositions and Pauli analysis techniques originating from quantum proof systems and quantum learning theory, respectively.

CVMar 2
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

Yiying Yang, Wei Cheng, Sijin Chen et al.

OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.

CVJun 6, 2023
SDR-GAIN: A High Real-Time Occluded Pedestrian Pose Completion Method for Autonomous Driving

Honghao Fu, Yongli Gu, Yidong Yan et al.

With the advancement of vision-based autonomous driving technology, pedestrian detection have become an important component for improving traffic safety and driving system robustness. Nevertheless, in complex traffic scenarios, conventional pose estimation approaches frequently fail to accurately reconstruct occluded keypoints, primarily due to obstructions caused by vehicles, vegetation, or architectural elements. To address this issue, we propose a novel real-time occluded pedestrian pose completion framework termed Separation and Dimensionality Reduction-based Generative Adversarial Imputation Nets (SDR-GAIN). Unlike previous approaches that train visual models to distinguish occlusion patterns, SDR-GAIN aims to learn human pose directly from the numerical distribution of keypoint coordinates and interpolate missing positions. It employs a self-supervised adversarial learning paradigm to train lightweight generators with residual structures for the imputation of missing pose keypoints. Additionally, it integrates multiple pose standardization techniques to alleviate the difficulty of the learning process. Experiments conducted on the COCO and JAAD datasets demonstrate that SDR-GAIN surpasses conventional machine learning and Transformer-based missing data interpolation algorithms in accurately recovering occluded pedestrian keypoints, while simultaneously achieving microsecond-level real-time inference.

CVApr 7
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

Honghao Fu, Miao Xu, Yiwei Wang et al.

Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query's intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query's reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame-query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at Github.

AIAug 26, 2025
VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

Honghao Fu, Junlong Ren, Qi Chai et al.

Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.

CVAug 5, 2025
WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval

Junlong Ren, Gangjian Zhang, Honghao Fu et al.

Text-Motion Retrieval (TMR) aims to retrieve 3D motion sequences semantically relevant to text descriptions. However, matching 3D motions with text remains highly challenging, primarily due to the intricate structure of human body and its spatial-temporal dynamics. Existing approaches often overlook these complexities, relying on general encoding methods that fail to distinguish different body parts and their dynamics, limiting precise semantic alignment. To address this, we propose WaMo, a novel wavelet-based multi-frequency feature extraction framework. It fully captures part-specific and time-varying motion details across multiple resolutions on body joints, extracting discriminative motion features to achieve fine-grained alignment with texts. WaMo has three key components: (1) Trajectory Wavelet Decomposition decomposes motion signals into frequency components that preserve both local kinematic details and global motion semantics. (2) Trajectory Wavelet Reconstruction uses learnable inverse wavelet transforms to reconstruct original joint trajectories from extracted features, ensuring the preservation of essential spatial-temporal information. (3) Disordered Motion Sequence Prediction reorders shuffled motion sequences to improve the learning of inherent temporal coherence, enhancing motion-text alignment. Extensive experiments demonstrate WaMo's superiority, achieving 17.0\% and 18.2\% improvements in $Rsum$ on HumanML3D and KIT-ML datasets, respectively, outperforming existing state-of-the-art (SOTA) methods.

AIOct 6, 2025
ContextNav: Towards Agentic Multimodal In-Context Learning

Honghao Fu, Yuan Ouyang, Kai-Wei Chang et al.

Recent advances demonstrate that multimodal large language models (MLLMs) exhibit strong multimodal in-context learning (ICL) capabilities, enabling them to adapt to novel vision-language tasks from a few contextual examples. However, existing ICL approaches face challenges in reconciling scalability with robustness across diverse tasks and noisy contextual examples: manually selecting examples produces clean contexts but is labor-intensive and task-specific, while similarity-based retrieval improves scalability but could introduce irrelevant or structurally inconsistent samples that degrade ICL performance. To address these limitations, we propose ContextNav, the first agentic framework that integrates the scalability of automated retrieval with the quality and adaptiveness of human-like curation, enabling noise-robust and dynamically optimized contextualization for multimodal ICL. ContextNav unifies context management and noise-robust contextualization within a closed-loop workflow driven by graph-based orchestration. Specifically, it builds a resource-aware multimodal embedding pipeline, maintains a retrievable vector database, and applies agentic retrieval and structural alignment to construct noise-resilient contexts. An Operational Grammar Graph (OGG) further supports adaptive workflow planning and optimization, enabling the agent to refine its operational strategies based on downstream ICL feedback. Experimental results demonstrate that ContextNav achieves state-of-the-art performance across various datasets, underscoring the promise of agentic workflows for advancing scalable and robust contextualization in multimodal ICL.