h-index28
15papers
82citations
Novelty49%
AI Score56

15 Papers

CLJun 2
HyperPatch: Sequential Knowledge Editing Under n-ary Structural Drift

Yu-Kai Chan, Wen-Sheng Lien, Dong-Ting Yao et al.

Large Language Models (LLMs) rely on Knowledge Editing (KE) to maintain temporal validity, yet real-world knowledge is inherently n-ary. We demonstrate that in non-stationary environments, sequential updates to complex relations induce N-ary Structural Drift, a phenomenon where the binary reification of n-ary events into triples fractures relational atomicity. This precipitates Structure-Conditioned Knowledge Transfer Failure, a systematic mis-grounding of the retriever frequently misdiagnosed as parametric hallucination. To tackle this, we propose HyperPatch, a parameter-preserving framework that reformulates sequential KE as a stability problem over hypergraph manifolds. HyperPatch preserves event integrity through three phases: (i) Structural Prior Initialization, establishing a topology-aware embedding space via contrastive learning on a Hypergraph Neural Network (HGNN) to capture high-order correlations; (ii) Sequential Topology Editing, utilizing a dual-stage mechanism that employs SimHash-based Topological Alignment for rapid conflict resolution and Topological LoRA Adaptation to track drift without backbone retraining; and (iii) Structure-Conditioned Reasoning, which integrates globally consistent evidence from fused linguistic and structural manifolds. On the MQuAKE-CF and MQuAKE-T benchmarks, HyperPatch achieves relative gains in Hop-wise Accuracy (H-Acc) of 96.24% and 21.06% over the strongest baseline, respectively. Further ablations demonstrate superior reliability under continuous n-ary update streams, whereas the standard KG-based variant suffers H-Acc collapses of up to 88.3% due to structural misalignment.

CVFeb 4Code
VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image

Teng-Fang Hsiao, Bo-Kai Ruan, Yu-Lun Liu et al.

3D editing has emerged as a critical research area to provide users with flexible control over 3D assets. While current editing approaches predominantly focus on 3D Gaussian Splatting or multi-view images, the direct editing of 3D meshes remains underexplored. Prior attempts, such as VoxHammer, rely on voxel-based representations that suffer from limited resolution and necessitate labor-intensive 3D mask. To address these limitations, we propose \textbf{VecSet-Edit}, the first pipeline that leverages the high-fidelity VecSet Large Reconstruction Model (LRM) as a backbone for mesh editing. Our approach is grounded on a analysis of the spatial properties in VecSet tokens, revealing that token subsets govern distinct geometric regions. Based on this insight, we introduce Mask-guided Token Seeding and Attention-aligned Token Gating strategies to precisely localize target regions using only 2D image conditions. Also, considering the difference between VecSet diffusion process versus voxel we design a Drift-aware Token Pruning to reject geometric outliers during the denoising process. Finally, our Detail-preserving Texture Baking module ensures that we not only preserve the geometric details of original mesh but also the textural information. More details can be found in our project page: https://github.com/BlueDyee/VecSet-Edit/tree/main

CVJul 7, 2022
Vision Transformers: State of the Art and Research Challenges

Bo-Kai Ruan, Hong-Han Shuai, Wen-Huang Cheng

Transformers have achieved great success in natural language processing. Due to the powerful capability of self-attention mechanism in transformers, researchers develop the vision transformers for a variety of computer vision tasks, such as image recognition, object detection, image segmentation, pose estimation, and 3D reconstruction. This paper presents a comprehensive overview of the literature on different architecture designs and training tricks (including self-supervised learning) for vision transformers. Our goal is to provide a systematic review with the open research opportunities.

ROMar 26
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model

Bo-Kai Ruan, Hao-Tang Tsui, Yung-Hui Li et al.

Generating realistic and controllable traffic scenes from natural language can greatly enhance the development and evaluation of autonomous driving systems. However, this task poses unique challenges: (1) grounding free-form text into spatially valid and semantically coherent layouts, (2) composing scenarios without predefined locations, and (3) planning multi-agent behaviors and selecting roads that respect agents' configurations. To address these, we propose a modular framework, TTSG, comprising prompt analysis, road retrieval, agent planning, and a novel plan-aware road ranking algorithm to solve these challenges. While large language models (LLMs) are used as general planners, our design integrates them into a tightly controlled pipeline that enforces structure, feasibility, and scene diversity. Notably, our ranking strategy ensures consistency between agent actions and road geometry, enabling scene generation without predefined routes or spawn points. The framework supports both routine and safety-critical scenarios, as well as multi-stage event composition. Experiments on SafeBench demonstrate that our method achieves the lowest average collision rate (3.5\%) across three critical scenarios. Moreover, driving captioning models trained on our generated scenes improve action reasoning by over 30 CIDEr points. These results underscore our proposed framework for flexible, interpretable, and safety-oriented simulation.

CLFeb 16
HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation

Wen-Sheng Lien, Yu-Kai Chan, Hao-Lung Hsiao et al.

Graph-based retrieval-augmented generation (RAG) methods, typically built on knowledge graphs (KGs) with binary relational facts, have shown promise in multi-hop open-domain QA. However, their rigid retrieval schemes and dense similarity search often introduce irrelevant context, increase computational overhead, and limit relational expressiveness. In contrast, n-ary hypergraphs encode higher-order relational facts that capture richer inter-entity dependencies and enable shallower, more efficient reasoning paths. To address this limitation, we propose HyperRAG, a RAG framework tailored for n-ary hypergraphs with two complementary retrieval variants: (i) HyperRetriever learns structural-semantic reasoning over n-ary facts to construct query-conditioned relational chains. It enables accurate factual tracking, adaptive high-order traversal, and interpretable multi-hop reasoning under context constraints. (ii) HyperMemory leverages the LLM's parametric memory to guide beam search, dynamically scoring n-ary facts and entities for query-aware path expansion. Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG's effectiveness. HyperRetriever achieves the highest answer accuracy overall, with average gains of 2.95% in MRR and 1.23% in Hits@10 over the strongest baseline. Qualitative analysis further shows that HyperRetriever bridges reasoning gaps through adaptive and interpretable n-ary chain construction, benefiting both open and closed-domain QA.

ROMay 8
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo et al.

World Action Models (WAMs) enable decision-making through imagined rollouts by predicting future observations and actions. However, the reliability of these imagined futures remains under-examined: is a generated future merely visually plausible, or is it dynamically compatible with the action sequence it claims to model? In this work, we identify action-state consistency, the alignment between predicted actions and induced state transitions, as a missing reliability axis for WAMs. Through a systematic study across representative joint-prediction and inverse-dynamics models, we find that action-state consistency systematically separates successful and failed rollouts across many tasks and follows similar success-failure trends as learned value estimates. These results suggest that consistency captures decision-relevant structure beyond visual realism. We further identify background collapse as an important boundary condition, where low-dynamics failed trajectories can become deceptively consistent because static futures are easier to predict. Building on these findings, we introduce a value-free consensus strategy for test-time selection, which ranks candidate rollouts by agreement among predicted futures. This strategy improves success rates on RoboCasa and RoboTwin 2.0 without additional training or reward modeling. Taken together, our findings establish action-state consistency as both a diagnostic tool for evaluating WAM reliability and a practical signal for value-free planning.

CVOct 21, 2025Code
Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

Yi-Lun Wu, Bo-Kai Ruan, Chiang Tseng et al.

Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/DiffusionDRO.

CEApr 2, 2025Code
Anomaly Detection for Hybrid Butterfly Subspecies via Probability Filtering

Bo-Kai Ruan, Yi-Zeng Fang, Hong-Han Shuai et al.

Detecting butterfly hybrids requires knowledge of the parent subspecies, and the process can be tedious when encountering a new subspecies. This study focuses on a specific scenario where a model trained to recognize hybrid species A can generalize to species B when B biologically mimics A. Since species A and B share similar patterns, we leverage BioCLIP as our feature extractor to capture features based on their taxonomy. Consequently, the algorithm designed for species A can be transferred to B, as their hybrid and non-hybrid patterns exhibit similar relationships. To determine whether a butterfly is a hybrid, we adopt proposed probability filtering and color jittering to augment and simulate the mimicry. With these approaches, we achieve second place in the official development phase. Our code is publicly available at https://github.com/Justin900429/NSF-HDR-Challenge.

CVApr 19, 2024Code
Training-and-Prompt-Free General Painterly Harmonization via Zero-Shot Disentenglement on Style and Content References

Teng-Fang Hsiao, Bo-Kai Ruan, Hong-Han Shuai

Painterly image harmonization aims at seamlessly blending disparate visual elements within a single image. However, previous approaches often struggle due to limitations in training data or reliance on additional prompts, leading to inharmonious and content-disrupted output. To surmount these hurdles, we design a Training-and-prompt-Free General Painterly Harmonization method (TF-GPH). TF-GPH incorporates a novel ``Similarity Disentangle Mask'', which disentangles the foreground content and background image by redirecting their attention to corresponding reference images, enhancing the attention mechanism for multi-image inputs. Additionally, we propose a ``Similarity Reweighting'' mechanism to balance harmonization between stylization and content preservation. This mechanism minimizes content disruption by prioritizing the content-similar features within the given background style reference. Finally, we address the deficiencies in existing benchmarks by proposing novel range-based evaluation metrics and a new benchmark to better reflect real-world applications. Extensive experiments demonstrate the efficacy of our method in all benchmarks. More detailed in https://github.com/BlueDyee/TF-GPH.

CVMar 19, 2025
TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

Teng-Fang Hsiao, Bo-Kai Ruan, Yi-Lun Wu et al.

Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.

CVApr 3, 2025
MAD: Makeup All-in-One with Cross-Domain Diffusion Model

Bo-Kai Ruan, Hong-Han Shuai

Existing makeup techniques often require designing multiple models to handle different inputs and align features across domains for different makeup tasks, e.g., beauty filter, makeup transfer, and makeup removal, leading to increased complexity. Another limitation is the absence of text-guided makeup try-on, which is more user-friendly without needing reference images. In this study, we make the first attempt to use a single model for various makeup tasks. Specifically, we formulate different makeup tasks as cross-domain translations and leverage a cross-domain diffusion model to accomplish all tasks. Unlike existing methods that rely on separate encoder-decoder configurations or cycle-based mechanisms, we propose using different domain embeddings to facilitate domain control. This allows for seamless domain switching by merely changing embeddings with a single model, thereby reducing the reliance on additional modules for different tasks. Moreover, to support precise text-to-makeup applications, we introduce the MT-Text dataset by extending the MT dataset with textual annotations, advancing the practicality of makeup technologies.

CVNov 30, 2024
FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting

Teng-Fang Hsiao, Bo-Kai Ruan, Sung-Lin Tsai et al.

In this study, we aim to determine and solve the deficiency of Stable Diffusion Inpainting (SDI) in following the instruction of both prompt and mask. Due to the training bias from masking, the inpainting quality is hindered when the prompt instruction and image condition are not related. Therefore, we conduct a detailed analysis of the internal representations learned by SDI, focusing on how the mask input influences the cross-attention layer. We observe that adapting text key tokens toward the input mask enables the model to selectively paint within the given area. Leveraging these insights, we propose FreeCond, which adjusts only the input mask condition and image condition. By increasing the latent mask value and modifying the frequency of image condition, we align the cross-attention features with the model's training bias to improve generation quality without additional computation, particularly when user inputs are complicated and deviate from the training setup. Extensive experiments demonstrate that FreeCond can enhance any SDI-based model, e.g., yielding up to a 60% and 58% improvement of SDI and SDXLI in the CLIP score.

CVNov 25, 2025
PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo et al.

Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.

CVSep 12, 2025
Color Me Correctly: Bridging Perceptual Color Spaces and Text Embeddings for Improved Diffusion Generation

Sung-Lin Tsai, Bo-Lun Huang, Yu Ting Shen et al.

Accurate color alignment in text-to-image (T2I) generation is critical for applications such as fashion, product visualization, and interior design, yet current diffusion models struggle with nuanced and compound color terms (e.g., Tiffany blue, lime green, hot pink), often producing images that are misaligned with human intent. Existing approaches rely on cross-attention manipulation, reference images, or fine-tuning but fail to systematically resolve ambiguous color descriptions. To precisely render colors under prompt ambiguity, we propose a training-free framework that enhances color fidelity by leveraging a large language model (LLM) to disambiguate color-related prompts and guiding color blending operations directly in the text embedding space. Our method first employs a large language model (LLM) to resolve ambiguous color terms in the text prompt, and then refines the text embeddings based on the spatial relationships of the resulting color terms in the CIELAB color space. Unlike prior methods, our approach improves color accuracy without requiring additional training or external reference images. Experimental results demonstrate that our framework improves color alignment without compromising image quality, bridging the gap between text semantics and visual generation.

CVMay 27, 2025
Score Replacement with Bounded Deviation for Rare Prompt Generation

Bo-Kai Ruan, Zi-Xiang Ni, Bo-Lun Huang et al.

Diffusion models achieve impressive performance in high-fidelity image generation but often struggle with rare concepts that appear infrequently in the training distribution. Prior work attempts to address this issue by prompt switching, where generation begins with a frequent proxy prompt and later transitions to the original rare prompt. However, such designs typically rely on fixed schedules that disregard the model's internal dynamics, making them brittle across prompts and backbones. In this paper, we re-frame rare prompt generation through the lens of score replacement: the denoising trajectory of a rare prompt can be initially guided by the score of a semantically related frequent prompt, which acts as a proxy. However, as the process unfolds, the proxy score gradually diverges from the true rare prompt score. To control this drift, we introduce a bounded deviation criterion that triggers the switch once the deviation exceeds a threshold. This formulation offers both a principled justification and a practical mechanism for rare prompt generation, enabling adaptive switching that can be widely adopted by different models. Extensive experiments across SDXL, SD3, Flux, and Sana confirm that our method consistently improves rare concept synthesis, outperforming strong baselines in both automated metrics and human evaluations.