Penghui Ruan

CV
h-index24
8papers
53citations
Novelty57%
AI Score54

8 Papers

CVDec 4, 2025Code
Refaçade: Editing Object with Given Reference Texture

Youze Huang, Penghui Ruan, Bojia Zi et al.

Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we introduce a new task, Object Retexture, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability for two reasons: conditioning on the raw reference image introduces unwanted structural information, and it fails to disentangle the visual texture and structure information of the source. To address this problem, we propose Refaçade, a method that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving the geometry and motion of source videos. Second, we disrupt the reference global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than the global layout of the object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations. Code is available at https://github.com/fishZe233/Refacade.

LGOct 23, 2023Code
MGAS: Multi-Granularity Architecture Search for Trade-Off Between Model Effectiveness and Efficiency

Xiaoyun Liu, Divya Saxena, Jiannong Cao et al.

Neural architecture search (NAS) has gained significant traction in automating the design of neural networks. To reduce search time, differentiable architecture search (DAS) reframes the traditional paradigm of discrete candidate sampling and evaluation into a differentiable optimization over a super-net, followed by discretization. However, most existing DAS methods primarily focus on optimizing the coarse-grained operation-level topology, while neglecting finer-grained structures such as filter-level and weight-level patterns. This limits their ability to balance model performance with model size. Additionally, many methods compromise search quality to save memory during the search process. To tackle these issues, we propose Multi-Granularity Differentiable Architecture Search (MG-DARTS), a unified framework which aims to discover both effective and efficient architectures from scratch by comprehensively yet memory-efficiently exploring a multi-granularity search space. Specifically, we improve the existing DAS methods in two aspects. First, we adaptively adjust the retention ratios of searchable units across different granularity levels through adaptive pruning, which is achieved by learning granularity-specific discretization functions along with the evolving architecture. Second, we decompose the super-net optimization and discretization into multiple stages, each operating on a sub-net, and introduce progressive re-evaluation to enable re-pruning and regrowth of previous units, thereby mitigating potential bias. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate that MG-DARTS outperforms other state-of-the-art methods in achieving a better trade-off between model accuracy and parameter efficiency. Codes are available at https://github.com/lxy12357/MG_DARTS.

LGSep 20, 2024
Inductive Spatial Temporal Prediction Under Data Drift with Informative Graph Neural Network

Jialun Zheng, Divya Saxena, Jiannong Cao et al.

Inductive spatial temporal prediction can generalize historical data to predict unseen data, crucial for highly dynamic scenarios (e.g., traffic systems, stock markets). However, external events (e.g., urban structural growth, market crash) and emerging new entities (e.g., locations, stocks) can undermine prediction accuracy by inducing data drift over time. Most existing studies extract invariant patterns to counter data drift but ignore pattern diversity, exhibiting poor generalization to unseen entities. To address this issue, we design an Informative Graph Neural Network (INF-GNN) to distill diversified invariant patterns and improve prediction accuracy under data drift. Firstly, we build an informative subgraph with a uniquely designed metric, Relation Importance (RI), that can effectively select stable entities and distinct spatial relationships. This subgraph further generalizes new entities' data via neighbors merging. Secondly, we propose an informative temporal memory buffer to help the model emphasize valuable timestamps extracted using influence functions within time intervals. This memory buffer allows INF-GNN to discern influential temporal patterns. Finally, RI loss optimization is designed for pattern consolidation. Extensive experiments on real-world dataset under substantial data drift demonstrate that INF-GNN significantly outperforms existing alternatives.

LGMar 12
GPrune-LLM: Generalization-Aware Structured Pruning for Large Language Models

Xiaoyun Liu, Divya Saxena, Jiannong Cao et al.

Structured pruning is widely used to compress large language models (LLMs), yet its effectiveness depends heavily on neuron importance estimation. Most existing methods estimate neuron importance from activation statistics on a single calibration dataset, which introduces calibration bias and degrades downstream cross-task generalization. We observe that neurons exhibit heterogeneous distribution sensitivity, with distribution-robust neurons maintaining consistent rankings across datasets and distribution-sensitive neurons showing high cross-dataset ranking variance. Based on this, we identify two structural limitations in existing methods. First, ranking all neurons within a shared space causes distribution-sensitive neurons that strongly activate on calibration inputs to dominate, crowding out distribution-robust neurons critical for out-of-distribution tasks. Second, applying activation-based importance metrics uniformly can be unreliable. Distribution-sensitive neurons that infrequently activate on calibration data receive insufficient activation signal for accurate local ranking. To address these limitations, we propose GPrune-LLM, a generalization-aware structured pruning framework that explicitly accounts for neuron differences in cross-distribution behavior. We first partition neurons into behavior-consistent modules to localize ranking competition, then evaluate activation-based metric reliability per module according to distribution sensitivity and score magnitude. For modules where activation-based scoring is unreliable, we switch to an activation-independent metric. Finally, we adaptively learn module-wise sparsity. Extensive experiments across multiple downstream tasks demonstrate GPrune-LLM's consistent improvements in post-compression generalization, particularly at high sparsity, and reduced dependence on importance metric choice.

CVFeb 10, 2025
Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists

Bojia Zi, Penghui Ruan, Marco Chen et al.

Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still suffer from several challenges. Inversion-based methods, though training-free and flexible, are time-consuming during inference, struggle with fine-grained editing instructions, and produce artifacts and jitter. On the other hand, end-to-end methods, which rely on edited video pairs for training, offer faster inference speeds but often produce poor editing results due to a lack of high-quality training video pairs. In this paper, to close the gap in end-to-end methods, we introduce Señorita-2M, a high-quality video editing dataset. Señorita-2M consists of approximately 2 millions of video editing pairs. It is built by crafting four high-quality, specialized video editing models, each crafted and trained by our team to achieve state-of-the-art editing results. We also propose a filtering pipeline to eliminate poorly edited video pairs. Furthermore, we explore common video editing architectures to identify the most effective structure based on current pre-trained generative model. Extensive experiments show that our dataset can help to yield remarkably high-quality video editing results. More details are available at https://senorita-2m-dataset.github.io.

CVFeb 11
Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

Penghui Ruan, Bojia Zi, Xianbiao Qi et al.

Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that Ctrl&Shift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.

CVNov 28, 2025
Geometry-Consistent 4D Gaussian Splatting for Sparse-Input Dynamic View Synthesis

Yiwei Li, Jiannong Cao, Penghui Ruan et al.

Gaussian Splatting has been considered as a novel way for view synthesis of dynamic scenes, which shows great potential in AIoT applications such as digital twins. However, recent dynamic Gaussian Splatting methods significantly degrade when only sparse input views are available, limiting their applicability in practice. The issue arises from the incoherent learning of 4D geometry as input views decrease. This paper presents GC-4DGS, a novel framework that infuses geometric consistency into 4D Gaussian Splatting (4DGS), offering real-time and high-quality dynamic scene rendering from sparse input views. While learning-based Multi-View Stereo (MVS) and monocular depth estimators (MDEs) provide geometry priors, directly integrating these with 4DGS yields suboptimal results due to the ill-posed nature of sparse-input 4D geometric optimization. To address these problems, we introduce a dynamic consistency checking strategy to reduce estimation uncertainties of MVS across spacetime. Furthermore, we propose a global-local depth regularization approach to distill spatiotemporal-consistent geometric information from monocular depths, thereby enhancing the coherent geometry and appearance learning within the 4D volume. Extensive experiments on the popular N3DV and Technicolor datasets validate the effectiveness of GC-4DGS in rendering quality without sacrificing efficiency. Notably, our method outperforms RF-DeRF, the latest dynamic radiance field tailored for sparse-input dynamic view synthesis, and the original 4DGS by 2.62dB and 1.58dB in PSNR, respectively, with seamless deployability on resource-constrained IoT edge devices.

CVOct 31, 2024
Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Penghui Ruan, Pichao Wang, Divya Saxena et al.

Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding, which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model's understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions. Project page: https://PR-Ryan.github.io/DEMO-project/