Yahui Liu

CV
h-index29
50papers
3,480citations
Novelty51%
AI Score62

50 Papers

CVMar 8, 2023Code
Point Cloud Classification Using Content-based Transformer via Clustering in Feature Space

Yahui Liu, Bin Tian, Yisheng Lv et al.

Recently, there have been some attempts of Transformer in 3D point cloud classification. In order to reduce computations, most existing methods focus on local spatial attention, but ignore their content and fail to establish relationships between distant but relevant points. To overcome the limitation of local spatial attention, we propose a point content-based Transformer architecture, called PointConT for short. It exploits the locality of points in the feature space (content-based), which clusters the sampled points with similar features into the same class and computes the self-attention within each class, thus enabling an effective trade-off between capturing long-range dependencies and computational complexity. We further introduce an Inception feature aggregator for point cloud classification, which uses parallel structures to aggregate high-frequency and low-frequency information in each branch separately. Extensive experiments show that our PointConT model achieves a remarkable performance on point cloud shape classification. Especially, our method exhibits 90.3% Top-1 accuracy on the hardest setting of ScanObjectNN. Source code of this paper is available at https://github.com/yahuiliu99/PointConT.

CVJul 25, 2022Code
Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer

Yingyi Chen, Xi Shen, Yahui Liu et al.

The success of Vision Transformer (ViT) in various computer vision tasks has promoted the ever-increasing prevalence of this convolution-free network. The fact that ViT works on image patches makes it potentially relevant to the problem of jigsaw puzzle solving, which is a classical self-supervised task aiming at reordering shuffled sequential image patches back to their natural form. Despite its simplicity, solving jigsaw puzzle has been demonstrated to be helpful for diverse tasks using Convolutional Neural Networks (CNNs), such as self-supervised feature representation learning, domain generalization, and fine-grained classification. In this paper, we explore solving jigsaw puzzle as a self-supervised auxiliary loss in ViT for image classification, named Jigsaw-ViT. We show two modifications that can make Jigsaw-ViT superior to standard ViT: discarding positional embeddings and masking patches randomly. Yet simple, we find that Jigsaw-ViT is able to improve both in generalization and robustness over the standard ViT, which is usually rather a trade-off. Experimentally, we show that adding the jigsaw puzzle branch provides better generalization than ViT on large-scale image classification on ImageNet. Moreover, the auxiliary task also improves robustness to noisy labels on Animal-10N, Food-101N, and Clothing1M as well as adversarial examples. Our implementation is available at https://yingyichen-cyy.github.io/Jigsaw-ViT/.

CVMay 25, 2022Code
Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers

Bin Ren, Yahui Liu, Yue Song et al.

Position Embeddings (PEs), an arguably indispensable component in Vision Transformers (ViTs), have been shown to improve the performance of ViTs on many vision tasks. However, PEs have a potentially high risk of privacy leakage since the spatial information of the input patches is exposed. This caveat naturally raises a series of interesting questions about the impact of PEs on the accuracy, privacy, prediction consistency, etc. To tackle these issues, we propose a Masked Jigsaw Puzzle (MJP) position embedding method. In particular, MJP first shuffles the selected patches via our block-wise random jigsaw puzzle shuffle algorithm, and their corresponding PEs are occluded. Meanwhile, for the non-occluded patches, the PEs remain the original ones but their spatial relation is strengthened via our dense absolute localization regressor. The experimental results reveal that 1) PEs explicitly encode the 2D spatial relationship and lead to severe privacy leakage problems under gradient inversion attack; 2) Training ViTs with the naively shuffled patches can alleviate the problem, but it harms the accuracy; 3) Under a certain shuffle ratio, the proposed MJP not only boosts the performance and robustness on large-scale datasets (i.e., ImageNet-1K and ImageNet-C, -A/O) but also improves the privacy preservation ability under typical gradient attacks by a large margin. The source code and trained models are available at~\url{https://github.com/yhlleo/MJP}.

CVMay 5, 2022
Language Models Can See: Plugging Visual Controls in Text Generation

Yixuan Su, Tian Lan, Yahui Liu et al. · cambridge, deepmind

Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be semantically related to a given image while being coherent to the previously generated context. Notably, the proposed decoding scheme does not involve any gradient update operation, therefore being computationally efficient. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding. In the experiments, we showcase that it is also capable of performing visually grounded story generation given both an image and a text prompt.

CVAug 26, 2022Code
Training and Tuning Generative Neural Radiance Fields for Attribute-Conditional 3D-Aware Face Generation

Jichao Zhang, Aliaksandr Siarohin, Yahui Liu et al.

Generative Neural Radiance Fields (GNeRF)-based 3D-aware GANs have showcased remarkable prowess in crafting high-fidelity images while upholding robust 3D consistency, particularly face generation. However, specific existing models prioritize view consistency over disentanglement, leading to constrained semantic or attribute control during the generation process. While many methods have explored incorporating semantic masks or leveraging 3D Morphable Models (3DMM) priors to imbue models with semantic control, these methods often demand training from scratch, entailing significant computational overhead. In this paper, we propose a novel approach: a conditional GNeRF model that integrates specific attribute labels as input, thus amplifying the controllability and disentanglement capabilities of 3D-aware generative models. Our approach builds upon a pre-trained 3D-aware face model, and we introduce a Training as Init and Optimizing for Tuning (TRIOT) method to train a conditional normalized flow module to enable the facial attribute editing, then optimize the latent vector to improve attribute-editing precision further. Our extensive experiments substantiate the efficacy of our model, showcasing its ability to generate high-quality edits with enhanced view consistency while safeguarding non-target regions. The code for our model is publicly available at https://github.com/zhangqianhui/TT-GNeRF.

CVAug 3, 2022Code
YOLO-FaceV2: A Scale and Occlusion Aware Face Detector

Ziping Yu, Hongbo Huang, Weijun Chen et al.

In recent years, face detection algorithms based on deep learning have made great progress. These algorithms can be generally divided into two categories, i.e. two-stage detector like Faster R-CNN and one-stage detector like YOLO. Because of the better balance between accuracy and speed, one-stage detectors have been widely used in many applications. In this paper, we propose a real-time face detector based on the one-stage detector YOLOv5, named YOLO-FaceV2. We design a Receptive Field Enhancement module called RFE to enhance receptive field of small face, and use NWD Loss to make up for the sensitivity of IoU to the location deviation of tiny objects. For face occlusion, we present an attention module named SEAM and introduce Repulsion Loss to solve it. Moreover, we use a weight function Slide to solve the imbalance between easy and hard samples and use the information of the effective receptive field to design the anchor. The experimental results on WiderFace dataset show that our face detector outperforms YOLO and its variants can be find in all easy, medium and hard subsets. Source code in https://github.com/Krasjet-Yu/YOLO-FaceV2

CVJun 9, 2022Code
Spatial Entropy as an Inductive Bias for Vision Transformers

Elia Peruzzo, Enver Sangineto, Yahui Liu et al.

Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR.

CLMay 13, 2022
MuCPAD: A Multi-Domain Chinese Predicate-Argument Dataset

Yahui Liu, Haoping Yang, Chen Gong et al.

During the past decade, neural network models have made tremendous progress on in-domain semantic role labeling (SRL). However, performance drops dramatically under the out-of-domain setting. In order to facilitate research on cross-domain SRL, this paper presents MuCPAD, a multi-domain Chinese predicate-argument dataset, which consists of 30,897 sentences and 92,051 predicates from six different domains. MuCPAD exhibits three important features. 1) Based on a frame-free annotation methodology, we avoid writing complex frames for new predicates. 2) We explicitly annotate omitted core arguments to recover more complete semantic structure, considering that omission of content words is ubiquitous in multi-domain Chinese texts. 3) We compile 53 pages of annotation guidelines and adopt strict double annotation for improving data quality. This paper describes in detail the annotation methodology and annotation process of MuCPAD, and presents in-depth data analysis. We also give benchmark results on cross-domain SRL based on MuCPAD.

AINov 8, 2025Code
Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling

Qi Wang, Hongzhi Zhang, Jia Fu et al.

Despite the proliferation of powerful agentic models, the lack of critical post-training details hinders the development of strong counterparts in the open-source community. In this study, we present a comprehensive and fully open-source pipeline for training a high-performance agentic model for interacting with external tools and environments, named Klear-Qwen3-AgentForge, starting from the Qwen3-8B base model. We design effective supervised fine-tuning (SFT) with synthetic data followed by multi-turn reinforcement learning (RL) to unlock the potential for multiple diverse agentic tasks. We perform exclusive experiments on various agentic benchmarks in both tool use and coding domains. Klear-Qwen3-AgentForge-8B achieves state-of-the-art performance among LLMs of similar size and remains competitive with significantly larger models.

CVOct 3, 2022
Smooth image-to-image translations with latent space interpolations

Yahui Liu, Enver Sangineto, Yajing Chen et al.

Multi-domain image-to-image (I2I) translations can transform a source image according to the style of a target domain. One important, desired characteristic of these transformations, is their graduality, which corresponds to a smooth change between the source and the target image when their respective latent-space representations are linearly interpolated. However, state-of-the-art methods usually perform poorly when evaluated using inter-domain interpolations, often producing abrupt changes in the appearance or non-realistic intermediate images. In this paper, we argue that one of the main reasons behind this problem is the lack of sufficient inter-domain training data and we propose two different regularization methods to alleviate this issue: a new shrinkage loss, which compacts the latent space, and a Mixup data-augmentation strategy, which flattens the style representations between domains. We also propose a new metric to quantitatively evaluate the degree of the interpolation smoothness, an aspect which is not sufficiently covered by the existing I2I translation metrics. Using both our proposed metric and standard evaluation protocols, we show that our regularization techniques can improve the state-of-the-art multi-domain I2I translations by a large margin. Our code will be made publicly available upon the acceptance of this article.

LGDec 1, 2025Code
Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe

Yahui Liu, Yang Yue, Jingyuan Zhang et al.

Recent efforts on Diffusion Mixture-of-Experts (MoE) models have primarily focused on developing more sophisticated routing mechanisms. However, we observe that the underlying architectural configuration space remains markedly under-explored. Inspired by the MoE design paradigms established in large language models (LLMs), we identify a set of crucial architectural factors for building effective Diffusion MoE models--including DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. Our systematic study reveals that carefully tuning these configurations is essential for unlocking the full potential of Diffusion MoE models, often yielding gains that exceed those achieved by routing innovations alone. Through extensive experiments, we present novel architectures that can be efficiently applied to both latent and pixel-space diffusion frameworks, which provide a practical and efficient training recipe that enables Diffusion MoE models to surpass strong baselines while using equal or fewer activated parameters. All code and models are publicly available at: https://github.com/yhlleo/EfficientMoE.

CVFeb 24
Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Junhao Xiao, Zhiyu Wu, Hao Lin et al.

Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.

CVFeb 10
Kelix Technique Report

Boyang Ding, Chenglong Chu, Dunju Zang et al.

Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.

AIJul 11, 2025Code
Leanabell-Prover-V2: Verifier-integrated Reasoning for Formal Theorem Proving via Reinforcement Learning

Xingguang Ji, Yahui Liu, Qi Wang et al.

We introduce our Leanabell-Prover-V2, a 7B large language models (LLMs) that can produce formal theorem proofs in Lean 4, with verifier-integrated Long Chain-of-Thoughts (CoT). Following our previous work Leanabell-Prover-V1, we continual to choose to posttrain existing strong prover models for further performance improvement. In our V2 version, we mainly upgrade the Reinforcement Learning (RL) with feedback provided by the Lean 4 verifier. Crucially, verifier feedback, such as indicating success or detailing specific errors, allows the LLM to become ``self-aware'' of the correctness of its own reasoning process and learn to reflexively correct errors. Leanabell-Prover-V2 directly optimizes LLM reasoning trajectories with multi-turn verifier interactions, together with feedback token masking for stable RL training and a simple reward strategy. Experiments show that Leanabell-Prover-V2 improves performance by 3.2% (pass@128) with Kimina-Prover-Preview-Distill-7B and 2.0% (pass@128) with DeepSeek-Prover-V2-7B on the MiniF2F test set. The source codes, curated data and models are available at: https://github.com/Leanabell-LM/Leanabell-Prover-V2.

CVAug 9, 2025Code
AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning

Shihao Yuan, Yahui Liu, Yang Yue et al.

Inspired by the success of reinforcement learning (RL) in refining large language models (LLMs), we propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models. We adapt the Group Relative Policy Optimization (GRPO) algorithm to refine the vanilla autoregressive models' outputs by carefully designed reward functions that evaluate generated images across multiple quality dimensions, including perceptual quality, realism, and semantic fidelity. We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks, demonstrating that our RL-enhanced framework significantly improves both the image quality and human preference of generated images compared to the standard AR baselines. Our results show consistent improvements across various evaluation metrics, establishing the viability of RL-based optimization for AR image generation and opening new avenues for controllable and high-quality image synthesis. The source codes and models are available at: https://github.com/Kwai-Klear/AR-GRPO.

94.4CVMar 30
VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

Li-Heng Chen, Ke Cheng, Yahui Liu et al.

Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.

CVMay 23, 2025Code
DiffusionReward: Enhancing Blind Face Restoration through Reward Feedback Learning

Bin Wu, Wei Wang, Yahui Liu et al.

Reward Feedback Learning (ReFL) has recently shown great potential in aligning model outputs with human preferences across various generative tasks. In this work, we introduce a ReFL framework, named DiffusionReward, to the Blind Face Restoration task for the first time. DiffusionReward effectively overcomes the limitations of diffusion-based methods, which often fail to generate realistic facial details and exhibit poor identity consistency. The core of our framework is the Face Reward Model (FRM), which is trained using carefully annotated data. It provides feedback signals that play a pivotal role in steering the optimization process of the restoration network. In particular, our ReFL framework incorporates a gradient flow into the denoising process of off-the-shelf face restoration methods to guide the update of model parameters. The guiding gradient is collaboratively determined by three aspects: (i) the FRM to ensure the perceptual quality of the restored faces; (ii) a regularization term that functions as a safeguard to preserve generative diversity; and (iii) a structural consistency constraint to maintain facial fidelity. Furthermore, the FRM undergoes dynamic optimization throughout the process. It not only ensures that the restoration network stays precisely aligned with the real face manifold, but also effectively prevents reward hacking. Experiments on synthetic and wild datasets demonstrate that our method outperforms state-of-the-art methods, significantly improving identity consistency and facial details. The source codes, data, and models are available at: https://github.com/01NeuralNinja/DiffusionReward.

CLApr 10, 2025Code
Data Metabolism: An Efficient Data Design Schema For Vision Language Model

Jingyuan Zhang, Hongzhi Zhang, Zhou Haonan et al.

Data curation plays a crucial role in training powerful Visual Language Models (VLMs). In this work, we introduce the concept of Data Metabolism and present our data-centric framework to build VLMs throughout the development lifecycle. Starting from a standard model architecture, we discuss and provide insights into two crucial development steps: data curation and iteration, forming a closed-loop system that continuously improves model performance. We show a detailed codebook on how to process existing massive datasets and build user-specific data flywheel. As a demonstration, we release a VLM, named Capybara-VL, which excels in typical multimodal tasks (e.g. , visual question answering, scientific reasoning, and text-rich tasks). Despite its relatively compact size, Capybara-VL surpasses several open-source models that are up to 10 times larger in size. Moreover, it achieves results that are on par with those of several leading proprietary models, demonstrating its remarkable competitiveness. These results highlight the power of our data-centric framework and the potential of training smaller and more efficient VLMs.

SEAug 7, 2025Code
Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning

Jia Fu, Xinyu Yang, Hongzhi Zhang et al.

Precise, correct feedback is crucial for effectively training large language models (LLMs) in code reinforcement learning. However, synthesizing high-quality test cases remains a profoundly challenging and unsolved problem. In this work, we present Klear-CodeTest, a comprehensive test case synthesis framework featuring rigorous verification to ensure quality and reliability of test cases. Our approach achieves broad coverage of programming problems via a novel Generator-Validation (G-V) framework, ensuring correctness through a consistency validation mechanism that verifies outputs against gold solutions. The proposed G-V framework generates comprehensive test cases including both regular and corner cases, enhancing test coverage and discriminative power for solution correctness assessment in code reinforcement learning. In addition, we design a multi-layered security sandbox system optimized for online verification platforms, guaranteeing safe and reliable code execution. Through comprehensive experiments, we demonstrate the effectiveness of our curated dataset, showing significant improvements in model performance and training stability. The source codes, curated dataset and sandbox system are available at: https://github.com/Kwai-Klear/CodeTest.

CVJun 11, 2025Code
Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search

Linhao Yu, Xinguang Ji, Yahui Liu et al.

Video captioning can be used to assess the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, existing benchmarks and evaluation protocols suffer from crucial issues, such as inadequate or homogeneous creation of key points, exorbitant cost of data creation, and limited evaluation scopes. To address these issues, we propose an automatic framework, named AutoCaption, which leverages Monte Carlo Tree Search (MCTS) to construct numerous and diverse descriptive sentences (\textit{i.e.}, key points) that thoroughly represent video content in an iterative way. This iterative captioning strategy enables the continuous enhancement of video details such as actions, objects' attributes, environment details, etc. We apply AutoCaption to curate MCTS-VCB, a fine-grained video caption benchmark covering video details, thereby enabling a comprehensive evaluation of MLLMs on the video captioning task. We evaluate more than 20 open- and closed-source MLLMs of varying sizes on MCTS-VCB. Results show that MCTS-VCB can effectively and comprehensively evaluate the video captioning capability, with Gemini-1.5-Pro achieving the highest F1 score of 71.2. Interestingly, we fine-tune InternVL2.5-8B with the AutoCaption-generated data, which helps the model achieve an overall improvement of 25.0% on MCTS-VCB and 16.3% on DREAM-1K, further demonstrating the effectiveness of AutoCaption. The code and data are available at https://github.com/tjunlp-lab/MCTS-VCB.

CLApr 10, 2025Code
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models

Xingguang Ji, Jiakang Wang, Hongzhi Zhang et al.

With the development of Multimodal Large Language Models (MLLMs), numerous outstanding accomplishments have emerged within the open-source community. Due to the complexity of creating and training multimodal data pairs, it is still a computational and time-consuming process to build powerful MLLMs. In this work, we introduce Capybara-OMNI, an MLLM that trains in a lightweight and efficient manner and supports understanding text, image, video, and audio modalities. We present in detail the framework design, the data construction, and the training recipe, to develop an MLLM step-by-step to obtain competitive performance. We also provide exclusive benchmarks utilized in our experiments to show how to properly verify understanding capabilities across different modalities. Results show that by following our guidance, we can efficiently build an MLLM that achieves competitive performance among models of the same scale on various multimodal benchmarks. Additionally, to enhance the multimodal instruction following and conversational capabilities of the model, we further discuss how to train the chat version upon an MLLM understanding model, which is more in line with user habits for tasks like real-time interaction with humans. We publicly disclose the Capybara-OMNI model, along with its chat-based version. The disclosure includes both the model weights, a portion of the training data, and the inference codes, which are made available on GitHub.

CLJan 14
DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing

Qian Cao, Yahui Liu, Wei Bi et al.

Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.

CLOct 8, 2025Code
Accelerating Diffusion LLM Inference via Local Determinism Propagation

Fanheng Kong, Jingyuan Zhang, Yahui Liu et al.

Diffusion large language models (dLLMs) represent a significant advancement in text generation, offering parallel token decoding capabilities. However, existing open-source implementations suffer from quality-speed trade-offs that impede their practical deployment. Conservative sampling strategies typically decode only the most confident token per step to ensure quality (i.e., greedy decoding), at the cost of inference efficiency due to repeated redundant refinement iterations--a phenomenon we term delayed decoding. Through systematic analysis of dLLM decoding dynamics, we characterize this delayed decoding behavior and propose a training-free adaptive parallel decoding strategy, named LocalLeap, to address these inefficiencies. LocalLeap is built on two fundamental empirical principles: local determinism propagation centered on high-confidence anchors and progressive spatial consistency decay. By applying these principles, LocalLeap identifies anchors and performs localized relaxed parallel decoding within bounded neighborhoods, achieving substantial inference step reduction through early commitment of already-determined tokens without compromising output quality. Comprehensive evaluation on various benchmarks demonstrates that LocalLeap achieves 6.94$\times$ throughput improvements and reduces decoding steps to just 14.2\% of the original requirement, achieving these gains with negligible performance impact. The source codes are available at: https://github.com/friedrichor/LocalLeap.

CVAug 26, 2025Code
All-in-One Slider for Attribute Manipulation in Diffusion Models

Weixin Ye, Hongguang Zhu, Wei Wang et al.

Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a One-for-One manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the All-in-One Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released at: https://github.com/ywxsuperstar/KSAE-FaceSteer.

LGJun 16, 2025Code
SeqPE: Transformer with Sequential Position Encoding

Huayang Li, Yahui Liu, Hongyu Sun et al.

Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we introduce two complementary objectives: a contrastive objective that aligns embedding distances with a predefined position-distance function, and a knowledge distillation loss that anchors out-of-distribution position embeddings to in-distribution teacher representations, further enhancing extrapolation performance. Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM), and accuracy--particularly under context length extrapolation--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign. We release our code, data, and checkpoints at https://github.com/ghrua/seqpe.

CVSep 26, 2021Code
ISF-GAN: An Implicit Style Function for High-Resolution Image-to-Image Translation

Yahui Liu, Yajing Chen, Linchao Bao et al.

Recently, there has been an increasing interest in image editing methods that employ pre-trained unconditional image generators (e.g., StyleGAN). However, applying these methods to translate images to multiple visual domains remains challenging. Existing works do not often preserve the domain-invariant part of the image (e.g., the identity in human face translations), they do not usually handle multiple domains, or do not allow for multi-modal translations. This work proposes an implicit style function (ISF) to straightforwardly achieve multi-modal and multi-domain image-to-image translation from pre-trained unconditional generators. The ISF manipulates the semantics of an input latent code to make the image generated from it lying in the desired visual domain. Our results in human face and animal manipulations show significantly improved results over the baselines. Our model enables cost-effective multi-modal unsupervised image-to-image translations at high resolution using pre-trained unconditional GANs. The code and data are available at: \url{https://github.com/yhlleo/stylegan-mmuit}.

CVJun 7, 2021Code
Efficient Training of Visual Transformers with Small Datasets

Yahui Liu, Enver Sangineto, Wei Bi et al.

Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training-set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose a self-supervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data are scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. Our code is available at: https://github.com/yhlleo/VTs-Drloc.

CVFeb 22, 2021Code
Adversarial Shape Learning for Building Extraction in VHR Remote Sensing Images

Lei Ding, Hao Tang, Yahui Liu et al.

Building extraction in VHR RSIs remains a challenging task due to occlusion and boundary ambiguity problems. Although conventional convolutional neural networks (CNNs) based methods are capable of exploiting local texture and context information, they fail to capture the shape patterns of buildings, which is a necessary constraint in the human recognition. To address this issue, we propose an adversarial shape learning network (ASLNet) to model the building shape patterns that improve the accuracy of building segmentation. In the proposed ASLNet, we introduce the adversarial learning strategy to explicitly model the shape constraints, as well as a CNN shape regularizer to strengthen the embedding of shape features. To assess the geometric accuracy of building segmentation results, we introduced several object-based quality assessment metrics. Experiments on two open benchmark datasets show that the proposed ASLNet improves both the pixel-based accuracy and the object-based quality measurements by a large margin. The code is available at: https://github.com/ggsDing/ASLNet

CVJan 8, 2019Code
Fast 3D Line Segment Detection From Unorganized Point Cloud

Xiaohu Lu, Yahui Liu, Kai Li

This paper presents a very simple but efficient algorithm for 3D line segment detection from large scale unorganized point cloud. Unlike traditional methods which usually extract 3D edge points first and then link them to fit for 3D line segments, we propose a very simple 3D line segment detection algorithm based on point cloud segmentation and 2D line detection. Given the input unorganized point cloud, three steps are performed to detect 3D line segments. Firstly, the point cloud is segmented into 3D planes via region growing and region merging. Secondly, for each 3D plane, all the points belonging to it are projected onto the plane itself to form a 2D image, which is followed by 2D contour extraction and Least Square Fitting to get the 2D line segments. Those 2D line segments are then re-projected onto the 3D plane to get the corresponding 3D line segments. Finally, a post-processing procedure is proposed to eliminate outliers and merge adjacent 3D line segments. Experiments on several public datasets demonstrate the efficiency and robustness of our method. More results and the C++ source code of the proposed algorithm are publicly available at https://github.com/xiaohulugo/3DLineDetection.

AIApr 8, 2025
Leanabell-Prover: Posttraining Scaling in Formal Reasoning

Jingyuan Zhang, Qi Wang, Xingguang Ji et al.

Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural languages. To begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.

AIMay 28, 2025
What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning

Gangwei Jiang, Yahui Liu, Zhaoyi Li et al.

Recent advances in reasoning with large language models (LLMs) have popularized Long Chain-of-Thought (LCoT), a strategy that encourages deliberate and step-by-step reasoning before producing a final answer. While LCoTs have enabled expert-level performance in complex tasks, how the internal structures of their reasoning chains drive, or even predict, the correctness of final answers remains a critical yet underexplored question. In this work, we present LCoT2Tree, an automated framework that converts sequential LCoTs into hierarchical tree structures and thus enables deeper structural analysis of LLM reasoning. Using graph neural networks (GNNs), we reveal that structural patterns extracted by LCoT2Tree, including exploration, backtracking, and verification, serve as stronger predictors of final performance across a wide range of tasks and models. Leveraging an explainability technique, we further identify critical thought patterns such as over-branching that account for failures. Beyond diagnostic insights, the structural patterns by LCoT2Tree support practical applications, including improving Best-of-N decoding effectiveness. Overall, our results underscore the critical role of internal structures of reasoning chains, positioning LCoT2Tree as a powerful tool for diagnosing, interpreting, and improving reasoning in LLMs.

ROSep 16, 2025
The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

Titong Jiang, Xuefeng Jiang, Yuan Ma et al.

We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.

CVApr 21, 2024
LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions

Xiaoran Zhao, Tianhao Wu, Yu Lai et al.

Controllable text-to-image generation synthesizes visual text and objects in images with certain conditions, which are frequently applied to emoji and poster generation. Visual text rendering and layout-to-image generation tasks have been popular in controllable text-to-image generation. However, each of these tasks typically focuses on single modality generation or rendering, leaving yet-to-be-bridged gaps between the approaches correspondingly designed for each of the tasks. In this paper, we combine text rendering and layout-to-image generation tasks into a single task: layout-controllable text-object synthesis (LTOS) task, aiming at synthesizing images with object and visual text based on predefined object layout and text contents. As compliant datasets are not readily available for our LTOS task, we construct a layout-aware text-object synthesis dataset, containing elaborate well-aligned labels of visual text and object information. Based on the dataset, we propose a layout-controllable text-object adaptive fusion (TOF) framework, which generates images with clear, legible visual text and plausible objects. We construct a visual-text rendering module to synthesize text and employ an object-layout control module to generate objects while integrating the two modules to harmoniously generate and integrate text content and objects in images. To better the image-text integration, we propose a self-adaptive cross-attention fusion module that helps the image generation to attend more to important text information. Within such a fusion module, we use a self-adaptive learnable factor to learn to flexibly control the influence of cross-attention outputs on image generation. Experimental results show that our method outperforms the state-of-the-art in LTOS, text rendering, and layout-to-image tasks, enabling harmonious visual text rendering and object generation.

CVDec 12, 2024
SEGT: A General Spatial Expansion Group Transformer for nuScenes Lidar-based Object Detection Task

Cheng Mei, Hao He, Yahui Liu et al.

In the technical report, we present a novel transformer-based framework for nuScenes lidar-based object detection task, termed Spatial Expansion Group Transformer (SEGT). To efficiently handle the irregular and sparse nature of point cloud, we propose migrating the voxels into distinct specialized ordered fields with the general spatial expansion strategies, and employ group attention mechanisms to extract the exclusive feature maps within each field. Subsequently, we integrate the feature representations across different ordered fields by alternately applying diverse expansion strategies, thereby enhancing the model's ability to capture comprehensive spatial information. The method was evaluated on the nuScenes lidar-based object detection test dataset, achieving an NDS score of 73.9 without Test-Time Augmentation (TTA) and 74.5 with TTA, demonstrating the effectiveness of the proposed method. Notably, our method ranks the 1st place in the nuScenes lidar-based object detection task.

CVSep 27, 2025
LRPO: Enhancing Blind Face Restoration through Online Reinforcement Learning

Bin Wu, Yahui Liu, Chi Zhang et al.

Blind Face Restoration (BFR) encounters inherent challenges in exploring its large solution space, leading to common artifacts like missing details and identity ambiguity in the restored images. To tackle these challenges, we propose a Likelihood-Regularized Policy Optimization (LRPO) framework, the first to apply online reinforcement learning (RL) to the BFR task. LRPO leverages rewards from sampled candidates to refine the policy network, increasing the likelihood of high-quality outputs while improving restoration performance on low-quality inputs. However, directly applying RL to BFR creates incompatibility issues, producing restoration results that deviate significantly from the ground truth. To balance perceptual quality and fidelity, we propose three key strategies: 1) a composite reward function tailored for face restoration assessment, 2) ground-truth guided likelihood regularization, and 3) noise-level advantage assignment. Extensive experiments demonstrate that our proposed LRPO significantly improves the face restoration quality over baseline methods and achieves state-of-the-art performance.

CVMay 24, 2025
Guiding the Experts: Semantic Priors for Efficient and Focused MoE Routing

Chengxi Min, Wei Wang, Yahui Liu et al.

Mixture-of-Experts (MoE) models have emerged as a promising direction for scaling vision architectures efficiently. Among them, Soft MoE improves training stability by assigning each token to all experts via continuous dispatch weights. However, current designs overlook the semantic structure which is implicitly encoded in these weights, resulting in suboptimal expert routing. In this paper, we discover that dispatch weights in Soft MoE inherently exhibit segmentation-like patterns but are not explicitly aligned with semantic regions. Motivated by this observation, we propose a foreground-guided enhancement strategy. Specifically, we introduce a spatially aware auxiliary loss that encourages expert activation to align with semantic foreground regions. To further reinforce this supervision, we integrate a lightweight LayerScale mechanism that improves information flow and stabilizes optimization in skip connections. Our method necessitates only minor architectural adjustments and can be seamlessly integrated into prevailing Soft MoE frameworks. Comprehensive experiments on ImageNet-1K and multiple smaller-scale classification benchmarks not only showcase consistent performance enhancements but also reveal more interpretable expert routing mechanisms.

CVMay 26, 2025
Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval

Fanheng Kong, Jingyuan Zhang, Yahui Liu et al.

Multimodal information retrieval (MIR) faces inherent challenges due to the heterogeneity of data sources and the complexity of cross-modal alignment. While previous studies have identified modal gaps in feature spaces, a systematic approach to address these challenges remains unexplored. In this work, we introduce UNITE, a universal framework that tackles these challenges through two critical yet underexplored aspects: data curation and modality-aware training configurations. Our work provides the first comprehensive analysis of how modality-specific data properties influence downstream task performance across diverse scenarios. Moreover, we propose Modal-Aware Masked Contrastive Learning (MAMCL) to mitigate the competitive relationships among the instances of different modalities. Our framework achieves state-of-the-art results on multiple multimodal retrieval benchmarks, outperforming existing methods by notable margins. Through extensive experiments, we demonstrate that strategic modality curation and tailored training protocols are pivotal for robust cross-modal representation learning. This work not only advances MIR performance but also provides a foundational blueprint for future research in multimodal systems. Our project is available at https://friedrichor.github.io/projects/UNITE.

CLMay 25, 2025
Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

Qian Cao, Xiting Wang, Yuzhuo Yuan et al.

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.

CLMar 29, 2022
Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics

Jiannan Xiang, Huayang Li, Yahui Liu et al.

Current practices in metric evaluation focus on one single dataset, e.g., Newstest dataset in each year's WMT Metrics Shared Task. However, in this paper, we qualitatively and quantitatively show that the performances of metrics are sensitive to data. The ranking of metrics varies when the evaluation is conducted on different datasets. Then this paper further investigates two potential hypotheses, i.e., insignificant data points and the deviation of Independent and Identically Distributed (i.i.d) assumption, which may take responsibility for the issue of data variance. In conclusion, our findings suggest that when evaluating automatic translation metrics, researchers should take data variance into account and be cautious to claim the result on a single dataset, because it may leads to inconsistent results with most of other datasets.

CVJun 16, 2021
Smoothing the Disentangled Latent Style Space for Unsupervised Image-to-Image Translation

Yahui Liu, Enver Sangineto, Yajing Chen et al.

Image-to-Image (I2I) multi-domain translation models are usually evaluated also using the quality of their semantic interpolation results. However, state-of-the-art models frequently show abrupt changes in the image appearance during interpolation, and usually perform poorly in interpolations across domains. In this paper, we propose a new training protocol based on three specific losses which help a translation network to learn a smooth and disentangled latent style space in which: 1) Both intra- and inter-domain interpolations correspond to gradual changes in the generated images and 2) The content of the source image is better preserved during the translation. Moreover, we propose a novel evaluation metric to properly measure the smoothness of latent style space of I2I translation models. The proposed method can be plugged into existing translation approaches, and our extensive experiments on different datasets show that it can significantly boost the quality of the generated images and the graduality of the interpolations.

CLMay 6, 2021
Assessing Dialogue Systems with Distribution Distances

Jiannan Xiang, Yahui Liu, Deng Cai et al.

An important aspect of developing dialogue systems is how to evaluate and compare the performance of different systems. Existing automatic evaluation metrics are based on turn-level quality evaluation and use average scores for system-level comparison. In this paper, we propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations. Specifically, two distribution-wise metrics, FBD and PRD, are developed and evaluated. Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.

ROMar 30, 2021
The Matter of Time -- A General and Efficient System for Precise Sensor Synchronization in Robotic Computing

Shaoshan Liu, Bo Yu, Yahui Liu et al.

Time synchronization is a critical task in robotic computing such as autonomous driving. In the past few years, as we developed advanced robotic applications, our synchronization system has evolved as well. In this paper, we first introduce the time synchronization problem and explain the challenges of time synchronization, especially in robotic workloads. Summarizing these challenges, we then present a general hardware synchronization system for robotic computing, which delivers high synchronization accuracy while maintaining low energy and resource consumption. The proposed hardware synchronization system is a key building block in our future robotic products.

CVOct 19, 2020
Semantic-Guided Inpainting Network for Complex Urban Scenes Manipulation

Pierfrancesco Ardino, Yahui Liu, Elisa Ricci et al.

Manipulating images of complex scenes to reconstruct, insert and/or remove specific object instances is a challenging task. Complex scenes contain multiple semantics and objects, which are frequently cluttered or ambiguous, thus hampering the performance of inpainting models. Conventional techniques often rely on structural information such as object contours in multi-stage approaches that generate unreliable results and boundaries. In this work, we propose a novel deep learning model to alter a complex urban scene by removing a user-specified portion of the image and coherently inserting a new object (e.g. car or pedestrian) in that scene. Inspired by recent works on image inpainting, our proposed method leverages the semantic segmentation to model the content and structure of the image, and learn the best shape and location of the object to insert. To generate reliable results, we design a new decoder block that combines the semantic segmentation and generation task to guide better the generation of new objects and scenes, which have to be semantically consistent with the image. Our experiments, conducted on two large-scale datasets of urban scenes (Cityscapes and Indian Driving), show that our proposed approach successfully address the problem of semantically-guided inpainting of complex urban scene.

CVAug 11, 2020
Retrieval Guided Unsupervised Multi-domain Image-to-Image Translation

Raul Gomez, Yahui Liu, Marco De Nadai et al.

Image to image translation aims to learn a mapping that transforms an image from one visual domain to another. Recent works assume that images descriptors can be disentangled into a domain-invariant content representation and a domain-specific style representation. Thus, translation models seek to preserve the content of source images while changing the style to a target visual domain. However, synthesizing new images is extremely challenging especially in multi-domain translations, as the network has to compose content and style to generate reliable and diverse images in multiple domains. In this paper we propose the use of an image retrieval system to assist the image-to-image translation task. First, we train an image-to-image translation model to map images to multiple domains. Then, we train an image retrieval model using real and generated images to find images similar to a query one in content but in a different domain. Finally, we exploit the image retrieval system to fine-tune the image-to-image translation model and generate higher quality images. Our experiments show the effectiveness of the proposed solution and highlight the contribution of the retrieval network, which can benefit from additional unlabeled data and help image-to-image translation models in the presence of scarce data.

CVAug 10, 2020
Describe What to Change: A Text-guided Unsupervised Image-to-Image Translation Approach

Yahui Liu, Marco De Nadai, Deng Cai et al.

Manipulating visual attributes of images through human-written text is a very challenging task. On the one hand, models have to learn the manipulation without the ground truth of the desired output. On the other hand, models have to deal with the inherent ambiguity of natural language. Previous research usually requires either the user to describe all the characteristics of the desired image or to use richly-annotated image captioning datasets. In this work, we propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence such as "change the hair color to black". Contrarily to state-of-the-art approaches, our model does not require a human-annotated dataset nor a textual description of all the attributes of the desired image, but only those that have to be modified. Our proposed model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description, before generating a new image from the content and the modified attribute representation. Because text might be inherently ambiguous (blond hair may refer to different shadows of blond, e.g. golden, icy, sandy), our method generates multiple stochastic versions of the same translation. Experiments show that the proposed model achieves promising performances on two large-scale public datasets: CelebA and CUB. We believe our approach will pave the way to new avenues of research combining textual and speech commands with visual attributes.

CVApr 6, 2020
Vanishing Point Guided Natural Image Stitching

Kai Chen, Jian Yao, Jingmin Tu et al.

Recently, works on improving the naturalness of stitching images gain more and more extensive attention. Previous methods suffer the failures of severe projective distortion and unnatural rotation, especially when the number of involved images is large or images cover a very wide field of view. In this paper, we propose a novel natural image stitching method, which takes into account the guidance of vanishing points to tackle the mentioned failures. Inspired by a vital observation that mutually orthogonal vanishing points in Manhattan world can provide really useful orientation clues, we design a scheme to effectively estimate prior of image similarity. Given such estimated prior as global similarity constraints, we feed it into a popular mesh deformation framework to achieve impressive natural stitching performances. Compared with other existing methods, including APAP, SPHP, AANAP, and GSP, our method achieves state-of-the-art performance in both quantitative and qualitative experiments on natural image stitching.

CVMar 15, 2020
GMM-UNIT: Unsupervised Multi-Domain and Multi-Modal Image-to-Image Translation via Attribute Gaussian Mixture Modeling

Yahui Liu, Marco De Nadai, Jian Yao et al.

Unsupervised image-to-image translation (UNIT) aims at learning a mapping between several visual domains by using unpaired training images. Recent studies have shown remarkable success for multiple domains but they suffer from two main limitations: they are either built from several two-domain mappings that are required to be learned independently, or they generate low-diversity results, a problem known as mode collapse. To overcome these limitations, we propose a method named GMM-UNIT, which is based on a content-attribute disentangled representation where the attribute space is fitted with a GMM. Each GMM component represents a domain, and this simple assumption has two prominent advantages. First, it can be easily extended to most multi-domain and multi-modal image-to-image translation tasks. Second, the continuous domain encoding allows for interpolation between domains and for extrapolation to unseen domains and translations. Additionally, we show how GMM-UNIT can be constrained down to different methods in the literature, meaning that GMM-UNIT is a unifying framework for unsupervised image-to-image translation.

IRNov 18, 2019
Pairwise Interactive Graph Attention Network for Context-Aware Recommendation

Yahui Liu, Furao Shen, Jian Zhao

Context-aware recommender systems (CARS), which consider rich side information to improve recommendation performance, have caught more and more attention in both academia and industry. How to predict user preferences from diverse contextual features is the core of CARS. Several recent models pay attention to user behaviors and use specifically designed structures to extract adaptive user interests from history behaviors. However, few works take item history interactions into consideration, which leads to the insufficiency of item feature representation and item attraction extraction. From these observations, we model the user-item interaction as a dynamic interaction graph (DIG) and proposed a GNN-based model called Pairwise Interactive Graph Attention Network (PIGAT) to capture dynamic user interests and item attractions simultaneously. PIGAT introduces the attention mechanism to consider the importance of each interacted user/item to both the user and the item, which captures user interests, item attractions and their influence on the recommendation context. Moreover, confidence embeddings are applied to interactions to distinguish the confidence of interactions occurring at different times. Then more expressive user/item representations and adaptive interaction features are generated, which benefits the recommendation performance especially when involving long-tail items. We conduct experiments on three real-world datasets to demonstrate the effectiveness of PIGAT.

CVJul 12, 2019
Gesture-to-Gesture Translation in the Wild via Category-Independent Conditional Maps

Yahui Liu, Marco De Nadai, Gloria Zen et al.

Recent works have shown Generative Adversarial Networks (GANs) to be particularly effective in image-to-image translations. However, in tasks such as body pose and hand gesture translation, existing methods usually require precise annotations, e.g. key-points or skeletons, which are time-consuming to draw. In this work, we propose a novel GAN architecture that decouples the required annotations into a category label - that specifies the gesture type - and a simple-to-draw category-independent conditional map - that expresses the location, rotation and size of the hand gesture. Our architecture synthesizes the target gesture while preserving the background context, thus effectively dealing with gesture translation in the wild. To this aim, we use an attention module and a rolling guidance approach, which loops the generated images back into the network and produces higher quality images compared to competing works. Thus, our GAN learns to generate new images from simple annotations without requiring key-points or skeleton labels. Results on two public datasets show that our method outperforms state of the art approaches both quantitatively and qualitatively. To the best of our knowledge, no work so far has addressed the gesture-to-gesture translation in the wild by requiring user-friendly annotations.

CVMay 12, 2017
Learning to Refine Object Contours with a Top-Down Fully Convolutional Encoder-Decoder Network

Yahui Liu, Jian Yao, Li Li et al.

We develop a novel deep contour detection algorithm with a top-down fully convolutional encoder-decoder network. Our proposed method, named TD-CEDN, solves two important issues in this low-level vision problem: (1) learning multi-scale and multi-level features; and (2) applying an effective top-down refined approach in the networks. TD-CEDN performs the pixel-wise prediction by means of leveraging features at all layers of the net. Unlike skip connections and previous encoder-decoder methods, we first learn a coarse feature map after the encoder stage in a feedforward pass, and then refine this feature map in a top-down strategy during the decoder stage utilizing features at successively lower layers. Therefore, the deconvolutional process is conducted stepwise, which is guided by Deeply-Supervision Net providing the integrated direct supervision. The above proposed technologies lead to a more precise and clearer prediction. Our proposed algorithm achieved the state-of-the-art on the BSDS500 dataset (ODS F-score of 0.788), the PASCAL VOC2012 dataset (ODS F-score of 0.588), and and the NYU Depth dataset (ODS F-score of 0.735).