Boyuan Liu

CV
h-index4
7papers
35citations
Novelty40%
AI Score36

7 Papers

CVJul 27, 2022
Multi-Forgery Detection Challenge 2022: Push the Frontier of Unconstrained and Diverse Forgery Detection

Jianshu Li, Man Luo, Jian Liu et al. · deepmind

In this paper, we present the Multi-Forgery Detection Challenge held concurrently with the IEEE Computer Society Workshop on Biometrics at CVPR 2022. Our Multi-Forgery Detection Challenge aims to detect automatic image manipulations including but not limited to image editing, image synthesis, image generation, image photoshop, etc. Our challenge has attracted 674 teams from all over the world, with about 2000 valid result submission counts. We invited the Top 10 teams to present their solutions to the challenge, from which three teams are awarded prizes in the grand finale. In this paper, we present the solutions from the Top 3 teams, in order to boost the research work in the field of image forgery detection.

CVJan 28, 2025Code
CascadeV: An Implementation of Wurstchen Architecture for Video Generation

Wenfeng Lin, Jiangchuan Wei, Boyuan Liu et al.

Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames. Furthermore, our model can be cascaded with existing T2V models, theoretically enabling a 4$\times$ increase in resolution or frames per second without any fine-tuning. Our code is available at https://github.com/bytedance/CascadeV.

BMNov 13, 2022
Drug-target affinity prediction method based on consistent expression of heterogeneous data

Boyuan Liu

The first step in drug discovery is finding drug molecule moieties with medicinal activity against specific targets. Therefore, it is crucial to investigate the interaction between drug-target proteins and small chemical molecules. However, traditional experimental methods for discovering potential small drug molecules are labor-intensive and time-consuming. There is currently a lot of interest in building computational models to screen small drug molecules using drug molecule-related databases. In this paper, we propose a method for predicting drug-target binding affinity using deep learning models. This method uses a modified GRU and GNN to extract features from the drug-target protein sequences and the drug molecule map, respectively, to obtain their feature vectors. The combined vectors are used as vector representations of drug-target molecule pairs and then fed into a fully connected network to predict drug-target binding affinity. This proposed model demonstrates its accuracy and effectiveness in predicting drug-target binding affinity on the DAVIS and KIBA datasets.

CVJan 23, 2025
EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

Jiangchuan Wei, Shiyue Yan, Wenfeng Lin et al.

Recent advancements in video generation have significantly impacted various downstream applications, particularly in identity-preserving video generation (IPT2V). However, existing methods struggle with "copy-paste" artifacts and low similarity issues, primarily due to their reliance on low-level facial image information. This dependence can result in rigid facial appearances and artifacts reflecting irrelevant details. To address these challenges, we propose EchoVideo, which employs two key strategies: (1) an Identity Image-Text Fusion Module (IITF) that integrates high-level semantic features from text, capturing clean facial identity representations while discarding occlusions, poses, and lighting variations to avoid the introduction of artifacts; (2) a two-stage training strategy, incorporating a stochastic method in the second phase to randomly utilize shallow facial information. The objective is to balance the enhancements in fidelity provided by shallow features while mitigating excessive reliance on them. This strategy encourages the model to utilize high-level features during training, ultimately fostering a more robust representation of facial identities. EchoVideo effectively preserves facial identities and maintains full-body integrity. Extensive experiments demonstrate that it achieves excellent results in generating high-quality, controllability and fidelity videos.

CLAug 24, 2025
Omne-R1: Learning to Reason with Memory for Multi-hop Question Answering

Boyuan Liu, Feng Ji, Jiayan Nan et al.

This paper introduces Omne-R1, a novel approach designed to enhance multi-hop question answering capabilities on schema-free knowledge graphs by integrating advanced reasoning models. Our method employs a multi-stage training workflow, including two reinforcement learning phases and one supervised fine-tuning phase. We address the challenge of limited suitable knowledge graphs and QA data by constructing domain-independent knowledge graphs and auto-generating QA pairs. Experimental results show significant improvements in answering multi-hop questions, with notable performance gains on more complex 3+ hop questions. Our proposed training framework demonstrates strong generalization abilities across diverse knowledge domains.

CVJun 5, 2025
ContentV: Efficient Training of Video Generation Models with Limited Compute

Wenfeng Lin, Renjie Chen, Boyuan Liu et al.

Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: https://contentv.github.io.

CVJun 2, 2021
DFGC 2021: A DeepFake Game Competition

Bo Peng, Hongxing Fan, Wei Wang et al.

This paper presents a summary of the DFGC 2021 competition. DeepFake technology is developing fast, and realistic face-swaps are increasingly deceiving and hard to detect. At the same time, DeepFake detection methods are also improving. There is a two-party game between DeepFake creators and detectors. This competition provides a common platform for benchmarking the adversarial game between current state-of-the-art DeepFake creation and detection methods. In this paper, we present the organization, results and top solutions of this competition and also share our insights obtained during this event. We also release the DFGC-21 testing dataset collected from our participants to further benefit the research community.