CVMar 9, 2023
RiDDLE: Reversible and Diversified De-identification with Latent EncryptorDongze Li, Wei Wang, Kang Zhao et al.
This work presents RiDDLE, short for Reversible and Diversified De-identification with Latent Encryptor, to protect the identity information of people from being misused. Built upon a pre-learned StyleGAN2 generator, RiDDLE manages to encrypt and decrypt the facial identity within the latent space. The design of RiDDLE has three appealing properties. First, the encryption process is cipher-guided and hence allows diverse anonymization using different passwords. Second, the true identity can only be decrypted with the correct password, otherwise the system will produce another de-identified face to maintain the privacy. Third, both encryption and decryption share an efficient implementation, benefiting from a carefully tailored lightweight encryptor. Comparisons with existing alternatives confirm that our approach accomplishes the de-identification task with better quality, higher diversity, and stronger reversibility. We further demonstrate the effectiveness of RiDDLE in anonymizing videos. Code and models will be made publicly available.
CLJun 18, 2023
UniMC: A Unified Framework for Long-Term Memory Conversation via Relevance Representation LearningKang Zhao, Wei Liu, Jian Luan et al.
Open-domain long-term memory conversation can establish long-term intimacy with humans, and the key is the ability to understand and memorize long-term dialogue history information. Existing works integrate multiple models for modelling through a pipeline, which ignores the coupling between different stages. In this paper, we propose a Unified framework for Long-term Memory Conversations (UniMC), which increases the connection between different stages by learning relevance representation. Specifically, we decompose the main task into three subtasks based on probability graphs: 1) conversation summarization, 2) memory retrieval, 3) memory-augmented generation. Each subtask involves learning a representation for calculating the relevance between the query and memory, which is modelled by inserting a special token at the beginning of the decoder input. The relevance representation learning strengthens the connection across subtasks through parameter sharing and joint training. Extensive experimental results show that the proposed method consistently improves over strong baselines and yields better dialogue consistency and engagingness.
CVNov 7, 2023
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion ModelsShiwei Zhang, Jiayu Wang, Yingya Zhang et al.
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280$\times$720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at \url{https://i2vgen-xl.github.io}.
CVOct 12, 2023Code
DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image EditingYueming Lyu, Kang Zhao, Bo Peng et al.
Text-guided image editing faces significant challenges when considering training and inference flexibility. Much literature collects large amounts of annotated image-text pairs to train text-conditioned generative models from scratch, which is expensive and not efficient. After that, some approaches that leverage pre-trained vision-language models have been proposed to avoid data collection, but they are limited by either per text-prompt optimization or inference-time hyper-parameters tuning. To address these issues, we investigate and identify a specific space, referred to as CLIP DeltaSpace, where the CLIP visual feature difference of two images is semantically aligned with the CLIP textual feature difference of their corresponding text descriptions. Based on DeltaSpace, we propose a novel framework called DeltaEdit, which maps the CLIP visual feature differences to the latent space directions of a generative model during the training phase, and predicts the latent space directions from the CLIP textual feature differences during the inference phase. And this design endows DeltaEdit with two advantages: (1) text-free training; (2) generalization to various text prompts for zero-shot inference. Extensive experiments validate the effectiveness and versatility of DeltaEdit with different generative models, including both the GAN model and the diffusion model, in achieving flexible text-guided image editing. Code is available at https://github.com/Yueming6568/DeltaEdit.
CVMar 26, 2025Code
Wan: Open and Advanced Large-Scale Video Generative ModelsTeam Wan, Ang Wang, Baole Ai et al.
This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1.
CLMar 5, 2022
Consistent Representation Learning for Continual Relation ExtractionKang Zhao, Hua Xu, Jiangong Yang et al.
Continual relation extraction (CRE) aims to continuously train a model on data with new relations while avoiding forgetting old ones. Some previous work has proved that storing a few typical samples of old relations and replaying them when learning new relations can effectively avoid forgetting. However, these memory-based methods tend to overfit the memory samples and perform poorly on imbalanced datasets. To solve these challenges, a consistent representation learning method is proposed, which maintains the stability of the relation embedding by adopting contrastive learning and knowledge distillation when replaying memory. Specifically, supervised contrastive learning based on a memory bank is first used to train each new task so that the model can effectively learn the relation representation. Then, contrastive replay is conducted of the samples in memory and makes the model retain the knowledge of historical relations through memory knowledge distillation to prevent the catastrophic forgetting of the old task. The proposed method can better learn consistent representations to alleviate forgetting effectively. Extensive experiments on FewRel and TACRED datasets show that our method significantly outperforms state-of-the-art baselines and yield strong robustness on the imbalanced dataset.
CVApr 21
Wan-Image: Pushing the Boundaries of Generative Visual IntelligenceChaojie Mao, Chen-Wei Xie, Chongyang Zhong et al.
We present Wan-Image, a unified visual generation system explicitly engineered to paradigm-shift image generation models from casual synthesizers into professional-grade productivity tools. While contemporary diffusion models excel at aesthetic generation, they frequently encounter critical bottlenecks in rigorous design workflows that demand absolute controllability, complex typography rendering, and strict identity preservation. To address these challenges, Wan-Image features a natively unified multi-modal architecture by synergizing the cognitive capabilities of large language models with the high-fidelity pixel synthesis of diffusion transformers, which seamlessly translates highly nuanced user intents into precise visual outputs. It is fundamentally powered by large-scale multi-modal data scaling, a systematic fine-grained annotation engine, and curated reinforcement learning data to surpass basic instruction following and unlock expert-level professional capabilities. These include ultra-long complex text rendering, hyper-diverse portrait generation, palette-guided generation, multi-subject identity preservation, coherent sequential visual generation, precise multi-modal interactive editing, native alpha-channel generation, and high-efficiency 4K synthesis. Across diverse human evaluations, Wan-Image exceeds Seedream 5.0 Lite and GPT Image 1.5 in overall performance, reaching parity with Nano Banana Pro in challenging tasks. Ultimately, Wan-Image revolutionizes visual content creation across e-commerce, entertainment, education, and personal productivity, redefining the boundaries of professional visual synthesis.
CVAug 18, 2024
S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head SynthesisDongze Li, Kang Zhao, Wei Wang et al.
Talking head synthesis is a practical technique with wide applications. Current Neural Radiance Field (NeRF) based approaches have shown their superiority on driving one-shot talking heads with videos or signals regressed from audio. However, most of them failed to take the audio as driven information directly, unable to enjoy the flexibility and availability of speech. Since mapping audio signals to face deformation is non-trivial, we design a Single-Shot Speech-Driven Neural Radiance Field (S^3D-NeRF) method in this paper to tackle the following three difficulties: learning a representative appearance feature for each identity, modeling motion of different face regions with audio, and keeping the temporal consistency of the lip area. To this end, we introduce a Hierarchical Facial Appearance Encoder to learn multi-scale representations for catching the appearance of different speakers, and elaborate a Cross-modal Facial Deformation Field to perform speech animation according to the relationship between the audio signal and different face regions. Moreover, to enhance the temporal consistency of the important lip area, we introduce a lip-sync discriminator to penalize the out-of-sync audio-visual sequences. Extensive experiments have shown that our S^3D-NeRF surpasses previous arts on both video fidelity and audio-lip synchronization.
LGAug 26, 2024
1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bitChang Gao, Jianfei Chen, Kang Zhao et al. · tencent-ai
Fully quantized training (FQT) accelerates the training of deep neural networks by quantizing the activations, weights, and gradients into lower precision. To explore the ultimate limit of FQT (the lowest achievable precision), we make a first attempt to 1-bit FQT. We provide a theoretical analysis of FQT based on Adam and SGD, revealing that the gradient variance influences the convergence of FQT. Building on these theoretical results, we introduce an Activation Gradient Pruning (AGP) strategy. The strategy leverages the heterogeneity of gradients by pruning less informative gradients and enhancing the numerical precision of remaining gradients to mitigate gradient variance. Additionally, we propose Sample Channel joint Quantization (SCQ), which utilizes different quantization strategies in the computation of weight gradients and activation gradients to ensure that the method is friendly to low-bitwidth hardware. Finally, we present a framework to deploy our algorithm. For fine-tuning VGGNet-16 and ResNet-18 on multiple datasets, our algorithm achieves an average accuracy improvement of approximately 6%, compared to per-sample quantization. Moreover, our training speedup can reach a maximum of 5.13x compared to full precision training.
CVJun 27, 2023
Freestyle 3D-Aware Portrait Synthesis Based on Compositional Generative PriorsTianxiang Ma, Kang Zhao, Jianxin Sun et al.
Efficiently generating a freestyle 3D portrait with high quality and 3D-consistency is a promising yet challenging task. The portrait styles generated by most existing methods are usually restricted by their 3D generators, which are learned in specific facial datasets, such as FFHQ. To get the diverse 3D portraits, one can build a large-scale multi-style database to retrain a 3D-aware generator, or use a off-the-shelf tool to do the style translation. However, the former is time-consuming due to data collection and training process, the latter may destroy the multi-view consistency. To tackle this problem, we propose a novel text-driven 3D-aware portrait synthesis framework that can generate out-of-distribution portrait styles. Specifically, for a given portrait style prompt, we first composite two generative priors, a 3D-aware GAN generator and a text-guided image editor, to quickly construct a few-shot stylized portrait set. Then we map the special style domain of this set to our proposed 3D latent feature generator and obtain a 3D representation containing the given style information. Finally we use a pre-trained 3D renderer to generate view-consistent stylized portraits from the 3D representation. Extensive experimental results show that our method is capable of synthesizing high-quality 3D portraits with specified styles in a few minutes, outperforming the state-of-the-art.
CVSep 30, 2024
FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video EditingLingling Cai, Kang Zhao, Hangjie Yuan et al.
Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.
CVDec 10, 2025
ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective LearningXinyu Liu, Hangjie Yuan, Yujie Wei et al.
Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.
CLOct 12, 2024Code
FlatQuant: Flatness Matters for LLM QuantizationYuxuan Sun, Ruikang Liu, Haoli Bai et al.
Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead of affine transformation, we apply Kronecker product with two lightweight matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments demonstrate that FlatQuant establishes a new state-of-the-art benchmark for quantization. For example, it achieves less than 1\% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5\%. Additionally, it provides up to 2.3x prefill speedup and 1.7x decoding speedup compared to the FP16 model. Code is available at: https://github.com/ruikangliu/FlatQuant.
NAMay 24, 2018
Updating quadratic palindromic models with no spillover effect on unmeasured spectral dataKang Zhao, Lizhi Cheng, Shengguo Li et al.
This paper concerns the model updating problems with no spillover of the quadratic palindromic system $P(λ)=λ^2 A^{\star}+λQ+A$
CVMar 1
PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video GenerationJiangshan Wang, Kang Zhao, Jiayi Guo et al.
High computational costs and slow inference hinder the practical application of video generation models. While prior works accelerate the generation process through feature caching, they often suffer from notable quality degradation. In this work, we reveal that this issue arises from their inability to distinguish truly redundant features, which leads to the unintended skipping of computations on important features. To address this, we propose \textbf{PreciseCache}, a plug-and-play framework that precisely detects and skips truly redundant computations, thereby accelerating inference without sacrificing quality. Specifically, PreciseCache contains two components: LFCache for step-wise caching and BlockCache for block-wise caching. For LFCache, we compute the Low-Frequency Difference (LFD) between the prediction features of the current step and those from the previous cached step. Empirically, we observe that LFD serves as an effective measure of step-wise redundancy, accurately detecting highly redundant steps whose computation can be skipped through reusing cached features. To further accelerate generation within each non-skipped step, we propose BlockCache, which precisely detects and skips redundant computations at the block level within the network. Extensive experiments on various backbones demonstrate the effectiveness of our PreciseCache, such as achieving an average of $2.6\times$ speedup on Wan2.1-14B without noticeable quality loss.
LGApr 2, 2024Code
Accelerating Transformer Pre-training with 2:4 SparsityYuezhou Hu, Kang Zhao, Weiyu Huang et al.
Training large transformers is slow, but recent innovations on GPU architecture give us an advantage. NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent. In the light of this property, we comprehensively investigate the feasibility of accelerating feed-forward networks (FFNs) of transformers in pre-training. First, we define a ``flip rate'' to monitor the stability of a 2:4 training process. Utilizing this metric, we propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator by applying the masked decay term on gradients, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality by a dense fine-tuning procedure near the end of pre-training. Besides, we devise two techniques to practically accelerate training: to calculate transposable 2:4 masks by convolution, and to accelerate gated activation functions by reducing GPU L2 cache miss. Experiments show that our 2:4 sparse training algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently. Our toolkit is available at https://github.com/huyz2023/2by4-pretrain.
ROApr 10, 2023
Learning Residual Model of Model Predictive Control via Random Forests for Autonomous DrivingKang Zhao, Jianru Xue, Xiangning Meng et al.
One major issue in learning-based model predictive control (MPC) for autonomous driving is the contradiction between the system model's prediction accuracy and computation efficiency. The more situations a system model covers, the more complex it is, along with highly nonlinear and nonconvex properties. These issues make the optimization too complicated to solve and render real-time control impractical.To address these issues, we propose a hierarchical learning residual model which leverages random forests and linear regression.The learned model consists of two levels. The low level uses linear regression to fit the residues, and the high level uses random forests to switch different linear models. Meanwhile, we adopt the linear dynamic bicycle model with error states as the nominal model.The switched linear regression model is added to the nominal model to form the system model. It reformulates the learning-based MPC as a quadratic program (QP) problem and optimization solvers can effectively solve it. Experimental path tracking results show that the driving vehicle's prediction accuracy and tracking accuracy are significantly improved compared with the nominal MPC.Compared with the state-of-the-art Gaussian process-based nonlinear model predictive control (GP-NMPC), our method gets better performance on tracking accuracy while maintaining a lower computation consumption.
LGJan 27
R^3: Replay, Reflection, and Ranking Rewards for LLM Reinforcement LearningZhizheng Jiang, Kang Zhao, Weikai Xu et al.
Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning. Recent advances in group-based policy optimization methods have shown promise in enabling stable advantage estimation without reliance on process-level annotations. However, these methods rely on advantage gaps induced by high-quality samples within the same batch, which makes the training process fragile and inefficient when intra-group advantages collapse under challenging tasks. To address these problems, we propose a reinforcement learning mechanism named \emph{\textbf{R^3}} that along three directions: (1) a \emph{cross-context \underline{\textbf{R}}eplay} strategy that maintains the intra-group advantage by recalling valuable examples from historical trajectories of the same query, (2) an \emph{in-context self-\underline{\textbf{R}}eflection} mechanism enabling models to refine outputs by leveraging past failures, and (3) a \emph{structural entropy \underline{\textbf{R}}anking reward}, which assigns relative rewards to truncated or failed samples by ranking responses based on token-level entropy patterns, capturing both local exploration and global stability. We implement our method on Deepseek-R1-Distill-Qwen-1.5B and train it on the DeepscaleR-40k in the math domain. Experiments demonstrate our method achieves SoTA performance on several math benchmarks, representing significant improvements and fewer reasoning tokens over the base models. Code and model will be released.
CLMay 30, 2025Code
A Simple Linear Patch Revives Layer-Pruned Large Language ModelsXinrui Chen, Haoli Bai, Tao Yuan et al.
Layer pruning has emerged as a widely used technique for compressing large language models (LLMs). However, existing layer pruning approaches often incur substantial performance degradation. We identify the majority of this degradation to a single yet previously overlooked issue: \textit{the mismatch of activation magnitudes at the pruning interface}. The pre-interface activations exhibit significantly different scales from the post-interface ones, causing the distributional shift as it propagates through the remaining layers. To address this issue, we introduce \textsc{LinearPatch}, a lightweight and plug-and-play technique that fuses two operations into one matrix multiply at the pruning interface: (i) a Hadamard transformation that suppresses massive outliers at particular tokens and (ii) a channel-wise scaling that aligns activation statistics. On LLaMA-3-8B, \textsc{LinearPatch} preserves up to \textbf{94.15\%} of the original model's performance when pruning 5 out of 32 layers, outperforming the previous state of the art by \textbf{4\%}. The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95.16\% within only 30 minutes on a single GPU. Code is available at https://github.com/chenxinrui-tsinghua/LinearPatch.
CVJan 4, 2023
Semi-MAE: Masked Autoencoders for Semi-supervised Vision TransformersHaojie Yu, Kang Zhao, Xiaoming Xu
Vision Transformer (ViT) suffers from data scarcity in semi-supervised learning (SSL). To alleviate this issue, inspired by masked autoencoder (MAE), which is a data-efficient self-supervised learner, we propose Semi-MAE, a pure ViT-based SSL framework consisting of a parallel MAE branch to assist the visual representation learning and make the pseudo labels more accurate. The MAE branch is designed as an asymmetric architecture consisting of a lightweight decoder and a shared-weights encoder. We feed the weakly-augmented unlabeled data with a high masking ratio to the MAE branch and reconstruct the missing pixels. Semi-MAE achieves 75.9% top-1 accuracy on ImageNet with 10% labels, surpassing prior state-of-the-art in semi-supervised image classification. In addition, extensive experiments demonstrate that Semi-MAE can be readily used for other ViT models and masked image modeling methods.
CVJul 22, 2025Code
Task-Specific Zero-shot Quantization-Aware Training for Object DetectionChanghao Li, Xinrui Chen, Ji Wang et al.
Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often restricted due to privacy concerns or security challenges. Zero-shot Quantization (ZSQ) addresses this by using synthetic data generated from pre-trained models, eliminating the need for real training data. Recently, ZSQ has been extended to object detection. However, existing methods use unlabeled task-agnostic synthetic images that lack the specific information required for object detection, leading to suboptimal performance. In this paper, we propose a novel task-specific ZSQ framework for object detection networks, which consists of two main stages. First, we introduce a bounding box and category sampling strategy to synthesize a task-specific calibration set from the pre-trained network, reconstructing object locations, sizes, and category distributions without any prior knowledge. Second, we integrate task-specific training into the knowledge distillation process to restore the performance of quantized detection networks. Extensive experiments conducted on the MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of our method. Our code is publicly available at: https://github.com/DFQ-Dojo/dfq-toolkit .
CLJun 3, 2024Code
SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLMQuandong Wang, Yuxuan Yuan, Xiaoyu Yang et al.
While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass Large Language Model, an innovative architecture that extends the core decoder-only framework by incorporating subsampling, upsampling, and bypass modules. The subsampling modules are responsible for shortening the sequence, while the upsampling modules restore the sequence length, and the bypass modules enhance convergence. In comparison to LLaMA, the proposed SUBLLM exhibits significant enhancements in both training and inference speeds as well as memory usage, while maintaining competitive few-shot performance. During training, SUBLLM increases speeds by 26% and cuts memory by 10GB per GPU. In inference, it boosts speeds by up to 37% and reduces memory by 1GB per GPU. The training and inference speeds can be enhanced by 34% and 52% respectively when the context window is expanded to 8192. Our code is available at https://github.com/XiaoMi/subllm.
LGMar 19, 2024Code
Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block QuantizationHaocheng Xi, Yuxiang Chen, Kang Zhao et al.
Pretraining transformers are generally time-consuming. Fully quantized training (FQT) is a promising approach to speed up pretraining. However, most FQT methods adopt a quantize-compute-dequantize procedure, which often leads to suboptimal speedup and significant performance degradation when used in transformers due to the high memory access overheads and low-precision computations. In this work, we propose Jetfire, an efficient and accurate INT8 training method specific to transformers. Our method features an INT8 data flow to optimize memory access and a per-block quantization method to maintain the accuracy of pretrained transformers. Extensive experiments demonstrate that our INT8 FQT method achieves comparable accuracy to the FP16 training baseline and outperforms the existing INT8 training works for transformers. Moreover, for a standard transformer block, our method offers an end-to-end training speedup of 1.42x and a 1.49x memory reduction compared to the FP16 baseline. Our code is open sourced at https://github.com/thu-ml/Jetfire-INT8Training.
CLSep 13, 2021Code
TEXTOIR: An Integrated and Visualized Platform for Text Open Intent RecognitionHanlei Zhang, Xiaoteng Li, Hua Xu et al.
TEXTOIR is the first integrated and visualized platform for text open intent recognition. It is composed of two main modules: open intent detection and open intent discovery. Each module integrates most of the state-of-the-art algorithms and benchmark intent datasets. It also contains an overall framework connecting the two modules in a pipeline scheme. In addition, this platform has visualized tools for data and model management, training, evaluation and analysis of the performance from different aspects. TEXTOIR provides useful toolkits and convenient visualized interfaces for each sub-module (Toolkit code: https://github.com/thuiar/TEXTOIR), and designs a framework to implement a complete process to both identify known intents and discover open intents (Demo code: https://github.com/thuiar/TEXTOIR-DEMO).
CVDec 18, 2023
AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head SynthesisDongze Li, Kang Zhao, Wei Wang et al.
Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality. Recent NeRF-based approaches have shown superiority in quality and fidelity compared to previous studies. However, when it comes to few-shot talking head generation, a practical scenario where only few seconds of talking video is available for one identity, two limitations emerge: 1) they either have no base model, which serves as a facial prior for fast convergence, or ignore the importance of audio when building the prior; 2) most of them overlook the degree of correlation between different face regions and audio, e.g., mouth is audio related, while ear is audio independent. In this paper, we present Audio Enhanced Neural Radiance Field (AE-NeRF) to tackle the above issues, which can generate realistic portraits of a new speaker with fewshot dataset. Specifically, we introduce an Audio Aware Aggregation module into the feature fusion stage of the reference scheme, where the weight is determined by the similarity of audio between reference and target image. Then, an Audio-Aligned Face Generation strategy is proposed to model the audio related and audio independent regions respectively, with a dual-NeRF framework. Extensive experiments have shown AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.
NAMay 1
Spectral decomposition of $(\star, ε)$-palindromic matrix polynomial and its applicationsKang Zhao, Xin Wang, Xiaoxiao Ma
This paper provides the spectral decomposition of $(\star,ε)$-palindromic quadratic matrix polynomial $P(λ)$ by a standard pair and a parameter matrix. When $J$ is assumed to be a block diagonal matrix, the parameter matrix $Γ$ has a special structure. And then the spectral decomposition is applied to solve the inverse eigenvalue problem and the eigenvalue embedding problem with no spill-over.
CVNov 13, 2024
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video GenerationXiaofeng Wang, Kang Zhao, Feng Liu et al.
Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of egocentric viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and high-level textual descriptions. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleaning pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.
LGOct 22, 2024
FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUsHaoran Lin, Xianzhi Yu, Kang Zhao et al.
FlashAttention series has been widely applied in the inference of large language models (LLMs). However, FlashAttention series only supports the high-level GPU architectures, e.g., Ampere and Hopper. At present, FlashAttention series is not easily transferrable to NPUs and low-resource GPUs. Moreover, FlashAttention series is inefficient for multi- NPUs or GPUs inference scenarios. In this work, we propose FastAttention which pioneers the adaptation of FlashAttention series for NPUs and low-resource GPUs to boost LLM inference efficiency. Specifically, we take Ascend NPUs and Volta-based GPUs as representatives for designing our FastAttention. We migrate FlashAttention series to Ascend NPUs by proposing a novel two-level tiling strategy for runtime speedup, tiling-mask strategy for memory saving and the tiling-AllReduce strategy for reducing communication overhead, respectively. Besides, we adapt FlashAttention for Volta-based GPUs by redesigning the operands layout in shared memory and introducing a simple yet effective CPU-GPU cooperative strategy for efficient memory utilization. On Ascend NPUs, our FastAttention can achieve a 10.7$\times$ speedup compared to the standard attention implementation. Llama-7B within FastAttention reaches up to 5.16$\times$ higher throughput than within the standard attention. On Volta architecture GPUs, FastAttention yields 1.43$\times$ speedup compared to its equivalents in \texttt{xformers}. Pangu-38B within FastAttention brings 1.46$\times$ end-to-end speedup using FasterTransformer. Coupled with the propose CPU-GPU cooperative strategy, FastAttention supports a maximal input length of 256K on 8 V100 GPUs. All the codes will be made available soon.
LGOct 21, 2024
Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUsKang Zhao, Tao Yuan, Han Bao et al.
To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ($\leq 1.3$) and requires fixed sparse ratios, meaning that other ratios, such as 4:8, 8:16, or those exceeding 50% sparsity, do not incur any speedups on GPUs. Recent studies suggest that V:N:M sparsity is promising in addressing these limitations of 2:4 sparsity. However, regarding accuracy, the effects of V:N:M sparsity on broader Transformer models, such as vision Transformers and large language models (LLMs), are largely unexamined. Moreover, Some specific issues related to V:N:M sparsity, such as how to select appropriate V and M values, remain unresolved. In this study, we thoroughly investigate the application of V:N:M sparsity in vision models and LLMs across multiple tasks, from pertaining to downstream tasks. We propose three key approaches to enhance the applicability and accuracy of V:N:M-sparse Transformers, including heuristic V and M selection, V:N:M-specific channel permutation, and three-staged LoRA training techniques. Experimental results show that, with our methods, the DeiT-small achieves lossless accuracy at 64:2:5 sparsity, while the DeiT-base maintains accuracy even at 64:2:8 sparsity. In addition, the fine-tuned LLama2-7B at 64:2:5 sparsity performs comparably or better than training-free 2:4 sparse alternatives on downstream tasks. More importantly, V:N:M-sparse Transformers offer a wider range of speedup-accuracy trade-offs compared to 2:4 sparsity. Overall, our exploration largely facilitates the V:N:M sparsity to act as a truly effective acceleration solution for Transformers in cost-sensitive inference scenarios.
SIApr 2
Structural Diversity Drives Disruptive Scientific InnovationYichun Peng, Saike He, Peijie Zhang et al.
Scientific innovation increasingly depends on collaboration, yet the organizational structure that fosters breakthrough ideas remains poorly understood. Existing metrics - such as team size or compositional diversity - capture readily observable characteristics but not the deeper architecture of collaboration. We introduce Structural Diversity (SD): the extent to which a team bridges multiple distinct knowledge communities within its prior collaboration network. Using a century-scale dataset of 260 million scientific publications (1900-2025) and combining causal inference with a quasi-natural experiment based on a U.S. National Science Foundation policy change in 2012, we show that SD is a powerful and robust predictor of disruptive innovation, outperforming traditional team novelty indicators such as team freshness and edge density. Moreover, SD positively interacts with team size and is able to mitigate the well-known "curse of scale" by transforming scale from a liability into a resource for creative synthesis. We find that one mechanism underlying this effect is Disciplinary Integration (DI): teams with higher SD can more effectively combine heterogeneous knowledge into novel configurations. Our findings position SD as both a new theoretical construct and an actionable design principle for organizing scientific collaboration. By linking the architecture of team assembly to the dynamics of creative discovery, our work offers a structural explanation for how collective intelligence can be systematically engineered to foster disruptive innovation.
CVJun 26, 2025
DFVEdit: Conditional Delta Flow Vector for Zero-shot Video EditingLingling Cai, Kang Zhao, Hangjie Yuan et al.
The advent of Video Diffusion Transformers (Video DiTs) marks a milestone in video generation. However, directly applying existing video editing methods to Video DiTs often incurs substantial computational overhead, due to resource-intensive attention modification or finetuning. To alleviate this problem, we present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs. DFVEdit eliminates the need for both attention modification and fine-tuning by directly operating on clean latents via flow transformation. To be more specific, we observe that editing and sampling can be unified under the continuous flow perspective. Building upon this foundation, we propose the Conditional Delta Flow Vector (CDFV) -- a theoretically unbiased estimation of DFV -- and integrate Implicit Cross Attention (ICA) guidance as well as Embedding Reinforcement (ER) to further enhance editing quality. DFVEdit excels in practical efficiency, offering at least 20x inference speed-up and 85% memory reduction on Video DiTs compared to attention-engineering-based editing methods. Extensive quantitative and qualitative experiments demonstrate that DFVEdit can be seamlessly applied to popular Video DiTs (e.g., CogVideoX and Wan2.1), attaining state-of-the-art performance on structural fidelity, spatial-temporal consistency, and editing quality.
CYJan 27
Dynamics of Human-AI Collective Knowledge on the Web: A Scalable Model and Insights for Sustainable GrowthBuddhika Nettasinghe, Kang Zhao
Humans and large language models (LLMs) now co-produce and co-consume the web's shared knowledge archives. Such human-AI collective knowledge ecosystems contain feedback loops with both benefits (e.g., faster growth, easier learning) and systemic risks (e.g., quality dilution, skill reduction, model collapse). To understand such phenomena, we propose a minimal, interpretable dynamical model of the co-evolution of archive size, archive quality, model (LLM) skill, aggregate human skill, and query volume. The model captures two content inflows (human, LLM) controlled by a gate on LLM-content admissions, two learning pathways for humans (archive study vs. LLM assistance), and two LLM-training modalities (corpus-driven scaling vs. learning from human feedback). Through numerical experiments, we identify different growth regimes (e.g., healthy growth, inverted flow, inverted learning, oscillations), and show how platform and policy levers (gate strictness, LLM training, human learning pathways) shift the system across regime boundaries. Two domain configurations (PubMed, GitHub and Copilot) illustrate contrasting steady states under different growth rates and moderation norms. We also fit the model to Wikipedia's knowledge flow during pre-ChatGPT and post-ChatGPT eras separately. We find a rise in LLM additions with a concurrent decline in human inflow, consistent with a regime identified by the model. Our model and analysis yield actionable insights for sustainable growth of human-AI collective knowledge on the Web.
AISep 29, 2025
Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning IntensityZhen Bi, Zhenlin Hu, Jinnan Yang et al.
Recent advances in large language models (LLMs) highlight the importance of training data structure and quality in shaping reasoning behavior. However, most existing approaches focus on transforming data formats while neglecting the internal reasoning complexity of training samples, leaving the reasoning potential of data under-explored and underutilized. In this work, we posit that LLM logical reasoning performance is jointly constrained by the potential of the training data and the cognitive capacity of the model. To make this relationship measurable, we introduce Data Reasoning Intensity (DRI), a novel metric that quantifies the latent logical reasoning complexity of samples by decomposing and aggregating their logical structures. This allows us to analyze how well current LLMs utilize logical reasoning signals and identify performance gaps relative to data potential. Based on this insight, we introduce a re-cognizing optimization strategy that systematically enhances the logical reasoning intensity of training data. Rather than increasing data volume, our method re-optimizes existing samples to better align with the LLM's logical reasoning boundary. Extensive experiments show that our approach significantly improves performance and generalization over data-centric strategies. We further validate our method under a reinforcement learning framework. Our results indicate that prioritizing reasoning complexity in data rather than sheer scale or superficial form is essential to realizing LLMs' full cognitive potential.
LGMar 24, 2025
Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMsChang Gao, Kang Zhao, Runqi Wang et al.
Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications. To address this issue, researchers have sought to apply network pruning techniques to LLMs. A critical challenge in pruning is allocation the sparsity for each layer. Recent sparsity allocation methods is often based on heuristics or search that can easily lead to suboptimal performance. In this paper, we conducted an extensive investigation into various LLMs and revealed three significant discoveries: (1) the layerwise pruning sensitivity (LPS) of LLMs is highly non-uniform, (2) the choice of pruning metric affects LPS, and (3) the performance of a sparse model is related to the uniformity of its layerwise redundancy level. Based on these observations, we propose that the layerwise sparsity of LLMs should adhere to three principles: \emph{non-uniformity}, \emph{pruning metric dependency}, and \emph{uniform layerwise redundancy level} in the pruned model. To this end, we proposed Maximum Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers (\emph{i.e.}, those with the highest non-outlier ratio) at each iteration. The achieved layerwise sparsity aligns with the outlined principles. We conducted extensive experiments on publicly available LLMs, including the LLaMA2 and OPT, across various benchmarks. Experimental results validate the effectiveness of MRP, demonstrating its superiority over previous methods.
CVFeb 9, 2021
Distribution Adaptive INT8 Quantization for Training CNNsKang Zhao, Sida Huang, Pan Pan et al.
Researches have demonstrated that low bit-width (e.g., INT8) quantization can be employed to accelerate the inference process. It makes the gradient quantization very promising since the backward propagation requires approximately twice more computation than forward one. Due to the variability and uncertainty of gradient distribution, a lot of methods have been proposed to attain training stability. However, most of them ignore the channel-wise gradient distributions and the impact of gradients with different magnitudes, resulting in the degradation of final accuracy. In this paper, we propose a novel INT8 quantization training framework for convolutional neural network to address the above issues. Specifically, we adopt Gradient Vectorized Quantization to quantize the gradient, based on the observation that layer-wise gradients contain multiple distributions along the channel dimension. Then, Magnitude-aware Clipping Strategy is introduced by taking the magnitudes of gradients into consideration when minimizing the quantization error, and we present a theoretical derivation to solve the quantization parameters of different distributions. Experimental results on broad range of computer vision tasks, such as image classification, object detection and video classification, demonstrate that the proposed Distribution Adaptive INT8 Quantization training method has achieved almost lossless training accuracy for different backbones, including ResNet, MobileNetV2, InceptionV3, VGG and AlexNet, which is superior to the state-of-the-art techniques. Moreover, we further implement the INT8 kernel that can accelerate the training iteration more than 200% under the latest Turing architecture, i.e., our method excels on both training accuracy and speed.
LGFeb 9, 2021
Large-Scale Training System for 100-Million Classification at AlibabaLiuyihan Song, Pan Pan, Kang Zhao et al.
In the last decades, extreme classification has become an essential topic for deep learning. It has achieved great success in many areas, especially in computer vision and natural language processing (NLP). However, it is very challenging to train a deep model with millions of classes due to the memory and computation explosion in the last output layer. In this paper, we propose a large-scale training system to address these challenges. First, we build a hybrid parallel training framework to make the training process feasible. Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs and improves the throughput of training. Then, to eliminate the communication overhead, we propose a new overlapping pipeline and a gradient sparsification method. Furthermore, we design a fast continuous convergence strategy to reduce total training iterations by adaptively adjusting learning rate and updating model parameters. With the help of all the proposed methods, we gain 3.9$\times$ throughput of our training system and reduce almost 60\% of training iterations. The experimental results show that using an in-house 256 GPUs cluster, we could train a classifier of 100 million classes on Alibaba Retail Product Dataset in about five days while achieving a comparable accuracy with the naive softmax training process.
CVFeb 9, 2021
Visual Search at AlibabaYanhao Zhang, Pan Pan, Yun Zheng et al.
This paper introduces the large scale visual search algorithm and system infrastructure at Alibaba. The following challenges are discussed under the E-commercial circumstance at Alibaba (a) how to handle heterogeneous image data and bridge the gap between real-shot images from user query and the online images. (b) how to deal with large scale indexing for massive updating data. (c) how to train deep models for effective feature representation without huge human annotations. (d) how to improve the user engagement by considering the quality of the content. We take advantage of large image collection of Alibaba and state-of-the-art deep learning techniques to perform visual search at scale. We present solutions and implementation details to overcome those problems and also share our learnings from building such a large scale commercial visual search engine. Specifically, model and search-based fusion approach is introduced to effectively predict categories. Also, we propose a deep CNN model for joint detection and feature learning by mining user click behavior. The binary index engine is designed to scale up indexing without compromising recall and precision. Finally, we apply all the stages into an end-to-end system architecture, which can simultaneously achieve highly efficient and scalable performance adapting to real-shot images. Extensive experiments demonstrate the advancement of each module in our system. We hope visual search at Alibaba becomes more widely incorporated into today's commercial applications.
CVFeb 9, 2021
Virtual ID Discovery from E-commerce Media at Alibaba: Exploiting Richness of User Click Behavior for Visual Search RelevanceYanhao Zhang, Pan Pan, Yun Zheng et al.
Visual search plays an essential role for E-commerce. To meet the search demands of users and promote shopping experience at Alibaba, visual search relevance of real-shot images is becoming the bottleneck. Traditional visual search paradigm is usually based upon supervised learning with labeled data. However, large-scale categorical labels are required with expensive human annotations, which limits its applicability and also usually fails in distinguishing the real-shot images. In this paper, we propose to discover Virtual ID from user click behavior to improve visual search relevance at Alibaba. As a totally click-data driven approach, we collect various types of click data for training deep networks without any human annotations at all. In particular, Virtual ID are learned as classification supervision with co-click embedding, which explores image relationship from user co-click behaviors to guide category prediction and feature learning. Concretely, we deploy Virtual ID Category Network by integrating first-clicks and switch-clicks as regularizer. Incorporating triplets and list constraints, Virtual ID Feature Network is trained in a joint classification and ranking manner. Benefiting from exploration of user click data, our networks are more effective to encode richer supervision and better distinguish real-shot images in terms of category and feature. To validate our method for visual search relevance, we conduct an extensive set of offline and online experiments on the collected real-shot images. We consistently achieve better experimental results across all components, compared with alternative and state-of-the-art methods.
IRFeb 9, 2021
Large-Scale Visual Search with Binary Distributed Graph at AlibabaKang Zhao, Pan Pan, Yun Zheng et al.
Graph-based approximate nearest neighbor search has attracted more and more attentions due to its online search advantages. Numbers of methods studying the enhancement of speed and recall have been put forward. However, few of them focus on the efficiency and scale of offline graph-construction. For a deployed visual search system with several billions of online images in total, building a billion-scale offline graph in hours is essential, which is almost unachievable by most existing methods. In this paper, we propose a novel algorithm called Binary Distributed Graph to solve this problem. Specifically, we combine binary codes with graph structure to speedup online and offline procedures, and achieve comparable performance with the ones in real-value based scenarios by recalling more binary candidates. Furthermore, the graph-construction is optimized to completely distributed implementation, which significantly accelerates the offline process and gets rid of the limitation of memory and disk within a single machine. Experimental comparisons on Alibaba Commodity Data Set (more than three billion images) show that the proposed method outperforms the state-of-the-art with respect to the online/offline trade-off.
CVJun 23, 2020
Boundary Regularized Building Footprint Extraction From Satellite Images Using Deep Neural NetworkKang Zhao, Muhammad Kamran, Gunho Sohn
In recent years, an ever-increasing number of remote satellites are orbiting the Earth which streams vast amount of visual data to support a wide range of civil, public and military applications. One of the key information obtained from satellite imagery is to produce and update spatial maps of built environment due to its wide coverage with high resolution data. However, reconstructing spatial maps from satellite imagery is not a trivial vision task as it requires reconstructing a scene or object with high-level representation such as primitives. For the last decade, significant advancement in object detection and representation using visual data has been achieved, but the primitive-based object representation still remains as a challenging vision task. Thus, a high-quality spatial map is mainly produced through complex labour-intensive processes. In this paper, we propose a novel deep neural network, which enables to jointly detect building instance and regularize noisy building boundary shapes from a single satellite imagery. The proposed deep learning method consists of a two-stage object detection network to produce region of interest (RoI) features and a building boundary extraction network using graph models to learn geometric information of the polygon shapes. Extensive experiments show that our model can accomplish multi-tasks of object localization, recognition, semantic labelling and geometric shape extraction simultaneously. In terms of building extraction accuracy, computation efficiency and boundary regularization performance, our model outperforms the state-of-the-art baseline models.
AIJun 17, 2015
Early Predictions of Movie Success: the Who, What, and When of ProfitabilityMichael T. Lash, Kang Zhao
This paper proposes a decision support system to aid movie investment decisions at the early stage of movie productions. The system predicts the success of a movie based on its profitability by leveraging historical data from various sources. Using social network analysis and text mining techniques, the system automatically extracts several groups of features, including "who" are on the cast, "what" a movie is about, "when" a movie will be released, as well as "hybrid" features that match "who" with "what", and "when" with "what". Experiment results with movies during an 11-year period showed that the system outperforms benchmark methods by a large margin in predicting movie profitability. Novel features we proposed also made great contributions to the prediction. In addition to designing a decision support system with practical utilities, our analysis of key factors for movie profitability may also have implications for theoretical research on team performance and the success of creative work.
SINov 11, 2013
User recommendation in reciprocal and bipartite social networks -- a case study of online datingKang Zhao, Xi Wang, Mo Yu et al.
Many social networks in our daily life are bipartite networks built on reciprocity. How can we recommend users/friends to a user, so that the user is interested in and attractive to recommended users? In this research, we propose a new collaborative filtering model to improve user recommendations in reciprocal and bipartite social networks. The model considers a user's "taste" in picking others and "attractiveness" in being picked by others. A case study of an online dating network shows that the new model has good performance in recommending both initial and reciprocal contacts.