Yao Tang

CV
h-index20
24papers
601citations
Novelty53%
AI Score61

24 Papers

CVJul 26, 2022Code
Efficient One Pass Self-distillation with Zipf's Label Smoothing

Jiajun Liang, Linze Li, Zhaodong Bing et al.

Self-distillation exploits non-uniform soft supervision from itself during training and improves performance without any runtime cost. However, the overhead during training is often overlooked, and yet reducing time and memory overhead during training is increasingly important in the giant models' era. This paper proposes an efficient self-distillation method named Zipf's Label Smoothing (Zipf's LS), which uses the on-the-fly prediction of a network to generate soft supervision that conforms to Zipf distribution without using any contrastive samples or auxiliary parameters. Our idea comes from an empirical observation that when the network is duly trained the output values of a network's final softmax layer, after sorting by the magnitude and averaged across samples, should follow a distribution reminiscent to Zipf's Law in the word frequency statistics of natural languages. By enforcing this property on the sample level and throughout the whole training period, we find that the prediction accuracy can be greatly improved. Using ResNet50 on the INAT21 fine-grained classification dataset, our technique achieves +3.61% accuracy gain compared to the vanilla baseline, and 0.88% more gain against the previous label smoothing or self-distillation strategies. The implementation is publicly available at https://github.com/megvii-research/zipfls.

CVApr 21, 2023Code
Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

Siyuan Wei, Tianzhu Ye, Shen Zhang et al.

Although vision transformers (ViTs) have shown promising results in various computer vision tasks recently, their high computational cost limits their practical applications. Previous approaches that prune redundant tokens have demonstrated a good trade-off between performance and computation costs. Nevertheless, errors caused by pruning strategies can lead to significant information loss. Our quantitative experiments reveal that the impact of pruned tokens on performance should be noticeable. To address this issue, we propose a novel joint Token Pruning & Squeezing module (TPS) for compressing vision transformers with higher efficiency. Firstly, TPS adopts pruning to get the reserved and pruned subsets. Secondly, TPS squeezes the information of pruned tokens into partial reserved tokens via the unidirectional nearest-neighbor matching and similarity-based fusing steps. Compared to state-of-the-art methods, our approach outperforms them under all token pruning intensities. Especially while shrinking DeiT-tiny&small computational budgets to 35%, it improves the accuracy by 1%-6% compared with baselines on ImageNet classification. The proposed method can accelerate the throughput of DeiT-small beyond DeiT-tiny, while its accuracy surpasses DeiT-tiny by 4.78%. Experiments on various transformers demonstrate the effectiveness of our method, while analysis experiments prove our higher robustness to the errors of the token pruning policy. Code is available at https://github.com/megvii-research/TPS-CVPR2023.

CVOct 26, 2022Code
SimpleDG: Simple Domain Generalization Baseline without Bells and Whistles

Zhi Lv, Bo Lin, Siyuan Liang et al.

We present a simple domain generalization baseline, which wins second place in both the common context generalization track and the hybrid context generalization track respectively in NICO CHALLENGE 2022. We verify the founding in recent literature, domainbed, that ERM is a strong baseline compared to recent state-of-the-art domain generalization methods and propose SimpleDG which includes several simple yet effective designs that further boost generalization performance. Code is available at https://github.com/megvii-research/SimpleDG

CLJan 13Code
Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge

Yao Tang, Li Dong, Yaru Hao et al.

Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR-Penn/Multiplex-Thinking.

ITJun 5, 2023
Integrated Sensing, Computation, and Communication for UAV-assisted Federated Edge Learning

Yao Tang, Guangxu Zhu, Wei Xu et al.

Federated edge learning (FEEL) enables privacy-preserving model training through periodic communication between edge devices and the server. Unmanned Aerial Vehicle (UAV)-mounted edge devices are particularly advantageous for FEEL due to their flexibility and mobility in efficient data collection. In UAV-assisted FEEL, sensing, computation, and communication are coupled and compete for limited onboard resources, and UAV deployment also affects sensing and communication performance. Therefore, the joint design of UAV deployment and resource allocation is crucial to achieving the optimal training performance. In this paper, we address the problem of joint UAV deployment design and resource allocation for FEEL via a concrete case study of human motion recognition based on wireless sensing. We first analyze the impact of UAV deployment on the sensing quality and identify a threshold value for the sensing elevation angle that guarantees a satisfactory quality of data samples. Due to the non-ideal sensing channels, we consider the probabilistic sensing model, where the successful sensing probability of each UAV is determined by its position. Then, we derive the upper bound of the FEEL training loss as a function of the sensing probability. Theoretical results suggest that the convergence rate can be improved if UAVs have a uniform successful sensing probability. Based on this analysis, we formulate a training time minimization problem by jointly optimizing UAV deployment, integrated sensing, computation, and communication (ISCC) resources under a desirable optimality gap constraint. To solve this challenging mixed-integer non-convex problem, we apply the alternating optimization technique, and propose the bandwidth, batch size, and position optimization (BBPO) scheme to optimize these three decision variables alternately.

CVNov 29, 2023
HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

Shen Zhang, Zhaowei Chen, Zhenyu Zhao et al.

Diffusion models have become a mainstream approach for high-resolution image synthesis. However, directly generating higher-resolution images from pretrained diffusion models will encounter unreasonable object duplication and exponentially increase the generation time. In this paper, we discover that object duplication arises from feature duplication in the deep blocks of the U-Net. Concurrently, We pinpoint the extended generation times to self-attention redundancy in U-Net's top blocks. To address these issues, we propose a tuning-free higher-resolution framework named HiDiffusion. Specifically, HiDiffusion contains Resolution-Aware U-Net (RAU-Net) that dynamically adjusts the feature map size to resolve object duplication and engages Modified Shifted Window Multi-head Self-Attention (MSW-MSA) that utilizes optimized window attention to reduce computations. we can integrate HiDiffusion into various pretrained diffusion models to scale image generation resolutions even to 4096x4096 at 1.5-6x the inference speed of previous methods. Extensive experiments demonstrate that our approach can address object duplication and heavy computation issues, achieving state-of-the-art performance on higher-resolution image synthesis tasks.

LGFeb 4, 2025Code
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Zongyu Lin, Yao Tang, Xingcheng Yao et al.

Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.

CVJul 2, 2025Code
Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

Ge Wu, Shen Zhang, Ruijing Shi et al.

REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https://github.com/Martinser/REG.

CVOct 7, 2023
1st Place Solution of Egocentric 3D Hand Pose Estimation Challenge 2023 Technical Report:A Concise Pipeline for Egocentric Hand Pose Reconstruction

Zhishan Zhou, Zhi Lv, Shihao Zhou et al.

This report introduce our work on Egocentric 3D Hand Pose Estimation workshop. Using AssemblyHands, this challenge focuses on egocentric 3D hand pose estimation from a single-view image. In the competition, we adopt ViT based backbones and a simple regressor for 3D keypoints prediction, which provides strong model baselines. We noticed that Hand-objects occlusions and self-occlusions lead to performance degradation, thus proposed a non-model method to merge multi-view results in the post-process stage. Moreover, We utilized test time augmentation and model ensemble to make further improvement. We also found that public dataset and rational preprocess are beneficial. Our method achieved 12.21mm MPJPE on test dataset, achieve the first place in Egocentric 3D Hand Pose Estimation challenge.

CVMay 14
The Velocity Deficit: Initial Energy Injection for Flow Matching

Linze Li, Zong-Wei Hong, Shen Zhang et al.

While Flow Matching theoretically guarantees constant-velocity trajectories, we identify a critical breakdown in high-dimensional practice: the Velocity Deficit. We show that the MSE objective systematically underestimates velocity magnitude, causing generated samples to fail to reach the data manifold-a phenomenon we term Integration Lag. To rectify this, we propose Initial Energy Injection, instantiated via two complementary methods: the training-based Magnitude-Aware Flow Matching (MAFM) and the training-free Scale Schedule Corrector (SSC). Both are grounded in our discovery of a crucial asymmetry: velocity contraction causes harmful kinetic stagnation at the trajectory's start, yet acts as a beneficial denoising mechanism at its end. Empirically, SSC yields significant efficiency gains with zero retraining and just one line of code. On ImageNet-1k (256x256), it improves FID by 44.6% (from 13.68 to 7.58) and achieves a 5x speedup, enabling a 50-step generator (FID 7.58) to beat a 250-step baseline (FID 8.65). Furthermore, our methods generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by ~22%.

CVJul 14, 2025Code
Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection

Jinglun Li, Kaixun Jiang, Zhaoyu Chen et al.

Pre-trained vision-language models have exhibited remarkable abilities in detecting out-of-distribution (OOD) samples. However, some challenging OOD samples, which lie close to in-distribution (InD) data in image feature space, can still lead to misclassification. The emergence of foundation models like diffusion models and multimodal large language models (MLLMs) offers a potential solution to this issue. In this work, we propose SynOOD, a novel approach that harnesses foundation models to generate synthetic, challenging OOD data for fine-tuning CLIP models, thereby enhancing boundary-level discrimination between InD and OOD samples. Our method uses an iterative in-painting process guided by contextual prompts from MLLMs to produce nuanced, boundary-aligned OOD samples. These samples are refined through noise adjustments based on gradients from OOD scores like the energy score, effectively sampling from the InD/OOD boundary. With these carefully synthesized images, we fine-tune the CLIP image encoder and negative label features derived from the text encoder to strengthen connections between near-boundary OOD samples and a set of negative labels. Finally, SynOOD achieves state-of-the-art performance on the large-scale ImageNet benchmark, with minimal increases in parameters and runtime. Our approach significantly surpasses existing methods, and the code is available at https://github.com/Jarvisgivemeasuit/SynOOD.

CVSep 7, 2017Code
FingerNet: An Unified Deep Network for Fingerprint Minutiae Extraction

Yao Tang, Fei Gao, Jufu Feng et al.

Minutiae extraction is of critical importance in automated fingerprint recognition. Previous works on rolled/slap fingerprints failed on latent fingerprints due to noisy ridge patterns and complex background noises. In this paper, we propose a new way to design deep convolutional network combining domain knowledge and the representation ability of deep learning. In terms of orientation estimation, segmentation, enhancement and minutiae extraction, several typical traditional methods performed well on rolled/slap fingerprints are transformed into convolutional manners and integrated as an unified plain network. We demonstrate that this pipeline is equivalent to a shallow network with fixed weights. The network is then expanded to enhance its representation ability and the weights are released to learn complex background variance from data, while preserving end-to-end differentiability. Experimental results on NIST SD27 latent database and FVC 2004 slap database demonstrate that the proposed algorithm outperforms the state-of-the-art minutiae extraction algorithms. Code is made publicly available at: https://github.com/felixTY/FingerNet.

CVSep 28, 2024
1st Place Solution of Multiview Egocentric Hand Tracking Challenge ECCV2024

Minqiang Zou, Zhi Lv, Riqiang Jin et al.

Multi-view egocentric hand tracking is a challenging task and plays a critical role in VR interaction. In this report, we present a method that uses multi-view input images and camera extrinsic parameters to estimate both hand shape and pose. To reduce overfitting to the camera layout, we apply crop jittering and extrinsic parameter noise augmentation. Additionally, we propose an offline neural smoothing post-processing method to further improve the accuracy of hand position and pose. Our method achieves 13.92mm MPJPE on the Umetrack dataset and 21.66mm MPJPE on the HOT3D dataset.

CVOct 20, 2023
Boosting Generalization with Adaptive Style Techniques for Fingerprint Liveness Detection

Kexin Zhu, Bo Lin, Yang Qiu et al.

We introduce a high-performance fingerprint liveness feature extraction technique that secured first place in LivDet 2023 Fingerprint Representation Challenge. Additionally, we developed a practical fingerprint recognition system with 94.68% accuracy, earning second place in LivDet 2023 Liveness Detection in Action. By investigating various methods, particularly style transfer, we demonstrate improvements in accuracy and generalization when faced with limited training data. As a result, our approach achieved state-of-the-art performance in LivDet 2023 Challenges.

IRMar 10
Diagnosing and Repairing Citation Failures in Generative Engine Optimization

Zhihua Tian, Yuhan Chen, Yao Tang et al.

Generative Engine Optimization (GEO) aims to improve content visibility in AI-generated responses. However, existing methods measure contribution-how much a document influences a response-rather than citation, the mechanism that actually drives traffic back to creators. Also, these methods apply generic rewriting rules uniformly, failing to diagnose why individual document are not cited. This paper introduces a diagnostic approach to GEO that asks why a document fails to be cited and intervenes accordingly. We develop a unified framework comprising: (1) the first taxonomy of citation failure modes spanning different stages of a citation pipeline; (2) AgentGEO, an agentic system that diagnoses failures using this taxonomy, selects targeted repairs from a corresponding tool library, and iterates until citation is achieved; and (3) a document-centric benchmark evaluating whether optimizations generalize across held-out queries. AgentGEO achieves over 40% relative improvement in citation rates while modifying only 5% of content, compared to 25% for baselines. Our analysis reveals that generic optimization can harm long-tail content and some documents face challenges that optimization alone cannot fully address-findings with implications for equitable visibility in AI-mediated information access.

CVMar 4, 2024
A Simple Baseline for Efficient Hand Mesh Reconstruction

Zhishan Zhou, Shihao. zhou, Zhi Lv et al.

3D hand pose estimation has found broad application in areas such as gesture recognition and human-machine interaction tasks. As performance improves, the complexity of the systems also increases, which can limit the comparative analysis and practical implementation of these methods. In this paper, we propose a simple yet effective baseline that not only surpasses state-of-the-art (SOTA) methods but also demonstrates computational efficiency. To establish this baseline, we abstract existing work into two components: a token generator and a mesh regressor, and then examine their core structures. A core structure, in this context, is one that fulfills intrinsic functions, brings about significant improvements, and achieves excellent performance without unnecessary complexities. Our proposed approach is decoupled from any modifications to the backbone, making it adaptable to any modern models. Our method outperforms existing solutions, achieving state-of-the-art (SOTA) results across multiple datasets. On the FreiHAND dataset, our approach produced a PA-MPJPE of 5.7mm and a PA-MPVPE of 6.0mm. Similarly, on the Dexycb dataset, we observed a PA-MPJPE of 5.5mm and a PA-MPVPE of 5.0mm. As for performance speed, our method reached up to 33 frames per second (fps) when using HRNet and up to 70 fps when employing FastViT-MA36

CLJun 9, 2025
Reinforcement Pre-Training

Qingxiu Dong, Li Dong, Yao Tang et al. · tsinghua

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

CVMar 6, 2025
LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding

Shen Zhang, Siyuan Liang, Yaning Tan et al.

Diffusion transformers (DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings(PE), such as RoPE, need extrapolating to unseen positions which degrades performance when the inference resolution differs from training. In this paper, We propose a Length-Extrapolatable Diffusion Transformer~(LEDiT) to overcome this limitation. LEDiT needs no explicit PEs, thereby avoiding PE extrapolation. The key innovation of LEDiT lies in the use of causal attention. We demonstrate that causal attention can implicitly encode global positional information and show that such information facilitates extrapolation. We further introduce a locality enhancement module, which captures fine-grained local information to complement the global coarse-grained position information encoded by causal attention. Experimental results on both conditional and text-to-image generation tasks demonstrate that LEDiT supports up to 4x resolution scaling (e.g., from 256x256 to 512x512), achieving better image quality compared to the state-of-the-art length extrapolation methods. We believe that LEDiT marks a departure from the standard RoPE-based methods and offers a promising insight into length extrapolation. Project page: https://shenzhang2145.github.io/ledit/

LGOct 23, 2024
Learning Versatile Skills with Curriculum Masking

Yao Tang, Zhihui Xie, Zichuan Lin et al.

Masked prediction has emerged as a promising pretraining paradigm in offline reinforcement learning (RL) due to its versatile masking schemes, enabling flexible inference across various downstream tasks with a unified model. Despite the versatility of masked prediction, it remains unclear how to balance the learning of skills at different levels of complexity. To address this, we propose CurrMask, a curriculum masking pretraining paradigm for sequential decision making. Motivated by how humans learn by organizing knowledge in a curriculum, CurrMask adjusts its masking scheme during pretraining for learning versatile skills. Through extensive experiments, we show that CurrMask exhibits superior zero-shot performance on skill prompting tasks, goal-conditioned planning tasks, and competitive finetuning performance on offline RL tasks. Additionally, our analysis of training dynamics reveals that CurrMask gradually acquires skills of varying complexity by dynamically adjusting its masking scheme.

CVNov 24, 2025
VeCoR -- Velocity Contrastive Regularization for Flow Matching

Zong-Wei Hong, Jing-lun Li, Lin-Ze Li et al.

Flow Matching (FM) has recently emerged as a principled and efficient alternative to diffusion models. Standard FM encourages the learned velocity field to follow a target direction; however, it may accumulate errors along the trajectory and drive samples off the data manifold, leading to perceptual degradation, especially in lightweight or low-step configurations. To enhance stability and generalization, we extend FM into a balanced attract-repel scheme that provides explicit guidance on both "where to go" and "where not to go." To be formal, we propose \textbf{Velocity Contrastive Regularization (VeCoR)}, a complementary training scheme for flow-based generative modeling that augments the standard FM objective with contrastive, two-sided supervision. VeCoR not only aligns the predicted velocity with a stable reference direction (positive supervision) but also pushes it away from inconsistent, off-manifold directions (negative supervision). This contrastive formulation transforms FM from a purely attractive, one-sided objective into a two-sided training signal, regularizing trajectory evolution and improving perceptual fidelity across datasets and backbones. On ImageNet-1K 256$\times$256, VeCoR yields 22\% and 35\% relative FID reductions on SiT-XL/2 and REPA-SiT-XL/2 backbones, respectively, and achieves further FID gains (32\% relative) on MS-COCO text-to-image generation, demonstrating consistent improvements in stability, convergence, and image quality, particularly in low-step and lightweight settings. Project page: https://p458732.github.io/VeCoR_Project_Page/

CLSep 3, 2025
NoteBar: An AI-Assisted Note-Taking System for Personal Knowledge Management

Josh Wisoff, Yao Tang, Zhengyu Fang et al.

Note-taking is a critical practice for capturing, organizing, and reflecting on information in both academic and professional settings. The recent success of large language models has accelerated the development of AI-assisted tools, yet existing solutions often struggle with efficiency. We present NoteBar, an AI-assisted note-taking tool that leverages persona information and efficient language models to automatically organize notes into multiple categories and better support user workflows. To support research and evaluation in this space, we further introduce a novel persona-conditioned dataset of 3,173 notes and 8,494 annotated concepts across 16 MBTI personas, offering both diversity and semantic richness for downstream tasks. Finally, we demonstrate that NoteBar can be deployed in a practical and cost-effective manner, enabling interactive use without reliance on heavy infrastructure. Together, NoteBar and its accompanying dataset provide a scalable and extensible foundation for advancing AI-assisted personal knowledge management.

CLMar 10, 2024
FMPAF: How Do Fed Chairs Affect the Financial Market? A Fine-grained Monetary Policy Analysis Framework on Their Language

Yayue Deng, Mohan Xu, Yao Tang

The effectiveness of central bank communication is a crucial aspect of monetary policy transmission. While recent research has examined the influence of policy communication by the chairs of the Federal Reserve on various financial variables, much of the literature relies on rule-based or dictionary-based methods in parsing the language of the chairs, leaving nuanced information about policy stance contained in nonverbal emotion out of the analysis. In the current study, we propose the Fine-Grained Monetary Policy Analysis Framework (FMPAF), a novel approach that integrates large language models (LLMs) with regression analysis to provide a comprehensive analysis of the impact of the press-conference communications of chairs of the Federal Reserve on financial markets. We conduct extensive comparisons of model performance under different levels of granularity, modalities, and communication scenarios. Based on our preferred specification, a one-unit increase in the sentiment score is associated with an increase of the price of S\&P 500 Exchange-Traded Fund by approximately 500 basis points, a 15-basis-point decrease in the policy interest rate, while not leading to a significant response in exchange rates.

CVDec 6, 2023
FAAC: Facial Animation Generation with Anchor Frame and Conditional Control for Superior Fidelity and Editability

Linze Li, Sunqi Fan, Hengjun Pu et al.

Over recent years, diffusion models have facilitated significant advancements in video generation. Yet, the creation of face-related videos still confronts issues such as low facial fidelity, lack of frame consistency, limited editability and uncontrollable human poses. To address these challenges, we introduce a facial animation generation method that enhances both face identity fidelity and editing capabilities while ensuring frame consistency. This approach incorporates the concept of an anchor frame to counteract the degradation of generative ability in original text-to-image models when incorporating a motion module. We propose two strategies towards this objective: training-free and training-based anchor frame methods. Our method's efficacy has been validated on multiple representative DreamBooth and LoRA models, delivering substantial improvements over the original outcomes in terms of facial fidelity, text-to-image editability, and video motion. Moreover, we introduce conditional control using a 3D parametric face model to capture accurate facial movements and expressions. This solution augments the creative possibilities for facial animation generation through the integration of multiple control signals. For additional samples, please visit https://paper-faac.github.io/.

CVSep 30, 2016
Latent fingerprint minutia extraction using fully convolutional network

Yao Tang, Fei Gao, Jufu Feng

Minutiae play a major role in fingerprint identification. Extracting reliable minutiae is difficult for latent fingerprints which are usually of poor quality. As the limitation of traditional handcrafted features, a fully convolutional network (FCN) is utilized to learn features directly from data to overcome complex background noises. Raw fingerprints are mapped to a correspondingly-sized minutia-score map with a fixed stride. And thus a large number of minutiae will be extracted through a given threshold. Then small regions centering at these minutia points are entered into a convolutional neural network (CNN) to reclassify these minutiae and calculate their orientations. The CNN shares convolutional layers with the fully convolutional network to speed up. 0.45 second is used on average to detect one fingerprint on a GPU. On the NIST SD27 database, we achieve 53\% recall rate and 53\% precise rate that outperform many other algorithms. Our trained model is also visualized to show that we have successfully extracted features preserving ridge information of a latent fingerprint.