Yutian Lin

CV
h-index4
16papers
1,326citations
Novelty50%
AI Score57

16 Papers

CVJun 1, 2023
Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Yingying Fan, Yu Wu, Bo Du et al.

We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities. Previous works only concentrate on video-level overall label denoising across modalities, but overlook the segment-level label noise, where adjacent video segments (i.e., 1-second video clips) may contain different events. However, recognizing events in the segment is challenging because its label could be any combination of events that occur in the video. To address this issue, we consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels. Specifically, we design language prompts to describe all cases of event appearance for each video. Then, the similarity between language prompts and segments is calculated, where the event of the most similar prompt is regarded as the segment-level label. In addition, to deal with the mislabeled segments, we propose to perform dynamic re-weighting on the unreliable segments to adjust their labels. Experiments show that our simple yet effective approach outperforms state-of-the-art methods by a large margin.

CVFeb 16, 2023
Visible-Infrared Person Re-Identification via Patch-Mixed Cross-Modality Learning

Zhihao Qian, Yutian Lin, Bo Du

Visible-infrared person re-identification (VI-ReID) aims to retrieve images of the same pedestrian from different modalities, where the challenges lie in the significant modality discrepancy. To alleviate the modality gap, recent methods generate intermediate images by GANs, grayscaling, or mixup strategies. However, these methods could introduce extra data distribution, and the semantic correspondence between the two modalities is not well learned. In this paper, we propose a Patch-Mixed Cross-Modality framework (PMCM), where two images of the same person from two modalities are split into patches and stitched into a new one for model learning. A part-alignment loss is introduced to regularize representation learning, and a patch-mixed modality learning loss is proposed to align between the modalities. In this way, the model learns to recognize a person through patches of different styles, thereby the modality semantic correspondence can be inferred. In addition, with the flexible image generation strategy, the patch-mixed images freely adjust the ratio of different modality patches, which could further alleviate the modality imbalance problem. On two VI-ReID datasets, we report new state-of-the-art performance with the proposed method.

CVApr 2, 2024Code
Improving Bird's Eye View Semantic Segmentation by Task Decomposition

Tianhao Zhao, Yongcan Chen, Yu Wu et al.

Semantic segmentation in bird's eye view (BEV) plays a crucial role in autonomous driving. Previous methods usually follow an end-to-end pipeline, directly predicting the BEV segmentation map from monocular RGB inputs. However, the challenge arises when the RGB inputs and BEV targets from distinct perspectives, making the direct point-to-point predicting hard to optimize. In this paper, we decompose the original BEV segmentation task into two stages, namely BEV map reconstruction and RGB-BEV feature alignment. In the first stage, we train a BEV autoencoder to reconstruct the BEV segmentation maps given corrupted noisy latent representation, which urges the decoder to learn fundamental knowledge of typical BEV patterns. The second stage involves mapping RGB input images into the BEV latent space of the first stage, directly optimizing the correlations between the two views at the feature level. Our approach simplifies the complexity of combining perception and generation into distinct steps, equipping the model to handle intricate and challenging scenes effectively. Besides, we propose to transform the BEV segmentation map from the Cartesian to the polar coordinate system to establish the column-wise correspondence between RGB images and BEV maps. Moreover, our method requires neither multi-scale features nor camera intrinsic parameters for depth estimation and saves computational overhead. Extensive experiments on nuScenes and Argoverse show the effectiveness and efficiency of our method. Code is available at https://github.com/happytianhao/TaDe.

CVNov 23, 2020Code
Re-identification = Retrieval + Verification: Back to Essence and Forward with a New Metric

Zheng Wang, Xin Yuan, Toshihiko Yamasaki et al.

Re-identification (re-ID) is currently investigated as a closed-world image retrieval task, and evaluated by retrieval based metrics. The algorithms return ranking lists to users, but cannot tell which images are the true target. In essence, current re-ID overemphasizes the importance of retrieval but underemphasizes that of verification, \textit{i.e.}, all returned images are considered as the target. On the other hand, re-ID should also include the scenario that the query identity does not appear in the gallery. To this end, we go back to the essence of re-ID, \textit{i.e.}, a combination of retrieval and verification in an open-set setting, and put forward a new metric, namely, Genuine Open-set re-ID Metric (GOM). GOM explicitly balances the effect of performing retrieval and verification into a single unified metric. It can also be decomposed into a family of sub-metrics, enabling a clear analysis of re-ID performance. We evaluate the effectiveness of GOM on the re-ID benchmarks, showing its ability to capture important aspects of re-ID performance that have not been taken into account by established metrics so far. Furthermore, we show GOM scores excellent in aligning with human visual evaluation of re-ID performance. Related codes are available at https://github.com/YuanXinCherry/Person-reID-Evaluation

53.6CVMar 28
RiskProp: Collision-Anchored Self-Supervised Risk Propagation for Early Accident Anticipation

Yiyang Zou, Tianhao Zhao, Peilun Xiao et al.

Accident anticipation aims to predict impending collisions from dashcam videos and trigger early alerts. Existing methods rely on binary supervision with manually annotated "anomaly onset" frames, which are subjective and inconsistent, leading to inaccurate risk estimation. In contrast, we propose RiskProp, a novel collision-anchored self-supervised risk propagation paradigm for early accident anticipation, which removes the need for anomaly onset annotations and leverages only the reliably annotated collision frame. RiskProp models temporal risk evolution through two observation-driven losses: first, since future frames contain more definitive evidence of an impending accident, we introduce a future-frame regularization loss that uses the model's next-frame prediction as a soft target to supervise the current frame, enabling backward propagation of risk signals; second, inspired by the empirical trend of rising risk before accidents, we design an adaptive monotonic constraint to encourage a non-decreasing progression over time. Experiments on CAP and Nexar demonstrate that RiskProp achieves state-of-the-art performance and produces smoother, more discriminative risk curves, improving both early anticipation and interpretability.

CVOct 25, 2025
Accident Anticipation via Temporal Occurrence Prediction

Tianhao Zhao, Yiyang Zou, Zihao Mao et al.

Accident anticipation aims to predict potential collisions in an online manner, enabling timely alerts to enhance road safety. Existing methods typically predict frame-level risk scores as indicators of hazard. However, these approaches rely on ambiguous binary supervision (labeling all frames in accident videos as positive) despite the fact that risk varies continuously over time, leading to unreliable learning and false alarms. To address this, we propose a novel paradigm that shifts the prediction target from current-frame risk scoring to directly estimating accident scores at multiple future time steps (e.g., 0.1s-2.0s ahead), leveraging precisely annotated accident timestamps as supervision. Our method employs a snippet-level encoder to jointly model spatial and temporal dynamics, and a Transformer-based temporal decoder that predicts accident scores for all future horizons simultaneously using dedicated temporal queries. Furthermore, we introduce a refined evaluation protocol that reports Time-to-Accident (TTA) and recall (evaluated at multiple pre-accident intervals (0.5s, 1.0s, and 1.5s)) only when the false alarm rate (FAR) remains within an acceptable range, ensuring practical relevance. Experiments show that our method achieves superior performance in both recall and TTA under realistic FAR constraints.

CVSep 21, 2025
Penalizing Boundary Activation for Object Completeness in Diffusion Models

Haoyang Xu, Tianhao Zhao, Sibei Yang et al.

Diffusion models have emerged as a powerful technique for text-to-image (T2I) generation, creating high-quality, diverse images across various domains. However, a common limitation in these models is the incomplete display of objects, where fragments or missing parts undermine the model's performance in downstream applications. In this study, we conduct an in-depth analysis of the incompleteness issue and reveal that the primary factor behind incomplete object generation is the usage of RandomCrop during model training. This widely used data augmentation method, though enhances model generalization ability, disrupts object continuity during training. To address this, we propose a training-free solution that penalizes activation values at image boundaries during the early denoising steps. Our method is easily applicable to pre-trained Stable Diffusion models with minimal modifications and negligible computational overhead. Extensive experiments demonstrate the effectiveness of our method, showing substantial improvements in object integrity and image quality.

CVSep 15, 2025
IS-Diff: Improving Diffusion-Based Inpainting with Better Initial Seed

Yongzhe Lyu, Yu Wu, Yutian Lin et al.

Diffusion models have shown promising results in free-form inpainting. Recent studies based on refined diffusion samplers or novel architectural designs led to realistic results and high data consistency. However, random initialization seed (noise) adopted in vanilla diffusion process may introduce mismatched semantic information in masked regions, leading to biased inpainting results, e.g., low consistency and low coherence with the other unmasked area. To address this issue, we propose the Initial Seed refined Diffusion Model (IS-Diff), a completely training-free approach incorporating distributional harmonious seeds to produce harmonious results. Specifically, IS-Diff employs initial seeds sampled from unmasked areas to imitate the masked data distribution, thereby setting a promising direction for the diffusion procedure. Moreover, a dynamic selective refinement mechanism is proposed to detect severe unharmonious inpaintings in intermediate latent and adjust the strength of our initialization prior dynamically. We validate our method on both standard and large-mask inpainting tasks using the CelebA-HQ, ImageNet, and Places2 datasets, demonstrating its effectiveness across all metrics compared to state-of-the-art inpainting methods.

LGAug 14, 2025
Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis

Xinyi Li, Sai Wang, Yutian Lin et al.

Retrosynthesis prediction aims to infer the reactant molecule based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing models rely on static pattern-matching paradigm, which limits their ability to perform effective logic decision-making, leading to black-box decision-making. Building on this, we propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary reasoning strengths of Large Language Models and specialized models via reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models perform shallow reasoning to construct high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions and corresponding interpretable reasoning path, and (3) reinforcement learning optimizing interpretable decision policy. Experiments show that Retro-Expert not only surpasses both LLM-based and specialized models across different metrics but also provides expert-aligned explanations that bridge the gap between AI predictions and actionable chemical insights.

CVJul 3, 2025
Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

Feizhen Huang, Yu Wu, Yutian Lin et al.

Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.

CVJun 18, 2024
MAC: A Benchmark for Multiple Attributes Compositional Zero-Shot Learning

Shuo Xu, Sai Wang, Xinyue Hu et al.

Compositional Zero-Shot Learning (CZSL) aims to learn semantic primitives (attributes and objects) from seen compositions and recognize unseen attribute-object compositions. Existing CZSL datasets focus on single attributes, neglecting the fact that objects naturally exhibit multiple interrelated attributes. Their narrow attribute scope and single attribute labeling introduce annotation biases, misleading the learning of attributes and causing inaccurate evaluation. To address these issues, we introduce the Multi-Attribute Composition (MAC) dataset, encompassing 22,838 images and 17,627 compositions with comprehensive and representative attribute annotations. MAC shows complex relationship between attributes and objects, with each attribute type linked to an average of 82.2 object types, and each object type associated with 31.4 attribute types. Based on MAC, we propose multi-attribute compositional zero-shot learning that requires deeper semantic understanding and advanced attribute associations, establishing a more realistic and challenging benchmark for CZSL. We also propose Multi-attribute Visual-Primitive Integrator (MVP-Integrator), a robust baseline for multi-attribute CZSL, which disentangles semantic primitives and performs effective visual-primitive association. Experimental results demonstrate that MVP-Integrator significantly outperforms existing CZSL methods on MAC with improved inference efficiency.

LGJan 21, 2024
Visual Imitation Learning with Calibrated Contrastive Representation

Yunke Wang, Linwei Tao, Bo Du et al.

Adversarial Imitation Learning (AIL) allows the agent to reproduce expert behavior with low-dimensional states and actions. However, challenges arise in handling visual states due to their less distinguishable representation compared to low-dimensional proprioceptive features. While existing methods resort to adopt complex network architectures or separate the process of learning representation and decision-making, they overlook valuable intra-agent information within demonstrations. To address this problem, this paper proposes a simple and effective solution by incorporating calibrated contrastive representative learning into visual AIL framework. Specifically, we present an image encoder in visual AIL, utilizing a combination of unsupervised and supervised contrastive learning to extract valuable features from visual states. Based on the fact that the improved agent often produces demonstrations of varying quality, we propose to calibrate the contrastive loss by treating each agent demonstrations as a mixed sample. The incorporation of contrastive learning can be jointly optimized with the AIL framework, without modifying the architecture or incurring significant computational costs. Experimental results on DMControl Suite demonstrate our proposed method is sample efficient and can outperform other compared methods from different aspects.

CVNov 4, 2021
MixSiam: A Mixture-based Approach to Self-supervised Representation Learning

Xiaoyang Guo, Tianhao Zhao, Yutian Lin et al.

Recently contrastive learning has shown significant progress in learning visual representations from unlabeled data. The core idea is training the backbone to be invariant to different augmentations of an instance. While most methods only maximize the feature similarity between two augmented data, we further generate more challenging training samples and force the model to keep predicting discriminative representation on these hard samples. In this paper, we propose MixSiam, a mixture-based approach upon the traditional siamese network. On the one hand, we input two augmented images of an instance to the backbone and obtain the discriminative representation by performing an element-wise maximum of two features. On the other hand, we take the mixture of these augmented images as input, and expect the model prediction to be close to the discriminative representation. In this way, the model could access more variant data samples of an instance and keep predicting invariant discriminative representations for them. Thus the learned model is more robust compared to previous contrastive learning methods. Extensive experiments on large-scale datasets show that MixSiam steadily improves the baseline and achieves competitive results with state-of-the-art methods. Our code will be released soon.

CVAug 16, 2021
Unsupervised Person Re-identification with Stochastic Training Strategy

Tianyang Liu, Yutian Lin, Bo Du

Unsupervised person re-identification (re-ID) has attracted increasing research interests because of its scalability and possibility for real-world applications. State-of-the-art unsupervised re-ID methods usually follow a clustering-based strategy, which generates pseudo labels by clustering and maintains a memory to store instance features and represent the centroid of the clusters for contrastive learning. This approach suffers two problems. First, the centroid generated by unsupervised learning may not be a perfect prototype. Forcing images to get closer to the centroid emphasizes the result of clustering, which could accumulate clustering errors during iterations. Second, previous methods utilize features obtained at different training iterations to represent one centroid, which is not consistent with the current training sample, since the features are not directly comparable. To this end, we propose an unsupervised re-ID approach with a stochastic learning strategy. Specifically, we adopt a stochastic updated memory, where a random instance from a cluster is used to update the cluster-level memory for contrastive learning. In this way, the relationship between randomly selected pair of images are learned to avoid the training bias caused by unreliable pseudo labels. The stochastic memory is also always up-to-date for classifying to keep the consistency. Besides, to relieve the issue of camera variance, a unified distance matrix is proposed during clustering, where the distance bias from different camera domain is reduced and the variances of identities is emphasized.

CVApr 7, 2020
Unsupervised Person Re-identification via Softened Similarity Learning

Yutian Lin, Lingxi Xie, Yu Wu et al.

Person re-identification (re-ID) is an important topic in computer vision. This paper studies the unsupervised setting of re-ID, which does not require any labeled information and thus is freely deployed to new scenarios. There are very few studies under this setting, and one of the best approach till now used iterative clustering and classification, so that unlabeled images are clustered into pseudo classes for a classifier to get trained, and the updated features are used for clustering and so on. This approach suffers two problems, namely, the difficulty of determining the number of clusters, and the hard quantization loss in clustering. In this paper, we follow the iterative training mechanism but discard clustering, since it incurs loss from hard quantization, yet its only product, image-level similarity, can be easily replaced by pairwise computation and a softened classification task. With these improvements, our approach becomes more elegant and is more robust to hyper-parameter changes. Experiments on two image-based and video-based datasets demonstrate state-of-the-art performance under the unsupervised re-ID setting.

CVMar 21, 2017
Improving Person Re-identification by Attribute and Identity Learning

Yutian Lin, Liang Zheng, Zhedong Zheng et al.

Person re-identification (re-ID) and attribute recognition share a common target at learning pedestrian descriptions. Their difference consists in the granularity. Most existing re-ID methods only take identity labels of pedestrians into consideration. However, we find the attributes, containing detailed local descriptions, are beneficial in allowing the re-ID model to learn more discriminative feature representations. In this paper, based on the complementarity of attribute labels and ID labels, we propose an attribute-person recognition (APR) network, a multi-task network which learns a re-ID embedding and at the same time predicts pedestrian attributes. We manually annotate attribute labels for two large-scale re-ID datasets, and systematically investigate how person re-ID and attribute recognition benefit from each other. In addition, we re-weight the attribute predictions considering the dependencies and correlations among the attributes. The experimental results on two large-scale re-ID benchmarks demonstrate that by learning a more discriminative representation, APR achieves competitive re-ID performance compared with the state-of-the-art methods. We use APR to speed up the retrieval process by ten times with a minor accuracy drop of 2.92% on Market-1501. Besides, we also apply APR on the attribute recognition task and demonstrate improvement over the baselines.