Zhengyang Yu

CV
h-index6
9papers
46citations
Novelty47%
AI Score44

9 Papers

CVOct 18, 2022
Sequence and Circle: Exploring the Relationship Between Patches

Zhengyang Yu, Jochen Triesch

The vision transformer (ViT) has achieved state-of-the-art results in various vision tasks. It utilizes a learnable position embedding (PE) mechanism to encode the location of each image patch. However, it is presently unclear if this learnable PE is really necessary and what its benefits are. This paper explores two alternative ways of encoding the location of individual patches that exploit prior knowledge about their spatial arrangement. One is called the sequence relationship embedding (SRE), and the other is called the circle relationship embedding (CRE). Among them, the SRE considers all patches to be in order, and adjacent patches have the same interval distance. The CRE considers the central patch as the center of the circle and measures the distance of the remaining patches from the center based on the four neighborhoods principle. Multiple concentric circles with different radii combine different patches. Finally, we implemented these two relations on three classic ViTs and tested them on four popular datasets. Experiments show that SRE and CRE can replace PE to reduce the random learnable parameters while achieving the same performance. Combining SRE or CRE with PE gets better performance than only using PE.

CLJul 25, 2024
Are Large Language Models Possible to Conduct Cognitive Behavioral Therapy?

Hao Shen, Zihan Li, Minqiang Yang et al.

In contemporary society, the issue of psychological health has become increasingly prominent, characterized by the diversification, complexity, and universality of mental disorders. Cognitive Behavioral Therapy (CBT), currently the most influential and clinically effective psychological treatment method with no side effects, has limited coverage and poor quality in most countries. In recent years, researches on the recognition and intervention of emotional disorders using large language models (LLMs) have been validated, providing new possibilities for psychological assistance therapy. However, are LLMs truly possible to conduct cognitive behavioral therapy? Many concerns have been raised by mental health experts regarding the use of LLMs for therapy. Seeking to answer this question, we collected real CBT corpus from online video websites, designed and conducted a targeted automatic evaluation framework involving the evaluation of emotion tendency of generated text, structured dialogue pattern and proactive inquiry ability. For emotion tendency, we calculate the emotion tendency score of the CBT dialogue text generated by each model. For structured dialogue pattern, we use a diverse range of automatic evaluation metrics to compare speaking style, the ability to maintain consistency of topic and the use of technology in CBT between different models . As for inquiring to guide the patient, we utilize PQA (Proactive Questioning Ability) metric. We also evaluated the CBT ability of the LLM after integrating a CBT knowledge base to explore the help of introducing additional knowledge to enhance the model's CBT counseling ability. Four LLM variants with excellent performance on natural language processing are evaluated, and the experimental result shows the great potential of LLMs in psychological counseling realm, especially after combining with other technological means.

CVNov 12, 2023
IMPUS: Image Morphing with Perceptually-Uniform Sampling Using Diffusion Models

Zhaoyuan Yang, Zhengyang Yu, Zhiwei Xu et al.

We present a diffusion-based image morphing approach with perceptually-uniform sampling (IMPUS) that produces smooth, direct and realistic interpolations given an image pair. The embeddings of two images may lie on distinct conditioned distributions of a latent diffusion model, especially when they have significant semantic difference. To bridge this gap, we interpolate in the locally linear and continuous text embedding space and Gaussian latent space. We first optimize the endpoint text embeddings and then map the images to the latent space using a probability flow ODE. Unlike existing work that takes an indirect morphing path, we show that the model adaptation yields a direct path and suppresses ghosting artifacts in the interpolated images. To achieve this, we propose a heuristic bottleneck constraint based on a novel relative perceptual path diversity score that automatically controls the bottleneck size and balances the diversity along the path with its directness. We also propose a perceptually-uniform sampling technique that enables visually smooth changes between the interpolated images. Extensive experiments validate that our IMPUS can achieve smooth, direct, and realistic image morphing and is adaptable to several other generative tasks.

CVDec 12, 2025
AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path

Zhengyang Yu, Akio Hayakawa, Masato Ishii et al.

Autoregressive video diffusion models (AR-VDMs) show strong promise as scalable alternatives to bidirectional VDMs, enabling real-time and interactive applications. Yet there remains room for improvement in their sample fidelity. A promising solution is inference-time alignment, which optimizes the noise space to improve sample fidelity without updating model parameters. Yet, optimization- or search-based methods are computationally impractical for AR-VDMs. Recent text-to-image (T2I) works address this via feedforward noise refiners that modulate sampled noises in a single forward pass. Can such noise refiners be extended to AR-VDMs? We identify the failure of naively extending T2I noise refiners to AR-VDMs and propose AutoRefiner-a noise refiner tailored for AR-VDMs, with two key designs: pathwise noise refinement and a reflective KV-cache. Experiments demonstrate that AutoRefiner serves as an efficient plug-in for AR-VDMs, effectively enhancing sample fidelity by refining noise along stochastic denoising paths.

CVNov 27, 2025Code
TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning

Qingtao Yu, Changlin Song, Minghao Sun et al.

A prominent approach to test-time scaling for text-to-image diffusion models formulates the problem as a search over multiple noise seeds, selecting the one that maximizes a certain image-reward function. The effectiveness of this strategy heavily depends on the number and diversity of noise seeds explored. However, verifying each candidate is computationally expensive, because each must be fully denoised before a reward can be computed. This severely limits the number of samples that can be explored under a fixed budget. We propose test-time scaling with noise-aware pruning (TTSnap), a framework that prunes low-quality candidates without fully denoising them. The key challenge is that reward models are learned in the clean image domain, and the ranking of rewards predicted for intermediate estimates are often inconsistent with those predicted for clean images. To overcome this, we train noise-aware reward models via self-distillation to align the reward for intermediate estimates with that of the final clean images. To stabilize learning across different noise levels, we adopt a curriculum training strategy that progressively shifts the data domain from clean images to noise images. In addition, we introduce a new metric that measures reward alignment and computational budget utilization. Experiments demonstrate that our approach improves performance by over 16\% compared with existing methods, enabling more efficient and effective test-time scaling. It also provides orthogonal gains when combined with post-training techniques and local test-time optimization. Code: https://github.com/TerrysLearning/TTSnap/.

CVOct 15, 2024
DreamSteerer: Enhancing Source Image Conditioned Editability using Personalized Diffusion Models

Zhengyang Yu, Zhaoyuan Yang, Jing Zhang

Recent text-to-image personalization methods have shown great promise in teaching a diffusion model user-specified concepts given a few images for reusing the acquired concepts in a novel context. With massive efforts being dedicated to personalized generation, a promising extension is personalized editing, namely to edit an image using personalized concepts, which can provide a more precise guidance signal than traditional textual guidance. To address this, a straightforward solution is to incorporate a personalized diffusion model with a text-driven editing framework. However, such a solution often shows unsatisfactory editability on the source image. To address this, we propose DreamSteerer, a plug-in method for augmenting existing T2I personalization methods. Specifically, we enhance the source image conditioned editability of a personalized diffusion model via a novel Editability Driven Score Distillation (EDSD) objective. Moreover, we identify a mode trapping issue with EDSD, and propose a mode shifting regularization with spatial feature guided sampling to avoid such an issue. We further employ two key modifications to the Delta Denoising Score framework that enable high-fidelity local editing with personalized concepts. Extensive experiments validate that DreamSteerer can significantly improve the editability of several T2I personalization baselines while being computationally efficient.

CVNov 4, 2024
Toddlers' Active Gaze Behavior Supports Self-Supervised Object Learning

Zhengyang Yu, Arthur Aubret, Marcel C. Raabe et al.

Toddlers learn to recognize objects from different viewpoints with almost no supervision. During this learning, they execute frequent eye and head movements that shape their visual experience. It is presently unclear if and how these behaviors contribute to toddlers' emerging object recognition abilities. To answer this question, we here combine head-mounted eye tracking during dyadic play with unsupervised machine learning. We approximate toddlers' central visual field experience by cropping image regions from a head-mounted camera centered on the current gaze location estimated via eye tracking. This visual stream feeds an unsupervised computational model of toddlers' learning, which constructs visual representations that slowly change over time. Our experiments demonstrate that toddlers' gaze strategy supports the learning of invariant object representations. Our analysis also shows that the limited size of the central visual field where acuity is high is crucial for this. Overall, our work reveals how toddlers' gaze behavior may support their development of view-invariant object recognition.

CVSep 19, 2025
Simulated Cortical Magnification Supports Self-Supervised Object Learning

Zhengyang Yu, Arthur Aubret, Chen Yu et al.

Recent self-supervised learning models simulate the development of semantic object representations by training on visual experience similar to that of toddlers. However, these models ignore the foveated nature of human vision with high/low resolution in the center/periphery of the visual field. Here, we investigate the role of this varying resolution in the development of object representations. We leverage two datasets of egocentric videos that capture the visual experience of humans during interactions with objects. We apply models of human foveation and cortical magnification to modify these inputs, such that the visual content becomes less distinct towards the periphery. The resulting sequences are used to train two bio-inspired self-supervised learning models that implement a time-based learning objective. Our results show that modeling aspects of foveated vision improves the quality of the learned object representations in this setting. Our analysis suggests that this improvement comes from making objects appear bigger and inducing a better trade-off between central and peripheral visual information. Overall, this work takes a step towards making models of humans' learning of visual representations more realistic and performant.

CVDec 7, 2020
Self-supervised asymmetric deep hashing with margin-scalable constraint

Zhengyang Yu, Song Wu, Zhihao Dou et al.

Due to its effectivity and efficiency, deep hashing approaches are widely used for large-scale visual search. However, it is still challenging to produce compact and discriminative hash codes for images associated with multiple semantics for two main reasons, 1) similarity constraints designed in most of the existing methods are based upon an oversimplified similarity assignment(i.e., 0 for instance pairs sharing no label, 1 for instance pairs sharing at least 1 label), 2) the exploration in multi-semantic relevance are insufficient or even neglected in many of the existing methods. These problems significantly limit the discrimination of generated hash codes. In this paper, we propose a novel self-supervised asymmetric deep hashing method with a margin-scalable constraint(SADH) approach to cope with these problems. SADH implements a self-supervised network to sufficiently preserve semantic information in a semantic feature dictionary and a semantic code dictionary for the semantics of the given dataset, which efficiently and precisely guides a feature learning network to preserve multilabel semantic information using an asymmetric learning strategy. By further exploiting semantic dictionaries, a new margin-scalable constraint is employed for both precise similarity searching and robust hash code generation. Extensive empirical research on four popular benchmarks validates the proposed method and shows it outperforms several state-of-the-art approaches.