Jie Du

CV
h-index1
4papers
27citations
Novelty54%
AI Score36

4 Papers

IVMay 2, 2022
Unsupervised Denoising of Optical Coherence Tomography Images with Dual_Merged CycleWGAN

Jie Du, Xujian Yang, Kecheng Jin et al.

Nosie is an important cause of low quality Optical coherence tomography (OCT) image. The neural network model based on Convolutional neural networks(CNNs) has demonstrated its excellent performance in image denoising. However, OCT image denoising still faces great challenges because many previous neural network algorithms required a large number of labeled data, which might cost much time or is expensive. Besides, these CNN-based algorithms need numerous parameters and good tuning techniques, which is hardware resources consuming. To solved above problems, We proposed a new Cycle-Consistent Generative Adversarial Nets called Dual-Merged Cycle-WGAN for retinal OCT image denoiseing, which has remarkable performance with less unlabeled traning data. Our model consists of two Cycle-GAN networks with imporved generator, descriminator and wasserstein loss to achieve good training stability and better performance. Using image merge technique between two Cycle-GAN networks, our model could obtain more detailed information and hence better training effect. The effectiveness and generality of our proposed network has been proved via ablation experiments and comparative experiments. Compared with other state-of-the-art methods, our unsupervised method obtains best subjective visual effect and higher evaluation objective indicators.

CVNov 3, 2025
Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation

Jie Du, Xinyu Gong, Qingshan Tan et al.

Recent studies have identified Direct Preference Optimization (DPO) as an efficient and reward-free approach to improving video generation quality. However, existing methods largely follow image-domain paradigms and are mainly developed on small-scale models (approximately 2B parameters), limiting their ability to address the unique challenges of video tasks, such as costly data construction, unstable training, and heavy memory consumption. To overcome these limitations, we introduce a GT-Pair that automatically builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives, eliminating the need for any external annotation. We further present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO loss to enhance training stability and generation fidelity. Additionally, by combining the FSDP framework with multiple memory optimization techniques, our approach achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on both I2V and T2V tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches, delivering superior video generation quality.

CVDec 3, 2021
Make A Long Image Short: Adaptive Token Length for Vision Transformers

Yichen Zhu, Yuqin Zhu, Jie Du et al.

The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably increased computational cost. Motivated by the proverb "A picture is worth a thousand words" we aim to accelerate the ViT model by making a long image short. To this end, we propose a novel approach to assign token length adaptively during inference. Specifically, we first train a ViT model, called Resizable-ViT (ReViT), that can process any given input with diverse token lengths. Then, we retrieve the "token-length label" from ReViT and use it to train a lightweight Token-Length Assigner (TLA). The token-length labels are the smallest number of tokens to split an image that the ReViT can make the correct prediction, and TLA is learned to allocate the optimal token length based on these labels. The TLA enables the ReViT to process the image with the minimum sufficient number of tokens during inference. Thus, the inference speed is boosted by reducing the token numbers in the ViT model. Our approach is general and compatible with modern vision transformer architectures and can significantly reduce computational expanse. We verified the effectiveness of our methods on multiple representative ViT models (DeiT, LV-ViT, and TimesFormer) across two tasks (image classification and action recognition).

LGDec 1, 2021
Training BatchNorm Only in Neural Architecture Search and Beyond

Yichen Zhu, Jie Du, Yuqin Zhu et al.

This work investigates the usage of batch normalization in neural architecture search (NAS). Specifically, Frankle et al. find that training BatchNorm only can achieve nontrivial performance. Furthermore, Chen et al. claim that training BatchNorm only can speed up the training of the one-shot NAS supernet over ten times. Critically, there is no effort to understand 1) why training BatchNorm only can find the perform-well architectures with the reduced supernet-training time, and 2) what is the difference between the train-BN-only supernet and the standard-train supernet. We begin by showing that the train-BN-only networks converge to the neural tangent kernel regime, obtain the same training dynamics as train all parameters theoretically. Our proof supports the claim to train BatchNorm only on supernet with less training time. Then, we empirically disclose that train-BN-only supernet provides an advantage on convolutions over other operators, cause unfair competition between architectures. This is due to only the convolution operator being attached with BatchNorm. Through experiments, we show that such unfairness makes the search algorithm prone to select models with convolutions. To solve this issue, we introduce fairness in the search space by placing a BatchNorm layer on every operator. However, we observe that the performance predictor in Chen et al. is inapplicable on the new search space. To this end, we propose a novel composite performance indicator to evaluate networks from three perspectives: expressivity, trainability, and uncertainty, derived from the theoretical property of BatchNorm. We demonstrate the effectiveness of our approach on multiple NAS-benchmarks (NAS-Bench101, NAS-Bench-201) and search spaces (DARTS search space and MobileNet search space).