Jen-Yuan Huang

CV
h-index26
4papers
3citations
Novelty51%
AI Score42

4 Papers

CVApr 20
Long-Text-to-Image Generation via Compositional Prompt Decomposition

Jen-Yuan Huang, Tong Lin, Yilun Du

While modern text-to-image (T2I) models excel at generating images from intricate prompts, they struggle to capture the key details when the inputs are descriptive paragraphs. This limitation stems from the prevalence of concise captions that shape their training distributions. Existing methods attempt to bridge this gap by either fine-tuning T2I models on long prompts, which generalizes poorly to longer lengths; or by projecting the oversize inputs into normal-prompt space and compromising fidelity. We propose Prompt Refraction for Intricate Scene Modeling (PRISM), a compositional approach that enables pre-trained T2I models to process long sequence inputs. PRISM uses a lightweight module to extract constituent representations from the long prompts. The T2I model makes independent noise predictions for each component, and their outputs are merged into a single denoising step using energy-based conjunction. We evaluate PRISM across a wide range of model architectures, showing comparable performances to models fine-tuned on the same training data. Furthermore, PRISM demonstrates superior generalization, outperforming baseline models by 7.4% on prompts over 500 tokens in a challenging public benchmark.

CVFeb 23, 2025Code
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang et al.

Despite significant progress on popular multimodal benchmarks, state-of-the-art Multimodal Large Language Models (MLLMs) continue to struggle with basic visual reasoning tasks that are trivially solved by humans, such as recognizing spatial relationships. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment. These subtests span four core domains of human visual cognition: (1) Visualization and Spatial Processing, (2) Perceptual and Closure, (3) Memory, and (4) Reasoning. We evaluate 20 frontier MLLMs from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families. The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that current MLLM performance gains on high-level benchmarks do not reflect human-like low-level visual cognition, challenging the assumption that large-scale pretraining naturally induces gestalt-like perceptual capabilities. The dataset and evaluation toolkit are publicly available at: https://github.com/CUHK-ARISE/VisFactor.

CVMar 26, 2025Code
Dynamic Pyramid Network for Efficient Multimodal Large Language Model

Hao Ai, Kunyi Wang, Zezhou Wang et al.

Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. The source code will be released at https://github.com/aihao2000/DPN-LLaVA.

CVDec 14, 2024
Optimizing Few-Step Sampler for Diffusion Probabilistic Model

Jen-Yuan Huang

Diffusion Probabilistic Models (DPMs) have demonstrated exceptional capability of generating high-quality and diverse images, but their practical application is hindered by the intensive computational cost during inference. The DPM generation process requires solving a Probability-Flow Ordinary Differential Equation (PF-ODE), which involves discretizing the integration domain into intervals for numerical approximation. This corresponds to the sampling schedule of a diffusion ODE solver, and we notice the solution from a first-order solver can be expressed as a convex combination of model outputs at all scheduled time-steps. We derive an upper bound for the discretization error of the sampling schedule, which can be efficiently optimized with Monte-Carlo estimation. Building on these theoretical results, we purpose a two-phase alternating optimization algorithm. In Phase-1, the sampling schedule is optimized for the pre-trained DPM; in Phase-2, the DPM further tuned on the selected time-steps. Experiments on a pre-trained DPM for ImageNet64 dataset demonstrate the purposed method consistently improves the baseline across various number of sampling steps.