88.0GRMay 23
AnySurf: Any Surface Generation with Directed EdgeWenda Shi, Chenyuan Pan, Dengming Zhang et al.
Open surface components prevail in real industrial 3D content and support rendering, physical simulation and geometric editing. Garments serve as a typical open surface type, with numerous existing generation methods leveraging sewing patterns to generate 2D panels and stitch them into 3D shapes. Such domain-specific designs lack scalability and cannot generalize to shoes and accessories. Common field-based 3D generators prioritize watertight meshes and tend to create flawed double-layer structures on open surfaces. Though Trellis2 adopts field-free representation, its open surface results still contain normal and topology errors. We present AnySurf, a unified framework generating open, closed and hybrid 3D surfaces with accurate face orientation. Built on directed-edge enhanced Flexible Dual Grid (FDG-D), our representation retains normal direction information via oriented grid edges. We also propose ROS-FT post-training and a lightweight DE-Adapter with merely 1% extra parameters, facilitating directed edge learning while preserving original generation performance. We further construct Outfit3D dataset containing industrial garments and closed accessories. Our work transforms garment modeling into a universal 3D generation task. Experimental results demonstrate superior mesh quality and better practicality for downstream applications.
CVNov 15, 2025
Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and SoundDengming Zhang, Weitao You, Jingxiong Li et al.
Emotion understanding is critical for making Large Language Models (LLMs) more general, reliable, and aligned with humans. Art conveys emotion through the joint design of visual and auditory elements, yet most prior work is human-centered or single-modality, overlooking the emotion intentionally expressed by the artwork. Meanwhile, current Audio-Visual Language Models (AVLMs) typically require large-scale audio pretraining to endow Visual Language Models (VLMs) with hearing, which limits scalability. We present Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM), a two-stage framework that teaches a VLM to hear by seeing with limited audio pretraining and to understand emotion across modalities. In Stage 1, Vision-Guided Audio Alignment (VG-Align) distills the frozen visual pathway into a new audio pathway by aligning next-token distributions of the shared LLM on synchronized audio-video clips, enabling hearing without a large audio dataset. In Stage 2, a lightweight Cross-Modal Emotion Adapter (EmoAdapter), composed of the Emotion Enhancer and the Emotion Supervisor, injects emotion-sensitive residuals and applies emotion supervision to enhance cross-modal emotion understanding. We also construct ArtEmoBenchmark, an art-centric emotion benchmark that evaluates content and emotion understanding under audio-only, visual-only, and audio-visual inputs. VAEmotionLLM achieves state-of-the-art results on ArtEmoBenchmark, outperforming audio-only, visual-only, and audio-visual baselines. Ablations show that the proposed components are complementary.
CVNov 28, 2024
FonTS: Text Rendering with Typography and Style ControlsWenda Shi, Yiren Song, Dengming Zhang et al.
Visual text rendering are widespread in various real-world applications, requiring careful font selection and typographic choices. Recent progress in diffusion transformer (DiT)-based text-to-image (T2I) models show promise in automating these processes. However, these methods still encounter challenges like inconsistent fonts, style variation, and limited fine-grained control, particularly at the word-level. This paper proposes a two-stage DiT-based pipeline to address these problems by enhancing controllability over typography and style in text rendering. We introduce typography control fine-tuning (TC-FT), an parameter-efficient fine-tuning method (on $5\%$ key parameters) with enclosing typography control tokens (ETC-tokens), which enables precise word-level application of typographic features. To further address style inconsistency in text rendering, we propose a text-agnostic style control adapter (SCA) that prevents content leakage while enhancing style consistency. To implement TC-FT and SCA effectively, we incorporated HTML-render into the data synthesis pipeline and proposed the first word-level controllable dataset. Through comprehensive experiments, we demonstrate the effectiveness of our approach in achieving superior word-level typographic control, font consistency, and style consistency in text rendering tasks. The datasets and models will be available for academic use.
CVJun 26, 2025
WordCon: Word-level Typography Control in Scene Text RenderingWenda Shi, Yiren Song, Zihan Rao et al.
Achieving precise word-level typography control within generated images remains a persistent challenge. To address it, we newly construct a word-level controlled scene text dataset and introduce the Text-Image Alignment (TIA) framework. This framework leverages cross-modal correspondence between text and local image regions provided by grounding models to enhance the Text-to-Image (T2I) model training. Furthermore, we propose WordCon, a hybrid parameter-efficient fine-tuning (PEFT) method. WordCon reparameterizes selective key parameters, improving both efficiency and portability. This allows seamless integration into diverse pipelines, including artistic text rendering, text editing, and image-conditioned text rendering. To further enhance controllability, the masked loss at the latent level is applied to guide the model to concentrate on learning the text region in the image, and the joint-attention loss provides feature-level supervision to promote disentanglement between different words. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. The datasets and source code will be available for academic use.