CVMar 28, 2023
StarNet: Style-Aware 3D Point Cloud GenerationYunfan Zhang, Hao Wang, Guosheng Lin et al.
This paper investigates an open research task of reconstructing and generating 3D point clouds. Most existing works of 3D generative models directly take the Gaussian prior as input for the decoder to generate 3D point clouds, which fail to learn disentangled latent codes, leading noisy interpolated results. Most of the GAN-based models fail to discriminate the local geometries, resulting in the point clouds generated not evenly distributed at the object surface, hence degrading the point cloud generation quality. Moreover, prevailing methods adopt computation-intensive frameworks, such as flow-based models and Markov chains, which take plenty of time and resources in the training phase. To resolve these limitations, this paper proposes a unified style-aware network architecture combining both point-wise distance loss and adversarial loss, StarNet which is able to reconstruct and generate high-fidelity and even 3D point clouds using a mapping network that can effectively disentangle the Gaussian prior from input's high-level attributes in the mapped latent space to generate realistic interpolated objects. Experimental results demonstrate that our framework achieves comparable state-of-the-art performance on various metrics in the point cloud reconstruction and generation tasks, but is more lightweight in model size, requires much fewer parameters and less time for model training.
NCMar 21
Hierarchical Multiscale Structure-Function Coupling for Brain Connectome IntegrationJianwei Chen, Zhengyang Miao, Wenjie Cai et al.
Integrating structural and functional connectomes remains challenging because their relationship is non-linear and organized over nested modular hierarchies. We propose a hierarchical multiscale structure-function coupling framework for connectome integration that jointly learns individualized modular organization and hierarchical coupling across structural connectivity (SC) and functional connectivity (FC). The framework includes: (i) Prototype-based Modular Pooling (PMPool), which learns modality-specific multiscale communities by selecting prototypical ROIs and optimizing a differentiable modularity-inspired objective; (ii) an Attention-based Hierarchical Coupling Module (AHCM) that models both within-hierarchy and cross-hierarchy SC-FC interactions to produce enriched hierarchical coupling representations; and (iii) a Coupling-guided Clustering loss (CgC-Loss) that regularizes SC and FC community assignments with coupling signals, allowing cross-modal interactions to shape community alignment across hierarchies. We evaluate the model's performance across four cohorts for predicting brain age, cognitive score, and disease classification. Our model consistently outperforms baselines and other state-of-the-art approaches across three tasks. Ablation and sensitivity analyses verify the contributions of key components. Finally, the visualizations of learned coupling reveal interpretable differences, suggesting that the framework captures biologically meaningful structure-function relationships.
CVSep 6, 2023
Image Aesthetics Assessment via Learnable QueriesZhiwei Xiong, Yunfan Zhang, Zhiqi Shen et al.
Image aesthetics assessment (IAA) aims to estimate the aesthetics of images. Depending on the content of an image, diverse criteria need to be selected to assess its aesthetics. Existing works utilize pre-trained vision backbones based on content knowledge to learn image aesthetics. However, training those backbones is time-consuming and suffers from attention dispersion. Inspired by learnable queries in vision-language alignment, we propose the Image Aesthetics Assessment via Learnable Queries (IAA-LQ) approach. It adapts learnable queries to extract aesthetic features from pre-trained image features obtained from a frozen image encoder. Extensive experiments on real-world data demonstrate the advantages of IAA-LQ, beating the best state-of-the-art method by 2.2% and 2.1% in terms of SRCC and PLCC, respectively.
AIApr 13
Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data ClusteringMingjie Zhao, Yunfan Zhang, Yiqun Zhang et al.
Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.
CLMar 5Code
FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM ApplicationsYunfan Zhang, Yijie Bei, Jetashree Ravi et al.
Instruction following is critical for LLMs deployed in enterprise and API-driven settings, where strict adherence to output formats, content constraints, and procedural requirements is essential for enabling reliable LLM-assisted workflows. However, existing instruction following benchmarks predominantly evaluate natural language generation constraints that reflect the needs of chat assistants rather than enterprise users. To bridge this gap, we introduce FireBench, an LLM instruction following benchmark grounded in real-world enterprise and API usage patterns. FireBench evaluates six core capability dimensions across diverse applications including information extraction, customer support, and coding agents, comprising over 2,400 samples. We evaluate 11 LLMs and present key findings on their instruction following behavior in enterprise scenarios. We open-source FireBench at fire-bench.com to help users assess model suitability, support model developers in diagnosing performance, and invite community contributions.
CVDec 14, 2023
iComMa: Inverting 3D Gaussian Splatting for Camera Pose Estimation via Comparing and MatchingYuan Sun, Xuan Wang, Yunfan Zhang et al.
We present a method named iComMa to address the 6D camera pose estimation problem in computer vision. Conventional pose estimation methods typically rely on the target's CAD model or necessitate specific network training tailored to particular object classes. Some existing methods have achieved promising results in mesh-free object and scene pose estimation by inverting the Neural Radiance Fields (NeRF). However, they still struggle with adverse initializations such as large rotations and translations. To address this issue, we propose an efficient method for accurate camera pose estimation by inverting 3D Gaussian Splatting (3DGS). Specifically, a gradient-based differentiable framework optimizes camera pose by minimizing the residual between the query image and the rendered image, requiring no training. An end-to-end matching module is designed to enhance the model's robustness against adverse initializations, while minimizing pixel-level comparing loss aids in precise pose estimation. Experimental results on synthetic and complex real-world data demonstrate the effectiveness of the proposed approach in challenging conditions and the accuracy of camera pose estimation.
LGDec 29, 2024
Asynchronous Federated Clustering with Unknown Number of ClustersYunfan Zhang, Yiqun Zhang, Yang Lu et al.
Federated Clustering (FC) is crucial to mining knowledge from unlabeled non-Independent Identically Distributed (non-IID) data provided by multiple clients while preserving their privacy. Most existing attempts learn cluster distributions at local clients, and then securely pass the desensitized information to the server for aggregation. However, some tricky but common FC problems are still relatively unexplored, including the heterogeneity in terms of clients' communication capacity and the unknown number of proper clusters $k^*$. To further bridge the gap between FC and real application scenarios, this paper first shows that the clients' communication asynchrony and unknown $k^*$ are complex coupling problems, and then proposes an Asynchronous Federated Cluster Learning (AFCL) method accordingly. It spreads the excessive number of seed points to the clients as a learning medium and coordinates them across the clients to form a consensus. To alleviate the distribution imbalance cumulated due to the unforeseen asynchronous uploading from the heterogeneous clients, we also design a balancing mechanism for seeds updating. As a result, the seeds gradually adapt to each other to reveal a proper number of clusters. Extensive experiments demonstrate the efficacy of AFCL.
CLDec 26, 2024
SketchFill: Sketch-Guided Code Generation for Imputing Derived Missing ValuesYunfan Zhang, Changlun Li, Yuyu Luo et al.
Missing value is a critical issue in data science, significantly impacting the reliability of analyses and predictions. Missing value imputation (MVI) is a longstanding problem because it highly relies on domain knowledge. Large language models (LLMs) have emerged as a promising tool for data cleaning, including MVI for tabular data, offering advanced capabilities for understanding and generating content. However, despite their promise, existing LLM techniques such as in-context learning and Chain-of-Thought (CoT) often fall short in guiding LLMs to perform complex reasoning for MVI, particularly when imputing derived missing values, which require mathematical formulas and data relationships across rows and columns. This gap underscores the need for further advancements in LLM methodologies to enhance their reasoning capabilities for more reliable imputation outcomes. To fill this gap, we propose SketchFill, a novel sketch-based method to guide LLMs in generating accurate formulas to impute missing numerical values. Our experimental results demonstrate that SketchFill significantly outperforms state-of-the-art methods, achieving 56.2% higher accuracy than CoT-based methods and 78.8% higher accuracy than MetaGPT. This sets a new standard for automated data cleaning and advances the field of MVI for numerical values.
CLApr 11, 2025
Forecasting Conversation Derailments Through GenerationYunfan Zhang, Kathleen McKeown, Smaranda Muresan
Forecasting conversation derailment can be useful in real-world settings such as online content moderation, conflict resolution, and business negotiations. However, despite language models' success at identifying offensive speech present in conversations, they struggle to forecast future conversation derailments. In contrast to prior work that predicts conversation outcomes solely based on the past conversation history, our approach samples multiple future conversation trajectories conditioned on existing conversation history using a fine-tuned LLM. It predicts the conversation outcome based on the consensus of these trajectories. We also experimented with leveraging socio-linguistic attributes, which reflect turn-level conversation dynamics, as guidance when generating future conversations. Our method of future conversation trajectories surpasses state-of-the-art results on English conversation derailment prediction benchmarks and demonstrates significant accuracy gains in ablation studies.
CVJan 22, 2025
InsTex: Indoor Scenes Stylized Texture SynthesisYunfan Zhang, Zhiwei Xiong, Zhiqi Shen et al.
Generating high-quality textures for 3D scenes is crucial for applications in interior design, gaming, and augmented/virtual reality (AR/VR). Although recent advancements in 3D generative models have enhanced content creation, significant challenges remain in achieving broad generalization and maintaining style consistency across multiple viewpoints. Current methods, such as 2D diffusion models adapted for 3D texturing, suffer from lengthy processing times and visual artifacts, while approaches driven by 3D data often fail to generalize effectively. To overcome these challenges, we introduce InsTex, a two-stage architecture designed to generate high-quality, style-consistent textures for 3D indoor scenes. InsTex utilizes depth-to-image diffusion priors in a coarse-to-fine pipeline, first generating multi-view images with a pre-trained 2D diffusion model and subsequently refining the textures for consistency. Our method supports both textual and visual prompts, achieving state-of-the-art results in visual quality and quantitative metrics, and demonstrates its effectiveness across various 3D texturing applications.
SPNov 7, 2024
Improve the Fitting Accuracy of Deep Learning for the Nonlinear Schrödinger Equation Using Linear Feature Decoupling MethodYunfan Zhang, Zekun Niu, Minghui Shi et al.
We utilize the Feature Decoupling Distributed (FDD) method to enhance the capability of deep learning to fit the Nonlinear Schrodinger Equation (NLSE), significantly reducing the NLSE loss compared to non decoupling model.
CVMay 2, 2024
Multi-modal Learnable Queries for Image Aesthetics AssessmentZhiwei Xiong, Yunfan Zhang, Zhiqi Shen et al.
Image aesthetics assessment (IAA) is attracting wide interest with the prevalence of social media. The problem is challenging due to its subjective and ambiguous nature. Instead of directly extracting aesthetic features solely from the image, user comments associated with an image could potentially provide complementary knowledge that is useful for IAA. With existing large-scale pre-trained models demonstrating strong capabilities in extracting high-quality transferable visual and textual features, learnable queries are shown to be effective in extracting useful features from the pre-trained visual features. Therefore, in this paper, we propose MMLQ, which utilizes multi-modal learnable queries to extract aesthetics-related features from multi-modal pre-trained features. Extensive experimental results demonstrate that MMLQ achieves new state-of-the-art performance on multi-modal IAA, beating previous methods by 7.7% and 8.3% in terms of SRCC and PLCC, respectively.
CLOct 5, 2025
Exploring Chain-of-Thought Reasoning for Steerable Pluralistic AlignmentYunfan Zhang, Kathleen McKeown, Smaranda Muresan
Large Language Models (LLMs) are typically trained to reflect a relatively uniform set of values, which limits their applicability to tasks that require understanding of nuanced human perspectives. Recent research has underscored the importance of enabling LLMs to support steerable pluralism -- the capacity to adopt a specific perspective and align generated outputs with it. In this work, we investigate whether Chain-of-Thought (CoT) reasoning techniques can be applied to building steerable pluralistic models. We explore several methods, including CoT prompting, fine-tuning on human-authored CoT, fine-tuning on synthetic explanations, and Reinforcement Learning with Verifiable Rewards (RLVR). We evaluate these approaches using the Value Kaleidoscope and OpinionQA datasets. Among the methods studied, RLVR consistently outperforms others and demonstrates strong training sample efficiency. We further analyze the generated CoT traces with respect to faithfulness and safety.
CVJan 24, 2024
Style-Consistent 3D Indoor Scene Synthesis with Decoupled ObjectsYunfan Zhang, Hong Huang, Zhiwei Xiong et al.
Controllable 3D indoor scene synthesis stands at the forefront of technological progress, offering various applications like gaming, film, and augmented/virtual reality. The capability to stylize and de-couple objects within these scenarios is a crucial factor, providing an advanced level of control throughout the editing process. This control extends not just to manipulating geometric attributes like translation and scaling but also includes managing appearances, such as stylization. Current methods for scene stylization are limited to applying styles to the entire scene, without the ability to separate and customize individual objects. Addressing the intricacies of this challenge, we introduce a unique pipeline designed for synthesis 3D indoor scenes. Our approach involves strategically placing objects within the scene, utilizing information from professionally designed bounding boxes. Significantly, our pipeline prioritizes maintaining style consistency across multiple objects within the scene, ensuring a cohesive and visually appealing result aligned with the desired aesthetic. The core strength of our pipeline lies in its ability to generate 3D scenes that are not only visually impressive but also exhibit features like photorealism, multi-view consistency, and diversity. These scenes are crafted in response to various natural language prompts, demonstrating the versatility and adaptability of our model.
LGDec 21, 2021
Implicit Neural Video CompressionYunfan Zhang, Ties van Rozendaal, Johann Brehmer et al.
We propose a method to compress full-resolution video sequences with implicit neural representations. Each frame is represented as a neural network that maps coordinate positions to pixel values. We use a separate implicit network to modulate the coordinate inputs, which enables efficient motion compensation between frames. Together with a small residual network, this allows us to efficiently compress P-frames relative to the previous frame. We further lower the bitrate by storing the network weights with learned integer quantization. Our method, which we call implicit pixel flow (IPF), offers several simplifications over established neural video codecs: it does not require the receiver to have access to a pretrained neural network, does not use expensive interpolation-based warping operations, and does not require a separate training dataset. We demonstrate the feasibility of neural implicit compression on image and video data.
IVNov 19, 2021
Instance-Adaptive Video Compression: Improving Neural Codecs by Training on the Test SetTies van Rozendaal, Johann Brehmer, Yunfan Zhang et al.
We introduce a video compression algorithm based on instance-adaptive learning. On each video sequence to be transmitted, we finetune a pretrained compression model. The optimal parameters are transmitted to the receiver along with the latent code. By entropy-coding the parameter updates under a suitable mixture model prior, we ensure that the network parameters can be encoded efficiently. This instance-adaptive compression algorithm is agnostic about the choice of base model and has the potential to improve any neural video codec. On UVG, HEVC, and Xiph datasets, our codec improves the performance of a scale-space flow model by between 21% and 27% BD-rate savings, and that of a state-of-the-art B-frame model by 17 to 20% BD-rate savings. We also demonstrate that instance-adaptive finetuning improves the robustness to domain shift. Finally, our approach reduces the capacity requirements of compression models. We show that it enables a competitive performance even after reducing the network size by 70%.