Zhonglin Jiang

CL
h-index19
14papers
32citations
Novelty58%
AI Score56

14 Papers

95.1ASMar 16Code
SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

Jiale Qian, Hao Meng, Tian Zheng et al.

While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.

CLJan 20Code
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

Zhiyuan Shi, Qibo Qiu, Feng Xue et al.

The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information, principally because they overlook the attention drift phenomenon where token significance evolves dynamically. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse-grained caching strategies and incur high I/O overhead due to frequent data transfers. To overcome these limitations, we propose HeteroCache, a training-free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and redundancy. Consequently, we apply a fine-grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes, thereby addressing the inefficiency of coarse-grained strategies. Furthermore, we employ a hierarchical storage mechanism in which a subset of representative heads monitors attention shift, and trigger an asynchronous, on-demand retrieval of contexts from the CPU, effectively hiding I/O latency. Finally, experiments demonstrate that HeteroCache achieves state-of-the-art performance on multiple long-context benchmarks and accelerates decoding by up to $3\times$ compared to the original model in the 224K context. Our code will be open-source.

LGSep 9, 2024
M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

Hongyang Lei, Xiaolong Cheng, Qi Qin et al.

Current multimodal learning strategies primarily optimize in the original token space. Such a framework is easy to incorporate with the backbone of pretrained language model, but might result in modality collapse. To alleviate such issues, we leverage the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks, which converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space. We implement this predictor by a Multi-Gate Mixture of Experts (MMoE) and name the framework as M3-JEPA, accordingly. The gating function disentangles the modality-specific and shared information and derives information-theoretic optimality. The framework is implemented with both contrastive and regularization loss, and solved by alternative gradient descent (AGD) between different multimodal tasks. By thoroughly designed experiments, we show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference. Our observation suggests that M3-JEPA might become a new basis to self-supervised learning in the open world.

LGSep 26, 2025Code
HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

Ke Li, Zheng Yang, Zhongbin Zhou et al.

Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to Optimal Brain Surgeon (OBS) theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where d is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of compression ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at compression ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at \href{https://github.com/LLIKKE/HEAPr}{https://github.com/LLIKKE/HEAPr}.

IRDec 21, 2024Code
Large Language Model Can Be a Foundation for Hidden Rationale-Based Retrieval

Luo Ji, Feixiang Guo, Teng Chen et al.

Despite the recent advancement in Retrieval-Augmented Generation (RAG) systems, most retrieval methodologies are often developed for factual retrieval, which assumes query and positive documents are semantically similar. In this paper, we instead propose and study a more challenging type of retrieval task, called hidden rationale retrieval, in which query and document are not similar but can be inferred by reasoning chains, logic relationships, or empirical experiences. To address such problems, an instruction-tuned Large language model (LLM) with a cross-encoder architecture could be a reasonable choice. To further strengthen pioneering LLM-based retrievers, we design a special instruction that transforms the retrieval task into a generative task by prompting LLM to answer a binary-choice question. The model can be fine-tuned with direct preference optimization (DPO). The framework is also optimized for computational efficiency with no performance degradation. We name this retrieval framework by RaHoRe and verify its zero-shot and fine-tuned performance superiority on Emotional Support Conversation (ESC), compared with previous retrieval works. Our study suggests the potential to employ LLM as a foundation for a wider scope of retrieval tasks. Our codes, models, and datasets are available on https://github.com/flyfree5/LaHoRe.

CLSep 18, 2024
MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning

Ningyuan Xi, Xiaoyu Wang, Yetao Wu et al.

Current research efforts are focused on enhancing the thinking and reasoning capability of large language model (LLM) by prompting, data-driven emergence and inference-time computation. In this study, we consider stimulating language model's thinking and cognitive abilities from a modular perspective, which mimics the human brain architecture. We select a specific intermediate attention layer with newly implemented language heads. We conduct dual-layer fine-tuning by annotated (query, thought, answer) samples and show that the intermediate layer can also learn to decode fluent and reasonable language tokens. A two-pass inference mechanism is designed to generate thoughts then formal responses. The entire framework is called modularized thinking language model (MeTHanol) which can enhance LLM's cognitive behaviors as indicated by Theory of Mind (ToM) and Vignette-based experiments. Case studies also show that MeTHanol can plan and self-reflect and generate human-like thoughts and answers, even on unseen and open-domain tasks. MeTHanol can also adapt to a personalized prompt and behave as the specified character. Our study holds promise for significant cognitive gains from a modular perspective. Our code, model and data are available at https://bachozean.github.io/methanol-page

LGMar 28, 2025
Generative Reliability-Based Design Optimization Using In-Context Learning Capabilities of Large Language Models

Zhonglin Jiang, Qian Tang, Zequn Wang

Large Language Models (LLMs) have demonstrated remarkable in-context learning capabilities, enabling flexible utilization of limited historical information to play pivotal roles in reasoning, problem-solving, and complex pattern recognition tasks. Inspired by the successful applications of LLMs in multiple domains, this paper proposes a generative design method by leveraging the in-context learning capabilities of LLMs with the iterative search mechanisms of metaheuristic algorithms for solving reliability-based design optimization problems. In detail, reliability analysis is performed by engaging the LLMs and Kriging surrogate modeling to overcome the computational burden. By dynamically providing critical information of design points to the LLMs with prompt engineering, the method enables rapid generation of high-quality design alternatives that satisfy reliability constraints while achieving performance optimization. With the Deepseek-V3 model, three case studies are used to demonstrated the performance of the proposed approach. Experimental results indicate that the proposed LLM-RBDO method successfully identifies feasible solutions that meet reliability constraints while achieving a comparable convergence rate compared to traditional genetic algorithms.

CLDec 6, 2024
Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation

Xiaoyu Wang, Ningyuan Xi, Teng Chen et al.

Large Language Models (LLM) are usually fine-tuned to participate in dyadic or two-party dialogues, which can not adapt well to multi-party dialogues (MPD), which hinders their applications in such scenarios including multi-personal meetings, discussions and daily communication. Previous LLM-based researches mainly focus on the multi-agent framework, while their base LLMs are still pairwisely fine-tuned. In this work, we design a multi-party fine-tuning framework (MuPaS) for LLMs on the multi-party dialogue datasets, and prove such a straightforward framework can let the LLM align with the multi-party conversation style efficiently and effectively. We also design two training strategies which can convert MuPaS into the MPD simulator. Substantial experiments show that MuPaS can achieve state-of-the-art multi-party response, higher accuracy of the-next-speaker prediction, higher human and automatic evaluated utterance qualities, and can even generate reasonably with out-of-distribution scene, topic and role descriptions. The MuPaS framework bridges the LLM training with more complicated multi-party applications, such as conversation generation, virtual rehearsal or meta-universe.

CVSep 1, 2025
FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation

Wenzhuang Wang, Yifan Zhao, Mingcan Ma et al. · pku

Layout-to-image (L2I) generation has exhibited promising results in natural domains, but suffers from limited generative fidelity and weak alignment with user-provided layouts when applied to degraded scenes (i.e., low-light, underwater). We primarily attribute these limitations to the "contextual illusion dilemma" in degraded conditions, where foreground instances are overwhelmed by context-dominant frequency distributions. Motivated by this, our paper proposes a new Frequency-Inspired Contextual Disentanglement Generative (FICGen) paradigm, which seeks to transfer frequency knowledge of degraded images into the latent diffusion space, thereby facilitating the rendering of degraded instances and their surroundings via contextual frequency-aware guidance. To be specific, FICGen consists of two major steps. Firstly, we introduce a learnable dual-query mechanism, each paired with a dedicated frequency resampler, to extract contextual frequency prototypes from pre-collected degraded exemplars in the training set. Secondly, a visual-frequency enhanced attention is employed to inject frequency prototypes into the degraded generation process. To alleviate the contextual illusion and attribute leakage, an instance coherence map is developed to regulate latent-space disentanglement between individual instances and their surroundings, coupled with an adaptive spatial-frequency aggregation module to reconstruct spatial-frequency mixed degraded representations. Extensive experiments on 5 benchmarks involving a variety of degraded scenarios-from severe low-light to mild blur-demonstrate that FICGen consistently surpasses existing L2I methods in terms of generative fidelity, alignment and downstream auxiliary trainability.

CLMay 11, 2025
Convert Language Model into a Value-based Strategic Planner

Xiaoyu Wang, Yue Zhao, Qingqing Gu et al.

Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Q-learning on LLMs, and propose a framework called straQ*. Our framework allows a plug-and-play LLM to bootstrap the planning during ESC, determine the optimal strategy based on long-term returns, and finally guide the LLM to response. Substantial experiments on ESC datasets suggest that straQ* outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and finite state machines.

CLApr 16, 2025
FiSMiness: A Finite State Machine Based Paradigm for Emotional Support Conversations

Yue Zhao, Qingqing Gu, Xiaoyu Wang et al.

Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Finite State Machine (FSM) on LLMs, and propose a framework called FiSMiness. Our framework allows a single LLM to bootstrap the planning during ESC, and self-reason the seeker's emotion, support strategy and the final response upon each conversational turn. Substantial experiments on ESC datasets suggest that FiSMiness outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and external-assisted methods, even those with many more parameters.

CLOct 21, 2025
Chain-of-Conceptual-Thought Elicits Daily Conversation in Large Language Models

Qingqing Gu, Dan Wang, Yue Zhao et al.

Chain-of-Thought (CoT) is widely applied to enhance the LLM capability in math, coding and reasoning tasks. However, its performance is limited for open-domain tasks, when there are no clearly defined reasoning steps or logical transitions. To mitigate such challenges, we propose a new prompt-based paradigm called Chain of Conceptual Thoughts (CoCT), which suggests the LLM first to produce the tag of concepts, then complete the detailed content following the concept. To encourage this hierarchical way of thinking, we implement the concepts with emotions, strategies and topics. We experiment with this paradigm in daily and emotional support conversations, covering tasks with both in-domain and out-of-domain concept settings. Automatic, human, and LLM-based evaluations reveal that CoCT surpasses several prompt-based baselines such as self-refine, ECoT, SoT and RAG, suggesting a potential solution of LLM prompting paradigm for a wider scope of tasks.

CLAug 23, 2025
Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling

Yue Zhao, Xiaoyu Wang, Dan Wang et al.

World models have been widely utilized in robotics, gaming, and auto-driving. However, their applications on natural language tasks are relatively limited. In this paper, we construct the dialogue world model, which could predict the user's emotion, sentiment, and intention, and future utterances. By defining a POMDP, we argue emotion, sentiment and intention can be modeled as the user belief and solved by maximizing the information bottleneck. By this user belief modeling, we apply the model-based reinforcement learning framework to the dialogue system, and propose a framework called DreamCUB. Experiments show that the pretrained dialogue world model can achieve state-of-the-art performances on emotion classification and sentiment identification, while dialogue quality is also enhanced by joint training of the policy, critic and dialogue world model. Further analysis shows that this manner holds a reasonable exploration-exploitation balance and also transfers well to out-of-domain scenarios such as empathetic dialogues.

CLJul 17, 2025
Making Language Model a Hierarchical Classifier

Yihong Wang, Zhonglin Jiang, Ningyuan Xi et al.

Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human's hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder. Language heads of the last layer are copied to different selected intermediate layers, and fine-tuned with different task inputs. By thorough experiments, we validate that these selective intermediate layers could be adapted to speak meaningful and reasonable contents, and this paradigm of hierarchical decoder can obtain state-of-the-art performances on multiple tasks such as hierarchical text classification, classification-guided generation, and hierarchical text generation. HdLM outperforms all baselines on WoS, DBpedia, ESconv, EmpatheticDialogues, and several cognitive tests. We also provide thorough theoretical analysis to validate the convergence and computational savings of our methodology. This study suggests the possibility of a generalized hierarchical reasoner, pretraining from scratch.