CLJul 1, 2024Code
CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error CorrectionJingheng Ye, Zishan Xu, Yinghui Li et al.
The paper focuses on the interpretability of Grammatical Error Correction (GEC) evaluation metrics, which received little attention in previous studies. To bridge the gap, we introduce **CLEME2.0**, a reference-based metric describing four fundamental aspects of GEC systems: hit-correction, wrong-correction, under-correction, and over-correction. They collectively contribute to exposing critical qualities and locating drawbacks of GEC systems. Evaluating systems by combining these aspects also leads to superior human consistency over other reference-based and reference-less metrics. Extensive experiments on two human judgment datasets and six reference datasets demonstrate the effectiveness and robustness of our method, achieving a new state-of-the-art result. Our codes are released at https://github.com/THUKElab/CLEME.
AIFeb 12Code
Evolving Beyond Snapshots: Harmonizing Structure and Sequence via Entity State Tuning for Temporal Knowledge Graph ForecastingSiyuan Li, Yunjia Wu, Yiyong Xiao et al.
Temporal knowledge graph (TKG) forecasting requires predicting future facts by jointly modeling structural dependencies within each snapshot and temporal evolution across snapshots. However, most existing methods are stateless: they recompute entity representations at each timestamp from a limited query window, leading to episodic amnesia and rapid decay of long-term dependencies. To address this limitation, we propose Entity State Tuning (EST), an encoder-agnostic framework that endows TKG forecasters with persistent and continuously evolving entity states. EST maintains a global state buffer and progressively aligns structural evidence with sequential signals via a closed-loop design. Specifically, a topology-aware state perceiver first injects entity-state priors into structural encoding. Then, a unified temporal context module aggregates the state-enhanced events with a pluggable sequence backbone. Subsequently, a dual-track evolution mechanism writes the updated context back to the global entity state memory, balancing plasticity against stability. Experiments on multiple benchmarks show that EST consistently improves diverse backbones and achieves state-of-the-art performance, highlighting the importance of state persistence for long-horizon TKG forecasting. The code is published at https://github.com/yuanwuyuan9/Evolving-Beyond-Snapshots
CLFeb 4
Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process SupervisionLingzhuang Sun, Ruitong Liu, Yuxia Zhu et al.
Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.
CLFeb 11
Canvas-of-Thought: Grounding Reasoning via Mutable Structured StatesLingzhuang Sun, Yuxia Zhu, Ruitong Liu et al.
While Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), relying solely on linear text sequences remains a bottleneck for complex tasks. We observe that even when auxiliary visual elements are interleaved, they are often treated as static snapshots within a one-dimensional, unstructured reasoning chain. We argue that such approaches treat reasoning history as an immutable stream: correcting a local error necessitates either generating verbose downstream corrections or regenerating the entire context. This forces the model to implicitly maintain and track state updates, significantly increasing token consumption and cognitive load. This limitation is particularly acute in high-dimensional domains, such as geometry and SVG design, where the textual expression of CoT lacks explicit visual guidance, further constraining the model's reasoning precision. To bridge this gap, we introduce \textbf{Canvas-of-Thought (Canvas-CoT)}. By leveraging a HTML Canvas as an external reasoning substrate, Canvas-CoT empowers the model to perform atomic, DOM-based CRUD operations. This architecture enables in-place state revisions without disrupting the surrounding context, allowing the model to explicitly maintain the "ground truth". Furthermore, we integrate a rendering-based critique loop that serves as a hard constraint validator, providing explicit visual feedback to resolve complex tasks that are difficult to articulate through text alone. Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.
AIFeb 18
OpenSage: Self-programming Agent Generation EngineHongwei Li, Zhun Wang, Qinrun Dai et al.
Agent development kits (ADKs) provide effective platforms and tooling for constructing agents, and their designs are critical to the constructed agents' performance, especially the functionality for agent topology, tools, and memory. However, current ADKs either lack sufficient functional support or rely on humans to manually design these components, limiting agents' generalizability and overall performance. We propose OpenSage, the first ADK that enables LLMs to automatically create agents with self-generated topology and toolsets while providing comprehensive and structured memory support. OpenSage offers effective functionality for agents to create and manage their own sub-agents and toolkits. It also features a hierarchical, graph-based memory system for efficient management and a specialized toolkit tailored to software engineering tasks. Extensive experiments across three state-of-the-art benchmarks with various backbone models demonstrate the advantages of OpenSage over existing ADKs. We also conduct rigorous ablation studies to demonstrate the effectiveness of our design for each component. We believe OpenSage can pave the way for the next generation of agent development, shifting the focus from human-centered to AI-centered paradigms.
CLJun 29, 2025Code
Flow-Modulated Scoring for Semantic-Aware Knowledge Graph CompletionSiyuan Li, Ruitong Liu, Yan Wen et al.
Knowledge graph completion demands effective modeling of multifaceted semantic relationships between entities. Yet, prevailing methods, which rely on static scoring functions over learned embeddings, struggling to simultaneously capture rich semantic context and the dynamic nature of relations. To overcome this limitation, we propose the Flow-Modulated Scoring (FMS) framework, conceptualizing a relation as a dynamic evolutionary process governed by its static semantic environment. FMS operates in two stages: it first learns context-aware entity embeddings via a Semantic Context Learning module, and then models a dynamic flow between them using a Conditional Flow-Matching module. This learned flow dynamically modulates a base static score for the entity pair. By unifying context-rich static representations with a conditioned dynamic flow, FMS achieves a more comprehensive understanding of relational semantics. Extensive experiments demonstrate that FMS establishes a new state of the art across both canonical knowledge graph completion tasks: relation prediction and entity prediction. On the standard relation prediction benchmark FB15k-237, FMS achieves a near-perfect MRR of 99.8\% and Hits@1 of 99.7\% using a mere 0.35M parameters, while also attaining a 99.9\% MRR on WN18RR. Its dominance extends to entity prediction, where it secures a 25.2\% relative MRR gain in the transductive setting and substantially outperforms all baselines in challenging inductive settings. By unifying a dynamic flow mechanism with rich static contexts, FMS offers a highly effective and parameter-efficient new paradigm for knowledge graph completion. Code published at: https://github.com/yuanwuyuan9/FMS.
CLFeb 8, 2025
Position: LLMs Can be Good Tutors in English EducationJingheng Ye, Shen Wang, Deqing Zou et al.
While recent efforts have begun integrating large language models (LLMs) into English education, they often rely on traditional approaches to learning tasks without fully embracing educational methodologies, thus lacking adaptability to language learning. To address this gap, we argue that LLMs have the potential to serve as effective tutors in English Education. Specifically, LLMs can play three critical roles: (1) as data enhancers, improving the creation of learning materials or serving as student simulations; (2) as task predictors, serving as learner assessment or optimizing learning pathway; and (3) as agents, enabling personalized and inclusive education. We encourage interdisciplinary research to explore these roles, fostering innovation while addressing challenges and risks, ultimately advancing English Education through the thoughtful integration of LLMs.
CLNov 28, 2025
FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English WritingJingheng Ye, Shen Wang, Jiaqi Chen et al.
Large Language Models (LLMs) have transformed artificial intelligence, offering profound opportunities for educational applications. However, their ability to provide fine-grained educational feedback for K-12 English writing remains underexplored. In this paper, we challenge the error analysis and pedagogical skills of LLMs by introducing the problem of Fine-grained Error Analysis for English Learners and present the Fine-grained Error ANalysis for English Learners (FEANEL) Benchmark. The benchmark comprises 1,000 essays written by elementary and secondary school students, and a well-developed English writing error taxonomy. Each error is annotated by language education experts and categorized by type, severity, and explanatory feedback, using a part-of-speech-based taxonomy they co-developed. We evaluate state-of-the-art LLMs on the FEANEL Benchmark to explore their error analysis and pedagogical abilities. Experimental results reveal significant gaps in current LLMs' ability to perform fine-grained error analysis, highlighting the need for advancements in particular methods for educational applications.
SENov 23, 2025
From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code IntelligenceJian Yang, Xianglong Liu, Weifeng Lv et al.
Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.
AIOct 10, 2025
Semantic-Condition Tuning: Fusing Graph Context with Large Language Models for Knowledge Graph CompletionRuitong Liu, Yan Wen, Te Sun et al.
Fusing Knowledge Graphs with Large Language Models is crucial for knowledge-intensive tasks like knowledge graph completion. The prevailing paradigm, prefix-tuning, simply concatenates knowledge embeddings with text inputs. However, this shallow fusion overlooks the rich relational semantics within KGs and imposes a significant implicit reasoning burden on the LLM to correlate the prefix with the text. To address these, we propose Semantic-condition Tuning (SCT), a new knowledge injection paradigm comprising two key modules. First, a Semantic Graph Module employs a Graph Neural Network to extract a context-aware semantic condition from the local graph neighborhood, guided by knowledge-enhanced relations. Subsequently, this condition is passed to a Condition-Adaptive Fusion Module, which, in turn, adaptively modulates the textual embedding via two parameterized projectors, enabling a deep, feature-wise, and knowledge-aware interaction. The resulting pre-fused embedding is then fed into the LLM for fine-tuning. Extensive experiments on knowledge graph benchmarks demonstrate that SCT significantly outperforms prefix-tuning and other strong baselines. Our analysis confirms that by modulating the input representation with semantic graph context before LLM inference, SCT provides a more direct and potent signal, enabling more accurate and robust knowledge reasoning.
AIJun 29, 2025
Context-Driven Knowledge Graph Completion with Semantic-Aware Relational Message PassingSiyuan Li, Yan Wen, Ruitong Liu et al.
Semantic context surrounding a triplet $(h, r, t)$ is crucial for Knowledge Graph Completion (KGC), providing vital cues for prediction. However, traditional node-based message passing mechanisms, when applied to knowledge graphs, often introduce noise and suffer from information dilution or over-smoothing by indiscriminately aggregating information from all neighboring edges. To address this challenge, we propose a semantic-aware relational message passing. A core innovation of this framework is the introduction of a semantic-aware Top-K neighbor selection strategy. Specifically, this strategy first evaluates the semantic relevance between a central node and its incident edges within a shared latent space, selecting only the Top-K most pertinent ones. Subsequently, information from these selected edges is effectively fused with the central node's own representation using a multi-head attention aggregator to generate a semantically focused node message. In this manner, our model not only leverages the structure and features of edges within the knowledge graph but also more accurately captures and propagates the contextual information most relevant to the specific link prediction task, thereby effectively mitigating interference from irrelevant information. Extensive experiments demonstrate that our method achieves superior performance compared to existing approaches on several established benchmarks.