Jing Guo

CV
h-index18
15papers
88citations
Novelty52%
AI Score56

15 Papers

CVMar 12Code
InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

InSpatio Team, Xiaoyu Zhang, Weihong Pan et al.

We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.

LGFeb 24Code
QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye et al.

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at https://github.com/qqliu/Yale-QEDBench.

ASSep 11, 2023
Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Jinzuomu Zhong, Yang Li, Hui Huang et al.

In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data scarcity.

CLJan 4, 2024Code
Advanced Unstructured Data Processing for ESG Reports: A Methodology for Structured Transformation and Enhanced Analysis

Jiahui Peng, Jing Gao, Xin Tong et al.

In the evolving field of corporate sustainability, analyzing unstructured Environmental, Social, and Governance (ESG) reports is a complex challenge due to their varied formats and intricate content. This study introduces an innovative methodology utilizing the "Unstructured Core Library", specifically tailored to address these challenges by transforming ESG reports into structured, analyzable formats. Our approach significantly advances the existing research by offering high-precision text cleaning, adept identification and extraction of text from images, and standardization of tables within these reports. Emphasizing its capability to handle diverse data types, including text, images, and tables, the method adeptly manages the nuances of differing page layouts and report styles across industries. This research marks a substantial contribution to the fields of industrial ecology and corporate sustainability assessment, paving the way for the application of advanced NLP technologies and large language models in the analysis of corporate governance and sustainability. Our code is available at https://github.com/linancn/TianGong-AI-Unstructure.git.

CLJan 10, 2025Code
Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment Domain

Jing Guo, Nan Li, Ming Xu

Generative AI holds significant potential for ecological and environmental applications such as monitoring, data analysis, education, and policy support. However, its effectiveness is limited by the lack of a unified evaluation framework. To address this, we present the Environmental Large Language model Evaluation (ELLE) question answer (QA) dataset, the first benchmark designed to assess large language models and their applications in ecological and environmental sciences. The ELLE dataset includes 1,130 question answer pairs across 16 environmental topics, categorized by domain, difficulty, and type. This comprehensive dataset standardizes performance assessments in these fields, enabling consistent and objective comparisons of generative AI performance. By providing a dedicated evaluation tool, ELLE dataset promotes the development and application of generative AI technologies for sustainable environmental outcomes. The dataset and code are available at https://elle.ceeai.net/ and https://github.com/CEEAI/elle.

SEApr 25
GlycoPy: A CasADi-based Python Framework for Hierarchical Modeling, Optimization, and Control of Bioprocesses

Yingjie Ma, Jing Guo, Richard D. Braatz

Efficient implementation of nonlinear model predictive control (NMPC) for bioprocesses remains challenging because large nonlinear models are difficult to organize, simulate, and embed within optimization and control workflows. This difficulty is particularly pronounced for large-scale and multiscale systems that require hierarchical model construction and customized simulation strategies. To address this issue, we present GlycoPy, a CasADi-based Python framework for hierarchical modeling, optimization, and control of bioprocesses. GlycoPy combines an equation-oriented, object-oriented modeling architecture with CasADi's symbolic and differentiable computational capabilities, enabling hierarchical model composition, numerical and symbolic simulation, parameter estimation, dynamic optimization, and NMPC within a unified workflow. A key feature of the framework is its support for customized differentiable simulation algorithms that can be embedded directly in gradient-based optimization and control. GlycoPy is demonstrated on a multiscale monoclonal antibody glycosylation process in Chinese hamster ovary cell culture, where it is used for hierarchical model construction, quasi-steady-state simulation, and adaptive NMPC. The results show that GlycoPy provides a practical and reusable framework for applying advanced optimization and control methods to computationally demanding bioprocesses.

CLDec 22, 2023
Empowering Working Memory for Large Language Model Agents

Jing Guo, Nan Li, Jianchuan Qi et al.

Large language models (LLMs) have achieved impressive linguistic capabilities. However, a key limitation persists in their lack of human-like memory faculties. LLMs exhibit constrained memory retention across sequential interactions, hindering complex reasoning. This paper explores the potential of applying cognitive psychology's working memory frameworks, to enhance LLM architecture. The limitations of traditional LLM memory designs are analyzed, including their isolation of distinct dialog episodes and lack of persistent memory links. To address this, an innovative model is proposed incorporating a centralized Working Memory Hub and Episodic Buffer access to retain memories across episodes. This architecture aims to provide greater continuity for nuanced contextual reasoning during intricate tasks and collaborative scenarios. While promising, further research is required into optimizing episodic memory encoding, storage, prioritization, retrieval, and security. Overall, this paper provides a strategic blueprint for developing LLM agents with more sophisticated, human-like memory capabilities, highlighting memory mechanisms as a vital frontier in artificial general intelligence.

CVApr 8
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team, Donghui Shen, Guofeng Zhang et al.

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

CVJan 24, 2025
SyncAnimation: A Real-Time End-to-End Framework for Audio-Driven Human Pose and Talking Head Animation

Yujian Liu, Shidang Xu, Jing Guo et al.

Generating talking avatar driven by audio remains a significant challenge. Existing methods typically require high computational costs and often lack sufficient facial detail and realism, making them unsuitable for applications that demand high real-time performance and visual quality. Additionally, while some methods can synchronize lip movement, they still face issues with consistency between facial expressions and upper body movement, particularly during silent periods. In this paper, we introduce SyncAnimation, the first NeRF-based method that achieves audio-driven, stable, and real-time generation of speaking avatar by combining generalized audio-to-pose matching and audio-to-expression synchronization. By integrating AudioPose Syncer and AudioEmotion Syncer, SyncAnimation achieves high-precision poses and expression generation, progressively producing audio-synchronized upper body, head, and lip shapes. Furthermore, the High-Synchronization Human Renderer ensures seamless integration of the head and upper body, and achieves audio-sync lip. The project page can be found at https://syncanimation.github.io

CROct 27, 2025
MCPGuard : Automatically Detecting Vulnerabilities in MCP Servers

Bin Wang, Zexin Liu, Hao Yu et al.

The Model Context Protocol (MCP) has emerged as a standardized interface enabling seamless integration between Large Language Models (LLMs) and external data sources and tools. While MCP significantly reduces development complexity and enhances agent capabilities, its openness and extensibility introduce critical security vulnerabilities that threaten system trustworthiness and user data protection. This paper systematically analyzes the security landscape of MCP-based systems, identifying three principal threat categories: (1) agent hijacking attacks stemming from protocol design deficiencies; (2) traditional web vulnerabilities in MCP servers; and (3) supply chain security. To address these challenges, we comprehensively survey existing defense strategies, examining both proactive server-side scanning approaches, ranging from layered detection pipelines and agentic auditing frameworks to zero-trust registry systems, and runtime interaction monitoring solutions that provide continuous oversight and policy enforcement. Our analysis reveals that MCP security fundamentally represents a paradigm shift where the attack surface extends from traditional code execution to semantic interpretation of natural language metadata, necessitating novel defense mechanisms tailored to this unique threat model.

AIOct 13, 2025
Spec-Driven AI for Science: The ARIA Framework for Automated and Reproducible Data Analysis

Chuke Chen, Biao Luo, Nan Li et al.

The rapid expansion of scientific data has widened the gap between analytical capability and research intent. Existing AI-based analysis tools, ranging from AutoML frameworks to agentic research assistants, either favor automation over transparency or depend on manual scripting that hinders scalability and reproducibility. We present ARIA (Automated Research Intelligence Assistant), a spec-driven, human-in-the-loop framework for automated and interpretable data analysis. ARIA integrates six interoperable layers, namely Command, Context, Code, Data, Orchestration, and AI Module, within a document-centric workflow that unifies human reasoning and machine execution. Through natural-language specifications, researchers define analytical goals while ARIA autonomously generates executable code, validates computations, and produces transparent documentation. Beyond achieving high predictive accuracy, ARIA can rapidly identify optimal feature sets and select suitable models, minimizing redundant tuning and repetitive experimentation. In the Boston Housing case, ARIA discovered 25 key features and determined XGBoost as the best performing model (R square = 0.93) with minimal overfitting. Evaluations across heterogeneous domains demonstrate ARIA's strong performance, interpretability, and efficiency compared with state-of-the-art systems. By combining AI for research and AI for science principles within a spec-driven architecture, ARIA establishes a new paradigm for transparent, collaborative, and reproducible scientific discovery.

CLAug 20, 2025
ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine

Junying Chen, Zhenyang Cai, Zhiheng Liu et al.

Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.

CVMay 21, 2024
Experimental investigation of trans-scale displacement responses of wrinkle defects in fiber reinforced composite laminates

Li Ma, Shoulong Wang, Changchen Liu et al.

Wrinkle defects were found widely exist in the field of industrial products, i.e. wind turbine blades and filament-wound composite pressure vessels. The magnitude of wrinkle wavelength varies from several millimeters to over one hundred millimeters. Locating the wrinkle defects and measuring their responses are very important to the assessment of the structures that containing wrinkle defects. A meso-mechanical modeling is presented based on the homogenization method to obtain the effective stiffness of a graded wrinkle. The finite element simulation predicts the trans-scale response of out-of-plane displacement of wrinkled laminates, where the maximum displacement ranges from nanoscale to millimeter scale. Such trans-scale effect requires different measurement approaches to observe the displacement responses. Here we employed Shearography (Speckle Pattern Shearing Interferometry) and fringe projection profilometry (FPP) method respectively according to the different magnitude of displacement. In FPP method, a displacement extraction algorithm was presented to obtain the out-of-plane displacement. The measurement sensitivity and accuracy of Shearography and FPP are compared, which provides a quantitative reference for industrial non-destructive test.

SDNov 1, 2021
RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

Shengyuan Xu, Wenxiao Zhao, Jing Guo

Most GAN(Generative Adversarial Network)-based approaches towards high-fidelity waveform generation heavily rely on discriminators to improve their performance. However, GAN methods introduce much uncertainty into the generation process and often result in mismatches of pitch and intensity, which is fatal when it comes to sensitive use cases such as singing voice synthesis(SVS). To address this problem, we propose RefineGAN, a high-fidelity neural vocoder focused on the robustness, pitch and intensity accuracy, and high-speed full-band audio generation. We applyed a pitch-guided refine architecture with a multi-scale spectrogram-based loss function to help stabilize the training process and maintain the robustness of the neural vocoder while using the GAN-based training method. Audio generated using this method shows a better performance in subjective tests when compared with the ground-truth audio. This result shows that the fidelity is even improved during the waveform reconstruction by eliminating defects produced by recording procedures. Moreover, it shows that models trained on a specified type of data can perform on totally unseen language and unseen speaker identically well. Generated sample pairs are provided onhttps://timedomain-tech.github.io/refinegan/.

CVNov 20, 2020
Image Denoising by Gaussian Patch Mixture Model and Low Rank Patches

Jing Guo, Shuping Wang, Chen Luo et al.

Non-local self-similarity based low rank algorithms are the state-of-the-art methods for image denoising. In this paper, a new method is proposed by solving two issues: how to improve similar patches matching accuracy and build an appropriate low rank matrix approximation model for Gaussian noise. For the first issue, similar patches can be found locally or globally. Local patch matching is to find similar patches in a large neighborhood which can alleviate noise effect, but the number of patches may be insufficient. Global patch matching is to determine enough similar patches but the error rate of patch matching may be higher. Based on this, we first use local patch matching method to reduce noise and then use Gaussian patch mixture model to achieve global patch matching. The second issue is that there is no low rank matrix approximation model to adapt to Gaussian noise. We build a new model according to the characteristics of Gaussian noise, then prove that there is a globally optimal solution of the model. By solving the two issues, experimental results are reported to show that the proposed approach outperforms the state-of-the-art denoising methods includes several deep learning ones in both PSNR / SSIM values and visual quality.