Zijie Guo

AI
h-index40
12papers
115citations
Novelty63%
AI Score58

12 Papers

AIJul 17, 2023Code
Long-range Meta-path Search on Large-scale Heterogeneous Graphs

Chao Li, Zijie Guo, Qiuting He et al.

Utilizing long-range dependency, a concept extensively studied in homogeneous graphs, remains underexplored in heterogeneous graphs, especially on large ones, posing two significant challenges: Reducing computational costs while maximizing effective information utilization in the presence of heterogeneity, and overcoming the over-smoothing issue in graph neural networks. To address this gap, we investigate the importance of different meta-paths and introduce an automatic framework for utilizing long-range dependency on heterogeneous graphs, denoted as Long-range Meta-path Search through Progressive Sampling (LMSPS). Specifically, we develop a search space with all meta-paths related to the target node type. By employing a progressive sampling algorithm, LMSPS dynamically shrinks the search space with hop-independent time complexity. Through a sampling evaluation strategy, LMSPS conducts a specialized and effective meta-path selection, leading to retraining with only effective meta-paths, thus mitigating costs and over-smoothing. Extensive experiments across diverse heterogeneous datasets validate LMSPS's capability in discovering effective long-range meta-paths, surpassing state-of-the-art methods. Our code is available at https://github.com/JHL-HUST/LMSPS.

AIDec 18, 2025
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Wanghan Xu, Yuhao Zhou, Yifan Zhou et al.

Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.

GEO-PHMar 24
TRACE: A Multi-Agent System for Autonomous Physical Reasoning in Seismological

Feng Liu, Jian Xu, Xin Cui et al.

Inferring the physical mechanisms that govern earthquake sequences from indirect geophysical observations remains difficult, particularly across tectonically distinct environments where similar seismic patterns can reflect different underlying processes. Current interpretations rely heavily on the expert synthesis of catalogs, spatiotemporal statistics, and candidate physical models, limiting reproducibility and the systematic transfer of insight across settings. Here we present TRACE (Trans-perspective Reasoning and Automated Comprehensive Evaluator), a multi-agent system that combines large language model planning with formal seismological constraints to derive auditable, physically grounded mechanistic inference from raw observations. Applied to the 2019 Ridgecrest sequence, TRACE autonomously identifies stress-perturbation-induced delayed triggering, resolving the cascading interaction between the Mw 6.4 and Mw 7.1 mainshocks; in the Santorini-Kolumbo case, the system identifies a structurally guided intrusion model, distinguishing fault-channeled episodic migration from the continuous propagation expected in homogeneous crustal failure. By providing a generalizable logical infrastructure for interpreting heterogeneous seismic phenomena, TRACE advances the field from expert-dependent analysis toward knowledge-guided autonomous discovery in Earth sciences.

AINov 26, 2025
EWE: An Agentic Framework for Extreme Weather Analysis

Zhe Jiang, Jiong Wang, Xiaoyu Yue et al.

Extreme weather events pose escalating risks to global society, underscoring the urgent need to unravel their underlying physical mechanisms. Yet the prevailing expert-driven, labor-intensive diagnostic paradigm has created a critical analytical bottleneck, stalling scientific progress. While AI for Earth Science has achieved notable advances in prediction, the equally essential challenge of automated diagnostic reasoning remains largely unexplored. We present the Extreme Weather Expert (EWE), the first intelligent agent framework dedicated to this task. EWE emulates expert workflows through knowledge-guided planning, closed-loop reasoning, and a domain-tailored meteorological toolkit. It autonomously produces and interprets multimodal visualizations from raw meteorological data, enabling comprehensive diagnostic analyses. To catalyze progress, we introduce the first benchmark for this emerging field, comprising a curated dataset of 103 high-impact events and a novel step-wise evaluation metric. EWE marks a step toward automated scientific discovery and offers the potential to democratize expertise and intellectual resources, particularly for developing nations vulnerable to extreme weather.

AIMay 18
Latent Action Reparameterization for Efficient Agent Inference

Wenhao Huang, Qingwen Zeng, Qiyue Chen et al.

Large language model (LLM) agents often rely on long sequences of low-level textual actions, resulting in large effective decision horizons and high inference cost. While prior work has focused on improving inference efficiency through system-level optimizations or prompt engineering, we argue that a key bottleneck lies in the representation of the action space itself. We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi-step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations. Across a range of LLM-based agent benchmarks, LAR significantly reduces the effective action horizon and improves inference efficiency under fixed compute budgets. As a consequence, our approach achieves substantial reductions in action tokens and corresponding wall-clock inference time, while maintaining or improving task success rates. These results suggest that action representation learning is a critical and underexplored factor in scaling efficient LLM agent inference, complementary to advances in model architecture and hardware.

AIFeb 9
InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery

Shiyang Feng, Runmin Ma, Xiangchao Yan et al.

We introduce InternAgent-1.5, a unified system designed for end-to-end scientific discovery across computational and empirical domains. The system is built on a structured architecture composed of three coordinated subsystems for generation, verification, and evolution. These subsystems are supported by foundational capabilities for deep research, solution optimization, and long horizon memory. The architecture allows InternAgent-1.5 to operate continuously across extended discovery cycles while maintaining coherent and improving behavior. It also enables the system to coordinate computational modeling and laboratory experimentation within a single unified system. We evaluate InternAgent-1.5 on scientific reasoning benchmarks such as GAIA, HLE, GPQA, and FrontierScience, and the system achieves leading performance that demonstrates strong foundational capabilities. Beyond these benchmarks, we further assess two categories of discovery tasks. In algorithm discovery tasks, InternAgent-1.5 autonomously designs competitive methods for core machine learning problems. In empirical discovery tasks, it executes complete computational or wet lab experiments and produces scientific findings in earth, life, biological, and physical domains. Overall, these results show that InternAgent-1.5 provides a general and scalable framework for autonomous scientific discovery.

IRMar 18
IoDResearch: Deep Research on Private Heterogeneous Data via the Internet of Data

Zhuofan Shi, Zijie Guo, Xinjian Ma et al.

The rapid growth of multi-source, heterogeneous, and multimodal scientific data has increasingly exposed the limitations of traditional data management. Most existing DeepResearch (DR) efforts focus primarily on web search while overlooking local private data. Consequently, these frameworks exhibit low retrieval efficiency for private data and fail to comply with the FAIR principles, ultimately resulting in inefficiency and limited reusability. To this end, we propose IoDResearch (Internet of Data Research), a private data-centric Deep Research framework that operationalizes the Internet of Data paradigm. IoDResearch encapsulates heterogeneous resources as FAIR-compliant digital objects, and further refines them into atomic knowledge units and knowledge graphs, forming a heterogeneous graph index for multi-granularity retrieval. On top of this representation, a multi-agent system supports both reliable question answering and structured scientific report generation. Furthermore, we establish the IoD DeepResearch Benchmark to systematically evaluate both data representation and Deep Research capabilities in IoD scenarios. Experimental results on retrieval, QA, and report-writing tasks show that IoDResearch consistently surpasses representative RAG and Deep Research baselines. Overall, IoDResearch demonstrates the feasibility of private-data-centric Deep Research under the IoD paradigm, paving the way toward more trustworthy, reusable, and automated scientific discovery.

CLSep 25, 2025Code
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning

Xiangru Tang, Wanghan Xu, Yujie Wang et al.

Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden "tool tax" of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3\% accuracy -- the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5\% and agent steps by 43.7\%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85\% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.

AIJun 12, 2025
Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Yuhao Zhou, Yiheng Wang, Xuming He et al.

Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.

AO-PHMay 24, 2024
Data-driven Global Ocean Modeling for Seasonal to Decadal Prediction

Zijie Guo, Pumeng Lyu, Fenghua Ling et al.

Accurate ocean dynamics modeling is crucial for enhancing understanding of ocean circulation, predicting climate variability, and tackling challenges posed by climate change. Despite improvements in traditional numerical models, predicting global ocean variability over multi-year scales remains challenging. Here, we propose ORCA-DL (Oceanic Reliable foreCAst via Deep Learning), the first data-driven 3D ocean model for seasonal to decadal prediction of global ocean circulation. ORCA-DL accurately simulates three-dimensional ocean dynamics and outperforms state-of-the-art dynamical models in capturing extreme events, including El Niño-Southern Oscillation and upper ocean heatwaves. This demonstrates the high potential of data-driven models for efficient and accurate global ocean forecasting. Moreover, ORCA-DL stably emulates ocean dynamics at decadal timescales, demonstrating its potential even for skillful decadal predictions and climate projections.

CVMay 29, 2025
OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data

Fengxiang Wang, Mingshuo Chen, Xuming He et al.

Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth's spheres and their cross-sphere interactions, typically restricting evaluation to the human-activity sphere of atmosphere and to at most 16 tasks. These limitations: \textit{narrow-source heterogeneity (single/few data sources), constrained scientific granularity, and limited-sphere extensibility}. Therefore, we introduce \textbf{OmniEarth-Bench}, the first multimodal benchmark that systematically spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, and human-activity sphere, and cross-spheres. Built with a scalable, modular-topology data inference framework and native multi-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released at OmniEarth-Bench (https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).

LGJul 23, 2025
A Self-Evolving AI Agent System for Climate Science

Zijie Guo, Jiong Wang, Fenghua Ling et al.

Scientific progress in Earth science depends on integrating data across the planet's interconnected spheres. However, the accelerating volume and fragmentation of multi-sphere knowledge and data have surpassed human analytical capacity. This creates a major bottleneck for discovery, especially in climate science. To address this challenge, we introduce EarthLink, the first self-evolving AI agent system designed as an interactive "copilot" for Earth scientists. Through natural language interaction, EarthLink automates the entire research workflow by integrating planning, code execution, data analysis, and physical reasoning into a unified process that directly addresses this limitation. Beyond efficiency, it exhibits human-like cross-disciplinary analytical ability and achieves proficiency comparable to a junior researcher in expert evaluations on core large-scale climate tasks, including model-observation comparison and climate change understanding. When tasked with an open scientific problem, specifically the discovery of precursors of the Atlantic Niño, EarthLink autonomously developed a research strategy, identified sources of predictability, verified its hypotheses with available data, and proposed a physically consistent mechanism. These emerging capabilities enable a new human-AI research paradigm. Scientists can focus on value and result judgments, while AI systems handle complex data analysis and knowledge integration. This accelerates the pace and breadth of discovery in Earth sciences. The system is accessible at our website https://earthlink.intern-ai.org.cn.