Fengli Xu

AI
h-index63
46papers
1,856citations
Novelty46%
AI Score60

46 Papers

CLAug 23, 2024Code
LIMP: Large Language Model Enhanced Intent-aware Mobility Prediction

Songwei Li, Jie Feng, Jiawei Chi et al.

Human mobility prediction is essential for applications like urban planning and transportation management, yet it remains challenging due to the complex, often implicit, intentions behind human behavior. Existing models predominantly focus on spatiotemporal patterns, paying less attention to the underlying intentions that govern movements. Recent advancements in large language models (LLMs) offer a promising alternative research angle for integrating commonsense reasoning into mobility prediction. However, it is a non-trivial problem because LLMs are not natively built for mobility intention inference, and they also face scalability issues and integration difficulties with spatiotemporal models. To address these challenges, we propose a novel LIMP (LLMs for Intent-ware Mobility Prediction) framework. Specifically, LIMP introduces an "Analyze-Abstract-Infer" (A2I) agentic workflow to unleash LLM's commonsense reasoning power for mobility intention inference. Besides, we design an efficient fine-tuning scheme to transfer reasoning power from commercial LLM to smaller-scale, open-source language model, ensuring LIMP's scalability to millions of mobility records. Moreover, we propose a transformer-based intention-aware mobility prediction model to effectively harness the intention inference ability of LLM. Evaluated on two real-world datasets, LIMP significantly outperforms baseline models, demonstrating improved accuracy in next-location prediction and effective intention inference. The interpretability of intention-aware mobility prediction highlights our LIMP framework's potential for real-world applications. Codes and data can be found in https://github.com/tsinghua-fib-lab/LIMP .

AINov 4, 2025Code
Deep Ideation: Designing LLM Agents to Generate Novel Research Ideas on Scientific Concept Network

Keyu Zhao, Weiquan Lin, Qirui Zheng et al.

Novel research ideas play a critical role in advancing scientific inquiries. Recent advancements in Large Language Models (LLMs) have demonstrated their potential to generate novel research ideas by leveraging large-scale scientific literature. However, previous work in research ideation has primarily relied on simplistic methods, such as keyword co-occurrence or semantic similarity. These approaches focus on identifying statistical associations in the literature but overlook the complex, contextual relationships between scientific concepts, which are essential to effectively leverage knowledge embedded in human literature. For instance, papers that simultaneously mention "keyword A" and "keyword B" often present research ideas that integrate both concepts. Additionally, some LLM-driven methods propose and refine research ideas using the model's internal knowledge, but they fail to effectively utilize the scientific concept network, limiting the grounding of ideas in established research. To address these challenges, we propose the Deep Ideation framework to address these challenges, integrating a scientific network that captures keyword co-occurrence and contextual relationships, enriching LLM-driven ideation. The framework introduces an explore-expand-evolve workflow to iteratively refine research ideas, using an Idea Stack to track progress. A critic engine, trained on real-world reviewer feedback, guides the process by providing continuous feedback on the novelty and feasibility of ideas. Our experiments show that our approach improves the quality of generated ideas by 10.67% compared to other methods, with ideas surpassing top conference acceptance levels. Human evaluation highlights their practical value in scientific research, and ablation studies confirm the effectiveness of each component in the workflow. Code repo is available at https://github.com/kyZhao-1/Deep-Ideation.

97.9CEMay 22Code
LiveFigure: Generating Editable Scientific Illustration with VLM Agents

Chenyang Shao, Jiahe Liu, Fengli Xu et al.

Scientific illustrations are essential for depicting conceptual designs, methodologies, and experimental workflows in research, playing a pivotal role in communicating complex academic insights. However, creating high-quality scientific illustrations remains a labor-intensive task for human scientists. While recent generative image models have advanced prompt-based editing, the synthesis of fully editable figures remains a fundamental challenge. Valid editability involves structured transformations of graphical elements, scales, attributes, and text, rather than simple pixel-level changes. Existing models generate raster outputs that do not support manual correction or layout adjustment, limiting their utility in scientific publishing, where editable vector figures are typically required for submission. To address this challenge, we introduce LiveFigure, an agentic framework driven by VLM agents that imitates the multi-step drawing workflow of human researchers. It first plans figure blueprints by drawing inspiration from high-quality references in previous works, then generates executable scripts that produce figures via the PowerPoint interface based on skills and experience, and finally refines the outputs with targeted visual diagnostics, producing fully vectorized, editable figures that meet publication standards. Extensive experiments demonstrate that LiveFigure generates inherently editable figures, achieving 80% publication-readiness in only 17 manual edits, far surpassing the 24% rate of the strongest baseline, NanoBanana. Human preference studies further validate this advantage, with LiveFigure securing a 60% win rate against NanoBanana. Our code is available at https://github.com/tsinghua-fib-lab/LiveFigure.git.

AIAug 8, 2024
Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions

Qingbin Zeng, Qinglong Yang, Shunan Dong et al.

This paper considers a scenario in city navigation: an AI agent is provided with language descriptions of the goal location with respect to some well-known landmarks; By only observing the scene around, including recognizing landmarks and road network connections, the agent has to make decisions to navigate to the goal location without instructions. This problem is very challenging, because it requires agent to establish self-position and acquire spatial representation of complex urban environment, where landmarks are often invisible. In the absence of navigation instructions, such abilities are vital for the agent to make high-quality decisions in long-range city navigation. With the emergent reasoning ability of large language models (LLMs), a tempting baseline is to prompt LLMs to "react" on each observation and make decisions accordingly. However, this baseline has very poor performance that the agent often repeatedly visits same locations and make short-sighted, inconsistent decisions. To address these issues, this paper introduces a novel agentic workflow featured by its abilities to perceive, reflect and plan. Specifically, we find LLaVA-7B can be fine-tuned to perceive the direction and distance of landmarks with sufficient accuracy for city navigation. Moreover, reflection is achieved through a memory mechanism, where past experiences are stored and can be retrieved with current perception for effective decision argumentation. Planning uses reflection results to produce long-term plans, which can avoid short-sighted decisions in long-range navigation. We show the designed workflow significantly improves navigation ability of the LLM agent compared with the state-of-the-art baselines.

CLOct 31, 2025Code
Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning

Chenyang Shao, Sijian Ren, Fengli Xu et al.

In recent years, large language models (LLMs) have witnessed remarkable advancements, with the test-time scaling law consistently enhancing the reasoning capabilities. Through systematic evaluation and exploration of a diverse spectrum of intermediate thoughts, LLMs demonstrate the potential to generate deliberate reasoning steps, thereby substantially enhancing reasoning accuracy. However, LLMs' autoregressive generation paradigm results in reasoning performance scaling sub-optimally with test-time computation, often requiring excessive computational overhead to propose thoughts while yielding only marginal performance gains. In contrast, diffusion language models (DLMs) can efficiently produce diverse samples through parallel denoising in a single forward pass, inspiring us to leverage them for proposing intermediate thoughts, thereby alleviating the computational burden associated with autoregressive generation while maintaining quality. In this work, we propose an efficient collaborative reasoning framework, leveraging DLMs to generate candidate thoughts and LLMs to evaluate their quality. Experiments across diverse benchmarks demonstrate that our framework achieves strong performance in complex reasoning tasks, offering a promising direction for future research. Our code is open-source at https://anonymous.4open.science/r/Diffuse-Thinking-EC60.

AIJan 16, 2025Code
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Fengli Xu, Qianyue Hao, Zefang Zong et al.

Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.

96.5LGMay 11Code
ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

Wanghan Xu, Yuhao Zhou, Hengyuan Zhao et al.

Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at https://github.com/black-yt/ReCrit .

HCDec 18, 2025
HybridQuestion: Human-AI Collaboration for Identifying High-Impact Research Questions

Keyu Zhao, Fengli Xu, Yong Li et al.

The "AI Scientist" paradigm is transforming scientific research by automating key stages of the research process, from idea generation to scholarly writing. This shift is expected to accelerate discovery and expand the scope of scientific inquiry. However, a key question remains unclear: can AI scientists identify meaningful research questions? While Large Language Models (LLMs) have been applied successfully to task-specific ideation, their potential to conduct strategic, long-term assessments of past breakthroughs and future questions remains largely unexplored. To address this gap, we explore a human-AI hybrid solution that integrates the scalable data processing capabilities of AI with the value judgment of human experts. Our methodology is structured in three phases. The first phase, AI-Accelerated Information Gathering, leverages AI's advantage in processing vast amounts of literature to generate a hybrid information base. The second phase, Candidate Question Proposing, utilizes this synthesized data to prompt an ensemble of six diverse LLMs to propose an initial candidate pool, filtered via a cross-model voting mechanism. The third phase, Hybrid Question Selection, refines this pool through a multi-stage filtering process that progressively increases human oversight. To validate this system, we conducted an experiment aiming to identify the Top 10 Scientific Breakthroughs of 2025 and the Top 10 Scientific Questions for 2026 across five major disciplines. Our analysis reveals that while AI agents demonstrate high alignment with human experts in recognizing established breakthroughs, they exhibit greater divergence in forecasting prospective questions, suggesting that human judgment remains crucial for evaluating subjective, forward-looking challenges.

CLNov 7, 2025
AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

Yu Li, Lehui Li, Qingmin Liao et al.

Large language model agents are becoming increasingly capable at web-centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data-collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self-descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning-augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85\% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85\% in Recall@20, +8.30\% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.

CLNov 21, 2024Code
Understanding World or Predicting Future? A Comprehensive Survey of World Models

Jingtao Ding, Yunke Zhang, Yu Shang et al.

The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. This survey offers a comprehensive review of the literature on world models. Generally, world models are regarded as tools for either understanding the present state of the world or predicting its future dynamics. This review presents a systematic categorization of world models, emphasizing two primary functions: (1) constructing internal representations to understand the mechanisms of the world, and (2) predicting future states to simulate and guide decision-making. Initially, we examine the current progress in these two categories. We then explore the application of world models in key domains, including generative games, autonomous driving, robotics, and social simulacra, with a focus on how each domain utilizes these aspects. Finally, we outline key challenges and provide insights into potential future research directions. We summarize the representative papers along with their code repositories in https://github.com/tsinghua-fib-lab/World-Model.

AIJan 7
Controllable LLM Reasoning via Sparse Autoencoder-Based Steering

Yi Fang, Wenjie Wang, Mingfeng Xue et al.

Large Reasoning Models (LRMs) exhibit human-like cognitive reasoning strategies (e.g. backtracking, cross-verification) during reasoning process, which improves their performance on complex tasks. Currently, reasoning strategies are autonomously selected by LRMs themselves. However, such autonomous selection often produces inefficient or even erroneous reasoning paths. To make reasoning more reliable and flexible, it is important to develop methods for controlling reasoning strategies. Existing methods struggle to control fine-grained reasoning strategies due to conceptual entanglement in LRMs' hidden states. To address this, we leverage Sparse Autoencoders (SAEs) to decompose strategy-entangled hidden states into a disentangled feature space. To identify the few strategy-specific features from the vast pool of SAE features, we propose SAE-Steering, an efficient two-stage feature identification pipeline. SAE-Steering first recalls features that amplify the logits of strategy-specific keywords, filtering out over 99\% of features, and then ranks the remaining features by their control effectiveness. Using the identified strategy-specific features as control vectors, SAE-Steering outperforms existing methods by over 15\% in control effectiveness. Furthermore, controlling reasoning strategies can redirect LRMs from erroneous paths to correct ones, achieving a 7\% absolute accuracy improvement.

AIDec 19, 2023Code
Urban Generative Intelligence (UGI): A Foundational Platform for Agents in Embodied City Environment

Fengli Xu, Jun Zhang, Chen Gao et al.

Urban environments, characterized by their complex, multi-layered networks encompassing physical, social, economic, and environmental dimensions, face significant challenges in the face of rapid urbanization. These challenges, ranging from traffic congestion and pollution to social inequality, call for advanced technological interventions. Recent developments in big data, artificial intelligence, urban computing, and digital twins have laid the groundwork for sophisticated city modeling and simulation. However, a gap persists between these technological capabilities and their practical implementation in addressing urban challenges in an systemic-intelligent way. This paper proposes Urban Generative Intelligence (UGI), a novel foundational platform integrating Large Language Models (LLMs) into urban systems to foster a new paradigm of urban intelligence. UGI leverages CityGPT, a foundation model trained on city-specific multi-source data, to create embodied agents for various urban tasks. These agents, operating within a textual urban environment emulated by city simulator and urban knowledge graph, interact through a natural language interface, offering an open platform for diverse intelligent and embodied agent development. This platform not only addresses specific urban issues but also simulates complex urban systems, providing a multidisciplinary approach to understand and manage urban complexity. This work signifies a transformative step in city science and urban intelligence, harnessing the power of LLMs to unravel and address the intricate dynamics of urban systems. The code repository with demonstrations will soon be released here https://github.com/tsinghua-fib-lab/UGI.

99.7CLApr 7
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

Yu Li, Chenyang Shao, Xinyang Liu et al.

Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.

LGFeb 18, 2024Code
Large Language Model-driven Meta-structure Discovery in Heterogeneous Information Network

Lin Chen, Fengli Xu, Nian Li et al.

Heterogeneous information networks (HIN) have gained increasing popularity in recent years for capturing complex relations between diverse types of nodes. Meta-structures are proposed as a useful tool to identify the important patterns in HINs, but hand-crafted meta-structures pose significant challenges for scaling up, drawing wide research attention towards developing automatic search algorithms. Previous efforts primarily focused on searching for meta-structures with good empirical performance, overlooking the importance of human comprehensibility and generalizability. To address this challenge, we draw inspiration from the emergent reasoning abilities of large language models (LLMs). We propose ReStruct, a meta-structure search framework that integrates LLM reasoning into the evolutionary procedure. ReStruct uses a grammar translator to encode the meta-structures into natural language sentences, and leverages the reasoning power of LLMs to evaluate their semantic feasibility. Besides, ReStruct also employs performance-oriented evolutionary operations. These two competing forces allow ReStruct to jointly optimize the semantic explainability and empirical performance of meta-structures. Furthermore, ReStruct contains a differential LLM explainer to generate and refine natural language explanations for the discovered meta-structures by reasoning through the search history. Experiments on eight representative HIN datasets demonstrate that ReStruct achieves state-of-the-art performance in both recommendation and node classification tasks. Moreover, a survey study involving 73 graduate students shows that the discovered meta-structures and generated explanations by ReStruct are substantially more comprehensible. Our code and questionnaire are available at https://github.com/LinChen-65/ReStruct.

AIFeb 15, 2024Code
Chain-of-Planned-Behaviour Workflow Elicits Few-Shot Mobility Generation in LLMs

Chenyang Shao, Fengli Xu, Bingbing Fan et al.

The powerful reasoning capabilities of large language models (LLMs) have brought revolutionary changes to many fields, but their performance in human behaviour generation has not yet been extensively explored. This gap likely emerges because the internal processes governing behavioral intentions cannot be solely explained by abstract reasoning. Instead, they are also influenced by a multitude of factors, including social norms and personal preference. Inspired by the Theory of Planned Behaviour (TPB), we develop a LLM workflow named Chain-of-Planned Behaviour (CoPB) for mobility behaviour generation, which reflects the important spatio-temporal dynamics of human activities. Through exploiting the cognitive structures of attitude, subjective norms, and perceived behaviour control in TPB, CoPB significantly enhance the ability of LLMs to reason the intention of next movement. Specifically, CoPB substantially reduces the error rate of mobility intention generation from 57.8% to 19.4%. To improve the scalability of the proposed CoPB workflow, we further explore the synergy between LLMs and mechanistic models. We find mechanistic mobility models, such as gravity model, can effectively map mobility intentions to physical mobility behaviours. The strategy of integrating CoPB with gravity model can reduce the token cost by 97.7% and achieve better performance simultaneously. Besides, the proposed CoPB workflow can facilitate GPT-4-turbo to automatically generate high quality labels for mobility behavior reasoning. We show such labels can be leveraged to fine-tune the smaller-scale, open source LLaMA 3-8B, which significantly reduces usage costs without sacrificing the quality of the generated behaviours.

IRFeb 26, 2025Code
AgentSociety Challenge: Designing LLM Agents for User Modeling and Recommendation on Web Platforms

Yuwei Yan, Yu Shang, Qingbin Zeng et al.

The AgentSociety Challenge is the first competition in the Web Conference that aims to explore the potential of Large Language Model (LLM) agents in modeling user behavior and enhancing recommender systems on web platforms. The Challenge consists of two tracks: the User Modeling Track and the Recommendation Track. Participants are tasked to utilize a combined dataset from Yelp, Amazon, and Goodreads, along with an interactive environment simulator, to develop innovative LLM agents. The Challenge has attracted 295 teams across the globe and received over 1,400 submissions in total over the course of 37 official competition days. The participants have achieved 21.9% and 20.3% performance improvement for Track 1 and Track 2 in the Development Phase, and 9.1% and 15.9% in the Final Phase, representing a significant accomplishment. This paper discusses the detailed designs of the Challenge, analyzes the outcomes, and highlights the most successful LLM agent designs. To support further research and development, we have open-sourced the benchmark environment at https://tsinghua-fib-lab.github.io/AgentSocietyChallenge.

IRJun 5, 2025Code
Reason-to-Recommend: Using Interaction-of-Thought Reasoning to Enhance LLM Recommendation

Keyu Zhao, Fengli Xu, Yong Li

Driven by advances in Large Language Models (LLMs), integrating them into recommendation tasks has gained interest due to their strong semantic understanding and prompt flexibility. Prior work encoded user-item interactions or metadata into prompts for recommendations. In parallel, LLM reasoning, boosted by test-time scaling and reinforcement learning, has excelled in fields like mathematics and code, where reasoning traces and correctness signals are clear, enabling high performance and interpretability. However, directly applying these reasoning methods to recommendation is ineffective because user feedback is implicit and lacks reasoning supervision. To address this, we propose $\textbf{R2Rec}$, a reasoning-enhanced recommendation framework that samples interaction chains from the user-item graph and converts them into structured interaction-of-thoughts via a progressive masked prompting strategy, with each thought representing stepwise reasoning grounded in interaction context. This allows LLMs to simulate step-by-step decision-making based on implicit patterns. We design a two-stage training pipeline: supervised fine-tuning teaches basic reasoning from high-quality traces, and reinforcement learning refines reasoning via reward signals, alleviating sparse explicit supervision. Experiments on three real-world datasets show R2Rec outperforms classical and LLM-based baselines with an average $\textbf{10.48%}$ improvement in HitRatio@1 and $\textbf{131.81%}$ gain over the original LLM. Furthermore, the explicit reasoning chains enhance interpretability by revealing the decision process. Our code is available at: https://anonymous.4open.science/r/R2Rec-7C5D.

AIDec 19, 2023
Large Language Models Empowered Agent-based Modeling and Simulation: A Survey and Perspectives

Chen Gao, Xiaochong Lan, Nian Li et al.

Agent-based modeling and simulation has evolved as a powerful tool for modeling complex systems, offering insights into emergent behaviors and interactions among diverse agents. Integrating large language models into agent-based modeling and simulation presents a promising avenue for enhancing simulation capabilities. This paper surveys the landscape of utilizing large language models in agent-based modeling and simulation, examining their challenges and promising future directions. In this survey, since this is an interdisciplinary field, we first introduce the background of agent-based modeling and simulation and large language model-empowered agents. We then discuss the motivation for applying large language models to agent-based simulation and systematically analyze the challenges in environment perception, human alignment, action generation, and evaluation. Most importantly, we provide a comprehensive overview of the recent works of large language model-empowered agent-based modeling and simulation in multiple scenarios, which can be divided into four domains: cyber, physical, social, and hybrid, covering simulation of both real-world and virtual environments. Finally, since this area is new and quickly evolving, we discuss the open problems and promising future directions.

AIJan 8
ResMAS: Resilience Optimization in LLM-based Multi-agent Systems

Zhilun Zhou, Zihan Liu, Jiahe Liu et al.

Large Language Model-based Multi-Agent Systems (LLM-based MAS), where multiple LLM agents collaborate to solve complex tasks, have shown impressive performance in many areas. However, MAS are typically distributed across different devices or environments, making them vulnerable to perturbations such as agent failures. While existing works have studied the adversarial attacks and corresponding defense strategies, they mainly focus on reactively detecting and mitigating attacks after they occur rather than proactively designing inherently resilient systems. In this work, we study the resilience of LLM-based MAS under perturbations and find that both the communication topology and prompt design significantly influence system resilience. Motivated by these findings, we propose ResMAS: a two-stage framework for enhancing MAS resilience. First, we train a reward model to predict the MAS's resilience, based on which we train a topology generator to automatically design resilient topology for specific tasks through reinforcement learning. Second, we introduce a topology-aware prompt optimization method that refines each agent's prompt based on its connections and interactions with other agents. Extensive experiments across a range of tasks show that our approach substantially improves MAS resilience under various constraints. Moreover, our framework demonstrates strong generalization ability to new tasks and models, highlighting its potential for building resilient MASs.

CLJun 26, 2025Code
AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text

Chenyang Shao, Tianxing Li, Chenhao Pu et al.

In today's digital world, casual user-generated content often contains subtle cues that may inadvertently expose sensitive personal attributes. Such risks underscore the growing importance of effective text anonymization to safeguard individual privacy. However, existing methods either rely on rigid replacements that damage utility or cloud-based LLMs that are costly and pose privacy risks. To address these issues, we explore the use of locally deployed smaller-scale language models (SLMs) for anonymization. Yet training effective SLMs remains challenging due to limited high-quality supervision. To address the challenge, we propose AgentStealth, a self-reinforcing LLM anonymization framework.First, we introduce an adversarial anonymization workflow enhanced by In-context Contrastive Learning and Adaptive Utility-Aware Control. Second, we perform supervised adaptation of SLMs using high-quality data collected from the workflow, which includes both anonymization and attack signals. Finally, we apply online reinforcement learning where the model leverages its internal adversarial feedback to iteratively improve anonymization performance. Experiments on two datasets show that our method outperforms baselines in both anonymization effectiveness (+12.3%) and utility (+6.8%). Our lightweight design supports direct deployment on edge devices, avoiding cloud reliance and communication-based privacy risks. Our code is open-source at https://github.com/tsinghua-fib-lab/AgentStealth.

HCJun 9, 2025Code
FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents

Qinglong Yang, Haoming Li, Haotian Zhao et al.

Mobile GUI agents are becoming critical tools for enhancing human-device interaction efficiency, with multimodal large language models (MLLMs) emerging as dominant paradigms in this domain. Current agents, however, are limited to following explicit human instructions, resulting in insufficient capability for proactive intent anticipation. Additionally, these agents fail to leverage the contextual information associated with users during task execution, thereby neglecting potentially vast differences in user preferences. To address these challenges, we introduce the FingerTip benchmark. It contains two new tracks: proactive task suggestions by analyzing environment observation and users' previous intents, and personalized task execution by catering to users' action preferences. We collected unique human demonstrations of multi-step Android device interactions across a variety of everyday apps. These demonstrations are not isolated but are continuously acquired from the users' long-term usage in their real lives, and encompass essential user-related contextual information. Our experiments reveal challenges of the tasks we propose. The model fine-tuned with the data we collected effectively utilized user information and achieved good results, highlighting the potential of our approach in building more user-oriented mobile GUI agents. Our code is open-source at https://anonymous.4open.science/r/FingerTip-57B8 for reproducibility.

LGSep 25, 2025Code
Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs

Honglin Zhang, Qianyue Hao, Fengli Xu et al.

Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families shows two robust effects of online RL post-training: (i) an overall increase in activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at https://anonymous.4open.science/r/llm_rl_probing_analysis-F673.

IRMay 26, 2025Code
AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

Yu Shang, Peijie Liu, Yuwei Yan et al.

The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs' advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving-interest, and cold-start recommendation tasks); (2) a unified modular framework for developing and studying agentic recommender systems; and (3) the first comprehensive benchmark comparing 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a continuously maintained leaderboard~\footnote[2]{https://tsinghua-fib-lab.github.io/AgentSocietyChallenge/pages/overview.html}, fostering ongoing community engagement and reproducible research. The benchmark is available at: \hyperlink{https://huggingface.co/datasets/SGJQovo/AgentRecBench}{https://huggingface.co/datasets/SGJQovo/AgentRecBench}.

LGNov 9, 2020Code
Automorphic Equivalence-aware Graph Neural Network

Fengli Xu, Quanming Yao, Pan Hui et al.

Distinguishing the automorphic equivalence of nodes in a graph plays an essential role in many scientific domains, e.g., computational biologist and social network analysis. However, existing graph neural networks (GNNs) fail to capture such an important property. To make GNN aware of automorphic equivalence, we first introduce a localized variant of this concept -- ego-centered automorphic equivalence (Ego-AE). Then, we design a novel variant of GNN, i.e., GRAPE, that uses learnable AE-aware aggregators to explicitly differentiate the Ego-AE of each node's neighbors with the aids of various subgraph templates. While the design of subgraph templates can be hard, we further propose a genetic algorithm to automatically search them from graph data. Moreover, we theoretically prove that GRAPE is expressive in terms of generating distinct representations for nodes with different Ego-AE features, which fills in a fundamental gap of existing GNN variants. Finally, we empirically validate our model on eight real-world graph data, including social network, e-commerce co-purchase network, and citation network, and show that it consistently outperforms existing GNNs. The source code is public available at https://github.com/tsinghua-fib-lab/GRAPE.

SIFeb 12, 2025
AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

Jinghua Piao, Yuwei Yan, Jun Zhang et al.

Understanding human behavior and society is a central focus in social sciences, with the rise of generative social science marking a significant paradigmatic shift. By leveraging bottom-up simulations, it replaces costly and logistically challenging traditional experiments with scalable, replicable, and systematic computational approaches for studying complex social dynamics. Recent advances in large language models (LLMs) have further transformed this research paradigm, enabling the creation of human-like generative social agents and realistic simulacra of society. In this paper, we propose AgentSociety, a large-scale social simulator that integrates LLM-driven agents, a realistic societal environment, and a powerful large-scale simulation engine. Based on the proposed simulator, we generate social lives for over 10k agents, simulating their 5 million interactions both among agents and between agents and their environment. Furthermore, we explore the potential of AgentSociety as a testbed for computational social experiments, focusing on four key social issues: polarization, the spread of inflammatory messages, the effects of universal basic income policies, and the impact of external shocks such as hurricanes. These four issues serve as valuable cases for assessing AgentSociety's support for typical research methods -- such as surveys, interviews, and interventions -- as well as for investigating the patterns, causes, and underlying mechanisms of social issues. The alignment between AgentSociety's outcomes and real-world experimental results not only demonstrates its ability to capture human behaviors and their underlying mechanisms, but also underscores its potential as an important platform for social scientists and policymakers.

MAOct 11, 2024
OpenCity: A Scalable Platform to Simulate Urban Activities with Massive LLM Agents

Yuwei Yan, Qingbin Zeng, Zhiheng Zheng et al.

Agent-based models (ABMs) have long been employed to explore how individual behaviors aggregate into complex societal phenomena in urban space. Unlike black-box predictive models, ABMs excel at explaining the micro-macro linkages that drive such emergent behaviors. The recent rise of Large Language Models (LLMs) has led to the development of LLM agents capable of simulating urban activities with unprecedented realism. However, the extreme high computational cost of LLMs presents significant challenges for scaling up the simulations of LLM agents. To address this problem, we propose OpenCity, a scalable simulation platform optimized for both system and prompt efficiencies. Specifically, we propose a LLM request scheduler to reduce communication overhead by parallelizing requests through IO multiplexing. Besides, we deisgn a "group-and-distill" prompt optimization strategy minimizes redundancy by clustering agents with similar static attributes. Through experiments on six global cities, OpenCity achieves a 600-fold acceleration in simulation time per agent, a 70% reduction in LLM requests, and a 50% reduction in token usage. These improvements enable the simulation of 10,000 agents' daily activities in 1 hour on commodity hardware. Besides, the substantial speedup of OpenCity allows us to establish a urban simulation benchmark for LLM agents for the first time, comparing simulated urban activities with real-world data in 6 major cities around the globe. We believe our OpenCity platform provides a critical infrastructure to harness the power of LLMs for interdisciplinary studies in urban space, fostering the collective efforts of broader research communities. Code repo is available at https://anonymous.4open.science/r/Anonymous-OpenCity-42BD.

AIJan 16, 2025
A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy

Huandong Wang, Wenjie Fu, Yingzhou Tang et al.

While large language models (LLMs) present significant potential for supporting numerous real-world applications and delivering positive social impacts, they still face significant challenges in terms of the inherent risk of privacy leakage, hallucinated outputs, and value misalignment, and can be maliciously used for generating toxic content and unethical purposes after been jailbroken. Therefore, in this survey, we present a comprehensive review of recent advancements aimed at mitigating these issues, organized across the four phases of LLM development and usage: data collecting and pre-training, fine-tuning and alignment, prompting and reasoning, and post-processing and auditing. We elaborate on the recent advances for enhancing the performance of LLMs in terms of privacy protection, hallucination reduction, value alignment, toxicity elimination, and jailbreak defenses. In contrast to previous surveys that focus on a single dimension of responsible LLMs, this survey presents a unified framework that encompasses these diverse dimensions, providing a comprehensive view of enhancing LLMs to better serve real-world applications.

CLFeb 6, 2025
Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents

Chenyang Shao, Xinyuan Hu, Yutang Lin et al.

The rapid expansion of web content has made on-device AI assistants indispensable for helping users manage the increasing complexity of online tasks. The emergent reasoning ability in large language models offer a promising path for next-generation on-device AI agents. However, deploying full-scale Large Language Models (LLMs) on resource-limited local devices is challenging. In this paper, we propose Division-of-Thoughts (DoT), a collaborative reasoning framework leveraging the synergy between locally deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT leverages a Task Decomposer to elicit the inherent planning abilities in language models to decompose user queries into smaller sub-tasks, which allows hybrid language models to fully exploit their respective strengths. Besides, DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks and create a dependency graph, facilitating parallel reasoning of sub-tasks and the identification of key steps. To allocate the appropriate model based on the difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an additional task head attached to the SLM that does not alter the SLM's parameters. To boost adapter's task allocation capability, we propose a self-reinforced training method that relies solely on task execution feedback. Extensive experiments on various benchmarks demonstrate that our DoT significantly reduces LLM costs while maintaining competitive reasoning accuracy. Specifically, DoT reduces the average reasoning time and API costs by 66.12% and 83.57%, while achieving comparable reasoning accuracy with the best baseline methods.

CLNov 20, 2024
A Survey on Human-Centric LLMs

Jing Yi Wang, Nicholas Sukiennik, Tong Li et al.

The rapid evolution of large language models (LLMs) and their capacity to simulate human cognition and behavior has given rise to LLM-based frameworks and tools that are evaluated and applied based on their ability to perform tasks traditionally performed by humans, namely those involving cognition, decision-making, and social interaction. This survey provides a comprehensive examination of such human-centric LLM capabilities, focusing on their performance in both individual tasks (where an LLM acts as a stand-in for a single human) and collective tasks (where multiple LLMs coordinate to mimic group dynamics). We first evaluate LLM competencies across key areas including reasoning, perception, and social cognition, comparing their abilities to human-like skills. Then, we explore real-world applications of LLMs in human-centric domains such as behavioral science, political science, and sociology, assessing their effectiveness in replicating human behaviors and interactions. Finally, we identify challenges and future research directions, such as improving LLM adaptability, emotional intelligence, and cultural sensitivity, while addressing inherent biases and enhancing frameworks for human-AI collaboration. This survey aims to provide a foundational understanding of LLMs from a human-centric perspective, offering insights into their current capabilities and potential for future development.

CYApr 11, 2025
An Evaluation of Cultural Value Alignment in LLM

Nicholas Sukiennik, Chen Gao, Fengli Xu et al.

LLMs as intelligent agents are being increasingly applied in scenarios where human interactions are involved, leading to a critical concern about whether LLMs are faithful to the variations in culture across regions. Several works have investigated this question in various ways, finding that there are biases present in the cultural representations of LLM outputs. To gain a more comprehensive view, in this work, we conduct the first large-scale evaluation of LLM culture assessing 20 countries' cultures and languages across ten LLMs. With a renowned cultural values questionnaire and by carefully analyzing LLM output with human ground truth scores, we thoroughly study LLMs' cultural alignment across countries and among individual models. Our findings show that the output over all models represents a moderate cultural middle ground. Given the overall skew, we propose an alignment metric, revealing that the United States is the best-aligned country and GLM-4 has the best ability to align to cultural values. Deeper investigation sheds light on the influence of model origin, prompt language, and value dimensions on cultural output. Specifically, models, regardless of where they originate, align better with the US than they do with China. The conclusions provide insight to how LLMs can be better aligned to various cultures as well as provoke further discussion of the potential for LLMs to propagate cultural bias and the need for more culturally adaptable models.

CLFeb 4, 2024
Synergy-of-Thoughts: Eliciting Efficient Reasoning in Hybrid Language Models

Yu Shang, Yu Li, Fengli Xu et al.

Large language models (LLMs) have shown impressive emergent abilities in a wide range of tasks, but the associated expensive API cost greatly limits the real application. Previous works like chain-of-thought (CoT) and tree-of-thoughts (ToT) have predominately focused on enhancing accuracy, but overlook the rapidly increasing API cost, which could be particularly problematic for open-ended real-world tasks with huge solution spaces. Motivated by the dual process theory of human cognition, we propose "Synergy of Thoughts"(SoT) to unleash the synergistic potential of hybrid LLMs with different scales for efficient reasoning. By default, SoT uses smaller-scale language models to generate multiple low-cost intuitive thoughts, which resembles the parallel intuitions produced by System 1. We then design a confidence evaluator where the intuitive thoughts are cross-evaluated and introduce a controllable threshold mechanism to decide their mutual conflict. If these intuitive thoughts exhibit conflicts, SoT will invoke the reflective reasoning of scaled-up language models to emulate the intervention of System 2, which will override the intuitive thoughts and rectify the reasoning results. This framework is model-agnostic and training-free, which can be flexibly implemented with various off-the-shelf LLMs. Experiments on six representative reasoning tasks show that SoT substantially reduces the API cost by 38.3%-75.1%, and simultaneously achieves state-of-the-art reasoning accuracy and solution diversity. Notably, the average token cost reduction on open-ended tasks reaches up to 69.1%.

CLOct 29, 2024
Synergizing LLM Agents and Knowledge Graph for Socioeconomic Prediction in LBSN

Zhilun Zhou, Jingyang Fan, Yu Liu et al.

The fast development of location-based social networks (LBSNs) has led to significant changes in society, resulting in popular studies of using LBSN data for socioeconomic prediction, e.g., regional population and commercial activity estimation. Existing studies design various graphs to model heterogeneous LBSN data, and further apply graph representation learning methods for socioeconomic prediction. However, these approaches heavily rely on heuristic ideas and expertise to extract task-relevant knowledge from diverse data, which may not be optimal for specific tasks. Additionally, they tend to overlook the inherent relationships between different indicators, limiting the prediction accuracy. Motivated by the remarkable abilities of large language models (LLMs) in commonsense reasoning, embedding, and multi-agent collaboration, in this work, we synergize LLM agents and knowledge graph for socioeconomic prediction. We first construct a location-based knowledge graph (LBKG) to integrate multi-sourced LBSN data. Then we leverage the reasoning power of LLM agent to identify relevant meta-paths in the LBKG for each type of socioeconomic prediction task, and design a semantic-guided attention module for knowledge fusion with meta-paths. Moreover, we introduce a cross-task communication mechanism to further enhance performance by enabling knowledge sharing across tasks at both LLM agent and KG levels. On the one hand, the LLM agents for different tasks collaborate to generate more diverse and comprehensive meta-paths. On the other hand, the embeddings from different tasks are adaptively merged for better socioeconomic prediction. Experiments on two datasets demonstrate the effectiveness of the synergistic design between LLM and KG, providing insights for information sharing across socioeconomic prediction tasks.

AIApr 14, 2025
A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science

Jie Feng, Jinwei Zeng, Qingyue Long et al. · tsinghua

Over the past year, the development of large language models (LLMs) has brought spatial intelligence into focus, with much attention on vision-based embodied intelligence. However, spatial intelligence spans a broader range of disciplines and scales, from navigation and urban planning to remote sensing and earth science. What are the differences and connections between spatial intelligence across these fields? In this paper, we first review human spatial cognition and its implications for spatial intelligence in LLMs. We then examine spatial memory, knowledge representations, and abstract reasoning in LLMs, highlighting their roles and connections. Finally, we analyze spatial intelligence across scales -- from embodied to urban and global levels -- following a framework that progresses from spatial memory and understanding to spatial reasoning and intelligence. Through this survey, we aim to provide insights into interdisciplinary spatial intelligence research and inspire future studies.

CLFeb 17, 2025
Invisible Walls in Cities: Leveraging Large Language Models to Predict Urban Segregation Experience with Social Media Content

Bingbing Fan, Lin Chen, Songwei Li et al.

Understanding experienced segregation in urban daily life is crucial for addressing societal inequalities and fostering inclusivity. The abundance of user-generated reviews on social media encapsulates nuanced perceptions and feelings associated with different places, offering rich insights into segregation. However, leveraging this data poses significant challenges due to its vast volume, ambiguity, and confluence of diverse perspectives. To tackle these challenges, we propose using Large Language Models (LLMs) to automate online review mining for segregation prediction. We design a Reflective LLM Coder to digest social media content into insights consistent with real-world feedback, and eventually produce a codebook capturing key dimensions that signal segregation experience, such as cultural resonance and appeal, accessibility and convenience, and community engagement and local involvement. Guided by the codebook, LLMs can generate both informative review summaries and ratings for segregation prediction. Moreover, we design a REasoning-and-EMbedding (RE'EM) framework, which combines the reasoning and embedding capabilities of language models to integrate multi-channel features for segregation prediction. Experiments on real-world data demonstrate that our framework greatly improves prediction accuracy, with a 22.79% elevation in R2 and a 9.33% reduction in MSE. The derived codebook is generalizable across three different cities, consistently improving prediction accuracy. Moreover, our user study confirms that the codebook-guided summaries provide cognitive gains for human participants in perceiving POIs' social inclusiveness. Our study marks an important step toward understanding implicit social barriers and inequalities, demonstrating the great potential of promoting social inclusiveness with AI.

CEDec 16, 2025
SciNetBench: A Relation-Aware Benchmark for Scientific Literature Retrieval Agents

Chenyang Shao, Yong Li, Fengli Xu

The rapid development of AI agent has spurred the development of advanced research tools, such as Deep Research. Achieving this require a nuanced understanding of the relations within scientific literature, surpasses the scope of keyword-based or embedding-based retrieval. Existing retrieval agents mainly focus on the content-level similarities and are unable to decode critical relational dynamics, such as identifying corroborating or conflicting studies or tracing technological lineages, all of which are essential for a comprehensive literature review. Consequently, this fundamental limitation often results in a fragmented knowledge structure, misleading sentiment interpretation, and inadequate modeling of collective scientific progress. To investigate relation-aware retrieval more deeply, we propose SciNetBench, the first Scientific Network Relation-aware Benchmark for literature retrieval agents. Constructed from a corpus of over 18 million AI papers, our benchmark systematically evaluates three levels of relations: ego-centric retrieval of papers with novel knowledge structures, pair-wise identification of scholarly relationships, and path-wise reconstruction of scientific evolutionary trajectories. Through extensive evaluation of three categories of retrieval agents, we find that their accuracy on relation-aware retrieval tasks often falls below 20%, revealing a core shortcoming of current retrieval paradigms. Notably, further experiments on the literature review tasks demonstrate that providing agents with relational ground truth leads to a substantial 23.4% performance improvement in the review quality, validating the critical importance of relation-aware retrieval. We publicly release our benchmark at https://anonymous.4open.science/r/SciNetBench/ to support future research on advanced retrieval systems.

CYNov 26, 2025
AI Urban Scientist: Multi-Agent Collaborative Automation for Urban Research

Tong Xia, Jiankun Zhang, Ruiwen You et al.

Urban research aims to understand how cities operate and evolve as complex adaptive systems. With the rapid growth of urban data and analytical methodologies, the central challenge of the field has shifted from data availability to the integration of heterogeneous data into coherent, verifiable urban knowledge through multidisciplinary approaches. Recent advances in AI, particularly the emergence of large language models (LLMs), have enabled the development of AI scientists capable of autonomous reasoning, hypothesis generation, and data-driven experimentation, demonstrating substantial potential for autonomous urban research. However, most general-purpose AI systems remain misaligned with the domain-specific knowledge, methodological conventions, and inferential standards required in urban studies. Here, we introduce the AI Urban Scientist, a knowledge-driven multi-agent framework designed to support autonomous urban research. Grounded in hypotheses, peer-review feedback, datasets, and research methodologies distilled from large-scale prior studies, the system constructs structured domain knowledge that guides LLM-based agents to automatically generate hypotheses, identify and integrate multi-source urban datasets, conduct empirical analyses and simulations, and iteratively refine analytical methods. Through this process, the framework synthesizes new insights in urban science and accelerates the urban research lifecycle.

AINov 21, 2025
MirrorMind: Empowering OmniScientist with the Expert Perspectives and Collective Knowledge of Human Scientists

Qingbin Zeng, Bingbing Fan, Zhiyu Chen et al.

The emergence of AI Scientists has demonstrated remarkable potential in automating scientific research. However, current approaches largely conceptualize scientific discovery as a solitary optimization or search process, overlooking that knowledge production is inherently a social and historical endeavor. Human scientific insight stems from two distinct yet interconnected sources. First is the individual cognitive trajectory, where a researcher's unique insight is shaped by their evolving research history and stylistic preferences; another is the collective disciplinary memory, where knowledge is sedimented into vast, interconnected networks of citations and concepts. Existing LLMs still struggle to represent these structured, high-fidelity cognitive and social contexts. To bridge this gap, we introduce MirrorMind, a hierarchical cognitive architecture that integrates dual-memory representations within a three-level framework. The Individual Level constructs high-fidelity cognitive models of individual researchers by capturing their episodic, semantic, and persona memories; the Domain Level maps collective knowledge into structured disciplinary concept graphs; and the Interdisciplinary Level that acts as an orthogonal orchestration engine. Crucially, our architecture separates memory storage from agentic execution, enabling AI scientist agents to flexibly access individual memories for unique perspectives or collective structures to reason. We evaluate MirrorMind across four comprehensive tasks, including author-level cognitive simulation, complementary reasoning, cross-disciplinary collaboration promotion, and multi-agent scientific problem solving. The results show that by integrating individual cognitive depth with collective disciplinary breadth, MirrorMind moves beyond simple fact retrieval toward structural, personalized, and insight-generating scientific reasoning.

CYNov 21, 2025
OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists

Chenyang Shao, Dehao Huang, Yu Li et al.

With the rapid development of Large Language Models (LLMs), AI agents have demonstrated increasing proficiency in scientific tasks, ranging from hypothesis generation and experimental design to manuscript writing. Such agent systems are commonly referred to as "AI Scientists." However, existing AI Scientists predominantly formulate scientific discovery as a standalone search or optimization problem, overlooking the fact that scientific research is inherently a social and collaborative endeavor. Real-world science relies on a complex scientific infrastructure composed of collaborative mechanisms, contribution attribution, peer review, and structured scientific knowledge networks. Due to the lack of modeling for these critical dimensions, current systems struggle to establish a genuine research ecosystem or interact deeply with the human scientific community. To bridge this gap, we introduce OmniScientist, a framework that explicitly encodes the underlying mechanisms of human research into the AI scientific workflow. OmniScientist not only achieves end-to-end automation across data foundation, literature review, research ideation, experiment automation, scientific writing, and peer review, but also provides comprehensive infrastructural support by simulating the human scientific system, comprising: (1) a structured knowledge system built upon citation networks and conceptual correlations; (2) a collaborative research protocol (OSP), which enables seamless multi-agent collaboration and human researcher participation; and (3) an open evaluation platform (ScienceArena) based on blind pairwise user voting and Elo rankings. This infrastructure empowers agents to not only comprehend and leverage human knowledge systems but also to collaborate and co-evolve, fostering a sustainable and scalable innovation ecosystem.

AISep 18, 2025
Rationality Check! Benchmarking the Rationality of Large Language Models

Zhilun Zhou, Jing Yi Wang, Nicholas Sukiennik et al.

Large language models (LLMs), a recent advance in deep learning and machine intelligence, have manifested astonishing capacities, now considered among the most promising for artificial general intelligence. With human-like capabilities, LLMs have been used to simulate humans and serve as AI assistants across many applications. As a result, great concern has arisen about whether and under what circumstances LLMs think and behave like real human agents. Rationality is among the most important concepts in assessing human behavior, both in thinking (i.e., theoretical rationality) and in taking action (i.e., practical rationality). In this work, we propose the first benchmark for evaluating the omnibus rationality of LLMs, covering a wide range of domains and LLMs. The benchmark includes an easy-to-use toolkit, extensive experimental results, and analysis that illuminates where LLMs converge and diverge from idealized human rationality. We believe the benchmark can serve as a foundational tool for both developers and users of LLMs.

IRFeb 21, 2021
Genetic Meta-Structure Search for Recommendation on Heterogeneous Information Network

Zhenyu Han, Fengli Xu, Jinghan Shi et al.

In the past decade, the heterogeneous information network (HIN) has become an important methodology for modern recommender systems. To fully leverage its power, manually designed network templates, i.e., meta-structures, are introduced to filter out semantic-aware information. The hand-crafted meta-structure rely on intense expert knowledge, which is both laborious and data-dependent. On the other hand, the number of meta-structures grows exponentially with its size and the number of node types, which prohibits brute-force search. To address these challenges, we propose Genetic Meta-Structure Search (GEMS) to automatically optimize meta-structure designs for recommendation on HINs. Specifically, GEMS adopts a parallel genetic algorithm to search meaningful meta-structures for recommendation, and designs dedicated rules and a meta-structure predictor to efficiently explore the search space. Finally, we propose an attention based multi-view graph convolutional network module to dynamically fuse information from different meta-structures. Extensive experiments on three real-world datasets suggest the effectiveness of GEMS, which consistently outperforms all baseline methods in HIN recommendation. Compared with simplified GEMS which utilizes hand-crafted meta-paths, GEMS achieves over $6\%$ performance gain on most evaluation metrics. More importantly, we conduct an in-depth analysis on the identified meta-structures, which sheds light on the HIN based recommender system design.

LGJan 3, 2021
AttnMove: History Enhanced Trajectory Recovery via Attentional Network

Tong Xia, Yunhan Qi, Jie Feng et al.

A considerable amount of mobility data has been accumulated due to the proliferation of location-based service. Nevertheless, compared with mobility data from transportation systems like the GPS module in taxis, this kind of data is commonly sparse in terms of individual trajectories in the sense that users do not access mobile services and contribute their data all the time. Consequently, the sparsity inevitably weakens the practical value of the data even it has a high user penetration rate. To solve this problem, we propose a novel attentional neural network-based model, named AttnMove, to densify individual trajectories by recovering unobserved locations at a fine-grained spatial-temporal resolution. To tackle the challenges posed by sparsity, we design various intra- and inter- trajectory attention mechanisms to better model the mobility regularity of users and fully exploit the periodical pattern from long-term history. We evaluate our model on two real-world datasets, and extensive results demonstrate the performance gain compared with the state-of-the-art methods. This also shows that, by providing high-quality mobility data, our model can benefit a variety of mobility-oriented down-stream applications.

CYOct 16, 2020
Understanding the Role of Intermediaries in Online Social E-commerce: An Exploratory Study of Beidian

Zhilong Chen, Hancheng Cao, Fengli Xu et al.

Social e-commerce, as a new form of social computing based marketing platforms, utilizes existing real-world social relationships for promotions and sales of products. It has been growing rapidly in recent years and attracted tens of millions of users in China. A key group of actors who enable market transactions on these platforms are intermediaries who connect producers with consumers by sharing information with and recommending products to their real-world social contacts. Despite their crucial role, the nature and behavior of these intermediaries on these social e-commerce platforms has not been systematically analyzed. Here we address this knowledge gap through a mixed method study. Leveraging 9 months' all-round behavior of about 40 million users on Beidian -- one of the largest social e-commerce sites in China, alongside with qualitative evidence from online forums and interviews, we examine characteristics of intermediaries, identify their behavioral patterns and uncover strategies and mechanisms that make successful intermediaries. We demonstrate that intermediaries on social e-commerce sites act as local trend detectors and "social grocers". Furthermore, successful intermediaries are highly dedicated whenever best sellers appear and broaden items for promotion. To the best of our knowledge, this paper presents the first large-scale analysis on the emerging role of intermediaries in social e-commerce platforms, which provides potential insights for the design and management of social computing marketing platforms.

CYOct 4, 2020
Learning from Home: A Mixed-Methods Analysis of Live Streaming Based Remote Education Experience in Chinese Colleges During the COVID-19 Pandemic

Zhilong Chen, Hancheng Cao, Yuting Deng et al.

The COVID-19 global pandemic and resulted lockdown policies have forced education in nearly every country to switch from a traditional co-located paradigm to a pure online 'distance learning from home' paradigm. Lying in the center of this learning paradigm shift is the emergence and wide adoption of distance communication tools and live streaming platforms for education. Here, we present a mixed-methods study on live streaming based education experience during the COVID-19 pandemic. We focus our analysis on Chinese higher education, carried out semi-structured interviews on 30 students, and 7 instructors from diverse colleges and disciplines, meanwhile launched a large-scale survey covering 6291 students and 1160 instructors in one leading Chinese university. Our study not only reveals important design guidelines and insights to better support current remote learning experience during the pandemic, but also provides valuable implications towards constructing future collaborative education supporting systems and experience after pandemic.

SIAug 15, 2019
When Your Friends Become Sellers: An Empirical Study of Social Commerce Site Beidian

Hancheng Cao, Zhilong Chen, Fengli Xu et al.

Past few years have witnessed the emergence and phenomenal success of strong-tie based social commerce. Embedded in social networking sites, these E-Commerce platforms transform ordinary people into sellers, where they advertise and sell products to their friends and family in online social networks. These sites can acquire millions of users within a short time, and are growing fast at an accelerated rate. However, little is known about how these social commerce develop as a blend of social relationship and economic transactions. In this paper we present the first measurement study on the full-scale data of Beidian, one of the fastest growing social commerce sites in China, which involves 11.8 million users. We first analyzed the topological structure of the Beidian platform and highlighted its decentralized nature. We then studied the site's rapid growth and its growth mechanism via invitation cascade. Finally, we investigated purchasing behavior on Beidian, where we focused on user proximity and loyalty, which contributes to the site's high conversion rate. As the consequences of interactions between strong ties and economic logics, emerging social commerce demonstrates significant property deviations from all known social networks and E-Commerce in terms of network structure, dynamics and user behavior. To the best of our knowledge, this work is the first quantitative study on the network characteristics and dynamics of emerging social commerce platforms.

CYNov 26, 2017
Smartphone App Usage Prediction Using Points of Interest

Donghan Yu, Yong Li, Fengli Xu et al.

In this paper we present the first population-level, city-scale analysis of application usage on smartphones. Using deep packet inspection at the network operator level, we obtained a geo-tagged dataset with more than 6 million unique devices that launched more than 10,000 unique applications across the city of Shanghai over one week. We develop a technique that leverages transfer learning to predict which applications are most popular and estimate the whole usage distribution based on the Point of Interest (POI) information of that particular location. We demonstrate that our technique has an 83.0% hitrate in successfully identifying the top five popular applications, and a 0.15 RMSE when estimating usage with just 10% sampled sparse data. It outperforms by about 25.7% over the existing state-of-the-art approaches. Our findings pave the way for predicting which apps are relevant to a user given their current location, and which applications are popular where. The implications of our findings are broad: it enables a range of systems to benefit from such timely predictions, including operating systems, network operators, appstores, advertisers, and service providers.

CYFeb 21, 2017
Trajectory Recovery From Ash: User Privacy Is NOT Preserved in Aggregated Mobility Data

Fengli Xu, Zhen Tu, Yong Li et al.

Human mobility data has been ubiquitously collected through cellular networks and mobile applications, and publicly released for academic research and commercial purposes for the last decade. Since releasing individual's mobility records usually gives rise to privacy issues, datasets owners tend to only publish aggregated mobility data, such as the number of users covered by a cellular tower at a specific timestamp, which is believed to be sufficient for preserving users' privacy. However, in this paper, we argue and prove that even publishing aggregated mobility data could lead to privacy breach in individuals' trajectories. We develop an attack system that is able to exploit the uniqueness and regularity of human mobility to recover individual's trajectories from the aggregated mobility data without any prior knowledge. By conducting experiments on two real-world datasets collected from both mobile application and cellular network, we reveal that the attack system is able to recover users' trajectories with accuracy about 73%~91% at the scale of tens of thousands to hundreds of thousands users, which indicates severe privacy leakage in such datasets. Through the investigation on aggregated mobility data, our work recognizes a novel privacy problem in publishing statistic data, which appeals for immediate attentions from both academy and industry.