20.6CLJun 3
MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document ReasoningQiyang Xie, Jialun Wu, Xinjie He et al.
AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $κ= 0.634$ across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.
57.3ITJun 2
Generative Spectrum Cartography: Unified Reconstruction and Active Sensing via Diffusion ModelsYuntong Gu, Xiangming meng, Zhiyuan Lin et al.
High-fidelity spectrum cartography is important for spectrum monitoring and wireless situational awareness, especially in satellite-based wide-area sensing scenarios where measurements are sparse, noisy, and often low-bit quantized. In such settings, two coupled challenges arise: accurate reconstruction from severely incomplete measurements and efficient allocation of additional sensing resources under a limited sensing budget. Existing methods usually address these problems separately, and, for reconstruction, they often rely on priors that are insufficiently expressive under sparse and quantized measurements. This paper proposes Generative Spectrum Cartography (GSC), a diffusion-based posterior inference framework for spectrum cartography with uncertainty-aware active sensing. Specifically, spectrum map recovery is formulated as a Bayesian inverse problem under a learned diffusion model prior, and closed-form posterior mean updates are derived for both linear and quantized measurement models. By embedding these updates into the reverse diffusion process, GSC enables gradient-free and measurement-consistent posterior sampling without relying on computationally costly likelihood-gradient guidance. The resulting posterior samples are further used to estimate spatial uncertainty and to guide diversity-aware selection of additional measurement locations for active sensing. Experiments on simulated electromagnetic maps and a high-fidelity simulated satellite monitoring scenario show that GSC achieves higher PSNR, lower LPIPS, and more efficient sensing than representative baseline methods under sparse, noisy, and low-bit quantized measurements.
44.7AIMay 28
When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMsShuai Xiao, Su Liu, Weikai Zhou et al.
Persona prompting is widely used to steer large language models, yet its practical value remains unclear. Prior work often evaluates persona prompting using aggregate scores, making it difficult to determine whether expert-role prompting consistently improves response quality or instead changes responses along different quality dimensions. We study this question through a controlled comparison of four prompting conditions across 1,140 open-ended questions spanning 38 expert roles and six domains: no role prompt, a generic domain-expert prompt, embedding-based role retrieval, and a hybrid retrieval method combining embedding search with LLM-based role selection. Aggregate results show only small overall differences between conditions. However, metric-level analysis reveals a consistent tradeoff that aggregate averages obscure: role prompting systematically increases expertise depth while reducing clarity. These effects are highly conditional rather than universal. Role prompting performs best on advisory questions and in domains such as medicine and psychology, where structured expert framing and risk communication are intrinsically valuable. In contrast, baseline prompting performs better on conceptual and explanatory questions in finance, legal, science, and technology domains, where concise plain-language explanation is more important. We further show that hybrid retrieval significantly improves over embedding-only role selection, although better role retrieval does not eliminate the broader expertise-depth versus clarity tradeoff. Overall, our findings suggest that persona prompting primarily reshapes response characteristics rather than broadly improving capability, and that multi-metric evaluation is necessary for understanding its effects.
21.9CEMay 23
Toward Secure Operation and Management (O&M) of Satellite Constellations: Efficiency, Resilience, and Reliability in a Network PerspectiveLinan Huang, Peilong Liu, Xi Chen et al.
Satellite constellations equipped with Inter-Satellite Links and onboard packet switching enable real-time Operation and Management across globally distributed satellites, but also broaden the attack surface and introduce unprecedented cybersecurity threats. Existing efforts mainly focus on cryptography for single-satellite point-to-point links, without considering constellation-level security. To address this gap, this article extends security research in two directions: from individual satellites to constellation-wide architectures, and from isolated cryptography to system-level security incorporating efficiency, resilience, and reliability. These extensions raise three key questions: how to design efficient security mechanisms for dynamic constellation topologies with adaptive onboard routing; how a constellation O&M system can recover resiliently under worst-case failures of onboard security functions; and how to improve the reliability of onboard security functions under stringent resource constraints. To address these challenges, we first construct a constellation-wide hybrid security framework that protects semantically sensitive content fields using End-to-End encryption, while safeguarding routing-related fields through Moving Target Defense. Next, we introduce a ciphered-mode and safe-mode management mechanism with an M-delayed fallback that balances recovery timeliness and exploitability. Finally, we propose security-aware routers that manage plaintext/ciphered modes and coordinate access to a shared pool of onboard cipher modules, enabling redundancy sharing across multiple endpoints and extending secure operation duration in ciphered mode. These solutions comply with existing standards defined by organizations including DVB and the CCSDS, while translating conceptual security principles into practical system-level mechanisms.
23.8CLMay 21
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QAXinjie He, Zhiyuan Lin, Su Liu et al.
Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi-session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in-domain (LoCoMo), mixed-benchmark (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill - temporal reasoning - despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single-GPU regime: cross-benchmark mixing requires filtering format-specific noise from memory banks to preserve training signal, and binary exact-match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.
SYAug 23, 2023
A Mobile Data-Driven Hierarchical Deep Reinforcement Learning Approach for Real-time Demand-Responsive Railway Rescheduling and Station Overcrowding MitigationEnze Liu, Zhiyuan Lin, Judith Y. T. Wang et al.
Real-time railway rescheduling is an important technique to enable operational recovery in response to unexpected and dynamic conditions in a timely and flexible manner. Current research relies mostly on OD based data and model-based methods for estimating train passenger demands. These approaches primarily focus on averaged disruption patterns, often overlooking the immediate uneven distribution of demand over time. In reality, passenger demand deviates significantly from predictions, especially during a disaster. Disastrous situations such as flood in Zhengzhou, China in 2022 has created not only unprecedented effect on Zhengzhou railway station itself, which is a major railway hub in China, but also other major hubs connected to Zhengzhou, e.g., Xi'an, the closest hub west of Zhengzhou. In this study, we define a real-time demand-responsive (RTDR) railway rescheduling problem focusing two specific aspects, namely, volatility of the demand, and management of station crowdedness. For the first time, we propose a data-driven approach using real-time mobile data (MD) to deal with this RTDR problem. A hierarchical deep reinforcement learning (HDRL) framework is designed to perform real-time rescheduling in a demand-responsive manner. The use of MD has enabled the modelling of passenger dynamics in response to train delays and station crowdedness, and a real-time optimisation for rescheduling of train services in view of the change in demand as a result of passengers' behavioural response to disruption. Results show that the agent can steadily satisfy over 62% of the demand with only 61% of the original rolling stock, ensuring continuous operations without overcrowding. Moreover, the agent exhibits adaptability when transferred to a new environment with increased demand, highlighting its effectiveness in addressing unforeseen disruptions in real-time settings.
LGSep 23, 2025
GSTM-HMU: Generative Spatio-Temporal Modeling for Human Mobility UnderstandingWenying Luo, Zhiyuan Lin, Wenhao Xu et al.
Human mobility traces, often recorded as sequences of check-ins, provide a unique window into both short-term visiting patterns and persistent lifestyle regularities. In this work we introduce GSTM-HMU, a generative spatio-temporal framework designed to advance mobility analysis by explicitly modeling the semantic and temporal complexity of human movement. The framework consists of four key innovations. First, a Spatio-Temporal Concept Encoder (STCE) integrates geographic location, POI category semantics, and periodic temporal rhythms into unified vector representations. Second, a Cognitive Trajectory Memory (CTM) adaptively filters historical visits, emphasizing recent and behaviorally salient events in order to capture user intent more effectively. Third, a Lifestyle Concept Bank (LCB) contributes structured human preference cues, such as activity types and lifestyle patterns, to enhance interpretability and personalization. Finally, task-oriented generative heads transform the learned representations into predictions for multiple downstream tasks. We conduct extensive experiments on four widely used real-world datasets, including Gowalla, WeePlace, Brightkite, and FourSquare, and evaluate performance on three benchmark tasks: next-location prediction, trajectory-user identification, and time estimation. The results demonstrate consistent and substantial improvements over strong baselines, confirming the effectiveness of GSTM-HMU in extracting semantic regularities from complex mobility data. Beyond raw performance gains, our findings also suggest that generative modeling provides a promising foundation for building more robust, interpretable, and generalizable systems for human mobility intelligence.
IRMay 26, 2015
Seeing the Forest through the Trees: Adaptive Local Exploration of Large GraphsRobert Pienta, Zhiyuan Lin, Minsuk Kahng et al.
Visualization is a powerful paradigm for exploratory data analysis. Visualizing large graphs, however, often results in a meaningless hairball. In this paper, we propose a different approach that helps the user adaptively explore large million-node graphs from a local perspective. For nodes that the user investigates, we propose to only show the neighbors with the most subjectively interesting neighborhoods. We contribute novel ideas to measure this interestingness in terms of how surprising a neighborhood is given the background distribution, as well as how well it fits the nodes the user chose to explore. We introduce FACETS, a fast and scalable method for visually exploring large graphs. By implementing our above ideas, it allows users to look into the forest through its trees. Empirical evaluation shows that our method works very well in practice, providing rankings of nodes that match interests of users. Moreover, as it scales linearly, FACETS is suited for the exploration of very large graphs.