Zhihan Jiang

HC
h-index20
14papers
106citations
Novelty50%
AI Score54

14 Papers

AIJun 3
Agents' Last Exam

Yiyou Sun, Xinyang Han, Weichen Zhang et al.

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.

SESep 20, 2024Code
Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis

Junjie Huang, Zhihan Jiang, Jinyang Liu et al.

Logs are imperative in the maintenance of online service systems, which often encompass important information for effective failure mitigation. While existing anomaly detection methodologies facilitate the identification of anomalous logs within extensive runtime data, manual investigation of log messages by engineers remains essential to comprehend faults, which is labor-intensive and error-prone. Upon examining the log-based troubleshooting practices at CloudA, we find that engineers typically prioritize two categories of log information for diagnosis. These include fault-indicating descriptions, which record abnormal system events, and fault-indicating parameters, which specify the associated entities. Motivated by this finding, we propose an approach to automatically extract such faultindicating information from logs for fault diagnosis, named LoFI. LoFI comprises two key stages. In the first stage, LoFI performs coarse-grained filtering to collect logs related to the faults based on semantic similarity. In the second stage, LoFI leverages a pre-trained language model with a novel prompt-based tuning method to extract fine-grained information of interest from the collected logs. We evaluate LoFI on logs collected from Apache Spark and an industrial dataset from CloudA. The experimental results demonstrate that LoFI outperforms all baseline methods by a significant margin, achieving an absolute improvement of 25.8~37.9 in F1 over the best baseline method, ChatGPT. This highlights the effectiveness of LoFI in recognizing fault-indicating information. Furthermore, the successful deployment of LoFI at CloudA and user studies validate the utility of our method. The code and data are available at https://github.com/Jun-jie-Huang/LoFI.

SEJun 8, 2023
Log-based Anomaly Detection based on EVT Theory with feedback

Jinyang Liu, Junjie Huang, Yintong Huo et al.

System logs play a critical role in maintaining the reliability of software systems. Fruitful studies have explored automatic log-based anomaly detection and achieved notable accuracy on benchmark datasets. However, when applied to large-scale cloud systems, these solutions face limitations due to high resource consumption and lack of adaptability to evolving logs. In this paper, we present an accurate, lightweight, and adaptive log-based anomaly detection framework, referred to as SeaLog. Our method introduces a Trie-based Detection Agent (TDA) that employs a lightweight, dynamically-growing trie structure for real-time anomaly detection. To enhance TDA's accuracy in response to evolving log data, we enable it to receive feedback from experts. Interestingly, our findings suggest that contemporary large language models, such as ChatGPT, can provide feedback with a level of consistency comparable to human experts, which can potentially reduce manual verification efforts. We extensively evaluate SeaLog on two public datasets and an industrial dataset. The results show that SeaLog outperforms all baseline methods in terms of effectiveness, runs 2X to 10X faster and only consumes 5% to 41% of the memory resource.

LGAug 22, 2023
Internal Cross-layer Gradients for Extending Homogeneity to Heterogeneity in Federated Learning

Yun-Hin Chan, Rui Zhou, Running Zhao et al.

Federated learning (FL) inevitably confronts the challenge of system heterogeneity in practical scenarios. To enhance the capabilities of most model-homogeneous FL methods in handling system heterogeneity, we propose a training scheme that can extend their capabilities to cope with this challenge. In this paper, we commence our study with a detailed exploration of homogeneous and heterogeneous FL settings and discover three key observations: (1) a positive correlation between client performance and layer similarities, (2) higher similarities in the shallow layers in contrast to the deep layers, and (3) the smoother gradients distributions indicate the higher layer similarities. Building upon these observations, we propose InCo Aggregation that leverages internal cross-layer gradients, a mixture of gradients from shallow and deep layers within a server model, to augment the similarity in the deep layers without requiring additional communication between clients. Furthermore, our methods can be tailored to accommodate model-homogeneous FL methods such as FedAvg, FedProx, FedNova, Scaffold, and MOON, to expand their capabilities to handle the system heterogeneity. Copious experimental results validate the effectiveness of InCo Aggregation, spotlighting internal cross-layer gradients as a promising avenue to enhance the performance in heterogeneous FL.

LGApr 3, 2023
FedIN: Federated Intermediate Layers Learning for Model Heterogeneity

Yun-Hin Chan, Zhihan Jiang, Jing Deng et al.

Federated learning (FL) facilitates edge devices to cooperatively train a global shared model while maintaining the training data locally and privately. However, a common assumption in FL requires the participating edge devices to have similar computation resources and train on an identical global model architecture. In this study, we propose an FL method called Federated Intermediate Layers Learning (FedIN), supporting heterogeneous models without relying on any public dataset. Instead, FedIN leverages the inherent knowledge embedded in client model features to facilitate knowledge exchange. The training models in FedIN are partitioned into three distinct components: an extractor, intermediate layers, and a classifier. We capture client features by extracting the outputs of the extractor and the inputs of the classifier. To harness the knowledge from client features, we propose IN training for aligning the intermediate layers based on features obtained from other clients. IN training only needs minimal memory and communication overhead by utilizing a single batch of client features. Additionally, we formulate and address a convex optimization problem to mitigate the challenge of gradient divergence caused by conflicts between IN training and local training. The experiment results demonstrate the superior performance of FedIN in heterogeneous model environments compared to state-of-the-art algorithms. Furthermore, our ablation study demonstrates the effectiveness of IN training and the proposed solution for alleviating gradient divergence.

CRMar 30
LiFeChain: Lightweight Blockchain for Secure and Efficient Federated Lifelong Learning in IoT

Handi Chen, Jing Deng, Xiuzhe Wu et al.

Internet of Things (IoT) devices constantly generate heterogeneous data streams, driving demand for continuous, decentralized intelligence. Federated Lifelong Learning (FLL) provides an ideal solution by incorporating federated learning and lifelong learning. However, the extended lifecycle of FLL in IoT systems increases their vulnerability to persistent attacks. This problem is exacerbated by the single point of failure. Furthermore, the single point of trust created by the central server hinders reliable auditing for long-term threats. Blockchain technology provides a tamper-proof foundation for trustworthy FLL. Nevertheless, directly applying blockchain to FLL significantly increases computational and retrieval costs with the expansion of the knowledge base, slowing down the training on resource-constrained IoT devices. To address these challenges, we propose LiFeChain, a lightweight blockchain for secure and efficient federated lifelong learning with minimal on-chain disclosure and bidirectional verification. LiFeChain is the first blockchain tailored for FLL. It incorporates two complementary mechanisms: the Proof-of-Model-Correlation (PoMC) consensus on the server, which couples learning and unlearning mechanisms to mitigate negative transfer; and Segmented Zero-knowledge Arbitration (Seg-ZA) at the client, which detects and arbitrates abnormal committee behavior without compromising privacy. LiFeChain is a plug-and-play component that can be seamlessly integrated into existing FLL algorithms for IoT applications. To demonstrate its practicality and performance, we implement LiFeChain in representative FLL algorithms with Hyperledger Fabric under 6 attacks. Theoretical analysis and extensive evaluations demonstrate that LiFeChain effectively mitigates long-term attacks, and significantly reduces latency and storage overhead compared to state-of-the-art blockchain solutions.

CRDec 20, 2024Code
Continual Learning with Strategic Selection and Forgetting for Network Intrusion Detection

Xinchen Zhang, Running Zhao, Zhihan Jiang et al.

Intrusion Detection Systems (IDS) are crucial for safeguarding digital infrastructure. In dynamic network environments, both threat landscapes and normal operational behaviors are constantly changing, resulting in concept drift. While continuous learning mitigates the adverse effects of concept drift, insufficient attention to drift patterns and excessive preservation of outdated knowledge can still hinder the IDS's adaptability. In this paper, we propose SSF (Strategic Selection and Forgetting), a novel continual learning method for IDS, providing continuous model updates with a constantly refreshed memory buffer. Our approach features a strategic sample selection algorithm to select representative new samples and a strategic forgetting mechanism to drop outdated samples. The proposed strategic sample selection algorithm prioritizes new samples that cause the `drifted' pattern, enabling the model to better understand the evolving landscape. Additionally, we introduce strategic forgetting upon detecting significant drift by discarding outdated samples to free up memory, allowing the incorporation of more recent data. SSF captures evolving patterns effectively and ensures the model is aligned with the change of data patterns, significantly enhancing the IDS's adaptability to concept drift. The state-of-the-art performance of SSF on NSL-KDD and UNSW-NB15 datasets demonstrates its superior adaptability to concept drift for network intrusion detection. The code is released at https://github.com/xinchen930/SSF-Strategic-Selection-and-Forgetting.

HCFeb 5
Hear You in Silence: Designing for Active Listening in Human Interaction with Conversational Agents Using Context-Aware Pacing

Zhihan Jiang, Qianhui Chen, Chu Zhang et al.

In human conversation, empathic dialogue requires nuanced temporal cues indicating whether the conversational partner is paying attention. This type of "active listening" is overlooked in the design of Conversational Agents (CAs), which use the same pacing for one conversation. To model the temporal cues in human conversation, we need CAs that dynamically adjust response pacing according to user input. We qualitatively analyzed ten cases of active listening to distill five context-aware pacing strategies: Reflective Silence, Facilitative Silence, Empathic Silence, Holding Space, and Immediate Response. In a between-subjects study (N=50) with two conversational scenarios (relationship and career-support), the context-aware agent scored higher than static-pacing control on perceived human-likeness, smoothness, and interactivity, supporting deeper self-disclosure and higher engagement. In the career support scenario, the CA yielded higher perceived listening quality and affective trust. This work shows how insights from human conversation like context-aware pacing can empower the design of more empathic human-AI communication.

AIMay 7
AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

Qinshi Zhang, Weipeng Deng, Zhihan Jiang et al.

In model-based learning, the agent learns behaviors by simulating trajectories based on world model predictions. Standard world models typically learn a stationary transition function that maps states and actions to next states, when an action and an outcome frequently co-occur in training data, the model tends to internalize this correlation as a general causal rule while ignoring action preconditions. In interactive environments, however, agent actions can reshape the future affordance space. At each timestep, an action may becomes executable only after its prerequisites are met, or non-executable when they are destroyed. We term such events structure-changing events (SC events). As a result, a conventional world model often fails to determine whether a given action is executable in the current state, especially in multi-step predictions. Each imagined step is conditioned on an incorrect affordance state, and therefore the prediction error compounds over the rollout horizon. In this paper, we propose AGWM (Affordance-Grounded World Model), which learns an abstract affordance structure represented as a DAG of prerequisite dependencies to explicitly track the dynamic executability of actions. Experiments on game-based simulated environments demonstrate the effectiveness of our method by achieving lower multi-step prediction error, better generalization to novel configurations, and improved interpretability.

HCMay 5
Deco: Extending Personal Physical Objects into Pervasive AI Companion through a Dual-Embodiment Framework

Zhihan Jiang, Mengyuan Millie Wu, Ruishi Zou et al.

Individuals frequently form deep attachments to physical objects (e.g., plush toys) that usually cannot sense or respond to their emotions. While AI companions offer responsiveness and personalization, they exist independently of these physical objects and lack an ongoing connection to them. To bridge this gap, we conducted a formative study (N=9) to explore how digital agents could inherit and extend the emotional bond, deriving four design principles (Faithful Identity, Calibrated Agency, Ambient Presence, and Reciprocal Memory). We then present the Dual-Embodiment Companion Framework, instantiated as Deco, a mobile system integrating multimodal Large Language Models (LLMs) and Augmented Reality to create synchronized digital embodiments of users' physical companions. A within-subjects study (N=25) showed Deco significantly outperformed a personalized LLM-empowered digital companion baseline on perceived companionship, emotional bond, and design-principle scales (all p<0.01). A seven-day field deployment (N=17) showed sustained engagement, subjective well-being improvement (p=.040), and three key relational patterns: digital activities retroactively vitalized physical objects, bond deepening was driven by emotional engagement depth rather than interaction frequency, and users sustained bonds while actively navigating digital companions' AI nature. This work highlights a promising alternative for designing digital companions: moving from creating new relationships to dual embodiment, where digital agents seamlessly extend the emotional history of physical objects.

SEFeb 27, 2024
FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems

Junjie Huang, Jinyang Liu, Zhuangbin Chen et al.

Postmortem analysis is essential in the management of incidents within cloud systems, which provides valuable insights to improve system's reliability and robustness. At CloudA, fault pattern profiling is performed during the postmortem phase, which involves the classification of incidents' faults into unique categories, referred to as fault pattern. By aggregating and analyzing these fault patterns, engineers can discern common faults, vulnerable components and emerging fault trends. However, this process is currently conducted by manual labeling, which has inherent drawbacks. On the one hand, the sheer volume of incidents means only the most severe ones are analyzed, causing a skewed overview of fault patterns. On the other hand, the complexity of the task demands extensive domain knowledge, which leads to errors and inconsistencies. To address these limitations, we propose an automated approach, named FaultProfIT, for Fault pattern Profiling of Incident Tickets. It leverages hierarchy-guided contrastive learning to train a hierarchy-aware incident encoder and predicts fault patterns with enhanced incident representations. We evaluate FaultProfIT using the production incidents from CloudA. The results demonstrate that FaultProfIT outperforms state-of-the-art methods. Our ablation study and analysis also verify the effectiveness of hierarchy-guided contrastive learning. Additionally, we have deployed FaultProfIT at CloudA for six months. To date, FaultProfIT has analyzed 10,000+ incidents from 30+ cloud services, successfully revealing several fault trends that have informed system improvements.

DCOct 16, 2024
Towards Edge General Intelligence via Large Language Models: Opportunities and Challenges

Handi Chen, Weipeng Deng, Shuo Yang et al.

Edge Intelligence (EI) has been instrumental in delivering real-time, localized services by leveraging the computational capabilities of edge networks. The integration of Large Language Models (LLMs) empowers EI to evolve into the next stage: Edge General Intelligence (EGI), enabling more adaptive and versatile applications that require advanced understanding and reasoning capabilities. However, systematic exploration in this area remains insufficient. This survey delineates the distinctions between EGI and traditional EI, categorizing LLM-empowered EGI into three conceptual systems: centralized, hybrid, and decentralized. For each system, we detail the framework designs and review existing implementations. Furthermore, we evaluate the performance and throughput of various Small Language Models (SLMs) that are more suitable for development on edge devices. This survey provides researchers with a comprehensive vision of EGI, offering insights into its vast potential and establishing a foundation for future advancements in this rapidly evolving field.

HCMar 6
MindfulAgents: Personalizing Mindfulness Meditation via an Expert-Aligned Multi-Agent System

Mengyuan Millie Wu, Zhihan Jiang, Yuang Fan et al.

Mindfulness meditation is a widely accessible and evidence-based method for supporting mental health. Despite the proliferation of mindfulness meditation apps, sustaining user engagement remains a persistent challenge. Personalizing the meditation experience is a promising strategy to improve engagement, but it often requires costly and unscalable manual effort. We present MindfulAgents, a multi-agent system powered by large language models that (1) generates guided meditation scripts based on an expert-established mindfulness framework, (2) encourages users' reflection on emotional states and mindfulness skills, and (3) enables real-time personalization of the mindfulness meditation experience for each user. In a formative lab study (N=13), MindfulAgents significantly improved in-session engagement (p = 0.011) and self-awareness (p = 0.014), and reduced momentary stress (p = 0.020). Furthermore, a four-week deployment study (N=62) demonstrated a notable increase in long-term engagement (p = 0.002) and level of mindfulness (p = 0.023). Participants reported that MindfulAgents offered more relevant meditation sessions personalized to individual needs in various contexts, supporting sustained practice. Our findings highlight the potential of LLM-driven personalization for enhancing user engagement in digital mindfulness meditation interventions.

HCAug 20, 2025
NoteIt: A System Converting Instructional Videos to Interactable Notes Through Multimodal Video Understanding

Running Zhao, Zhihan Jiang, Xinchen Zhang et al.

Users often take notes for instructional videos to access key knowledge later without revisiting long videos. Automated note generation tools enable users to obtain informative notes efficiently. However, notes generated by existing research or off-the-shelf tools fail to preserve the information conveyed in the original videos comprehensively, nor can they satisfy users' expectations for diverse presentation formats and interactive features when using notes digitally. In this work, we present NoteIt, a system, which automatically converts instructional videos to interactable notes using a novel pipeline that faithfully extracts hierarchical structure and multimodal key information from videos. With NoteIt's interface, users can interact with the system to further customize the content and presentation formats of the notes according to their preferences. We conducted both a technical evaluation and a comparison user study (N=36). The solid performance in objective metrics and the positive user feedback demonstrated the effectiveness of the pipeline and the overall usability of NoteIt. Project website: https://zhaorunning.github.io/NoteIt/