LGApr 17, 2025Code
GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic TasksHao Xu, Xiangru Jian, Xinjian Zhao et al.
This paper introduces GraphOmni, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs on graph-theoretic tasks articulated in natural language. GraphOmni encompasses diverse graph types, serialization formats, and prompting schemes, significantly exceeding prior efforts in both scope and depth. Through extensive systematic evaluation, we identify critical interactions among these dimensions, demonstrating their substantial impact on model performance. Our experiments reveal that state-of-the-art models like Claude-3.5 and o4-mini consistently outperform other models, yet even these leading models exhibit substantial room for improvement. Performance variability is evident depending on the specific combinations of factors we considered, underscoring the necessity of comprehensive evaluations across these interconnected dimensions. Additionally, we observe distinct impacts of serialization and prompting strategies between open-source and closed-source models, encouraging the development of tailored approaches. Motivated by the findings, we also propose a reinforcement learning-inspired framework that adaptively selects the optimal factors influencing LLM reasoning capabilities. This flexible and extendable benchmark not only deepens our understanding of LLM performance on structured tasks but also provides a robust foundation for advancing research in LLM-based graph reasoning. The code and datasets are available at https://github.com/GAI-Community/GraphOmni.
CLMay 12
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon AgentsJingyi Peng, Zhongwei Wan, Weiting Liu et al.
Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the context window without addressing what is retrieved, perform heavy ingestion-time fact extraction at substantial token cost, or rely on heuristic graph traversal that leaves both accuracy and efficiency on the table. We present PRISM, a training-free retrieval-side framework that treats long-horizon memory as a joint retrieval-and-compression problem over a graph-structured memory. PRISM combines four orthogonal inference-time components: Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing that aligns traversal with detected query intent, Evidence Compression that compresses the candidate bundle into a compact answer-side context, and Adaptive Intent Routing that routes most queries through zero-LLM tiers. By formulating retrieval as min-cost selection over typed path templates and pairing it with an LLM-side compression step, PRISM surfaces the right evidence under a strict context budget without any fine-tuning or modification to the upstream ingestion pipeline. Experiments on the LoCoMo benchmark show that PRISM delivers substantially higher LLM-judge accuracy than every same-protocol baseline at an order-of-magnitude smaller context budget, occupying a previously empty corner of the accuracy-context-cost frontier and demonstrating a superior balance between answer quality and retrieval efficiency.
LGNov 2, 2025
SARIMAX-Based Power Outage Prediction During Extreme Weather EventsHaoran Ye, Qiuzhuang Sun, Yang Yang
This study develops a SARIMAX-based prediction system for short-term power outage forecasting during extreme weather events. Using hourly data from Michigan counties with outage counts and comprehensive weather features, we implement a systematic two-stage feature engineering pipeline: data cleaning to remove zero-variance and unknown features, followed by correlation-based filtering to eliminate highly correlated predictors. The selected features are augmented with temporal embeddings, multi-scale lag features, and weather variables with their corresponding lags as exogenous inputs to the SARIMAX model. To address data irregularity and numerical instability, we apply standardization and implement a hierarchical fitting strategy with sequential optimization methods, automatic downgrading to ARIMA when convergence fails, and historical mean-based fallback predictions as a final safeguard. The model is optimized separately for short-term (24 hours) and medium-term (48 hours) forecast horizons using RMSE as the evaluation metric. Our approach achieves an RMSE of 177.2, representing an 8.4\% improvement over the baseline method (RMSE = 193.4), thereby validating the effectiveness of our feature engineering and robust optimization strategy for extreme weather-related outage prediction.
CVJan 5, 2025
Tighnari: Multi-modal Plant Species Prediction Based on Hierarchical Cross-Attention Using Graph-Based and Vision Backbone-Extracted FeaturesHaixu Liu, Penghao Jiang, Zerui Tao et al.
Predicting plant species composition in specific spatiotemporal contexts plays an important role in biodiversity management and conservation, as well as in improving species identification tools. Our work utilizes 88,987 plant survey records conducted in specific spatiotemporal contexts across Europe. We also use the corresponding satellite images, time series data, climate time series, and other rasterized environmental data such as land cover, human footprint, bioclimatic, and soil variables as training data to train the model to predict the outcomes of 4,716 plant surveys. We propose a feature construction and result correction method based on the graph structure. Through comparative experiments, we select the best-performing backbone networks for feature extraction in both temporal and image modalities. In this process, we built a backbone network based on the Swin-Transformer Block for extracting temporal Cubes features. We then design a hierarchical cross-attention mechanism capable of robustly fusing features from multiple modalities. During training, we adopt a 10-fold cross-fusion method based on fine-tuning and use a Threshold Top-K method for post-processing. Ablation experiments demonstrate the improvements in model performance brought by our proposed solution pipeline.
CVApr 8
FORGE:Fine-grained Multimodal Evaluation for Manufacturing ScenariosXiangru Jian, Hao Xu, Wei Pang et al.
The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.
CVJan 2, 2025
Multi-Modal Video Feature Extraction for Popularity PredictionHaixu Liu, Wenning Wang, Haoxiang Zheng et al.
This work aims to predict the popularity of short videos using the videos themselves and their related features. Popularity is measured by four key engagement metrics: view count, like count, comment count, and share count. This study employs video classification models with different architectures and training methods as backbone networks to extract video modality features. Meanwhile, the cleaned video captions are incorporated into a carefully designed prompt framework, along with the video, as input for video-to-text generation models, which generate detailed text-based video content understanding. These texts are then encoded into vectors using a pre-trained BERT model. Based on the six sets of vectors mentioned above, a neural network is trained for each of the four prediction metrics. Moreover, the study conducts data mining and feature engineering based on the video and tabular data, constructing practical features such as the total frequency of hashtag appearances, the total frequency of mention appearances, video duration, frame count, frame rate, and total time online. Multiple machine learning models are trained, and the most stable model, XGBoost, is selected. Finally, the predictions from the neural network and XGBoost models are averaged to obtain the final result.
CVJan 2, 2025
nnY-Net: Swin-NeXt with Cross-Attention for 3D Medical Images SegmentationHaixu Liu, Zerui Tao, Wenzhen Dong et al.
This paper provides a novel 3D medical image segmentation model structure called nnY-Net. This name comes from the fact that our model adds a cross-attention module at the bottom of the U-net structure to form a Y structure. We integrate the advantages of the two latest SOTA models, MedNeXt and SwinUNETR, and use Swin Transformer as the encoder and ConvNeXt as the decoder to innovatively design the Swin-NeXt structure. Our model uses the lowest-level feature map of the encoder as Key and Value and uses patient features such as pathology and treatment information as Query to calculate the attention weights in a Cross Attention module. Moreover, we simplify some pre- and post-processing as well as data enhancement methods in 3D image segmentation based on the dynUnet and nnU-net frameworks. We integrate our proposed Swin-NeXt with Cross-Attention framework into this framework. Last, we construct a DiceFocalCELoss to improve the training efficiency for the uneven data convergence of voxel classification.
LGJun 26, 2024
Unveiling and Controlling Anomalous Attention Distribution in TransformersRuiqing Yan, Xingbo Du, Haoyu Deng et al.
With the advent of large models based on the Transformer architecture, researchers have observed an anomalous phenomenon in the Attention mechanism--there is a very high attention on the first element, which is prevalent across Transformer-based models. It is crucial to understand it for the development of techniques focusing on attention distribution, such as Key-Value (KV) Cache compression and infinite extrapolation; however, the latent cause leaves to be unknown. In this paper, we analyze such a phenomenon from the perspective of waiver phenomenon, which involves reducing the internal values of certain elements in the sequence, allowing them to absorb excess attention without affecting their contribution to information. In specific models, due to differences in positional encoding and attention patterns, we have found that the selection of waiver elements by the model can be categorized into two methods: positional-encoding-based and feature-distribution-within-elements-based.
SYApr 20, 2024
Online Planning of Power Flows for Power Systems Against Bushfires Using Spatial ContextJianyu Xu, Qiuzhuang Sun, Yang Yang et al.
The 2019-20 Australia bushfire incurred numerous economic losses and significantly affected the operations of power systems. A power station or transmission line can be significantly affected due to bushfires, leading to an increase in operational costs. We study a fundamental but challenging problem of planning the optimal power flow (OPF) for power systems subject to bushfires. Considering the stochastic nature of bushfire spread, we develop a model to capture such dynamics based on Moore's neighborhood model. Under a periodic inspection scheme that reveals the in-situ bushfire status, we propose an online optimization modeling framework that sequentially plans the power flows in the electricity network. Our framework assumes that the spread of bushfires is non-stationary over time, and the spread and containment probabilities are unknown. To meet these challenges, we develop a contextual online learning algorithm that treats the in-situ geographical information of the bushfire as a 'spatial context'. The online learning algorithm learns the unknown probabilities sequentially based on the observed data and then makes the OPF decision accordingly. The sequential OPF decisions aim to minimize the regret function, which is defined as the cumulative loss against the clairvoyant strategy that knows the true model parameters. We provide a theoretical guarantee of our algorithm by deriving a bound on the regret function, which outperforms the regret bound achieved by other benchmark algorithms. Our model assumptions are verified by the real bushfire data from NSW, Australia, and we apply our model to two power systems to illustrate its applicability.