22.1CLMay 24
Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand PredictionHangxuan Li, Renjun Jia, Xuezhang Wu et al.
Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability across applications. We define feature engineering as an agentic code generation problem: features are not static data transformations, but executable programs that can be generated, evaluated, and iteratively improved. We present Eureka, an LLM-driven framework with three stages. (1) An Expert Agent, fine-tuned via SFT on domain knowledge, produces structured feature design plans in JSON format. (2) An LLM Feature Factory translates each plan into executable Python code through chain-of-thought reasoning, turning feature hypotheses into runnable programs. (3) A Self-Evolving Alignment Engine uses Reinforcement Learning (GRPO) with dual-channel reward (metric-based utility + semantic alignment) to enhance code quality. By expressing features as programs, the learned generation patterns can transfer across domains. Evaluated on 7 public benchmarks in healthcare, finance, and social domains, Eureka consistently outperforms both traditional AutoFE and LLM-based baselines. We further demonstrate Eureka's effectiveness on cloud GPU resource demand prediction at Alibaba Cloud, where Eureka improves demand fulfillment rate by 16% and lowers computing resource migration rates by 33%.
LGDec 22, 2025Code
OmniMER: Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM AdaptationXueming Yan, Boyan Xu, Yaochu Jin et al.
Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emotion categories. The dataset exhibits realistic challenges including cross-modal inconsistency and long-tailed class distributions shaped by Indonesian cultural communication norms. To address these challenges, we propose OmniMER, a multimodal adaptation framework built upon Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction for text, facial expression analysis for video, and prosody analysis for audio. These auxiliary tasks help the model identify emotion-relevant cues in each modality before fusion, reducing reliance on spurious correlations in low-resource settings. Experiments on IndoMER show that OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on the Chinese CH-SIMS dataset further demonstrates the generalizability of the proposed framework. The dataset and code are publicly available. https://github.com/yanxm01/INDOMER
NEMar 20, 2025Code
SpiLiFormer: Enhancing Spiking Transformers with Lateral InhibitionZeqi Zheng, Yanchen Huang, Yingchao Yu et al.
Spiking Neural Networks (SNNs) based on Transformers have garnered significant attention due to their superior performance and high energy efficiency. However, the spiking attention modules of most existing Transformer-based SNNs are adapted from those of analog Transformers, failing to fully address the issue of over-allocating attention to irrelevant contexts. To fix this fundamental yet overlooked issue, we propose a Lateral Inhibition-inspired Spiking Transformer (SpiLiFormer). It emulates the brain's lateral inhibition mechanism, guiding the model to enhance attention to relevant tokens while suppressing attention to irrelevant ones. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including CIFAR-10 (+0.45%), CIFAR-100 (+0.48%), CIFAR10-DVS (+2.70%), N-Caltech101 (+1.94%), and ImageNet-1K (+1.6%). Notably, on the ImageNet-1K dataset, SpiLiFormer (69.9M parameters, 4 time steps, 384 resolution) outperforms E-SpikeFormer (173.0M parameters, 8 time steps, 384 resolution), a SOTA spiking Transformer, by 0.46% using only 39% of the parameters and half the time steps. The code and model checkpoints are publicly available at https://github.com/KirinZheng/SpiLiFormer.
NEAug 1, 2025Code
STF: Shallow-Level Temporal Feedback to Enhance Spiking TransformersZeqi Zheng, Zizheng Zhu, Yingchao Yu et al.
Transformer-based Spiking Neural Networks (SNNs) suffer from a great performance gap compared to floating-point \mbox{Artificial} Neural Networks (ANNs) due to the binary nature of spike trains. Recent efforts have introduced deep-level feedback loops to transmit high-level semantic information to narrow this gap. However, these designs often span \mbox{multiple} deep layers, resulting in costly feature transformations, higher parameter overhead, increased energy consumption, and longer inference latency. To address this issue, we propose Shallow-level Temporal Feedback (STF), a lightweight plug-and-play module for the encoding layer, which consists of Temporal-Spatial Position Embedding (TSPE) and Temporal Feedback (TF). Extensive experiments show that STF consistently improves performance across various Transformer-based SNN backbones on static datasets, including CIFAR-10, CIFAR-100, and ImageNet-1K, under different spike timestep settings. Further analysis reveals that STF enhances the diversity of spike patterns, which is key to performance gain. Moreover, evaluations on adversarial robustness and temporal sensitivity confirm that STF outperforms direct coding and its variants, highlighting its potential as a new spike encoding scheme for static scenarios. Our code will be released upon acceptance.
ROJan 25, 2025
Think Small, Plan Smart: Minimalist Symbolic Abstraction and Heuristic Subspace Search for LLM-Guided Task PlanningJunfeng Tang, Yuping Yan, Zihan Ye et al.
Reliable task planning is pivotal for achieving long-horizon autonomy in real-world robotic systems. Large language models (LLMs) offer a promising interface for translating complex and ambiguous natural language instructions into actionable plans. However, their probabilistic and opaque nature often leads to logically inconsistent or infeasible outputs. To address these limitations, recent frameworks combine LLMs with symbolic planners by first generating action models (Planning Domain Definition Language) and then applying heuristic search. Although promising, such systems still suffer from representation redundancy and exponential search complexity, often resulting in inefficient or overly long plans. To improve planning efficiency and effectiveness, we propose PLAHX (Planning from Language using Abstraction and Heuristic eXploration), a two-stage LLM-symbolic planning framework that integrates abstract symbolic representations with meta-heuristic subspace search in a parallel and iterative fashion. Rather than relying on verbose LLM-generated domain models, we introduce a minimalist symbolic abstraction pipeline that preserves semantic fidelity while eliminating redundancy. Our approach redefines LLM-symbolic planning not by making LLMs smarter, but by reducing the symbolic search space adaptively. Empirical results across four challenging domains, including block stacking and robotic mobile grasping, show that our approach improves the success rate by 21.47% on average, while reducing token consumption by 13% compared to state-of-the-art baselines.
CVSep 29, 2025
Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMsYuanshuai Li, Yuping Yan, Junfeng Tang et al.
Multimodal Large Language Models (MLLMs) have significantly improved the performance of various tasks, but continue to suffer from visual hallucinations, a critical issue where generated responses contradict visual evidence. While Direct Preference Optimization(DPO) is widely used for alignment, its application to MLLMs often fails to capture fine-grained semantic differences and encourages shortcut learning. To address these challenges, we propose Semantic Curriculum Preference Optimization (SCPO), a novel framework for MLLM alignment. SCPO employs a progressive, easy-to-hard curriculum built upon our Semantic Curriculum Preference Pairs dataset, which provides fine-grained semantic contrasts sorted by difficulty. This curriculum is trained with a dynamic reference model and a novel symmetric, bidirectional objective to facilitate simultaneous learning from both textual and visual preferences. To our knowledge, SCPO is the first framework to unify semantics, symmetry, and curriculum for MLLMs alignment, effectively mitigating visual hallucinations. Extensive experiments on LLaVA models across various scales and versions validate that SCPO demonstrates superior performance compared to baseline models on multiple hallucination benchmarks, reducing the hallucination rate by up to 62.9%. Moreover, evaluations on generalized benchmarks show that SCPO improves factuality while preserving general capabilities, with its performance remaining stable across general vision-language benchmarks.
NCJun 10, 2025
Sparse Autoencoders Bridge The Deep Learning Model and The BrainZiming Mao, Jia Xu, Zeqi Zheng et al.
We present SAE-BrainMap, a novel framework that directly aligns deep learning visual model representations with voxel-level fMRI responses using sparse autoencoders (SAEs). First, we train layer-wise SAEs on model activations and compute the correlations between SAE unit activations and cortical fMRI signals elicited by the same natural image stimuli with cosine similarity, revealing strong activation correspondence (maximum similarity up to 0.76). Depending on this alignment, we construct a voxel dictionary by optimally assigning the most similar SAE feature to each voxel, demonstrating that SAE units preserve the functional structure of predefined regions of interest (ROIs) and exhibit ROI-consistent selectivity. Finally, we establish fine-grained hierarchical mapping between model layers and the human ventral visual pathway, also by projecting voxel dictionary activations onto individual cortical surfaces, we visualize the dynamic transformation of the visual information in deep learning models. It is found that ViT-B/16$_{CLIP}$ tends to utilize low-level information to generate high-level semantic information in the early layers and reconstructs the low-dimension information later. Our results establish a direct, downstream-task-free bridge between deep neural networks and human visual cortex, offering new insights into model interpretability.
NEMay 17, 2025
TDFormer: A Top-Down Attention-Controlled Spiking TransformerZizheng Zhu, Yingchao Yu, Zeqi Zheng et al.
Traditional spiking neural networks (SNNs) can be viewed as a combination of multiple subnetworks with each running for one time step, where the parameters are shared, and the membrane potential serves as the only information link between them. However, the implicit nature of the membrane potential limits its ability to effectively represent temporal information. As a result, each time step cannot fully leverage information from previous time steps, seriously limiting the model's performance. Inspired by the top-down mechanism in the brain, we introduce TDFormer, a novel model with a top-down feedback structure that functions hierarchically and leverages high-order representations from earlier time steps to modulate the processing of low-order information at later stages. The feedback structure plays a role from two perspectives: 1) During forward propagation, our model increases the mutual information across time steps, indicating that richer temporal information is being transmitted and integrated in different time steps. 2) During backward propagation, we theoretically prove that the feedback structure alleviates the problem of vanishing gradients along the time dimension. We find that these mechanisms together significantly and consistently improve the model performance on multiple datasets. In particular, our model achieves state-of-the-art performance on ImageNet with an accuracy of 86.83%.
NEJan 24, 2025
IP$^{2}$-RSNN: Bi-level Intrinsic Plasticity Enables Learning-to-learn in Recurrent Spiking Neural NetworksYingchao Yu, Yaochu Jin, Kuangrong Hao et al.
Learning-to-learn (L2L), defined as progressively faster learning across similar tasks, is fundamental to both neuroscience and artificial intelligence. However, its neural basis remains elusive, as most studies emphasize neural population dynamics induced by synaptic plasticity while overlooking adaptations driven by intrinsic neuronal plasticity, which point-neuron models cannot capture. To address the above issue, we develop a recurrent spiking neural network with bi-level intrinsic plasticity (IP$^{2}$-RSNN). First, based on task demands, a slow meta-intrinsic plasticity determines which intrinsic neuronal properties are learnable, which is preserved throughout subsequent task learning once configured. Second, a fast intrinsic plasticity fine-tunes those learnable properties within each task. Our results indicate that the proposed bi-level intrinsic plasticity plays a critical role in enabling L2L in RSNNs and show that IP$^{2}$-RSNNs outperform point-neuron recurrent neural networks and self-attention models. Furthermore, our analysis of multi-scale neural dynamics reveals that the bi-level intrinsic plasticity is essential to task-type-specific adaptations at both the neuronal and network levels during L2L, while such adaptations cannot be captured by point-neuron models. Our results suggest that intrinsic plasticity provides significant computational advantages in L2L, shedding light on the design of brain-inspired deep learning models and algorithms.