Shan Yu

CV
h-index15
34papers
646citations
Novelty54%
AI Score58

34 Papers

91.1DCJun 2
UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

Xinming Wei, Chao Jin, Tuo Dai et al.

Large-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which becomes unreliable for production deployments with non-stationary load patterns. We present UltraEP, the first exact-load, real-time balancer for large-EP MoE training and serving prefill on rack-scale nodes (RSNs). Built upon the extended scale-up connectivity of RSNs, UltraEP rebalances every microbatch and layer on critical paths, which requires nontrivial co-design of plan solving and expert replication communication to minimize exposed overhead. To this end, UltraEP eagerly reacts to post-gating load with efficient quota-driven planning, and executes the resulting irregular expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation. Averaged across MoE models from 106B to 671B parameters in training and prefill, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering 1.49$\times$ improvement over non-balancing, while reducing the final inter-rank imbalance from 1.30$-$4.01 to 1.01$-$1.04. Additionally, we validate UltraEP's scalability and robustness in production MoE training with 2560 GPUs.

LGApr 3, 2022Code
BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster

Jason Dai, Ding Ding, Dongjie Shi et al.

Most AI projects start with a Python notebook running on a single laptop; however, one usually needs to go through a mountain of pains to scale it to handle larger dataset (for both experimentation and production deployment). These usually entail many manual and error-prone steps for the data scientists to fully take advantage of the available hardware resources (e.g., SIMD instructions, multi-processing, quantization, memory allocation optimization, data partitioning, distributed computing, etc.). To address this challenge, we have open sourced BigDL 2.0 at https://github.com/intel-analytics/BigDL/ under Apache 2.0 license (combining the original BigDL and Analytics Zoo projects); using BigDL 2.0, users can simply build conventional Python notebooks on their laptops (with possible AutoML support), which can then be transparently accelerated on a single node (with up-to 9.6x speedup in our experiments), and seamlessly scaled out to a large cluster (across several hundreds servers in real-world use cases). BigDL 2.0 has already been adopted by many real-world users (such as Mastercard, Burger King, Inspur, etc.) in production.

CLMar 9, 2023Code
TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test Questions

He Zhu, Xihua Li, Xuemin Zhao et al.

Recently, more and more people study online for the convenience of access to massive learning materials (e.g. test questions/notes), thus accurately understanding learning materials became a crucial issue, which is essential for many educational applications. Previous studies focus on using language models to represent the question data. However, test questions (TQ) are usually heterogeneous and multi-modal, e.g., some of them may only contain text, while others half contain images with information beyond their literal description. In this context, both supervised and unsupervised methods are difficult to learn a fused representation of questions. Meanwhile, this problem cannot be solved by conventional methods such as image caption, as the images may contain information complementary rather than duplicate to the text. In this paper, we first improve previous text-only representation with a two-stage unsupervised instance level contrastive based pre-training method (MCL: Mixture Unsupervised Contrastive Learning). Then, TQ-Net was proposed to fuse the content of images to the representation of heterogeneous data. Finally, supervised contrastive learning was conducted on relevance prediction-related downstream tasks, which helped the model to learn the representation of questions effectively. We conducted extensive experiments on question-based tasks on large-scale, real-world datasets, which demonstrated the effectiveness of TQ-Net and improve the precision of downstream applications (e.g. similar questions +2.02% and knowledge point prediction +7.20%). Our code will be available, and we will open-source a subset of our data to promote the development of relative studies.

CVNov 3, 2023Code
VQPy: An Object-Oriented Approach to Modern Video Analytics

Shan Yu, Zhenting Zhu, Yu Chen et al.

Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.

LGJun 1, 2023Code
Out-of-distribution forgetting: vulnerability of continual learning to intra-class distribution shift

Liangxuan Guo, Yang Chen, Shan Yu

Continual learning (CL) is an important technique to allow artificial neural networks to work in open environments. CL enables a system to learn new tasks without severe interference to its performance on old tasks, i.e., overcome the problems of catastrophic forgetting. In joint learning, it is well known that the out-of-distribution (OOD) problem caused by intentional attacks or environmental perturbations will severely impair the ability of networks to generalize. In this work, we reported a special form of catastrophic forgetting raised by the OOD problem in continual learning settings, and we named it out-of-distribution forgetting (OODF). In continual image classification tasks, we found that for a given category, introducing an intra-class distribution shift significantly impaired the recognition accuracy of CL methods for that category during subsequent learning. Interestingly, this phenomenon is special for CL as the same level of distribution shift had only negligible effects in the joint learning scenario. We verified that CL methods without dedicating subnetworks for individual tasks are all vulnerable to OODF. Moreover, OODF does not depend on any specific way of shifting the distribution, suggesting it is a risk for CL in a wide range of circumstances. Taken together, our work identified an under-attended risk during CL, highlighting the importance of developing approaches that can overcome OODF. Code available: \url{https://github.com/Hiroid/OODF}

ROFeb 4Code
GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning

Guoqing Ma, Siheng Wang, Zeyu Zhang et al.

Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit limited zero-shot capability, which hampers their ability to generalize effectively to unseen scenarios. In this work, we propose GeneralVLA (Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning), a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models, enabling zero-shot manipulation and automatically generating data for robotics. In particular, we study a class of hierarchical VLA model where the high-level ASM (Affordance Segmentation Module) is finetuned to perceive image keypoint affordances of the scene; the mid-level 3DAgent carries out task understanding, skill knowledge, and trajectory planning to produce a 3D path indicating the desired robot end-effector trajectory. The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Compared to alternative approaches, our method requires no real-world robotic data collection or human demonstration, making it much more scalable to diverse tasks and viewpoints. Empirically, GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser. The generated demonstrations can train more robust behavior cloning policies than training with human demonstrations or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe GeneralVLA can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Code: https://github.com/AIGeeksGroup/GeneralVLA. Website: https://aigeeksgroup.github.io/GeneralVLA.

CVJul 20, 2023
Ethosight: A Reasoning-Guided Iterative Learning System for Nuanced Perception based on Joint-Embedding & Contextual Label Affinity

Hugo Latapie, Shan Yu, Patrick Hammer et al.

Traditional computer vision models often necessitate extensive data acquisition, annotation, and validation. These models frequently struggle in real-world applications, resulting in high false positive and negative rates, and exhibit poor adaptability to new scenarios, often requiring costly retraining. To address these issues, we present Ethosight, a flexible and adaptable zero-shot video analytics system. Ethosight begins from a clean slate based on user-defined video analytics, specified through natural language or keywords, and leverages joint embedding models and reasoning mechanisms informed by ontologies such as WordNet and ConceptNet. Ethosight operates effectively on low-cost edge devices and supports enhanced runtime adaptation, thereby offering a new approach to continuous learning without catastrophic forgetting. We provide empirical validation of Ethosight's promising effectiveness across diverse and complex use cases, while highlighting areas for further improvement. A significant contribution of this work is the release of all source code and datasets to enable full reproducibility and to foster further innovation in both the research and commercial domains.

CVApr 5, 2023
Adaptive Data Augmentation for Contrastive Learning

Yuhan Zhang, He Zhu, Shan Yu

In computer vision, contrastive learning is the most advanced unsupervised learning framework. Yet most previous methods simply apply fixed composition of data augmentations to improve data efficiency, which ignores the changes in their optimal settings over training. Thus, the pre-determined parameters of augmentation operations cannot always fit well with an evolving network during the whole training period, which degrades the quality of the learned representations. In this work, we propose AdDA, which implements a closed-loop feedback structure to a generic contrastive learning network. AdDA works by allowing the network to adaptively adjust the augmentation compositions according to the real-time feedback. This online adjustment helps maintain the dynamic optimal composition and enables the network to acquire more generalizable representations with minimal computational overhead. AdDA achieves competitive results under the common linear protocol on ImageNet-100 classification (+1.11% on MoCo v2).

49.5LGApr 28
Towards Unified Multi-task EEG Analysis with Low-Rank Adaptation

Sicheng Dai, Kai Chen, Hongwang Xiao et al.

Recent self-supervised pre-training methods for electroencephalogram (EEG) have shown promising results. However, the pre-trained models typically require full fine-tuning on each downstream task individually to achieve good performance. In practical applications involving multiple tasks, utilizing a separate model for each task is not ideal regarding computational and spatial cost. In this study, we go one step further and explore the simultaneous adaptation of a pre-trained model to multiple different tasks. The EEG signals exhibit significant heterogeneity due to their collection from various subjects using diverse devices and experimental setups, resulting in potential conflicts among different tasks that impede joint optimization. To tackle this challenge, we propose MTEEG, a multi-task EEG analysis framework which incorporates task-specific low-rank adaptation (LoRA) modules to disentangle the parameter space and alleviate task conflicts. To investigate the trade-off between task specification and interaction, we propose three variants of MTEEG that integrate the LoRA modules in different ways and evaluate them on six downstream tasks, demonstrating that MTEEG can surpass state-of-the-art single-task methods on the majority of metrics. MTEEG shows the potential of multi-task EEG analysis and promotes the development of general-purpose brain-computer interfaces in the future.

LGJan 12, 2023
Effective Decision Boundary Learning for Class Incremental Learning

Kunchi Li, Jun Wan, Shan Yu

Rehearsal approaches in class incremental learning (CIL) suffer from decision boundary overfitting to new classes, which is mainly caused by two factors: insufficiency of old classes data for knowledge distillation and imbalanced data learning between the learned and new classes because of the limited storage memory. In this work, we present a simple but effective approach to tackle these two factors. First, we employ a re-sampling strategy and Mixup K}nowledge D}istillation (Re-MKD) to improve the performances of KD, which would greatly alleviate the overfitting problem. Specifically, we combine mixup and re-sampling strategies to synthesize adequate data used in KD training that are more consistent with the latent distribution between the learned and new classes. Second, we propose a novel incremental influence balance (IIB) method for CIL to tackle the classification of imbalanced data by extending the influence balance method into the CIL setting, which re-weights samples by their influences to create a proper decision boundary. With these two improvements, we present the effective decision boundary learning algorithm (EDBL) which improves the performance of KD and deals with the imbalanced data learning simultaneously. Experiments show that the proposed EDBL achieves state-of-the-art performances on several CIL benchmarks.

CVSep 27, 2024
Learning from Pattern Completion: Self-supervised Controllable Generation

Zhiqiang Chen, Guofan Fan, Jinying Gao et al.

The human brain exhibits a strong ability to spontaneously associate different visual attributes of the same or similar visual scene, such as associating sketches and graffiti with real-world visual objects, usually without supervising information. In contrast, in the field of artificial intelligence, controllable generation methods like ControlNet heavily rely on annotated training datasets such as depth maps, semantic segmentation maps, and poses, which limits the method's scalability. Inspired by the neural mechanisms that may contribute to the brain's associative power, specifically the cortical modularization and hippocampal pattern completion, here we propose a self-supervised controllable generation (SCG) framework. Firstly, we introduce an equivariant constraint to promote inter-module independence and intra-module correlation in a modular autoencoder network, thereby achieving functional specialization. Subsequently, based on these specialized modules, we employ a self-supervised pattern completion approach for controllable generation training. Experimental results demonstrate that the proposed modular autoencoder effectively achieves functional specialization, including the modular processing of color, brightness, and edge detection, and exhibits brain-like features including orientation selectivity, color antagonism, and center-surround receptive fields. Through self-supervised training, associative generation capabilities spontaneously emerge in SCG, demonstrating excellent generalization ability to various tasks such as associative generation on painting, sketches, and ancient graffiti. Compared to the previous representative method ControlNet, our proposed approach not only demonstrates superior robustness in more challenging high-noise scenarios but also possesses more promising scalability potential due to its self-supervised manner.Codes are released on Github and Gitee.

CVSep 15, 2023
AnyOKP: One-Shot and Instance-Aware Object Keypoint Extraction with Pretrained ViT

Fangbo Qin, Taogang Hou, Shan Lin et al.

Towards flexible object-centric visual perception, we propose a one-shot instance-aware object keypoint (OKP) extraction approach, AnyOKP, which leverages the powerful representation ability of pretrained vision transformer (ViT), and can obtain keypoints on multiple object instances of arbitrary category after learning from a support image. An off-the-shelf petrained ViT is directly deployed for generalizable and transferable feature extraction, which is followed by training-free feature enhancement. The best-prototype pairs (BPPs) are searched for in support and query images based on appearance similarity, to yield instance-unaware candidate keypoints.Then, the entire graph with all candidate keypoints as vertices are divided to sub-graphs according to the feature distributions on the graph edges. Finally, each sub-graph represents an object instance. AnyOKP is evaluated on real object images collected with the cameras of a robot arm, a mobile robot, and a surgical robot, which not only demonstrates the cross-category flexibility and instance awareness, but also show remarkable robustness to domain shift and viewpoint change.

CVMar 3
StegaFFD: Privacy-Preserving Face Forgery Detection via Fine-Grained Steganographic Domain Lifting

Guoqing Ma, Xun Lin, Hui Ma et al.

Most existing Face Forgery Detection (FFD) models assume access to raw face images. In practice, under a client-server framework, private facial data may be intercepted during transmission or leaked by untrusted servers. Previous privacy protection approaches, such as anonymization, encryption, or distortion, partly mitigate leakage but often introduce severe semantic distortion, making images appear obviously protected. This alerts attackers, provoking more aggressive strategies and turning the process into a cat-and-mouse game. Moreover, these methods heavily manipulate image contents, introducing degradation or artifacts that may confuse FFD models, which rely on extremely subtle forgery traces. Inspired by advances in image steganography, which enable high-fidelity hiding and recovery, we propose a Stega}nography-based Face Forgery Detection framework (StegaFFD) to protect privacy without raising suspicion. StegaFFD hides facial images within natural cover images and directly conducts forgery detection in the steganographic domain. However, the hidden forgery-specific features are extremely subtle and interfered with by cover semantics, posing significant challenges. To address this, we propose Low-Frequency-Aware Decomposition (LFAD) and Spatial-Frequency Differential Attention (SFDA), which suppress interference from low-frequency cover semantics and enhance hidden facial feature perception. Furthermore, we introduce Steganographic Domain Alignment (SDA) to align the representations of hidden faces with those of their raw counterparts, enhancing the model's ability to perceive subtle facial cues in the steganographic domain. Extensive experiments on seven FFD datasets demonstrate that StegaFFD achieves strong imperceptibility, avoids raising attackers' suspicion, and better preserves FFD accuracy compared to existing facial privacy protection methods.

NCJul 1, 2024
Individual brain parcellation: Review of methods, validations and applications

Chengyi Li, Shan Yu, Yue Cui

Individual brains vary greatly in morphology, connectivity and organization. The applicability of group-level parcellations is limited by the rapid development of precision medicine today because they do not take into account the variation of parcels at the individual level. Accurate mapping of brain functional regions at the individual level is pivotal for a comprehensive understanding of the variations in brain function and behaviors, early and precise identification of brain abnormalities, as well as personalized treatments for neuropsychiatric disorders. With the development of neuroimaging and machine learning techniques, studies on individual brain parcellation are booming. In this paper, we offer an overview of recent advances in the methodologies of individual brain parcellation, including optimization- and learning-based methods. Comprehensive evaluation metrics to validate individual brain mapping have been introduced. We also review the studies of how individual brain mapping promotes neuroscience research and clinical medicine. Finally, we summarize the major challenges and important future directions of individualized brain parcellation. Collectively, we intend to offer a thorough overview of individual brain parcellation methods, validations, and applications, along with highlighting the current challenges that call for an urgent demand for integrated platforms that integrate datasets, methods, and validations.

AIApr 13, 2023
Emergence of Symbols in Neural Networks for Semantic Understanding and Communication

Yang Chen, Liangxuan Guo, Shan Yu

The capacity to generate meaningful symbols and effectively employ them for advanced cognitive processes, such as communication, reasoning, and planning, constitutes a fundamental and distinctive aspect of human intelligence. Existing deep neural networks still notably lag human capabilities in terms of generating symbols for higher cognitive functions. Here, we propose a solution (symbol emergence artificial network (SEA-net)) to endow neural networks with the ability to create symbols, understand semantics, and achieve communication. SEA-net generates symbols that dynamically configure the network to perform specific tasks. These symbols capture compositional semantic information that allows the system to acquire new functions purely by symbolic manipulation or communication. In addition, these self-generated symbols exhibit an intrinsic structure resembling that of natural language, suggesting a common framework underlying the generation and understanding of symbols in both human brains and artificial neural networks. We believe that the proposed framework will be instrumental in producing more capable systems that can synergize the strengths of connectionist and symbolic approaches for artificial intelligence (AI).

LGFeb 26
Autoregressive Visual Decoding from EEG Signals

Sicheng Dai, Hongwang Xiao, Shan Yu et al.

Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a "next-scale prediction" strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a transformer is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while preserving a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications.

NCJan 5
A neural network for modeling human concept formation, understanding and communication

Liangxuan Guo, Haoyang Chen, Yang Chen et al.

A remarkable capability of the human brain is to form more abstract conceptual representations from sensorimotor experiences and flexibly apply them independent of direct sensory inputs. However, the computational mechanism underlying this ability remains poorly understood. Here, we present a dual-module neural network framework, the CATS Net, to bridge this gap. Our model consists of a concept-abstraction module that extracts low-dimensional conceptual representations, and a task-solving module that performs visual judgement tasks under the hierarchical gating control of the formed concepts. The system develops transferable semantic structure based on concept representations that enable cross-network knowledge transfer through conceptual communication. Model-brain fitting analyses reveal that these emergent concept spaces align with both neurocognitive semantic model and brain response structures in the human ventral occipitotemporal cortex, while the gating mechanisms mirror that in the semantic control brain network. This work establishes a unified computational framework that can offer mechanistic insights for understanding human conceptual cognition and engineering artificial systems with human-like conceptual intelligence.

58.1AIMar 16
Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning

Guangfu Hao, Yuming Dai, Xianzhe Qin et al.

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of language tasks, yet complex multi-step reasoning remains a fundamental challenge. While Large Reasoning Models (LRMs) equipped with extended chain-of-thought mechanisms demonstrate improved performance over standard LLMs, both model types still suffer from accuracy collapse on sufficiently complex tasks, suggesting that scaling model-level reasoning alone is insufficient. Inspired by the global workspace theory of human cognition, we propose Brain-Inspired Graph Multi-Agent Systems (BIGMAS), in which specialized LLM agents are organized as nodes in a dynamically constructed directed graph and coordinate exclusively through a centralized shared workspace. A problem-adaptive GraphDesigner constructs task-specific agent topologies, while a global Orchestrator leverages the complete shared state for routing decisions, overcoming the local-view bottleneck of reactive approaches. Experiments on Game24, Six Fives, and Tower of London across six frontier LLMs demonstrate that BIGMAS consistently improves reasoning performance for both standard LLMs and LRMs, outperforming existing multi-agent baselines including ReAct and Tree of Thoughts, showing that multi-agent architectural design provides complementary gains orthogonal to model-level reasoning enhancements.

LGFeb 25
Orthogonal Weight Modification Enhances Learning Scalability and Convergence Efficiency without Gradient Backpropagation

Guoqing Ma, Shan Yu

Recognizing the substantial computational cost of backpropagation (BP), non-BP methods have emerged as attractive alternatives for efficient learning on emerging neuromorphic systems. However, existing non-BP approaches still face critical challenges in efficiency and scalability. Inspired by neural representations and dynamic mechanisms in the brain, we propose a perturbation-based approach called LOw-rank Cluster Orthogonal (LOCO) weight modification. We find that low-rank is an inherent property of perturbation-based algorithms. Under this condition, the orthogonality constraint limits the variance of the node perturbation (NP) gradient estimates and enhances the convergence efficiency. Through extensive evaluations on multiple datasets, LOCO demonstrates the capability to locally train the deepest spiking neural networks to date (more than 10 layers), while exhibiting strong continual learning ability, improved convergence efficiency, and better task performance compared to other brain-inspired non-BP algorithms. Notably, LOCO requires only O(1) parallel time complexity for weight updates, which is significantly lower than that of BP methods. This offers a promising direction for achieving high-performance, real-time, and lifelong learning on neuromorphic systems.

75.0LGApr 30
AG-TAL: Anatomically-Guided Topology-Aware Loss for Multiclass Segmentation of the Circle of Willis Using Large-Scale Multi-Center Datasets

Jialu Liu, Yue Cui, Shan Yu

Accurate multiclass segmentation of the Circle of Willis (CoW) is essential for neurovascular disease management but remains challenging due to complex vascular topology and variable morphology. Existing deep learning methods often suffer from vascular discontinuities and inter-class misclassification, while current topological loss functions incur prohibitive computational costs in 3D multiclass settings. To address these limitations, we propose an Anatomically-Guided Topology-Aware Loss (AG-TAL) and introduce a large-scale, multi-center CoW dataset with unified annotations to facilitate robust model training. AG-TAL specifically integrates a radius-aware Dice loss to address class imbalance in small vessels, a breakage-aware clDice loss that utilizes group convolutions to efficiently preserve local connectivity, and an adjacency-aware co-occurrence loss that leverages anatomical priors to enforce distinct boundaries between neighboring arteries. Evaluated using 5-fold cross-validation, AG-TAL achieved an average Dice score of 80.85% for all CoW arteries, with small arteries notably higher by 1.05-3.09% compared to state-of-the-art methods. Across six independent datasets, the performance of AG-TAL achieved Dice scores ranging from 74.46% to 81.17% for all CoW arteries, with improvements of 2.20% to 9.98% for small arteries compared to other methods. This study demonstrates the superiority of AG-TAL in identifying multiclass CoW arteries and its ability to generalize well to multiple independent datasets. Furthermore, reliability analyses and clinical applications in an Alzheimer's disease cohort validate the AG-TAL's robustness and its potential for discovering imaging-based morphological biomarkers.

94.2MAApr 28
Pythia: Toward Predictability-Driven Agent-Native LLM Serving

Shan Yu, Junyi Shu, Yuanjiang Ni et al.

As LLM applications grow more complex, developers are increasingly adopting multi-agent architectures to decompose workflows into specialized, collaborative components, introducing structure that constrains agent behavior and exposes useful semantic predictability. Unlike traditional LLM serving, which operates under highly dynamic and uncertain conditions, this structured topology enables opportunities to reduce runtime uncertainty -- yet existing systems fail to exploit it, treating agentic workloads as generic traffic and incurring significant inefficiencies. Our analysis of production traces from an agent-serving platform and an internal coding assistant reveals key bottlenecks, including low prefix cache hit rates, severe resource contention from long-context requests, and substantial queuing delays due to suboptimal scaling. To address these challenges, we propose Pythia, a multi-agent serving system that captures workflow semantics through a simple interface at the serving layer, unlocking new optimization opportunities and substantially improving throughput and job completion time over state-of-the-art baselines.

DCMay 6, 2025
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Shan Yu, Jiarong Xing, Yifan Qiao et al.

Serving large language models (LLMs) is expensive, especially for providers hosting many models, making cost reduction essential. The unique workload patterns of serving multiple LLMs (i.e., multi-LLM serving) create new opportunities and challenges for this task. The long-tail popularity of models and their long idle periods present opportunities to improve utilization through GPU sharing. However, existing GPU sharing systems lack the ability to adjust their resource allocation and sharing policies at runtime, making them ineffective at meeting latency service-level objectives (SLOs) under rapidly fluctuating workloads. This paper presents Prism, a multi-LLM serving system that unleashes the full potential of GPU sharing to achieve both cost efficiency and SLO attainment. At its core, Prism tackles a key limitation of existing systems$\unicode{x2014}$the lack of $\textit{cross-model memory coordination}$, which is essential for flexibly sharing GPU memory across models under dynamic workloads. Prism achieves this with two key designs. First, it supports on-demand memory allocation by dynamically mapping physical to virtual memory pages, allowing flexible memory redistribution among models that space- and time-share a GPU. Second, it improves memory efficiency through a two-level scheduling policy that dynamically adjusts sharing strategies based on models' runtime demands. Evaluations on real-world traces show that Prism achieves more than $2\times$ cost savings and $3.3\times$ SLO attainment compared to state-of-the-art systems.

LGNov 2, 2025
Clustering-Based Weight Orthogonalization for Stabilizing Deep Reinforcement Learning

Guoqing Ma, Yuhan Zhang, Yuming Dai et al.

Reinforcement learning (RL) has made significant advancements, achieving superhuman performance in various tasks. However, RL agents often operate under the assumption of environmental stationarity, which poses a great challenge to learning efficiency since many environments are inherently non-stationary. This non-stationarity results in the requirement of millions of iterations, leading to low sample efficiency. To address this issue, we introduce the Clustering Orthogonal Weight Modified (COWM) layer, which can be integrated into the policy network of any RL algorithm and mitigate non-stationarity effectively. The COWM layer stabilizes the learning process by employing clustering techniques and a projection matrix. Our approach not only improves learning speed but also reduces gradient interference, thereby enhancing the overall learning efficiency. Empirically, the COWM outperforms state-of-the-art methods and achieves improvements of 9% and 12.6% in vision based and state-based DMControl benchmark. It also shows robustness and generality across various algorithms and tasks.

MLMar 10, 2024
Nonparametric Automatic Differentiation Variational Inference with Spline Approximation

Yuda Shao, Shan Yu, Tianshu Feng

Automatic Differentiation Variational Inference (ADVI) is efficient in learning probabilistic models. Classic ADVI relies on the parametric approach to approximate the posterior. In this paper, we develop a spline-based nonparametric approximation approach that enables flexible posterior approximation for distributions with complicated structures, such as skewness, multimodality, and bounded support. Compared with widely-used nonparametric variational inference methods, the proposed method is easy to implement and adaptive to various data structures. By adopting the spline approximation, we derive a lower bound of the importance weighted autoencoder and establish the asymptotic consistency. Experiments demonstrate the efficiency of the proposed method in approximating complex posterior distributions and improving the performance of generative models with incomplete data.

CVFeb 8, 2025
Efficient Reinforcement Learning Through Adaptively Pretrained Visual Encoder

Yuhan Zhang, Guoqing Ma, Guangfu Hao et al.

While Reinforcement Learning (RL) agents can successfully learn to handle complex tasks, effectively generalizing acquired skills to unfamiliar settings remains a challenge. One of the reasons behind this is the visual encoders used are task-dependent, preventing effective feature extraction in different settings. To address this issue, recent studies have tried to pretrain encoders with diverse visual inputs in order to improve their performance. However, they rely on existing pretrained encoders without further exploring the impact of pretraining period. In this work, we propose APE: efficient reinforcement learning through Adaptively Pretrained visual Encoder -- a framework that utilizes adaptive augmentation strategy during the pretraining phase and extracts generalizable features with only a few interactions within the task environments in the policy learning period. Experiments are conducted across various domains, including DeepMind Control Suite, Atari Games and Memory Maze benchmarks, to verify the effectiveness of our method. Results show that mainstream RL methods, such as DreamerV3 and DrQ-v2, achieve state-of-the-art performance when equipped with APE. In addition, APE significantly improves the sampling efficiency using only visual inputs during learning, approaching the efficiency of state-based method in several control tasks. These findings demonstrate the potential of adaptive pretraining of encoder in enhancing the generalization ability and efficiency of visual RL algorithms.

CVOct 10, 2025
Visual Anomaly Detection for Reliable Robotic Implantation of Flexible Microelectrode Array

Yitong Chen, Xinyao Xu, Ping Zhu et al.

Flexible microelectrode (FME) implantation into brain cortex is challenging due to the deformable fiber-like structure of FME probe and the interaction with critical bio-tissue. To ensure reliability and safety, the implantation process should be monitored carefully. This paper develops an image-based anomaly detection framework based on the microscopic cameras of the robotic FME implantation system. The unified framework is utilized at four checkpoints to check the micro-needle, FME probe, hooking result, and implantation point, respectively. Exploiting the existing object localization results, the aligned regions of interest (ROIs) are extracted from raw image and input to a pretrained vision transformer (ViT). Considering the task specifications, we propose a progressive granularity patch feature sampling method to address the sensitivity-tolerance trade-off issue at different locations. Moreover, we select a part of feature channels with higher signal-to-noise ratios from the raw general ViT features, to provide better descriptors for each specific scene. The effectiveness of the proposed methods is validated with the image datasets collected from our implantation system.

AIMay 28, 2025
Visual Large Language Models Exhibit Human-Level Cognitive Flexibility in the Wisconsin Card Sorting Test

Guangfu Hao, Frederic Alexandre, Shan Yu

Cognitive flexibility has been extensively studied in human cognition but remains relatively unexplored in the context of Visual Large Language Models (VLLMs). This study assesses the cognitive flexibility of state-of-the-art VLLMs (GPT-4o, Gemini-1.5 Pro, and Claude-3.5 Sonnet) using the Wisconsin Card Sorting Test (WCST), a classic measure of set-shifting ability. Our results reveal that VLLMs achieve or surpass human-level set-shifting capabilities under chain-of-thought prompting with text-based inputs. However, their abilities are highly influenced by both input modality and prompting strategy. In addition, we find that through role-playing, VLLMs can simulate various functional deficits aligned with patients having impairments in cognitive flexibility, suggesting that VLLMs may possess a cognitive architecture, at least regarding the ability of set-shifting, similar to the brain. This study reveals the fact that VLLMs have already approached the human level on a key component underlying our higher cognition, and highlights the potential to use them to emulate complex brain processes.

CVMay 28, 2025
Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language

Guangfu Hao, Haojie Wen, Liangxuan Guo et al.

Flexible tool selection reflects a complex cognitive ability that distinguishes humans from other species, yet computational models that capture this ability remain underdeveloped. We developed a framework using low-dimensional attribute representations to bridge visual tool perception and linguistic task understanding. We constructed a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 carefully designed attributes spanning physical, functional, and psychological properties, paired with natural language scenarios describing tool usage. Visual encoders (ResNet or ViT) extract attributes from tool images while fine-tuned language models (GPT-2, LLaMA, DeepSeek) derive required attributes from task descriptions. Our approach achieves 74% accuracy in tool selection tasks-significantly outperforming direct tool matching (20%) and smaller multimodal models (21%-58%), while approaching performance of much larger models like GPT-4o (73%) with substantially fewer parameters. Human evaluation studies validate our framework's alignment with human decision-making patterns, and generalization experiments demonstrate effective performance on novel tool categories. Ablation studies revealed that manipulation-related attributes (graspability, elongation, hand-relatedness) consistently prove most critical across modalities. This work provides a parameter-efficient, interpretable solution that mimics human-like tool cognition, advancing both cognitive science understanding and practical applications in tool selection tasks.

CVDec 28, 2021
Recursive Least-Squares Estimator-Aided Online Learning for Visual Tracking

Jin Gao, Yan Lu, Xiaojuan Qi et al.

Tracking visual objects from a single initial exemplar in the testing phase has been broadly cast as a one-/few-shot problem, i.e., one-shot learning for initial adaptation and few-shot learning for online adaptation. The recent few-shot online adaptation methods incorporate the prior knowledge from large amounts of annotated training data via complex meta-learning optimization in the offline phase. This helps the online deep trackers to achieve fast adaptation and reduce overfitting risk in tracking. In this paper, we propose a simple yet effective recursive least-squares estimator-aided online learning approach for few-shot online adaptation without requiring offline training. It allows an in-built memory retention mechanism for the model to remember the knowledge about the object seen before, and thus the seen data can be safely removed from training. This also bears certain similarities to the emerging continual learning field in preventing catastrophic forgetting. This mechanism enables us to unveil the power of modern online deep trackers without incurring too much extra computational cost. We evaluate our approach based on two networks in the online learning families for tracking, i.e., multi-layer perceptrons in RT-MDNet and convolutional neural networks in DiMP. The consistent improvements on several challenging tracking benchmarks demonstrate its effectiveness and efficiency.

CVApr 12, 2021
Intra-Class Uncertainty Loss Function for Classification

He Zhu, Shan Yu

Most classification models can be considered as the process of matching templates. However, when intra-class uncertainty/variability is not considered, especially for datasets containing unbalanced classes, this may lead to classification errors. To address this issue, we propose a loss function with intra-class uncertainty following Gaussian distribution. Specifically, in our framework, the features extracted by deep networks of each class are characterized by independent Gaussian distribution. The parameters of distribution are learned with a likelihood regularization along with other network parameters. The means of the Gaussian play a similar role as the center anchor in existing methods, and the variance describes the uncertainty of different classes. In addition, similar to the inter-class margin in traditional loss functions, we introduce a margin to intra-class uncertainty to make each cluster more compact and reduce the imbalance of feature distribution from different categories. Based on MNIST, CIFAR, ImageNet, and Long-tailed CIFAR analyses, the proposed approach shows improved classification performance, through learning a better class representation.

CROct 30, 2020
EEG-Based Brain-Computer Interfaces Are Vulnerable to Backdoor Attacks

Lubin Meng, Jian Huang, Zhigang Zeng et al.

Research and development of electroencephalogram (EEG) based brain-computer interfaces (BCIs) have advanced rapidly, partly due to deeper understanding of the brain and wide adoption of sophisticated machine learning approaches for decoding the EEG signals. However, recent studies have shown that machine learning algorithms are vulnerable to adversarial attacks. This article proposes to use narrow period pulse for poisoning attack of EEG-based BCIs, which is implementable in practice and has never been considered before. One can create dangerous backdoors in the machine learning model by injecting poisoning samples into the training set. Test samples with the backdoor key will then be classified into the target class specified by the attacker. What most distinguishes our approach from previous ones is that the backdoor key does not need to be synchronized with the EEG trials, making it very easy to implement. The effectiveness and robustness of the backdoor attack approach is demonstrated, highlighting a critical security concern for EEG-based BCIs and calling for urgent attention to address it.

CVAug 8, 2019
Progressive Relation Learning for Group Activity Recognition

Guyue Hu, Bo Cui, Yuan He et al.

Group activities usually involve spatiotemporal dynamics among many interactive individuals, while only a few participants at several key frames essentially define the activity. Therefore, effectively modeling the group-relevant and suppressing the irrelevant actions (and interactions) are vital for group activity recognition. In this paper, we propose a novel method based on deep reinforcement learning to progressively refine the low-level features and high-level relations of group activities. Firstly, we construct a semantic relation graph (SRG) to explicitly model the relations among persons. Then, two agents adopting policy according to two Markov decision processes are applied to progressively refine the SRG. Specifically, one feature-distilling (FD) agent in the discrete action space refines the low-level spatio-temporal features by distilling the most informative frames. Another relation-gating (RG) agent in continuous action space adjusts the high-level semantic graph to pay more attention to group-relevant relations. The SRG, FD agent, and RG agent are optimized alternately to mutually boost the performance of each other. Extensive experiments on two widely used benchmarks demonstrate the effectiveness and superiority of the proposed approach.

CVNov 10, 2018
Skeleton-Based Action Recognition with Synchronous Local and Non-local Spatio-temporal Learning and Frequency Attention

Guyue Hu, Bo Cui, Shan Yu

Benefiting from its succinctness and robustness, skeleton-based action recognition has recently attracted much attention. Most existing methods utilize local networks (e.g., recurrent, convolutional, and graph convolutional networks) to extract spatio-temporal dynamics hierarchically. As a consequence, the local and non-local dependencies, which contain more details and semantics respectively, are asynchronously captured in different level of layers. Moreover, existing methods are limited to the spatio-temporal domain and ignore information in the frequency domain. To better extract synchronous detailed and semantic information from multi-domains, we propose a residual frequency attention (rFA) block to focus on discriminative patterns in the frequency domain, and a synchronous local and non-local (SLnL) block to simultaneously capture the details and semantics in the spatio-temporal domain. Besides, a soft-margin focal loss (SMFL) is proposed to optimize the learning whole process, which automatically conducts data selection and encourages intrinsic margins in classifiers. Our approach significantly outperforms other state-of-the-art methods on several large-scale datasets.

LGSep 29, 2018
Continual Learning of Context-dependent Processing in Neural Networks

Guanxiong Zeng, Yang Chen, Bo Cui et al.

Deep neural networks (DNNs) are powerful tools in learning sophisticated but fixed mapping rules between inputs and outputs, thereby limiting their application in more complex and dynamic situations in which the mapping rules are not kept the same but changing according to different contexts. To lift such limits, we developed a novel approach involving a learning algorithm, called orthogonal weights modification (OWM), with the addition of a context-dependent processing (CDP) module. We demonstrated that with OWM to overcome the problem of catastrophic forgetting, and the CDP module to learn how to reuse a feature representation and a classifier for different contexts, a single network can acquire numerous context-dependent mapping rules in an online and continual manner, with as few as $\sim$10 samples to learn each. This should enable highly compact systems to gradually learn myriad regularities of the real world and eventually behave appropriately within it.