97.0LGMay 26Code
Heterogeneous Parallelism for Multimodal Large Language Model TrainingYashaswi Karnati, Kamran Jafari, Akash Mehra et al.
Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layout increasingly limits throughput. This coupling forces encoders to inherit LLM-driven sharding and placement choices that can add communication, limit encoder parallelism, or constrain the LLM schedule; the mismatch is most pronounced at long contexts, where LLM context parallelism is needed for the fused multimodal sequence but encoder inputs remain bounded. We present heterogeneous parallelism for multimodal large language model training, an abstraction that lets modules in one end-to-end graph use independent layouts and rank placements, supporting colocated execution on shared GPUs and non-colocated execution on disjoint rank sets. The key challenge is preserving boundary tensor semantics across independent layouts: forward activations must be materialized for the destination layout, while backward gradients must be routed back to the source layout. We address this with boundary communicators that implement forward and backward layout transforms, plus scheduling extensions for both placement modes. We evaluate optimized homogeneous, colocated heterogeneous, and non-colocated heterogeneous configurations across multimodal workloads and GPU scales to characterize when added layout and placement freedom exposes a better operating point. Across this sweep, colocated heterogeneity improves TFLOPS/GPU by up to 49.3%, while non-colocated heterogeneity improves aggregate token throughput by up to 13.0% and TFLOPS/GPU by up to 9.6%. We validate loss convergence parity against homogeneous baselines and release the system as an open-source Megatron-LM extension.
CLNov 18, 2022Code
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language ModelsGuangxuan Xiao, Ji Lin, Mickael Seznec et al.
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, Llama-1/2, Falcon, Mistral, and Mixtral models. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. SmoothQuant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. Code is available at https://github.com/mit-han-lab/smoothquant.
98.1AIMay 31
ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent AdaptationJingqi Zhou, Sheng Wang, Dezhao Deng et al.
LLM-powered agentic systems excel at complex long-horizon tasks, but remain constrained by static configurations fixed before execution. Such rigidity forces a trade-off between domain-specific performance and cross-task generalization: strong priors and compact tool spaces aid specialization but weaken transfer, while task-agnostic workflows and broad action spaces expand coverage but dilute guidance. Existing pre-execution optimization, planner-worker orchestration, and configuration patching fall short of resolving this tension, as they decouple adaptation from execution, causing information loss, fragmented optimization, and ambiguous credit assignment. We propose ToolSelf, a tool-driven runtime self-reconfiguration paradigm that abstracts configuration updates as a standardized tool interface and unifies execution and adaptation within one policy's action space. The execution agent can dynamically update sub-goals, strategies, toolboxes, context, and context-management modes based on task progress and feedback. We further introduce Configuration-Aware Two-stage Training (CAT), which combines rejection sampling fine-tuning with trajectory-level KTO reinforcement learning to internalize self-reconfiguration. Across diverse benchmarks, zero-shot ToolSelf rivals task-specialized agents; after CAT training, ToolSelf gains 28.8 points over the static-configuration baseline on average, illuminating a path toward emergent adaptivity that obviates manually injected guidance.
97.8CVMay 27Code
ViCA: Efficient Multimodal LLMs with Vision-Only Cross-AttentionWenjie Liu, Hao Wu, Xin Qiu et al.
Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.
LGJan 28, 2023Code
A Closer Look at Few-shot Classification AgainXu Luo, Hao Wu, Ji Zhang et al.
Few-shot classification consists of a training phase where a model is learned on a relatively large dataset and an adaptation phase where the learned model is adapted to previously-unseen tasks with limited labeled samples. In this paper, we empirically prove that the training algorithm and the adaptation algorithm can be completely disentangled, which allows algorithm analysis and design to be done individually for each phase. Our meta-analysis for each phase reveals several interesting insights that may help better understand key aspects of few-shot classification and connections with other fields such as visual representation learning and transfer learning. We hope the insights and research challenges revealed in this paper can inspire future work in related directions. Code and pre-trained models (in PyTorch) are available at https://github.com/Frankluox/CloserLookAgainFewShot.
CLAug 17, 2022Code
Exploiting Unlabeled Data for Target-Oriented Opinion Words ExtractionYidong Wang, Hao Wu, Ao Liu et al. · pku
Target-oriented Opinion Words Extraction (TOWE) is a fine-grained sentiment analysis task that aims to extract the corresponding opinion words of a given opinion target from the sentence. Recently, deep learning approaches have made remarkable progress on this task. Nevertheless, the TOWE task still suffers from the scarcity of training data due to the expensive data annotation process. Limited labeled data increase the risk of distribution shift between test data and training data. In this paper, we propose exploiting massive unlabeled data to reduce the risk by increasing the exposure of the model to varying distribution shifts. Specifically, we propose a novel Multi-Grained Consistency Regularization (MGCR) method to make use of unlabeled data and design two filters specifically for TOWE to filter noisy data at different granularity. Extensive experimental results on four TOWE benchmark datasets indicate the superiority of MGCR compared with current state-of-the-art methods. The in-depth analysis also demonstrates the effectiveness of the different-granularity filters. Our codes are available at https://github.com/TOWESSL/TOWESSL.
CVJun 4, 2022Code
Priors in Deep Image Restoration and Enhancement: A SurveyYunfan Lu, Yiqi Lin, Hao Wu et al.
Image restoration and enhancement is a process of improving the image quality by removing degradations, such as noise, blur, and resolution degradation. Deep learning (DL) has recently been applied to image restoration and enhancement. Due to its ill-posed property, plenty of works have been explored priors to facilitate training deep neural networks (DNNs). However, the importance of priors has not been systematically studied and analyzed by far in the research community. Therefore, this paper serves as the first study that provides a comprehensive overview of recent advancements in priors for deep image restoration and enhancement. Our work covers five primary contents: (1) A theoretical analysis of priors for deep image restoration and enhancement; (2) A hierarchical and structural taxonomy of priors commonly used in the DL-based methods; (3) An insightful discussion on each prior regarding its principle, potential, and applications; (4) A summary of crucial problems by highlighting the potential future directions, especially adopting the large-scale foundation models as prior, to spark more research in the community; (5) An open-source repository that provides a taxonomy of all mentioned works and code links.
71.4AIMay 27
Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference OptimizationJiawei Kong, Hao Fang, Shunxiang Liao et al.
Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.
CVOct 17, 2022
Track Targets by Dense Spatio-Temporal Position EncodingJinkun Cao, Hao Wu, Kris Kitani · cmu
In this work, we propose a novel paradigm to encode the position of targets for target tracking in videos using transformers. The proposed paradigm, Dense Spatio-Temporal (DST) position encoding, encodes spatio-temporal position information in a pixel-wise dense fashion. The provided position encoding provides location information to associate targets across frames beyond appearance matching by comparing objects in two bounding boxes. Compared to the typical transformer positional encoding, our proposed encoding is applied to the 2D CNN features instead of the projected feature vectors to avoid losing positional information. Moreover, the designed DST encoding can represent the location of a single-frame object and the evolution of the location of the trajectory among frames uniformly. Integrated with the DST encoding, we build a transformer-based multi-object tracking model. The model takes a video clip as input and conducts the target association in the clip. It can also perform online inference by associating existing trajectories with objects from the new-coming frames. Experiments on video multi-object tracking (MOT) and multi-object tracking and segmentation (MOTS) datasets demonstrate the effectiveness of the proposed DST position encoding.
99.4NAJun 2
Physics-guided correction for operator learning under model misspecificationLei Ma, Nicolas Boullé, Yu-Sen Yang et al.
Physics-informed operator learning provides an efficient framework for approximating solution operators of partial differential equations by combining observational data with governing physical laws. However, most existing methods implicitly assume that the prescribed governing equation is accurate. This assumption may fail in practical applications, where model simplifications, missing physical effects, parameter drift, or incomplete constitutive relations can lead to model misspecification. In this work, we propose a physics-guided operator correction framework for learning solution operators under misspecified governing equations. At the operator level, the target mapping is decomposed into a prior operator induced by an approximate physical model and a learnable correction operator that accounts for the remaining discrepancy. Although the formulation is architecture independent, we realize it using a serial DeepONet architecture, where the first DeepONet provides a prior prediction and the second DeepONet learns an additive correction conditioned on both the input function and the prior prediction. The learned correction is incorporated into the physics residual and trained together with data-consistency constraints, allowing the model to retain useful physical structure while adapting to inaccurate governing equations. Numerical experiments on diffusion-reaction, Burgers, cavity flow, and hyperelastic problems show that the proposed method substantially reduces errors induced by misspecified physics. Additional tests under sparse and noisy observations further demonstrate the robustness of the framework and its ability to provide informative uncertainty estimates through deep ensembles.
CVMay 30, 2022
Edge YOLO: Real-Time Intelligent Object Detection System Based on Edge-Cloud Cooperation in Autonomous VehiclesSiyuan Liang, Hao Wu
Driven by the ever-increasing requirements of autonomous vehicles, such as traffic monitoring and driving assistant, deep learning-based object detection (DL-OD) has been increasingly attractive in intelligent transportation systems. However, it is difficult for the existing DL-OD schemes to realize the responsible, cost-saving, and energy-efficient autonomous vehicle systems due to low their inherent defects of low timeliness and high energy consumption. In this paper, we propose an object detection (OD) system based on edge-cloud cooperation and reconstructive convolutional neural networks, which is called Edge YOLO. This system can effectively avoid the excessive dependence on computing power and uneven distribution of cloud computing resources. Specifically, it is a lightweight OD framework realized by combining pruning feature extraction network and compression feature fusion network to enhance the efficiency of multi-scale prediction to the largest extent. In addition, we developed an autonomous driving platform equipped with NVIDIA Jetson for system-level verification. We experimentally demonstrate the reliability and efficiency of Edge YOLO on COCO2017 and KITTI data sets, respectively. According to COCO2017 standard datasets with a speed of 26.6 frames per second (FPS), the results show that the number of parameters in the entire network is only 25.67 MB, while the accuracy (mAP) is up to 47.3%.
87.4AIJun 4
Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon AgentsYaoqi Chen, Haibin Lai, Yuru Feng et al.
LLM-based agents increasingly tackle long-horizon tasks with interdependent decisions, where each action reshapes future constraints and intermediate errors can cascade. Existing RAG and agent memory systems organize histories by semantic similarity, retrieving content-relevant entries at decision time. We argue that this design mismatches execution-state dependencies: it fragments decision trajectories and mixes valid and erroneous traces, hindering coherent state reconstruction and error isolation. We propose MAGE (Memory as Agent-Guided Exploration), an active execution-state manager that stores interactions in a hierarchical state tree. The agent derives its state from the active root-to-current path, combining subgoal summaries, recent traces, and hints from prior branches. Four coupled operations maintain the tree: Grow records new traces, Compress summarizes completed subgoals, Maintain validates summaries, and Revise restores a target boundary and resumes on a new branch. This design bounds context growth while preserving state integrity and isolating flawed segments from the active path. Experiments on MemoryArena show that MAGE improves the average task success rate by 7.8--20.4 pp over baselines, while reducing token consumption by 55.1%.
CLSep 29, 2025
Pretraining Large Language Models with NVFP4Felix Abecassis, Anjulie Agrusa, Dong Ahn et al. · nvidia
Large Language Models (LLMs) today are powerful problem solvers across many domains, and they continue to get stronger as they scale in model size, training set size, and training set quality, as shown by extensive research and experimentation across the industry. Training a frontier model today requires on the order of tens to hundreds of yottaflops, which is a massive investment of time, compute, and energy. Improving pretraining efficiency is therefore essential to enable the next generation of even more capable LLMs. While 8-bit floating point (FP8) training is now widely adopted, transitioning to even narrower precision, such as 4-bit floating point (FP4), could unlock additional improvements in computational speed and resource utilization. However, quantization at this level poses challenges to training stability, convergence, and implementation, notably for large-scale models trained on long token horizons. In this study, we introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format. Our method integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers. We validate our approach by training a 12-billion-parameter model on 10 trillion tokens -- the longest publicly documented training run in 4-bit precision to date. Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline. These findings highlight that NVFP4, when combined with our training approach, represents a major step forward in narrow-precision LLM training algorithms.
LGSep 12, 2022
FP8 Formats for Deep LearningPaulius Micikevicius, Dusan Stosic, Neil Burgess et al.
FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.
CVMay 16, 2022Code
A New Outlier Removal Strategy Based on Reliability of Correspondence Graph for Fast Point Cloud RegistrationLi Yan, Pengcheng Wei, Hong Xie et al.
Registration is a basic yet crucial task in point cloud processing. In correspondence-based point cloud registration, matching correspondences by point feature techniques may lead to an extremely high outlier ratio. Current methods still suffer from low efficiency, accuracy, and recall rate. We use a simple and intuitive method to describe the 6-DOF (degree of freedom) curtailment process in point cloud registration and propose an outlier removal strategy based on the reliability of the correspondence graph. The method constructs the corresponding graph according to the given correspondences and designs the concept of the reliability degree of the graph node for optimal candidate selection and the reliability degree of the graph edge to obtain the global maximum consensus set. The presented method could achieve fast and accurate outliers removal along with gradual aligning parameters estimation. Extensive experiments on simulations and challenging real-world datasets demonstrate that the proposed method can still perform effective point cloud registration even the correspondence outlier ratio is over 99%, and the efficiency is better than the state-of-the-art. Code is available at https://github.com/WPC-WHU/GROR.
SPApr 12, 2022
GMSS: Graph-Based Multi-Task Self-Supervised Learning for EEG Emotion RecognitionYang Li, Ji Chen, Fu Li et al.
Previous electroencephalogram (EEG) emotion recognition relies on single-task learning, which may lead to overfitting and learned emotion features lacking generalization. In this paper, a graph-based multi-task self-supervised learning model (GMSS) for EEG emotion recognition is proposed. GMSS has the ability to learn more general representations by integrating multiple self-supervised tasks, including spatial and frequency jigsaw puzzle tasks, and contrastive learning tasks. By learning from multiple tasks simultaneously, GMSS can find a representation that captures all of the tasks thereby decreasing the chance of overfitting on the original task, i.e., emotion recognition task. In particular, the spatial jigsaw puzzle task aims to capture the intrinsic spatial relationships of different brain regions. Considering the importance of frequency information in EEG emotional signals, the goal of the frequency jigsaw puzzle task is to explore the crucial frequency bands for EEG emotion recognition. To further regularize the learned features and encourage the network to learn inherent representations, contrastive learning task is adopted in this work by mapping the transformed data into a common feature space. The performance of the proposed GMSS is compared with several popular unsupervised and supervised methods. Experiments on SEED, SEED-IV, and MPED datasets show that the proposed model has remarkable advantages in learning more discriminative and general features for EEG emotional signals.
84.0NIJun 3
vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality ModelsXunzhuo Liu, Huamin Chen, Samzong Lu et al.
As large language models (LLMs) diversify across modalities, capabilities, and cost profiles, the problem of intelligent request routing: selecting the right model for each query at inference time, has become a critical systems challenge. We present vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality (MoM) model deployments. The architecture follows two complementary Shannon-inspired views. In the information-theoretic regime, signal extraction reduces the entropy of "which model?" by distilling routing-relevant information from raw queries. In the Boolean-algebraic regime, the decision engine composes functionally complete routing policies from signal conditions. The central innovation is composable signal orchestration: thirteen heterogeneous signal types, spanning sub-millisecond heuristics and neural classifiers for semantics, safety, and modality, are composed through configurable Boolean decision rules into deployment-specific routing policies, so that fundamentally different scenarios (multi-cloud enterprise, privacy-regulated, cost-optimized) are expressed as different configurations over the same architecture. Matched decisions drive semantic model routing via thirteen selection algorithms, while per-decision plugin chains enforce safety constraints including a three-stage HaluGate hallucination detection pipeline and a lightweight episodic memory system with ReflectionGate for personalized multi-turn context. A typed neural-symbolic DSL specifies these routing policies and compiles them to multiple deployment targets, enabling configuration-first adaptation without code changes. Together, these components show that composable signal orchestration enables a single framework to serve diverse deployment scenarios with differentiated cost, privacy, and safety policies.
78.0CVApr 16
The Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method OverviewZheng Chen, Kai Liu, Jingkai Wang et al.
This paper presents the NTIRE 2026 image super-resolution ($\times$4) challenge, one of the associated competitions of the NTIRE 2026 Workshop at CVPR 2026. The challenge aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective super-resolution solutions and analyze recent advances in the field. To reflect the evolving objectives of image super-resolution, the challenge includes two tracks: (1) a restoration track, which emphasizes pixel-wise fidelity and ranks submissions based on PSNR; and (2) a perceptual track, which focuses on visual realism and evaluates results using a perceptual score. A total of 194 participants registered for the challenge, with 31 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, main results, and methods of participating teams. The challenge provides a unified benchmark and offers insights into current progress and future directions in image super-resolution.
LGMar 16, 2022
Monte Carlo PINNs: deep learning approach for forward and inverse problems involving high dimensional fractional partial differential equationsLing Guo, Hao Wu, Xiaochen Yu et al.
We introduce a sampling based machine learning approach, Monte Carlo physics informed neural networks (MC-PINNs), for solving forward and inverse fractional partial differential equations (FPDEs). As a generalization of physics informed neural networks (PINNs), our method relies on deep neural network surrogates in addition to a stochastic approximation strategy for computing the fractional derivatives of the DNN outputs. A key ingredient in our MC-PINNs is to construct an unbiased estimation of the physical soft constraints in the loss function. Our directly sampling approach can yield less overall computational cost compared to fPINNs proposed in \cite{pang2019fpinns} and thus provide an opportunity for solving high dimensional fractional PDEs. We validate the performance of MC-PINNs method via several examples that include high dimensional integral fractional Laplacian equations, parametric identification of time-space fractional PDEs, and fractional diffusion equation with random inputs. The results show that MC-PINNs is flexible and promising to tackle high-dimensional FPDEs.
NAMay 2, 2012
Gaussian Beam Methods for the Dirac Equation in the Semi-classical RegimeHao Wu, Zhongyi Huang, Shi Jin et al.
The Dirac equation is an important model in relativistic quantum mechanics. In the semi-classical regime $ε\ll1$, even a spatially spectrally accurate time splitting method \cite{HuJi:05} requires the mesh size to be $O(ε)$, which makes the direct simulation extremely expensive. In this paper, we present the Gaussian beam method for the Dirac equation. With the help of an eigenvalue decomposition, the Gaussian beams can be independently evolved along each eigenspace and summed to construct an approximate solution of the Dirac equation. Moreover, the proposed Eulerian Gaussian beam keeps the advantages of constructing the Hessian matrices by simply using level set functions' derivatives. Finally, several numerical examples show the efficiency and accuracy of the method.
CLMar 10, 2022
Compilable Neural Code Generation with Compiler FeedbackXin Wang, Yasheng Wang, Yao Wan et al.
Automatically generating compilable programs with (or without) natural language descriptions has always been a touchstone problem for computational linguistics and automated software engineering. Existing deep-learning approaches model code generation as text generation, either constrained by grammar structures in decoder, or driven by pre-trained language models on large-scale code corpus (e.g., CodeGPT, PLBART, and CodeT5). However, few of them account for compilability of the generated programs. To improve compilability of the generated programs, this paper proposes COMPCODER, a three-stage pipeline utilizing compiler feedback for compilable code generation, including language model fine-tuning, compilability reinforcement, and compilability discrimination. Comprehensive experiments on two code generation tasks demonstrate the effectiveness of our proposed approach, improving the success rate of compilation from 44.18 to 89.18 in code completion on average and from 70.3 to 96.2 in text-to-code generation, respectively, when comparing with the state-of-the-art CodeGPT.
99.3DCMay 6
eLLM: Elastic Memory Management Framework for Efficient LLM ServingJiale Xu, Rui Zhang, Yi Xiong et al.
Large Language Models are increasingly being deployed in datacenters. Serving these models requires careful memory management, as their memory usage includes static weights, dynamic activations, and key-value caches. While static weights are constant and predictable, dynamic components such as activations and KV caches change frequently during runtime, presenting significant challenges for efficient memory management. Modern LLM serving systems typically handle runtime memory and KV caches at distinct abstraction levels: runtime memory management relies on static tensor abstractions, whereas KV caches utilize a page table-based virtualization layer built on top of the tensor abstraction. This virtualization dynamically manages KV caches to mitigate memory fragmentation. However, this dual-level approach fundamentally isolates runtime memory and KV cache management, resulting in suboptimal memory utilization under dynamic workloads, which can lead to a nearly 20% drop in throughput. To address these limitations, we propose eLLM, an elastic memory management framework inspired by the classical memory ballooning mechanism in operating systems. The core components of eLLM include: (1) Virtual Tensor Abstraction, which decouples the virtual address space of tensors from the physical GPU memory, creating a unified and flexible memory pool; (2) an Elastic Memory Mechanism that dynamically adjusts memory allocation through runtime memory inflation and deflation, leveraging CPU memory as an extensible buffer; and (3) a Lightweight Scheduling Strategy employing SLO-aware policies to optimize memory utilization and effectively balance performance trade-offs under stringent SLO constraints. Comprehensive evaluations demonstrate that eLLM significantly outperforms state-of-the-art systems, 2.32x higher decoding throughput, and supporting 3x larger batch sizes for 128K-token inputs.
NAFeb 7, 2023
IB-UQ: Information bottleneck based uncertainty quantification for neural function regression and neural operator learningLing Guo, Hao Wu, Wenwen Zhou et al.
We propose a novel framework for uncertainty quantification via information bottleneck (IB-UQ) for scientific machine learning tasks, including deep neural network (DNN) regression and neural operator learning (DeepONet). Specifically, we incorporate the bottleneck by a confidence-aware encoder, which encodes inputs into latent representations according to the confidence of the input data belonging to the region where training data is located, and utilize a Gaussian decoder to predict means and variances of outputs conditional on representation variables. Furthermore, we propose a data augmentation based information bottleneck objective which can enhance the quantification quality of the extrapolation uncertainty, and the encoder and decoder can be both trained by minimizing a tractable variational bound of the objective. In comparison to uncertainty quantification (UQ) methods for scientific learning tasks that rely on Bayesian neural networks with Hamiltonian Monte Carlo posterior estimators, the model we propose is computationally efficient, particularly when dealing with large-scale data sets. The effectiveness of the IB-UQ model has been demonstrated through several representative examples, such as regression for discontinuous functions, real-world data set regression, learning nonlinear operators for partial differential equations, and a large-scale climate model. The experimental results indicate that the IB-UQ model can handle noisy data, generate robust predictions, and provide confident uncertainty evaluation for out-of-distribution data.
LGFeb 22, 2023
Do We Really Need Complicated Model Architectures For Temporal Networks?Weilin Cong, Si Zhang, Jian Kang et al.
Recurrent neural network (RNN) and self-attention mechanism (SAM) are the de facto methods to extract spatial-temporal information for temporal graph learning. Interestingly, we found that although both RNN and SAM could lead to a good performance, in practice neither of them is always necessary. In this paper, we propose GraphMixer, a conceptually and technically simple architecture that consists of three components: (1) a link-encoder that is only based on multi-layer perceptrons (MLP) to summarize the information from temporal links, (2) a node-encoder that is only based on neighbor mean-pooling to summarize node information, and (3) an MLP-based link classifier that performs link prediction based on the outputs of the encoders. Despite its simplicity, GraphMixer attains an outstanding performance on temporal link prediction benchmarks with faster convergence and better generalization performance. These results motivate us to rethink the importance of simpler model architecture.
77.7CLApr 13Code
C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World PromptsChenxi Qing, Junxi Wu, Zheng Liu et al.
Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C-ReD.
CVMar 3Code
Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language ModelsJialiang Zhang, Junlong Tong, Junyan Lin et al.
Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{this repository.}
NAApr 25, 2023
Efficient Bayesian inference using physics-informed invertible neural networks for inverse problemsXiaofei Guan, Xintong Wang, Hao Wu et al.
In this paper, we introduce an innovative approach for addressing Bayesian inverse problems through the utilization of physics-informed invertible neural networks (PI-INN). The PI-INN framework encompasses two sub-networks: an invertible neural network (INN) and a neural basis network (NB-Net). The primary role of the NB-Net lies in modeling the spatial basis functions characterizing the solution to the forward problem dictated by the underlying partial differential equation. Simultaneously, the INN is designed to partition the parameter vector linked to the input physical field into two distinct components: the expansion coefficients representing the forward problem solution and the Gaussian latent noise. If the forward mapping is precisely estimated, and the statistical independence between expansion coefficients and latent noise is well-maintained, the PI-INN offers a precise and efficient generative model for Bayesian inverse problems, yielding tractable posterior density estimates. As a particular physics-informed deep learning model, the primary training challenge for PI-INN centers on enforcing the independence constraint, which we tackle by introducing a novel independence loss based on estimated density. We support the efficacy and precision of the proposed PI-INN through a series of numerical experiments, including inverse kinematics, 1-dimensional and 2-dimensional diffusion equations, and seismic traveltime tomography. Specifically, our experimental results showcase the superior performance of the proposed independence loss in comparison to the commonly used but computationally demanding kernel-based maximum mean discrepancy loss.
NAOct 28, 2017
The quadratic Wasserstein metric for earthquake locationJing Chen, Yifan Chen, Hao Wu et al.
In [Engquist et al., Commun. Math. Sci., 14(2016)], the Wasserstein metric was successfully introduced to the full waveform inversion. We apply this method to the earthquake location problem. For this problem, the seismic stations are far from each other. Thus, the trace by trace comparison [Yang et al., arXiv(2016)] is a natural way to compare the earthquake signals. Under this framework, we have derived a concise analytic expression of the Frèchet gradient of the Wasserstein metric, which leads to a simple and efficient implementation for the adjoint method. We square and normalize the earthquake signals for comparison so that the convexity of the misfit function with respect to earthquake hypocenter and origin time can be observed numerically. To reduce the impact of noise, which can not offset each other after squaring the signals, a new control parameter is introduced. Finally, the LMF (Levenberg-Marquardt-Fletcher) method is applied to solve the resulted optimization problem. According to the numerical experiments, only a few iterations are required to converge to the real earthquake hypocenter and origin time. Even for data with noise, we can obtain reasonable and convergent numerical results.
LGSep 11, 2023
Reaction coordinate flows for model reduction of molecular kineticsHao Wu, Frank Noé
In this work, we introduce a flow based machine learning approach, called reaction coordinate (RC) flow, for discovery of low-dimensional kinetic models of molecular systems. The RC flow utilizes a normalizing flow to design the coordinate transformation and a Brownian dynamics model to approximate the kinetics of RC, where all model parameters can be estimated in a data-driven manner. In contrast to existing model reduction methods for molecular kinetics, RC flow offers a trainable and tractable model of reduced kinetics in continuous time and space due to the invertibility of the normalizing flow. Furthermore, the Brownian dynamics-based reduced kinetic model investigated in this work yields a readily discernible representation of metastable states within the phase space of the molecular system. Numerical experiments demonstrate how effectively the proposed method discovers interpretable and accurate low-dimensional representations of given full-state kinetics from simulations.
QMAug 23, 2023
Enhancing cardiovascular risk prediction through AI-enabled calcium-omicsAmmar Hoori, Sadeer Al-Kindi, Tao Hu et al.
Background. Coronary artery calcium (CAC) is a powerful predictor of major adverse cardiovascular events (MACE). Traditional Agatston score simply sums the calcium, albeit in a non-linear way, leaving room for improved calcification assessments that will more fully capture the extent of disease. Objective. To determine if AI methods using detailed calcification features (i.e., calcium-omics) can improve MACE prediction. Methods. We investigated additional features of calcification including assessment of mass, volume, density, spatial distribution, territory, etc. We used a Cox model with elastic-net regularization on 2457 CT calcium score (CTCS) enriched for MACE events obtained from a large no-cost CLARIFY program (ClinicalTri-als.gov Identifier: NCT04075162). We employed sampling techniques to enhance model training. We also investigated Cox models with selected features to identify explainable high-risk characteristics. Results. Our proposed calcium-omics model with modified synthetic down sampling and up sampling gave C-index (80.5%/71.6%) and two-year AUC (82.4%/74.8%) for (80:20, training/testing), respectively (sampling was applied to the training set only). Results compared favorably to Agatston which gave C-index (71.3%/70.3%) and AUC (71.8%/68.8%), respectively. Among calcium-omics features, numbers of calcifications, LAD mass, and diffusivity (a measure of spatial distribution) were important determinants of increased risk, with dense calcification (>1000HU) associated with lower risk. The calcium-omics model reclassified 63% of MACE patients to the high risk group in a held-out test. The categorical net-reclassification index was NRI=0.153. Conclusions. AI analysis of coronary calcification can lead to improved results as compared to Agatston scoring. Our findings suggest the utility of calcium-omics in improved prediction of risk.
CVMar 2Code
FireRed-OCR Technical ReportHao Wu, Haoran Lou, Xinyue Li et al.
We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.
CVAug 8, 2022
Distinctive Image Captioning via CLIP Guided Group OptimizationYouyuan Zhang, Jiuniu Wang, Hao Wu et al.
Image captioning models are usually trained according to human annotated ground-truth captions, which could generate accurate but generic captions. In this paper, we focus on generating distinctive captions that can distinguish the target image from other similar images. To evaluate the distinctiveness of captions, we introduce a series of metrics that use large-scale vision-language pre-training model CLIP to quantify the distinctiveness. To further improve the distinctiveness of captioning models, we propose a simple and effective training strategy that trains the model by comparing target image with similar image group and optimizing the group embedding gap. Extensive experiments are conducted on various baseline models to demonstrate the wide applicability of our strategy and the consistency of metric results with human evaluation. By comparing the performance of our best model with existing state-of-the-art models, we claim that our model achieves new state-of-the-art towards distinctiveness objective.
CVJan 17, 2023
Vision Learners Meet Web Image-Text PairsBingchen Zhao, Quan Cui, Hao Wu et al.
Many self-supervised learning methods are pre-trained on the well-curated ImageNet-1K dataset. In this work, given the excellent scalability of web data, we consider self-supervised pre-training on noisy web sourced image-text paired data. First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting. We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training. We observe that existing multi-modal methods do not outperform their single-modal counterparts on vision transfer learning tasks. We derive an information-theoretical view to explain these benchmark results, which provides insight into how to design a novel vision learner. Inspired by this insight, we present a new visual representation pre-training method, MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text data. MUG achieves state-of-the-art transfer performance on a variety of tasks and demonstrates promising scaling properties. Pre-trained models and code will be made public upon acceptance.
77.6DBMay 29
Modeling and Optimization for Massive Data Allocation in DatabasePanpan Niu, Boxiang Ren, Hao Wu et al.
In the era of big data, e-commerce and Internet platforms face the challenge of processing massive amounts of data. However, due to data being scattered across different machines in distributed database, extra communication costs are incurred in gathering relevant data to complete transactions. Without a carefully designed data placement scheme, this cost can severely impact the performance of Online Transaction Processing systems. To meet industry requirements, algorithms that output a data placement scheme that achieves i) data balance and ii) low communication overhead within a fixed period of time are eagerly investigated. Although some existing methods have been studied, they do not adequately meet the aforementioned requirements. In this paper, inspired by the normalized cut of spectral clustering, we introduce a novel model for data allocation problem. The normalized cut reconciles the inherent conflict between the two objectives. Taking into account the variable characteristics of the model, we formulate the problem as a 0-1 optimization problem, and solve the relaxed problem using the Bregman proximal gradient method with guaranteed convergence. The numerical experiments reveal that the convergent solutions can be smoothly rounded to discrete solutions. Furthermore, our algorithm surpasses both simple and meta-heuristic partitioning schemes by minimizing migration costs while maintaining a superior balance.
CVNov 1, 2025Code
Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and BeyondFan Zhang, Haoxuan Li, Shengju Qian et al.
Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant limitations in reasoning and interpretability. To this end, we introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs. Specifically, we curate two high-quality and large-scale datasets: UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for reinforcement learning with verifiable rewards (RLVR), respectively. Building upon them, we develop a unified and interpretable FER foundation model termed UniFER-7B, which outperforms many open-sourced and closed-source generalist MLLMs (e.g., Gemini-2.5-Pro and Qwen2.5-VL-72B).
LGJan 8Code
FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-ExpertsYiji Zhao, Zihao Zhong, Ao Wang et al.
Spatial-Temporal Graph (STG) forecasting on large-scale networks has garnered significant attention. However, existing models predominantly focus on short-horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long-horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity-aware Mixture-of-Experts (MoEs) for long-horizon and large-scale STG forecasting, which unlocks one-week-ahead (672 steps at a 15-minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self-attention modules when applied to large-scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed-forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real-world datasets demonstrate that FaST not only delivers superior long-horizon predictive accuracy but also achieves remarkable computational efficiency compared to state-of-the-art baselines. Our source code is available at: https://github.com/yijizhao/FaST.
CVAug 16, 2024Code
FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion ModelsMohammadreza Samadi, Fred X. Han, Mohammad Salameh et al.
Diffusion models have demonstrated outstanding performance in generative tasks, making them ideal candidates for image editing. Recent studies highlight their ability to apply desired edits effectively by following textual instructions, yet with two key challenges remaining. First, these models struggle to apply multiple edits simultaneously, resulting in computational inefficiencies due to their reliance on sequential processing. Second, relying on textual prompts to determine the editing region can lead to unintended alterations to the image. We introduce FunEditor, an efficient diffusion model designed to learn atomic editing functions and perform complex edits by aggregating simpler functions. This approach enables complex editing tasks, such as object movement, by aggregating multiple functions and applying them simultaneously to specific areas. Our experiments demonstrate that FunEditor significantly outperforms recent inference-time optimization methods and fine-tuned models, either quantitatively across various metrics or through visual comparisons or both, on complex tasks like object movement and object pasting. In the meantime, with only 4 steps of inference, FunEditor achieves 5-24x inference speedups over existing popular methods. The code is available at: mhmdsmdi.github.io/funeditor/.
73.6CVMay 28
IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face GenerationHao Wu, Xiangyang Luo, Hao Wang et al.
With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder scalability and accessibility of diffusion-based approaches across the research community. To address this, we propose a finetuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameterfree components: (1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; (2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and (3) the Noise Sensor, which introduces Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.
45.0LGMay 28
Gated Graph Attention Networks with Learnable TemperatureZhongtian Ma, Hao Wu, Yexin Zhang et al.
Graph attention networks learn neighbor importance through data-dependent coefficients, but standard layers lack explicit control over unreliable feature dimensions and use fixed sharpness of attention coefficient distributions. This paper proposes gated graph attention and learnable temperature for common graph attention mechanisms. Gated graph attention filters feature or message responses to reduce the influence of unreliable dimensions, while learnable temperature dynamically adjusts the sharpness of the attention coefficient distribution. Experiments on homogeneous and heterophilic heterogeneous benchmarks show that the proposed variants consistently improve the corresponding graph attention backbones, and controlled noise studies further verify their behavior under feature perturbations. Theoretical analysis explains these results by showing that gating improves robustness when only part of the feature coordinates are reliable, while temperature is beneficial when global noise weakens the discriminability of node features.
PLMay 4, 2022
CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-TrainingXin Wang, Yasheng Wang, Yao Wan et al.
Recent years have witnessed increasing interest in code representation learning, which aims to represent the semantics of source code into distributed vectors. Currently, various works have been proposed to represent the complex semantics of source code from different views, including plain text, Abstract Syntax Tree (AST), and several kinds of code graphs (e.g., Control/Data Flow Graph). However, most of them only consider a single view of source code independently, ignoring the correspondences among different views. In this paper, we propose to integrate different views with the natural-language description of source code into a unified framework with Multi-View contrastive Pre-training, and name our model as CODE-MVP. Specifically, we first extract multiple code views using compiler tools, and learn the complementary information among them under a contrastive learning framework. Inspired by the type checking in compilation, we also design a fine-grained type inference objective in the pre-training. Experiments on three downstream tasks over five datasets demonstrate the superiority of CODE-MVP when compared with several state-of-the-art baselines. For example, we achieve 2.4/2.3/1.1 gain in terms of MRR/MAP/Accuracy metrics on natural language code retrieval, code similarity, and code defect detection tasks, respectively.
CVJun 27, 2023
Cardiac CT perfusion imaging of pericoronary adipose tissue (PCAT) highlights potential confounds in coronary CTAHao Wu, Yingnan Song, Ammar Hoori et al.
Features of pericoronary adipose tissue (PCAT) assessed from coronary computed tomography angiography (CCTA) are associated with inflammation and cardiovascular risk. As PCAT is vascularly connected with coronary vasculature, the presence of iodine is a potential confounding factor on PCAT HU and textures that has not been adequately investigated. Use dynamic cardiac CT perfusion (CCTP) to inform contrast determinants of PCAT assessment. From CCTP, we analyzed HU dynamics of territory-specific PCAT, myocardium, and other adipose depots in patients with coronary artery disease. HU, blood flow, and radiomics were assessed over time. Changes from peak aorta time, Pa, chosen to model the time of CCTA, were obtained. HU in PCAT increased more than in other adipose depots. The estimated blood flow in PCAT was ~23% of that in the contiguous myocardium. Comparing PCAT distal and proximal to a significant stenosis, we found less enhancement and longer time-to-peak distally. Two-second offsets [before, after] Pa resulted in [ 4-HU, 3-HU] differences in PCAT. Due to changes in HU, the apparent PCAT volume reduced ~15% from the first scan (P1) to Pa using a conventional fat window. Comparing radiomic features over time, 78% of features changed >10% relative to P1. CCTP elucidates blood flow in PCAT and enables analysis of PCAT features over time. PCAT assessments (HU, apparent volume, and radiomics) are sensitive to acquisition timing and the presence of obstructive stenosis, which may confound the interpretation of PCAT in CCTA images. Data normalization may be in order.
CLJul 1, 2024
$\text{Memory}^3$: Language Modeling with Explicit MemoryHongkang Yang, Zehao Lin, Wenjin Wang et al.
The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining "abstract knowledge". As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named $\text{Memory}^3$, since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.
AO-PHAug 6, 2023
AI-GOMS: Large AI-Driven Global Ocean Modeling SystemWei Xiong, Yanfei Xiang, Hao Wu et al.
Ocean modeling is a powerful tool for simulating the physical, chemical, and biological processes of the ocean, which is the foundation for marine science research and operational oceanography. Modern numerical ocean modeling mainly consists of governing equations and numerical algorithms. Nonlinear instability, computational expense, low reusability efficiency and high coupling costs have gradually become the main bottlenecks for the further development of numerical ocean modeling. Recently, artificial intelligence-based modeling in scientific computing has shown revolutionary potential for digital twins and scientific simulations, but the bottlenecks of numerical ocean modeling have not been further solved. Here, we present AI-GOMS, a large AI-driven global ocean modeling system, for accurate and efficient global ocean daily prediction. AI-GOMS consists of a backbone model with the Fourier-based Masked Autoencoder structure for basic ocean variable prediction and lightweight fine-tuning models incorporating regional downscaling, wave decoding, and biochemistry coupling modules. AI-GOMS has achieved the best performance in 30 days of prediction for the global ocean basic variables with 15 depth layers at 1/4° spatial resolution. Beyond the good performance in statistical metrics, AI-GOMS realizes the simulation of mesoscale eddies in the Kuroshio region at 1/12° spatial resolution and ocean stratification in the tropical Pacific Ocean. AI-GOMS provides a new backbone-downstream paradigm for Earth system modeling, which makes the system transferable, scalable and reusable.
37.4LGMay 26
PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film DesignRuntian Wang, Renhao Xue, Baige Chen et al.
The inverse problem of multilayer thin-film optical coatings design represents a complex combinatorial-continuous optimization challenge. We present PRISM (Position-encoded Regressive Inverse Spectral Model), a unified decoder-only autoregressive transformer that streamlines this process by jointly predicting discrete material selection and continuous thickness regression within a single backbone. PRISM introduces two primary architectural innovations: (1) spectrum prefix conditioning, which utilizes standard prefix tokens for in-context target injection, and (2) cumulative-depth Rotary Position Embeddings, which encode continuous thickness directly into the positional representation to preserve the physical spatial relationships of the stack. Our benchmarks demonstrate that a PRISM-13M model reduces MAE by over 50\% compared to other transformer baselines while utilizing only one-fifth of the parameters. Furthermore, a 44M-parameter variant achieves state-of-the-art performance (MAE = 0.010) on our in-distribution validation benchmark and operates significantly faster than simulated annealing, offering a highly efficient alternative to classical optimization methods.
64.7NAMay 26
FINOM: Fast Sinkhorn on Non-uniform MeshesQihao Cheng, Qichen Liao, Hao Wu et al.
A linear-complexity algorithm for computing the Wasserstein-1 distance on non-uniform meshes is proposed. This work extends the fast Sinkhorn algorithms from [Q. Liao et al., Commun. Math. Sci., 20(2022)] and [Q. Liao et al., J. Sci. Comput., 98 (2024)] to non-uniform meshes. In those prior works, a distinctive collinear structure of the kernel matrix on uniform meshes was identified, enabling \(O(N)\) acceleration via dynamic programming. While non-uniform meshes are prevalent in practical applications like computational fluid dynamics and finance, their lack of collinearity has hindered direct acceleration. In this paper, we introduce the concept of a ``dividing index'', which partitions the kernel matrix into two blocks. We demonstrate that each block exhibits a quasi-collinear property, a generalization of the structure found in uniform meshes. Leveraging this insight, we develop \textbf{F}ast S\textbf{I}nkhorn algorithm on \textbf{NO}n-uniform \textbf{M}eshes (\textbf{FINOM}), a dynamic programming approach that reduces the per-iteration complexity of the Sinkhorn algorithm from \(O(N^2)\) to \(O(N)\). Extensive numerical experiments on 1D and 2D problems confirm these improvements, achieving speed-ups of several orders of magnitude while maintaining accuracy.
NAOct 23, 2016
A new earthquake location method based on the waveform inversionHao Wu, Jing Chen, Xueyuan Huang et al.
In this paper, a new earthquake location method based on the waveform inversion is proposed. As is known to all, the waveform misfit function is very sensitive to the phase shift between the synthetic waveform signal and the real waveform signal. Thus, the convergence domain of the conventional waveform based earthquake location methods is very small. In present study, by introducing and solving a simple sub-optimization problem, we greatly expand the convergence domain of the waveform based earthquake location method. According to a large number of numerical experiments, the new method expands the range of convergence by several tens of times. This allows us to locate the earthquake accurately even from some relatively bad initial values.
98.6CVMar 24Code
3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and UnderstandingYiping Chen, Jinpeng Li, Wenyu Ke et al.
While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.
55.9SIMay 25
Scalable Algorithm for Dynamic Quasi-clique DetectionJingbang Chen, Weinuo Li, Yingli Zhou et al.
Identifying dense subgraphs known as quasi-cliques is pivotal in numerous graph mining tasks across domains such as social networks, biology, and e-commerce. While prior work has developed efficient algorithms for quasi-clique detection in static graphs, real-world networks are inherently dynamic, where edges appear and disappear continuously. This renders static methods inefficient and ill-suited for real-time analysis. In this paper, we initiate the study of the Dynamic Maximum Quasi-Clique Problem (DMQCP), which aims to maintain and update the largest quasi-clique in a graph under streaming graph updates. We propose DMI, a novel MinHash-based dynamic framework that supports fast, high-quality approximate maintenance of quasi-cliques. DMI leverages two update-efficient hashing schemes, i.e., $l$-buffered $k$-MinHash and Bottom-$k$ MinHash, to maintain candidate quasi-cliques incrementally. To ensure robustness and reduce bias, we further design a batch reconstruction strategy to periodically rebuild the candidate set, guaranteeing both stability and adaptability under frequent updates. Extensive experiments on real-world and synthetic datasets show that DMI achieves up to four orders of magnitude speedup over static baselines, while preserving solution quality. As a side product, we also propose a framework NSF that primarily uses the neighbor-search technique to maintain quasi-clique candidates while edge updating. This work establishes the first efficient algorithmic framework for dynamic quasi-clique extraction, enabling scalable and real-time dense subgraph mining in evolving networks.
CVJul 22, 2023
An Intelligent Remote Sensing Image Quality Inspection SystemYijiong Yu, Tao Wang, Kang Ran et al.
Due to the inevitable presence of quality problems, quality inspection of remote sensing images is indeed an indispensable step between the acquisition and the application of them. However, traditional manual inspection suffers from low efficiency. Hence, we propose a novel deep learning-based two-step intelligent system consisting of multiple advanced computer vision models, which first performs image classification by SwinV2 and then accordingly adopts the most appropriate method, such as semantic segmentation by Segformer, to localize the quality problems. Results demonstrate that the proposed method exhibits excellent performance and efficiency, surpassing traditional methods. Furthermore, we conduct an initial exploration of applying multimodal models to remote sensing image quality inspection.
SDJan 23Code
Do Models Hear Like Us? Probing the Representational Alignment of Audio LLMs and Naturalistic EEGHaoyun Yang, Xin Xiao, Jiang Zhong et al.
Audio Large Language Models (Audio LLMs) have demonstrated strong capabilities in integrating speech perception with language understanding. However, whether their internal representations align with human neural dynamics during naturalistic listening remains largely unexplored. In this work, we systematically examine layer-wise representational alignment between 12 open-source Audio LLMs and Electroencephalogram (EEG) signals across 2 datasets. Specifically, we employ 8 similarity metrics, such as Spearman-based Representational Similarity Analysis (RSA), to characterize within-sentence representational geometry. Our analysis reveals 3 key findings: (1) we observe a rank-dependence split, in which model rankings vary substantially across different similarity metrics; (2) we identify spatio-temporal alignment patterns characterized by depth-dependent alignment peaks and a pronounced increase in RSA within the 250-500 ms time window, consistent with N400-related neural dynamics; (3) we find an affective dissociation whereby negative prosody, identified using a proposed Tri-modal Neighborhood Consistency (TNC) criterion, reduces geometric similarity while enhancing covariance-based dependence. These findings provide new neurobiological insights into the representational mechanisms of Audio LLMs.