62.8AIMay 29
UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time ScalingKaiyu Huang, Xingyu Wang, Mingze Kong et al.
In real-world deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challenge. Existing approaches tackle this trade-off along two largely independent dimensions: model routing, which switches among models of different scales to match request complexity, and test-time scaling (TTS), which adjusts inference-time compute within a fixed model for fine-grained control. However, this decoupled design introduces inherent limitations. Model routing yields coarse-grained, discrete performance changes due to the sparse set of model scales, while single-model TTS often encounters capacity ceilings and exhibits diminishing returns as compute increases. Moreover, treating the two mechanisms separately restricts adaptability in dynamic inference environments. To overcome these limitations, we introduce Unified Inference Scaling (UIS), which unifies model routing and TTS in a single optimization space. Building on this formulation, we propose UniScale, an online framework that models adaptive UIS as a contextual multi-armed bandit problem and learns inference policies via LinUCB. The framework incorporates efficiency-aware learning and cost modeling to ensure stable and scalable optimization over high-dimensional action spaces. Evaluation shows that UniScale effectively exploits the synergy in the UIS space to deliver a fine-grained and consistently better quality-cost trade-off across diverse, dynamic inference scenarios.
83.6LGJun 2
Libra: Efficient Resource Management for Agentic RL Post-TrainingKaiwen Chen, Xin Tan, Jingzong Li et al.
Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alignment to complex reasoning and multi-turn agentic behaviors. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that challenge conventional resource-management assumptions. Three fundamental challenges arise. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Third, as the RL policy evolves, the trajectory-length distribution drifts over time, rendering any static resource split progressively suboptimal. We present Libra, which introduces two core mechanisms. The first is a periodic global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0$\times$ higher throughput and converges up to 2.5$\times$ faster in reward compared to the baselines.
70.2CLMay 25Code
PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight ExtractionDaren Wang, Hong Xu, Jiawen Xian
This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly signals with global Open Source Intelligence (OSINT) streams, specifically Global Database of Events, Language, and Tone (GDELT). We define and target "Perspective Mismatches", the narrative divergence between Polymarket sentiment and global media flows, as high-alpha trading signals. Moving beyond generic agentic superiority, we rigorously quantify the efficacy of "Harness Engineering" techniques, including reflection loops, tool-calling, divide-and-conquer partitioning (D&C), and chain-of-thought (CoT), within high-noise financial domains. Our empirical evaluation against human-expert benchmarks reveals that while structural partitioning is mandatory for multi-dimensional alignment, unconstrained terminal reflection actively induces logical drift. Furthermore, we identify a pervasive "consensus bias" across all agent configurations during narrative reasoning, necessitating deterministic validation. Ultimately, we isolate a Pareto-optimal configuration that achieves professional-grade analytical precision while minimizing latency and token overhead, providing a robust blueprint for autonomous intelligence in prediction markets.
81.4CVMay 25Code
Towards Reliable Fetal Ultrasound Interpretation with Multi-Agent CollaborationXiaotian Hu, Mingxuan Liu, Junwei Huang et al.
Automated fetal ultrasound interpretation requires a workflow from visual perception, including plane recognition and anatomical segmentation, to clinical understanding, including biometric measurement and diagnostic reporting. However, the prevailing "one-task, one-model" paradigm limits systematic integration of evidence across this multi-step process. Although multimodal large language models (MLLMs) show promising visual understanding, their limited domain-specific grounding and hallucination risks restrict reliability in fetal ultrasound analysis. To address these limitations, we propose FetUSAgents, a tool-augmented multi-agent system for comprehensive fetal ultrasound interpretation, supporting visual question answering (VQA), report generation, image captioning, and video summarization. FetUSAgents coordinates task-specific visual tools through collaborative LLM agents and decomposes clinical queries into subtasks that progress from anatomical recognition to quantitative measurement. We further introduce Dual-Path Evidence Arbitration (DPEA), which integrates LLM-based deliberative reasoning with structured computational evidence from specialized visual tools. A retrieval-enhanced evidence bank consolidates intermediate findings to support traceable and clinically grounded conclusions. In addition, we construct FetUS-VQA, a dedicated VQA benchmark for fetal ultrasound, comprising 1,892 images and 3,205 question-answer pairs across 10 clinical tasks. Extensive out-of-distribution experiments show that FetUSAgents outperforms general and medical MLLMs, exceeding the strongest baseline by more than 25 percent in VQA accuracy. These results suggest a scalable route toward evidence-driven clinical assistants for prenatal imaging. Code is available.
ARSep 4, 2024Code
RTLRewriter: Methodologies for Large Models aided RTL Code OptimizationXufeng Yao, Yiwen Wang, Xing Li et al.
Register Transfer Level (RTL) code optimization is crucial for enhancing the efficiency and performance of digital circuits during early synthesis stages. Currently, optimization relies heavily on manual efforts by skilled engineers, often requiring multiple iterations based on synthesis feedback. In contrast, existing compiler-based methods fall short in addressing complex designs. This paper introduces RTLRewriter, an innovative framework that leverages large models to optimize RTL code. A circuit partition pipeline is utilized for fast synthesis and efficient rewriting. A multi-modal program analysis is proposed to incorporate vital visual diagram information as optimization cues. A specialized search engine is designed to identify useful optimization guides, algorithms, and code snippets that enhance the model ability to generate optimized RTL. Additionally, we introduce a Cost-aware Monte Carlo Tree Search (C-MCTS) algorithm for efficient rewriting, managing diverse retrieved contents and steering the rewriting results. Furthermore, a fast verification pipeline is proposed to reduce verification cost. To cater to the needs of both industry and academia, we propose two benchmarking suites: the Large Rewriter Benchmark, targeting complex scenarios with extensive circuit partitioning, optimization trade-offs, and verification challenges, and the Small Rewriter Benchmark, designed for a wider range of scenarios and patterns. Our comparative analysis with established compilers such as Yosys and E-graph demonstrates significant improvements, highlighting the benefits of integrating large models into the early stages of circuit design. We provide our benchmarks at https://github.com/yaoxufeng/RTLRewriter-Bench.
NIJan 26, 2015
RepNet: Cutting Tail Latency in Data Center Networks with Flow ReplicationShuhao Liu, Wei Bai, Hong Xu et al.
Data center networks need to provide low latency, especially at the tail, as demanded by many interactive applications. To improve tail latency, existing approaches require modifications to switch hardware and/or end-host operating systems, making them difficult to be deployed. We present the design, implementation, and evaluation of RepNet, an application layer transport that can be deployed today. RepNet exploits the fact that only a few paths among many are congested at any moment in the network, and applies simple flow replication to mice flows to opportunistically use the less congested path. RepNet has two designs for flow replication: (1) RepSYN, which only replicates SYN packets and uses the first connection that finishes TCP handshaking for data transmission, and (2) RepFlow which replicates the entire mice flow. We implement RepNet on {\tt node.js}, one of the most commonly used platforms for networked interactive applications. {\tt node}'s single threaded event-loop and non-blocking I/O make flow replication highly efficient. Performance evaluation on a real network testbed and in Mininet reveals that RepNet is able to reduce the tail latency of mice flows, as well as application completion times, by more than 50\%.
42.5LGMay 22
AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-TuningWei Lin, Yining Jiang, Qingyu Song et al.
Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically employ isotropic perturbations, neglecting the rich structural information available during the forward pass. In this paper, we identify a crucial link between gradient formation and activation structure: the gradient of a linear layer is confined to the subspace spanned by its input activations. Leveraging this insight, we propose Activation-Guided Zeroth-Order optimization (AGZO). Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. We provide a theoretical framework showing that AGZO optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines. Empirically, we evaluate AGZO on Qwen3 and Pangu models across various benchmarks. AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning, while maintaining almost the same peak memory footprint as other ZO methods.
76.3SEMay 7
TSGuard: Automated User-Centric Incident Diagnosis for AI Workloads in the CloudYitao Yang, Yangtao Deng, Yifan Xiong et al.
AI workloads incur frequent failures and incidents from the underlying infrastructure. The current incident management workflow follows a provider-centric paradigm, where users report incidents to the infrastructure provider who then conducts troubleshooting. Due to the large number of incidents and the manual nature of the troubleshooting process, the provider often takes several days to resolve an incident, resulting in operational delays and productivity loss. To address these challenges, we present TSGuard, a user-centric multi-agent system that delivers immediate incident diagnosis to users who deploy the workloads. The core innovation of TSGuard is twofold: (1) constructing domain-specific knowledge bases by mining historical on-call experiences in the offline phase, and (2) mimicking human expert diagnosis via structured reasoning and iterative trial-and-error in the online phase. Evaluation using production incident records from Microsoft Azure demonstrates that TSGuard significantly outperforms state-of-the-art baselines, improving diagnostic accuracy by 19.8%. Furthermore, TSGuard reduces the average verification time by 63.4% compared to the sequential execution baseline.
LGJan 30Code
FOCUS: DLLMs Know How to Tame Their Compute BoundKaihua Liang, Xin Tan, An Zhong et al.
Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS -- an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy, while preserving or improving generation quality across multiple benchmarks. The FOCUS system is publicly available on GitHub: https://github.com/sands-lab/FOCUS.
CLOct 11, 2023
Adaptive Gating in Mixture-of-Experts based Language ModelsJiamin Li, Qiang Su, Yitao Yang et al.
Large language models, such as OpenAI's ChatGPT, have demonstrated exceptional language understanding capabilities in various NLP tasks. Sparsely activated mixture-of-experts (MoE) has emerged as a promising solution for scaling models while maintaining a constant number of computational operations. Existing MoE model adopts a fixed gating network where each token is computed by the same number of experts. However, this approach contradicts our intuition that the tokens in each sequence vary in terms of their linguistic complexity and, consequently, require different computational costs. Little is discussed in prior research on the trade-off between computation per token and model performance. This paper introduces adaptive gating in MoE, a flexible training strategy that allows tokens to be processed by a variable number of experts based on expert probability distribution. The proposed framework preserves sparsity while improving training efficiency. Additionally, curriculum learning is leveraged to further reduce training time. Extensive experiments on diverse NLP tasks show that adaptive gating reduces at most 22.5% training time while maintaining inference quality. Moreover, we conduct a comprehensive analysis of the routing decisions and present our insights when adaptive gating is used.
NIFeb 18, 2023
Moby: Empowering 2D Models for Efficient Point Cloud Analytics on the EdgeJingzong Li, Yik Hong Cai, Libin Liu et al.
3D object detection plays a pivotal role in many applications, most notably autonomous driving and robotics. These applications are commonly deployed on edge devices to promptly interact with the environment, and often require near real-time response. With limited computation power, it is challenging to execute 3D detection on the edge using highly complex neural networks. Common approaches such as offloading to the cloud induce significant latency overheads due to the large amount of point cloud data during transmission. To resolve the tension between wimpy edge devices and compute-intensive inference workloads, we explore the possibility of empowering fast 2D detection to extrapolate 3D bounding boxes. To this end, we present Moby, a novel system that demonstrates the feasibility and potential of our approach. We design a transformation pipeline for Moby that generates 3D bounding boxes efficiently and accurately based on 2D detection results without running 3D detectors. Further, we devise a frame offloading scheduler that decides when to launch the 3D detector judiciously in the cloud to avoid the errors from accumulating. Extensive evaluations on NVIDIA Jetson TX2 with real-world autonomous driving datasets demonstrate that Moby offers up to 91.9% latency improvement with modest accuracy loss over state of the art.
90.9DCMay 20
Frontier: Towards Comprehensive and Accurate LLM Inference SimulationYicheng Feng, Xin Tan, Yangtao Deng et al.
Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration.
LGDec 26, 2023Code
MoTCoder: Elevating Large Language Models with Modular of Thought for Challenging Programming TasksJingyao Li, Pengguang Chen, Bin Xia et al.
Large Language Models (LLMs) have showcased impressive capabilities in handling straightforward programming tasks. However, their performance tends to falter when confronted with more challenging programming problems. We observe that conventional models often generate solutions as monolithic code blocks, restricting their effectiveness in tackling intricate questions. To overcome this limitation, we present Module-of-Thought Coder (MoTCoder). We introduce a framework for MoT instruction tuning, designed to promote the decomposition of tasks into logical sub-tasks and sub-modules. Our investigations reveal that, through the cultivation and utilization of sub-modules, MoTCoder significantly improves both the modularity and correctness of the generated solutions, leading to substantial pass@1 improvements of 5.9% on APPS and 5.8% on CodeContests. MoTCoder also achieved significant improvements in self-correction capabilities, surpassing the current SOTA by 3.3%. Additionally, we provide an analysis of between problem complexity and optimal module decomposition and evaluate the maintainability index, confirming that the code generated by MoTCoder is easier to understand and modify, which can be beneficial for long-term code maintenance and evolution. Our codes are available at https://github.com/dvlab-research/MoTCoder.
98.2DCMar 12
OrchestrRL: Dynamic Compute and Network Orchestration for Disaggregated RLXin Tan, Yicheng Feng, Yu Zhou et al.
Disaggregating the generation and training stages in RL is widely adopted to scale LLM post-training. There are two critical challenges here. First, the generation stage often becomes a bottleneck due to dynamic workload shifts and severe execution imbalances. Second, the decoupled stages result in diverse and dynamic network traffic patterns that strain the conventional static fabric. We build OrchestrRL to orchestrate dynamically both compute and network in disaggregated RL. OrchestrRL employs an adaptive compute scheduler that adjusts parallelism configuration to match changing workload characteristics within and across generation steps. OrchestrRL adopts a reconfigurable optical-electrical fabric called RFabric: It leverages optical circuit switches to reconfigure the aggregation and core layers of the topology on demand, tailoring bandwidth resources to the unique communication patterns across various phases of training, generation, and weight synchronization. Evaluated on a 64-H800 GPU testbed, OrchestrRL demonstrates up to a 1.42x throughput improvement over static baselines. Using a high-fidelity simulator, we also show that RFabric achieves superior performance-cost efficiency at scale over static Fat-Tree networks.
LGJun 17, 2025Code
Towards Robust Learning to Optimize with Theoretical GuaranteesQingyu Song, Wei Lin, Juncheng Wang et al.
Learning to optimize (L2O) is an emerging technique to solve mathematical optimization problems with learning-based methods. Although with great success in many real-world scenarios such as wireless communications, computer networks, and electronic design, existing L2O works lack theoretical demonstration of their performance and robustness in out-of-distribution (OOD) scenarios. We address this gap by providing comprehensive proofs. First, we prove a sufficient condition for a robust L2O model with homogeneous convergence rates over all In-Distribution (InD) instances. We assume an L2O model achieves robustness for an InD scenario. Based on our proposed methodology of aligning OOD problems to InD problems, we also demonstrate that the L2O model's convergence rate in OOD scenarios will deteriorate by an equation of the L2O model's input features. Moreover, we propose an L2O model with a concise gradient-only feature construction and a novel gradient-based history modeling method. Numerical simulation demonstrates that our proposed model outperforms the state-of-the-art baseline in both InD and OOD scenarios and achieves up to 10 $\times$ convergence speedup. The code of our method can be found from https://github.com/NetX-lab/GoMathL2O-Official.
MMJul 25, 2025Code
MMESGBench: Pioneering Multimodal Understanding and Complex Reasoning Benchmark for ESG TasksLei Zhang, Xin Zhou, Chaoyue He et al.
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency. However, these documents are often lengthy, structurally diverse, and multimodal, comprising dense text, structured tables, complex figures, and layout-dependent semantics. Existing AI systems often struggle to perform reliable document-level reasoning in such settings, and no dedicated benchmark currently exists in ESG domain. To fill the gap, we introduce \textbf{MMESGBench}, a first-of-its-kind benchmark dataset targeted to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents. This dataset is constructed via a human-AI collaborative, multi-stage pipeline. First, a multimodal LLM generates candidate question-answer (QA) pairs by jointly interpreting rich textual, tabular, and visual information from layout-aware document pages. Second, an LLM verifies the semantic accuracy, completeness, and reasoning complexity of each QA pair. This automated process is followed by an expert-in-the-loop validation, where domain specialists validate and calibrate QA pairs to ensure quality, relevance, and diversity. MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories. Questions are categorized as single-page, cross-page, or unanswerable, with each accompanied by fine-grained multimodal evidence. Initial experiments validate that multimodal and retrieval-augmented models substantially outperform text-only baselines, particularly on visually grounded and cross-page tasks. MMESGBench is publicly available as an open-source dataset at https://github.com/Zhanglei1103/MMESGBench.
LGMay 30, 2025Code
Logits-Based FinetuningJingyao Li, Senqiao Yang, Sitong Wu et al.
In recent years, developing compact and efficient large language models (LLMs) has emerged as a thriving area of research. Traditional Supervised Fine-Tuning (SFT), which relies on singular ground truth labels, often fails to capture token-level dependencies and linguistic diversity. To address these limitations, we propose a logits-based fine-tuning framework that integrates the strengths of supervised learning and knowledge distillation. Our approach constructs enriched training targets by combining teacher logits with ground truth labels, preserving both correctness and linguistic diversity. This ensures more reliable and effective training. We constructed a large-scale 1.2M logits dataset and trained a series of science-focused models. Experimental results demonstrate that our method achieves significant improvements, with accuracy gains of 18% on Mawps and 22.7% on TabMWP. Across nine widely used mathematical benchmarks, our method consistently outperforms prior SFT models, achieving an average improvement of 7.28%. Codes are available at https://github.com/dvlab-research/Logits-Based-Finetuning.
DCJun 29, 2024Code
Teola: Towards End-to-End Optimization of LLM-based ApplicationsXin Tan, Yimin Jiang, Yitao Yang et al.
Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph. This explicitly exposes a much larger design space, enables optimizations in parallelization and pipelining across primitives of different modules, and enhances scheduling to improve application-level performance. We build Teola, a novel orchestration framework for LLM-based applications that implements this scheme. Comprehensive experiments show that Teola can achieve up to 2.09x speedup over existing systems across various popular LLM applications. The code is available at https://github.com/NetX-lab/Ayo.
LGJun 11, 2024Code
QuickLLaMA: Query-aware Inference Acceleration for Large Language ModelsJingyao Li, Han Shi, Xin Jiang et al.
The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn't require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the $\infty$-bench. In the Needle-in-a-Haystack and BABILong task, Q-LLM improved upon the current SOTA by 7.0% and 6.1%. Our code can be found in https://github.com/dvlab-research/Q-LLM.
LGJan 30, 2025Code
Learning Provably Improves the Convergence of Gradient DescentQingyu Song, Wei Lin, Hong Xu
Learn to Optimize (L2O) trains deep neural network-based solvers for optimization, achieving success in accelerating convex problems and improving non-convex solutions. However, L2O lacks rigorous theoretical backing for its own training convergence, as existing analyses often use unrealistic assumptions -- a gap this work highlights empirically. We bridge this gap by proving the training convergence of L2O models that learn Gradient Descent (GD) hyperparameters for quadratic programming, leveraging the Neural Tangent Kernel (NTK) theory. We propose a deterministic initialization strategy to support our theoretical results and promote stable training over extended optimization horizons by mitigating gradient explosion. Our L2O framework demonstrates over 50% better optimality than GD and superior robustness over state-of-the-art L2O methods on synthetic datasets. The code of our method can be found from https://github.com/NetX-lab/MathL2OProof-Official.
LGSep 4, 2019Code
Model Asset eXchange: Path to Ubiquitous Deep Learning DeploymentAlex Bozarth, Brendan Dwyer, Fei Hu et al.
A recent trend observed in traditionally challenging fields such as computer vision and natural language processing has been the significant performance gains shown by deep learning (DL). In many different research fields, DL models have been evolving rapidly and become ubiquitous. Despite researchers' excitement, unfortunately, most software developers are not DL experts and oftentimes have a difficult time following the booming DL research outputs. As a result, it usually takes a significant amount of time for the latest superior DL models to prevail in industry. This issue is further exacerbated by the common use of sundry incompatible DL programming frameworks, such as Tensorflow, PyTorch, Theano, etc. To address this issue, we propose a system, called Model Asset Exchange (MAX), that avails developers of easy access to state-of-the-art DL models. Regardless of the underlying DL programming frameworks, it provides an open source Python library (called the MAX framework) that wraps DL models and unifies programming interfaces with our standardized RESTful APIs. These RESTful APIs enable developers to exploit the wrapped DL models for inference tasks without the need to fully understand different DL programming frameworks. Using MAX, we have wrapped and open-sourced more than 30 state-of-the-art DL models from various research fields, including computer vision, natural language processing and signal processing, etc. In the end, we selectively demonstrate two web applications that are built on top of MAX, as well as the process of adding a DL model to MAX.
90.2LGMay 8
GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State InjectionChaobo Jia, Ruipeng Wan, Ting Sun et al.
LLM-based game generation promises to turn natural-language specifications into executable games, but progress is limited by the lack of reliable automated verification. Unlike conventional code generation, game correctness is defined over long-horizon interaction: a game may appear correct while violating core mechanics such as state updates, interaction rules, and phase transitions. Existing Agent-as-a-Verifier approaches collapse verification into open-ended gameplay, making verdicts reachability-bound, time-consuming, coverage-limited, and sensitive to the agent's gameplay ability. We present GameGen-Verifier, an automated verification paradigm for LLM-generated games that decomposes a specification into verifiable keypoints and grounds them into independent verification units. Each unit patches the game runtime into a concrete target state, executes a bounded interaction, and judges the outcome against the keypoint assertion. We implement GGV-Harness, a scalable agentic harness providing concurrency management, runtime isolation, and fault recovery. On VeriGame, our dataset of 100 games across seven genres, GameGen-Verifier achieves up to 92.2% accuracy against human judgments versus 58.8% for the coverage-enforced Agent-as-a-Verifier baseline, while reducing wall-clock time by up to 16.6x.
CLJan 29
SP^2DPO: An LLM-assisted Semantic Per-Pair DPO GeneralizationChaoyue He, Xin Zhou, Di Wang et al.
Direct Preference Optimization (DPO) controls the trade-off between fitting preference labels and staying close to a reference model using a single global temperature beta, implicitly treating all preference pairs as equally informative. Real-world preference corpora are heterogeneous: they mix high-signal, objective failures (for example, safety, factuality, instruction violations) with low-signal or subjective distinctions (for example, style), and also include label noise. We introduce our method, SP2DPO (Semantic Per-Pair DPO), a generalization that replaces the global temperature with an instance-specific schedule beta_i pre-decided offline from structured semantic-gap annotations (category, magnitude, confidence) produced by teacher language models. We instantiate this procedure on the UltraFeedback preference corpus (59,960 pairs), enabling large-scale construction of an auditable beta_i artifact, and incur zero training-time overhead: the inner-loop optimizer remains standard DPO with beta set per pair. We focus our empirical study on AlpacaEval 2.0, reporting both raw win rate and length-controlled win rate. Across four open-weight, instruction-tuned student backbones (4B-8B), SP2DPO is competitive with a tuned global-beta DPO baseline and improves AlpacaEval 2.0 length-controlled win rate on two of four backbones, while avoiding per-model beta sweeps. All code, annotations, and artifacts will be released.
AIFeb 2
PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft ModelsXuliang Wang, Yuetao Chen, Maochan Zhen et al.
Large Language Models (LLMs), constrained by their auto-regressive nature, suffer from slow decoding. Speculative decoding methods have emerged as a promising solution to accelerate LLM decoding, attracting attention from both systems and AI research communities. Recently, the pursuit of better draft quality has driven a trend toward parametrically larger draft models, which inevitably introduces substantial computational overhead. While existing work attempts to balance the trade-off between prediction accuracy and compute latency, we address this fundamental dilemma through architectural innovation. We propose PRISM, which disaggregates the computation of each predictive step across different parameter sets, refactoring the computational pathways of draft models to successfully decouple model capacity from inference cost. Through extensive experiments, we demonstrate that PRISM outperforms all existing draft architectures, achieving exceptional acceptance lengths while maintaining minimal draft latency for superior end-to-end speedup. We also re-examine scaling laws with PRISM, revealing that PRISM scales more effectively with expanding data volumes than other draft architectures. Through rigorous and fair comparison, we show that PRISM boosts the decoding throughput of an already highly optimized inference engine by more than 2.6x.
DCFeb 11, 2025
DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT TrainingXin Tan, Yuetao Chen, Yimin Jiang et al.
Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos. However, the quadratic complexity of 3D full attention remains a bottleneck in scaling DiT training, especially with high-definition, lengthy videos, where it can consume up to 95% of processing time and demand specialized context parallelism. This paper introduces DSV to accelerate video DiT training by leveraging the dynamic attention sparsity we empirically observe. DSV uses a two-stage algorithm to capture the dynamic sparsity patterns via low-rank based approximation of the original query and key. It employs custom kernels to efficiently identify critical key-value pairs and compute the sparse attention. To accommodate the new sparsity dimension, DSV adopts a hybrid sparsity-aware context parallelism that re-balances the skewed workload across attention heads and blocks due to sparsity heterogeneity. DSV achieves up to 3.02x higher training throughput, scaling to 128 GPUs and 520k token lengths, without quality loss.
LGDec 17, 2024
Echo: Simulating Distributed Training At ScaleYicheng Feng, Yuetao Chen, Kaiwen Chen et al.
Simulation offers unique values for both enumeration and extrapolation purposes, and is becoming increasingly important for managing the massive machine learning (ML) clusters and large-scale distributed training jobs. In this paper, we build Echo to tackle three key challenges in large-scale training simulation: (1) tracing the runtime training workloads at each device in an ex-situ fashion so we can use a single device to obtain the actual execution graphs of 1K-GPU training, (2) accurately estimating the collective communication without high overheads of discrete-event based network simulation, and (3) accounting for the interference-induced computation slowdown from overlapping communication and computation kernels on the same device. Echo delivers on average 8% error in training step -- roughly 3x lower than state-of-the-art simulators -- for GPT-175B on a 96-GPU H800 cluster with 3D parallelism on Megatron-LM under 2 minutes.
CLJun 2, 2025
ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability KnowledgeChaoyue He, Xin Zhou, Yi Wu et al.
We introduce ESGenius, a comprehensive benchmark for evaluating and enhancing the proficiency of Large Language Models (LLMs) in Environmental, Social, and Governance (ESG) and sustainability-focused question answering. ESGenius comprises two key components: (i) ESGenius-QA, a collection of 1,136 Multiple-Choice Questions (MCQs) generated by LLMs and rigorously validated by domain experts, covering a broad range of ESG pillars and sustainability topics. Each question is systematically linked to its corresponding source text, enabling transparent evaluation and supporting Retrieval-Augmented Generation (RAG) methods; and (ii) ESGenius-Corpus, a meticulously curated repository of 231 foundational frameworks, standards, reports, and recommendation documents from 7 authoritative sources. Moreover, to fully assess the capabilities and adaptation potential of LLMs, we implement a rigorous two-stage evaluation protocol -- Zero-Shot and RAG. Extensive experiments across 50 LLMs (0.5B to 671B) demonstrate that state-of-the-art models achieve only moderate performance in zero-shot settings, with accuracies around 55--70%, highlighting a significant knowledge gap for LLMs in this specialized, interdisciplinary domain. However, models employing RAG demonstrate significant performance improvements, particularly for smaller models. For example, DeepSeek-R1-Distill-Qwen-14B improves from 63.82% (zero-shot) to 80.46% with RAG. These results demonstrate the necessity of grounding responses in authoritative sources for enhanced ESG understanding. To the best of our knowledge, ESGenius is the first comprehensive QA benchmark designed to rigorously evaluate LLMs on ESG and sustainability knowledge, providing a critical tool to advance trustworthy AI in this vital domain.
DCSep 3, 2025
Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM TrainingYangtao Deng, Lei Zhang, Qinlong Wang et al.
Reliability is essential for ensuring efficiency in LLM training. However, many real-world reliability issues remain difficult to resolve, resulting in wasted resources and degraded model performance. Unfortunately, today's collective communication libraries operate as black boxes, hiding critical information needed for effective root cause analysis. We propose Mycroft, a lightweight distributed tracing and root cause analysis system designed to address previously hidden reliability issues in collective communication. Mycroft's key idea is to trace collective communication states and leverage internal control and data dependencies to resolve reliability problems in LLM training. Mycroft has been deployed at ByteDance for over six months to debug collective communication related issues at runtime. It detected anomalies within 15 seconds in 90% of cases and identified the root cause within 20 seconds in 60% of cases. We also conducted extensive fault injection experiments to demonstrate Mycroft's capability and efficiency.
CLMay 7, 2024
LingML: Linguistic-Informed Machine Learning for Enhanced Fake News DetectionJasraj Singh, Fang Liu, Hong Xu et al.
Nowadays, Information spreads at an unprecedented pace in social media and discerning truth from misinformation and fake news has become an acute societal challenge. Machine learning (ML) models have been employed to identify fake news but are far from perfect with challenging problems like limited accuracy, interpretability, and generalizability. In this paper, we enhance ML-based solutions with linguistics input and we propose LingML, linguistic-informed ML, for fake news detection. We conducted an experimental study with a popular dataset on fake news during the pandemic. The experiment results show that our proposed solution is highly effective. There are fewer than two errors out of every ten attempts with only linguistic input used in ML and the knowledge is highly explainable. When linguistics input is integrated with advanced large-scale ML models for natural language processing, our solution outperforms existing ones with 1.8% average error rate. LingML creates a new path with linguistics to push the frontier of effective and efficient fake news detection. It also sheds light on real-world multi-disciplinary applications requiring both ML and domain expertise to achieve optimal performance.
LGJul 29, 2025
MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache ReuseKaiwen Chen, Xin Tan, Minchen Yu et al.
Large Reasoning Models (LRMs) have achieved significant advances in mathematical reasoning and formal logic tasks. However, their tendency to generate lengthy chain-of-thought sequences leads to substantial memory overhead during inference. We observe that LRMs frequently produce highly similar intermediate reasoning steps, which correspond to similar KV cache states across layers. Motivated by this observation, we propose MemShare, a novel KV cache management approach that effectively reduces memory overhead. MemShare employs a collaborative filtering algorithm to efficiently identify reusable KV cache blocks and enables zero copy cache reuse to significantly reduce memory overhead, improve throughput while maintaining accuracy. Experimental results demonstrate that MemShare delivers up to 84.79\% improvement in throughput while maintaining better accuracy compared to existing KV cache management methods.
LGJul 27, 2025
EcoTransformer: Attention without MultiplicationXin Gao, Xingming Xu, Shirin Amiraslani et al.
The Transformer, with its scaled dot-product attention mechanism, has become a foundational architecture in modern AI. However, this mechanism is computationally intensive and incurs substantial energy costs. We propose a new Transformer architecture EcoTransformer, in which the output context vector is constructed as the convolution of the values using a Laplacian kernel, where the distances are measured by the L1 metric between the queries and keys. Compared to dot-product based attention, the new attention score calculation is free of matrix multiplication. It performs on par with, or even surpasses, scaled dot-product attention in NLP, bioinformatics, and vision tasks, while consuming significantly less energy. (This version (v2) supersedes v1 and reflects the intended release and licensing.)
CVNov 24, 2024
Optimization-Driven Statistical Models of Anatomies using Radial Basis Function Shape RepresentationHong Xu, Shireen Y. Elhabian
Particle-based shape modeling (PSM) is a popular approach to automatically quantify shape variability in populations of anatomies. The PSM family of methods employs optimization to automatically populate a dense set of corresponding particles (as pseudo landmarks) on 3D surfaces to allow subsequent shape analysis. A recent deep learning approach leverages implicit radial basis function representations of shapes to better adapt to the underlying complex geometry of anatomies. Here, we propose an adaptation of this method using a traditional optimization approach that allows more precise control over the desired characteristics of models by leveraging both an eigenshape and a correspondence loss. Furthermore, the proposed approach avoids using a black-box model and allows more freedom for particles to navigate the underlying surfaces, yielding more informative statistical models. We demonstrate the efficacy of the proposed approach to state-of-the-art methods on two real datasets and justify our choice of losses empirically.
LGSep 19, 2025
The Multi-Query Paradox in Zeroth-Order OptimizationWei Lin, Qingyu Song, Hong Xu
Zeroth-order (ZO) optimization provides a powerful framework for problems where explicit gradients are unavailable and have to be approximated using only queries to function value. The prevalent single-query approach is simple, but suffers from high estimation variance, motivating a multi-query paradigm to improves estimation accuracy. This, however, creates a critical trade-off: under a fixed budget of queries (i.e. cost), queries per iteration and the total number of optimization iterations are inversely proportional to one another. How to best allocate this budget is a fundamental, under-explored question. This work systematically resolves this query allocation problem. We analyze two aggregation methods: the de facto simple averaging (ZO-Avg), and a new Projection Alignment method (ZO-Align) we derive from local surrogate minimization. By deriving convergence rates for both methods that make the dependence on the number of queries explicit across strongly convex, convex, non-convex, and stochastic settings, we uncover a stark dichotomy: For ZO-Avg, we prove that using more than one query per iteration is always query-inefficient, rendering the single-query approach optimal. On the contrary, ZO-Align generally performs better with more queries per iteration, resulting in a full-subspace estimation as the optimal approach. Thus, our work clarifies that the multi-query problem boils down to a choice not about an intermediate query size, but between two classic algorithms, a choice dictated entirely by the aggregation method used. These theoretical findings are also consistently validated by extensive experiments.
CVSep 19, 2025
Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality FusionShanghong Li, Chiam Wen Qi Ruth, Hong Xu et al.
The rapid proliferation of short video platforms has necessitated advanced methods for detecting fake news. This need arises from the widespread influence and ease of sharing misinformation, which can lead to significant societal harm. Current methods often struggle with the dynamic and multimodal nature of short video content. This paper presents HFN, Heterogeneous Fusion Net, a novel multimodal framework that integrates video, audio, and text data to evaluate the authenticity of short video content. HFN introduces a Decision Network that dynamically adjusts modality weights during inference and a Weighted Multi-Modal Feature Fusion module to ensure robust performance even with incomplete data. Additionally, we contribute a comprehensive dataset VESV (VEracity on Short Videos) specifically designed for short video fake news detection. Experiments conducted on the FakeTT and newly collected VESV datasets demonstrate improvements of 2.71% and 4.14% in Marco F1 over state-of-the-art methods. This work establishes a robust solution capable of effectively identifying fake news in the complex landscape of short video platforms, paving the way for more reliable and comprehensive approaches in combating misinformation.
NIAug 25, 2025
Automating Conflict-Aware ACL Configurations with Natural Language IntentsWenlong Ding, Jianqiang Li, Zhixiong Niu et al.
ACL configuration is essential for managing network flow reachability, yet its complexity grows significantly with topologies and pre-existing rules. To carry out ACL configuration, the operator needs to (1) understand the new configuration policies or intents and translate them into concrete ACL rules, (2) check and resolve any conflicts between the new and existing rules, and (3) deploy them across the network. Existing systems rely heavily on manual efforts for these tasks, especially for the first two, which are tedious, error-prone, and impractical to scale. We propose Xumi to tackle this problem. Leveraging LLMs with domain knowledge of the target network, Xumi automatically and accurately translates the natural language intents into complete ACL rules to reduce operators' manual efforts. Xumi then detects all potential conflicts between new and existing rules and generates resolved intents for deployment with operators' guidance, and finally identifies the best deployment plan that minimizes the rule additions while satisfying all intents. Evaluation shows that Xumi accelerates the entire configuration pipeline by over 10x compared to current practices, addresses O(100) conflicting ACLs and reduces rule additions by ~40% in modern cloud network.
LGAug 5, 2025
Frontier: Simulating the Next Generation of LLM Inference SystemsYicheng Feng, Xin Tan, Kin Hang Sew et al.
Large Language Model (LLM) inference is growing increasingly complex with the rise of Mixture-of-Experts (MoE) models and disaggregated architectures that decouple components like prefill/decode (PD) or attention/FFN (AF) for heterogeneous scaling. Existing simulators, architected for co-located, dense models, are unable to capture the intricate system dynamics of these emerging paradigms. We present Frontier, a high-fidelity simulator designed from the ground up for this new landscape. Frontier introduces a unified framework to model both co-located and disaggregated systems, providing native support for MoE inference with expert parallelism (EP). It enables the simulation of complex workflows like cross-cluster expert routing and advanced pipelining strategies for latency hiding. To ensure fidelity and usability, Frontier incorporates refined operator models for improved accuracy. Frontier empowers the community to design and optimize the future of LLM inference at scale.
CVJul 10, 2025
Adaptive Particle-Based Shape Modeling for Anatomical Surface CorrespondenceHong Xu, Shireen Y. Elhabian
Particle-based shape modeling (PSM) is a family of approaches that automatically quantifies shape variability across anatomical cohorts by positioning particles (pseudo landmarks) on shape surfaces in a consistent configuration. Recent advances incorporate implicit radial basis function representations as self-supervised signals to better capture the complex geometric properties of anatomical structures. However, these methods still lack self-adaptivity -- that is, the ability to automatically adjust particle configurations to local geometric features of each surface, which is essential for accurately representing complex anatomical variability. This paper introduces two mechanisms to increase surface adaptivity while maintaining consistent particle configurations: (1) a novel neighborhood correspondence loss to enable high adaptivity and (2) a geodesic correspondence algorithm that regularizes optimization to enforce geodesic neighborhood consistency. We evaluate the efficacy and scalability of our approach on challenging datasets, providing a detailed analysis of the adaptivity-correspondence trade-off and benchmarking against existing methods on surface representation accuracy and correspondence metrics.
LGJun 26, 2025
DynamicBench: Evaluating Real-Time Report Generation in Large Language ModelsJingyao Li, Hao Sun, Zile Qiao et al.
Traditional benchmarks for large language models (LLMs) typically rely on static evaluations through storytelling or opinion expression, which fail to capture the dynamic requirements of real-time information processing in contemporary applications. To address this limitation, we present DynamicBench, a benchmark designed to evaluate the proficiency of LLMs in storing and processing up-to-the-minute data. DynamicBench utilizes a dual-path retrieval pipeline, integrating web searches with local report databases. It necessitates domain-specific knowledge, ensuring accurate responses report generation within specialized fields. By evaluating models in scenarios that either provide or withhold external documents, DynamicBench effectively measures their capability to independently process recent information or leverage contextual enhancements. Additionally, we introduce an advanced report generation system adept at managing dynamic information synthesis. Our experimental results confirm the efficacy of our approach, with our method achieving state-of-the-art performance, surpassing GPT4o in document-free and document-assisted scenarios by 7.0% and 5.8%, respectively. The code and data will be made publicly available.
IVMay 3, 2025
Easz: An Agile Transformer-based Image Compression Framework for Resource-constrained IoTsYu Mao, Jingzong Li, Jun Wang et al.
Neural image compression, necessary in various machine-to-machine communication scenarios, suffers from its heavy encode-decode structures and inflexibility in switching between different compression levels. Consequently, it raises significant challenges in applying the neural image compression to edge devices that are developed for powerful servers with high computational and storage capacities. We take a step to solve the challenges by proposing a new transformer-based edge-compute-free image coding framework called Easz. Easz shifts the computational overhead to the server, and hence avoids the heavy encoding and model switching overhead on the edge. Easz utilizes a patch-erase algorithm to selectively remove image contents using a conditional uniform-based sampler. The erased pixels are reconstructed on the receiver side through a transformer-based framework. To further reduce the computational overhead on the receiver, we then introduce a lightweight transformer-based reconstruction structure to reduce the reconstruction load on the receiver side. Extensive evaluations conducted on a real-world testbed demonstrate multiple advantages of Easz over existing compression approaches, in terms of adaptability to different compression levels, computational efficiency, and image reconstruction quality.
LGNov 25, 2024
A Learn-to-Optimize Approach for Coordinate-Wise Step Sizes for Quasi-Newton MethodsWei Lin, Qingyu Song, Hong Xu
Tuning step sizes is crucial for the stability and efficiency of optimization algorithms. While adaptive coordinate-wise step sizes have been shown to outperform scalar step size in first-order methods, their use in second-order methods is still under-explored and more challenging. Current approaches, including hypergradient descent and cutting plane methods, offer limited improvements or encounter difficulties in second-order contexts. To address these limitations, we first conduct a theoretical analysis within the Broyden-Fletcher-Goldfarb-Shanno (BFGS) framework, a prominent quasi-Newton method, and derive sufficient conditions for coordinate-wise step sizes that ensure convergence and stability. Building on this theoretical foundation, we introduce a novel learn-to-optimize (L2O) method that employs LSTM-based networks to learn optimal step sizes by leveraging insights from past optimization trajectories, while inherently respecting the derived theoretical guarantees. Extensive experiments demonstrate that our approach achieves substantial improvements over scalar step size methods and hypergradient descent-based method, offering up to 4$\times$ faster convergence across diverse optimization tasks.
ROJun 6, 2024
RoboCoder: Robotic Learning from Basic Skills to General Tasks with Large Language ModelsJingyao Li, Pengguang Chen, Sitong Wu et al.
The emergence of Large Language Models (LLMs) has improved the prospects for robotic tasks. However, existing benchmarks are still limited to single tasks with limited generalization capabilities. In this work, we introduce a comprehensive benchmark and an autonomous learning framework, RoboCoder aimed at enhancing the generalization capabilities of robots in complex environments. Unlike traditional methods that focus on single-task learning, our research emphasizes the development of a general-purpose robotic coding algorithm that enables robots to leverage basic skills to tackle increasingly complex tasks. The newly proposed benchmark consists of 80 manually designed tasks across 7 distinct entities, testing the models' ability to learn from minimal initial mastery. Initial testing revealed that even advanced models like GPT-4 could only achieve a 47% pass rate in three-shot scenarios with humanoid entities. To address these limitations, the RoboCoder framework integrates Large Language Models (LLMs) with a dynamic learning system that uses real-time environmental feedback to continuously update and refine action codes. This adaptive method showed a remarkable improvement, achieving a 36% relative improvement. Our codes will be released.
CVFeb 22, 2024
VLPose: Bridging the Domain Gap in Pose Estimation with Language-Vision TuningJingyao Li, Pengguang Chen, Xuan Ju et al.
Thanks to advances in deep learning techniques, Human Pose Estimation (HPE) has achieved significant progress in natural scenarios. However, these models perform poorly in artificial scenarios such as painting and sculpture due to the domain gap, constraining the development of virtual reality and augmented reality. With the growth of model size, retraining the whole model on both natural and artificial data is computationally expensive and inefficient. Our research aims to bridge the domain gap between natural and artificial scenarios with efficient tuning strategies. Leveraging the potential of language models, we enhance the adaptability of traditional pose estimation models across diverse scenarios with a novel framework called VLPose. VLPose leverages the synergy between language and vision to extend the generalization and robustness of pose estimation models beyond the traditional domains. Our approach has demonstrated improvements of 2.26% and 3.74% on HumanArt and MSCOCO, respectively, compared to state-of-the-art tuning strategies.
CVDec 29, 2023
Particle-Based Shape Modeling for Arbitrary Regions-of-InterestHong Xu, Alan Morris, Shireen Y. Elhabian
Statistical Shape Modeling (SSM) is a quantitative method for analyzing morphological variations in anatomical structures. These analyses often necessitate building models on targeted anatomical regions of interest to focus on specific morphological features. We propose an extension to \particle-based shape modeling (PSM), a widely used SSM framework, to allow shape modeling to arbitrary regions of interest. Existing methods to define regions of interest are computationally expensive and have topological limitations. To address these shortcomings, we use mesh fields to define free-form constraints, which allow for delimiting arbitrary regions of interest on shape surfaces. Furthermore, we add a quadratic penalty method to the model optimization to enable computationally efficient enforcement of any combination of cutting-plane and free-form constraints. We demonstrate the effectiveness of this method on a challenging synthetic dataset and two medical datasets.
CVMay 19, 2023
Image2SSM: Reimagining Statistical Shape Models from Images with Radial Basis FunctionsHong Xu, Shireen Y. Elhabian
Statistical shape modeling (SSM) is an essential tool for analyzing variations in anatomical morphology. In a typical SSM pipeline, 3D anatomical images, gone through segmentation and rigid registration, are represented using lower-dimensional shape features, on which statistical analysis can be performed. Various methods for constructing compact shape representations have been proposed, but they involve laborious and costly steps. We propose Image2SSM, a novel deep-learning-based approach for SSM that leverages image-segmentation pairs to learn a radial-basis-function (RBF)-based representation of shapes directly from images. This RBF-based shape representation offers a rich self-supervised signal for the network to estimate a continuous, yet compact representation of the underlying surface that can adapt to complex geometries in a data-driven manner. Image2SSM can characterize populations of biological structures of interest by constructing statistical landmark-based shape models of ensembles of anatomical shapes while requiring minimal parameter tuning and no user assistance. Once trained, Image2SSM can be used to infer low-dimensional shape representations from new unsegmented images, paving the way toward scalable approaches for SSM, especially when dealing with large cohorts. Experiments on synthetic and real datasets show the efficacy of the proposed method compared to the state-of-art correspondence-based method for SSM.
DCFeb 16, 2022
Aryl: An Elastic Cluster Scheduler for Deep LearningJiamin Li, Hong Xu, Yibo Zhu et al.
Companies build separate training and inference GPU clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both training and inference: inference clusters have low GPU utilization when the traffic load is low; training jobs often experience long queueing time due to lack of resources. We introduce Aryl, a new cluster scheduler to address these problems. Aryl introduces capacity loaning to loan idle inference GPU servers for training jobs. It further exploits elastic scaling that scales a training job's GPU allocation to better utilize loaned resources. Capacity loaning and elastic scaling create new challenges to cluster management. When the loaned servers need to be returned, we need to minimize the number of job preemptions; when more GPUs become available, we need to allocate them to elastic jobs and minimize the job completion time (JCT). Aryl addresses these combinatorial problems using principled heuristics. It introduces the notion of server preemption cost which it greedily reduces during server reclaiming. It further relies on the JCT reduction value defined for each additional worker for an elastic job to solve the scheduling problem as a multiple-choice knapsack problem. Prototype implementation on a 64-GPU testbed and large-scale simulation with 15-day traces of over 50,000 production jobs show that Aryl brings 1.53x and 1.50x reductions in average queuing time and JCT, and improves cluster usage by up to 26.9% over the cluster scheduler without capacity loaning or elastic scaling.
CLOct 1, 2021
Neural sentence embedding models for semantic similarity estimation in the biomedical domainKathrin Blagec, Hong Xu, Asan Agibetov et al.
BACKGROUND: In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set. RESULTS: With a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson's r (r=0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models' performance on the smaller contradiction subset to be poor. CONCLUSIONS: In this study we highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work.
LGJun 1, 2021
More Behind Your Electricity Bill: a Dual-DNN Approach to Non-Intrusive Load MonitoringYu Zhang, Guoming Tang, Qianyi Huang et al.
Non-intrusive load monitoring (NILM) is a well-known single-channel blind source separation problem that aims to decompose the household energy consumption into itemised energy usage of individual appliances. In this way, considerable energy savings could be achieved by enhancing household's awareness of energy usage. Recent investigations have shown that deep neural networks (DNNs) based approaches are promising for the NILM task. Nevertheless, they normally ignore the inherent properties of appliance operations in the network design, potentially leading to implausible results. We are thus motivated to develop the dual Deep Neural Networks (dual-DNN), which aims to i) take advantage of DNNs' learning capability of latent features and ii) empower the DNN architecture with identification ability of universal properties. Specifically in the design of dual-DNN, we adopt one subnetwork to measure power ratings of different appliances' operation states, and the other subnetwork to identify the running states of target appliances. The final result is then obtained by multiplying these two network outputs and meanwhile considering the multi-state property of household appliances. To enforce the sparsity property in appliance's state operating, we employ median filtering and hard gating mechanisms to the subnetwork for state identification. Compared with the state-of-the-art NILM methods, our dual-DNN approach demonstrates a 21.67% performance improvement in average on two public benchmark datasets.
CLMay 29, 2021
Novel Slot Detection: A Benchmark for Discovering Unknown Slot Types in the Task-Oriented Dialogue SystemYanan Wu, Zhiyuan Zeng, Keqing He et al.
Existing slot filling models can only recognize pre-defined in-domain slot types from a limited slot set. In the practical application, a reliable dialogue system should know what it does not know. In this paper, we introduce a new task, Novel Slot Detection (NSD), in the task-oriented dialogue system. NSD aims to discover unknown or out-of-domain slot types to strengthen the capability of a dialogue system based on in-domain training data. Besides, we construct two public NSD datasets, propose several strong NSD baselines, and establish a benchmark for future work. Finally, we conduct exhaustive experiments and qualitative analysis to comprehend key challenges and provide new guidance for future directions.
CLMay 29, 2021
Modeling Discriminative Representations for Out-of-Domain Detection with Supervised Contrastive LearningZhiyuan Zeng, Keqing He, Yuanmeng Yan et al.
Detecting Out-of-Domain (OOD) or unknown intents from user queries is essential in a task-oriented dialog system. A key challenge of OOD detection is to learn discriminative semantic features. Traditional cross-entropy loss only focuses on whether a sample is correctly classified, and does not explicitly distinguish the margins between categories. In this paper, we propose a supervised contrastive learning objective to minimize intra-class variance by pulling together in-domain intents belonging to the same class and maximize inter-class variance by pushing apart samples from different classes. Besides, we employ an adversarial augmentation mechanism to obtain pseudo diverse views of a sample in the latent space. Experiments on two public datasets prove the effectiveness of our method capturing discriminative representations for OOD detection.
CVJul 10, 2020
Attention-guided Quality Assessment for Automated Cryo-EM Grid ScreeningHong Xu, David E. Timm, Shireen Y. Elhabian
Cryogenic electron microscopy (cryo-EM) has become an enabling technology in drug discovery and in understanding molecular bases of disease by producing near-atomic resolution (less than 0.4 nm) 3D reconstructions of biological macromolecules. The imaging process required for 3D reconstructions involves a highly iterative and empirical screening process, starting with the acquisition of low magnification images of the cryo-EM grids. These images are inspected for squares that are likely to contain useful molecular signals. Potentially useful squares within the grid are then imaged at progressively higher magnifications, with the goal of identifying sub-micron areas within circular holes (bounded by the squares) for imaging at high magnification. This arduous, multi-step data acquisition process represents a bottleneck for obtaining a high throughput data collection. Here, we focus on automating the early decision making for the microscope operator, scoring low magnification images of squares, and proposing the first deep learning framework, XCryoNet, for automated cryo-EM grid screening. XCryoNet is a semi-supervised, attention-guided deep learning approach that provides explainable scoring of automatically extracted square images using limited amounts of labeled data. Results show up to 8% and 37% improvements over a fully supervised and a no-attention solution, respectively, when labeled data is scarce.