h-index29
43papers
956citations
Novelty50%
AI Score59

43 Papers

CVJun 2Code
World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

Yucheng Zhou, Wei Tao, Yiwen Guo et al.

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.

SEMar 5, 2022
RACE: Retrieval-Augmented Commit Message Generation

Ensheng Shi, Yanlin Wang, Wei Tao et al. · tencent-ai

Commit messages are important for software development and maintenance. Many neural network-based approaches have been proposed and shown promising results on automatic commit message generation. However, the generated commit messages could be repetitive or redundant. In this paper, we propose RACE, a new retrieval-augmented neural commit message generation method, which treats the retrieved similar commit as an exemplar and leverages it to generate an accurate commit message. As the retrieved commit message may not always accurately describe the content/intent of the current code diff, we also propose an exemplar guider, which learns the semantic similarity between the retrieved and current code diff and then guides the generation of commit message based on the similarity. We conduct extensive experiments on a large public dataset with five programming languages. Experimental results show that RACE can outperform all baselines. Furthermore, RACE can boost the performance of existing Seq2Seq models in commit message generation.

SEJan 15Code
Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

Caihua Li, Lianghong Guo, Yanlin Wang et al. · tencent-ai

Issue resolution, a complex Software Engineering (SWE) task integral to real-world development, has emerged as a compelling challenge for artificial intelligence. The establishment of benchmarks like SWE-bench revealed this task as profoundly difficult for large language models, thereby significantly accelerating the evolution of autonomous coding agents. This paper presents a systematic survey of this emerging domain. We begin by examining data construction pipelines, covering automated collection and synthesis approaches. We then provide a comprehensive analysis of methodologies, spanning training-free frameworks with their modular components to training-based techniques, including supervised fine-tuning and reinforcement learning. Subsequently, we discuss critical analyses of data quality and agent behavior, alongside practical applications. Finally, we identify key challenges and outline promising directions for future research. An open-source repository is maintained at https://github.com/DeepSoftwareAnalytics/Awesome-Issue-Resolution to serve as a dynamic resource in this field.

CVJun 3
Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

Xunyi Zhao, Sihao Lin, Gengze Zhou et al.

Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an under-specified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate through repeated high-information questions rather than by resolving the underlying ambiguity efficiently. We recast interactive IGN as a cost-sensitive uncertainty-reduction problem, where the agent should ask the question whose answer provides the largest reduction in navigation uncertainty relative to its penalty. To this end, we apply an information-gain analysis on existing navigation corpora to identify which cues reduce navigation uncertainty, yielding a compact set of question types and data-derived weights. However, existing interactive navigation benchmarks do not model the cost of different question types or evaluate how efficiently agents use interaction, making them unsuitable for studying cost-sensitive interaction. Based on this taxonomy, we construct a benchmark for diagnosing interaction behavior and efficiency, together with a Weighted Success Rate metric that penalizes each query by its derived cost. We further propose a zero-shot MLLM navigator that selectively queries at each decision step only when the expected uncertainty reduction justifies the interaction cost.

SEMay 17Code
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

Sirui Hong, Zhijie Liu, Tengfei Li et al.

Evaluating LLM-generated interactive software requires execution in addition to static analysis. The key difficulty is that correctness is a graph-level reachable property over latent UI state-transition graphs, whereas a GUI evaluator observes only a single execution trajectory. A failed rollout therefore rules out only one realized path, leaving failure attribution ambiguous between evaluator-side execution error and genuine software defect. We present DiagEval, a trajectory-conditioned diagnostic evaluation protocol for post-failure GUI-agent evaluation of interactive software. Rather than blindly retrying from scratch, DiagEval reuses the failed trajectory to choose targeted diagnostic probes and aggregates their outcomes into an internal attribution signal. The latent-graph view motivates the diagnostic problem; DiagEval does not reconstruct the graph or estimate calibrated posterior probabilities. We evaluate DiagEval on WebDevJudge-Unit and RealDevBench across multiple GUI-agent evaluators and LLM backbones. On false-negative cases, DiagEval recovers 45.6-62.1% of failures that were initially misattributed to software defects, outperforming retry-based baselines with 34.4-160.6% relative gains. On the full evaluation sets, this recovery improves accuracy from 69.9% to 78.3% on WebDevJudge-Unit and from 65.0% to 81.6% on RealDevBench. These results suggest that reliable GUI-agent evaluation requires not only stronger execution, but also active failure diagnosis to disambiguate evaluator-side errors from genuine software defects. Our code is available at https://github.com/scutGit/DiagEval.

LGApr 18, 2022
A Convergence Analysis of Nesterov's Accelerated Gradient Method in Training Deep Linear Neural Networks

Xin Liu, Wei Tao, Zhisong Pan

Momentum methods, including heavy-ball~(HB) and Nesterov's accelerated gradient~(NAG), are widely used in training neural networks for their fast convergence. However, there is a lack of theoretical guarantees for their convergence and acceleration since the optimization landscape of the neural network is non-convex. Nowadays, some works make progress towards understanding the convergence of momentum methods in an over-parameterized regime, where the number of the parameters exceeds that of the training instances. Nonetheless, current results mainly focus on the two-layer neural network, which are far from explaining the remarkable success of the momentum methods in training deep neural networks. Motivated by this, we investigate the convergence of NAG with constant learning rate and momentum parameter in training two architectures of deep linear networks: deep fully-connected linear neural networks and deep linear ResNets. Based on the over-parameterization regime, we first analyze the residual dynamics induced by the training trajectory of NAG for a deep fully-connected linear neural network under the random Gaussian initialization. Our results show that NAG can converge to the global minimum at a $(1 - \mathcal{O}(1/\sqrtκ))^t$ rate, where $t$ is the iteration number and $κ> 1$ is a constant depending on the condition number of the feature matrix. Compared to the $(1 - \mathcal{O}(1/κ))^t$ rate of GD, NAG achieves an acceleration over GD. To the best of our knowledge, this is the first theoretical guarantee for the convergence of NAG to the global minimum in training deep neural networks. Furthermore, we extend our analysis to deep linear ResNets and derive a similar convergence result.

QUANT-PHMay 26, 2022
QSpeech: Low-Qubit Quantum Speech Application Toolkit

Zhenhou Hong, Jianzong Wang, Xiaoyang Qu et al.

Quantum devices with low qubits are common in the Noisy Intermediate-Scale Quantum (NISQ) era. However, Quantum Neural Network (QNN) running on low-qubit quantum devices would be difficult since it is based on Variational Quantum Circuit (VQC), which requires many qubits. Therefore, it is critical to make QNN with VQC run on low-qubit quantum devices. In this study, we propose a novel VQC called the low-qubit VQC. VQC requires numerous qubits based on the input dimension; however, the low-qubit VQC with linear transformation can liberate this condition. Thus, it allows the QNN to run on low-qubit quantum devices for speech applications. Furthermore, as compared to the VQC, our proposed low-qubit VQC can stabilize the training process more. Based on the low-qubit VQC, we implement QSpeech, a library for quick prototyping of hybrid quantum-classical neural networks in the speech field. It has numerous quantum neural layers and QNN models for speech applications. Experiments on Speech Command Recognition and Text-to-Speech show that our proposed low-qubit VQC outperforms VQC and is more stable.

AIFeb 28, 2024Code
Data Interpreter: An LLM Agent For Data Science

Sirui Hong, Yizhang Lin, Bang Liu et al. · tencent-ai, tsinghua

Large Language Model (LLM)-based agents have shown effectiveness across many applications. However, their use in data science scenarios requiring solving long-term interconnected tasks, dynamic data adjustments and domain expertise remains challenging. Previous approaches primarily focus on individual tasks, making it difficult to assess the complete data science workflow. Moreover, they struggle to handle real-time changes in intermediate data and fail to adapt dynamically to evolving task dependencies inherent to data science problems. In this paper, we present Data Interpreter, an LLM-based agent designed to automatically solve various data science problems end-to-end. Our Data Interpreter incorporates two key modules: 1) Hierarchical Graph Modeling, which breaks down complex problems into manageable subproblems, enabling dynamic node generation and graph optimization; and 2) Programmable Node Generation, a technique that refines and verifies each subproblem to iteratively improve code generation results and robustness. Extensive experiments consistently demonstrate the superiority of Data Interpreter. On InfiAgent-DABench, it achieves a 25% performance boost, raising accuracy from 75.9% to 94.9%. For machine learning and open-ended tasks, it improves performance from 88% to 95%, and from 60% to 97%, respectively. Moreover, on the MATH dataset, Data Interpreter achieves remarkable performance with a 26% improvement compared to state-of-the-art baselines. The code is available at https://github.com/geekan/MetaGPT.

LGJan 27, 2023
Adapting Step-size: A Unified Perspective to Analyze and Improve Gradient-based Methods for Adversarial Attacks

Wei Tao, Lei Bao, Sheng Long et al.

Learning adversarial examples can be formulated as an optimization problem of maximizing the loss function with some box-constraints. However, for solving this induced optimization problem, the state-of-the-art gradient-based methods such as FGSM, I-FGSM and MI-FGSM look different from their original methods especially in updating the direction, which makes it difficult to understand them and then leaves some theoretical issues to be addressed in viewpoint of optimization. In this paper, from the perspective of adapting step-size, we provide a unified theoretical interpretation of these gradient-based adversarial learning methods. We show that each of these algorithms is in fact a specific reformulation of their original gradient methods but using the step-size rules with only current gradient information. Motivated by such analysis, we present a broad class of adaptive gradient-based algorithms based on the regular gradient methods, in which the step-size strategy utilizing information of the accumulated gradients is integrated. Such adaptive step-size strategies directly normalize the scale of the gradients rather than use some empirical operations. The important benefit is that convergence for the iterative algorithms is guaranteed and then the whole optimization process can be stabilized. The experiments demonstrate that our AdaI-FGM consistently outperforms I-FGSM and AdaMI-FGM remains competitive with MI-FGSM for black-box attacks.

SEApr 21Code
PlayCoder: Making LLM-Generated GUI Code Playable

Zhiyuan Peng, Wei Tao, Xin Yin et al.

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.

LGAug 8, 2022
Provable Acceleration of Nesterov's Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks

Xin Liu, Wei Tao, Wei Li et al.

Due to its simplicity and efficiency, the first-order gradient method has been extensively employed in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the first-order method is capable of attaining a global minimum during training over-parameterized neural networks, where the number of parameters is significantly larger than that of training instances. Momentum methods, including the heavy ball (HB) method and Nesterov's accelerated gradient (NAG) method, are the workhorse of first-order gradient methods owning to their accelerated convergence. In practice, NAG often exhibits superior performance than HB. However, current theoretical works fail to distinguish their convergence difference in training neural networks. To fill this gap, we consider the training problem of the two-layer ReLU neural network under over-parameterization and random initialization. Leveraging high-resolution dynamical systems and neural tangent kernel (NTK) theory, our result not only establishes tighter upper bounds of the convergence rate for both HB and NAG, but also provides the first theoretical guarantee for the acceleration of NAG over HB in training neural networks. Finally, we validate our theoretical results on three benchmark datasets.

CVApr 30Code
EdgeFM: Efficient Edge Inference for Vision-Language Models

Mengling Deng, Yuanpeng Chen, Sheng Yang et al.

Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely on bloated general-purpose designs or force developers into opaque, hardware-specific closed-source ecosystems, leading to hardware lock-in limitation and poor cross-platform adaptability. Observing that modern AI agents can efficiently search and tune configurations to generate highly optimized low-level kernels for standard LLM operators, we propose EdgeFM, a lightweight, agent-driven VLM/LLM inference framework tailored for cross-platform industrial edge deployment. EdgeFM removes non-essential features to reduce single-request latency, and encapsulates agent-tuned kernel optimizations as a modular library of reusable skills. By allowing direct invocation of these skills rather than waiting for closed-source implementations, it effectively closes the performance gap long dominated by proprietary toolchains. The framework natively supports mainstream platforms including x86 and NVIDIA Orin SoCs, and represents the first end-to-end VLA deployment on the domestic Horizon Journey platform, enhancing cross-platform portability. In most cases, it yields clearly better inference performance than conventional vendor-specific toolchains, achieving up to 1.49 times speedup over TensorRT-Edge-LLM on the NVIDIA Orin platform. Experimental results show that EdgeFM delivers favorable end-to-end inference performance, providing an open-source, production-grade solution for diverse edge industrial scenarios.

LGFeb 22
The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers

Wei Tao, Yang Dai, Jincai Huang et al.

Crafting adversarial examples can be formulated as an optimization problem. While sign-based optimizers such as I-FGSM and MI-FGSM have become the de facto standard for the induced optimization problems, there still exist several unsolved problems in theoretical grounding and practical reliability especially in non-convergence and instability, which inevitably influences their transferability. Contrary to the expectation, we observe that the attack success rate may degrade sharply when more number of iterations are conducted. In this paper, we address these issues from an optimization perspective. By reformulating the sign-based optimizer as a specific coordinate-wise gradient descent, we argue that one cause for non-convergence and instability is their non-decaying step-size scheduling. Based upon this viewpoint, we propose a series of new attack algorithms that enforce Monotonically Decreasing Coordinate-wise Step-sizes (MDCS) within sign-based optimizers. Typically, we further provide theoretical guarantees proving that MDCS-MI attains an optimal convergence rate of $O(1/\sqrt{T})$, where $T$ is the number of iterations. Extensive experiments on image classification and cross-modal retrieval tasks demonstrate that our approach not only significantly improves transferability but also enhances attack stability compared to state-of-the-art sign-based methods.

CVFeb 9
Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries

Haocheng Lu, Nan Zhang, Wei Tao et al.

Streaming video question answering (Streaming Video QA) poses distinct challenges for multimodal large language models (MLLMs), as video frames arrive sequentially and user queries can be issued at arbitrary time points. Existing solutions relying on fixed-size memory or naive compression often suffer from context loss or memory overflow, limiting their effectiveness in long-form, real-time scenarios. We present Vista, a novel framework for scene-aware streaming video QA that enables efficient and scalable reasoning over continuous video streams. The innovation of Vista can be summarized in three aspects: (1) scene-aware segmentation, where Vista dynamically clusters incoming frames into temporally and visually coherent scene units; (2) scene-aware compression, where each scene is compressed into a compact token representation and stored in GPU memory for efficient index-based retrieval, while full-resolution frames are offloaded to CPU memory; and (3) scene-aware recall, where relevant scenes are selectively recalled and reintegrated into the model input upon receiving a query, enabling both efficiency and completeness. Vista is model-agnostic and integrates seamlessly with a variety of vision-language backbones, enabling long-context reasoning without compromising latency or memory efficiency. Extensive experiments on StreamingBench demonstrate that Vista achieves state-of-the-art performance, establishing a strong baseline for real-world streaming video understanding.

CVJan 30
Triage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models

Anmin Wang, Nan Zhang, Wei Tao et al.

Vision-Language Models (VLMs) face significant computational challenges in video processing due to massive data redundancy, which creates prohibitively long token sequences. To address this, we introduce Triage, a training-free, plug-and-play framework that reframes video reasoning as a resource allocation problem via hierarchical visual budgeting. Its first stage, Frame-Level Budgeting, identifies keyframes by evaluating their visual dynamics and relevance, generating a strategic prior based on their importance scores. Guided by this prior, the second stage, Token-Level Budgeting, allocates tokens in two phases: it first secures high-relevance Core Tokens, followed by diverse Context Tokens selected with an efficient batched Maximal Marginal Relevance (MMR) algorithm. Extensive experiments demonstrate that Triage improves inference speed and reduces memory footprint, while maintaining or surpassing the performance of baselines and other methods on various video reasoning benchmarks.

SEJun 12, 2025Code
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

Lianghong Guo, Yanlin Wang, Caihua Li et al. · tencent-ai

Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.

CVDec 9, 2025Code
FastBEV++: Fast by Algorithm, Deployable by Design

Yuanpeng Chen, Hui Song, Wei Tao et al.

The advancement of camera-only Bird's-Eye-View(BEV) perception is currently impeded by a fundamental tension between state-of-the-art performance and on-vehicle deployment tractability. This bottleneck stems from a deep-rooted dependency on computationally prohibitive view transformations and bespoke, platform-specific kernels. This paper introduces FastBEV++, a framework engineered to reconcile this tension, demonstrating that high performance and deployment efficiency can be achieved in unison via two guiding principles: Fast by Algorithm and Deployable by Design. We realize the "Deployable by Design" principle through a novel view transformation paradigm that decomposes the monolithic projection into a standard Index-Gather-Reshape pipeline. Enabled by a deterministic pre-sorting strategy, this transformation is executed entirely with elementary, operator native primitives (e.g Gather, Matrix Multiplication), which eliminates the need for specialized CUDA kernels and ensures fully TensorRT-native portability. Concurrently, our framework is "Fast by Algorithm", leveraging this decomposed structure to seamlessly integrate an end-to-end, depth-aware fusion mechanism. This jointly learned depth modulation, further bolstered by temporal aggregation and robust data augmentation, significantly enhances the geometric fidelity of the BEV representation.Empirical validation on the nuScenes benchmark corroborates the efficacy of our approach. FastBEV++ establishes a new state-of-the-art 0.359 NDS while maintaining exceptional real-time performance, exceeding 134 FPS on automotive-grade hardware (e.g Tesla T4). By offering a solution that is free of custom plugins yet highly accurate, FastBEV++ presents a mature and scalable design philosophy for production autonomous systems. The code is released at: https://github.com/ymlab/advanced-fastbev

SEJan 16, 2024Code
KADEL: Knowledge-Aware Denoising Learning for Commit Message Generation

Wei Tao, Yucheng Zhou, Yanlin Wang et al.

Commit messages are natural language descriptions of code changes, which are important for software evolution such as code understanding and maintenance. However, previous methods are trained on the entire dataset without considering the fact that a portion of commit messages adhere to good practice (i.e., good-practice commits), while the rest do not. On the basis of our empirical study, we discover that training on good-practice commits significantly contributes to the commit message generation. Motivated by this finding, we propose a novel knowledge-aware denoising learning method called KADEL. Considering that good-practice commits constitute only a small proportion of the dataset, we align the remaining training samples with these good-practice commits. To achieve this, we propose a model that learns the commit knowledge by training on good-practice commits. This knowledge model enables supplementing more information for training samples that do not conform to good practice. However, since the supplementary information may contain noise or prediction errors, we propose a dynamic denoising training method. This method composes a distribution-aware confidence function and a dynamic distribution list, which enhances the effectiveness of the training process. Experimental results on the whole MCMD dataset demonstrate that our method overall achieves state-of-the-art performance compared with previous methods. Our source code and data are available at https://github.com/DeepSoftwareAnalytics/KADEL

SEMar 26, 2024
MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

Wei Tao, Yucheng Zhou, Yanlin Wang et al. · tencent-ai

In software development, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing code. Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving Github issues, particularly at the repository level. To overcome this challenge, we empirically study the reason why LLMs fail to resolve GitHub issues and analyze the major factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, significantly outperforming the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the advanced LLM.

CVNov 26, 2025
Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Naifu Zhang, Wei Tao, Xi Xiao et al.

In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.

CVMay 4
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

Wei Tao, Xiaoyang Qu, Peiqiang Wang et al.

Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in VLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called WindowQuant, which employs window-adaptive mixed-precision quantization to optimize the KV cache. WindowQuant consists of two modules: window-level quantization search and window-level KV cache computation. Window-level quantization search quickly determines the optimal bit-width configuration of the KV cache windows based on the similarity scores between the corresponding visual token windows and the text prompt, maintaining the model accuracy. Furthermore, window-level KV cache computation reorders the KV cache windows before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that WindowQuant outperforms state-of-the-art VLM models and KV cache quantization methods on various datasets.

LGDec 16, 2025
Optimizing the Adversarial Perturbation with a Momentum-based Adaptive Matrix

Wei Tao, Sheng Long, Xin Liu et al.

Generating adversarial examples (AEs) can be formulated as an optimization problem. Among various optimization-based attacks, the gradient-based PGD and the momentum-based MI-FGSM have garnered considerable interest. However, all these attacks use the sign function to scale their perturbations, which raises several theoretical concerns from the point of view of optimization. In this paper, we first reveal that PGD is actually a specific reformulation of the projected gradient method using only the current gradient to determine its step-size. Further, we show that when we utilize a conventional adaptive matrix with the accumulated gradients to scale the perturbation, PGD becomes AdaGrad. Motivated by this analysis, we present a novel momentum-based attack AdaMI, in which the perturbation is optimized with an interesting momentum-based adaptive matrix. AdaMI is proved to attain optimal convergence for convex problems, indicating that it addresses the non-convergence issue of MI-FGSM, thereby ensuring stability of the optimization process. The experiments demonstrate that the proposed momentum-based adaptive matrix can serve as a general and effective technique to boost adversarial transferability over the state-of-the-art methods across different networks while maintaining better stability and imperceptibility.

CLJul 30, 2025
C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Chengqian Ma, Wei Tao, Yiwen Guo · tencent-ai

Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

CLApr 13, 2025
MADLLM: Multivariate Anomaly Detection via Pre-trained LLMs

Wei Tao, Xiaoyang Qu, Kai Lu et al.

When applying pre-trained large language models (LLMs) to address anomaly detection tasks, the multivariate time series (MTS) modality of anomaly detection does not align with the text modality of LLMs. Existing methods simply transform the MTS data into multiple univariate time series sequences, which can cause many problems. This paper introduces MADLLM, a novel multivariate anomaly detection method via pre-trained LLMs. We design a new triple encoding technique to align the MTS modality with the text modality of LLMs. Specifically, this technique integrates the traditional patch embedding method with two novel embedding approaches: Skip Embedding, which alters the order of patch processing in traditional methods to help LLMs retain knowledge of previous features, and Feature Embedding, which leverages contrastive learning to allow the model to better understand the correlations between different features. Experimental results demonstrate that our method outperforms state-of-the-art methods in various public anomaly detection datasets.

CLMar 30, 2025
Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference

Wei Tao, Bin Zhang, Xiaoyang Qu et al.

Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.

CVJun 9, 2025
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts

Wei Tao, Haocheng Lu, Xiaoyang Qu et al.

One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache. Existing approaches, such as quantization, have demonstrated promising results in reducing memory usage. However, current quantization methods cannot take both effectiveness and efficiency into account. In this paper, we propose MoQAE, a novel mixed-precision quantization method via mixture of quantization-aware experts. First, we view different quantization bit-width configurations as experts and use the traditional mixture of experts (MoE) method to select the optimal configuration. To avoid the inefficiency caused by inputting tokens one by one into the router in the traditional MoE method, we input the tokens into the router chunk by chunk. Second, we design a lightweight router-only fine-tuning process to train MoQAE with a comprehensive loss to learn the trade-off between model accuracy and memory usage. Finally, we introduce a routing freezing (RF) and a routing sharing (RS) mechanism to further reduce the inference overhead. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art KV cache quantization approaches in both efficiency and effectiveness.

CVDec 24, 2024
UNet--: Memory-Efficient and Feature-Enhanced Network Architecture based on U-Net with Reduced Skip-Connections

Lingxiao Yin, Wei Tao, Dongyue Zhao et al.

U-Net models with encoder, decoder, and skip-connections components have demonstrated effectiveness in a variety of vision tasks. The skip-connections transmit fine-grained information from the encoder to the decoder. It is necessary to maintain the feature maps used by the skip-connections in memory before the decoding stage. Therefore, they are not friendly to devices with limited resource. In this paper, we propose a universal method and architecture to reduce the memory consumption and meanwhile generate enhanced feature maps to improve network performance. To this end, we design a simple but effective Multi-Scale Information Aggregation Module (MSIAM) in the encoder and an Information Enhancement Module (IEM) in the decoder. The MSIAM aggregates multi-scale feature maps into single-scale with less memory. After that, the aggregated feature maps can be expanded and enhanced to multi-scale feature maps by the IEM. By applying the proposed method on NAFNet, a SOTA model in the field of image restoration, we design a memory-efficient and feature-enhanced network architecture, UNet--. The memory demand by the skip-connections in the UNet-- is reduced by 93.3%, while the performance is improved compared to NAFNet. Furthermore, we show that our proposed method can be generalized to multiple visual tasks, with consistent improvements in both memory consumption and network accuracy compared to the existing efficient architectures.

CVMay 31, 2025
BAGNet: A Boundary-Aware Graph Attention Network for 3D Point Cloud Semantic Segmentation

Wei Tao, Xiaoyang Qu, Kai Lu et al.

Since the point cloud data is inherently irregular and unstructured, point cloud semantic segmentation has always been a challenging task. The graph-based method attempts to model the irregular point cloud by representing it as a graph; however, this approach incurs substantial computational cost due to the necessity of constructing a graph for every point within a large-scale point cloud. In this paper, we observe that boundary points possess more intricate spatial structural information and develop a novel graph attention network known as the Boundary-Aware Graph attention Network (BAGNet). On one hand, BAGNet contains a boundary-aware graph attention layer (BAGLayer), which employs edge vertex fusion and attention coefficients to capture features of boundary points, reducing the computation time. On the other hand, BAGNet employs a lightweight attention pooling layer to extract the global feature of the point cloud to maintain model accuracy. Extensive experiments on standard datasets demonstrate that BAGNet outperforms state-of-the-art methods in point cloud semantic segmentation with higher accuracy and less inference time.

CVJan 3, 2025
Dedicated Inference Engine and Binary-Weight Neural Networks for Lightweight Instance Segmentation

Tse-Wei Chen, Wei Tao, Dongyue Zhao et al.

Reducing computational costs is an important issue for development of embedded systems. Binary-weight Neural Networks (BNNs), in which weights are binarized and activations are quantized, are employed to reduce computational costs of various kinds of applications. In this paper, a design methodology of hardware architecture for inference engines is proposed to handle modern BNNs with two operation modes. Multiply-Accumulate (MAC) operations can be simplified by replacing multiply operations with bitwise operations. The proposed method can effectively reduce the gate count of inference engines by removing a part of computational costs from the hardware system. The architecture of MAC operations can calculate the inference results of BNNs efficiently with only 52% of hardware costs compared with the related works. To show that the inference engine can handle practical applications, two lightweight networks which combine the backbones of SegNeXt and the decoder of SparseInst for instance segmentation are also proposed. The output results of the lightweight networks are computed using only bitwise operations and add operations. The proposed inference engine has lower hardware costs than related works. The experimental results show that the proposed inference engine can handle the proposed instance-segmentation networks and achieves higher accuracy than YOLACT on the "Person" category although the model size is 77.7$\times$ smaller compared with YOLACT.

CVJan 24, 2024
Value-Driven Mixed-Precision Quantization for Patch-Based Inference on Microcontrollers

Wei Tao, Shenglin He, Kai Lu et al.

Deploying neural networks on microcontroller units (MCUs) presents substantial challenges due to their constrained computation and memory resources. Previous researches have explored patch-based inference as a strategy to conserve memory without sacrificing model accuracy. However, this technique suffers from severe redundant computation overhead, leading to a substantial increase in execution latency. A feasible solution to address this issue is mixed-precision quantization, but it faces the challenges of accuracy degradation and a time-consuming search time. In this paper, we propose QuantMCU, a novel patch-based inference method that utilizes value-driven mixed-precision quantization to reduce redundant computation. We first utilize value-driven patch classification (VDPC) to maintain the model accuracy. VDPC classifies patches into two classes based on whether they contain outlier values. For patches containing outlier values, we apply 8-bit quantization to the feature maps on the dataflow branches that follow. In addition, for patches without outlier values, we utilize value-driven quantization search (VDQS) on the feature maps of their following dataflow branches to reduce search time. Specifically, VDQS introduces a novel quantization search metric that takes into account both computation and accuracy, and it employs entropy as an accuracy representation to avoid additional training. VDQS also adopts an iterative approach to determine the bitwidth of each feature map to further accelerate the search process. Experimental results on real-world MCU devices show that QuantMCU can reduce computation by 2.2x on average while maintaining comparable model accuracy compared to the state-of-the-art patch-based inference methods.

SEJul 12, 2021
On the Evaluation of Commit Message Generation Models: An Experimental Study

Wei Tao, Yanlin Wang, Ensheng Shi et al.

Commit messages are natural language descriptions of code changes, which are important for program understanding and maintenance. However, writing commit messages manually is time-consuming and laborious, especially when the code is updated frequently. Various approaches utilizing generation or retrieval techniques have been proposed to automatically generate commit messages. To achieve a better understanding of how the existing approaches perform in solving this problem, this paper conducts a systematic and in-depth analysis of the state-of-the-art models and datasets. We find that: (1) Different variants of the BLEU metric are used in previous works, which affects the evaluation and understanding of existing methods. (2) Most existing datasets are crawled only from Java repositories while repositories in other programming languages are not sufficiently explored. (3) Dataset splitting strategies can influence the performance of existing models by a large margin. Some models show better performance when the datasets are split by commit, while other models perform better when the datasets are split by timestamp or by project. Based on our findings, we conduct a human evaluation and find the BLEU metric that best correlates with the human scores for the task. We also collect a large-scale, information-rich, and multi-language commit message dataset MCMD and evaluate existing models on this dataset. Furthermore, we conduct extensive experiments under different dataset splitting strategies and suggest the suitable models under different scenarios. Based on the experimental results and findings, we provide feasible suggestions for comprehensively evaluating commit message generation models and discuss possible future research directions. We believe this work can help practitioners and researchers better evaluate and select models for automatic commit message generation.

LGJul 5, 2021
Provable Convergence of Nesterov's Accelerated Gradient Method for Over-Parameterized Neural Networks

Xin Liu, Zhisong Pan, Wei Tao

Momentum methods, such as heavy ball method~(HB) and Nesterov's accelerated gradient method~(NAG), have been widely used in training neural networks by incorporating the history of gradients into the current updating process. In practice, they often provide improved performance over (stochastic) gradient descent~(GD) with faster convergence. Despite these empirical successes, theoretical understandings of their accelerated convergence rates are still lacking. Recently, some attempts have been made by analyzing the trajectories of gradient-based methods in an over-parameterized regime, where the number of the parameters is significantly larger than the number of the training instances. However, the majority of existing theoretical work is mainly concerned with GD and the established convergence result of NAG is inferior to HB and GD, which fails to explain the practical success of NAG. In this paper, we take a step towards closing this gap by analyzing NAG in training a randomly initialized over-parameterized two-layer fully connected neural network with ReLU activation. Despite the fact that the objective function is non-convex and non-smooth, we show that NAG converges to a global minimum at a non-asymptotic linear rate $(1-Θ(1/\sqrtκ))^t$, where $κ> 1$ is the condition number of a gram matrix and $t$ is the number of the iterations. Compared to the convergence rate $(1-Θ(1/κ))^t$ of GD, our result provides theoretical guarantees for the acceleration of NAG in neural network training. Furthermore, our findings suggest that NAG and HB have similar convergence rate. Finally, we conduct extensive experiments on six benchmark datasets to validate the correctness of our theoretical results.

CVApr 29, 2021
CASSOD-Net: Cascaded and Separable Structures of Dilated Convolution for Embedded Vision Systems and Applications

Tse-Wei Chen, Deyu Wang, Wei Tao et al.

The field of view (FOV) of convolutional neural networks is highly related to the accuracy of inference. Dilated convolutions are known as an effective solution to the problems which require large FOVs. However, for general-purpose hardware or dedicated hardware, it usually takes extra time to handle dilated convolutions compared with standard convolutions. In this paper, we propose a network module, Cascaded and Separable Structure of Dilated (CASSOD) Convolution, and a special hardware system to handle the CASSOD networks efficiently. A CASSOD-Net includes multiple cascaded $2 \times 2$ dilated filters, which can be used to replace the traditional $3 \times 3$ dilated filters without decreasing the accuracy of inference. Two example applications, face detection and image segmentation, are tested with dilated convolutions and the proposed CASSOD modules. The new network for face detection achieves higher accuracy than the previous work with only 47% of filter weights in the dilated convolution layers of the context module. Moreover, the proposed hardware system can accelerate the computations of dilated convolutions, and it is 2.78 times faster than traditional hardware systems when the filter size is $3 \times 3$.

CVApr 29, 2021
Hardware Architecture of Embedded Inference Accelerator and Analysis of Algorithms for Depthwise and Large-Kernel Convolutions

Tse-Wei Chen, Wei Tao, Deyu Wang et al.

In order to handle modern convolutional neural networks (CNNs) efficiently, a hardware architecture of CNN inference accelerator is proposed to handle depthwise convolutions and regular convolutions, which are both essential building blocks for embedded-computer-vision algorithms. Different from related works, the proposed architecture can support filter kernels with different sizes with high flexibility since it does not require extra costs for intra-kernel parallelism, and it can generate convolution results faster than the architecture of the related works. The experimental results show the importance of supporting depthwise convolutions and dilated convolutions with the proposed hardware architecture. In addition to depthwise convolutions with large-kernels, a new structure called DDC layer, which includes the combination of depthwise convolutions and dilated convolutions, is also analyzed in this paper. For face detection, the computational costs decrease by 30%, and the model size decreases by 20% when the DDC layers are applied to the network. For image classification, the accuracy is increased by 1% by simply replacing $3 \times 3$ filters with $5 \times 5$ filters in depthwise convolutions.

CVApr 29, 2021
Condensation-Net: Memory-Efficient Network Architecture with Cross-Channel Pooling Layers and Virtual Feature Maps

Tse-Wei Chen, Motoki Yoshinaga, Hongxing Gao et al.

"Lightweight convolutional neural networks" is an important research topic in the field of embedded vision. To implement image recognition tasks on a resource-limited hardware platform, it is necessary to reduce the memory size and the computational cost. The contribution of this paper is stated as follows. First, we propose an algorithm to process a specific network architecture (Condensation-Net) without increasing the maximum memory storage for feature maps. The architecture for virtual feature maps saves 26.5% of memory bandwidth by calculating the results of cross-channel pooling before storing the feature map into the memory. Second, we show that cross-channel pooling can improve the accuracy of object detection tasks, such as face detection, because it increases the number of filter weights. Compared with Tiny-YOLOv2, the improvement of accuracy is 2.0% for quantized networks and 1.5% for full-precision networks when the false-positive rate is 0.1. Last but not the least, the analysis results show that the overhead to support the cross-channel pooling with the proposed hardware architecture is negligible small. The extra memory cost to support Condensation-Net is 0.2% of the total size, and the extra gate count is only 1.0% of the total size.

LGFeb 15, 2021
The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods

Wei Tao, Sheng Long, Gaowei Wu et al.

The adaptive stochastic gradient descent (SGD) with momentum has been widely adopted in deep learning as well as convex optimization. In practice, the last iterate is commonly used as the final solution to make decisions. However, the available regret analysis and the setting of constant momentum parameters only guarantee the optimal convergence of the averaged solution. In this paper, we fill this theory-practice gap by investigating the convergence of the last iterate (referred to as individual convergence), which is a more difficult task than convergence analysis of the averaged solution. Specifically, in the constrained convex cases, we prove that the adaptive Polyak's Heavy-ball (HB) method, in which only the step size is updated using the exponential moving average strategy, attains an optimal individual convergence rate of $O(\frac{1}{\sqrt{t}})$, as opposed to the optimality of $O(\frac{\log t}{\sqrt {t}})$ of SGD, where $t$ is the number of iterations. Our new analysis not only shows how the HB momentum and its time-varying weight help us to achieve the acceleration in convex optimization but also gives valuable hints how the momentum parameters should be scheduled in deep learning. Empirical results on optimizing convex functions and training deep networks validate the correctness of our convergence analysis and demonstrate the improved performance of the adaptive HB methods.

ROFeb 8, 2021
Simultaneous Localization and Mapping Related Datasets: A Comprehensive Survey

Yuanzhi Liu, Yujia Fu, Fengdong Chen et al.

Due to the complicated procedure and costly hardware, Simultaneous Localization and Mapping (SLAM) has been heavily dependent on public datasets for drill and evaluation, leading to many impressive demos and good benchmark scores. However, with a huge contrast, SLAM is still struggling on the way towards mature deployment, which sounds a warning: some of the datasets are overexposed, causing biased usage and evaluation. This raises the problem on how to comprehensively access the existing datasets and correctly select them. Moreover, limitations do exist in current datasets, then how to build new ones and which directions to go? Nevertheless, a comprehensive survey which can tackle the above issues does not exist yet, while urgently demanded by the community. To fill the gap, this paper strives to cover a range of cohesive topics about SLAM related datasets, including general collection methodology and fundamental characteristic dimensions, SLAM related tasks taxonomy and datasets categorization, introduction of state-of-the-arts, overview and comparison of existing datasets, review of evaluation criteria, and analyses and discussions about current limitations and future directions, looking forward to not only guiding the dataset selection, but also promoting the dataset research.

LGDec 29, 2020
Gradient Descent Averaging and Primal-dual Averaging for Strongly Convex Optimization

Wei Tao, Wei Li, Zhisong Pan et al.

Averaging scheme has attracted extensive attention in deep learning as well as traditional machine learning. It achieves theoretically optimal convergence and also improves the empirical model performance. However, there is still a lack of sufficient convergence analysis for strongly convex optimization. Typically, the convergence about the last iterate of gradient descent methods, which is referred to as individual convergence, fails to attain its optimality due to the existence of logarithmic factor. In order to remove this factor, we first develop gradient descent averaging (GDA), which is a general projection-based dual averaging algorithm in the strongly convex setting. We further present primal-dual averaging for strongly convex cases (SC-PDA), where primal and dual averaging schemes are simultaneously utilized. We prove that GDA yields the optimal convergence rate in terms of output averaging, while SC-PDA derives the optimal individual convergence. Several experiments on SVMs and deep learning models validate the correctness of theoretical analysis and effectiveness of algorithms.

CVSep 29, 2020
BAMSProd: A Step towards Generalizing the Adaptive Optimization Methods to Deep Binary Model

Junjie Liu, Dongchao Wen, Deyu Wang et al.

Recent methods have significantly reduced the performance degradation of Binary Neural Networks (BNNs), but guaranteeing the effective and efficient training of BNNs is an unsolved problem. The main reason is that the estimated gradients produced by the Straight-Through-Estimator (STE) mismatches with the gradients of the real derivatives. In this paper, we provide an explicit convex optimization example where training the BNNs with the traditionally adaptive optimization methods still faces the risk of non-convergence, and identify that constraining the range of gradients is critical for optimizing the deep binary model to avoid highly suboptimal solutions. For solving above issues, we propose a BAMSProd algorithm with a key observation that the convergence property of optimizing deep binary model is strongly related to the quantization errors. In brief, it employs an adaptive range constraint via an errors measurement for smoothing the gradients transition while follows the exponential moving strategy from AMSGrad to avoid errors accumulation during the optimization. The experiments verify the corollary of theoretical convergence analysis, and further demonstrate that our optimization method can speed up the convergence about 1:2x and boost the performance of BNNs to a significant level than the specific binary optimizer about 3:7%, even in a highly non-convex optimization problem.

CVSep 10, 2020
QuantNet: Learning to Quantize by Learning within Fully Differentiable Framework

Junjie Liu, Dongchao Wen, Deyu Wang et al.

Despite the achievements of recent binarization methods on reducing the performance degradation of Binary Neural Networks (BNNs), gradient mismatching caused by the Straight-Through-Estimator (STE) still dominates quantized networks. This paper proposes a meta-based quantizer named QuantNet, which utilizes a differentiable sub-network to directly binarize the full-precision weights without resorting to STE and any learnable gradient estimators. Our method not only solves the problem of gradient mismatching, but also reduces the impact of discretization errors, caused by the binarizing operation in the deployment, on performance. Generally, the proposed algorithm is implemented within a fully differentiable framework, and is easily extended to the general network quantization with any bits. The quantitative experiments on CIFAR-100 and ImageNet demonstrate that QuantNet achieves the signifficant improvements comparing with previous binarization methods, and even bridges gaps of accuracies between binarized models and full-precision models.

CVNov 19, 2019
IFQ-Net: Integrated Fixed-point Quantization Networks for Embedded Vision

Hongxing Gao, Wei Tao, Dongchao Wen et al.

Deploying deep models on embedded devices has been a challenging problem since the great success of deep learning based networks. Fixed-point networks, which represent their data with low bits fixed-point and thus give remarkable savings on memory usage, are generally preferred. Even though current fixed-point networks employ relative low bits (e.g. 8-bits), the memory saving is far from enough for the embedded devices. On the other hand, quantization deep networks, for example XNOR-Net and HWGQNet, quantize the data into 1 or 2 bits resulting in more significant memory savings but still contain lots of floatingpoint data. In this paper, we propose a fixed-point network for embedded vision tasks through converting the floatingpoint data in a quantization network into fixed-point. Furthermore, to overcome the data loss caused by the conversion, we propose to compose floating-point data operations across multiple layers (e.g. convolution, batch normalization and quantization layers) and convert them into fixedpoint. We name the fixed-point network obtained through such integrated conversion as Integrated Fixed-point Quantization Networks (IFQ-Net). We demonstrate that our IFQNet gives 2.16x and 18x more savings on model size and runtime feature map memory respectively with similar accuracy on ImageNet. Furthermore, based on YOLOv2, we design IFQ-Tinier-YOLO face detector which is a fixed-point network with 256x reduction in model size (246k Bytes) than Tiny-YOLO. We illustrate the promising performance of our face detector in terms of detection rate on Face Detection Data Set and Bencmark (FDDB) and qualitative results of detecting small faces of Wider Face dataset.

CVNov 13, 2019
DupNet: Towards Very Tiny Quantized CNN with Improved Accuracy for Face Detection

Hongxing Gao, Wei Tao, Dongchao Wen et al.

Deploying deep learning based face detectors on edge devices is a challenging task due to the limited computation resources. Even though binarizing the weights of a very tiny network gives impressive compactness on model size (e.g. 240.9 KB for IFQ-Tinier-YOLO), it is not tiny enough to fit in the embedded devices with strict memory constraints. In this paper, we propose DupNet which consists of two parts. Firstly, we employ weights with duplicated channels for the weight-intensive layers to reduce the model size. Secondly, for the quantization-sensitive layers whose quantization causes notable accuracy drop, we duplicate its input feature maps. It allows us to use more weights channels for convolving more representative outputs. Based on that, we propose a very tiny face detector, DupNet-Tinier-YOLO, which is 6.5X times smaller on model size and 42.0% less complex on computation and meanwhile achieves 2.4% higher detection than IFQ-Tinier-YOLO. Comparing with the full precision Tiny-YOLO, our DupNet-Tinier-YOLO gives 1,694.2X and 389.9X times savings on model size and computation complexity respectively with only 4.0% drop on detection rate (0.880 vs. 0.920). Moreover, our DupNet-Tinier-YOLO is only 36.9 KB, which is the tiniest deep face detector to our best knowledge.

CVNov 13, 2019
Knowledge Representing: Efficient, Sparse Representation of Prior Knowledge for Knowledge Distillation

Junjie Liu, Dongchao Wen, Hongxing Gao et al.

Despite the recent works on knowledge distillation (KD) have achieved a further improvement through elaborately modeling the decision boundary as the posterior knowledge, their performance is still dependent on the hypothesis that the target network has a powerful capacity (representation ability). In this paper, we propose a knowledge representing (KR) framework mainly focusing on modeling the parameters distribution as prior knowledge. Firstly, we suggest a knowledge aggregation scheme in order to answer how to represent the prior knowledge from teacher network. Through aggregating the parameters distribution from teacher network into more abstract level, the scheme is able to alleviate the phenomenon of residual accumulation in the deeper layers. Secondly, as the critical issue of what the most important prior knowledge is for better distilling, we design a sparse recoding penalty for constraining the student network to learn with the penalized gradients. With the proposed penalty, the student network can effectively avoid the over-regularization during knowledge distilling and converge faster. The quantitative experiments exhibit that the proposed framework achieves the state-ofthe-arts performance, even though the target network does not have the expected capacity. Moreover, the framework is flexible enough for combining with other KD methods based on the posterior knowledge.