Meng Cai

AS
h-index10
10papers
174citations
Novelty53%
AI Score49

10 Papers

CVJul 29, 2024Code
Background Semantics Matter: Cross-Task Feature Exchange Network for Clustered Infrared Small Target Detection

Mengxuan Xiao, Yinfei Zhu, Yiming Zhu et al.

Infrared small target detection presents significant challenges due to the limited intrinsic features of the target and the overwhelming presence of visually similar background distractors. We contend that background semantics are critical for distinguishing between objects that appear visually similar in this context. To address this challenge, we propose a task, clustered infrared small target detection, and introduce DenseSIRST, a benchmark dataset that provides per-pixel semantic annotations for background regions. This dataset facilitates the shift from sparse to dense target detection. This dataset facilitates the shift from sparse to dense target detection. Building on this resource, we propose the Background-Aware Feature Exchange Network (BAFE-Net), a multi-task architecture that jointly tackles target detection and background semantic segmentation. BAFE-Net incorporates a dynamic cross-task feature hard-exchange mechanism, enabling the effective exchange of target and background semantics between the two tasks. Comprehensive experiments demonstrate that BAFE-Net significantly enhances target detection accuracy while mitigating false alarms. The DenseSIRST dataset, along with the code and trained models, is publicly available at https://github.com/GrokCV/BAFE-Net.

CLAug 9, 2024Code
MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

Junhao Xu, Zhenlin Liang, Yi Liu et al.

In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training. We achieve performance comparable to other open source models trained with up to hundreds of thousands of hours of labeled speech data. Meanwhile, experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs. A BLEU score of 25.2 can be obtained. The main contributions of this paper are summarized as follows. First, this paper presents a training strategy for encoders and LLMs on speech related tasks (including ASR and AST) using a small size of pseudo labeled data without any extra manual annotation and selection. Second, we release our ASR and AST models and plan to open-source our training code and strategy in the near future. Moreover, a model trained on 8wh scale training data is planned to be released later on.

67.8LGMay 25
ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

Meng Cai, Lars Kulik, Farhana Choudhury

When language models use test-time sampling, they generate multiple reasoning trajectories and select an answer by majority vote. We show that these trajectories are not independent: for a given question, they concentrate into a small number of clusters, or reasoning basins, each defined by a normalized final answer and the solutions that reach it. A majority vote therefore selects the most stable basin rather than the most accurate one, which creates wrong-majority failures where the correct answer is present but outvoted. We introduce ARBITER, a model-agnostic approach that models interactions between basins using only the base model's own sampled outputs, hidden states, and derived evidence. Most direct correction strategies fail; ARBITER instead uses conservative additive evidence on top of consensus. In its simplest parameter-free form, ARBITER-Δ adds same-model evidence to the majority prior, while ARBITER-Enc augments this with bounded residual signals from hidden states over complete solutions. On GSM8K with Qwen3-4B, consensus over K=24 samples achieves around the mid-94% range, while a same-pool top-2 oracle reaches around the mid-96% range. ARBITER recovers a subset of these cases using zero external information. Across three model families and three math benchmarks, it yields consistent gains with no net-negative cases; for example, on Llama-3.1-8B MMLU-HS-Math, it improves accuracy from the mid-78% range to the mid-82% range, recovering about 22% of the available oracle headroom, indicating that this headroom can be partially recovered from the sample pool itself.

CVMay 21, 2025Code
AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection

Yangting Shi, Yinfei Zhu, Renjie He et al.

Omni-domain infrared small target detection (Omni-IRSTD) poses formidable challenges, as a single model must seamlessly adapt to diverse imaging systems, varying resolutions, and multiple spectral bands simultaneously. Current approaches predominantly rely on visual-only modeling paradigms that not only struggle with complex background interference and inherently scarce target features, but also exhibit limited generalization capabilities across complex omni-scene environments where significant domain shifts and appearance variations occur. In this work, we reveal a critical oversight in existing paradigms: the neglect of readily available auxiliary metadata describing imaging parameters and acquisition conditions, such as spectral bands, sensor platforms, resolution, and observation perspectives. To address this limitation, we propose the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet), a novel multimodal framework that is the first to incorporate metadata into the IRSTD paradigm for scene-aware optimization. Through a high-dimensional fusion module based on multi-layer perceptrons (MLPs), AuxDet dynamically integrates metadata semantics with visual features, guiding adaptive representation learning for each individual sample. Additionally, we design a lightweight prior-initialized enhancement module using 1D convolutional blocks to further refine fused features and recover fine-grained target cues. Extensive experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy in omni-domain IRSTD tasks. Code is available at https://github.com/GrokCV/AuxDet.

93.6NAMar 28
Weak convergence order of stochastic theta method for SDEs driven by time-changed Lévy noise

Ziheng Chen, Jiao Liu, Meng Cai

This paper studies the weak convergence order of the stochastic theta method for stochastic differential equations (SDEs) driven by time-changed Lévy noise under global Lipschitz and linear growth conditions. In contrast to classical Lévy-driven SDEs, the presence of a random time change makes the weak error analysis involve both the discretization error of the underlying equation and the approximation error of the random clock. Moreover, compared with explicit Euler--Maruyama method, the implicit drift correction in the stochastic theta method makes the associated weak error analysis substantially more delicate. To address these difficulties, we first establish a global weak convergence estimate of order one for the stochastic theta method applied to the corresponding non-time-changed Lévy SDEs on the infinite time interval by means of the Kolmogorov backward partial integro differential equations. Incorporating the approximation of the inverse subordinator together with the duality principle, we derive the weak convergence order of the stochastic theta method with $θ\in [0,1]$ in the time-changed Lévy setting. The result advances the currently available weak convergence analysis beyond the Euler--Maruyama method to the more general class of stochastic theta method, and establishes a workable route from the weak analysis of the underlying non-time-changed Lévy equation to the corresponding time-changed problem. Finally, numerical experiments are presented to further support the theoretical findings.

CLJan 30, 2022
Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection

Minglun Han, Linhao Dong, Zhenlin Liang et al.

Nowadays, most methods in end-to-end contextual speech recognition bias the recognition process towards contextual knowledge. Since all-neural contextual biasing methods rely on phrase-level contextual modeling and attention-based relevance modeling, they may encounter confusion between similar context-specific phrases, which hurts predictions at the token level. In this work, we focus on mitigating confusion problems with fine-grained contextual knowledge selection (FineCoS). In FineCoS, we introduce fine-grained knowledge to reduce the uncertainty of token predictions. Specifically, we first apply phrase selection to narrow the range of phrase candidates, and then conduct token attention on the tokens in the selected phrase candidates. Moreover, we re-normalize the attention weights of most relevant phrases in inference to obtain more focused phrase-level contextual representations, and inject position information to better discriminate phrases or tokens. On LibriSpeech and an in-house 160,000-hour dataset, we explore the proposed methods based on a controllable all-neural biasing method, collaborative decoding (ColDec). The proposed methods provide at most 6.1% relative word error rate reduction on LibriSpeech and 16.4% relative character error rate reduction on the in-house dataset over ColDec.

ASOct 8, 2021
Improving Pseudo-label Training For End-to-end Speech Recognition Using Gradient Mask

Shaoshi Ling, Chen Shen, Meng Cai et al.

In the recent trend of semi-supervised speech recognition, both self-supervised representation learning and pseudo-labeling have shown promising results. In this paper, we propose a novel approach to combine their ideas for end-to-end speech recognition model. Without any extra loss function, we utilize the Gradient Mask to optimize the model when training on pseudo-label. This method forces the speech recognition model to predict from the masked input to learn strong acoustic representation and make training robust to label noise. In our semi-supervised experiments, the method can improve the model performance when training on pseudo-label and our method achieved competitive results comparing with other semi-supervised approaches on the Librispeech 100 hours experiments.

ASNov 3, 2020
Improving RNN transducer with normalized jointer network

Mingkun Huang, Jun Zhang, Meng Cai et al.

Recurrent neural transducer (RNN-T) is a promising end-to-end (E2E) model in automatic speech recognition (ASR). It has shown superior performance compared to traditional hybrid ASR systems. However, training RNN-T from scratch is still challenging. We observe a huge gradient variance during RNN-T training and suspect it hurts the performance. In this work, we analyze the cause of the huge gradient variance in RNN-T training and proposed a new \textit{normalized jointer network} to overcome it. We also propose to enhance the RNN-T network with a modified conformer encoder network and transformer-XL predictor networks to achieve the best performance. Experiments are conducted on the open 170-hour AISHELL-1 and industrial-level 30000-hour mandarin speech dataset. On the AISHELL-1 dataset, our RNN-T system gets state-of-the-art results on AISHELL-1's streaming and non-streaming benchmark with CER 6.15\% and 5.37\% respectively. We further compare our RNN-T system with our well trained commercial hybrid system on 30000-hour-industry audio data and get 9\% relative improvement without pre-training or external language model.

ASNov 3, 2020
Dynamic latency speech recognition with asynchronous revision

Mingkun Huang, Meng Cai, Jun Zhang et al.

In this work we propose an inference technique, asynchronous revision, to unify streaming and non-streaming speech recognition models. Specifically, we achieve dynamic latency with only one model by using arbitrary right context during inference. The model is composed of a stack of convolutional layers for audio encoding. In inference stage, the history states of encoder and decoder can be asynchronously revised to trade off between the latency and the accuracy of the model. To alleviate training and inference mismatch, we propose a training technique, segment cropping, which randomly splits input utterances into several segments with forward connections. This allows us to have dynamic latency speech recognition results with large improvements in accuracy. Experiments show that our dynamic latency model with asynchronous revision gives 8\%-14\% relative improvements over the streaming models.

SPNov 23, 2018
Application of Machine Learning in Fiber Nonlinearity Modeling and Monitoring for Elastic Optical Networks

Qunbi Zhuge, Xiaobo Zeng, Huazhi Lun et al.

Fiber nonlinear interference (NLI) modeling and monitoring are the key building blocks to support elastic optical networks (EONs). In the past, they were normally developed and investigated separately. Moreover, the accuracy of the previously proposed methods still needs to be improved for heterogenous dynamic optical networks. In this paper, we present the application of machine learning (ML) in NLI modeling and monitoring. In particular, we first propose to use ML approaches to calibrate the errors of current fiber nonlinearity models. The Gaussian-noise (GN) model is used as an illustrative example, and significant improvement is demonstrated with the aid of an artificial neural network (ANN). Further, we propose to use ML to combine the modeling and monitoring schemes for a better estimation of NLI variance. The following contents are the listed errors as mentioned in the comments for reasons of withdrawal. (1) The works, as mentioned as the title, should be addressed is about the elastic optical networks(EON), however, the simulation setup and the results section are focused on the conventional wavelength division multiplexing(WDM) networks. This error may confuse some researcher, getting the misleading decision for the researches about the elastic optical networks. (2) There exists some errors in the results rection, such as, Fig.9(b) and (c) with the wrong captions may result in misleading decision. (3) The split-step-Fourier-method(SSFM) presents good accuracy if the sufficiently small steps are adopted in the calculation, however this paper has not necessary contents and efforts to optimise the step-length of SSFM. This error may confuse the accuracy of simulation results. Therefore, we decide to withdraw this paper from arXiv. The correct and complete paper with the same title was published in journal of lightwave technology with doi: 10.1109/JLT.2019.2910143.