LGNov 11, 2025
From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM TrainingDonglai Xu, Hongzheng Yang, Yuzhi Zhao et al.
Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation, which enables more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones - Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B - spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.
SYOct 10, 2025
Resilient Multi-Dimensional Consensus and Distributed Optimization against Agent-Based and Denial-of-Service AttacksHongjian Chen, Changyun Wen, Xiaolei Li
In this paper, we consider the resilient multi-dimensional consensus and distributed optimization problems of multi-agent systems (MASs) in the presence of both agent-based and denial-of-service (DoS) attacks. The considered agent-based attacks can cover malicious, Byzantine, and stubborn agents. The links between agents in the network can be blocked by DoS attacks, which may lead the digraph to be time-varying and even disconnected. The objective is to ensure that the remaining benign agents achieve consensus. To this end, an "auxiliary point"-based resilient control algorithm is proposed for MASs. Under the proposed algorithm, each healthy agent constructs a "safe kernel" utilizing the states of its in-neighbors and updates its state toward a specific point within this kernel at each iteration. If an agent cannot receive its neighbors' states owing to DoS attacks, it will use the states received immediately before the DoS period. Moreover, a resilient multi-dimensional distributed optimization (RMDO) algorithm is also proposed. Theoretical proofs and numerical examples are presented to demonstrate the effectiveness of the proposed algorithms.
MLMar 24, 2020
Training a U-Net based on a random mode-coupling matrix model to recover acoustic interference striationsXiaolei Li, Wenhua Song, Dazhi Gao et al.
A U-Net is trained to recover acoustic interference striations (AISs) from distorted ones. A random mode-coupling matrix model is introduced to generate a large number of training data quickly, which are used to train the U-Net. The performance of AIS recovery of the U-Net is tested in range-dependent waveguides with nonlinear internal waves (NLIWs). Although the random mode-coupling matrix model is not an accurate physical model, the test results show that the U-Net successfully recovers AISs under different signal-to-noise ratios (SNRs) and different amplitudes and widths of NLIWs for different shapes.
LGApr 1, 2019
Sound source ranging using a feed-forward neural network with fitting-based early stoppingJing Chi, Xiaolei Li, Haozhong Wang et al.
When a feed-forward neural network (FNN) is trained for source ranging in an ocean waveguide, it is difficult evaluating the range accuracy of the FNN on unlabeled test data. A fitting-based early stopping (FEAST) method is introduced to evaluate the range error of the FNN on test data where the distance of source is unknown. Based on FEAST, when the evaluated range error of the FNN reaches the minimum on test data, stopping training, which will help to improve the ranging accuracy of the FNN on the test data. The FEAST is demonstrated on simulated and experimental data.
CRDec 10, 2016
Monet: A User-oriented Behavior-based Malware Variants Detection System for AndroidMingshen Sun, Xiaolei Li, John C. S. Lui et al.
Android, the most popular mobile OS, has around 78% of the mobile market share. Due to its popularity, it attracts many malware attacks. In fact, people have discovered around one million new malware samples per quarter, and it was reported that over 98% of these new malware samples are in fact "derivatives" (or variants) from existing malware families. In this paper, we first show that runtime behaviors of malware's core functionalities are in fact similar within a malware family. Hence, we propose a framework to combine "runtime behavior" with "static structures" to detect malware variants. We present the design and implementation of MONET, which has a client and a backend server module. The client module is a lightweight, in-device app for behavior monitoring and signature generation, and we realize this using two novel interception techniques. The backend server is responsible for large scale malware detection. We collect 3723 malware samples and top 500 benign apps to carry out extensive experiments of detecting malware variants and defending against malware transformation. Our experiments show that MONET can achieve around 99% accuracy in detecting malware variants. Furthermore, it can defend against 10 different obfuscation and transformation techniques, while only incurs around 7% performance overhead and about 3% battery overhead. More importantly, MONET will automatically alert users with intrusion details so to prevent further malicious behaviors.