Mingli Song

h-index3

6papers

82citations

Novelty46%

AI Score46

Ranked #40,168 of 194,257 authors (top 21%)#14,185 in CV (top 24%)

6 Papers

11.1AIDec 29, 2025Code

Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following

Kongcheng Zhang, Qi Yao, Shunyu Liu et al.

Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at https://github.com/sastpg/HIR.

1.5CVNov 19, 2023Code

Pair-wise Layer Attention with Spatial Masking for Video Prediction

Ping Li, Chenhan Zhang, Zheng Yang et al.

Video prediction yields future frames by employing the historical frames and has exhibited its great potential in many applications, e.g., meteorological prediction, and autonomous driving. Previous works often decode the ultimate high-level semantic features to future frames without texture details, which deteriorates the prediction quality. Motivated by this, we develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps derived from the U-shape structure in Translator, by coupling low-level visual cues and high-level features. Hence, the texture details of predicted frames are enriched. Moreover, most existing methods capture the spatiotemporal dynamics by Translator, but fail to sufficiently utilize the spatial features of Encoder. This inspires us to design a Spatial Masking (SM) module to mask partial encoding features during pretraining, which adds the visibility of remaining feature pixels by Decoder. To this end, we present a Pair-wise Layer Attention with Spatial Masking (PLA-SM) framework for video prediction to capture the spatiotemporal dynamics, which reflect the motion trend. Extensive experiments and rigorous ablation studies on five benchmarks demonstrate the advantages of the proposed approach. The code is available at GitHub.

21.8CVMar 4, 2024Code

Training-Free Pretrained Model Merging

Zhengqi Xu, Ke Yuan, Huiqiong Wang et al.

Recently, model merging techniques have surfaced as a solution to combine multiple single-talent models into a single multi-talent model. However, previous endeavors in this field have either necessitated additional training or fine-tuning processes, or require that the models possess the same pre-trained initialization. In this work, we identify a common drawback in prior works w.r.t. the inconsistency of unit similarity in the weight space and the activation space. To address this inconsistency, we propose an innovative model merging framework, coined as merging under dual-space constraints (MuDSC). Specifically, instead of solely maximizing the objective of a single space, we advocate for the exploration of permutation matrices situated in a region with a unified high similarity in the dual space, achieved through the linear combination of activation and weight similarity matrices. In order to enhance usability, we have also incorporated adaptations for group structure, including Multi-Head Attention and Group Normalization. Comprehensive experimental comparisons demonstrate that MuDSC can significantly boost the performance of merged models with various task combinations and architectures. Furthermore, the visualization of the merged model within the multi-task loss landscape reveals that MuDSC enables the merged model to reside in the overlapping segment, featuring a unified lower loss for each task. Our code is publicly available at https://github.com/zju-vipa/training_free_model_merging.

14.4LGDec 29, 2024Code

Training-free Heterogeneous Model Merging

Zhengqi Xu, Han Zheng, Jie Song et al.

Model merging has attracted significant attention as a powerful paradigm for model reuse, facilitating the integration of task-specific models into a singular, versatile framework endowed with multifarious capabilities. Previous studies, predominantly utilizing methods such as Weight Average (WA), have shown that model merging can effectively leverage pretrained models without the need for laborious retraining. However, the inherent heterogeneity among models poses a substantial constraint on its applicability, particularly when confronted with discrepancies in model architectures. To overcome this challenge, we propose an innovative model merging framework designed for heterogeneous models, encompassing both depth and width heterogeneity. To address depth heterogeneity, we introduce a layer alignment strategy that harmonizes model layers by segmenting deeper models, treating consecutive layers with similar representations as a cohesive segment, thus enabling the seamless merging of models with differing layer depths. For width heterogeneity, we propose a novel elastic neuron zipping algorithm that projects the weights from models of varying widths onto a common dimensional space, eliminating the need for identical widths. Extensive experiments validate the efficacy of these proposed methods, demonstrating that the merging of structurally heterogeneous models can achieve performance levels comparable to those of homogeneous merging, across both vision and NLP tasks. Our code is publicly available at https://github.com/zju-vipa/training_free_heterogeneous_model_merging.

5.1GRMay 17, 2013

Sparse Norm Filtering

Chengxi Ye, Dacheng Tao, Mingli Song et al.

Optimization-based filtering smoothes an image by minimizing a fidelity function and simultaneously preserves edges by exploiting a sparse norm penalty over gradients. It has obtained promising performance in practical problems, such as detail manipulation, HDR compression and deblurring, and thus has received increasing attentions in fields of graphics, computer vision and image processing. This paper derives a new type of image filter called sparse norm filter (SNF) from optimization-based filtering. SNF has a very simple form, introduces a general class of filtering techniques, and explains several classic filters as special implementations of SNF, e.g. the averaging filter and the median filter. It has advantages of being halo free, easy to implement, and low time and memory costs (comparable to those of the bilateral filter). Thus, it is more generic than a smoothing operator and can better adapt to different tasks. We validate the proposed SNF by a wide variety of applications including edge-preserving smoothing, outlier tolerant filtering, detail manipulation, HDR compression, non-blind deconvolution, image segmentation, and colorization.

7.5CVFeb 3, 2013

Sparse Camera Network for Visual Surveillance -- A Comprehensive Survey

Mingli Song, Dachent Tao, Stephen J. Maybank

Technological advances in sensor manufacture, communication, and computing are stimulating the development of new applications that are transforming traditional vision systems into pervasive intelligent camera networks. The analysis of visual cues in multi-camera networks enables a wide range of applications, from smart home and office automation to large area surveillance and traffic surveillance. While dense camera networks - in which most cameras have large overlapping fields of view - are well studied, we are mainly concerned with sparse camera networks. A sparse camera network undertakes large area surveillance using as few cameras as possible, and most cameras have non-overlapping fields of view with one another. The task is challenging due to the lack of knowledge about the topological structure of the network, variations in the appearance and motion of specific tracking targets in different views, and the difficulties of understanding composite events in the network. In this review paper, we present a comprehensive survey of recent research results to address the problems of intra-camera tracking, topological structure learning, target appearance modeling, and global activity understanding in sparse camera networks. A number of current open research issues are discussed.