Yuxun Qu

h-index3

3papers

68citations

3 Papers

15.5CVDec 19, 2025Code

Jiaze Li, Jingyang Chen, Yuxun Qu et al.

We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at https://github.com/XiaoMi/xiaomi-mimo-vl-miloco to support research and deployment in real-world smart-home applications.

9.1CVJun 19

Synergistic Dual-Branch Adaptation for Multi-modal Generalized Category Discovery

Yuxun Qu, Minyu Zhou, Yongqiang Tang et al.

Generalized Category Discovery (GCD) aims to classify old categories and discover new ones from unlabeled data. Recent multi-modal approaches introduce retrieved or synthesized texts into a dual-branch architecture to provide semantic cues complementary to visual features. However, the cross-modal synergy in existing dual-branch methods remains coarse and incomplete: the two modalities are encoded independently with the bias and noise in the derived text left unaddressed during encoding, and existing mutual learning strategies operate only on global class-level anchors, lacking fine-grained relational supervision. To address these limitations, we propose the Synergistic Dual-Branch Adaptation (SDBA) framework, which serves as a plug-and-play enhancement compatible with existing dual-branch methods such as GET and TextGCD. SDBA comprises two components: the cross-modal synergistic adapter inserts lightweight adapters into both branches and further injects visual information into the text adapter at each encoder layer to enhance text feature learning during encoding; the neighborhood mutual learning module enforces consistent local neighborhood distributions between the two branches via bidirectional KL divergence, providing fine-grained relational supervision for both old and new classes. Extensive experiments on six benchmarks demonstrate state-of-the-art performance, and consistent improvements on different baselines validate the broad scalability of the proposed framework.

3.7CVOct 29, 2024

AdaptGCD: Multi-Expert Adapter Tuning for Generalized Category Discovery

Yuxun Qu, Yongqiang Tang, Chenyang Zhang et al.

Different from the traditional semi-supervised learning paradigm that is constrained by the close-world assumption, Generalized Category Discovery (GCD) presumes that the unlabeled dataset contains new categories not appearing in the labeled set, and aims to not only classify old categories but also discover new categories in the unlabeled data. Existing studies on GCD typically devote to transferring the general knowledge from the self-supervised pretrained model to the target GCD task via some fine-tuning strategies, such as partial tuning and prompt learning. Nevertheless, these fine-tuning methods fail to make a sound balance between the generalization capacity of pretrained backbone and the adaptability to the GCD task. To fill this gap, in this paper, we propose a novel adapter-tuning-based method named AdaptGCD, which is the first work to introduce the adapter tuning into the GCD task and provides some key insights expected to enlighten future research. Furthermore, considering the discrepancy of supervision information between the old and new classes, a multi-expert adapter structure equipped with a route assignment constraint is elaborately devised, such that the data from old and new classes are separated into different expert groups. Extensive experiments are conducted on 7 widely-used datasets. The remarkable improvements in performance highlight the effectiveness of our proposals.