AIMar 5

Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning

arXiv:2603.05235v14 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the problem of underutilized information in CLIP's text encoder for researchers and practitioners working on cross-domain few-shot learning, offering an incremental improvement to existing methods.

The paper investigates the "Lost Layers" phenomenon in CLIP's text encoder for Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL), where removing certain middle layers improves performance. They discover that these layers contain beneficial information that is underutilized due to visual gaps, and propose a method to re-utilize this information at both layer and encoder levels, guiding the re-learning of the visual branch under domain shifts.

Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks. Current works indicate CLIP's text encoder is more suitable for cross-domain tasks, however, we find that \textbf{removing certain middle layers of the text encoder can effectively improve performance in SF-CDFSL}, which we call the Lost Layers. In this paper, we delve into this phenomenon for a deeper understanding. We discover that instead of being harmful for the SF-CDFSL task, the information in these layers is actually beneficial, but visual gaps prevent this useful information from being fully utilized, making these layers seem redundant. Based on this understanding, unlike current works that simply remove these layers, we propose a method to teachs the model to \textbf{re-utilize} information in these lost layers at both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts. Our approach effectively addresses the issue of underutilized information in the text encoder. Extensive experiments across various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 10 Meta-dataset datasets) demonstrate the effectiveness of our method. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-VtT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes