66.4CVMay 12Code
RNA-FM: Flow-Matching Generative Model for Genome-wide RNA-Seq PredictionYaxuan Song, Jianan Fan, Tianyi Wang et al.
Histopathology whole-slide images (WSIs) are routinely acquired in clinical practice and contain rich tissue morphology but lack direct molecular architecture and functional programs defining pathological states, whereas RNA sequencing (RNA-seq) provides genome-wide transcriptional profiles at substantial cost, thereby motivating WSI-based genome-wide transcriptomic prediction. Existing approaches for predicting gene expression from WSIs predominantly rely on deterministic regression with one-to-one mapping, limiting their ability to capture biological heterogeneity and predictive uncertainty. We propose RNA-FM, a flow-matching generative framework for genome-wide bulk RNA-seq prediction from WSIs. RNA-FM formulates transcriptomic prediction as a continuous-time conditional transport problem, learning a velocity field that maps a simple prior to the target gene expression distribution conditioned on morphologies. By integrating pathway-level structure, RNA-FM enables scalable and biologically interpretable genome-wide gene expression imputation. Extensive experiments demonstrate that RNA-FM consistently outperforms state-of-the-art approaches while maintaining biological meaningfulness. Code is available at https://github.com/YXSong000/RNA-FM.
SESep 18, 2023
EGFE: End-to-end Grouping of Fragmented Elements in UI Designs with Multimodal LearningLiuqing Chen, Yunnong Chen, Shuhong Xiao et al.
When translating UI design prototypes to code in industry, automatically generating code from design prototypes can expedite the development of applications and GUI iterations. However, in design prototypes without strict design specifications, UI components may be composed of fragmented elements. Grouping these fragmented elements can greatly improve the readability and maintainability of the generated code. Current methods employ a two-stage strategy that introduces hand-crafted rules to group fragmented elements. Unfortunately, the performance of these methods is not satisfying due to visually overlapped and tiny UI elements. In this study, we propose EGFE, a novel method for automatically End-to-end Grouping Fragmented Elements via UI sequence prediction. To facilitate the UI understanding, we innovatively construct a Transformer encoder to model the relationship between the UI elements with multi-modal representation learning. The evaluation on a dataset of 4606 UI prototypes collected from professional UI designers shows that our method outperforms the state-of-the-art baselines in the precision (by 29.75\%), recall (by 31.07\%), and F1-score (by 30.39\%) at edit distance threshold of 4. In addition, we conduct an empirical study to assess the improvement of the generated front-end code. The results demonstrate the effectiveness of our method on a real software engineering application. Our end-to-end fragmented elements grouping method creates opportunities for improving UI-related software engineering tasks.
CVMar 1Code
GuiDINO: Rethinking Vision Foundation Model in Medical Image SegmentationZhuonan Liang, Wei Guo, Jie Gan et al.
Foundation vision models are increasingly adopted in medical image analysis. Due to domain shift, these pretrained models misalign with medical image segmentation needs without being fully fine-tuned or lightly adapted. We introduce GuiDINO, a framework that repositions native foundation model to acting as a visual guidance generator for downstream segmentation. GuiDINO extracts visual feature representation from DINOv3 and converts them into a spatial guide mask via a lightweight TokenBook mechanism, which aggregates token-prototype similarities. This guide mask gates feature activations in multiple segmentation backbones, thereby injecting foundation-model priors while preserving the inductive biases and efficiency of medical dedicated architectures. Training relies on a guide supervision objective loss that aligns the guide mask to ground-truth regions, optionally augmented by a boundary-focused hinge loss to sharpen fine structures. GuiDINO also supports parameter-efficient adaptation through LoRA on the DINOv3 guide backbone. Across diverse medical datasets and nnUNet-style inference, GuiDINO consistently improves segmentation quality and boundary robustness, suggesting a practical alternative to fine-tuning and offering a new perspective on how foundation models can best serve medical vision. Code is available at https://github.com/Hi-FishU/GuiDINO
CVFeb 9, 2024Code
Multi-source-free Domain Adaptation via Uncertainty-aware Adaptive DistillationYaxuan Song, Jianan Fan, Dongnan Liu et al.
Source-free domain adaptation (SFDA) alleviates the domain discrepancy among data obtained from domains without accessing the data for the awareness of data privacy. However, existing conventional SFDA methods face inherent limitations in medical contexts, where medical data are typically collected from multiple institutions using various equipment. To address this problem, we propose a simple yet effective method, named Uncertainty-aware Adaptive Distillation (UAD) for the multi-source-free unsupervised domain adaptation (MSFDA) setting. UAD aims to perform well-calibrated knowledge distillation from (i) model level to deliver coordinated and reliable base model initialisation and (ii) instance level via model adaptation guided by high-quality pseudo-labels, thereby obtaining a high-performance target domain model. To verify its general applicability, we evaluate UAD on two image-based diagnosis benchmarks among two multi-centre datasets, where our method shows a significant performance gain compared with existing works. The code is available at https://github.com/YXSong000/UAD.
CVJul 19, 2025Code
Gene-DML: Dual-Pathway Multi-Level Discrimination for Gene Expression Prediction from Histopathology ImagesYaxuan Song, Jianan Fan, Hang Chang et al.
Accurately predicting gene expression from histopathology images offers a scalable and non-invasive approach to molecular profiling, with significant implications for precision medicine and computational pathology. However, existing methods often underutilize the cross-modal representation alignment between histopathology images and gene expression profiles across multiple representational levels, thereby limiting their prediction performance. To address this, we propose Gene-DML, a unified framework that structures latent space through Dual-pathway Multi-Level discrimination to enhance correspondence between morphological and transcriptional modalities. The multi-scale instance-level discrimination pathway aligns hierarchical histopathology representations extracted at local, neighbor, and global levels with gene expression profiles, capturing scale-aware morphological-transcriptional relationships. In parallel, the cross-level instance-group discrimination pathway enforces structural consistency between individual (image/gene) instances and modality-crossed (gene/image, respectively) groups, strengthening the alignment across modalities. By jointly modeling fine-grained and structural-level discrimination, Gene-DML is able to learn robust cross-modal representations, enhancing both predictive accuracy and generalization across diverse biological contexts. Extensive experiments on public spatial transcriptomics datasets demonstrate that Gene-DML achieves state-of-the-art performance in gene expression prediction. The code and processed datasets are available at https://github.com/YXSong000/Gene-DML.
HCFeb 7, 2024
ChatScratch: An AI-Augmented System Toward Autonomous Visual Programming Learning for Children Aged 6-12Liuqing Chen, Shuhong Xiao, Yunnong Chen et al.
As Computational Thinking (CT) continues to permeate younger age groups in K-12 education, established CT platforms such as Scratch face challenges in catering to these younger learners, particularly those in the elementary school (ages 6-12). Through formative investigation with Scratch experts, we uncover three key obstacles to children's autonomous Scratch learning: artist's block in project planning, bounded creativity in asset creation, and inadequate coding guidance during implementation. To address these barriers, we introduce ChatScratch, an AI-augmented system to facilitate autonomous programming learning for young children. ChatScratch employs structured interactive storyboards and visual cues to overcome artist's block, integrates digital drawing and advanced image generation technologies to elevate creativity, and leverages Scratch-specialized Large Language Models (LLMs) for professional coding guidance. Our study shows that, compared to Scratch, ChatScratch efficiently fosters autonomous programming learning, and contributes to the creation of high-quality, personally meaningful Scratch projects for children.
IVNov 22, 2024
Cell as Point: One-Stage Framework for Efficient Cell TrackingYaxuan Song, Jianan Fan, Heng Huang et al.
Conventional multi-stage cell tracking approaches rely heavily on detection or segmentation in each frame as a prerequisite, requiring substantial resources for high-quality segmentation masks and increasing the overall prediction time. To address these limitations, we propose CAP, a novel end-to-end one-stage framework that reimagines cell tracking by treating Cell as Point. Unlike traditional methods, CAP eliminates the need for explicit detection or segmentation, instead jointly tracking cells for sequences in one stage by leveraging the inherent correlations among their trajectories. This simplification reduces both labeling requirements and pipeline complexity. However, directly processing the entire sequence in one stage poses challenges related to data imbalance in capturing cell division events and long sequence inference. To solve these challenges, CAP introduces two key innovations: (1) adaptive event-guided (AEG) sampling, which prioritizes cell division events to mitigate the occurrence imbalance of cell events, and (2) the rolling-as-window (RAW) inference strategy, which ensures continuous and stable tracking of newly emerging cells over extended sequences. By removing the dependency on segmentation-based preprocessing while addressing the challenges of imbalanced occurrence of cell events and long-sequence tracking, CAP demonstrates promising cell tracking performance and is 8 to 32 times more efficient than existing methods. The code and model checkpoints will be available soon.
CVJun 20, 2025
On the Theory of Conditional Feature Alignment for Unsupervised Domain-Adaptive CountingZhuonan Liang, Dongnan Liu, Jianan Fan et al.
Object counting models suffer when deployed across domains with differing density variety, since density shifts are inherently task-relevant and violate standard domain adaptation assumptions. To address this, we propose a theoretical framework of conditional feature alignment and provide a straightforward implementation. By theoretical analysis, our framework is feasible to achieve superior cross-domain generalization for counting. In the presented network, the features related to density are explicitly preserved across domains. Theoretically, we formalize the notion of conditional divergence by partitioning each domain into subsets and measuring divergences per condition. We then derive a joint error bound showing that, under discrete label spaces treated as condition sets, aligning distributions conditionally leads to tighter bounds on the combined source-target decision error than unconditional alignment. Empirically, we demonstrate the effectiveness of our approach through extensive experiments on multiple counting datasets with varying density distributions. The results show that our method outperforms existing unsupervised domain adaptation methods, empirically validating the theoretical insights on conditional feature alignment.