24.7CVMay 21
Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease RecognitionGanlin Feng, Yuxi Long, Erin Lou et al.
Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling. While prior work has shown that synthetic data can augment real datasets and preserve phenotype-level semantics, it remains unclear whether synthetic data alone is sufficient for learning in ultra-low-resource pediatric settings. In this work, we study the synthetic-only regime for pediatric rare disease recognition. Under a controlled experimental setup, models are trained exclusively on phenotype-aware synthetic facial images at increasing scales. We find that synthetic-only training achieves performance comparable to real-data-only baselines at sufficient scale across multiple backbones, suggesting that high-fidelity synthetic data can approximate clinically meaningful distributions. These findings together further enable the use of synthetic pediatric facial images as privacy-preserving resources for genetic education and counseling, supporting clinician training and patient communication. Our results highlight the potential of computer vision to improve data efficiency and expand accessible visual tools in children's healthcare.
12.4CVApr 3
RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic GenerationGanlin Feng, Yuxi Long, Hafsa Ali et al.
Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AI-assisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata. RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision-language model from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.
2.2CYApr 4
Internet-Mediated Digital Informal Learning Portfolios in STEM Higher Education: A Computational Grounded Theory Study of Online Peer Advice CommunitiesJianjun Xiao, Yuxi Long
Internet technologies have expanded higher education students' access to learning resources, peer guidance, and skill-development opportunities beyond formal curricula. Yet the ways students assemble these distributed online resources into coherent learning pathways remain insufficiently understood. This study examines how STEM students construct digital informal learning portfolios through internet-mediated peer advice and platform use. Drawing on Social Cognitive Career Theory (SCCT) and informal learning frameworks, we analyze 3,607 peer advice posts from a large online student community using Computational Grounded Theory (CGT). Results show that career pathway (69.6% of coded documents) and career orientation (59.7%) are the dominant organizing dimensions, yielding three distinct digital informal learning portfolios: a graduate-study portfolio centered on competition training, mathematical foundations, and staged preparation; an industry-employment portfolio centered on self-directed skill building, online platform learning, and strategically timed internships; and a public-sector portfolio characterized by dual-track hedging across graduate study, enterprise employment, and public-sector preparation pathways. The online peer community itself functions as a distributed informal curriculum, collectively producing and transmitting pathway-specific guidance about what to learn, when to learn it, and which internet resources to prioritize. These findings extend SCCT into the domain of internet-mediated digital informal learning and introduce career front-loading as a pattern of early learning reorganization. Implications are discussed for institutional learning support, recognition of internet-enabled learning, and the design of digital guidance infrastructures in higher education.
LGFeb 4
Pruning for Generalization: A Transfer-Oriented Spatiotemporal Graph FrameworkZihao Jing, Yuxi Long, Ganlin Feng
Multivariate time series forecasting in graph-structured domains is critical for real-world applications, yet existing spatiotemporal models often suffer from performance degradation under data scarcity and cross-domain shifts. We address these challenges through the lens of structure-aware context selection. We propose TL-GPSTGN, a transfer-oriented spatiotemporal framework that enhances sample efficiency and out-of-distribution generalization by selectively pruning non-optimized graph context. Specifically, our method employs information-theoretic and correlation-based criteria to extract structurally informative subgraphs and features, resulting in a compact, semantically grounded representation. This optimized context is subsequently integrated into a spatiotemporal convolutional architecture to capture complex multivariate dynamics. Evaluations on large-scale traffic benchmarks demonstrate that TL-GPSTGN consistently outperforms baselines in low-data transfer scenarios. Our findings suggest that explicit context pruning serves as a powerful inductive bias for improving the robustness of graph-based forecasting models.