Ruijie Yao

CV
h-index33
5papers
14citations
Novelty45%
AI Score48

5 Papers

CVAug 28, 2023Code
GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition

Ruijie Yao, Sheng Jin, Lumin Xu et al.

Multi-Label Image Recognition (MLIR) is a challenging task that aims to predict multiple object labels in a single image while modeling the complex relationships between labels and image regions. Although convolutional neural networks and vision transformers have succeeded in processing images as regular grids of pixels or patches, these representations are sub-optimal for capturing irregular and discontinuous regions of interest. In this work, we present the first fully graph convolutional model, Group K-nearest neighbor based Graph convolutional Network (GKGNet), which models the connections between semantic label embeddings and image patches in a flexible and unified graph structure. To address the scale variance of different objects and to capture information from multiple perspectives, we propose the Group KGCN module for dynamic graph construction and message passing. Our experiments demonstrate that GKGNet achieves state-of-the-art performance with significantly lower computational costs on the challenging multi-label datasets, i.e., MS-COCO and VOC2007 datasets. Codes are available at https://github.com/jin-s13/GKGNet.

HCApr 28Code
Feature Anchors for Time-Series Sensor-Based Human Activity Recognition

Ruijie Yao, Chenhang Li, Danyang Zhuo et al.

Wearable Human Activity Recognition (HAR) still lacks a representation that is both explicit and adaptable. Handcrafted time-series features (TSFs) capture meaningful motion statistics and remain competitive on standard benchmarks, but they are usually used as fixed preprocessing outputs. Deep models learn adaptable representations directly from raw signals, but those representations are typically latent and difficult to inspect. We address this gap by treating handcrafted TSFs as feature anchors: explicit intermediate representations that remain inside the model and are adjusted by neural context instead of being discarded. We propose the Temporal Conditioning Network for Feature Anchors (TCNet), which extracts handcrafted anchors, encodes complementary time-domain and frequency-domain context from raw IMU windows, and predicts context-conditioned scale, bias, and gating parameters to modulate anchor groups directly in feature space. This design keeps anchor semantics visible while allowing the representation to adapt to the classification objective. Across five HAR benchmarks, TCNet achieves 70.2% mF1 on USC-HAD, 85.1% mF1 on Daphnet, 93.9% mF1 on MHealth, and 94.5% mF1 on PAMAP2. Relative to rTsfNet, it improves by 4.5 points on USC-HAD, 14.6 points on Daphnet, and 6.5 points on MHealth. Ablations show that the gains come primarily from anchor guidance rather than simple branch fusion, and feature-space analyses indicate that several discriminative TSF families are not reliably accessible in standard latent representations. These results suggest that, for HAR, handcrafted TSFs are most useful when they remain explicit and adaptable within the model. The code is available at: https://github.com/ni-x-lab/TCNet-har

SDMar 29
Investigation on the Robustness of Acoustic Foundation Models on Post Exercise Speech

Xiangyuan Xue, Yuyu Wang, Ruijie Yao et al.

Automatic speech recognition (ASR) has been extensively studied on neutral and stationary speech, yet its robustness under post-exercise physiological shift remains underexplored. Compared with resting speech, post-exercise speech often contains micro-breaths, non-semantic pauses, unstable phonation, and repetitions caused by reduced breath support, making transcription more difficult. In this work, we benchmark acoustic foundation models on post-exercise speech under a unified evaluation protocol. We compare sequence-to-sequence models (Whisper and FunASR/Paraformer) and self-supervised encoders with CTC decoding (Wav2Vec2, HuBERT, and WavLM), under both off-the-shelf inference and post-exercise in-domain fine-tuning. Across the Static/Post-All benchmark, most models degrade on post-exercise speech, while FunASR shows the strongest baseline robustness at 14.57% WER and 8.21% CER on Post-All. Fine-tuning substantially improves several CTC-based models, whereas Whisper shows unstable adaptation. As an exploratory case study, we further stratify results by fluent and non-fluent speakers; although the non-fluent subset is small, it is consistently more challenging than the fluent subset. Overall, our findings show that post-exercise ASR robustness is strongly model-dependent, that in-domain adaptation can be highly effective but not uniformly stable, and that future post-exercise ASR studies should explicitly separate fluency-related effects from exercise-induced speech variation.

CVApr 30, 2024Code
UniFS: Universal Few-shot Instance Perception with Point Representations

Sheng Jin, Ruijie Yao, Lumin Xu et al.

Instance perception tasks (object detection, instance segmentation, pose estimation, counting) play a key role in industrial applications of visual models. As supervised learning methods suffer from high labeling cost, few-shot learning methods which effectively learn from a limited number of labeled examples are desired. Existing few-shot learning methods primarily focus on a restricted set of tasks, presumably due to the challenges involved in designing a generic model capable of representing diverse tasks in a unified manner. In this paper, we propose UniFS, a universal few-shot instance perception model that unifies a wide range of instance perception tasks by reformulating them into a dynamic point representation learning framework. Additionally, we propose Structure-Aware Point Learning (SAPL) to exploit the higher-order structural relationship among points to further enhance representation learning. Our approach makes minimal assumptions about the tasks, yet it achieves competitive results compared to highly specialized and well optimized specialist models. Codes and data are available at https://github.com/jin-s13/UniFS.

IVMay 9, 2025
Predicting Diabetic Macular Edema Treatment Responses Using OCT: Dataset and Methods of APTOS Competition

Weiyi Zhang, Peranut Chotcomwongse, Yinwen Li et al.

Diabetic macular edema (DME) significantly contributes to visual impairment in diabetic patients. Treatment responses to intravitreal therapies vary, highlighting the need for patient stratification to predict therapeutic benefits and enable personalized strategies. To our knowledge, this study is the first to explore pre-treatment stratification for predicting DME treatment responses. To advance this research, we organized the 2nd Asia-Pacific Tele-Ophthalmology Society (APTOS) Big Data Competition in 2021. The competition focused on improving predictive accuracy for anti-VEGF therapy responses using ophthalmic OCT images. We provided a dataset containing tens of thousands of OCT images from 2,000 patients with labels across four sub-tasks. This paper details the competition's structure, dataset, leading methods, and evaluation metrics. The competition attracted strong scientific community participation, with 170 teams initially registering and 41 reaching the final round. The top-performing team achieved an AUC of 80.06%, highlighting the potential of AI in personalized DME treatment and clinical decision-making.