Miki Haseyama

CV
h-index76
53papers
496citations
Novelty50%
AI Score57

53 Papers

AIJun 2
Dynamic Objective Selection with Safeguards and LLM Oversight for Financial Decision-Making

Keigo Sakurai, Takahiro Ogawa, Miki Haseyama et al.

Financial decision-making tasks such as stock recommendation and portfolio allocation typically estimate future return and risk and then select trades or allocations for an investor, and the chosen optimization objective often determines realized performance. However, because market conditions evolve over time, a fixed objective can be suboptimal across regimes, while regime-switching pipelines that rely on latent regime estimates can be noisy or delayed and frequent switching can increase turnover and operational instability. In this paper, we propose DOSS (Dynamic Objective Selection with Safeguards), a learning-based selector that directly chooses the decision-relevant objective function at each time point from interpretable statistical summaries of recent returns, selecting among a small set of candidates (e.g., return-seeking, loss-averse, and risk-adjusted) without introducing intermediate regime variables. DOSS formulates objective selection as a classification problem over objectives and performs sequential updates with a rolling window to make forward-looking selections without temporal leakage, while also outputting a confidence score for each proposal. To mitigate misselection and excessive switching in deployment, DOSS applies confidence-aware gating with a fail-safe that overrides low-confidence proposals to a conservative default and enforces explicit controls tied to switching frequency. We further integrate governance by positioning a Large Language Model (LLM) as an oversight component rather than a generator of new objectives: the LLM is restricted to accept a proposed objective or override it to a predefined safe default, with deterministic rule-based constraints triggering overrides when needed.

CVOct 23, 2024
Which Client is Reliable?: A Reliable and Personalized Prompt-based Federated Learning for Medical Image Question Answering

He Zhu, Ren Togo, Takahiro Ogawa et al.

Conventional medical artificial intelligence (AI) models face barriers in clinical application and ethical issues owing to their inability to handle the privacy-sensitive characteristics of medical data. We present a novel personalized federated learning (pFL) method for medical visual question answering (VQA) models, addressing privacy reliability challenges in the medical domain. Our method introduces learnable prompts into a Transformer architecture to efficiently train it on diverse medical datasets without massive computational costs. Then we introduce a reliable client VQA model that incorporates Dempster-Shafer evidence theory to quantify uncertainty in predictions, enhancing the model's reliability. Furthermore, we propose a novel inter-client communication mechanism that uses maximum likelihood estimation to balance accuracy and uncertainty, fostering efficient integration of insights across clients.

CVAug 16, 2024Code
Generative Dataset Distillation Based on Diffusion Model

Duo Su, Junjie Hou, Guang Li et al.

This paper presents our method for the generative track of The First Dataset Distillation Challenge at ECCV 2024. Since the diffusion model has become the mainstay of generative models because of its high-quality generative effects, we focus on distillation methods based on the diffusion model. Considering that the track can only generate a fixed number of images in 10 minutes using a generative model for CIFAR-100 and Tiny-ImageNet datasets, we need to use a generative model that can generate images at high speed. In this study, we proposed a novel generative dataset distillation method based on Stable Diffusion. Specifically, we use the SDXL-Turbo model which can generate images at high speed and quality. Compared to other diffusion models that can only generate images per class (IPC) = 1, our method can achieve an IPC = 10 for Tiny-ImageNet and an IPC = 20 for CIFAR-100, respectively. Additionally, to generate high-quality distilled datasets for CIFAR-100 and Tiny-ImageNet, we use the class information as text prompts and post data augmentation for the SDXL-Turbo model. Experimental results show the effectiveness of the proposed method, and we achieved third place in the generative track of the ECCV 2024 DD Challenge. Codes are available at https://github.com/Guang000/BANKO.

CVJun 2
Hierarchical Federated Learning with Dynamic Clustering and Adaptive Regularization for Robust Infrastructure Inspection

Yuhu Feng, Keisuke Maeda, Takahiro Ogawa et al.

The deployment of data-driven computer vision models for structural health monitoring (SHM) is heavily constrained by the data silo dilemma due to stringent privacy and security regulations. While federated learning (FL) offers a privacy-preserving collaborative alternative, its application to nationwide infrastructure networks is severely hindered by the challenge of ``double heterogeneity'': macro-level physical divergence across disparate structural types and micro-level statistical imbalances within local datasets. To overcome this challenge, this paper proposes a novel hierarchical federated learning framework. The framework orchestrates a synergistic two-tier optimization strategy. At the macro-level, a dynamic gradient-based clustering mechanism autonomously aggregates distributed clients into specialized expert groups based on their structural degradation trajectories, circumventing the need for prior geographical metadata. Concurrently, at the micro-level, an intra-cluster Dynamic Region-Adaptive Proximal Regularization (DRAPR) module computes a real-time statistical Non-IID Intensity Score for each client. By adaptively modulating a proximal penalty based on local label skewness and gradient divergence, DRAPR effectively calibrates local updates, mitigates client drift, and prevents the catastrophic forgetting of minority damage classes. Comprehensive evaluations on a large-scale, real-world structural inspection dataset demonstrate that the hierarchical integration of macro-clustering and micro-regularization successfully neutralizes dual-level heterogeneity, yielding highly robust and specialized diagnostic models for complex infrastructure inspection.

CVDec 6, 2022Code
Union-set Multi-source Model Adaptation for Semantic Segmentation

Zongyao Li, Ren Togo, Takahiro Ogawa et al.

This paper solves a generalized version of the problem of multi-source model adaptation for semantic segmentation. Model adaptation is proposed as a new domain adaptation problem which requires access to a pre-trained model instead of data for the source domain. A general multi-source setting of model adaptation assumes strictly that each source domain shares a common label space with the target domain. As a relaxation, we allow the label space of each source domain to be a subset of that of the target domain and require the union of the source-domain label spaces to be equal to the target-domain label space. For the new setting named union-set multi-source model adaptation, we propose a method with a novel learning strategy named model-invariant feature learning, which takes full advantage of the diverse characteristics of the source-domain models, thereby improving the generalization in the target domain. We conduct extensive experiments in various adaptation settings to show the superiority of our method. The code is available at https://github.com/lzy7976/union-set-model-adaptation.

CVSep 29, 2022
Compressed Gastric Image Generation Based on Soft-Label Dataset Distillation for Medical Data Sharing

Guang Li, Ren Togo, Takahiro Ogawa et al.

Background and objective: Sharing of medical data is required to enable the cross-agency flow of healthcare information and construct high-accuracy computer-aided diagnosis systems. However, the large sizes of medical datasets, the massive amount of memory of saved deep convolutional neural network (DCNN) models, and patients' privacy protection are problems that can lead to inefficient medical data sharing. Therefore, this study proposes a novel soft-label dataset distillation method for medical data sharing. Methods: The proposed method distills valid information of medical image data and generates several compressed images with different data distributions for anonymous medical data sharing. Furthermore, our method can extract essential weights of DCNN models to reduce the memory required to save trained models for efficient medical data sharing. Results: The proposed method can compress tens of thousands of images into several soft-label images and reduce the size of a trained model to a few hundredths of its original size. The compressed images obtained after distillation have been visually anonymized; therefore, they do not contain the private information of the patients. Furthermore, we can realize high-detection performance with a small number of compressed images. Conclusions: The experimental results show that the proposed method can improve the efficiency and security of medical data sharing.

CRSep 29, 2022
Dataset Distillation for Medical Dataset Sharing

Guang Li, Ren Togo, Takahiro Ogawa et al.

Sharing medical datasets between hospitals is challenging because of the privacy-protection problem and the massive cost of transmitting and storing many high-resolution medical images. However, dataset distillation can synthesize a small dataset such that models trained on it achieve comparable performance with the original large dataset, which shows potential for solving the existing medical sharing problems. Hence, this paper proposes a novel dataset distillation-based method for medical dataset sharing. Experimental results on a COVID-19 chest X-ray image dataset show that our method can achieve high detection performance even using scarce anonymized chest X-ray images.

IVJun 7, 2022
Self-Knowledge Distillation based Self-Supervised Learning for Covid-19 Detection from Chest X-Ray Images

Guang Li, Ren Togo, Takahiro Ogawa et al.

The global outbreak of the Coronavirus 2019 (COVID-19) has overloaded worldwide healthcare systems. Computer-aided diagnosis for COVID-19 fast detection and patient triage is becoming critical. This paper proposes a novel self-knowledge distillation based self-supervised learning method for COVID-19 detection from chest X-ray images. Our method can use self-knowledge of images based on similarities of their visual features for self-supervised learning. Experimental results show that our method achieved an HM score of 0.988, an AUC of 0.999, and an accuracy of 0.957 on the largest open COVID-19 chest X-ray dataset.

IVDec 19, 2022
COVID-19 Detection Based on Self-Supervised Transfer Learning Using Chest X-Ray Images

Guang Li, Ren Togo, Takahiro Ogawa et al.

Purpose: Considering several patients screened due to COVID-19 pandemic, computer-aided detection has strong potential in assisting clinical workflow efficiency and reducing the incidence of infections among radiologists and healthcare providers. Since many confirmed COVID-19 cases present radiological findings of pneumonia, radiologic examinations can be useful for fast detection. Therefore, chest radiography can be used to fast screen COVID-19 during the patient triage, thereby determining the priority of patient's care to help saturated medical facilities in a pandemic situation. Methods: In this paper, we propose a new learning scheme called self-supervised transfer learning for detecting COVID-19 from chest X-ray (CXR) images. We compared six self-supervised learning (SSL) methods (Cross, BYOL, SimSiam, SimCLR, PIRL-jigsaw, and PIRL-rotation) with the proposed method. Additionally, we compared six pretrained DCNNs (ResNet18, ResNet50, ResNet101, CheXNet, DenseNet201, and InceptionV3) with the proposed method. We provide quantitative evaluation on the largest open COVID-19 CXR dataset and qualitative results for visual inspection. Results: Our method achieved a harmonic mean (HM) score of 0.985, AUC of 0.999, and four-class accuracy of 0.953. We also used the visualization technique Grad-CAM++ to generate visual explanations of different classes of CXR images with the proposed method to increase the interpretability. Conclusions: Our method shows that the knowledge learned from natural images using transfer learning is beneficial for SSL of the CXR images and boosts the performance of representation learning for COVID-19 detection. Our method promises to reduce the incidence of infections among radiologists and healthcare providers.

CVSep 29, 2022
Dataset Distillation Using Parameter Pruning

Guang Li, Ren Togo, Takahiro Ogawa et al.

In this study, we propose a novel dataset distillation method based on parameter pruning. The proposed method can synthesize more robust distilled datasets and improve distillation performance by pruning difficult-to-match parameters during the distillation process. Experimental results on two benchmark datasets show the superiority of the proposed method.

CVJun 7, 2022
TriBYOL: Triplet BYOL for Self-Supervised Representation Learning

Guang Li, Ren Togo, Takahiro Ogawa et al.

This paper proposes a novel self-supervised learning method for learning better representations with small batch sizes. Many self-supervised learning methods based on certain forms of the siamese network have emerged and received significant attention. However, these methods need to use large batch sizes to learn good representations and require heavy computational resources. We present a new triplet network combined with a triple-view loss to improve the performance of self-supervised representation learning with small batch sizes. Experimental results show that our method can drastically outperform state-of-the-art self-supervised learning methods on several datasets in small-batch cases. Our method provides a feasible solution for self-supervised learning with real-world high-resolution images that uses small batch sizes.

IVDec 19, 2022
Boosting Automatic COVID-19 Detection Performance with Self-Supervised Learning and Batch Knowledge Ensembling

Guang Li, Ren Togo, Takahiro Ogawa et al.

Problem: Detecting COVID-19 from chest X-Ray (CXR) images has become one of the fastest and easiest methods for detecting COVID-19. However, the existing methods usually use supervised transfer learning from natural images as a pretraining process. These methods do not consider the unique features of COVID-19 and the similar features between COVID-19 and other pneumonia. Aim: In this paper, we want to design a novel high-accuracy COVID-19 detection method that uses CXR images, which can consider the unique features of COVID-19 and the similar features between COVID-19 and other pneumonia. Methods: Our method consists of two phases. One is self-supervised learning-based pertaining; the other is batch knowledge ensembling-based fine-tuning. Self-supervised learning-based pretraining can learn distinguished representations from CXR images without manually annotated labels. On the other hand, batch knowledge ensembling-based fine-tuning can utilize category knowledge of images in a batch according to their visual feature similarities to improve detection performance. Unlike our previous implementation, we introduce batch knowledge ensembling into the fine-tuning phase, reducing the memory used in self-supervised learning and improving COVID-19 detection accuracy. Results: On two public COVID-19 CXR datasets, namely, a large dataset and an unbalanced dataset, our method exhibited promising COVID-19 detection performance. Our method maintains high detection accuracy even when annotated CXR training images are reduced significantly (e.g., using only 10% of the original dataset). In addition, our method is insensitive to changes in hyperparameters.

CVJul 8, 2024
Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

Yaozong Gan, Guang Li, Ren Togo et al.

Recent multimodal large language models (MLLM) such as GPT-4o and GPT-4v have shown great potential in autonomous driving. In this paper, we propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition (TSR). We first construct a traffic sign detection network based on Vision Transformer Adapter and an extraction module to extract traffic signs from the original road images. To reduce the dependence on training data and improve the performance stability of cross-country TSR, we introduce a cross-domain few-shot in-context learning method based on the MLLM. To enhance MLLM's fine-grained recognition ability of traffic signs, the proposed method generates corresponding description texts using template traffic signs. These description texts contain key information about the shape, color, and composition of traffic signs, which can stimulate the ability of MLLM to perceive fine-grained traffic sign categories. By using the description texts, our method reduces the cross-domain differences between template and real traffic signs. Our approach requires only simple and uniform textual indications, without the need for large-scale traffic sign images and labels. We perform comprehensive evaluations on the German traffic sign recognition benchmark dataset, the Belgium traffic sign dataset, and two real-world datasets taken from Japan. The experimental results show that our method significantly enhances the TSR performance.

CVSep 29, 2022
Dataset Complexity Assessment Based on Cumulative Maximum Scaled Area Under Laplacian Spectrum

Guang Li, Ren Togo, Takahiro Ogawa et al.

Dataset complexity assessment aims to predict classification performance on a dataset with complexity calculation before training a classifier, which can also be used for classifier selection and dataset reduction. The training process of deep convolutional neural networks (DCNNs) is iterative and time-consuming because of hyperparameter uncertainty and the domain shift introduced by different datasets. Hence, it is meaningful to predict classification performance by assessing the complexity of datasets effectively before training DCNN models. This paper proposes a novel method called cumulative maximum scaled Area Under Laplacian Spectrum (cmsAULS), which can achieve state-of-the-art complexity assessment performance on six datasets.

LGMay 8Code
Predictive but Not Plannable: RC-aux for Latent World Models

Wenyuan Li, Guang Li, Keisuke Maeda et al.

A latent world model may achieve accurate short-horizon prediction while still inducing a latent space that is poorly aligned with planning. A key issue is spatiotemporal mismatch: these models are often trained with local predictive supervision, but deployed for long-horizon goal-directed search in latent spaces where Euclidean distance may not reflect what is reachable within a finite action budget. We present the Reachability-Correction auxiliary objective (RC-aux), a lightweight correction for this mismatch in reconstruction-free latent world models. RC-aux keeps the world-model backbone unchanged and adds planning-aligned supervision along two axes. Along the time axis, multi-horizon open-loop prediction trains the model beyond one-step consistency. Along the space axis, budget-conditioned reachability supervision, together with temporal hard negatives, encourages the latent space to distinguish states that are eventually reachable from those reachable within the current planning horizon. At test time, the learned reachability signal can also be used by a reachability-aware planner to favor trajectories that are both goal-directed and attainable under the available budget. We instantiate RC-aux on LeWorldModel and evaluate it under both continuation-training and matched-from-scratch settings. Across goal-conditioned pixel-control tasks and a LIBERO-Goal extension, RC-aux improves LeWM-style planning with modest additional cost. These results suggest that planning with latent world models depends not only on predictive accuracy, but also on whether the learned representation encodes the temporal and geometric structure required by downstream search. The code is available at https://github.com/Guang000/RC-aux.

SDSep 2, 2024
MMT-BERT: Chord-aware Symbolic Music Generation Based on Multitrack Music Transformer and MusicBERT

Jinlong Zhu, Keigo Sakurai, Ren Togo et al.

We propose a novel symbolic music representation and Generative Adversarial Network (GAN) framework specially designed for symbolic multitrack music generation. The main theme of symbolic music generation primarily encompasses the preprocessing of music data and the implementation of a deep learning framework. Current techniques dedicated to symbolic music generation generally encounter two significant challenges: training data's lack of information about chords and scales and the requirement of specially designed model architecture adapted to the unique format of symbolic music representation. In this paper, we solve the above problems by introducing new symbolic music representation with MusicLang chord analysis model. We propose our MMT-BERT architecture adapting to the representation. To build a robust multitrack music generator, we fine-tune a pre-trained MusicBERT model to serve as the discriminator, and incorporate relativistic standard loss. This approach, supported by the in-depth understanding of symbolic music encoded within MusicBERT, fortifies the consonance and humanity of music generated by our method. Experimental results demonstrate the effectiveness of our approach which strictly follows the state-of-the-art methods.

CVMar 26
FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation

Hongxu Ma, Guang Li, Shijie Wang et al.

Dataset distillation (DD) compresses a large training set into a small synthetic set, reducing storage and training cost, and has shown strong results on general benchmarks. Decoupled DD further improves efficiency by splitting the pipeline into pretraining, sample distillation, and soft-label generation. However, existing decoupled methods largely rely on coarse class-label supervision and optimize samples within each class in a nearly identical manner. On fine-grained datasets, this often yields distilled samples that (i) retain large intra-class variation with subtle inter-class differences and (ii) become overly similar within the same class, limiting localized discriminative cues and hurting recognition. To solve the above-mentioned problems, we propose FD$^{2}$, a dedicated framework for Fine-grained Dataset Distillation. FD$^{2}$ localizes discriminative regions and constructs fine-grained representations for distillation. During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes. During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others, and a similarity constraint diversifies attention across same-class samples. Experiments on multiple fine-grained and general datasets show that FD$^{2}$ integrates seamlessly with decoupled DD and improves performance in most settings, indicating strong transferability.

CVSep 3, 2024
Cross-domain Multi-step Thinking: Zero-shot Fine-grained Traffic Sign Recognition in the Wild

Yaozong Gan, Guang Li, Ren Togo et al.

In this study, we propose Cross-domain Multi-step Thinking (CdMT) to improve zero-shot fine-grained traffic sign recognition (TSR) performance in the wild. Zero-shot fine-grained TSR in the wild is challenging due to the cross-domain problem between clean template traffic signs and real-world counterparts, and existing approaches particularly struggle with cross-country TSR scenarios, where traffic signs typically differ between countries. The proposed CdMT framework tackles these challenges by leveraging the multi-step reasoning capabilities of large multimodal models (LMMs). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for LMMs. Context descriptions, which are enhanced by center coordinate prompt optimization, enable the precise localization of target traffic signs in complex road images and filter irrelevant responses via novel prior traffic sign hypotheses. Characteristic descriptions, which are derived from in-context learning with template traffic signs, bridge cross-domain gaps and enhance fine-grained TSR. Differential descriptions refine the multimodal reasoning ability of LMMs by distinguishing subtle differences among similar signs. CdMT is independent of training data and requires only simple and uniform instructions, enabling it to achieve cross-country TSR. We conducted extensive experiments on three benchmark datasets and two real-world datasets from different countries. The proposed CdMT framework achieved superior performance compared with other state-of-the-art methods on all five datasets, with recognition accuracies of 0.93, 0.89, 0.97, 0.89, and 0.85 on the GTSRB, BTSD, TT-100K, Sapporo, and Yokohama datasets, respectively.

CVMar 8, 2023
Interpretable Visual Question Answering Referring to Outside Knowledge

He Zhu, Ren Togo, Takahiro Ogawa et al.

We present a novel multimodal interpretable VQA model that can answer the question more accurately and generate diverse explanations. Although researchers have proposed several methods that can generate human-readable and fine-grained natural language sentences to explain a model's decision, these methods have focused solely on the information in the image. Ideally, the model should refer to various information inside and outside the image to correctly generate explanations, just as we use background knowledge daily. The proposed method incorporates information from outside knowledge and multiple image captions to increase the diversity of information available to the model. The contribution of this paper is to construct an interpretable visual question answering model using multimodal inputs to improve the rationality of generated results. Experimental results show that our model can outperform state-of-the-art methods regarding answer accuracy and explanation rationality.

CVNov 1, 2022
RGMIM: Region-Guided Masked Image Modeling for Learning Meaningful Representations from X-Ray Images

Guang Li, Ren Togo, Takahiro Ogawa et al.

In this study, we propose a novel method called region-guided masked image modeling (RGMIM) for learning meaningful representations from X-ray images. Our method adopts a new masking strategy that utilizes organ mask information to identify valid regions for learning more meaningful representations. We conduct quantitative evaluations on an open lung X-ray image dataset as well as masking ratio hyperparameter studies. When using the entire training set, RGMIM outperformed other comparable methods, achieving a 0.962 lung disease detection accuracy. Specifically, RGMIM significantly improved performance in small data volumes, such as 5% and 10% of the training set compared to other methods. RGMIM can mask more valid regions, facilitating the learning of discriminative representations and the subsequent high-accuracy lung disease detection. RGMIM outperforms other state-of-the-art self-supervised learning methods in experiments, particularly when limited training data is used.

CVJan 15
Difficulty-guided Sampling: Bridging the Target Gap between Dataset Distillation and Downstream Tasks

Mingzhuo Li, Guang Li, Linfeng Ye et al.

In this paper, we propose difficulty-guided sampling (DGS) to bridge the target gap between the distillation objective and the downstream task, therefore improving the performance of dataset distillation. Deep neural networks achieve remarkable performance but have time and storage-consuming training processes. Dataset distillation is proposed to generate compact, high-quality distilled datasets, enabling effective model training while maintaining downstream performance. Existing approaches typically focus on features extracted from the original dataset, overlooking task-specific information, which leads to a target gap between the distillation objective and the downstream task. We propose leveraging characteristics that benefit the downstream training into data distillation to bridge this gap. Focusing on the downstream task of image classification, we introduce the concept of difficulty and propose DGS as a plug-in post-stage sampling module. Following the specific target difficulty distribution, the final distilled dataset is sampled from image pools generated by existing methods. We also propose difficulty-aware guidance (DAG) to explore the effect of difficulty in the generation process. Extensive experiments across multiple settings demonstrate the effectiveness of the proposed methods. It also highlights the broader potential of difficulty for diverse downstream tasks.

LGJan 29
L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

Minghao Yang, Ren Togo, Guang Li et al.

Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank \& Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.

CVFeb 23
Personalized Longitudinal Medical Report Generation via Temporally-Aware Federated Adaptation

He Zhu, Ren Togo, Takahiro Ogawa et al.

Longitudinal medical report generation is clinically important yet remains challenging due to strict privacy constraints and the evolving nature of disease progression. Although federated learning (FL) enables collaborative training without data sharing, existing FL methods largely overlook longitudinal dynamics by assuming stationary client distributions, making them unable to model temporal shifts across visits or patient-specific heterogeneity-ultimately leading to unstable optimization and suboptimal report generation. We introduce Federated Temporal Adaptation (FTA), a federated setting that explicitly accounts for the temporal evolution of client data. Building upon this setting, we propose FedTAR, a framework that integrates demographic-driven personalization with time-aware global aggregation. FedTAR generates lightweight LoRA adapters from demographic embeddings and performs temporal residual aggregation, where updates from different visits are weighted by a meta-learned temporal policy optimized via first-order MAML. Experiments on J-MID (1M exams) and MIMIC-CXR demonstrate consistent improvements in linguistic accuracy, temporal coherence, and cross-site generalization, establishing FedTAR as a robust and privacy-preserving paradigm for federated longitudinal modeling.

CVNov 10, 2025
Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV

Wenbo Huang, Jinghui Zhang, Zhenghao Chen et al.

Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.

CVOct 31, 2025
Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness

Ren Tasai, Guang Li, Ren Togo et al.

We propose a novel continual self-supervised learning (CSSL) framework for simultaneously learning diverse features from multi-window-obtained chest computed tomography (CT) images and ensuring data privacy. Achieving a robust and highly generalizable model in medical image diagnosis is challenging, mainly because of issues, such as the scarcity of large-scale, accurately annotated datasets and domain shifts inherent to dynamic healthcare environments. Specifically, in chest CT, these domain shifts often arise from differences in window settings, which are optimized for distinct clinical purposes. Previous CSSL frameworks often mitigated domain shift by reusing past data, a typically impractical approach owing to privacy constraints. Our approach addresses these challenges by effectively capturing the relationship between previously learned knowledge and new information across different training stages through continual pretraining on unlabeled images. Specifically, by incorporating a latent replay-based mechanism into CSSL, our method mitigates catastrophic forgetting due to domain shifts during continual pretraining while ensuring data privacy. Additionally, we introduce a feature distillation technique that integrates Wasserstein distance-based knowledge distillation (WKD) and batch-knowledge ensemble (BKE), enhancing the ability of the model to learn meaningful, domain-shift-robust representations. Finally, we validate our approach using chest CT images obtained across two different window settings, demonstrating superior performance compared with other approaches.

IVJul 6, 2023
Few-shot Personalized Saliency Prediction Based on Interpersonal Gaze Patterns

Yuya Moroto, Keisuke Maeda, Takahiro Ogawa et al.

This study proposes a few-shot personalized saliency prediction method that leverages interpersonal gaze patterns. Unlike general saliency maps, personalized saliency maps (PSMs) capture individual visual attention and provide insights into individual visual preferences. However, predicting PSMs is challenging because of the complexity of gaze patterns and the difficulty of collecting extensive eye-tracking data from individuals. An effective strategy for predicting PSMs from limited data is the use of eye-tracking data from other persons. To efficiently handle the PSMs of other persons, this study focuses on the selection of images to acquire eye-tracking data and the preservation of the structural information of PSMs. In the proposed method, these images are selected such that they bring more diverse gaze patterns to persons, and structural information is preserved using tensor-based regression. The experimental results demonstrate that these two factors are beneficial for few-shot PSM prediction.

CVMay 18
SAS: Semantic-aware Sampling for Generative Dataset Distillation

Mingzhuo Li, Guang Li, Linfeng Ye et al.

Deep neural networks have achieved impressive performance across a wide range of tasks, but this success often comes with substantial computational and storage costs due to large-scale training data. Dataset distillation addresses this challenge by constructing compact yet informative datasets that enable efficient model training while maintaining downstream performance. However, most existing approaches primarily emphasize matching data distributions or downstream training statistics, with limited attention to preserving high-level semantic information in the distilled data. In this work, we introduce a semantic-aware perspective for dataset distillation by leveraging Contrastive Language-Image Pretraining (CLIP) as a semantic prior for post-sampling. Our goal is to obtain distilled datasets that are not only compact but also semantically class-discriminative and diverse. To this end, we design three semantic scoring functions that quantify class relevance, inter-class separability, and intra-set diversity in a pretrained semantic space. Based on image pools generated by existing distillation methods, we further develop a two-stage strategy for effective sampling: the first stage filters semantically discriminative samples to form a reliable candidate set, and the second stage performs a dynamic diversity-aware selection to reduce redundancy while preserving semantic coverage. Extensive experiments across multiple datasets, image pools, and downstream models demonstrate consistent performance gains, highlighting the effectiveness of incorporating semantic information into dataset distillation.

LGMay 30, 2025Code
Hyperbolic Dataset Distillation

Wenyuan Li, Guang Li, Keisuke Maeda et al.

To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. To the best of our knowledge, this is the first work to incorporate the hyperbolic space into the dataset distillation process. The code is available at https://github.com/Guang000/HDD.

LGSep 15, 2022
Gromov-Wasserstein Autoencoders

Nao Nakagawa, Ren Togo, Takahiro Ogawa et al.

Variational Autoencoder (VAE)-based generative models offer flexible representation learning by incorporating meta-priors, general premises considered beneficial for downstream tasks. However, the incorporated meta-priors often involve ad-hoc model deviations from the original likelihood architecture, causing undesirable changes in their training. In this paper, we propose a novel representation learning method, Gromov-Wasserstein Autoencoders (GWAE), which directly matches the latent and data distributions using the variational autoencoding scheme. Instead of likelihood-based objectives, GWAE models minimize the Gromov-Wasserstein (GW) metric between the trainable prior and given data distributions. The GW metric measures the distance structure-oriented discrepancy between distributions even with different dimensionalities, which provides a direct measure between the latent and data spaces. By restricting the prior family, we can introduce meta-priors into the latent space without changing their objective. The empirical comparisons with VAE-based models show that GWAE models work in two prominent meta-priors, disentanglement and clustering, with their GW objective unchanged.

CVJul 4, 2025Code
Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling

Mingzhuo Li, Guang Li, Jiafeng Mao et al.

To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks. The code is available at https://github.com/SumomoTaku/DiffGuideSamp.

CVSep 11, 2025Code
Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation

Yuiko Uchida, Ren Togo, Keisuke Maeda et al.

This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric for 3D scenes that explicitly focuses on "objects," which are fundamental units of human visual perception. Existing metrics assess overall image quality, leading to discrepancies with human perception. Inspired by neuropsychological insights, we hypothesize that human recognition of 3D scenes fundamentally involves attention to individual objects. OSIM enables object-centric evaluations by leveraging an object detection model and its feature representations to quantify the "objectness" of each object in the scene. Our user study demonstrates that OSIM aligns more closely with human perception compared to existing metrics. We also analyze the characteristics of OSIM using various approaches. Moreover, we re-evaluate recent 3D reconstruction and generation models under a standardized experimental setup to clarify advancements in this field. The code is available at https://github.com/Objectness-Similarity/OSIM.

CVMay 8
Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models

Bincheng Peng, Guang Li, Ping Liu et al.

Dataset distillation compresses a large training set into a small synthetic set that preserves downstream training utility. While most existing methods target training networks from scratch, modern visual transfer learning often uses frozen pre-trained encoders followed by lightweight linear probing. Existing distillation methods for this setting either unroll iterative linear-probe updates with trajectory-based gradient matching, or rely on closed-form formulations originally designed for from-scratch training with neural-tangent-kernel (NTK) approximations. Neither route exploits the fact that frozen-feature linear probing admits a closed-form solution determined directly by the pre-trained features themselves, with no infinite-width approximation and no inner-loop trajectory. We propose Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel formulation that computes the linear probe induced by the synthetic set with a sample-space kernel ridge solver. The synthetic images are then updated by evaluating this induced classifier on real features through a temperature-scaled softmax cross-entropy, where the classifier columns act as learned class anchors in feature space. We further show that the choice of outer objective is decisive: pairing the closed-form inner solver with a standard MSE outer loss substantially underperforms trajectory-based methods, while the discriminative outer loss closes most of the gap. On ImageNet-100 with four pre-trained backbones, CLP-DD substantially improves over LGM without DSA and approaches LGM with DSA at a fraction of the computational cost. On ImageNet-1K, CLP-DD matches or surpasses LGM with DSA on three of four backbones while running roughly $14\times$ faster and using less than one-eighth of the GPU memory.

CVJan 29, 2024
Importance-Aware Adaptive Dataset Distillation

Guang Li, Ren Togo, Takahiro Ogawa et al.

Herein, we propose a novel dataset distillation method for constructing small informative datasets that preserve the information of the large original datasets. The development of deep learning models is enabled by the availability of large-scale datasets. Despite unprecedented success, large-scale datasets considerably increase the storage and transmission costs, resulting in a cumbersome model training process. Moreover, using raw data for training raises privacy and copyright concerns. To address these issues, a new task named dataset distillation has been introduced, aiming to synthesize a compact dataset that retains the essential information from the large original dataset. State-of-the-art (SOTA) dataset distillation methods have been proposed by matching gradients or network parameters obtained during training on real and synthetic datasets. The contribution of different network parameters to the distillation process varies, and uniformly treating them leads to degraded distillation performance. Based on this observation, we propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance by automatically assigning importance weights to different network parameters during distillation, thereby synthesizing more robust distilled datasets. IADD demonstrates superior performance over other SOTA dataset distillation methods based on parameter matching on multiple benchmark datasets and outperforms them in terms of cross-architecture generalization. In addition, the analysis of self-adaptive weights demonstrates the effectiveness of IADD. Furthermore, the effectiveness of IADD is validated in a real-world medical application such as COVID-19 detection.

CVApr 26, 2024
Generative Dataset Distillation: Balancing Global Structure and Local Details

Longzhen Li, Guang Li, Ren Togo et al.

In this paper, we propose a new dataset distillation method that considers balancing global structure and local details when distilling the information from a large dataset into a generative model. Dataset distillation has been proposed to reduce the size of the required dataset when training models. The conventional dataset distillation methods face the problem of long redeployment time and poor cross-architecture performance. Moreover, previous methods focused too much on the high-level semantic attributes between the synthetic dataset and the original dataset while ignoring the local features such as texture and shape. Based on the above understanding, we propose a new method for distilling the original image dataset into a generative model. Our method involves using a conditional generative adversarial network to generate the distilled dataset. Subsequently, we ensure balancing global structure and local details in the distillation process, continuously optimizing the generator for more information-dense dataset generation.

CVFeb 15, 2024
Prompt-based Personalized Federated Learning for Medical Visual Question Answering

He Zhu, Ren Togo, Takahiro Ogawa et al.

We present a novel prompt-based personalized federated learning (pFL) method to address data heterogeneity and privacy concerns in traditional medical visual question answering (VQA) methods. Specifically, we regard medical datasets from different organs as clients and use pFL to train personalized transformer-based VQA models for each client. To address the high computational complexity of client-to-client communication in previous pFL methods, we propose a succinct information sharing system by introducing prompts that are small learnable parameters. In addition, the proposed method introduces a reliability parameter to prevent the negative effects of low performance and irrelevant clients. Finally, extensive evaluations on various heterogeneous medical datasets attest to the effectiveness of our proposed method.

CVJan 8, 2025
Generative Dataset Distillation Based on Self-knowledge Distillation

Longzhen Li, Guang Li, Ren Togo et al.

Dataset distillation is an effective technique for reducing the cost and complexity of model training while maintaining performance by compressing large datasets into smaller, more efficient versions. In this paper, we present a novel generative dataset distillation method that can improve the accuracy of aligning prediction logits. Our approach integrates self-knowledge distillation to achieve more precise distribution matching between the synthetic and original data, thereby capturing the overall structure and relationships within the data. To further improve the accuracy of alignment, we introduce a standardization step on the logits before performing distribution matching, ensuring consistency in the range of logits. Through extensive experiments, we demonstrate that our method outperforms existing state-of-the-art methods, resulting in superior distillation performance.

CVDec 10, 2024
Manta: Enhancing Mamba for Few-Shot Action Recognition of Long Sub-Sequence

Wenbo Huang, Jinghui Zhang, Guang Li et al.

In few-shot action recognition (FSAR), long sub-sequences of video naturally express entire actions more effectively. However, the high computational complexity of mainstream Transformer-based methods limits their application. Recent Mamba demonstrates efficiency in modeling long sequences, but directly applying Mamba to FSAR overlooks the importance of local feature modeling and alignment. Moreover, long sub-sequences within the same class accumulate intra-class variance, which adversely impacts FSAR performance. To solve these challenges, we propose a Matryoshka MAmba and CoNtrasTive LeArning framework (Manta). Firstly, the Matryoshka Mamba introduces multiple Inner Modules to enhance local feature representation, rather than directly modeling global features. An Outer Module captures dependencies of timeline between these local features for implicit temporal alignment. Secondly, a hybrid contrastive learning paradigm, combining both supervised and unsupervised methods, is designed to mitigate the negative effects of intra-class variance accumulation. The Matryoshka Mamba and the hybrid contrastive learning paradigm operate in two parallel branches within Manta, enhancing Mamba for FSAR of long sub-sequence. Manta achieves new state-of-the-art performance on prominent benchmarks, including SSv2, Kinetics, UCF101, and HMDB51. Extensive empirical studies prove that Manta significantly improves FSAR of long sub-sequence from multiple perspectives.

CVJan 8, 2025
Continual Self-supervised Learning Considering Medical Domain Knowledge in Chest CT Images

Ren Tasai, Guang Li, Ren Togo et al.

We propose a novel continual self-supervised learning method (CSSL) considering medical domain knowledge in chest CT images. Our approach addresses the challenge of sequential learning by effectively capturing the relationship between previously learned knowledge and new information at different stages. By incorporating an enhanced DER into CSSL and maintaining both diversity and representativeness within the rehearsal buffer of DER, the risk of data interference during pretraining is reduced, enabling the model to learn more richer and robust feature representations. In addition, we incorporate a mixup strategy and feature distillation to further enhance the model's ability to learn meaningful representations. We validate our method using chest CT images obtained under two different imaging conditions, demonstrating superior performance compared to state-of-the-art methods.

LGJul 7, 2025
Information-Guided Diffusion Sampling for Dataset Distillation

Linfeng Ye, Shayan Mohajer Hamidi, Guang Li et al.

Dataset distillation aims to create a compact dataset that retains essential information while maintaining model performance. Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings, where generated samples lack diversity. In this paper, we address this issue from an information-theoretic perspective by identifying two key types of information that a distilled dataset must preserve: ($i$) prototype information $\mathrm{I}(X;Y)$, which captures label-relevant features; and ($ii$) contextual information $\mathrm{H}(X | Y)$, which preserves intra-class variability. Here, $(X,Y)$ represents the pair of random variables corresponding to the input data and its ground truth label, respectively. Observing that the required contextual information scales with IPC, we propose maximizing $\mathrm{I}(X;Y) + β\mathrm{H}(X | Y)$ during the DM sampling process, where $β$ is IPC-dependent. Since directly computing $\mathrm{I}(X;Y)$ and $\mathrm{H}(X | Y)$ is intractable, we develop variational estimations to tightly lower-bound these quantities via a data-driven approach. Our approach, information-guided diffusion sampling (IGDS), seamlessly integrates with diffusion models and improves dataset distillation across all IPC settings. Experiments on Tiny ImageNet and ImageNet subsets show that IGDS significantly outperforms existing methods, particularly in low-IPC regimes. The code will be released upon acceptance.

CVMar 27, 2024
Enhancing Generative Class Incremental Learning Performance with Model Forgetting Approach

Taro Togo, Ren Togo, Keisuke Maeda et al.

This study presents a novel approach to Generative Class Incremental Learning (GCIL) by introducing the forgetting mechanism, aimed at dynamically managing class information for better adaptation to streaming data. GCIL is one of the hot topics in the field of computer vision, and this is considered one of the crucial tasks in society, specifically the continual learning of generative models. The ability to forget is a crucial brain function that facilitates continual learning by selectively discarding less relevant information for humans. However, in the field of machine learning models, the concept of intentionally forgetting has not been extensively investigated. In this study we aim to bridge this gap by incorporating the forgetting mechanisms into GCIL, thereby examining their impact on the models' ability to learn in continual learning. Through our experiments, we have found that integrating the forgetting mechanisms significantly enhances the models' performance in acquiring new knowledge, underscoring the positive role that strategic forgetting plays in the process of continual learning.

LGFeb 6, 2025
StarMAP: Global Neighbor Embedding for Faithful Data Visualization

Koshi Watanabe, Keisuke Maeda, Takahiro Ogawa et al.

Neighbor embedding is widely employed to visualize high-dimensional data; however, it frequently overlooks the global structure, e.g., intercluster similarities, thereby impeding accurate visualization. To address this problem, this paper presents Star-attracted Manifold Approximation and Projection (StarMAP), which incorporates the advantage of principal component analysis (PCA) in neighbor embedding. Inspired by the property of PCA embedding, which can be viewed as the largest shadow of the data, StarMAP introduces the concept of \textit{star attraction} by leveraging the PCA embedding. This approach yields faithful global structure preservation while maintaining the interpretability and computational efficiency of neighbor embedding. StarMAP was compared with existing methods in the visualization tasks of toy datasets, single-cell RNA sequencing data, and deep representation. The experimental results show that StarMAP is simple but effective in realizing faithful visualizations.

CVNov 22, 2025
Decoupled Audio-Visual Dataset Distillation

Wenyuan Li, Guang Li, Keisuke Maeda et al.

Audio-Visual Dataset Distillation aims to compress large-scale datasets into compact subsets while preserving the performance of the original data. However, conventional Distribution Matching (DM) methods struggle to capture intrinsic cross-modal alignment. Subsequent studies have attempted to introduce cross-modal matching, but two major challenges remain: (i) independently and randomly initialized encoders lead to inconsistent modality mapping spaces, increasing training difficulty; and (ii) direct interactions between modalities tend to damage modality-specific (private) information, thereby degrading the quality of the distilled data. To address these challenges, we propose DAVDD, a pretraining-based decoupled audio-visual distillation framework. DAVDD leverages a diverse pretrained bank to obtain stable modality features and uses a lightweight decoupler bank to disentangle them into common and private representations. To effectively preserve cross-modal structure, we further introduce Common Intermodal Matching together with a Sample-Distribution Joint Alignment strategy, ensuring that shared representations are aligned both at the sample level and the global distribution level. Meanwhile, private representations are entirely isolated from cross-modal interaction, safeguarding modality-specific cues throughout distillation. Extensive experiments across multiple benchmarks show that DAVDD achieves state-of-the-art results under all IPC settings, demonstrating the effectiveness of decoupled representation learning for high-quality audio-visual dataset distillation. Code will be released.

CVOct 1, 2025
Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning

Minghao Yang, Ren Togo, Guang Li et al.

Mixture-of-Experts (MoE) has emerged as a powerful framework for multi-task learning (MTL). However, existing MoE-MTL methods often rely on single-task pretrained backbones and suffer from redundant adaptation and inefficient knowledge sharing during the transition from single-task to multi-task learning (STL to MTL). To address these limitations, we propose adaptive shared experts (ASE) within a low-rank adaptation (LoRA) based MoE, where shared experts are assigned router-computed gating weights jointly normalized with sparse experts. This design facilitates STL to MTL transition, enhances expert specialization, and cooperation. Furthermore, we incorporate fine-grained experts by increasing the number of LoRA experts while proportionally reducing their rank, enabling more effective knowledge sharing under a comparable parameter budget. Extensive experiments on the PASCAL-Context benchmark, under unified training settings, demonstrate that ASE consistently improves performance across diverse configurations and validates the effectiveness of fine-grained designs for MTL.

CLAug 30, 2025
Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition

Ryo Takahashi, Naoki Saito, Keisuke Maeda et al.

Visual Emotion Recognition (VER) is an important research topic due to its wide range of applications, including opinion mining and advertisement design. Extending this capability to recognize emotions at the individual level further broadens its potential applications. Recently, Multimodal Large Language Models (MLLMs) have attracted increasing attention and demonstrated performance comparable to that of conventional VER methods. However, MLLMs are trained on large and diverse datasets containing general opinions, which causes them to favor majority viewpoints and familiar patterns. This tendency limits their performance in a personalized VER, which is crucial for practical and real-world applications, and indicates a key area for improvement. To address this limitation, the proposed method employs discrete prompt tuning inspired by the process of humans' prompt engineering to adapt the VER task to each individual. Our method selects the best natural language representation from the generated prompts and uses it to update the prompt for the realization of accurate personalized VER.

CVAug 28, 2025
Dual-Model Weight Selection and Self-Knowledge Distillation for Medical Image Classification

Ayaka Tsutsumi, Guang Li, Ren Togo et al.

We propose a novel medical image classification method that integrates dual-model weight selection with self-knowledge distillation (SKD). In real-world medical settings, deploying large-scale models is often limited by computational resource constraints, which pose significant challenges for their practical implementation. Thus, developing lightweight models that achieve comparable performance to large-scale models while maintaining computational efficiency is crucial. To address this, we employ a dual-model weight selection strategy that initializes two lightweight models with weights derived from a large pretrained model, enabling effective knowledge transfer. Next, SKD is applied to these selected models, allowing the use of a broad range of initial weight configurations without imposing additional excessive computational cost, followed by fine-tuning for the target classification tasks. By combining dual-model weight selection with self-knowledge distillation, our method overcomes the limitations of conventional approaches, which often fail to retain critical information in compact models. Extensive experiments on publicly available datasets-chest X-ray images, lung computed tomography scans, and brain magnetic resonance imaging scans-demonstrate the superior performance and robustness of our approach compared to existing methods.

LGMay 26, 2025
Diversity-Driven Generative Dataset Distillation Based on Diffusion Model with Self-Adaptive Memory

Mingzhuo Li, Guang Li, Jiafeng Mao et al.

Dataset distillation enables the training of deep neural networks with comparable performance in significantly reduced time by compressing large datasets into small and representative ones. Although the introduction of generative models has made great achievements in this field, the distributions of their distilled datasets are not diverse enough to represent the original ones, leading to a decrease in downstream validation accuracy. In this paper, we present a diversity-driven generative dataset distillation method based on a diffusion model to solve this problem. We introduce self-adaptive memory to align the distribution between distilled and real datasets, assessing the representativeness. The degree of alignment leads the diffusion model to generate more diverse datasets during the distillation process. Extensive experiments show that our method outperforms existing state-of-the-art methods in most situations, proving its ability to tackle dataset distillation tasks.

CVFeb 25, 2025
Personalized Federated Learning for Egocentric Video Gaze Estimation with Comprehensive Parameter Frezzing

Yuhu Feng, Keisuke Maeda, Takahiro Ogawa et al.

Egocentric video gaze estimation requires models to capture individual gaze patterns while adapting to diverse user data. Our approach leverages a transformer-based architecture, integrating it into a PFL framework where only the most significant parameters, those exhibiting the highest rate of change during training, are selected and frozen for personalization in client models. Through extensive experimentation on the EGTEA Gaze+ and Ego4D datasets, we demonstrate that FedCPF significantly outperforms previously reported federated learning methods, achieving superior recall, precision, and F1-score. These results confirm the effectiveness of our comprehensive parameters freezing strategy in enhancing model personalization, making FedCPF a promising approach for tasks requiring both adaptability and accuracy in federated learning settings.

CVJan 22, 2025
Triplet Synthesis For Enhancing Composed Image Retrieval via Counterfactual Image Generation

Kenta Uesugi, Naoki Saito, Keisuke Maeda et al.

Composed Image Retrieval (CIR) provides an effective way to manage and access large-scale visual data. Construction of the CIR model utilizes triplets that consist of a reference image, modification text describing desired changes, and a target image that reflects these changes. For effectively training CIR models, extensive manual annotation to construct high-quality training datasets, which can be time-consuming and labor-intensive, is required. To deal with this problem, this paper proposes a novel triplet synthesis method by leveraging counterfactual image generation. By controlling visual feature modifications via counterfactual image generation, our approach automatically generates diverse training triplets without any manual intervention. This approach facilitates the creation of larger and more expressive datasets, leading to the improvement of CIR model's performance.

LGOct 22, 2024
Hyperboloid GPLVM for Discovering Continuous Hierarchies via Nonparametric Estimation

Koshi Watanabe, Keisuke Maeda, Takahiro Ogawa et al.

Dimensionality reduction (DR) offers a useful representation of complex high-dimensional data. Recent DR methods focus on hyperbolic geometry to derive a faithful low-dimensional representation of hierarchical data. However, existing methods are based on neighbor embedding, frequently ruining the continual relation of the hierarchies. This paper presents hyperboloid Gaussian process (GP) latent variable models (hGP-LVMs) to embed high-dimensional hierarchical data with implicit continuity via nonparametric estimation. We adopt generative modeling using the GP, which brings effective hierarchical embedding and executes ill-posed hyperparameter tuning. This paper presents three variants that employ original point, sparse point, and Bayesian estimations. We establish their learning algorithms by incorporating the Riemannian optimization and active approximation scheme of GP-LVM. For Bayesian inference, we further introduce the reparameterization trick to realize Bayesian latent variable learning. In the last part of this paper, we apply hGP-LVMs to several datasets and show their ability to represent high-dimensional hierarchies in low-dimensional spaces.

CVJun 27, 2024
Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

Huaying Zhang, Rintaro Yanagi, Ren Togo et al.

This paper proposes a novel zero-shot composed image retrieval (CIR) method considering the query-target relationship by masked image-text pairs. The objective of CIR is to retrieve the target image using a query image and a query text. Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text and use a pre-trained visual-language model to realize the retrieval. However, they do not consider the query-target relationship to train the textual inversion network to acquire information for retrieval. In this paper, we propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs. By exploiting the abundant image-text pairs that are convenient to obtain with a masking strategy for learning the query-target relationship, it is expected that accurate zero-shot CIR using a retrieval-focused textual inversion network can be realized. Experimental results show the effectiveness of the proposed method.