Masakazu Yoshimura

CV
h-index15
16papers
111citations
Novelty52%
AI Score52

16 Papers

CVNov 2, 2022
DynamicISP: Dynamically Controlled Image Signal Processor for Image Recognition

Masakazu Yoshimura, Junji Otsuka, Atsushi Irie et al.

Image Signal Processors (ISPs) play important roles in image recognition tasks as well as in the perceptual quality of captured images. In most cases, experts make a lot of effort to manually tune many parameters of ISPs, but the parameters are sub-optimal. In the literature, two types of techniques have been actively studied: a machine learning-based parameter tuning technique and a DNN-based ISP technique. The former is lightweight but lacks expressive power. The latter has expressive power, but the computational cost is too heavy on edge devices. To solve these problems, we propose "DynamicISP," which consists of multiple classical ISP functions and dynamically controls the parameters of each frame according to the recognition result of the previous frame. We show our method successfully controls the parameters of multiple ISP functions and achieves state-of-the-art accuracy with low computational cost in single and multi-category object detection tasks.

CVOct 28, 2022
Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments

Masakazu Yoshimura, Junji Otsuka, Atsushi Irie et al.

Image recognition models that work in challenging environments (e.g., extremely dark, blurry, or high dynamic range conditions) must be useful. However, creating training datasets for such environments is expensive and hard due to the difficulties of data collection and annotation. It is desirable if we could get a robust model without the need for hard-to-obtain datasets. One simple approach is to apply data augmentation such as color jitter and blur to standard RGB (sRGB) images in simple scenes. Unfortunately, this approach struggles to yield realistic images in terms of pixel intensity and noise distribution due to not considering the non-linearity of Image Signal Processors (ISPs) and noise characteristics of image sensors. Instead, we propose a noise-accounted RAW image augmentation method. In essence, color jitter and blur augmentation are applied to a RAW image before applying non-linear ISP, resulting in realistic intensity. Furthermore, we introduce a noise amount alignment method that calibrates the domain gap in the noise property caused by the augmentation. We show that our proposed noise-accounted RAW augmentation method doubles the image recognition accuracy in challenging environments only with simple training data.

CVMar 24, 2023
Self-Supervised Reversed Image Signal Processing via Reference-Guided Dynamic Parameter Selection

Junji Otsuka, Masakazu Yoshimura, Takeshi Ohashi

Unprocessed sensor outputs (RAW images) potentially improve both low-level and high-level computer vision algorithms, but the lack of large-scale RAW image datasets is a barrier to research. Thus, reversed Image Signal Processing (ISP) which converts existing RGB images into RAW images has been studied. However, most existing methods require camera-specific metadata or paired RGB and RAW images to model the conversion, and they are not always available. In addition, there are issues in handling diverse ISPs and recovering global illumination. To tackle these limitations, we propose a self-supervised reversed ISP method that does not require metadata and paired images. The proposed method converts a RGB image into a RAW-like image taken in the same environment with the same sensor as a reference RAW image by dynamically selecting parameters of the reversed ISP pipeline based on the reference RAW image. The parameter selection is trained via pseudo paired data created from unpaired RGB and RAW images. We show that the proposed method is able to learn various reversed ISPs with comparable accuracy to other state-of-the-art supervised methods and convert unknown RGB images from COCO and Flickr1M to target RAW-like images more accurately in terms of pixel distribution. We also demonstrate that our generated RAW images improve performance on real RAW image object detection task.

34.7LGApr 17
LLM as a Tool, Not an Agent: Code-Mined Tree Transformations for Neural Architecture Search

Masakazu Yoshimura, Zitang Sun, Yuiko Sakuma et al.

Neural Architecture Search (NAS) aims to automatically discover high-performing deep neural network (DNN) architectures. However, conventional algorithm-driven NAS relies on carefully hand-crafted search spaces to ensure executability, which restricts open-ended exploration. Recent coding-based agentic approaches using large language models (LLMs) reduce manual design, but current LLMs struggle to reliably generate complex, valid architectures, and their proposals are often biased toward a narrow set of patterns observed in their training data. To bridge reliable algorithmic search with powerful LLM assistance, we propose LLMasTool, a hierarchical tree-based NAS framework for stable and open-ended model evolution. Our method automatically extracts reusable modules from arbitrary source code and represents full architectures as hierarchical trees, enabling evolution through reliable tree transformations rather than code generation. At each evolution step, coarse-level planning is governed by a diversity-guided algorithm that leverages Bayesian modeling to improve exploration efficiency, while the LLM resolves the remaining degrees of freedom to ensure a meaningful evolutionary trajectory and an executable generated architecture. With this formulation, instead of fully agentic LLM approaches, our method explores diverse directions beyond the inherent biases in the LLM. Our method improves over existing NAS methods by 0.69, 1.83, and 2.68 points on CIFAR-10, CIFAR-100, and ImageNet16-120, demonstrating its effectiveness.

CVNov 9, 2022
Efficient Joint Detection and Multiple Object Tracking with Spatially Aware Transformer

Siddharth Sagar Nijhawan, Leo Hoshikawa, Atsushi Irie et al.

We propose a light-weight and highly efficient Joint Detection and Tracking pipeline for the task of Multi-Object Tracking using a fully-transformer architecture. It is a modified version of TransTrack, which overcomes the computational bottleneck associated with its design, and at the same time, achieves state-of-the-art MOTA score of 73.20%. The model design is driven by a transformer based backbone instead of CNN, which is highly scalable with the input resolution. We also propose a drop-in replacement for Feed Forward Network of transformer encoder layer, by using Butterfly Transform Operation to perform channel fusion and depth-wise convolution to learn spatial context within the feature maps, otherwise missing within the attention maps of the transformer. As a result of our modifications, we reduce the overall model size of TransTrack by 58.73% and the complexity by 78.72%. Therefore, we expect our design to provide novel perspectives for architecture optimization in future research related to multi-object tracking.

23.9CVMay 19
Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

Yuiko Sakuma, Masakazu Yoshimura, Marcel Gröpl et al.

Current neural architecture search (NAS) methods are often limited by their predefined, restrictive search spaces. While recent large language model (LLM)-assisted NAS methods enable open-ended search spaces, they often suffer from inefficient exploration due to biased or low-quality design ideas. To address these issues, we propose to semi-automatically structure model design knowledge to guide the search process. Our approach first defines a high-level structural template of architectural attributes. An LLM then populates this template by analyzing papers, creating a rich and diverse search space that embodies this structured design knowledge. To efficiently explore this vast space, we introduce FairNAD, using a multi-type mutation that enables broad exploration through mutation with fair idea sampling, Pareto-aware mutation, LLM-driven iterative mutation, and a fine-grained feedback loop. We demonstrate the effectiveness of FairNAD in discovering high-performing architectures that yield 0.84, 2.17, and 2.35 points improvement on CIFAR-10, CIFAR-100, and ImageNet16-120, respectively, compared to current state-of-the-art methods.

19.2CVMar 17
SF-Mamba: Rethinking State Space Model for Vision

Masakazu Yoshimura, Teruaki Hayashi, Yuki Hoshino et al.

The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.

CLNov 6, 2024
MambaPEFT: Exploring Parameter-Efficient Fine-Tuning for Mamba

Masakazu Yoshimura, Teruaki Hayashi, Yota Maeda

An ecosystem of Transformer-based models has been established by building large models with extensive data. Parameter-efficient fine-tuning (PEFT) is a crucial technology for deploying these models to downstream tasks with minimal cost while achieving effective performance. Recently, Mamba, a State Space Model (SSM)-based model, has attracted attention as a potential alternative to Transformers. While many large-scale Mamba-based models have been proposed, efficiently adapting pre-trained Mamba-based models to downstream tasks remains unexplored. In this paper, we conduct an exploratory analysis of PEFT methods for Mamba. We investigate the effectiveness of existing PEFT methods for Transformers when applied to Mamba. We also modify these methods to better align with the Mamba architecture. Additionally, we propose new Mamba-specific PEFT methods that leverage the distinctive structure of Mamba. Our experiments indicate that PEFT performs more effectively for Mamba than Transformers. Lastly, we demonstrate how to effectively combine multiple PEFT methods and provide a framework that outperforms previous works. To ensure reproducibility, we will release the code after publication.

IVMar 15, 2024
PQDynamicISP: Dynamically Controlled Image Signal Processor for Any Image Sensors Pursuing Perceptual Quality

Masakazu Yoshimura, Junji Otsuka, Takeshi Ohashi

Full DNN-based image signal processors (ISPs) have been actively studied and have achieved superior image quality compared to conventional ISPs. In contrast to this trend, we propose a lightweight ISP that consists of simple conventional ISP functions but achieves high image quality by increasing expressiveness. Specifically, instead of tuning the parameters of the ISP, we propose to control them dynamically for each environment and even locally. As a result, state-of-the-art accuracy is achieved on various datasets, including other tasks like tone mapping and image enhancement, even though ours is lighter than DNN-based ISPs. Additionally, our method can process different image sensors with a single ISP through dynamic control, whereas conventional methods require training for each sensor.

CVNov 18, 2025
Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision

Zitang Sun, Masakazu Yoshimura, Junji Otsuka et al.

High-quality data has become a primary driver of progress under scale laws, with curated datasets often outperforming much larger unfiltered ones at lower cost. Online data curation extends this idea by dynamically selecting training samples based on the model's evolving state. While effective in classification and multimodal learning, existing online sampling strategies rarely extend to object detection because of its structural complexity and domain gaps. We introduce DetGain, an online data curation method specifically for object detection that estimates the marginal perturbation of each image to dataset-level Average Precision (AP) based on its prediction quality. By modeling global score distributions, DetGain efficiently estimates the global AP change and computes teacher-student contribution gaps to select informative samples at each iteration. The method is architecture-agnostic and minimally intrusive, enabling straightforward integration into diverse object detection architectures. Experiments on the COCO dataset with multiple representative detectors show consistent improvements in accuracy. DetGain also demonstrates strong robustness under low-quality data and can be effectively combined with knowledge distillation techniques to further enhance performance, highlighting its potential as a general and complementary strategy for data-efficient object detection.

CVApr 3, 2025
SemiISP/SemiIE: Semi-Supervised Image Signal Processor and Image Enhancement Leveraging One-to-Many Mapping sRGB-to-RAW

Masakazu Yoshimura, Junji Otsuka, Radu Berdan et al.

DNN-based methods have been successful in Image Signal Processor (ISP) and image enhancement (IE) tasks. However, the cost of creating training data for these tasks is considerably higher than for other tasks, making it difficult to prepare large-scale datasets. Also, creating personalized ISP and IE with minimal training data can lead to new value streams since preferred image quality varies depending on the person and use case. While semi-supervised learning could be a potential solution in such cases, it has rarely been utilized for these tasks. In this paper, we realize semi-supervised learning for ISP and IE leveraging a RAW image reconstruction (sRGB-to-RAW) method. Although existing sRGB-to-RAW methods can generate pseudo-RAW image datasets that improve the accuracy of RAW-based high-level computer vision tasks such as object detection, their quality is not sufficient for ISP and IE tasks that require precise image quality definition. Therefore, we also propose a sRGB-to-RAW method that can improve the image quality of these tasks. The proposed semi-supervised learning with the proposed sRGB-to-RAW method successfully improves the image quality of various models on various datasets.

CVMar 17, 2025
Beyond RGB: Adaptive Parallel Processing for RAW Object Detection

Shani Gamrian, Hila Barel, Feiran Li et al.

Object detection models are typically applied to standard RGB images processed through Image Signal Processing (ISP) pipelines, which are designed to enhance sensor-captured RAW images for human vision. However, these ISP functions can lead to a loss of critical information that may be essential in optimizing for computer vision tasks, such as object detection. In this work, we introduce Raw Adaptation Module (RAM), a module designed to replace the traditional ISP, with parameters optimized specifically for RAW object detection. Inspired by the parallel processing mechanisms of the human visual system, RAM departs from existing learned ISP methods by applying multiple ISP functions in parallel rather than sequentially, allowing for a more comprehensive capture of image features. These processed representations are then fused in a specialized module, which dynamically integrates and optimizes the information for the target task. This novel approach not only leverages the full potential of RAW sensor data but also enables task-specific pre-processing, resulting in superior object detection performance. Our approach outperforms RGB-based methods and achieves state-of-the-art results across diverse RAW image datasets under varying lighting conditions and dynamic ranges.

CVMar 29, 2024
Mixed-precision Supernet Training from Vision Foundation Models using Low Rank Adapter

Yuiko Sakuma, Masakazu Yoshimura, Junji Otsuka et al.

Compression of large and performant vision foundation models (VFMs) into arbitrary bit-wise operations (BitOPs) allows their deployment on various hardware. We propose to fine-tune a VFM to a mixed-precision quantized supernet. The supernet-based neural architecture search (NAS) can be adopted for this purpose, which trains a supernet, and then subnets within arbitrary hardware budgets can be extracted. However, existing methods face difficulties in optimizing the mixed-precision search space and incurring large memory costs during training. To tackle these challenges, first, we study the effective search space design for fine-tuning a VFM by comparing different operators (such as resolution, feature size, width, depth, and bit-widths) in terms of performance and BitOPs reduction. Second, we propose memory-efficient supernet training using a low-rank adapter (LoRA) and a progressive training strategy. The proposed method is evaluated for the recently proposed VFM, Segment Anything Model, fine-tuned on segmentation tasks. The searched model yields about a 95% reduction in BitOPs without incurring performance degradation.

ROMar 15, 2021
MBAPose: Mask and Bounding-Box Aware Pose Estimation of Surgical Instruments with Photorealistic Domain Randomization

Masakazu Yoshimura, Murilo Marques Marinho, Kanako Harada et al.

Surgical robots are usually controlled using a priori models based on the robots' geometric parameters, which are calibrated before the surgical procedure. One of the challenges in using robots in real surgical settings is that those parameters can change over time, consequently deteriorating control accuracy. In this context, our group has been investigating online calibration strategies without added sensors. In one step toward that goal, we have developed an algorithm to estimate the pose of the instruments' shafts in endoscopic images. In this study, we build upon that earlier work and propose a new framework to more precisely estimate the pose of a rigid surgical instrument. Our strategy is based on a novel pose estimation model called MBAPose and the use of synthetic training data. Our experiments demonstrated an improvement of 21 % for translation error and 26 % for orientation error on synthetic test data with respect to our previous work. Results with real test data provide a baseline for further research.

CVOct 15, 2020
FOSS: Multi-Person Age Estimation with Focusing on Objects and Still Seeing Surroundings

Masakazu Yoshimura, Satoshi Ogata

Age estimation from images can be used in many practical scenes. Most of the previous works targeted on the estimation from images in which only one face exists. Also, most of the open datasets for age estimation contain images like that. However, in some situations, age estimation in the wild and for multi-person is needed. Usually, such situations were solved by two separate models; one is a face detector model which crops facial regions and the other is an age estimation model which estimates from cropped images. In this work, we propose a method that can detect and estimate the age of multi-person with a single model which estimates age with focusing on faces and still seeing surroundings. Also, we propose a training method which enables the model to estimate multi-person well despite trained with images in which only one face is photographed. In the experiments, we evaluated our proposed method compared with the traditional approach using two separate models. As the result, the accuracy could be enhanced with our proposed method. We also adapted our proposed model to commonly used single person photographed age estimation datasets and it is proved that our method is also effective to those images and outperforms the state of the art accuracy.

CVMar 3, 2020
Single-Shot Pose Estimation of Surgical Robot Instruments' Shafts from Monocular Endoscopic Images

Masakazu Yoshimura, Murilo M. Marinho, Kanako Harada et al.

Surgical robots are used to perform minimally invasive surgery and alleviate much of the burden imposed on surgeons. Our group has developed a surgical robot to aid in the removal of tumors at the base of the skull via access through the nostrils. To avoid injuring the patients, a collision-avoidance algorithm that depends on having an accurate model for the poses of the instruments' shafts is used. Given that the model's parameters can change over time owing to interactions between instruments and other disturbances, the online estimation of the poses of the instrument's shaft is essential. In this work, we propose a new method to estimate the pose of the surgical instruments' shafts using a monocular endoscope. Our method is based on the use of an automatically annotated training dataset and an improved pose-estimation deep-learning architecture. In preliminary experiments, we show that our method can surpass state of the art vision-based marker-less pose estimation techniques (providing an error decrease of 55% in position estimation, 64% in pitch, and 69% in yaw) by using artificial images.