Toru Tamaki

h-index30

36papers

129citations

Novelty38%

AI Score52

Ranked #36,948 of 201,326 authors (top 18%)#14,548 in CV (top 25%)

36 Papers

CVMay 27Code

Reflective Dialogue between Teacher and Solver Agents for Video Question Answering

Takuya Murakawa, Toru Tamaki

Various approaches have been proposed to adapt Vision-Language Models (VLMs) to specialized domains for Video Question Answering, including fine-tuning and in-context learning. However, acquiring task-specific knowledge at the inference phase from only a small labeled support set without fine-tuning remains a challenge. In this paper, we propose a method that achieves adaptation solely through inference-time context injection. Our method first constructs a Reflective Dialogue (RD) -- a multi-turn conversation between two agents, in which Teacher poses each support question and delivers correctness feedback, and Solver answers and provides visual grounding explanations (or reflections) for both correct and incorrect answers. This dialogue history is then used as context at the inference phase. Experiments on the EgoCross benchmark demonstrate that our method outperforms both a baseline zero-shot setting and a standard in-context learning approach that passes support set examples directly, achieving 3rd place in the Open-source Track of the 1st Cross-Domain EgoCross Challenge at the CVPR 2026 EgoVis Workshop, for which this paper also serves as a technical report.

CVApr 1, 2022

ObjectMix: Data Augmentation by Copy-Pasting Objects in Videos for Action Recognition

Jun Kimata, Tomoya Nitta, Toru Tamaki

In this paper, we propose a data augmentation method for action recognition using instance segmentation. Although many data augmentation methods have been proposed for image recognition, few of them are tailored for action recognition. Our proposed method, ObjectMix, extracts each object region from two videos using instance segmentation and combines them to create new videos. Experiments on two action recognition datasets, UCF101 and HMDB51, demonstrate the effectiveness of the proposed method and show its superiority over VideoMix, a prior work.

CVApr 19, 2022

Performance Evaluation of Action Recognition Models on Low Quality Videos

Aoi Otani, Ryota Hashiguchi, Kazuki Omi et al.

In the design of action recognition models, the quality of videos is an important issue; however, the trade-off between the quality and performance is often ignored. In general, action recognition models are trained on high-quality videos, hence it is not known how the model performance degrades when tested on low-quality videos, and how much the quality of training videos affects the performance. The issue of video quality is important, however, it has not been studied so far. The goal of this study is to show the trade-off between the performance and the quality of training and test videos by quantitative performance evaluation of several action recognition models for transcoded videos in different qualities. First, we show how the video quality affects the performance of pre-trained models. We transcode the original validation videos of Kinetics400 by changing quality control parameters of JPEG (compression strength) and H.264/AVC (CRF). Then we use the transcoded videos to validate the pre-trained models. Second, we show how the models perform when trained on transcoded videos. We transcode the original training videos of Kinetics400 by changing the quality parameters of JPEG and H.264/AVC. Then we train the models on the transcoded training videos and validate them with the original and transcoded validation videos. Experimental results with JPEG transcoding show that there is no severe performance degradation (up to -1.5%) for compression strength smaller than 70 where no quality degradation is visually observed, and for larger than 80 the performance degrades linearly with respect to the quality index. Experiments with H.264/AVC transcoding show that there is no significant performance loss (up to -1%) with CRF30 while the total size of video files is reduced to 30%.

CVJul 27, 2022

Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition

Tomoya Nitta, Tsubasa Hirakawa, Hironobu Fujiyoshi et al.

In this paper we propose an extension of the Attention Branch Network (ABN) by using instance segmentation for generating sharper attention maps for action recognition. Methods for visual explanation such as Grad-CAM usually generate blurry maps which are not intuitive for humans to understand, particularly in recognizing actions of people in videos. Our proposed method, Object-ABN, tackles this issue by introducing a new mask loss that makes the generated attention maps close to the instance segmentation result. Further the PC loss and multiple attention maps are introduced to enhance the sharpness of the maps and improve the performance of classification. Experimental results with UCF101 and SSv2 shows that the generated maps by the proposed method are much clearer qualitatively and quantitatively than those of the original ABN.

CVJan 16Code

M3DDM+: An improved video outpainting by a modified masking strategy

Takuya Murakawa, Takumi Fukuzawa, Ning Ding et al.

M3DDM provides a computationally efficient framework for video outpainting via latent diffusion modeling. However, it exhibits significant quality degradation -- manifested as spatial blur and temporal inconsistency -- under challenging scenarios characterized by limited camera motion or large outpainting regions, where inter-frame information is limited. We identify the cause as a training-inference mismatch in the masking strategy: M3DDM's training applies random mask directions and widths across frames, whereas inference requires consistent directional outpainting throughout the video. To address this, we propose M3DDM+, which applies uniform mask direction and width across all frames during training, followed by fine-tuning of the pretrained M3DDM model. Experiments demonstrate that M3DDM+ substantially improves visual fidelity and temporal coherence in information-limited scenarios while maintaining computational efficiency. The code is available at https://github.com/tamaki-lab/M3DDM-Plus.

CVApr 15, 2022

Model-agnostic Multi-Domain Learning with Domain-Specific Adapters for Action Recognition

Kazuki Omi, Jun Kimata, Toru Tamaki

In this paper, we propose a multi-domain learning model for action recognition. The proposed method inserts domain-specific adapters between layers of domain-independent layers of a backbone network. Unlike a multi-head network that switches classification heads only, our model switches not only the heads, but also the adapters for facilitating to learn feature representations universal to multiple domains. Unlike prior works, the proposed method is model-agnostic and doesn't assume model structures unlike prior works. Experimental results on three popular action recognition datasets (HMDB51, UCF101, and Kinetics-400) demonstrate that the proposed method is more effective than a multi-head architecture and more efficient than separately training models for each domain.

CVApr 1, 2022

Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

Ryota Hashiguchi, Toru Tamaki

Feature shifts have been shown to be useful for action recognition with CNN-based models since Temporal Shift Module (TSM) was proposed. It is based on frame-wise feature extraction with late fusion, and layer features are shifted along the time direction for the temporal interaction. TokenShift, a recent model based on Vision Transformer (ViT), also uses the temporal feature shift mechanism, which, however, does not fully exploit the structure of Multi-head Self-Attention (MSA) in ViT. In this paper, we propose Multi-head Self/Cross-Attention (MSCA), which fully utilizes the attention structure. TokenShift is based on a frame-wise ViT with features temporally shifted with successive frames (at time t+1 and t-1). In contrast, the proposed MSCA replaces MSA in the frame-wise ViT, and some MSA heads attend to successive frames instead of the current frame. The computation cost is the same as the frame-wise ViT and TokenShift as it simply changes the target to which the attention is taken. There is a choice about which of key, query, and value are taken from the successive frames, then we experimentally compared these variants with Kinetics400. We also investigate other variants in which the proposed MSCA is used along the patch dimension of ViT, instead of the head dimension. Experimental results show that a variant, MSCA-KV, shows the best performance and is better than TokenShift by 0.1% and then ViT by 1.2%.

CVAug 21, 2023

Joint learning of images and videos with a single Vision Transformer

Shuki Shimizu, Toru Tamaki

In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer IV-ViT, and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.

CVOct 23, 2023

S3Aug: Segmentation, Sampling, and Shift for Action Recognition

Taiki Sugiura, Toru Tamaki

Action recognition is a well-established area of research in computer vision. In this paper, we propose S3Aug, a video data augmenatation for action recognition. Unlike conventional video data augmentation methods that involve cutting and pasting regions from two videos, the proposed method generates new videos from a single training video through segmentation and label-to-image transformation. Furthermore, the proposed method modifies certain categories of label images by sampling to generate a variety of videos, and shifts intermediate features to enhance the temporal coherency between frames of the generate videos. Experimental results on the UCF101, HMDB51, and Mimetics datasets demonstrate the effectiveness of the proposed method, paricularlly for out-of-context videos of the Mimetics dataset.

CVDec 12, 2019Code

Meaning guided video captioning

Rushi J. Babariya, Toru Tamaki

Current video captioning approaches often suffer from problems of missing objects in the video to be described, while generating captions semantically similar with ground truth sentences. In this paper, we propose a new approach to video captioning that can describe objects detected by object detection, and generate captions having similar meaning with correct captions. Our model relies on S2VT, a sequence-to-sequence model for video captioning. Given a sequence of video frames, the encoding RNN takes a frame as well as detected objects in the frame in order to incorporate the information of the objects in the scene. The following decoding RNN outputs are then fed into an attention layer and then to a decoder for generating captions. The caption is compared with the ground truth by learning metric so that vector representations of generated captions are semantically similar to those of ground truth. Experimental results with the MSDV dataset demonstrate that the performance of the proposed approach is much better than the model without the proposed meaning-guided framework, showing the effectiveness of the proposed model. Code are publicly available at https://github.com/captanlevi/Meaning-guided-video-captioning-.

CVNov 28, 2017Code

Revisiting hand-crafted feature for action recognition: a set of improved dense trajectories

Kenji Matsui, Toru Tamaki, Gwladys Auffret et al.

We propose a feature for action recognition called Trajectory-Set (TS), on top of the improved Dense Trajectory (iDT). The TS feature encodes only trajectories around densely sampled interest points, without any appearance features. Experimental results on the UCF50, UCF101, and HMDB51 action datasets demonstrate that TS is comparable to state-of-the-arts, and outperforms many other methods; for HMDB the accuracy of 85.4%, compared to the best accuracy of 80.2% obtained by a deep method. Our code is available on-line at https://github.com/Gauffret/TrajectorySet .

CVMar 26

BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning

Ning Ding, Keisuke Fujii, Toru Tamaki

Understanding tactical dynamics in badminton requires analyzing entire matches rather than isolated clips. However, existing badminton datasets mainly focus on short clips or task-specific annotations and rarely provide full-match data with dense multimodal annotations. This limitation makes it difficult to generate accurate shot captions and perform match-level analysis. To address this limitation, we introduce the first Badminton Full Match Dense (BFMD) dataset, with 19 broadcast matches (including both singles and doubles) covering over 20 hours of play, comprising 1,687 rallies and 16,751 hit events, each annotated with a shot caption. The dataset provides hierarchical annotations including match segments, rally events, and dense rally-level multimodal annotations such as shot types, shuttle trajectories, player pose keypoints, and shot captions. We develop a VideoMAE-based multimodal captioning framework with a Semantic Feedback mechanism that leverages shot semantics to guide caption generation and improve semantic consistency. Experimental results demonstrate that multimodal modeling and semantic feedback improve shot caption quality over RGB-only baselines. We further showcase the potential of BFMD by analyzing the temporal evolution of tactical patterns across full matches.

CVAug 28, 2024

Online pre-training with long-form videos

Itsuki Kato, Kodai Kamiya, Toru Tamaki

In this study, we investigate the impact of online pre-training with continuous video clips. We will examine three methods for pre-training (masked image modeling, contrastive learning, and knowledge distillation), and assess the performance on downstream action recognition tasks. As a result, online pre-training with contrast learning showed the highest performance in downstream tasks. Our findings suggest that learning from long-form videos can be helpful for action recognition with short videos.

CVSep 27, 2024

Query matching for spatio-temporal action detection with query-based object detector

Shimon Hori, Kazuki Omi, Toru Tamaki

In this paper, we propose a method that extends the query-based object detection model, DETR, to spatio-temporal action detection, which requires maintaining temporal consistency in videos. Our proposed method applies DETR to each frame and uses feature shift to incorporate temporal information. However, DETR's object queries in each frame may correspond to different objects, making a simple feature shift ineffective. To overcome this issue, we propose query matching across different frames, ensuring that queries for the same object are matched and used for the feature shift. Experimental results show that performance on the JHMDB21 dataset improves significantly when query features are shifted using the proposed query matching.

CVAug 27, 2024

Fine-grained length controllable video captioning with ordinal embeddings

Tomoya Nitta, Takumi Fukuzawa, Toru Tamaki

This paper proposes a method for video captioning that controls the length of generated captions. Previous work on length control often had few levels for expressing length. In this study, we propose two methods of length embedding for fine-grained length control. A traditional embedding method is linear, using a one-hot vector and an embedding matrix. In this study, we propose methods that represent length in multi-hot vectors. One is bit embedding that expresses length in bit representation, and the other is ordinal embedding that uses the binary representation often used in ordinal regression. These length representations of multi-hot vectors are converted into length embedding by a nonlinear MLP. This method allows for not only the length control of caption sentences but also the control of the time when reading the caption. Experiments using ActivityNet Captions and Spoken Moments in Time show that the proposed method effectively controls the length of the generated captions. Analysis of the embedding vectors with ICA shows that length and semantics were learned separately, demonstrating the effectiveness of the proposed embedding methods.

CVJan 22, 2025

Can masking background and object reduce static bias for zero-shot action recognition?

Takumi Fukuzawa, Kensho Hara, Hirokatsu Kataoka et al.

In this paper, we address the issue of static bias in zero-shot action recognition. Action recognition models need to represent the action itself, not the appearance. However, some fully-supervised works show that models often rely on static appearances, such as the background and objects, rather than human actions. This issue, known as static bias, has not been investigated for zero-shot. Although CLIP-based zero-shot models are now common, it remains unclear if they sufficiently focus on human actions, as CLIP primarily captures appearance features related to languages. In this paper, we investigate the influence of static bias in zero-shot action recognition with CLIP-based models. Our approach involves masking backgrounds, objects, and people differently during training and validation. Experiments with masking background show that models depend on background bias as their performance decreases for Kinetics400. However, for Mimetics, which has a weak background bias, masking the background leads to improved performance even if the background is masked during validation. Furthermore, masking both the background and objects in different colors improves performance for SSv2, which has a strong object bias. These results suggest that masking the background or objects during training prevents models from overly depending on static bias and makes them focus more on human action.

CVSep 27, 2025

Disentangling Static and Dynamic Information for Reducing Static Bias in Action Recognition

Masato Kobayashi, Ning Ding, Toru Tamaki

Action recognition models rely excessively on static cues rather than dynamic human motion, which is known as static bias. This bias leads to poor performance in real-world applications and zero-shot action recognition. In this paper, we propose a method to reduce static bias by separating temporal dynamic information from static scene information. Our approach uses a statistical independence loss between biased and unbiased streams, combined with a scene prediction loss. Our experiments demonstrate that this method effectively reduces static bias and confirm the importance of scene prediction loss.

CVOct 16, 2025

Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding

Ning Ding, Keisuke Fujii, Toru Tamaki

Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.

LGAug 15, 2025

The 1st International Workshop on Disentangled Representation Learning for Controllable Generation (DRL4Real): Methods and Results

Qiuyu Chen, Xin Jin, Yue Song et al.

This paper reviews the 1st International Workshop on Disentangled Representation Learning for Controllable Generation (DRL4Real), held in conjunction with ICCV 2025. The workshop aimed to bridge the gap between the theoretical promise of Disentangled Representation Learning (DRL) and its application in realistic scenarios, moving beyond synthetic benchmarks. DRL4Real focused on evaluating DRL methods in practical applications such as controllable generation, exploring advancements in model robustness, interpretability, and generalization. The workshop accepted 9 papers covering a broad range of topics, including the integration of novel inductive biases (e.g., language), the application of diffusion models to DRL, 3D-aware disentanglement, and the expansion of DRL into specialized domains like autonomous driving and EEG analysis. This summary details the workshop's objectives, the themes of the accepted papers, and provides an overview of the methodologies proposed by the authors.

CVAug 5, 2025

MoExDA: Domain Adaptation for Edge-based Action Recognition

Takuya Sugimoto, Ning Ding, Toru Tamaki

Modern action recognition models suffer from static bias, leading to reduced generalization performance. In this paper, we propose MoExDA, a lightweight domain adaptation between RGB and edge information using edge frames in addition to RGB frames to counter the static bias issue. Experiments demonstrate that the proposed method effectively suppresses static bias with a lower computational cost, allowing for more robust action recognition than previous approaches.

CVAug 5, 2025

Separating Shared and Domain-Specific LoRAs for Multi-Domain Learning

Yusaku Takama, Ning Ding, Tatsuya Yokota et al.

Existing architectures of multi-domain learning have two types of adapters: shared LoRA for all domains and domain-specific LoRA for each particular domain. However, it remains unclear whether this structure effectively captures domain-specific information. In this paper, we propose a method that ensures that shared and domain-specific LoRAs exist in different subspaces; specifically, the column and left null subspaces of the pre-trained weights. We apply the proposed method to action recognition with three datasets (UCF101, Kinetics400, and HMDB51) and demonstrate its effectiveness in some cases along with the analysis of the dimensions of LoRA weights.

CVMar 17, 2025

Action tube generation by person query matching for spatio-temporal action detection

Kazuki Omi, Jion Oshima, Toru Tamaki

This paper proposes a method for spatio-temporal action detection (STAD) that directly generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting. Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames. We introduce the Query Matching Module (QMM), which uses metric learning to bring queries for the same person closer together across frames compared to queries for different people. Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip. Experimental results on JHMDB, UCF101-24, and AVA datasets demonstrate that our method performs well for large position changes of people while offering superior computational efficiency and lower resource requirements.

CVJan 26, 2024

Multi-model learning by sequential reading of untrimmed videos for action recognition

Kodai Kamiya, Toru Tamaki

We propose a new method for learning videos by aggregating multiple models by sequentially extracting video clips from untrimmed video. The proposed method reduces the correlation between clips by feeding clips to multiple models in turn and synchronizes these models through federated learning. Experimental results show that the proposed method improves the performance compared to the no synchronization.

CVAug 26, 2021

Improving the Reliability of Semantic Segmentation of Medical Images by Uncertainty Modeling with Bayesian Deep Networks and Curriculum Learning

Sora Iwamoto, Bisser Raytchev, Toru Tamaki et al.

In this paper we propose a novel method which leverages the uncertainty measures provided by Bayesian deep networks through curriculum learning so that the uncertainty estimates are fed back to the system to resample the training data more densely in areas where uncertainty is high. We show in the concrete setting of a semantic segmentation task (iPS cell colony segmentation) that the proposed system is able to increase significantly the reliability of the model.

CVApr 12, 2020

An Entropy Clustering Approach for Assessing Visual Question Difficulty

Kento Terao, Toru Tamaki, Bisser Raytchev et al.

We propose a novel approach to identify the difficulty of visual questions for Visual Question Answering (VQA) without direct supervision or annotations to the difficulty. Prior works have considered the diversity of ground-truth answers of human annotators. In contrast, we analyze the difficulty of visual questions based on the behavior of multiple different VQA models. We propose to cluster the entropy values of the predicted answer distributions obtained by three different models: a baseline method that takes as input images and questions, and two variants that take as input images only and questions only. We use a simple k-means to cluster the visual questions of the VQA v2 validation set. Then we use state-of-the-art methods to determine the accuracy and the entropy of the answer distributions for each cluster. A benefit of the proposed method is that no annotation of the difficulty is required, because the accuracy of each cluster reflects the difficulty of visual questions that belong to it. Our approach can identify clusters of difficult visual questions that are not answered correctly by state-of-the-art methods. Detailed analysis on the VQA v2 dataset reveals that 1) all methods show poor performances on the most difficult cluster (about 10\% accuracy), 2) as the cluster difficulty increases, the answers predicted by the different methods begin to differ, and 3) the values of cluster entropy are highly correlated with the cluster accuracy. We show that our approach has the advantage of being able to assess the difficulty of visual questions without ground-truth (\ie, the test set of VQA v2) by assigning them to one of the clusters. We expect that this can stimulate the development of novel directions of research and new algorithms.

CVApr 10, 2020

Rephrasing visual questions by specifying the entropy of the answer distribution

Kento Terao, Toru Tamaki, Bisser Raytchev et al.

Visual question answering (VQA) is a task of answering a visual question that is a pair of question and image. Some visual questions are ambiguous and some are clear, and it may be appropriate to change the ambiguity of questions from situation to situation. However, this issue has not been addressed by any prior work. We propose a novel task, rephrasing the questions by controlling the ambiguity of the questions. The ambiguity of a visual question is defined by the use of the entropy of the answer distribution predicted by a VQA model. The proposed model rephrases a source question given with an image so that the rephrased question has the ambiguity (or entropy) specified by users. We propose two learning strategies to train the proposed model with the VQA v2 dataset, which has no ambiguity information. We demonstrate the advantage of our approach that can control the ambiguity of the rephrased questions, and an interesting observation that it is harder to increase than to reduce ambiguity.

CVFeb 19, 2020

On-line non-overlapping camera calibration net

Zhao Fangda, Toru Tamaki, Takio Kurita et al.

We propose an easy-to-use non-overlapping camera calibration method. First, successive images are fed to a PoseNet-based network to obtain ego-motion of cameras between frames. Next, the pose between cameras are estimated. Instead of using a batch method, we propose an on-line method of the inter-camera pose estimation. Furthermore, we implement the entire procedure on a computation graph. Experiments with simulations and the KITTI dataset show the proposed method to be effective in simulation.

CVDec 12, 2019

Improved Activity Forecasting for Generating Trajectories

Daisuke Ogawa, Toru Tamaki, Tsubasa Hirakawa et al.

An efficient inverse reinforcement learning for generating trajectories is proposed based of 2D and 3D activity forecasting. We modify reward function with $L_p$ norm and propose convolution into value iteration steps, which is called convolutional value iteration. Experimental results with seabird trajectories (43 for training and 10 for test), our method is best in terms of MHD error and performs fastest. Generated trajectories for interpolating missing parts of trajectories look much similar to real seabird trajectories than those by the previous works.

CVDec 12, 2019

Semantic segmentation of trajectories with improved agent models for pedestrian behavior analysis

Toru Tamaki, Daisuke Ogawa, Bisser Raytchev et al.

In this paper, we propose a method for semantic segmentation of pedestrian trajectories based on pedestrian behavior models, or agents. The agents model the dynamics of pedestrian movements in two-dimensional space using a linear dynamics model and common start and goal locations of trajectories. First, agent models are estimated from the trajectories obtained from image sequences. Our method is built on top of the Mixture model of Dynamic pedestrian Agents (MDA); however, the MDA's trajectory modeling and estimation are improved. Then, the trajectories are divided into semantically meaningful segments. The subsegments of a trajectory are modeled by applying a hidden Markov model using the estimated agent models. Experimental results with a real trajectory dataset show the effectiveness of the proposed method as compared to the well-known classical Ramer-Douglas-Peucker algorithm and also to the original MDA model.

CVSep 27, 2019

Biomedical Image Segmentation by Retina-like Sequential Attention Mechanism Using Only A Few Training Images

Shohei Hayashi, Bisser Raytchev, Toru Tamaki et al.

In this paper we propose a novel deep learning-based algorithm for biomedical image segmentation which uses a sequential attention mechanism able to shift the focus of attention across the image in a selective way, allowing subareas which are more difficult to classify to be processed at increased resolution. The spatial distribution of class information in each subarea is learned using a retina-like representation where resolution decreases with distance from the center of attention. The final segmentation is achieved by averaging class predictions over overlapping subareas, utilizing the power of ensemble learning to increase segmentation accuracy. Experimental results for semantic segmentation task for which only a few training images are available show that a CNN using the proposed method outperforms both a patch-based classification CNN and a fully convolutional-based method.

CVNov 1, 2018

Survey on Vision-based Path Prediction

Tsubasa Hirakawa, Takayoshi Yamashita, Toru Tamaki et al.

Path prediction is a fundamental task for estimating how pedestrians or vehicles are going to move in a scene. Because path prediction as a task of computer vision uses video as input, various information used for prediction, such as the environment surrounding the target and the internal state of the target, need to be estimated from the video in addition to predicting paths. Many prediction approaches that include understanding the environment and the internal state have been proposed. In this survey, we systematically summarize methods of path prediction that take video as input and and extract features from the video. Moreover, we introduce datasets used to evaluate path prediction methods quantitatively.

CVFeb 27, 2018

Semantic segmentation of trajectories with agent models

Daisuke Ogawa, Toru Tamaki, Bisser Raytchev et al.

In many cases, such as trajectories clustering and classification, we often divide a trajectory into segments as preprocessing. In this paper, we propose a trajectory semantic segmentation method based on learned behavior models. In the proposed method, we learn some behavior models from video sequences. Next, using learned behavior models and a hidden Markov model, we segment a trajectory into semantic segments. Comparing with the Ramer-Douglas-Peucker algorithm, we show the effectiveness of the proposed method.

CVDec 15, 2016

Development of a Real-time Colorectal Tumor Classification System for Narrow-band Imaging zoom-videoendoscopy

Tsubasa Hirakawa, Toru Tamaki, Bisser Raytchev et al.

Colorectal endoscopy is important for the early detection and treatment of colorectal cancer and is used worldwide. A computer-aided diagnosis (CAD) system that provides an objective measure to endoscopists during colorectal endoscopic examinations would be of great value. In this study, we describe a newly developed CAD system that provides real-time objective measures. Our system captures the video stream from an endoscopic system and transfers it to a desktop computer. The captured video stream is then classified by a pretrained classifier and the results are displayed on a monitor. The experimental results show that our developed system works efficiently in actual endoscopic examinations and is medically significant.

CVNov 8, 2016

Domain Adaptation with L2 constraints for classifying images from different endoscope systems

Toru Tamaki, Shoji Sonoyama, Takio Kurita et al.

This paper proposes a method for domain adaptation that extends the maximum margin domain transfer (MMDT) proposed by Hoffman et al., by introducing L2 distance constraints between samples of different domains; thus, our method is denoted as MMDTL2. Motivated by the differences between the images taken by narrow band imaging (NBI) endoscopic devices, we utilize different NBI devices as different domains and estimate the transformations between samples of different domains, i.e., image samples taken by different NBI endoscope systems. We first formulate the problem in the primal form, and then derive the dual form with much lesser computational costs as compared to the naive approach. From our experimental results using NBI image datasets from two different NBI endoscopic devices, we find that MMDTL2 is better than MMDT and also support vector machines without adaptation, especially when NBI image features are high-dimensional and the per-class training samples are greater than 20.

CVAug 24, 2016

Transfer Learning for Endoscopic Image Classification

Shoji Sonoyama, Toru Tamaki, Tsubasa Hirakawa et al.

In this paper we propose a method for transfer learning of endoscopic images. For transferring between features obtained from images taken by different (old and new) endoscopes, we extend the Max-Margin Domain Transfer (MMDT) proposed by Hoffman et al. in order to use L2 distance constraints as regularization, called Max-Margin Domain Transfer with L2 Distance Constraints (MMDTL2). Furthermore, we develop the dual formulation of the optimization problem in order to reduce the computation cost. Experimental results demonstrate that the proposed MMDTL2 outperforms MMDT for real data sets taken by different endoscopes.

CVAug 24, 2016

Computer-Aided Colorectal Tumor Classification in NBI Endoscopy Using CNN Features

Toru Tamaki, Shoji Sonoyama, Tsubasa Hirakawa et al.

In this paper we report results for recognizing colorectal NBI endoscopic images by using features extracted from convolutional neural network (CNN). In this comparative study, we extract features from different layers from different CNN models, and then train linear SVM classifiers. Experimental results with 10-fold cross validations show that features from first few convolution layers are enough to achieve similar performance (i.e., recognition rate of 95%) with non-CNN local features such as Bag-of-Visual words, Fisher vector, and VLAD.