Yoshitaka Ushiku

CV
h-index39
62papers
8,010citations
Novelty49%
AI Score60

62 Papers

RONov 2, 2023Code
Vision-Language Interpreter for Robot Task Planning

Keisuke Shirai, Cristian C. Beltran-Hernandez, Masashi Hamaya et al.

Large language models (LLMs) are accelerating the development of language-guided robot planners. Meanwhile, symbolic planners offer the advantage of interpretability. This paper proposes a new task that bridges these two trends, namely, multimodal planning problem specification. The aim is to generate a problem description (PD), a machine-readable file used by the planners to find a plan. By generating PDs from language instruction and scene observation, we can drive symbolic planners in a language-guided framework. We propose a Vision-Language Interpreter (ViLaIn), a new framework that generates PDs using state-of-the-art LLM and vision-language models. ViLaIn can refine generated PDs via error message feedback from the symbolic planner. Our aim is to answer the question: How accurately can ViLaIn and the symbolic planner generate valid robot plans? To evaluate ViLaIn, we introduce a novel dataset called the problem description generation (ProDG) dataset. The framework is evaluated with four new evaluation metrics. Experimental results show that ViLaIn can generate syntactically correct problems with more than 99\% accuracy and valid plans with more than 58\% accuracy. Our code and dataset are available at https://github.com/omron-sinicx/ViLaIn.

CVApr 20, 2023Code
Noisy Universal Domain Adaptation via Divergence Optimization for Visual Recognition

Qing Yu, Atsushi Hashimoto, Yoshitaka Ushiku

To transfer the knowledge learned from a labeled source domain to an unlabeled target domain, many studies have worked on universal domain adaptation (UniDA), where there is no constraint on the label sets of the source domain and target domain. However, the existing UniDA methods rely on source samples with correct annotations. Due to the limited resources in the real world, it is difficult to obtain a large amount of perfectly clean labeled data in a source domain in some applications. As a result, we propose a novel realistic scenario named Noisy UniDA, in which classifiers are trained using noisy labeled data from the source domain as well as unlabeled domain data from the target domain that has an uncertain class distribution. A multi-head convolutional neural network framework is proposed in this paper to address all of the challenges faced in the Noisy UniDA at once. Our network comprises a single common feature generator and multiple classifiers with various decision bounds. We can detect noisy samples in the source domain, identify unknown classes in the target domain, and align the distribution of the source and target domains by optimizing the divergence between the outputs of the various classifiers. The proposed method outperformed the existing methods in most of the settings after a thorough analysis of the various domain adaption scenarios. The source code is available at \url{https://github.com/YU1ut/Divergence-Optimization}.

59.1ROMay 17
Robust and Resilient Soft Robotic Object Insertion with Compliance-Enabled Contact Formation and Failure Recovery

Mimo Shirasaka, Cristian C. Beltran-Hernandez, Masashi Hamaya et al.

Object insertion tasks are prone to failure under pose uncertainty and environmental variation, often requiring manual fine-tuning or controller retraining. We present a novel approach for robust and resilient object insertion using a passively compliant soft wrist that enables safe contact absorption through large deformations, without high-frequency control or force sensing. Our method structures insertion as compliance-enabled contact formations, sequential contact states that progressively constrain degrees of freedom, and integrates automated failure recovery strategies. Our key insight is that wrist compliance permits safe, repeated recovery attempts; hence, we refer to it as compliance-enabled failure recovery. We employ a pre-trained vision-language model (VLM) that assesses each skill execution from terminal poses and images, identifies failure modes, and proposes recovery actions by selecting skills and updating goals. In simulation, our method achieved an 83% success rate, recovering from failures induced by randomized conditions, including grasp misalignments up to 5 degrees, hole-pose errors up to 20 mm, fivefold increases in friction, and unseen square/rectangular pegs, and we further validated the approach on a real robot. Project page is available at https://omron-sinicx.github.io/compliance-enabled-failure-recovery/.

26.6ROMay 13
SCU-Hand with Integrated Single-Sheet Valve: A Funnel-Shaped Robotic Hand for Milligram-Scale Powder Handling

Tomoya Takahashi, Yusaku Nakajima, Cristian Camilo Beltran-Hernandez et al.

Laboratory Automation (LA) has the potential to accelerate solid-state materials discovery by enabling continuous robotic operation without human intervention. While robotic systems have been developed for tasks such as powder grinding and X-ray diffraction (XRD) analysis, fully automating powder handling at the milligram scale remains a significant challenge due to the complex flow dynamics of powders and the diversity of laboratory tasks. To address this challenge, this study proposes the SCU-Hand-SV (Soft Conical Universal Robotic Hand with Single-sheet Valve), which preserves the softness and conical sheet designs in prior work while incorporating a controllable valve at the cone apex to enable precise, incremental dispensing of milligram-scale powder quantities. The SCU-Hand-SV is integrated with an external balance through a feedback control system based on a model of powder flow and online parameter identification. Experimental evaluations with glass beads, monosodium glutamate, and titanium dioxide demonstrated that 80% of the trials achieved an error within -2 mg to +2 mg, and the maximum error observed was approximately 20 mg across a target range of 20 mg to 3 g. In addition, by incorporating flow prediction models commonly used for hoppers and performing online parameter identification, the system is able to adapt to variations in powder dynamics. Compared to direct PID control, the proposed model-based control significantly improved both accuracy and convergence speed. These results highlight the potential of the proposed system to enable efficient and flexible powder weighing, with scalability toward larger quantities and applicability to a broad range of laboratory automation tasks.

LGJun 21, 2022
Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery

Yoshitomo Matsubara, Naoya Chiba, Ryo Igarashi et al.

This paper revisits datasets and evaluation criteria for Symbolic Regression (SR), specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling ranges of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. We also create another 120 datasets that contain dummy variables to examine whether SR methods can choose necessary variables only. Besides, we propose to use normalized edit distances (NED) between a predicted equation and the true equation trees for addressing a critical issue that existing SR metrics are either binary or errors between the target values and an SR model's predicted values for a given input. We conduct benchmark experiments on our new SRSD datasets using various representative SR methods. The experimental results show that we provide a more realistic performance evaluation, and our user study shows that the NED correlates with human judges significantly more than an existing SR metric. We publish repositories of our code and 240 SRSD datasets.

CVJul 29, 2024Code
SciPostLayout: A Dataset for Layout Analysis and Layout Generation of Scientific Posters

Shohei Tanaka, Hao Wang, Yoshitaka Ushiku

Scientific posters are used to present the contributions of scientific papers effectively in a graphical format. However, creating a well-designed poster that efficiently summarizes the core of a paper is both labor-intensive and time-consuming. A system that can automatically generate well-designed posters from scientific papers would reduce the workload of authors and help readers understand the outline of the paper visually. Despite the demand for poster generation systems, only a limited research has been conduced due to the lack of publicly available datasets. Thus, in this study, we built the SciPostLayout dataset, which consists of 7,855 scientific posters and manual layout annotations for layout analysis and generation. SciPostLayout also contains 100 scientific papers paired with the posters. All of the posters and papers in our dataset are under the CC-BY license and are publicly available. As benchmark tests for the collected dataset, we conducted experiments for layout analysis and generation utilizing existing computer vision models and found that both layout analysis and generation of posters using SciPostLayout are more challenging than with scientific papers. We also conducted experiments on generating layouts from scientific papers to demonstrate the potential of utilizing LLM as a scientific poster generation system. The dataset is publicly available at https://huggingface.co/datasets/omron-sinicx/scipostlayout_v2. The code is also publicly available at https://github.com/omron-sinicx/scipostlayout.

CLSep 13, 2022
Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows

Keisuke Shirai, Atsushi Hashimoto, Taichi Nishimura et al.

We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). The image pairs are grounded in the r-FG, which provides the cross-modal relation. With our dataset, one can try a range of applications, from multimodal commonsense reasoning and procedural text generation.

MMSep 21, 2022
Recipe Generation from Unsegmented Cooking Videos

Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku et al.

This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is crucial, and a model should extract an appropriate number of events in the correct order and generate accurate sentences based on them. We analyze the output of the DVC model and confirm that although (1) several events are adoptable as a recipe story, (2) the generated sentences for such events are not grounded in the visual content. Based on this, we set our goal to obtain correct recipes by selecting oracle events from the output events and re-generating sentences for them. To achieve this, we propose a transformer-based multimodal recurrent approach of training an event selector and sentence generator for selecting oracle events from the DVC's events and generating sentences for them. In addition, we extend the model by including ingredients to generate more accurate recipes. The experimental results show that the proposed method outperforms state-of-the-art DVC models. We also confirm that, by modeling the recipe in a story-aware manner, the proposed model outputs the appropriate number of events in the correct order.

MTRL-SCIDec 8, 2022
Neural Structure Fields with Application to Crystal Structure Autoencoders

Naoya Chiba, Yuta Suzuki, Tatsunori Taniai et al.

Representing crystal structures of materials to facilitate determining them via neural networks is crucial for enabling machine-learning applications involving crystal structure estimation. Among these applications, the inverse design of materials can contribute to explore materials with desired properties without relying on luck or serendipity. We propose neural structure fields (NeSF) as an accurate and practical approach for representing crystal structures using neural networks. Inspired by the concepts of vector fields in physics and implicit neural representations in computer vision, the proposed NeSF considers a crystal structure as a continuous field rather than as a discrete set of atoms. Unlike existing grid-based discretized spatial representations, the NeSF overcomes the tradeoff between spatial resolution and computational complexity and can represent any crystal structure. We propose an autoencoder of crystal structures that can recover various crystal structures, such as those of perovskite structure materials and cuprate superconductors. Extensive quantitative results demonstrate the superior performance of the NeSF compared with the existing grid-based approach.

CVJul 6, 2023
A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task

Shiqi Yang, Atsushi Hashimoto, Yoshitaka Ushiku

In recent years large model trained on huge amount of cross-modality data, which is usually be termed as foundation model, achieves conspicuous accomplishment in many fields, such as image recognition and generation. Though achieving great success in their original application case, it is still unclear whether those foundation models can be applied to other different downstream tasks. In this paper, we conduct a short survey on the current methods for discriminative dense recognition tasks, which are built on the pretrained foundation model. And we also provide some preliminary experimental analysis of an existing open-vocabulary segmentation method based on Stable Diffusion, which indicates the current way of deploying diffusion model for segmentation is not optimal. This aims to provide insights for future research on adopting foundation model for downstream task.

CVNov 28, 2023
Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura et al.

We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain shots showing either full-body or hand regions, while the egocentric view is constantly shifting. This necessitates the in-depth study of cross-view transfer under complex view changes. To this end, we first create a real-life egocentric dataset (EgoYC2) whose captions follow the definition of YouCook2 captions, enabling transfer learning between these datasets with access to their ground-truth. To bridge the view gaps, we propose a view-invariant learning method using adversarial training, which consists of pre-training and fine-tuning stages. Our experiments confirm the effectiveness of overcoming the view change problem and knowledge transfer to egocentric views. Our benchmark pushes the study of cross-view transfer into a new task domain of dense video captioning and envisions methodologies that describe egocentric videos in natural language.

CVAug 5, 2024
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto et al.

Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query instructional contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedited overhead-view videos captured by smartphones, in which participants performed food preparation based on given recipes. Fixed-viewpoint video datasets often lack environmental diversity due to high camera setup costs. We used modern wide-angle smartphone lenses to cover cooking counters from sink to cooktop in an overhead view, capturing activity without in-person assistance. With this setup, we collected a diverse dataset by distributing smartphones to participants. With this dataset, we propose the novel video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video captioning domain Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our experiments verified the capabilities and limitations of current web-video-based SOTA methods in handling these tasks.

AIJul 28, 2024
AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering

Mahiro Ukai, Shuhei Kurita, Atsushi Hashimoto et al.

Visual question answering aims to provide responses to natural language questions given visual input. Recently, visual programmatic models (VPMs), which generate executable programs to answer questions through large language models (LLMs), have attracted research interest. However, they often require long input prompts to provide the LLM with sufficient API usage details to generate relevant code. To address this limitation, we propose AdaCoder, an adaptive prompt compression framework for VPMs. AdaCoder operates in two phases: a compression phase and an inference phase. In the compression phase, given a preprompt that describes all API definitions in the Python language with example snippets of code, a set of compressed preprompts is generated, each depending on a specific question type. In the inference phase, given an input question, AdaCoder predicts the question type and chooses the appropriate corresponding compressed preprompt to generate code to answer the question. Notably, AdaCoder employs a single frozen LLM and pre-defined prompts, negating the necessity of additional training and maintaining adaptability across different powerful black-box LLMs such as GPT and Claude. In experiments, we apply AdaCoder to ViperGPT and demonstrate that it reduces token length by 71.1%, while maintaining or even improving the performance of visual question answering.

LGOct 19, 2023
WeaveNet for Approximating Two-sided Matching Problems

Shusaku Sone, Jiaxin Ma, Atsushi Hashimoto et al.

Matching, a task to optimally assign limited resources under constraints, is a fundamental technology for society. The task potentially has various objectives, conditions, and constraints; however, the efficient neural network architecture for matching is underexplored. This paper proposes a novel graph neural network (GNN), \textit{WeaveNet}, designed for bipartite graphs. Since a bipartite graph is generally dense, general GNN architectures lose node-wise information by over-smoothing when deeply stacked. Such a phenomenon is undesirable for solving matching problems. WeaveNet avoids it by preserving edge-wise information while passing messages densely to reach a better solution. To evaluate the model, we approximated one of the \textit{strongly NP-hard} problems, \textit{fair stable matching}. Despite its inherent difficulties and the network's general purpose design, our model reached a comparative performance with state-of-the-art algorithms specially designed for stable matching for small numbers of agents.

CVJul 28, 2025Code
AgroBench: Vision-Language Model Benchmark in Agriculture

Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka et al.

Precise automated understanding of agricultural tasks such as disease identification is essential for sustainable crop production. Recent advances in vision-language models (VLMs) are expected to further expand the range of agricultural tasks by facilitating human-model interaction through easy, text-based communication. Here, we introduce AgroBench (Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development. Our dataset and code are available at https://dahlian00.github.io/AgroBenchPage/ .

69.2CVMar 16
HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning

Kuniaki Saito, Risa Shinoda, Shohei Tanaka et al.

Hallucination detection in captions (HalDec) assesses a vision-language model's ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination detection is also essential for curating high-quality image-caption pairs used to train VLMs. However, the generalizability of VLMs as hallucination detectors across different captioning models and hallucination types remains unclear due to the lack of a comprehensive benchmark. In this work, we introduce HalDec-Bench, a benchmark designed to evaluate hallucination detectors in a principled and interpretable manner. HalDec-Bench contains captions generated by diverse VLMs together with human annotations indicating the presence of hallucinations, detailed hallucination-type categories, and segment-level labels. The benchmark provides tasks with a wide range of difficulty levels and reveals performance differences across models that are not visible in existing multimodal reasoning or alignment benchmarks. Our analysis further uncovers two key findings. First, detectors tend to recognize sentences appearing at the beginning of a response as correct, regardless of their actual correctness. Second, our experiments suggest that dataset noise can be substantially reduced by using strong VLMs as filters while employing recent VLMs as caption generators. Our project page is available at https://dahlian00.github.io/HalDec-Bench-Page/.

CLFeb 2
Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge

Yuzheng Xu, Tosho Hirasawa, Tadashi Kozuno et al.

Large language models (LLMs) are now widely used to evaluate the quality of text, a field commonly referred to as LLM-as-a-judge. While prior works mainly focus on point-wise and pair-wise evaluation paradigms. Rubric-based evaluation, where LLMs select a score from multiple rubrics, has received less analysis. In this work, we show that rubric-based evaluation implicitly resembles a multi-choice setting and therefore has position bias: LLMs prefer score options appearing at specific positions in the rubric list. Through controlled experiments across multiple models and datasets, we demonstrate consistent position bias. To mitigate this bias, we propose a balanced permutation strategy that evenly distributes each score option across positions. We show that aggregating scores across balanced permutations not only reveals latent position bias, but also improves correlation between the LLM-as-a-Judge and human. Our results suggest that rubric-based LLM-as-a-Judge is not inherently point-wise and that simple permutation-based calibration can substantially improve its reliability.

5.6CLMar 12
MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

Michiko Yoshitake, Yuta Suzuki, Ryo Igarashi et al.

We present MaterialFigBench, a benchmark dataset designed to evaluate the ability of multimodal large language models (LLMs) to solve university-level materials science problems that require accurate interpretation of figures. Unlike existing benchmarks that primarily rely on textual representations, MaterialFigBench focuses on problems in which figures such as phase diagrams, stress-strain curves, Arrhenius plots, diffraction patterns, and microstructural schematics are indispensable for deriving correct answers. The dataset consists of 137 free-response problems adapted from standard materials science textbooks, covering a broad range of topics including crystal structures, mechanical properties, diffusion, phase diagrams, phase transformations, and electronic properties of materials. To address unavoidable ambiguity in reading numerical values from images, expert-defined answer ranges are provided where appropriate. We evaluate several state-of-the-art multimodal LLMs, including ChatGPT and GPT models accessed via OpenAI APIs, and analyze their performance across problem categories and model versions. The results reveal that, although overall accuracy improves with model updates, current LLMs still struggle with genuine visual understanding and quantitative interpretation of materials science figures. In many cases, correct answers are obtained by relying on memorized domain knowledge rather than by reading the provided images. MaterialFigBench highlights persistent weaknesses in visual reasoning, numerical precision, and significant-digit handling, while also identifying problem types where performance has improved. This benchmark provides a systematic and domain-specific foundation for advancing multimodal reasoning capabilities in materials science and for guiding the development of future LLMs with stronger figure-based understanding.

CVNov 23, 2025Code
SciPostLayoutTree: A Dataset for Structural Analysis of Scientific Posters

Shohei Tanaka, Atsushi Hashimoto, Yoshitaka Ushiku

Scientific posters play a vital role in academic communication by presenting ideas through visual summaries. Analyzing reading order and parent-child relations of posters is essential for building structure-aware interfaces that facilitate clear and accurate understanding of research content. Despite their prevalence in academic communication, posters remain underexplored in structural analysis research, which has primarily focused on papers. To address this gap, we constructed SciPostLayoutTree, a dataset of approximately 8,000 posters annotated with reading order and parent-child relations. Compared to an existing structural analysis dataset, SciPostLayoutTree contains more instances of spatially challenging relations, including upward, horizontal, and long-distance relations. As a solution to these challenges, we develop Layout Tree Decoder, which incorporates visual features as well as bounding box features including position and category information. The model also uses beam search to predict relations while capturing sequence-level plausibility. Experimental results demonstrate that our model improves the prediction accuracy for spatially challenging relations and establishes a solid baseline for poster structure analysis. The dataset is publicly available at https://huggingface.co/datasets/omron-sinicx/scipostlayouttree. The code is also publicly available at https://github.com/omron-sinicx/scipostlayouttree.

CVJul 2, 2025Code
CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning

Kuniaki Saito, Donghyun Kim, Kwanyong Park et al.

An image captioning model flexibly switching its language pattern, e.g., descriptiveness and length, should be useful since it can be applied to diverse applications. However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. First, our approach quantifies three properties of each caption, length, descriptiveness, and uniqueness of a word, as continuous scalar values, without human annotation. Given the values, we represent the conditioning via interpolation between two endpoint vectors corresponding to the extreme states, e.g., one for a very short caption and one for a very long caption. Empirical results demonstrate that the resulting model can smoothly change the properties of the output captions and show higher lexical alignment than baselines. For instance, CaptionSmiths reduces the error in controlling caption length by 506\% despite better lexical alignment. Code will be available on https://github.com/omron-sinicx/captionsmiths.

CVJul 6, 2021Code
Foreground-Aware Stylization and Consensus Pseudo-Labeling for Domain Adaptation of First-Person Hand Segmentation

Takehiko Ohkawa, Takuma Yagi, Atsushi Hashimoto et al.

Hand segmentation is a crucial task in first-person vision. Since first-person images exhibit strong bias in appearance among different environments, adapting a pre-trained segmentation model to a new domain is required in hand segmentation. Here, we focus on appearance gaps for hand regions and backgrounds separately. We propose (i) foreground-aware image stylization and (ii) consensus pseudo-labeling for domain adaptation of hand segmentation. We stylize source images independently for the foreground and background using target images as style. To resolve the domain shift that the stylization has not addressed, we apply careful pseudo-labeling by taking a consensus between the models trained on the source and stylized source images. We validated our method on domain adaptation of hand segmentation from real and simulation images. Our method achieved state-of-the-art performance in both settings. We also demonstrated promising results in challenging multi-target domain adaptation and domain generalization settings. Code is available at https://github.com/ut-vision/FgSty-CPL.

CVDec 12, 2018Code
Strong-Weak Distribution Alignment for Adaptive Object Detection

Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada et al.

We propose an approach for unsupervised adaptation of object detectors from label-rich to label-poor domains which can significantly reduce annotation costs associated with detection. Recently, approaches that align distributions of source and target images using an adversarial loss have been proven effective for adapting object classifiers. However, for object detection, fully matching the entire distributions of source and target images to each other at the global image level may fail, as domains could have distinct scene layouts and different combinations of objects. On the other hand, strong matching of local features such as texture and color makes sense, as it does not change category level semantics. This motivates us to propose a novel method for detector adaptation based on strong local alignment and weak global alignment. Our key contribution is the weak alignment model, which focuses the adversarial alignment loss on images that are globally similar and puts less emphasis on aligning images that are globally dissimilar. Additionally, we design the strong domain alignment model to only look at local receptive fields of the feature map. We empirically verify the effectiveness of our method on four datasets comprising both large and small domain shifts. Our code is available at \url{https://github.com/VisionLearningGroup/DA_Detection}

CVNov 29, 2018Code
Generating Easy-to-Understand Referring Expressions for Target Identifications

Mikihiro Tanaka, Takayuki Itamochi, Kenichi Narioka et al.

This paper addresses the generation of referring expressions that not only refer to objects correctly but also let humans find them quickly. As a target becomes relatively less salient, identifying referred objects itself becomes more difficult. However, the existing studies regarded all sentences that refer to objects correctly as equally good, ignoring whether they are easily understood by humans. If the target is not salient, humans utilize relationships with the salient contexts around it to help listeners to comprehend it better. To derive this information from human annotations, our model is designed to extract information from the target and from the environment. Moreover, we regard that sentences that are easily understood are those that are comprehended correctly and quickly by humans. We optimized this by using the time required to locate the referred objects by humans and their accuracies. To evaluate our system, we created a new referring expression dataset whose images were acquired from Grand Theft Auto V (GTA V), limiting targets to persons. Experimental results show the effectiveness of our approach. Our code and dataset are available at https://github.com/mikittt/easy-to-understand-REG.

CVNov 27, 2018Code
Label-Noise Robust Generative Adversarial Networks

Takuhiro Kaneko, Yoshitaka Ushiku, Tatsuya Harada

Generative adversarial networks (GANs) are a framework that learns a generative distribution through adversarial training. Recently, their class-conditional extensions (e.g., conditional GAN (cGAN) and auxiliary classifier GAN (AC-GAN)) have attracted much attention owing to their ability to learn the disentangled representations and to improve the training stability. However, their training requires the availability of large-scale accurate class-labeled data, which are often laborious or impractical to collect in a real-world scenario. To remedy this, we propose a novel family of GANs called label-noise robust GANs (rGANs), which, by incorporating a noise transition model, can learn a clean label conditional generative distribution even when training labels are noisy. In particular, we propose two variants: rAC-GAN, which is a bridging model between AC-GAN and the label-noise robust classification model, and rcGAN, which is an extension of cGAN and solves this problem with no reliance on any classifier. In addition to providing the theoretical background, we demonstrate the effectiveness of our models through extensive experiments using diverse GAN configurations, various noise settings, and multiple evaluation metrics (in which we tested 402 conditions in total). Our code is available at https://github.com/takuhirok/rGAN/.

CVNov 27, 2018Code
Class-Distinct and Class-Mutual Image Generation with GANs

Takuhiro Kaneko, Yoshitaka Ushiku, Tatsuya Harada

Class-conditional extensions of generative adversarial networks (GANs), such as auxiliary classifier GAN (AC-GAN) and conditional GAN (cGAN), have garnered attention owing to their ability to decompose representations into class labels and other factors and to boost the training stability. However, a limitation is that they assume that each class is separable and ignore the relationship between classes even though class overlapping frequently occurs in a real-world scenario when data are collected on the basis of diverse or ambiguous criteria. To overcome this limitation, we address a novel problem called class-distinct and class-mutual image generation, in which the goal is to construct a generator that can capture between-class relationships and generate an image selectively conditioned on the class specificity. To solve this problem without additional supervision, we propose classifier's posterior GAN (CP-GAN), in which we redesign the generator input and the objective function of AC-GAN for class-overlapping data. Precisely, we incorporate the classifier's posterior into the generator input and optimize the generator so that the classifier's posterior of generated data corresponds with that of real data. We demonstrate the effectiveness of CP-GAN using both controlled and real-world class-overlapping data with a model configuration analysis and comparative study. Our code is available at https://github.com/takuhirok/CP-GAN/.

CVAug 6, 2018Code
Visual Question Generation for Class Acquisition of Unknown Objects

Kohei Uehara, Antonio Tejero-De-Pablos, Yoshitaka Ushiku et al.

Traditional image recognition methods only consider objects belonging to already learned classes. However, since training a recognition model with every object class in the world is unfeasible, a way of getting information on unknown objects (i.e., objects whose class has not been learned) is necessary. A way for an image recognition system to learn new classes could be asking a human about objects that are unknown. In this paper, we propose a method for generating questions about unknown objects in an image, as means to get information about classes that have not been learned. Our method consists of a module for proposing objects, a module for identifying unknown objects, and a module for generating questions about unknown objects. The experimental results via human evaluation show that our method can successfully get information about unknown objects in an image dataset. Our code and dataset are available at https://github.com/mil-tokyo/vqg-unknown.

CVDec 7, 2017Code
Maximum Classifier Discrepancy for Unsupervised Domain Adaptation

Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku et al.

In this work, we present a method for unsupervised domain adaptation. Many adversarial learning methods train domain classifier networks to distinguish the features as either a source or target and train a feature generator network to mimic the discriminator. Two problems exist with these methods. First, the domain classifier only tries to distinguish the features as a source or target and thus does not consider task-specific decision boundaries between classes. Therefore, a trained generator can generate ambiguous features near class boundaries. Second, these methods aim to completely match the feature distributions between different domains, which is difficult because of each domain's characteristics. To solve these problems, we introduce a new approach that attempts to align distributions of source and target by utilizing the task-specific decision boundaries. We propose to maximize the discrepancy between two classifiers' outputs to detect target samples that are far from the support of the source. A feature generator learns to generate target features near the support to minimize the discrepancy. Our method outperforms other methods on several datasets of image classification and semantic segmentation. The codes are available at \url{https://github.com/mil-tokyo/MCD_DA}

CLSep 5, 2024
MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models

Michiko Yoshitake, Yuta Suzuki, Ryo Igarashi et al.

A college-level benchmark dataset for large language models (LLMs) in the materials science field, MaterialBENCH, is constructed. This dataset consists of problem-answer pairs, based on university textbooks. There are two types of problems: one is the free-response answer type, and the other is the multiple-choice type. Multiple-choice problems are constructed by adding three incorrect answers as choices to a correct answer, so that LLMs can choose one of the four as a response. Most of the problems for free-response answer and multiple-choice types overlap except for the format of the answers. We also conduct experiments using the MaterialBENCH on LLMs, including ChatGPT-3.5, ChatGPT-4, Bard (at the time of the experiments), and GPT-3.5 and GPT-4 with the OpenAI API. The differences and similarities in the performance of LLMs measured by the MaterialBENCH are analyzed and discussed. Performance differences between the free-response type and multiple-choice type in the same models and the influence of using system massages on multiple-choice problems are also studied. We anticipate that MaterialBENCH will encourage further developments of LLMs in reasoning abilities to solve more complicated problems and eventually contribute to materials research and discovery.

LGMar 18, 2024
Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding

Tatsunori Taniai, Ryo Igarashi, Yuta Suzuki et al.

Predicting physical properties of materials from their crystal structures is a fundamental problem in materials science. In peripheral areas such as the prediction of molecular properties, fully connected attention networks have been shown to be successful. However, unlike these finite atom arrangements, crystal structures are infinitely repeating, periodic arrangements of atoms, whose fully connected attention results in infinitely connected attention. In this work, we show that this infinitely connected attention can lead to a computationally tractable formulation, interpreted as neural potential summation, that performs infinite interatomic potential summations in a deeply learned feature space. We then propose a simple yet effective Transformer-based encoder architecture for crystal structures called Crystalformer. Compared to an existing Transformer-based model, the proposed model requires only 29.4% of the number of parameters, with minimal modifications to the original Transformer architecture. Despite the architectural simplicity, the proposed method outperforms state-of-the-art methods for various property regression tasks on the Materials Project and JARVIS-DFT datasets.

CLFeb 16, 2024
Where is the answer? Investigating Positional Bias in Language Model Knowledge Extraction

Kuniaki Saito, Kihyuk Sohn, Chen-Yu Lee et al.

Large language models require updates to remain up-to-date or adapt to new domains by fine-tuning them with new documents. One key is memorizing the latest information in a way that the memorized information is extractable with a query prompt. However, LLMs suffer from a phenomenon called perplexity curse; despite minimizing document perplexity during fine-tuning, LLMs struggle to extract information through a prompt sentence. In this new knowledge acquisition and extraction, we find a very intriguing fact that LLMs can accurately answer questions about the first sentence, but they struggle to extract information described in the middle or end of the documents used for fine-tuning. Our study suggests that the auto-regressive training causes this issue; each token is prompted by reliance on all previous tokens, which hinders the model from recalling information from training documents by question prompts. To conduct the in-depth study, we publish both synthetic and real datasets, enabling the evaluation of the QA performance w.r.t. the position of the corresponding answer in a document. Our investigation shows that even a large model suffers from the perplexity curse, but regularization such as denoising auto-regressive loss can enhance the information extraction from diverse positions. These findings will be (i) a key to improving knowledge extraction from LLMs and (ii) new elements to discuss the trade-off between RAG and fine-tuning in adapting LLMs to a new domain.

LGDec 7, 2023
A Transformer Model for Symbolic Regression towards Scientific Discovery

Florian Lalande, Yoshitomo Matsubara, Naoya Chiba et al.

Symbolic Regression (SR) searches for mathematical expressions which best describe numerical datasets. This allows to circumvent interpretation issues inherent to artificial neural networks, but SR algorithms are often computationally expensive. This work proposes a new Transformer model aiming at Symbolic Regression particularly focused on its application for Scientific Discovery. We propose three encoder architectures with increasing flexibility but at the cost of column-permutation equivariance violation. Training results indicate that the most flexible architecture is required to prevent from overfitting. Once trained, we apply our best model to the SRSD datasets (Symbolic Regression for Scientific Discovery datasets) which yields state-of-the-art results using the normalized tree-based edit distance, at no extra computational cost.

CVDec 23, 2024
SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images

Risa Shinoda, Kuniaki Saito, Shohei Tanaka et al.

Building a large-scale figure QA dataset requires a considerable amount of work, from gathering and selecting figures to extracting attributes like text, numbers, and colors, and generating QAs. Although recent developments in LLMs have led to efforts to synthesize figures, most of these focus primarily on QA generation. Additionally, creating figures directly using LLMs often encounters issues such as code errors, similar-looking figures, and repetitive content in figures. To address this issue, we present SBSFigures (Stage-by-Stage Synthetic Figures), a dataset for pre-training figure QA. Our proposed pipeline enables the creation of chart figures with complete annotations of the visualized data and dense QA annotations without any manual annotation process. Our stage-by-stage pipeline makes it possible to create diverse topic and appearance figures efficiently while minimizing code errors. Our SBSFigures demonstrate a strong pre-training effect, making it possible to achieve efficient training with a limited amount of real-world chart data starting from our pre-trained weights.

LGMar 4, 2025
CrystalFramer: Rethinking the Role of Frames for SE(3)-Invariant Crystal Structure Modeling

Yusei Ito, Tatsunori Taniai, Ryo Igarashi et al.

Crystal structure modeling with graph neural networks is essential for various applications in materials informatics, and capturing SE(3)-invariant geometric features is a fundamental requirement for these networks. A straightforward approach is to model with orientation-standardized structures through structure-aligned coordinate systems, or"frames." However, unlike molecules, determining frames for crystal structures is challenging due to their infinite and highly symmetric nature. In particular, existing methods rely on a statically fixed frame for each structure, determined solely by its structural information, regardless of the task under consideration. Here, we rethink the role of frames, questioning whether such simplistic alignment with the structure is sufficient, and propose the concept of dynamic frames. While accommodating the infinite and symmetric nature of crystals, these frames provide each atom with a dynamic view of its local environment, focusing on actively interacting atoms. We demonstrate this concept by utilizing the attention mechanism in a recent transformer-based crystal encoder, resulting in a new architecture called CrystalFramer. Extensive experiments show that CrystalFramer outperforms conventional frames and existing crystal encoders in various crystal property prediction tasks.

LGJan 22, 2025
Bridging Text and Crystal Structures: Literature-driven Contrastive Learning for Materials Science

Yuta Suzuki, Tatsunori Taniai, Ryo Igarashi et al.

Understanding structure-property relationships is an essential yet challenging aspect of materials discovery and development. To facilitate this process, recent studies in materials informatics have sought latent embedding spaces of crystal structures to capture their similarities based on properties and functionalities. However, abstract feature-based embedding spaces are human-unfriendly and prevent intuitive and efficient exploration of the vast materials space. Here we introduce Contrastive Language--Structure Pre-training (CLaSP), a learning paradigm for constructing crossmodal embedding spaces between crystal structures and texts. CLaSP aims to achieve material embeddings that 1) capture property- and functionality-related similarities between crystal structures and 2) allow intuitive retrieval of materials via user-provided description texts as queries. To compensate for the lack of sufficient datasets linking crystal structures with textual descriptions, CLaSP leverages a dataset of over 400,000 published crystal structures and corresponding publication records, including paper titles and abstracts, for training. We demonstrate the effectiveness of CLaSP through text-based crystal structure screening and embedding space visualization.

CVMar 4, 2024
TNF: Tri-branch Neural Fusion for Multimodal Medical Data Classification

Tong Zheng, Shusaku Sone, Yoshitaka Ushiku et al.

This paper presents a Tri-branch Neural Fusion (TNF) approach designed for classifying multimodal medical images and tabular data. It also introduces two solutions to address the challenge of label inconsistency in multimodal classification. Traditional methods in multi-modality medical data classification often rely on single-label approaches, typically merging features from two distinct input modalities. This becomes problematic when features are mutually exclusive or labels differ across modalities, leading to reduced accuracy. To overcome this, our TNF approach implements a tri-branch framework that manages three separate outputs: one for image modality, another for tabular modality, and a third hybrid output that fuses both image and tabular data. The final decision is made through an ensemble method that integrates likelihoods from all three branches. We validate the effectiveness of TNF through extensive experiments, which illustrate its superiority over traditional fusion and ensemble methods in various convolutional neural networks and transformer-based architectures across multiple datasets.

CVNov 27, 2025
SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts

Shun Inadumi, Shohei Tanaka, Tosho Hirasawa et al.

As the number of scientific papers continues to grow, there is a demand for approaches that can effectively convey research findings, with posters serving as a key medium for presenting paper contents. Poster layouts determine how effectively research is communicated and understood, highlighting their growing importance. In particular, a gap remains in understanding how papers correspond to the layouts that present them, which calls for datasets with paired annotations at scale. To bridge this gap, we introduce SciPostGen, a large-scale dataset for understanding and generating poster layouts from scientific papers. Our analyses based on SciPostGen show that paper structures are associated with the number of layout elements in posters. Based on this insight, we explore a framework, Retrieval-Augmented Poster Layout Generation, which retrieves layouts consistent with a given paper and uses them as guidance for layout generation. We conducted experiments under two conditions: with and without layout constraints typically specified by poster creators. The results show that the retriever estimates layouts aligned with paper structures, and our framework generates layouts that also satisfy given constraints.

CVNov 25, 2025
AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

Kuniaki Saito, Risa Shinoda, Shohei Tanaka et al.

Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.

HCAug 22, 2025
Cooperative Design Optimization through Natural Language Interaction

Ryogo Niwa, Shigeo Yoshida, Yuki Koyama et al.

Designing successful interactions requires identifying optimal design parameters. To do so, designers often conduct iterative user testing and exploratory trial-and-error. This involves balancing multiple objectives in a high-dimensional space, making the process time-consuming and cognitively demanding. System-led optimization methods, such as those based on Bayesian optimization, can determine for designers which parameters to test next. However, they offer limited opportunities for designers to intervene in the optimization process, negatively impacting the designer's experience. We propose a design optimization framework that enables natural language interactions between designers and the optimization system, facilitating cooperative design optimization. This is achieved by integrating system-led optimization methods with Large Language Models (LLMs), allowing designers to intervene in the optimization process and better understand the system's reasoning. Experimental results show that our method provides higher user agency than a system-led method and shows promising optimization performance compared to manual design. It also matches the performance of an existing cooperative method with lower cognitive load.

ROJun 14, 2025
Perspective on Utilizing Foundation Models for Laboratory Automation in Materials Research

Kan Hatakeyama-Sato, Toshihiko Nishida, Kenta Kitamura et al.

This review explores the potential of foundation models to advance laboratory automation in the materials and chemical sciences. It emphasizes the dual roles of these models: cognitive functions for experimental planning and data analysis, and physical functions for hardware operations. While traditional laboratory automation has relied heavily on specialized, rigid systems, foundation models offer adaptability through their general-purpose intelligence and multimodal capabilities. Recent advancements have demonstrated the feasibility of using large language models (LLMs) and multimodal robotic systems to handle complex and dynamic laboratory tasks. However, significant challenges remain, including precision manipulation of hardware, integration of multimodal data, and ensuring operational safety. This paper outlines a roadmap highlighting future directions, advocating for close interdisciplinary collaboration, benchmark establishment, and strategic human-AI integration to realize fully autonomous experimental laboratories.

MTRL-SCIOct 11, 2024
Rethinking Gradient-Based Methods: Multi-Property Materials Design Beyond Differentiable Targets

Akihiro Fujii, Yoshitaka Ushiku, Koji Shimizu et al.

Gradient-based methods offer a simple, efficient strategy for materials design by directly optimizing candidates using gradients from pretrained property predictors. However, their use in crystal structure optimization is hindered by two key challenges: handling non-differentiable constraints, such as charge neutrality and structural fidelity, and susceptibility to poor local minima. We revisit and extend the gradient-based methods to address these issues. We propose Simultaneous Multi-property Optimization using Adaptive Crystal Synthesizer (SMOACS), which integrates oxidation-number masks and template-based initialization to enforce non-differentiable constraints, avoid poor local minima, and flexibly incorporate additional constraints without retraining. SMOACS enables multi-property optimization. including exceptional targets such as high-temperature superconductivity, and scales to large crystal systems, both persistent challenges for generative models, even those enhanced with gradient-based guidance from property predictors. In experiments on five target properties and three datasets, SMOACS outperforms generative models and Bayesian optimization methods, successfully designing 135-atom perovskite structures that satisfy multiple property targets and constraints, a task at which the other methods fail entirely.

CVFeb 4, 2022
3D Point Cloud Registration with Learning-based Matching Algorithm

Rintaro Yanagi, Atsushi Hashimoto, Shusaku Sone et al.

We present a novel differential matching algorithm for 3D point cloud registration. Instead of only optimizing the feature extractor for a matching algorithm, we propose a learning-based matching module optimized to the jointly-trained feature extractor. We focused on edge-wise feature-forwarding architectures, which are memory-consuming but can avoid the over-smoothing effect that GNNs suffer. We improve its memory efficiency to scale it for point cloud registration while investigating the best way of connecting it to the feature extractor. Experimental results show our matching module's significant impact on performance improvement in rigid/non-rigid and whole/partial point cloud registration datasets with multiple contemporary feature extractors. For example, our module boosted the current SOTA method, RoITr, by +5.4%, and +7.2% in the NFMR metric and +6.1% and +8.5% in the IR metric on the 4DMatch and 4DLoMatch datasets, respectively.

CLApr 28, 2021
Removing Word-Level Spurious Alignment between Images and Pseudo-Captions in Unsupervised Image Captioning

Ukyo Honda, Yoshitaka Ushiku, Atsushi Hashimoto et al.

Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs, but only with images and sentences drawn from different sources and object labels detected from the images. In previous work, pseudo-captions, i.e., sentences that contain the detected object labels, were assigned to a given image. The focus of the previous work was on the alignment of input images and pseudo-captions at the sentence level. However, pseudo-captions contain many words that are irrelevant to a given image. In this work, we investigate the effect of removing mismatched words from image-sentence alignment to determine how they make this task difficult. We propose a simple gating mechanism that is trained to align image features with only the most reliable words in pseudo-captions: the detected object labels. The experimental results show that our proposed method outperforms the previous methods without introducing complex sentence-level learning objectives. Combined with the sentence-level alignment method of previous work, our method further improves its performance. These results confirm the importance of careful alignment in word-level details.

CVApr 1, 2021
Divergence Optimization for Noisy Universal Domain Adaptation

Qing Yu, Atsushi Hashimoto, Yoshitaka Ushiku

Universal domain adaptation (UniDA) has been proposed to transfer knowledge learned from a label-rich source domain to a label-scarce target domain without any constraints on the label sets. In practice, however, it is difficult to obtain a large amount of perfectly clean labeled data in a source domain with limited resources. Existing UniDA methods rely on source samples with correct annotations, which greatly limits their application in the real world. Hence, we consider a new realistic setting called Noisy UniDA, in which classifiers are trained with noisy labeled data from the source domain and unlabeled data with an unknown class distribution from the target domain. This paper introduces a two-head convolutional neural network framework to solve all problems simultaneously. Our network consists of one common feature generator and two classifiers with different decision boundaries. By optimizing the divergence between the two classifiers' outputs, we can detect noisy source samples, find "unknown" classes in the target domain, and align the distribution of the source and target domains. In an extensive evaluation of different domain adaptation settings, the proposed method outperformed existing methods by a large margin in most settings.

CVNov 22, 2019
Crowd Density Forecasting by Modeling Patch-based Dynamics

Hiroaki Minoura, Ryo Yonetani, Mai Nishimura et al.

Forecasting human activities observed in videos is a long-standing challenge in computer vision, which leads to various real-world applications such as mobile robots, autonomous driving, and assistive systems. In this work, we present a new visual forecasting task called crowd density forecasting. Given a video of a crowd captured by a surveillance camera, our goal is to predict how that crowd will move in future frames. To address this task, we have developed the patch-based density forecasting network (PDFN), which enables forecasting over a sequence of crowd density maps describing how crowded each location is in each video frame. PDFN represents a crowd density map based on spatially overlapping patches and learns density dynamics patch-wise in a compact latent space. This enables us to model diverse and complex crowd density dynamics efficiently, even when the input video involves a variable number of crowds that each move independently. Experimental results with several public datasets demonstrate the effectiveness of our approach compared with state-of-the-art forecasting methods.

LGMay 23, 2019
Decentralized Learning of Generative Adversarial Networks from Non-iid Data

Ryo Yonetani, Tomohiro Takahashi, Atsushi Hashimoto et al.

This work addresses a new problem that learns generative adversarial networks (GANs) from multiple data collections that are each i) owned separately by different clients and ii) drawn from a non-identical distribution that comprises different classes. Given such non-iid data as input, we aim to learn a distribution involving all the classes input data can belong to, while keeping the data decentralized in each client storage. Our key contribution to this end is a new decentralized approach for learning GANs from non-iid data called Forgiver-First Update (F2U), which a) asks clients to train an individual discriminator with their own data and b) updates a generator to fool the most `forgiving' discriminators who deem generated samples as the most real. Our theoretical analysis proves that this updating strategy allows the decentralized GAN to achieve a generator's distribution with all the input classes as its global optimum based on f-divergence minimization. Moreover, we propose a relaxed version of F2U called Forgiver-First Aggregation (F2A) that performs well in practice, which adaptively aggregates the discriminators while emphasizing forgiving ones. Our empirical evaluations with image generation tasks demonstrated the effectiveness of our approach over state-of-the-art decentralized learning methods.

CVMar 15, 2019
Pose Graph Optimization for Unsupervised Monocular Visual Odometry

Yang Li, Yoshitaka Ushiku, Tatsuya Harada

Unsupervised Learning based monocular visual odometry (VO) has lately drawn significant attention for its potential in label-free leaning ability and robustness to camera parameters and environmental variations. However, partially due to the lack of drift correction technique, these methods are still by far less accurate than geometric approaches for large-scale odometry estimation. In this paper, we propose to leverage graph optimization and loop closure detection to overcome limitations of unsupervised learning based monocular visual odometry. To this end, we propose a hybrid VO system which combines an unsupervised monocular VO called NeuralBundler with a pose graph optimization back-end. NeuralBundler is a neural network architecture that uses temporal and spatial photometric loss as main supervision and generates a windowed pose graph consists of multi-view 6DoF constraints. We propose a novel pose cycle consistency loss to relieve the tensions in the windowed pose graph, leading to improved performance and robustness. In the back-end, a global pose graph is built from local and loop 6DoF constraints estimated by NeuralBundler and is optimized over SE(3). Empirical evaluation on the KITTI odometry dataset demonstrates that 1) NeuralBundler achieves state-of-the-art performance on unsupervised monocular VO estimation, and 2) our whole approach can achieve efficient loop closing and show favorable overall translational accuracy compared to established monocular SLAM systems.

CVDec 11, 2018
Multichannel Semantic Segmentation with Unsupervised Domain Adaptation

Kohei Watanabe, Kuniaki Saito, Yoshitaka Ushiku et al.

Most contemporary robots have depth sensors, and research on semantic segmentation with RGBD images has shown that depth images boost the accuracy of segmentation. Since it is time-consuming to annotate images with semantic labels per pixel, it would be ideal if we could avoid this laborious work by utilizing an existing dataset or a synthetic dataset which we can generate on our own. Robot motions are often tested in a synthetic environment, where multichannel (eg, RGB + depth + instance boundary) images plus their pixel-level semantic labels are available. However, models trained simply on synthetic images tend to demonstrate poor performance on real images. In order to address this, we propose two approaches that can efficiently exploit multichannel inputs combined with an unsupervised domain adaptation (UDA) algorithm. One is a fusion-based approach that uses depth images as inputs. The other is a multitask learning approach that uses depth images as outputs. We demonstrated that the segmentation results were improved by using a multitask learning approach with a post-process and created a benchmark for this task.

CVDec 4, 2018
Conditional Video Generation Using Action-Appearance Captions

Shohei Yamamoto, Antonio Tejero-de-Pablos, Yoshitaka Ushiku et al.

The field of automatic video generation has received a boost thanks to the recent Generative Adversarial Networks (GANs). However, most existing methods cannot control the contents of the generated video using a text caption, losing their usefulness to a large extent. This particularly affects human videos due to their great variety of actions and appearances. This paper presents Conditional Flow and Texture GAN (CFT-GAN), a GAN-based video generation method from action-appearance captions. We propose a novel way of generating video by encoding a caption (e.g., "a man in blue jeans is playing golf") in a two-stage generation pipeline. Our CFT-GAN uses such caption to generate an optical flow (action) and a texture (appearance) for each frame. As a result, the output video reflects the content specified in the caption in a plausible way. Moreover, to train our method, we constructed a new dataset for human video generation with captions. We evaluated the proposed method qualitatively and quantitatively via an ablation study and a user study. The results demonstrate that CFT-GAN is able to successfully generate videos containing the action and appearances indicated in the captions.

CLApr 27, 2018
Customized Image Narrative Generation via Interactive Visual Question Generation and Answering

Andrew Shin, Yoshitaka Ushiku, Tatsuya Harada

Image description task has been invariably examined in a static manner with qualitative presumptions held to be universally applicable, regardless of the scope or target of the description. In practice, however, different viewers may pay attention to different aspects of the image, and yield different descriptions or interpretations under various contexts. Such diversity in perspectives is difficult to derive with conventional image description techniques. In this paper, we propose a customized image narrative generation task, in which the users are interactively engaged in the generation process by providing answers to the questions. We further attempt to learn the user's interest via repeating such interactive stages, and to automatically reflect the interest in descriptions for new images. Experimental results demonstrate that our model can generate a variety of descriptions from single image that cover a wider range of topics than conventional models, while being customizable to the target user of interaction.

CVApr 27, 2018
Open Set Domain Adaptation by Backpropagation

Kuniaki Saito, Shohei Yamamoto, Yoshitaka Ushiku et al.

Numerous algorithms have been proposed for transferring knowledge from a label-rich domain (source) to a label-scarce domain (target). Almost all of them are proposed for a closed-set scenario, where the source and the target domain completely share the class of their samples. We call the shared class the \doublequote{known class.} However, in practice, when samples in target domain are not labeled, we cannot know whether the domains share the class. A target domain can contain samples of classes that are not shared by the source domain. We call such classes the \doublequote{unknown class} and algorithms that work well in the open set situation are very practical. However, most existing distribution matching methods for domain adaptation do not work well in this setting because unknown target samples should not be aligned with the source. In this paper, we propose a method for an open set domain adaptation scenario which utilizes adversarial training. A classifier is trained to make a boundary between the source and the target samples whereas a generator is trained to make target samples far from the boundary. Thus, we assign two options to the feature generator: aligning them with source known samples or rejecting them as unknown target samples. This approach allows extracting features that separate unknown target samples from known target samples. Our method was extensively evaluated in domain adaptation setting and outperformed other methods with a large margin in most settings.