ROMay 27
Whose Is This?: Context-Aware Object Ownership Inference with Uncertainty-Guided QuestioningSaki Hashimoto, Akira Taniguchi, Shoichi Hasegawa et al.
Service robots must infer object ownership to correctly interpret instructions such as "bring me my cup." However, ownership is a latent attribute that cannot be directly observed, and existing methods often rely on limited cues such as recent usage, making them unreliable in scenarios such as temporary sharing. We propose a framework for context-aware ownership inference with uncertainty-guided interaction (COIN). The method integrates user background information and object usage history using a large language model (LLM) to estimate ownership scores. To handle uncertainty, we apply conformal prediction to construct a set of plausible owners and selectively generate user queries when the prediction is uncertain. Experiments in a simulated home environment show that the proposed method consistently outperforms baseline approaches, achieving a Subset Accuracy of 0.988 and a Mean Jaccard index of 0.991. The method also maintains high performance in scenarios involving temporary use and shared ownership. The results demonstrate that combining contextual reasoning with uncertainty-aware interaction improves both estimation accuracy and robustness. The project page is available at https://emergentsystemlabstudent.github.io/COIN/.
ROJan 14, 2023
World Models and Predictive Coding for Cognitive and Developmental Robotics: Frontiers and ChallengesTadahiro Taniguchi, Shingo Murata, Masahiro Suzuki et al.
Creating autonomous robots that can actively explore the environment, acquire knowledge and learn skills continuously is the ultimate achievement envisioned in cognitive and developmental robotics. Their learning processes should be based on interactions with their physical and social world in the manner of human learning and cognitive development. Based on this context, in this paper, we focus on the two concepts of world models and predictive coding. Recently, world models have attracted renewed attention as a topic of considerable interest in artificial intelligence. Cognitive systems learn world models to better predict future sensory observations and optimize their policies, i.e., controllers. Alternatively, in neuroscience, predictive coding proposes that the brain continuously predicts its inputs and adapts to model its own dynamics and control behavior in its environment. Both ideas may be considered as underpinning the cognitive development of robots and humans capable of continual or lifelong learning. Although many studies have been conducted on predictive coding in cognitive robotics and neurorobotics, the relationship between world model-based approaches in AI and predictive coding in robotics has rarely been discussed. Therefore, in this paper, we clarify the definitions, relationships, and status of current research on these topics, as well as missing pieces of world models and predictive coding in conjunction with crucially related concepts such as the free-energy principle and active inference in the context of cognitive and developmental robotics. Furthermore, we outline the frontiers and challenges involved in world models and predictive coding toward the further integration of AI and robotics, as well as the creation of robots with real cognitive and developmental capabilities in the future.
LGMar 1, 2022
DreamingV2: Reinforcement Learning with Discrete World Models without ReconstructionMasashi Okada, Tadahiro Taniguchi
The present paper proposes a novel reinforcement learning method with world models, DreamingV2, a collaborative extension of DreamerV2 and Dreaming. DreamerV2 is a cutting-edge model-based reinforcement learning from pixels that uses discrete world models to represent latent states with categorical variables. Dreaming is also a form of reinforcement learning from pixels that attempts to avoid the autoencoding process in general world model training by involving a reconstruction-free contrastive learning objective. The proposed DreamingV2 is a novel approach of adopting both the discrete representation of DreamingV2 and the reconstruction-free objective of Dreaming. Compared to DreamerV2 and other recent model-based methods without reconstruction, DreamingV2 achieves the best scores on five simulated challenging 3D robot arm tasks. We believe that DreamingV2 will be a reliable solution for robot learning since its discrete representation is suitable to describe discontinuous environments, and the reconstruction-free fashion well manages complex vision observations.
AIMay 24, 2022
Emergent Communication through Metropolis-Hastings Naming Game with Deep Generative ModelsTadahiro Taniguchi, Yuto Yoshida, Akira Taniguchi et al.
Constructive studies on symbol emergence systems seek to investigate computational models that can better explain human language evolution, the creation of symbol systems, and the construction of internal representations. This study provides a new model for emergent communication, which is based on a probabilistic generative model (PGM) instead of a discriminative model based on deep reinforcement learning. We define the Metropolis-Hastings (MH) naming game by generalizing previously proposed models. It is not a referential game with explicit feedback, as assumed by many emergent communication studies. Instead, it is a game based on joint attention without explicit feedback. Mathematically, the MH naming game is proved to be a type of MH algorithm for an integrative PGM that combines two agents that play the naming game. From this viewpoint, symbol emergence is regarded as decentralized Bayesian inference, and semiotic communication is regarded as inter-personal cross-modal inference. This notion leads to the collective predictive coding hypothesis} regarding language evolution and, in general, the emergence of symbols. We also propose the inter-Gaussian mixture model (GMM)+ variational autoencoder (VAE), a deep generative model for emergent communication based on the MH naming game. The model has been validated on MNIST and Fruits 360 datasets. Experimental findings demonstrate that categories are formed from real images observed by agents, and signs are correctly shared across agents by successfully utilizing both of the observations of agents via the MH naming game. Furthermore, scholars verified that visual images were recalled from signs uttered by agents. Notably, emergent communication without supervision and reward feedback improved the performance of the unsupervised representation learning of agents.
CVMar 22, 2022
Representation Uncertainty in Self-Supervised Learning as Variational InferenceHiroki Nakamura, Masashi Okada, Tadahiro Taniguchi
In this study, a novel self-supervised learning (SSL) method is proposed, which considers SSL in terms of variational inference to learn not only representation but also representation uncertainties. SSL is a method of learning representations without labels by maximizing the similarity between image representations of different augmented views of an image. Meanwhile, variational autoencoder (VAE) is an unsupervised representation learning method that trains a probabilistic generative model with variational inference. Both VAE and SSL can learn representations without labels, but their relationship has not been investigated in the past. Herein, the theoretical relationship between SSL and variational inference has been clarified. Furthermore, a novel method, namely variational inference SimSiam (VI-SimSiam), has been proposed. VI-SimSiam can predict the representation uncertainty by interpreting SimSiam with variational inference and defining the latent space distribution. The present experiments qualitatively show that VI- SimSiam could learn uncertainty by comparing input images and predicted uncertainties. Additionally, we described a relationship between estimated uncertainty and classification accuracy.
ROMar 10, 2022
Tactile-Sensitive NewtonianVAE for High-Accuracy Industrial Connector InsertionRyo Okumura, Nobuki Nishio, Tadahiro Taniguchi
An industrial connector insertion task requires submillimeter positioning and grasp pose compensation for a plug. Thus, highly accurate estimation of the relative pose between a plug and socket is fundamental for achieving the task. World models are promising technologies for visuomotor control because they obtain appropriate state representation to jointly optimize feature extraction and latent dynamics model. Recent studies show that the NewtonianVAE, a type of the world model, acquires latent space equivalent to mapping from images to physical coordinates. Proportional control can be achieved in the latent space of NewtonianVAE. However, applying NewtonianVAE to high-accuracy industrial tasks in physical environments is an open problem. Moreover, the existing framework does not consider the grasp pose compensation in the obtained latent space. In this work, we proposed tactile-sensitive NewtonianVAE and applied it to a USB connector insertion with grasp pose variation in the physical environments. We adopted a GelSight-type tactile sensor and estimated the insertion position compensated by the grasp pose of the plug. Our method trains the latent space in an end-to-end manner, and no additional engineering and annotation are required. Simple proportional control is available in the obtained latent space. Moreover, we showed that the original NewtonianVAE fails in some situations, and demonstrated that domain knowledge induction improves model accuracy. This domain knowledge can be easily obtained using robot specification and grasp pose error measurement. We demonstrated that our proposed method achieved a 100\% success rate and 0.3 mm positioning accuracy in the USB connector insertion task in the physical environment. It outperformed SOTA CNN-based two-stage goal pose regression with grasp pose compensation using coordinate transformation.
AIMar 15, 2022
Multi-View Dreaming: Multi-View World Model with Contrastive LearningAkira Kinose, Masashi Okada, Ryo Okumura et al.
In this paper, we propose Multi-View Dreaming, a novel reinforcement learning agent for integrated recognition and control from multi-view observations by extending Dreaming. Most current reinforcement learning method assumes a single-view observation space, and this imposes limitations on the observed data, such as lack of spatial information and occlusions. This makes obtaining ideal observational information from the environment difficult and is a bottleneck for real-world robotics applications. In this paper, we use contrastive learning to train a shared latent space between different viewpoints, and show how the Products of Experts approach can be used to integrate and control the probability distributions of latent states for multiple viewpoints. We also propose Multi-View DreamingV2, a variant of Multi-View Dreaming that uses a categorical distribution to model the latent state instead of the Gaussian distribution. Experiments show that the proposed method outperforms simple extensions of existing methods in a realistic robot control task.
SDJun 9, 2022
Speak Like a Dog: Human to Non-human creature Voice ConversionKohei Suzuki, Shoki Sakamoto, Tadahiro Taniguchi et al.
This paper proposes a new voice conversion (VC) task from human speech to dog-like speech while preserving linguistic information as an example of human to non-human creature voice conversion (H2NH-VC) tasks. Although most VC studies deal with human to human VC, H2NH-VC aims to convert human speech into non-human creature-like speech. Non-parallel VC allows us to develop H2NH-VC, because we cannot collect a parallel dataset that non-human creatures speak human language. In this study, we propose to use dogs as an example of a non-human creature target domain and define the "speak like a dog" task. To clarify the possibilities and characteristics of the "speak like a dog" task, we conducted a comparative experiment using existing representative non-parallel VC methods in acoustic features (Mel-cepstral coefficients and Mel-spectrograms), network architectures (five different kernel-size settings), and training criteria (variational autoencoder (VAE)- based and generative adversarial network-based). Finally, the converted voices were evaluated using mean opinion scores: dog-likeness, sound quality and intelligibility, and character error rate (CER). The experiment showed that the employment of the Mel-spectrogram improved the dog-likeness of the converted speech, while it is challenging to preserve linguistic information. Challenges and limitations of the current VC methods for H2NH-VC are highlighted.
RONov 20, 2022
Active Exploration based on Information Gain by Particle Filter for Efficient Spatial Concept FormationAkira Taniguchi, Yoshiki Tabuchi, Tomochika Ishikawa et al.
Autonomous robots need to learn the categories of various places by exploring their environments and interacting with users. However, preparing training datasets with linguistic instructions from users is time-consuming and labor-intensive. Moreover, effective exploration is essential for appropriate concept formation and rapid environmental coverage. To address this issue, we propose an active inference method, referred to as spatial concept formation with information gain-based active exploration (SpCoAE) that combines sequential Bayesian inference using particle filters and information gain-based destination determination in a probabilistic generative model. This study interprets the robot's action as a selection of destinations to ask the user, `What kind of place is this?' in the context of active inference. This study provides insights into the technical aspects of the proposed method, including active perception and exploration by the robot, and how the method can enable mobile robots to learn spatial concepts through active exploration. Our experiment demonstrated the effectiveness of the SpCoAE in efficiently determining a destination for learning appropriate spatial concepts in home environments.
LGMay 24, 2022
Symbol Emergence as Inter-personal Categorization with Head-to-head Latent WordKazuma Furukawa, Akira Taniguchi, Yoshinobu Hagiwara et al.
In this study, we propose a head-to-head type (H2H-type) inter-personal multimodal Dirichlet mixture (Inter-MDM) by modifying the original Inter-MDM, which is a probabilistic generative model that represents the symbol emergence between two agents as multiagent multimodal categorization. A Metropolis--Hastings method-based naming game based on the Inter-MDM enables two agents to collaboratively perform multimodal categorization and share signs with a solid mathematical foundation of convergence. However, the conventional Inter-MDM presumes a tail-to-tail connection across a latent word variable, causing inflexibility of the further extension of Inter-MDM for modeling a more complex symbol emergence. Therefore, we propose herein a head-to-head type (H2H-type) Inter-MDM that treats a latent word variable as a child node of an internal variable of each agent in the same way as many prior studies of multimodal categorization. On the basis of the H2H-type Inter-MDM, we propose a naming game in the same way as the conventional Inter-MDM. The experimental results show that the H2H-type Inter-MDM yields almost the same performance as the conventional Inter-MDM from the viewpoint of multimodal categorization and sign sharing.
ROMar 21, 2022
Hierarchical Path-planning from Speech Instructions with Spatial Concept-based Topometric Semantic MappingAkira Taniguchi, Shuya Ito, Tadahiro Taniguchi
Assisting individuals in their daily activities through autonomous mobile robots, especially for users without specialized knowledge, is crucial. Specifically, the capability of robots to navigate to destinations based on human speech instructions is essential. While robots can take different paths to the same goal, the shortest path is not always the best. A preferred approach is to accommodate waypoint specifications flexibly, planning an improved alternative path, even with detours. Additionally, robots require real-time inference capabilities. This study aimed to realize a hierarchical spatial representation using a topometric semantic map and path planning with speech instructions, including waypoints. This paper presents Spatial Concept-based Topometric Semantic Mapping for Hierarchical Path Planning (SpCoTMHP), integrating place connectivity. This approach offers a novel integrated probabilistic generative model and fast approximate inference across hierarchy levels. A formulation based on control as probabilistic inference theoretically supports the proposed path planning algorithm. We conducted experiments in home environments using the Toyota Human Support Robot on the SIGVerse simulator and in a lab-office environment with the real robot, Albert. Users issued speech commands specifying the waypoint and goal, such as "Go to the bedroom via the corridor." Navigation experiments using speech instructions with a waypoint demonstrated a performance improvement of SpCoTMHP over the baseline hierarchical path planning method with heuristic path costs (HPP-I), in terms of the weighted success rate at which the robot reaches the closest target and passes the correct waypoints, by 0.590. The computation time was significantly accelerated by 7.14 seconds with SpCoTMHP compared to baseline HPP-I in advanced tasks.
AIJul 11, 2023
Control as Probabilistic Inference as an Emergent Communication Mechanism in Multi-Agent Reinforcement LearningTomoaki Nakamura, Akira Taniguchi, Tadahiro Taniguchi
This paper proposes a generative probabilistic model integrating emergent communication and multi-agent reinforcement learning. The agents plan their actions by probabilistic inference, called control as inference, and communicate using messages that are latent variables and estimated based on the planned actions. Through these messages, each agent can send information about its actions and know information about the actions of another agent. Therefore, the agents change their actions according to the estimated messages to achieve cooperative tasks. This inference of messages can be considered as communication, and this procedure can be formulated by the Metropolis-Hasting naming game. Through experiments in the grid world environment, we show that the proposed PGM can infer meaningful messages to achieve the cooperative task.
NCJul 6, 2022
Brain-inspired probabilistic generative model for double articulation analysis of spoken languageAkira Taniguchi, Maoko Muro, Hiroshi Yamakawa et al.
The human brain, among its several functions, analyzes the double articulation structure in spoken language, i.e., double articulation analysis (DAA). A hierarchical structure in which words are connected to form a sentence and words are composed of phonemes or syllables is called a double articulation structure. Where and how DAA is performed in the human brain has not been established, although some insights have been obtained. In addition, existing computational models based on a probabilistic generative model (PGM) do not incorporate neuroscientific findings, and their consistency with the brain has not been previously discussed. This study compared, mapped, and integrated these existing computational models with neuroscientific findings to bridge this gap, and the findings are relevant for future applications and further research. This study proposes a PGM for a DAA hypothesis that can be realized in the brain based on the outcomes of several neuroscientific surveys. The study involved (i) investigation and organization of anatomical structures related to spoken language processing, and (ii) design of a PGM that matches the anatomy and functions of the region of interest. Therefore, this study provides novel insights that will be foundational to further exploring DAA in the brain.
ROMar 30
Reducing Mental Workload through On-Demand Human Assistance for Physical Action Failures in LLM-based Multi-Robot CoordinationShoichi Hasegawa, Akira Taniguchi, Lotfi El Hafi et al.
Multi-robot coordination based on large language models (LLMs) has attracted growing attention, since LLMs enable the direct translation of natural language instructions into robot action plans by decomposing tasks and generating high-level plans. However, recovering from physical execution failures remains difficult, and tasks often stagnate due to the repetition of the same unsuccessful actions. While frameworks for remote robot operation using Mixed Reality were proposed, there have been few attempts to implement remote error resolution specifically for physical failures in multi-robot environments. In this study, we propose REPAIR (Robot Execution with Planned And Interactive Recovery), a human-in-the-loop framework that integrates remote error resolution into LLM-based multi-robot planning. In this method, robots execute tasks autonomously; however, when an irrecoverable failure occurs, the LLM requests assistance from an operator, enabling task continuity through remote intervention. Evaluations using a multi-robot trash collection task in a real-world environment confirmed that REPAIR significantly improves task progress (the number of items cleared within a time limit) compared to fully autonomous methods. Furthermore, for easily collectable items, it achieved task progress equivalent to full remote control. The results also suggested that the mental workload on the operator may differ in terms of physical demand and effort. The project website is https://emergentsystemlabstudent.github.io/REPAIR/.
CVSep 8, 2023
Representation Synthesis by Probabilistic Many-Valued Logic Operation in Self-Supervised LearningHiroki Nakamura, Masashi Okada, Tadahiro Taniguchi
In this paper, we propose a new self-supervised learning (SSL) method for representations that enable logic operations. Representation learning has been applied to various tasks, such as image generation and retrieval. The logical controllability of representations is important for these tasks. Although some methods have been shown to enable the intuitive control of representations using natural languages as the inputs, representation control via logic operations between representations has not been demonstrated. Some SSL methods using representation synthesis (e.g., elementwise mean and maximum operations) have been proposed, but the operations performed in these methods do not incorporate logic operations. In this work, we propose a logic-operable self-supervised representation learning method by replacing the existing representation synthesis with the OR operation on the probabilistic extension of many-valued logic. The representations comprise a set of feature-possession degrees, which are truth values indicating the presence or absence of each feature in the image, and realize the logic operations (e.g., OR and AND). Our method can generate a representation that has the features of both representations or only those features common to both representations. In addition, the expression of the ambiguous presence of a feature is realized by indicating the feature-possession degree by the probability distribution of truth values of the many-valued logic. We showed that our method performs competitively in single and multi-label classification tasks compared with prior SSL methods using synthetic representations. Moreover, experiments on image retrieval using MNIST and PascalVOC showed that the representations of our method can be operated by OR and AND operations.
CLJun 27, 2023
Symbol emergence as interpersonal cross-situational learning: the emergence of lexical knowledge with combinatorialityYoshinobu Hagiwara, Kazuma Furukawa, Takafumi Horie et al.
We present a computational model for a symbol emergence system that enables the emergence of lexical knowledge with combinatoriality among agents through a Metropolis-Hastings naming game and cross-situational learning. Many computational models have been proposed to investigate combinatoriality in emergent communication and symbol emergence in cognitive and developmental robotics. However, existing models do not sufficiently address category formation based on sensory-motor information and semiotic communication through the exchange of word sequences within a single integrated model. Our proposed model facilitates the emergence of lexical knowledge with combinatoriality by performing category formation using multimodal sensory-motor information and enabling semiotic communication through the exchange of word sequences among agents in a unified model. Furthermore, the model enables an agent to predict sensory-motor information for unobserved situations by combining words associated with categories in each modality. We conducted two experiments with two humanoid robots in a simulated environment to evaluate our proposed model. The results demonstrated that the agents can acquire lexical knowledge with combinatoriality through interpersonal cross-situational learning based on the Metropolis-Hastings naming game and cross-situational learning. Furthermore, our results indicate that the lexical knowledge developed using our proposed model exhibits generalization performance for novel situations through interpersonal cross-modal inference.
CLSep 14, 2024
Constructive Approach to Bidirectional Influence between Qualia Structure and Language EmergenceTadahiro Taniguchi, Masafumi Oizumi, Noburo Saji et al.
This perspective paper explores the bidirectional influence between language emergence and the relational structure of subjective experiences, termed qualia structure, and lays out a constructive approach to the intricate dependency between the two. We hypothesize that the emergence of languages with distributional semantics (e.g., syntactic-semantic structures) is linked to the coordination of internal representations shaped by experience, potentially facilitating more structured language through reciprocal influence. This hypothesized mutual dependency connects to recent advancements in AI and symbol emergence robotics, and is explored within this paper through theoretical frameworks such as the collective predictive coding. Computational studies show that neural network-based language models form systematically structured internal representations, and multimodal language models can share representations between language and perceptual information. This perspective suggests that language emergence serves not only as a mechanism creating a communication tool but also as a mechanism for allowing people to realize shared understanding of qualitative experiences. The paper discusses the implications of this bidirectional influence in the context of consciousness studies, linguistics, and cognitive science, and outlines future constructive research directions to further explore this dynamic relationship between language emergence and qualia structure.
CLNov 8, 2023
Lewis's Signaling Game as beta-VAE For Natural Word Lengths and SegmentsRyo Ueda, Tadahiro Taniguchi
As a sub-discipline of evolutionary and computational linguistics, emergent communication (EC) studies communication protocols, called emergent languages, arising in simulations where agents communicate. A key goal of EC is to give rise to languages that share statistical properties with natural languages. In this paper, we reinterpret Lewis's signaling game, a frequently used setting in EC, as beta-VAE and reformulate its objective function as ELBO. Consequently, we clarify the existence of prior distributions of emergent languages and show that the choice of the priors can influence their statistical properties. Specifically, we address the properties of word lengths and segmentation, known as Zipf's law of abbreviation (ZLA) and Harris's articulation scheme (HAS), respectively. It has been reported that the emergent languages do not follow them when using the conventional objective. We experimentally demonstrate that by selecting an appropriate prior distribution, more natural segments emerge, while suggesting that the conventional one prevents the languages from following ZLA and HAS.
LGSep 7, 2022
Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and ToolkitGabriela Sejnova, Michal Vavrecka, Karla Stepanova et al.
Multimodal Variational Autoencoders (VAEs) have been the subject of intense research in the past years as they can integrate multiple modalities into a joint representation and can thus serve as a promising tool for both data classification and generation. Several approaches toward multimodal VAE learning have been proposed so far, their comparison and evaluation have however been rather inconsistent. One reason is that the models differ at the implementation level, another problem is that the datasets commonly used in these cases were not initially designed to evaluate multimodal generative models. This paper addresses both mentioned issues. First, we propose a toolkit for systematic multimodal VAE training and comparison. The toolkit currently comprises 4 existing multimodal VAEs and 6 commonly used benchmark datasets along with instructions on how to easily add a new model or a dataset. Second, we present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities across multiple difficulty levels. We demonstrate the utility of our dataset by comparing the implemented state-of-the-art models.
CVFeb 18
EasyControlEdge: A Foundation-Model Fine-Tuning for Edge DetectionHiroki Nakamura, Hiroto Iino, Masashi Okada et al.
We propose EasyControlEdge, adapting an image-generation foundation model to edge detection. In real-world edge detection (e.g., floor-plan walls, satellite roads/buildings, and medical organ boundaries), crispness and data efficiency are crucial, yet producing crisp raw edge maps with limited training samples remains challenging. Although image-generation foundation models perform well on many downstream tasks, their pretrained priors for data-efficient transfer and iterative refinement for high-frequency detail preservation remain underexploited for edge detection. To enable crisp and data-efficient edge detection using these capabilities, we introduce an edge-specialized adaptation of image-generation foundation models. To better specialize the foundation model for edge detection, we incorporate an edge-oriented objective with an efficient pixel-space loss. At inference, we introduce guidance based on unconditional dynamics, enabling a single model to control the edge density through a guidance scale. Experiments on BSDS500, NYUDv2, BIPED, and CubiCasa compare against state-of-the-art methods and show consistent gains, particularly under no-post-processing crispness evaluation and with limited training data.
CVMay 12
Emergent Communication between Heterogeneous Visual Agents through Decentralized LearningMikako Ochiai, Masatoshi Nagano, Tadahiro Taniguchi
Symbols are shared, but perception is private. We study emergent communication between heterogeneous visual agents through decentralized learning, asking what visual information can become shareable when agents have different visual representations. Instead of optimizing messages through a shared external communicative objective, our agents exchange only discrete token sequences and update their own models using local perceptual evidence. This setting focuses on an underexplored aspect of emergent communication, examining whether common symbols can arise without shared perceptual access, and how the similarity between private visual spaces constrains the content and symmetry of the resulting language. We instantiate this setting in the Metropolis-Hastings Captioning Game (MHCG), where two agents collaboratively form shared captions by exchanging proposed token sequences that a listener accepts or rejects using an MH-style criterion evaluated against its own visual features. We compare three pairings of frozen visual encoders, with agents starting from randomly initialized text modules. Experiments on MS-COCO show that MHCG produces visually informative shared token sequences that outperform a no-communication baseline in cross-agent alignment, visual-feature prediction, and image-text retrieval; all cross-agent metrics decline as encoder mismatch increases. Moderate encoder heterogeneity reduces the number of shared sequences while preserving per-sequence visual specificity, whereas stronger encoder heterogeneity yields fewer, coarser, and more asymmetric sequences. Ablations show that listener-side MH acceptance is critical for avoiding degenerate token formation. These results suggest that shared symbols can arise from local perceptual evaluation alone, with visual representational similarity across encoders shaping both the content and symmetry of the resulting language.
NEJan 29
MolLIBRA: Genetic Molecular Optimization with Multi-Fingerprint Surrogates and Text-Molecule Aligned CriticMasahi Okada, Kazuki Sakai, Hiroaki Yoshida et al.
We study sample-efficient molecular optimization under a limited budget of oracle evaluations. We propose MolLIBRA (MultimOdaLity and Language Integrated Bayesian and evolutionaRy optimizAtion), a genetic algorithm based framework that pre-ranks candidate molecules using multiple critics before oracle calls: (i) an ensemble of Gaussian process (GP) surrogates defined over multiple molecular fingerprints and (ii) a pretrained text-molecule aligned encoder CLAMP. The GP ensemble enables adaptive selection of task-appropriate fingerprints, while CLAMP provides a zero-shot scoring signal from task descriptions by measuring the similarity between molecular and text embeddings. On the Practical Molecular Optimization (PMO) benchmark with a budget of 1,000 evaluations (PMO-1K), MolLIBRA-L, our variant with a language-model-based candidate generator, attains the best Top-10 AUC on 14/22 tasks and the highest overall sum of Top-10 AUC across tasks among prior methods.
MAMay 10
Emergent Communication for Co-constructed Emotion Between Embodied Agents via Collective Predictive CodingZehang Zhang, Nguyen Le Hoang, Tadahiro Taniguchi et al.
According to the theory of constructed emotion, the brain actively forms emotion categories by integrating multimodal bodily signals, and constructs emotional experiences by using these categories to predict and interpret sensory inputs. While research has advanced in modeling individual emotion construction, the social process of co-construction-how a shared understanding of emotions emerges between individuals-remains computationally underexplored. This study investigates this process by modeling emergent communication between two embodied agents using the Metropolis-Hastings Naming Game (MHNG), grounded in the Collective Predictive Coding (CPC) framework. Our experiments, using visual, auditory, and simulated interoceptive inputs, yield two main findings. First, MHNG-based communication significantly improves the alignment, clarity, and inter-agent agreement of the learned emotion categories compared to non-communicative and non-selective baselines, with the alignment effect concentrated at the symbolic layer rather than the perceptual latent representation. Second, even when the two agents have systematically divergent interoceptive dynamics, communication still produces robust categorical alignment, with distinct, category-specific reshaping patterns of each agent's emotion categories-consistent with the constructed-emotion view that interoceptive heterogeneity is constitutive of, rather than an obstacle to, shared emotional meaning. These findings provide computational support for the co-constructionist view of emotion and extend the CPC framework from physical to socially-grounded domains.
AIDec 31, 2024
Generative Emergent Communication: Large Language Model is a Collective World ModelTadahiro Taniguchi, Ryo Ueda, Tomoaki Nakamura et al.
Large Language Models (LLMs) have demonstrated a remarkable ability to capture extensive world knowledge, yet how this is achieved without direct sensorimotor experience remains a fundamental puzzle. This study proposes a novel theoretical solution by introducing the Collective World Model hypothesis. We argue that an LLM does not learn a world model from scratch; instead, it learns a statistical approximation of a collective world model that is already implicitly encoded in human language through a society-wide process of embodied, interactive sense-making. To formalize this process, we introduce generative emergent communication (Generative EmCom), a framework built on the Collective Predictive Coding (CPC). This framework models the emergence of language as a process of decentralized Bayesian inference over the internal states of multiple agents. We argue that this process effectively creates an encoder-decoder structure at a societal scale: human society collectively encodes its grounded, internal representations into language, and an LLM subsequently decodes these symbols to reconstruct a latent space that mirrors the structure of the original collective representations. This perspective provides a principled, mathematical explanation for how LLMs acquire their capabilities. The main contributions of this paper are: 1) the formalization of the Generative EmCom framework, clarifying its connection to world models and multi-agent reinforcement learning, and 2) its application to interpret LLMs, explaining phenomena such as distributional semantics as a natural consequence of representation reconstruction. This work provides a unified theory that bridges individual cognitive development, collective language evolution, and the foundations of large-scale AI.
MAApr 4, 2025
Decentralized Collective World Model for Emergent Communication and CoordinationKentaro Nomura, Tatsuya Aoki, Tadahiro Taniguchi et al.
We propose a fully decentralized multi-agent world model that enables both symbol emergence for communication and coordinated behavior through temporal extension of collective predictive coding. Unlike previous research that focuses on either communication or coordination separately, our approach achieves both simultaneously. Our method integrates world models with communication channels, enabling agents to predict environmental dynamics, estimate states from partial observations, and share critical information through bidirectional message exchange with contrastive learning for message alignment. Using a two-agent trajectory drawing task, we demonstrate that our communication-based approach outperforms non-communicative models when agents have divergent perceptual capabilities, achieving the second-best coordination after centralized models. Importantly, our decentralized approach with constraints preventing direct access to other agents' internal states facilitates the emergence of more meaningful symbol systems that accurately reflect environmental states. These findings demonstrate the effectiveness of decentralized communication for supporting coordination while developing shared representations of the environment.
CLApr 13, 2025
Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian InferenceYuta Matsui, Ryosuke Yamaki, Ryo Ueda et al.
We propose the Metropolis-Hastings Captioning Game (MHCG), a method to fuse knowledge of multiple vision-language models (VLMs) by learning from each other. Although existing methods that combine multiple models suffer from inference costs and architectural constraints, MHCG avoids these problems by performing decentralized Bayesian inference through a process resembling a language game. The knowledge fusion process establishes communication between two VLM agents alternately captioning images and learning from each other. We conduct two image-captioning experiments with two VLMs, each pre-trained on a different dataset. The first experiment demonstrates that MHCG achieves consistent improvement in reference-free evaluation metrics. The second experiment investigates how MHCG contributes to sharing VLMs' category-level vocabulary by observing the occurrence of the vocabulary in the generated captions.
AIMar 8, 2025
System 0/1/2/3: Quad-process theory for multi-timescale embodied collective cognitive systemsTadahiro Taniguchi, Yasushi Hirai, Masahiro Suzuki et al.
This paper introduces the System 0/1/2/3 framework as an extension of dual-process theory, employing a quad-process model of cognition. Expanding upon System 1 (fast, intuitive thinking) and System 2 (slow, deliberative thinking), we incorporate System 0, which represents pre-cognitive embodied processes, and System 3, which encompasses collective intelligence and symbol emergence. We contextualize this model within Bergson's philosophy by adopting multi-scale time theory to unify the diverse temporal dynamics of cognition. System 0 emphasizes morphological computation and passive dynamics, illustrating how physical embodiment enables adaptive behavior without explicit neural processing. Systems 1 and 2 are explained from a constructive perspective, incorporating neurodynamical and AI viewpoints. In System 3, we introduce collective predictive coding to explain how societal-level adaptation and symbol emergence operate over extended timescales. This comprehensive framework ranges from rapid embodied reactions to slow-evolving collective intelligence, offering a unified perspective on cognition across multiple timescales, levels of abstraction, and forms of human intelligence. The System 0/1/2/3 model provides a novel theoretical foundation for understanding the interplay between adaptive and cognitive processes, thereby opening new avenues for research in cognitive science, AI, robotics, and collective intelligence.
MAMay 28, 2025
Reward-Independent Messaging for Decentralized Multi-Agent Reinforcement LearningNaoto Yoshida, Tadahiro Taniguchi
In multi-agent reinforcement learning (MARL), effective communication improves agent performance, particularly under partial observability. We propose MARL-CPC, a framework that enables communication among fully decentralized, independent agents without parameter sharing. MARL-CPC incorporates a message learning model based on collective predictive coding (CPC) from emergent communication research. Unlike conventional methods that treat messages as part of the action space and assume cooperation, MARL-CPC links messages to state inference, supporting communication in non-cooperative, reward-independent settings. We introduce two algorithms -Bandit-CPC and IPPO-CPC- and evaluate them in non-cooperative MARL tasks. Benchmarks show that both outperform standard message-as-action approaches, establishing effective communication even when messages offer no direct benefit to the sender. These results highlight MARL-CPC's potential for enabling coordination in complex, decentralized environments.
CLOct 29, 2024
SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent CommunicationNguyen Le Hoang, Tadahiro Taniguchi, Fang Tianwei et al.
Emergent communication, driven by generative models, enables agents to develop a shared language for describing their individual views of the same objects through interactions. Meanwhile, self-supervised learning (SSL), particularly SimSiam, uses discriminative representation learning to make representations of augmented views of the same data point closer in the representation space. Building on the prior work of VI-SimSiam, which incorporates a generative and Bayesian perspective into the SimSiam framework via variational inference (VI) interpretation, we propose SimSiam+VAE, a unified approach for both representation learning and emergent communication. SimSiam+VAE integrates a variational autoencoder (VAE) into the predictor of the SimSiam network to enhance representation learning and capture uncertainty. Experimental results show that SimSiam+VAE outperforms both SimSiam and VI-SimSiam. We further extend this model into a communication framework called the SimSiam Naming Game (SSNG), which applies the generative and Bayesian approach based on VI to develop internal representations and emergent language, while utilizing the discriminative process of SimSiam to facilitate mutual understanding between agents. In experiments with established models, despite the dynamic alternation of agent roles during interactions, SSNG demonstrates comparable performance to the referential game and slightly outperforms the Metropolis-Hastings naming game.
ROAug 22, 2025
Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View InstructionsAkira Oyama, Shoichi Hasegawa, Akira Taniguchi et al.
Daily life support robots must interpret ambiguous verbal instructions involving demonstratives such as ``Bring me that cup,'' even when objects or users are out of the robot's view. Existing approaches to exophora resolution primarily rely on visual data and thus fail in real-world scenarios where the object or user is not visible. We propose Multimodal Interactive Exophora resolution with user Localization (MIEL), which is a multimodal exophora resolution framework leveraging sound source localization (SSL), semantic mapping, visual-language models (VLMs), and interactive questioning with GPT-4o. Our approach first constructs a semantic map of the environment and estimates candidate objects from a linguistic query with the user's skeletal data. SSL is utilized to orient the robot toward users who are initially outside its visual field, enabling accurate identification of user gestures and pointing directions. When ambiguities remain, the robot proactively interacts with the user, employing GPT-4o to formulate clarifying questions. Experiments in a real-world environment showed results that were approximately 1.3 times better when the user was visible to the robot and 2.0 times better when the user was not visible to the robot, compared to the methods without SSL and interactive questioning. The project website is https://emergentsystemlabstudent.github.io/MIEL/.
NCAug 20, 2025
Beyond Individuals: Collective Predictive Coding for Memory, Attention, and the Emergence of LanguageTadahiro Taniguchi
This commentary extends the discussion by Parr et al. on memory and attention beyond individual cognitive systems. From the perspective of the Collective Predictive Coding (CPC) hypothesis -- a framework for understanding these faculties and the emergence of language at the group level -- we introduce a hypothetical idea: that language, with its embedded distributional semantics, serves as a collectively formed external representation. CPC generalises the concepts of individual memory and attention to the collective level. This offers a new perspective on how shared linguistic structures, which may embrace collective world models learned through next-word prediction, emerge from and shape group-level cognition.
ROSep 16, 2025
Toward Ownership Understanding of Objects: Active Question Generation with Large Language Model and Probabilistic Generative ModelSaki Hashimoto, Shoichi Hasegawa, Tomochika Ishikawa et al.
Robots operating in domestic and office environments must understand object ownership to correctly execute instructions such as ``Bring me my cup.'' However, ownership cannot be reliably inferred from visual features alone. To address this gap, we propose Active Ownership Learning (ActOwL), a framework that enables robots to actively generate and ask ownership-related questions to users. ActOwL employs a probabilistic generative model to select questions that maximize information gain, thereby acquiring ownership knowledge efficiently to improve learning efficiency. Additionally, by leveraging commonsense knowledge from Large Language Models (LLM), objects are pre-classified as either shared or owned, and only owned objects are targeted for questioning. Through experiments in a simulated home environment and a real-world laboratory setting, ActOwL achieved significantly higher ownership clustering accuracy with fewer questions than baseline methods. These findings demonstrate the effectiveness of combining active inference with LLM-guided commonsense reasoning, advancing the capability of robots to acquire ownership knowledge for practical and socially appropriate task execution.
ROSep 16, 2025
Multi-Robot Task Planning for Multi-Object Retrieval Tasks with Distributed On-Site Knowledge via Large Language ModelsKento Murata, Shoichi Hasegawa, Tomochika Ishikawa et al.
It is crucial to efficiently execute instructions such as "Find an apple and a banana" or "Get ready for a field trip," which require searching for multiple objects or understanding context-dependent commands. This study addresses the challenging problem of determining which robot should be assigned to which part of a task when each robot possesses different situational on-site knowledge-specifically, spatial concepts learned from the area designated to it by the user. We propose a task planning framework that leverages large language models (LLMs) and spatial concepts to decompose natural language instructions into subtasks and allocate them to multiple robots. We designed a novel few-shot prompting strategy that enables LLMs to infer required objects from ambiguous commands and decompose them into appropriate subtasks. In our experiments, the proposed method achieved 47/50 successful assignments, outperforming random (28/50) and commonsense-based assignment (26/50). Furthermore, we conducted qualitative evaluations using two actual mobile manipulators. The results demonstrated that our framework could handle instructions, including those involving ad hoc categories such as "Get ready for a field trip," by successfully performing task decomposition, assignment, sequential planning, and execution.
HCJun 18, 2025
Co-Creative Learning via Metropolis-Hastings Interaction between Humans and AIRyota Okumura, Tadahiro Taniguchi, Akira Taniguchi et al.
We propose co-creative learning as a novel paradigm where humans and AI, i.e., biological and artificial agents, mutually integrate their partial perceptual information and knowledge to construct shared external representations, a process we interpret as symbol emergence. Unlike traditional AI teaching based on unilateral knowledge transfer, this addresses the challenge of integrating information from inherently different modalities. We empirically test this framework using a human-AI interaction model based on the Metropolis-Hastings naming game (MHNG), a decentralized Bayesian inference mechanism. In an online experiment, 69 participants played a joint attention naming game (JA-NG) with one of three computer agent types (MH-based, always-accept, or always-reject) under partial observability. Results show that human-AI pairs with an MH-based agent significantly improved categorization accuracy through interaction and achieved stronger convergence toward a shared sign system. Furthermore, human acceptance behavior aligned closely with the MH-derived acceptance probability. These findings provide the first empirical evidence for co-creative learning emerging in human-AI dyads via MHNG-based interaction. This suggests a promising path toward symbiotic AI systems that learn with humans, rather than from them, by dynamically aligning perceptual experiences, opening a new venue for symbiotic AI alignment.
ROApr 15, 2024
Real-world Instance-specific Image Goal Navigation: Bridging Domain Gaps via Contrastive LearningTaichi Sakaguchi, Akira Taniguchi, Yoshinobu Hagiwara et al.
Improving instance-specific image goal navigation (InstanceImageNav), which locates the identical object in a real-world environment from a query image, is essential for robotic systems to assist users in finding desired objects. The challenge lies in the domain gap between low-quality images observed by the moving robot, characterized by motion blur and low-resolution, and high-quality query images provided by the user. Such domain gaps could significantly reduce the task success rate but have not been the focus of previous work. To address this, we propose a novel method called Few-shot Cross-quality Instance-aware Adaptation (CrossIA), which employs contrastive learning with an instance classifier to align features between massive low- and few high-quality images. This approach effectively reduces the domain gap by bringing the latent representations of cross-quality images closer on an instance basis. Additionally, the system integrates an object image collection with a pre-trained deblurring model to enhance the observed image quality. Our method fine-tunes the SimSiam model, pre-trained on ImageNet, using CrossIA. We evaluated our method's effectiveness through an InstanceImageNav task with 20 different types of instances, where the robot identifies the same instance in a real-world environment as a high-quality query image. Our experiments showed that our method improves the task success rate by up to three times compared to the baseline, a conventional approach based on SuperGlue. These findings highlight the potential of leveraging contrastive learning and image enhancement techniques to bridge the domain gap and improve object localization in robotic applications. The project website is https://emergentsystemlabstudent.github.io/DomainBridgingNav/.
CLMay 31, 2023
Metropolis-Hastings algorithm in joint-attention naming game: Experimental semiotics studyRyota Okumura, Tadahiro Taniguchi, Yosinobu Hagiwara et al.
In this study, we explore the emergence of symbols during interactions between individuals through an experimental semiotic study. Previous studies investigate how humans organize symbol systems through communication using artificially designed subjective experiments. In this study, we have focused on a joint attention-naming game (JA-NG) in which participants independently categorize objects and assign names while assuming their joint attention. In the theory of the Metropolis-Hastings naming game (MHNG), listeners accept provided names according to the acceptance probability computed using the Metropolis-Hastings (MH) algorithm. The theory of MHNG suggests that symbols emerge as an approximate decentralized Bayesian inference of signs, which is represented as a shared prior variable if the conditions of MHNG are satisfied. This study examines whether human participants exhibit behavior consistent with MHNG theory when playing JA-NG. By comparing human acceptance decisions of a partner's naming with acceptance probabilities computed in the MHNG, we tested whether human behavior is consistent with the MHNG theory. The main contributions of this study are twofold. First, we reject the null hypothesis that humans make acceptance judgments with a constant probability, regardless of the acceptance probability calculated by the MH algorithm. This result suggests that people followed the acceptance probability computed by the MH algorithm to some extent. Second, the MH-based model predicted human acceptance/rejection behavior more accurately than the other four models: Constant, Numerator, Subtraction, and Binary. This result indicates that symbol emergence in JA-NG can be explained using MHNG and is considered an approximate decentralized Bayesian inference.
CLMay 31, 2023
Recursive Metropolis-Hastings Naming Game: Symbol Emergence in a Multi-agent System based on Probabilistic Generative ModelsJun Inukai, Tadahiro Taniguchi, Akira Taniguchi et al.
In the studies on symbol emergence and emergent communication in a population of agents, a computational model was employed in which agents participate in various language games. Among these, the Metropolis-Hastings naming game (MHNG) possesses a notable mathematical property: symbol emergence through MHNG is proven to be a decentralized Bayesian inference of representations shared by the agents. However, the previously proposed MHNG is limited to a two-agent scenario. This paper extends MHNG to an N-agent scenario. The main contributions of this paper are twofold: (1) we propose the recursive Metropolis-Hastings naming game (RMHNG) as an N-agent version of MHNG and demonstrate that RMHNG is an approximate Bayesian inference method for the posterior distribution over a latent variable shared by agents, similar to MHNG; and (2) we empirically evaluate the performance of RMHNG on synthetic and real image data, enabling multiple agents to develop and share a symbol system. Furthermore, we introduce two types of approximations -- one-sample and limited-length -- to reduce computational complexity while maintaining the ability to explain communication in a population of agents. The experimental findings showcased the efficacy of RMHNG as a decentralized Bayesian inference for approximating the posterior distribution concerning latent variables, which are jointly shared among agents, akin to MHNG. Moreover, the utilization of RMHNG elucidated the agents' capacity to exchange symbols. Furthermore, the study discovered that even the computationally simplified version of RMHNG could enable symbols to emerge among the agents.
AIJan 18, 2022
Unsupervised Multimodal Word Discovery based on Double Articulation Analysis with Co-occurrence cuesAkira Taniguchi, Hiroaki Murakami, Ryo Ozaki et al.
Human infants acquire their verbal lexicon with minimal prior knowledge of language based on the statistical properties of phonological distributions and the co-occurrence of other sensory stimuli. This study proposes a novel fully unsupervised learning method for discovering speech units using phonological information as a distributional cue and object information as a co-occurrence cue. The proposed method can acquire words and phonemes from speech signals using unsupervised learning and utilize object information based on multiple modalities-vision, tactile, and auditory-simultaneously. The proposed method is based on the nonparametric Bayesian double articulation analyzer (NPB-DAA) discovering phonemes and words from phonological features, and multimodal latent Dirichlet allocation (MLDA) categorizing multimodal information obtained from objects. In an experiment, the proposed method showed higher word discovery performance than baseline methods. Words that expressed the characteristics of objects (i.e., words corresponding to nouns and adjectives) were segmented accurately. Furthermore, we examined how learning performance is affected by differences in the importance of linguistic information. Increasing the weight of the word modality further improved performance relative to that of the fixed condition.
AISep 15, 2021
Multiagent Multimodal Categorization for Symbol Emergence: Emergent Communication via Interpersonal Cross-modal InferenceYoshinobu Hagiwara, Kazuma Furukawa, Akira Taniguchi et al.
This paper describes a computational model of multiagent multimodal categorization that realizes emergent communication. We clarify whether the computational model can reproduce the following functions in a symbol emergence system, comprising two agents with different sensory modalities playing a naming game. (1) Function for forming a shared lexical system that comprises perceptual categories and corresponding signs, formed by agents through individual learning and semiotic communication between agents. (2) Function to improve the categorization accuracy in an agent via semiotic communication with another agent, even when some sensory modalities of each agent are missing. (3) Function that an agent infers unobserved sensory information based on a sign sampled from another agent in the same manner as cross-modal inference. We propose an interpersonal multimodal Dirichlet mixture (Inter-MDM), which is derived by dividing an integrative probabilistic generative model, which is obtained by integrating two Dirichlet mixtures (DMs). The Markov chain Monte Carlo algorithm realizes emergent communication. The experimental results demonstrated that Inter-MDM enables agents to form multimodal categories and appropriately share signs between agents. It is shown that emergent communication improves categorization accuracy, even when some sensory modalities are missing. Inter-MDM enables an agent to predict unobserved information based on a shared sign.
SDAug 10, 2021
StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized by Automatic Speech RecognitionShoki Sakamoto, Akira Taniguchi, Tadahiro Taniguchi et al.
Preserving the linguistic content of input speech is essential during voice conversion (VC). The star generative adversarial network-based VC method (StarGAN-VC) is a recently developed method that allows non-parallel many-to-many VC. Although this method is powerful, it can fail to preserve the linguistic content of input speech when the number of available training samples is extremely small. To overcome this problem, we propose the use of automatic speech recognition to assist model training, to improve StarGAN-VC, especially in low-resource scenarios. Experimental results show that using our proposed method, StarGAN-VC can retain more linguistic information than vanilla StarGAN-VC.
AIJun 16, 2021
Unsupervised Lexical Acquisition of Relative Spatial Concepts Using Spoken User UtterancesRikunari Sagara, Ryo Taguchi, Akira Taniguchi et al.
This paper proposes methods for unsupervised lexical acquisition for relative spatial concepts using spoken user utterances. A robot with a flexible spoken dialog system must be able to acquire linguistic representation and its meaning specific to an environment through interactions with humans as children do. Specifically, relative spatial concepts (e.g., front and right) are widely used in our daily lives, however, it is not obvious which object is a reference object when a robot learns relative spatial concepts. Therefore, we propose methods by which a robot without prior knowledge of words can learn relative spatial concepts. The methods are formulated using a probabilistic model to estimate the proper reference objects and distributions representing concepts simultaneously. The experimental results show that relative spatial concepts and a phoneme sequence representing each concept can be learned under the condition that the robot does not know which located object is the reference object. Additionally, we show that two processes in the proposed method improve the estimation accuracy of the concepts: generating candidate word sequences by class n-gram and selecting word sequences using location information. Furthermore, we show that clues to reference objects improve accuracy even though the number of candidate reference objects increases.
SDApr 5, 2021
StarGAN-based Emotional Voice Conversion for Japanese PhrasesAsuka Moritani, Ryo Ozaki, Shoki Sakamoto et al.
This paper shows that StarGAN-VC, a spectral envelope transformation method for non-parallel many-to-many voice conversion (VC), is capable of emotional VC (EVC). Although StarGAN-VC has been shown to enable speaker identity conversion, its capability for EVC for Japanese phrases has not been clarified. In this paper, we describe the direct application of StarGAN-VC to an EVC task with minimal fundamental frequency and aperiodicity processing. Through subjective evaluation experiments, we evaluated the performance of our StarGAN-EVC system in terms of its ability to achieve EVC for Japanese phrases. The subjective evaluation is conducted in terms of subjective classification and mean opinion score of neutrality and similarity. In addition, the interdependence between the source and target emotional domains was investigated from the perspective of the quality of EVC.
ROMar 16, 2021
Map completion from partial observation using the global structure of multiple environmental mapsYuki Katsumata, Akinori Kanechika, Akira Taniguchi et al.
Using the spatial structure of various indoor environments as prior knowledge, the robot would construct the map more efficiently. Autonomous mobile robots generally apply simultaneous localization and mapping (SLAM) methods to understand the reachable area in newly visited environments. However, conventional mapping approaches are limited by only considering sensor observation and control signals to estimate the current environment map. This paper proposes a novel SLAM method, map completion network-based SLAM (MCN-SLAM), based on a probabilistic generative model incorporating deep neural networks for map completion. These map completion networks are primarily trained in the framework of generative adversarial networks (GANs) to extract the global structure of large amounts of existing map data. We show in experiments that the proposed method can estimate the environment map 1.3 times better than the previous SLAM methods in the situation of partial observation.
CLMar 15, 2021
Double Articulation Analyzer with Prosody for Unsupervised Word and Phoneme DiscoveryYasuaki Okuda, Ryo Ozaki, Tadahiro Taniguchi
Infants acquire words and phonemes from unsegmented speech signals using segmentation cues, such as distributional, prosodic, and co-occurrence cues. Many pre-existing computational models that represent the process tend to focus on distributional or prosodic cues. This paper proposes a nonparametric Bayesian probabilistic generative model called the prosodic hierarchical Dirichlet process-hidden language model (Prosodic HDP-HLM). Prosodic HDP-HLM, an extension of HDP-HLM, considers both prosodic and distributional cues within a single integrative generative model. We conducted three experiments on different types of datasets, and demonstrate the validity of the proposed method. The results show that the Prosodic DAA successfully uses prosodic cues and outperforms a method that solely uses distributional cues. The main contributions of this study are as follows: 1) We develop a probabilistic generative model for time series data including prosody that potentially has a double articulation structure; 2) We propose the Prosodic DAA by deriving the inference procedure for Prosodic HDP-HLM and show that Prosodic DAA can discover words directly from continuous human speech signals using statistical information and prosodic information in an unsupervised manner; 3) We show that prosodic cues contribute to word segmentation more in naturally distributed case words, i.e., they follow Zipf's law.
AIMar 15, 2021
A Whole Brain Probabilistic Generative Model: Toward Realizing Cognitive Architectures for Developmental RobotsTadahiro Taniguchi, Hiroshi Yamakawa, Takayuki Nagai et al.
Building a humanlike integrative artificial cognitive system, that is, an artificial general intelligence (AGI), is the holy grail of the artificial intelligence (AI) field. Furthermore, a computational model that enables an artificial system to achieve cognitive development will be an excellent reference for brain and cognitive science. This paper describes an approach to develop a cognitive architecture by integrating elemental cognitive modules to enable the training of the modules as a whole. This approach is based on two ideas: (1) brain-inspired AI, learning human brain architecture to build human-level intelligence, and (2) a probabilistic generative model(PGM)-based cognitive system to develop a cognitive system for developmental robots by integrating PGMs. The development framework is called a whole brain PGM (WB-PGM), which differs fundamentally from existing cognitive architectures in that it can learn continuously through a system based on sensory-motor information. In this study, we describe the rationale of WB-PGM, the current status of PGM-based elemental cognitive modules, their relationship with the human brain, the approach to the integration of the cognitive modules, and future challenges. Our findings can serve as a reference for brain studies. As PGMs describe explicit informational relationships between variables, this description provides interpretable guidance from computational sciences to brain science. By providing such information, researchers in neuroscience can provide feedback to researchers in AI and robotics on what the current models lack with reference to the brain. Further, it can facilitate collaboration among researchers in neuro-cognitive sciences as well as AI and robotics.
ROMar 11, 2021
Hierarchical Bayesian Model for the Transfer of Knowledge on Spatial Concepts based on Multimodal InformationYoshinobu Hagiwara, Keishiro Taguchi, Satoshi Ishibushi et al.
This paper proposes a hierarchical Bayesian model based on spatial concepts that enables a robot to transfer the knowledge of places from experienced environments to a new environment. The transfer of knowledge based on spatial concepts is modeled as the calculation process of the posterior distribution based on the observations obtained in each environment with the parameters of spatial concepts generalized to environments as prior knowledge. We conducted experiments to evaluate the generalization performance of spatial knowledge for general places such as kitchens and the adaptive performance of spatial knowledge for unique places such as `Emma's room' in a new environment. In the experiments, the accuracies of the proposed method and conventional methods were compared in the prediction task of location names from an image and a position, and the prediction task of positions from a location name. The experimental results demonstrated that the proposed method has a higher prediction accuracy of location names and positions than the conventional method owing to the transfer of knowledge.
HCAug 23, 2020
Visual Exploration System for Analyzing Trends in Annual Recruitment Using Time-varying GraphsToshiyuki T. Yokoyama, Masashi Okada, Tadahiro Taniguchi
Annual recruitment data of new graduates are manually analyzed by human resources specialists (HR) in industries, which signifies the need to evaluate the recruitment strategy of HR specialists. Every year, different applicants send in job applications to companies. The relationships between applicants' attributes (e.g., English skill or academic credential) can be used to analyze the changes in recruitment trends across multiple years' data. However, most attributes are unnormalized and thus require thorough preprocessing. Such unnormalized data hinder the effective comparison of the relationship between applicants in the early stage of data analysis. Thus, a visual exploration system is highly needed to gain insight from the overview of the relationship between applicants across multiple years. In this study, we propose the Polarizing Attributes for Network Analysis of Correlation on Entities Association (Panacea) visualization system. The proposed system integrates a time-varying graph model and dynamic graph visualization for heterogeneous tabular data. Using this system, human resource specialists can interactively inspect the relationships between two attributes of prospective employees across multiple years. Further, we demonstrate the usability of Panacea with representative examples for finding hidden trends in real-world datasets and then describe HR specialists' feedback obtained throughout Panacea's development. The proposed Panacea system enables HR specialists to visually explore the annual recruitment of new graduates.
LGJul 29, 2020
Dreaming: Model-based Reinforcement Learning by Latent Imagination without ReconstructionMasashi Okada, Tadahiro Taniguchi
In the present paper, we propose a decoder-free extension of Dreamer, a leading model-based reinforcement learning (MBRL) method from pixels. Dreamer is a sample- and cost-efficient solution to robot learning, as it is used to train latent state-space models based on a variational autoencoder and to conduct policy optimization by latent trajectory imagination. However, this autoencoding based approach often causes object vanishing, in which the autoencoder fails to perceives key objects for solving control tasks, and thus significantly limiting Dreamer's potential. This work aims to relieve this Dreamer's bottleneck and enhance its performance by means of removing the decoder. For this purpose, we firstly derive a likelihood-free and InfoMax objective of contrastive learning from the evidence lower bound of Dreamer. Secondly, we incorporate two components, (i) independent linear dynamics and (ii) the random crop data augmentation, to the learning scheme so as to improve the training performance. In comparison to Dreamer and other recent model-free reinforcement learning methods, our newly devised Dreamer with InfoMax and without generative decoder (Dreaming) achieves the best scores on 5 difficult simulated robotics tasks, in which Dreamer suffers from object vanishing.
CRJul 17, 2020
Graph Convolutional Network-based Suspicious Communication Pair Estimation for Industrial Control SystemsTatsumi Oba, Tadahiro Taniguchi
Whitelisting is considered an effective security monitoring method for networks used in industrial control systems, where the whitelists consist of observed tuples of the IP address of the server, the TCP/UDP port number, and IP address of the client (communication triplets). However, this method causes frequent false detections. To reduce false positives due to a simple whitelist-based judgment, we propose a new framework for scoring communications to judge whether the communications not present in whitelists are normal or anomalous. To solve this problem, we developed a graph convolutional network-based suspicious communication pair estimation using relational graph convolution networks, and evaluated its performance. For this, we collected the network traffic of three factories owned by Panasonic Corporation, Japan. The proposed method achieved a receiver operating characteristic area under the curve of 0.957, which outperforms baseline approaches such as DistMult, a method that directly optimizes the node embeddings, and heuristics, which score the triplets using first- and second-order proximities of multigraphs. This method enables security operators to concentrate on significant alerts.
LGMar 1, 2020
PlaNet of the Bayesians: Reconsidering and Improving Deep Planning Network by Incorporating Bayesian InferenceMasashi Okada, Norio Kosaka, Tadahiro Taniguchi
In the present paper, we propose an extension of the Deep Planning Network (PlaNet), also referred to as PlaNet of the Bayesians (PlaNet-Bayes). There has been a growing demand in model predictive control (MPC) in partially observable environments in which complete information is unavailable because of, for example, lack of expensive sensors. PlaNet is a promising solution to realize such latent MPC, as it is used to train state-space models via model-based reinforcement learning (MBRL) and to conduct planning in the latent space. However, recent state-of-the-art strategies mentioned in MBRR literature, such as involving uncertainty into training and planning, have not been considered, significantly suppressing the training performance. The proposed extension is to make PlaNet uncertainty-aware on the basis of Bayesian inference, in which both model and action uncertainty are incorporated. Uncertainty in latent models is represented using a neural network ensemble to approximately infer model posteriors. The ensemble of optimal action candidates is also employed to capture multimodal uncertainty in the optimality. The concept of the action ensemble relies on a general variational inference MPC (VI-MPC) framework and its instance, probabilistic action ensemble with trajectory sampling (PaETS). In this paper, we extend VI-MPC and PaETS, which have been originally introduced in previous literature, to address partially observable cases. We experimentally compare the performances on continuous control tasks, and conclude that our method can consistently improve the asymptotic performance compared with PlaNet.