91.9ITApr 13
Prior-Guided Movable Antenna Control for Agile Multi-Path Sensing (extended version)Jaehong Kim, Jihong Park, Changsheng You et al.
Multi-path sensing, which aims to extract the geometric attributes of multiple propagation paths, is expected to be a key functionality of 6G. A movable antenna (MA) can enable this functionality by creating a synthetic aperture through sequential mechanical motion. However, existing MA-based sensing methods typically rely on exhaustive scanning over the entire movable plate, resulting in significant control overhead and sensing latency, which limits their practicality for agile sensing. To address this challenge, this paper develops a prior-guided agile multi-path sensing framework that leverages weak prior angle-of-arrival (AoA) statistics as side information. The proposed framework comprises two steps. First, the movable plate's three-dimensional orientation is optimized only once to maximize path visibility while preserving path discriminability, both induced from Fisher information analysis. Second, only two predetermined linear MA scans are made on the tilted plate to estimate the elevation and azimuth AoAs from the resulting sequence of received signals. By incorporating the prior AoA statistics, a maximum a posteriori (MAP)-based AoA estimation algorithm is developed. With only one orientation control and two linear scans, the proposed framework enables agile multi-path sensing with significantly reduced control overhead and latency, while achieving AoA estimation accuracy approaching that of the single-path benchmark.
ROFeb 11, 2025Code
Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually ImpairedByungOk Han, Woo-han Yun, Beom-Su Seo et al.
Guide dog robots offer promising solutions to enhance mobility and safety for visually impaired individuals, addressing the limitations of traditional guide dogs, particularly in perceptual intelligence and communication. With the emergence of Vision-Language Models (VLMs), robots are now capable of generating natural language descriptions of their surroundings, aiding in safer decision-making. However, existing VLMs often struggle to accurately interpret and convey spatial relationships, which is crucial for navigation in complex environments such as street crossings. We introduce the Space-Aware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench) to address the limitations of current VLMs in understanding physical environments. Our automated data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings, enhancing environmental comprehension and enabling VLMs to provide more accurate guidance to visually impaired individuals. We also propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance. Comparative experiments demonstrate that our space-aware instruction-tuned model outperforms state-of-the-art algorithms. We have fully open-sourced the SAIT dataset and SA-Bench, along with the related code, at https://github.com/byungokhan/Space-awareVLM
HCAug 10, 2021Code
SGToolkit: An Interactive Gesture Authoring Toolkit for Embodied Conversational AgentsYoungwoo Yoon, Keunwoo Park, Minsu Jang et al.
Non-verbal behavior is essential for embodied agents like social robots, virtual avatars, and digital humans. Existing behavior authoring approaches including keyframe animation and motion capture are too expensive to use when there are numerous utterances requiring gestures. Automatic generation methods show promising results, but their output quality is not satisfactory yet, and it is hard to modify outputs as a gesture designer wants. We introduce a new gesture generation toolkit, named SGToolkit, which gives a higher quality output than automatic methods and is more efficient than manual authoring. For the toolkit, we propose a neural generative model that synthesizes gestures from speech and accommodates fine-level pose controls and coarse-level style controls from users. The user study with 24 participants showed that the toolkit is favorable over manual authoring, and the generated gestures were also human-like and appropriate to input speech. The SGToolkit is platform agnostic, and the code is available at https://github.com/ai4r/SGToolkit.
GRSep 4, 2020Code
Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker IdentityYoungwoo Yoon, Bok Cha, Joo-Haeng Lee et al.
For human-like agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human--agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. However, it is difficult to generate human-like gestures due to the lack of understanding of how people gesture. Data-driven approaches attempt to learn gesticulation skills from human demonstrations, but the ambiguous and individual nature of gestures hinders learning. In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric for gesture generation models. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to work with synthesized audio in a scenario where contexts are constrained, and show that different gesture styles can be generated for the same speech by specifying different speaker identities in the style embedding space that is learned from videos of various speakers. All the code and data is available at https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.
ROSep 4, 2020Code
AIR-Act2Act: Human-human interaction dataset for teaching non-verbal social behaviors to robotsWoo-Ri Ko, Minsu Jang, Jaeyeon Lee et al.
To better interact with users, a social robot should understand the users' behavior, infer the intention, and respond appropriately. Machine learning is one way of implementing robot intelligence. It provides the ability to automatically learn and improve from experience instead of explicitly telling the robot what to do. Social skills can also be learned through watching human-human interaction videos. However, human-human interaction datasets are relatively scarce to learn interactions that occur in various situations. Moreover, we aim to use service robots in the elderly-care domain; however, there has been no interaction dataset collected for this domain. For this reason, we introduce a human-human interaction dataset for teaching non-verbal social behaviors to robots. It is the only interaction dataset that elderly people have participated in as performers. We recruited 100 elderly people and two college students to perform 10 interactions in an indoor environment. The entire dataset has 5,000 interaction samples, each of which contains depth maps, body indexes and 3D skeletal data that are captured with three Microsoft Kinect v2 cameras. In addition, we provide the joint angles of a humanoid NAO robot which are converted from the human behavior that robots need to learn. The dataset and useful python scripts are available for download at https://github.com/ai4r/AIR-Act2Act. It can be used to not only teach social skills to robots but also benchmark action recognition algorithms.
AINov 4, 2025
ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task PlanningJae-Woo Choi, Hyungmin Kim, Hyobin Ong et al.
Recent advancements in large language models (LLMs) have enabled significant progress in decision-making and task planning for embodied autonomous agents. However, most existing methods still struggle with complex, long-horizon tasks because they rely on a monolithic trajectory that entangles all past decisions and observations, attempting to solve the entire task in a single unified process. To address this limitation, we propose ReAcTree, a hierarchical task-planning method that decomposes a complex goal into more manageable subgoals within a dynamically constructed agent tree. Each subgoal is handled by an LLM agent node capable of reasoning, acting, and further expanding the tree, while control flow nodes coordinate the execution strategies of agent nodes. In addition, we integrate two complementary memory systems: each agent node retrieves goal-specific, subgoal-level examples from episodic memory and shares environment-specific observations through working memory. Experiments on the WAH-NL and ALFRED datasets demonstrate that ReAcTree consistently outperforms strong task-planning baselines such as ReAct across diverse LLMs. Notably, on WAH-NL, ReAcTree achieves a 61% goal success rate with Qwen 2.5 72B, nearly doubling ReAct's 31%.
CYJan 29
Moral Outrage Shapes Commitments Beyond Attention: Multimodal Moral Emotions on YouTube in Korea and the USSeongchan Park, Jaehong Kim, Hyeonseung Kim et al.
Understanding how media rhetoric shapes audience engagement is crucial in the attention economy. This study examines how moral emotional framing by mainstream news channels on YouTube influences user behavior across Korea and the United States. To capture the platform's multimodal nature, combining thumbnail images and video titles, we develop a multimodal moral emotion classifier by fine tuning a vision language model. The model is trained on human annotated multimodal datasets in both languages and applied to approximately 400,000 videos from major news outlets. We analyze engagement levels including views, likes, and comments, representing increasing degrees of commitment. The results show that other condemning rhetoric expressions of moral outrage that criticize others morally consistently increase all forms of engagement across cultures, with effects ranging from passive viewing to active commenting. These findings suggest that moral outrage is a particularly effective emotional strategy, attracting not only attention but also active participation. We discuss concerns about the potential misuse of other condemning rhetoric, as such practices may deepen polarization by reinforcing in group and out group divisions. To facilitate future research and ensure reproducibility, we publicly release our Korean and English multimodal moral emotion classifiers.
AIFeb 13, 2024
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied AgentsJae-Woo Choi, Youngwoo Yoon, Hyobin Ong et al.
Large language models (LLMs) have recently received considerable attention as alternative solutions for task planning. However, comparing the performance of language-oriented task planners becomes difficult, and there exists a dearth of detailed exploration regarding the effects of various factors such as pre-trained model selection and prompt construction. To address this, we propose a benchmark system for automatically quantifying performance of task planning for home-service embodied agents. Task planners are tested on two pairs of datasets and simulators: 1) ALFRED and AI2-THOR, 2) an extension of Watch-And-Help and VirtualHome. Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several enhancements of the baseline planner. We expect that the proposed benchmark tool would accelerate the development of language-oriented task planners.
ROOct 21, 2024
A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLMByungOk Han, Jaehong Kim, Jinhyeok Jang
Vision-Language-Action (VLA) models are receiving increasing attention for their ability to enable robots to perform complex tasks by integrating visual context with linguistic commands. However, achieving efficient real-time performance remains challenging due to the high computational demands of existing models. To overcome this, we propose Dual Process VLA (DP-VLA), a hierarchical framework inspired by dual-process theory. DP-VLA utilizes a Large System 2 Model (L-Sys2) for complex reasoning and decision-making, while a Small System 1 Model (S-Sys1) handles real-time motor control and sensory processing. By leveraging Vision-Language Models (VLMs), the L-Sys2 operates at low frequencies, reducing computational overhead, while the S-Sys1 ensures fast and accurate task execution. Experimental results on the RoboCasa dataset demonstrate that DP-VLA achieves faster inference and higher task success rates, providing a scalable solution for advanced robotic applications.
45.0CLApr 23
Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model DecisionsJiseon Kim, Jea Kwon, Luiz Felipe Vecchietti et al.
Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower's Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making. By analyzing the reasoning processes, we identify a clear cross-perspective divergence: while moral rightness remains consistently fairness-oriented, predicted human behavior shifts significantly toward loyalty as relational closeness increases. Crucially, model decisions align with moral rightness judgments rather than their own behavioral predictions. This inconsistency suggests that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling, which poses a gap that may lead to significant misalignments in real-world deployments.
ROFeb 17, 2025
Learning Dexterous Bimanual Catch Skills through Adversarial-Cooperative Heterogeneous-Agent Reinforcement LearningTaewoo Kim, Youngwoo Yoon, Jaehong Kim
Robotic catching has traditionally focused on single-handed systems, which are limited in their ability to handle larger or more complex objects. In contrast, bimanual catching offers significant potential for improved dexterity and object handling but introduces new challenges in coordination and control. In this paper, we propose a novel framework for learning dexterous bimanual catching skills using Heterogeneous-Agent Reinforcement Learning (HARL). Our approach introduces an adversarial reward scheme, where a throw agent increases the difficulty of throws-adjusting speed-while a catch agent learns to coordinate both hands to catch objects under these evolving conditions. We evaluate the framework in simulated environments using 15 different objects, demonstrating robustness and versatility in handling diverse objects. Our method achieved approximately a 2x increase in catching reward compared to single-agent baselines across 15 diverse objects.
LGAug 7, 2025
Learning from Oblivion: Predicting Knowledge Overflowed Weights via Retrodiction of ForgettingJinhyeok Jang, Jaehong Kim, Jung Uk Kim
Pre-trained weights have become a cornerstone of modern deep learning, enabling efficient knowledge transfer and improving downstream task performance, especially in data-scarce scenarios. However, a fundamental question remains: how can we obtain better pre-trained weights that encapsulate more knowledge beyond the given dataset? In this work, we introduce \textbf{KNowledge Overflowed Weights (KNOW)} prediction, a novel strategy that leverages structured forgetting and its inversion to synthesize knowledge-enriched weights. Our key insight is that sequential fine-tuning on progressively downsized datasets induces a structured forgetting process, which can be modeled and reversed to recover knowledge as if trained on a larger dataset. We construct a dataset of weight transitions governed by this controlled forgetting and employ meta-learning to model weight prediction effectively. Specifically, our \textbf{KNowledge Overflowed Weights Nowcaster (KNOWN)} acts as a hyper-model that learns the general evolution of weights and predicts enhanced weights with improved generalization. Extensive experiments across diverse datasets and architectures demonstrate that KNOW prediction consistently outperforms Naïve fine-tuning and simple weight prediction, leading to superior downstream performance. Our work provides a new perspective on reinterpreting forgetting dynamics to push the limits of knowledge transfer in deep learning.
AIFeb 20, 2025
A novel approach to the relationships between data features -- based on comprehensive examination of mathematical, technological, and causal methodologyJaeHong Kim
The expansion of artificial intelligence (AI) has raised concerns about transparency, accountability, and interpretability, with counterfactual reasoning emerging as a key approach to addressing these issues. However, current mathematical, technological, and causal methodologies rely on externalization techniques that normalize feature relationships within a single coordinate space, often distorting intrinsic interactions. This study proposes the Convergent Fusion Paradigm (CFP) theory, a framework integrating mathematical, technological, and causal perspectives to provide a more precise and comprehensive analysis of feature relationships. CFP theory introduces Hilbert space and backward causation to reinterpret the feature relationships as emergent structures, offering a potential solution to the common cause problem -- a fundamental challenge in causal modeling. From a mathematical -- technical perspective, it utilizes a Riemannian manifold-based framework, thereby improving the structural representation of high- and low-dimensional data interactions. From a causal inference perspective, CFP theory adopts abduction as a methodological foundation, employing Hilbert space for a dynamic causal reasoning approach, where causal relationships are inferred abductively, and feature relationships evolve as emergent properties. Ultimately, CFP theory introduces a novel AI modeling methodology that integrates Hilbert space, backward causation, and Riemannian geometry, strengthening AI governance and transparency in counterfactual reasoning.
LGFeb 14, 2025
A novel approach to data generation in generative modelJaeHong Kim, Jaewon Shim
Variational Autoencoders (VAEs) and other generative models are widely employed in artificial intelligence to synthesize new data. However, current approaches rely on Euclidean geometric assumptions and statistical approximations that fail to capture the structured and emergent nature of data generation. This paper introduces the Convergent Fusion Paradigm (CFP) theory, a novel geometric framework that redefines data generation by integrating dimensional expansion accompanied by qualitative transformation. By modifying the latent space geometry to interact with emergent high-dimensional structures, CFP theory addresses key challenges such as identifiability issues and unintended artifacts like hallucinations in Large Language Models (LLMs). CFP theory is based on two key conceptual hypotheses that redefine how generative models structure relationships between data and algorithms. Through the lens of CFP theory, we critically examine existing metric-learning approaches. CFP theory advances this perspective by introducing time-reversed metric embeddings and structural convergence mechanisms, leading to a novel geometric approach that better accounts for data generation as a structured epistemic process. Beyond its computational implications, CFP theory provides philosophical insights into the ontological underpinnings of data generation. By offering a systematic framework for high-dimensional learning dynamics, CFP theory contributes to establishing a theoretical foundation for understanding the data-relationship structures in AI. Finally, future research in CFP theory will be led to its implications for fully realizing qualitative transformations, introducing the potential of Hilbert space in generative modeling.
ASJan 20, 2021
VOTE400(Voide Of The Elderly 400 Hours): A Speech Dataset to Study Voice Interface for Elderly-CareMinsu Jang, Sangwon Seo, Dohyung Kim et al.
This paper introduces a large-scale Korean speech dataset, called VOTE400, that can be used for analyzing and recognizing voices of the elderly people. The dataset includes about 300 hours of continuous dialog speech and 100 hours of read speech, both recorded by the elderly people aged 65 years or over. A preliminary experiment showed that speech recognition system trained with VOTE400 can outperform conventional systems in speech recognition of elderly people's voice. This work is a multi-organizational effort led by ETRI and MINDs Lab Inc. for the purpose of advancing the speech recognition performance of the elderly-care robots.
ROMar 4, 2020
ETRI-Activity3D: A Large-Scale RGB-D Dataset for Robots to Recognize Daily Activities of the ElderlyJinhyeok Jang, Dohyung Kim, Cheonshu Park et al.
Deep learning, based on which many modern algorithms operate, is well known to be data-hungry. In particular, the datasets appropriate for the intended application are difficult to obtain. To cope with this situation, we introduce a new dataset called ETRI-Activity3D, focusing on the daily activities of the elderly in robot-view. The major characteristics of the new dataset are as follows: 1) practical action categories that are selected from the close observation of the daily lives of the elderly; 2) realistic data collection, which reflects the robot's working environment and service situations; and 3) a large-scale dataset that overcomes the limitations of the current 3D activity analysis benchmark datasets. The proposed dataset contains 112,620 samples including RGB videos, depth maps, and skeleton sequences. During the data acquisition, 100 subjects were asked to perform 55 daily activities. Additionally, we propose a novel network called four-stream adaptive CNN (FSA-CNN). The proposed FSA-CNN has three main properties: robustness to spatio-temporal variations, input-adaptive activation function, and extension of the conventional two-stream approach. In the experiment section, we confirmed the superiority of the proposed FSA-CNN using NTU RGB+D and ETRI-Activity3D. Further, the domain difference between both groups of age was verified experimentally. Finally, the extension of FSA-CNN to deal with the multimodal data was investigated.
ROSep 26, 2019
Cut-and-Paste Dataset Generation for Balancing Domain Gaps in Object Instance DetectionWoo-han Yun, Taewoo Kim, Jaeyeon Lee et al.
Training an object instance detector where only a few training object images are available is a challenging task. One solution is a cut-and-paste method that generates a training dataset by cutting object areas out of training images and pasting them onto other background images. A detector trained on a dataset generated with a cut-and-paste method suffers from the conventional domain shift problem, which stems from a discrepancy between the source domain (generated training dataset) and the target domain (real test dataset). Though state-of-the-art domain adaptation methods are able to reduce this gap, it is limited because they do not consider the difference of domain gaps of foreground and background. In this study, we present that the conventional domain gap can be divided into two sub-domain gaps for foreground and background. Then, we show that the original cut-and-paste approach suffers from a new domain gap problem, an unbalanced domain gaps, because it has two separate source domains for foreground and background, unlike the conventional domain shift problem. Then, we introduce an advanced cut-and-paste method to balance the unbalanced domain gaps by diversifying the foreground with GAN (generative adversarial network)-generated seed images and simplifying the background using image processing techniques. Experimental results show that our method is effective for balancing domain gaps and improving the accuracy of object instance detection in a cluttered indoor environment using only a few seed images. Furthermore, we show that balancing domain gaps can improve the detection accuracy of state-of-the-art domain adaptation methods.
CVNov 21, 2018
Neural Networks with Activation NetworksJinhyeok Jang, Jaehong Kim, Jaeyeon Lee et al.
This work presents an adaptive activation method for neural networks that exploits the interdependency of features. Each pixel, node, and layer is assigned with a polynomial activation function, whose coefficients are provided by an auxiliary activation network. The activation of a feature depends on the features of neighboring pixels in a convolutional layer and other nodes in a dense layer. The dependency is learned from data by the activation networks. In our experiments, networks with activation networks provide significant performance improvement compared to the baseline networks on which they are built. The proposed method can be used to improve the network performance as an alternative to increasing the number of nodes and layers.
ROOct 30, 2018
Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid RobotsYoungwoo Yoon, Woo-Ri Ko, Minsu Jang et al.
Co-speech gestures enhance interaction experiences between humans as well as between humans and robots. Existing robots use rule-based speech-gesture association, but this requires human labor and prior knowledge of experts to be implemented. We present a learning-based co-speech gesture generation that is learned from 52 h of TED talks. The proposed end-to-end neural network model consists of an encoder for speech text understanding and a decoder to generate a sequence of gestures. The model successfully produces various gestures including iconic, metaphoric, deictic, and beat gestures. In a subjective evaluation, participants reported that the gestures were human-like and matched the speech content. We also demonstrate a co-speech gesture with a NAO robot working in real time.
LGSep 11, 2018
Deep Asymmetric Networks with a Set of Node-wise Variant Activation FunctionsJinhyeok Jang, Hyunjoong Cho, Jaehong Kim et al.
This work presents deep asymmetric networks with a set of node-wise variant activation functions. The nodes' sensitivities are affected by activation function selections such that the nodes with smaller indices become increasingly more sensitive. As a result, features learned by the nodes are sorted by the node indices in the order of their importance. Asymmetric networks not only learn input features but also the importance of those features. Nodes of lesser importance in asymmetric networks can be pruned to reduce the complexity of the networks, and the pruned networks can be retrained without incurring performance losses. We validate the feature-sorting property using both shallow and deep asymmetric networks as well as deep asymmetric networks transferred from famous networks.
LGJun 20, 2018
Doubly Nested Network for Resource-Efficient InferenceJaehong Kim, Sungeun Hong, Yongseok Choi et al.
We propose doubly nested network(DNNet) where all neurons represent their own sub-models that solve the same task. Every sub-model is nested both layer-wise and channel-wise. While nesting sub-models layer-wise is straight-forward with deep-supervision as proposed in \cite{xie2015holistically}, channel-wise nesting has not been explored in the literature to our best knowledge. Channel-wise nesting is non-trivial as neurons between consecutive layers are all connected to each other. In this work, we introduce a technique to solve this problem by sorting channels topologically and connecting neurons accordingly. For the purpose, channel-causal convolutions are used. Slicing doubly nested network gives a working sub-network. The most notable application of our proposed network structure with slicing operation is resource-efficient inference. At test time, computing resources such as time and memory available for running the prediction algorithm can significantly vary across devices and applications. Given a budget constraint, we can slice the network accordingly and use a sub-model for inference within budget, requiring no additional computation such as training or fine-tuning after deployment. We demonstrate the effectiveness of our approach in several practical scenarios of utilizing available resource efficiently.
LGJun 11, 2018
Auto-Meta: Automated Gradient Based Meta Learner SearchJaehong Kim, Sangyeul Lee, Sungwan Kim et al.
Fully automating machine learning pipelines is one of the key challenges of current artificial intelligence research, since practical machine learning often requires costly and time-consuming human-powered processes such as model design, algorithm development, and hyperparameter tuning. In this paper, we verify that automated architecture search synergizes with the effect of gradient-based meta learning. We adopt the progressive neural architecture search \cite{liu:pnas_google:DBLP:journals/corr/abs-1712-00559} to find optimal architectures for meta-learners. The gradient based meta-learner whose architecture was automatically found achieved state-of-the-art results on the 5-shot 5-way Mini-ImageNet classification problem with $74.65\%$ accuracy, which is $11.54\%$ improvement over the result obtained by the first gradient-based meta-learner called MAML \cite{finn:maml:DBLP:conf/icml/FinnAL17}. To our best knowledge, this work is the first successful neural architecture search implementation in the context of meta learning.
AIMay 24, 2017
Continual Learning with Deep Generative ReplayHanul Shin, Jung Kwon Lee, Jaehong Kim et al.
Attempts to train a comprehensive artificial intelligence capable of solving multiple tasks have been impeded by a chronic problem called catastrophic forgetting. Although simply replaying all previous data alleviates the problem, it requires large memory and even worse, often infeasible in real world applications where the access to past data is limited. Inspired by the generative nature of hippocampus as a short-term memory system in primate brain, we propose the Deep Generative Replay, a novel framework with a cooperative dual model architecture consisting of a deep generative model ("generator") and a task solving model ("solver"). With only these two models, training data for previous tasks can easily be sampled and interleaved with those for a new task. We test our methods in several sequential learning settings involving image classification tasks.