Yu-Kai Wang

LG
h-index23
19papers
265citations
Novelty54%
AI Score51

19 Papers

ASMar 1, 2023
SpeechPrompt v2: Prompt Tuning for Speech Classification Tasks

Kai-Wei Chang, Yu-Kai Wang, Hua Shen et al. · meta-ai, mit

Prompt tuning is a technology that tunes a small set of parameters to steer a pre-trained language model (LM) to directly generate the output for downstream tasks. Recently, prompt tuning has demonstrated its storage and computation efficiency in both natural language processing (NLP) and speech processing fields. These advantages have also revealed prompt tuning as a candidate approach to serving pre-trained LM for multiple tasks in a unified manner. For speech processing, SpeechPrompt shows its high parameter efficiency and competitive performance on a few speech classification tasks. However, whether SpeechPrompt is capable of serving a large number of tasks is unanswered. In this work, we propose SpeechPrompt v2, a prompt tuning framework capable of performing a wide variety of speech classification tasks, covering multiple languages and prosody-related tasks. The experiment result shows that SpeechPrompt v2 achieves performance on par with prior works with less than 0.15M trainable parameters in a unified framework.

ASAug 23, 2024
SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks

Kai-Wei Chang, Haibin Wu, Yu-Kai Wang et al. · meta-ai, mit

Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.

SPOct 1, 2022
Cross Task Neural Architecture Search for EEG Signal Classifications

Yiqun Duan, Zhen Wang, Yi Li et al.

Electroencephalograms (EEGs) are brain dynamics measured outside the brain, which have been widely utilized in non-invasive brain-computer interface applications. Recently, various neural network approaches have been proposed to improve the accuracy of EEG signal recognition. However, these approaches severely rely on manually designed network structures for different tasks which generally are not sharing the same empirical design cross-task-wise. In this paper, we propose a cross-task neural architecture search (CTNAS-EEG) framework for EEG signal recognition, which can automatically design the network structure across tasks and improve the recognition accuracy of EEG signals. Specifically, a compatible search space for cross-task searching and an efficient constrained searching method is proposed to overcome challenges brought by EEG signals. By unifying structure search on different EEG tasks, this work is the first to explore and analyze the searched structure difference cross-task-wise. Moreover, by introducing architecture search, this work is the first to analyze model performance by customizing model structure for each human subject. Detailed experimental results suggest that the proposed CTNAS-EEG could reach state-of-the-art performance on different EEG tasks, such as Motor Imagery (MI) and Emotion recognition. Extensive experiments and detailed analysis are provided as a good reference for follow-up researchers.

SPAug 28, 2024
BELT-2: Bootstrapping EEG-to-Language representation alignment for multi-task brain decoding

Jinzhao Zhou, Yiqun Duan, Fred Chang et al.

The remarkable success of large language models (LLMs) across various multi-modality applications is well established. However, integrating large language models with humans, or brain dynamics, remains relatively unexplored. In this paper, we introduce BELT-2, a pioneering multi-task model designed to enhance both encoding and decoding performance from EEG signals. To bolster the quality of the EEG encoder, BELT-2 is the first work to innovatively 1) adopt byte-pair encoding (BPE)-level EEG-language alignment and 2) integrate multi-task training and decoding in the EEG domain. Inspired by the idea of \textbf{\textit{Bridging the Brain with GPT}}, we further connect the multi-task EEG encoder with LLMs by utilizing prefix-tuning on intermediary output from the EEG encoder. These innovative efforts make BELT-2 a pioneering breakthrough, making it the first work in the field capable of decoding coherent and readable sentences from non-invasive brain signals. Our experiments highlight significant advancements over prior techniques in both quantitative and qualitative measures, achieving a decoding performance with a BLEU-1 score of 52.2\% on the ZuCo dataset. Furthermore, BELT-2 shows a remarkable improvement ranging from 31\% to 162\% on other translation benchmarks. Codes can be accessed via the provided anonymous link~\footnote{https://anonymous.4open.science/r/BELT-2-0048}.

AISep 21, 2023
BELT:Bootstrapping Electroencephalography-to-Language Decoding and Zero-Shot Sentiment Classification by Natural Language Supervision

Jinzhao Zhou, Yiqun Duan, Yu-Cheng Chang et al.

This paper presents BELT, a novel model and learning framework for the pivotal topic of brain-to-language translation research. The translation from noninvasive brain signals into readable natural language has the potential to promote the application scenario as well as the development of brain-computer interfaces (BCI) as a whole. The critical problem in brain signal decoding or brain-to-language translation is the acquisition of semantically appropriate and discriminative EEG representation from a dataset of limited scale and quality. The proposed BELT method is a generic and efficient framework that bootstraps EEG representation learning using off-the-shelf large-scale pretrained language models (LMs). With a large LM's capacity for understanding semantic information and zero-shot generalization, BELT utilizes large LMs trained on Internet-scale datasets to bring significant improvements to the understanding of EEG signals. In particular, the BELT model is composed of a deep conformer encoder and a vector quantization encoder. Semantical EEG representation is achieved by a contrastive learning step that provides natural language supervision. We achieve state-of-the-art results on two featuring brain decoding tasks including the brain-to-language translation and zero-shot sentiment classification. Specifically, our model surpasses the baseline model on both tasks by 5.45% and over 10% and archives a 42.31% BLEU-1 score and 67.32% precision on the main evaluation metrics for translation and zero-shot sentiment classification respectively.

LGAug 8, 2024
Masked EEG Modeling for Driving Intention Prediction

Jinzhao Zhou, Justin Sia, Yiqun Duan et al.

Driving under drowsy conditions significantly escalates the risk of vehicular accidents. Although recent efforts have focused on using electroencephalography to detect drowsiness, helping prevent accidents caused by driving in such states, seamless human-machine interaction in driving scenarios requires a more versatile EEG-based system. This system should be capable of understanding a driver's intention while demonstrating resilience to artifacts induced by sudden movements. This paper pioneers a novel research direction in BCI-assisted driving, studying the neural patterns related to driving intentions and presenting a novel method for driving intention prediction. In particular, our preliminary analysis of the EEG signal using independent component analysis suggests a close relation between the intention of driving maneuvers and the neural activities in central-frontal and parietal areas. Power spectral density analysis at a group level also reveals a notable distinction among various driving intentions in the frequency domain. To exploit these brain dynamics, we propose a novel Masked EEG Modeling framework for predicting human driving intentions, including the intention for left turning, right turning, and straight proceeding. Extensive experiments, encompassing comprehensive quantitative and qualitative assessments on public dataset, demonstrate the proposed method is proficient in predicting driving intentions across various vigilance states. Specifically, our model attains an accuracy of 85.19% when predicting driving intentions for drowsy subjects, which shows its promising potential for mitigating traffic accidents related to drowsy driving. Notably, our method maintains over 75% accuracy when more than half of the channels are missing or corrupted, underscoring its adaptability in real-life driving.

CLAug 8, 2024
Towards Linguistic Neural Representation Learning and Sentence Retrieval from Electroencephalogram Recordings

Jinzhao Zhou, Yiqun Duan, Ziyi Zhao et al.

Decoding linguistic information from non-invasive brain signals using EEG has gained increasing research attention due to its vast applicational potential. Recently, a number of works have adopted a generative-based framework to decode electroencephalogram (EEG) signals into sentences by utilizing the power generative capacity of pretrained large language models (LLMs). However, this approach has several drawbacks that hinder the further development of linguistic applications for brain-computer interfaces (BCIs). Specifically, the ability of the EEG encoder to learn semantic information from EEG data remains questionable, and the LLM decoder's tendency to generate sentences based on its training memory can be hard to avoid. These issues necessitate a novel approach for converting EEG signals into sentences. In this paper, we propose a novel two-step pipeline that addresses these limitations and enhances the validity of linguistic EEG decoding research. We first confirm that word-level semantic information can be learned from EEG data recorded during natural reading by training a Conformer encoder via a masked contrastive objective for word-level classification. To achieve sentence decoding results, we employ a training-free retrieval method to retrieve sentences based on the predictions from the EEG encoder. Extensive experiments and ablation studies were conducted in this paper for a comprehensive evaluation of the proposed approach. Visualization of the top prediction candidates reveals that our model effectively groups EEG segments into semantic categories with similar meanings, thereby validating its ability to learn patterns from unspoken EEG recordings. Despite the exploratory nature of this work, these results suggest that our method holds promise for providing more reliable solutions for converting EEG signals into text.

29.5CVMar 26
Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control

Zhuoli Zhuang, Yu-Cheng Chang, Yu-Kai Wang et al.

Recent advancements in computer vision have accelerated the development of autonomous driving. Despite these advancements, training machines to drive in a way that aligns with human expectations remains a significant challenge. Human factors are still essential, as humans possess a sophisticated cognitive system capable of rapidly interpreting scene information and making accurate decisions. Aligning machine with human intent has been explored with Reinforcement Learning with Human Feedback (RLHF). Conventional RLHF methods rely on collecting human preference data by manually ranking generated outputs, which is time-consuming and indirect. In this work, we propose an electroencephalography (EEG)-guided decision-making framework to incorporate human cognitive insights without behaviour response interruption into reinforcement learning (RL) for autonomous driving. We collected EEG signals from 20 participants in a realistic driving simulator and analyzed event-related potentials (ERP) in response to sudden environmental changes. Our proposed framework employs a neural network to predict the strength of ERP based on the cognitive information from visual scene information. Moreover, we explore the integration of such cognitive information into the reward signal of the RL algorithm. Experimental results show that our framework can improve the collision avoidance ability of the RL algorithm, highlighting the potential of neuro-cognitive feedback in enhancing autonomous driving systems. Our project page is: https://alex95gogo.github.io/Cognitive-Reward/.

AIJan 29
BrainStack: Neuro-MoE with Functionally Guided Expert Routing for EEG-Based Language Decoding

Ziyi Zhao, Jinzhao Zhou, Xiaowei Jiang et al.

Decoding linguistic information from electroencephalography (EEG) remains challenging due to the brain's distributed and nonlinear organization. We present BrainStack, a functionally guided neuro-mixture-of-experts (Neuro-MoE) framework that models the brain's modular functional architecture through anatomically partitioned expert networks. Each functional region is represented by a specialized expert that learns localized neural dynamics, while a transformer-based global expert captures cross-regional dependencies. A learnable routing gate adaptively aggregates these heterogeneous experts, enabling context-dependent expert coordination and selective fusion. To promote coherent representation across the hierarchy, we introduce cross-regional distillation, where the global expert provides top-down regularization to the regional experts. We further release SilentSpeech-EEG (SS-EEG), a large-scale benchmark comprising over 120 hours of EEG recordings from 12 subjects performing 24 silent words, the largest dataset of its kind. Experiments demonstrate that BrainStack consistently outperforms state-of-the-art models, achieving superior accuracy and generalization across subjects. Our results establish BrainStack as a functionally modular, neuro-inspired MoE paradigm that unifies neuroscientific priors with adaptive expert routing, paving the way for scalable and interpretable brain-language decoding.

ROSep 26, 2025Code
WoW: Towards a World omniscient World model Through Embodied Interaction

Xiaowei Chi, Peidong Jia, Chun-Kai Fan et al.

Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.

ROJan 7
Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test

Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju et al.

As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.

CVMay 13, 2024
MaskFuser: Masked Fusion of Joint Multi-Modal Tokenization for End-to-End Autonomous Driving

Yiqun Duan, Xianda Guo, Zheng Zhu et al.

Current multi-modality driving frameworks normally fuse representation by utilizing attention between single-modality branches. However, the existing networks still suppress the driving performance as the Image and LiDAR branches are independent and lack a unified observation representation. Thus, this paper proposes MaskFuser, which tokenizes various modalities into a unified semantic feature space and provides a joint representation for further behavior cloning in driving contexts. Given the unified token representation, MaskFuser is the first work to introduce cross-modality masked auto-encoder training. The masked training enhances the fusion representation by reconstruction on masked tokens. Architecturally, a hybrid-fusion network is proposed to combine advantages from both early and late fusion: For the early fusion stage, modalities are fused by performing monotonic-to-BEV translation attention between branches; Late fusion is performed by tokenizing various modalities into a unified token space with shared encoding on it. MaskFuser respectively reaches a driving score of 49.05 and route completion of 92.85% on the CARLA LongSet6 benchmark evaluation, which improves the best of previous baselines by 1.74 and 3.21%. The introduced masked fusion increases driving stability under damaged sensory inputs. MaskFuser outperforms the best of previous baselines on driving score by 6.55 (27.8%), 1.53 (13.8%), 1.57 (30.9%), respectively given sensory masking ratios 25%, 50%, and 75%.

MADec 30, 2023
Contrastive learning-based agent modeling for deep reinforcement learning

Wenhao Ma, Yu-Cheng Chang, Jie Yang et al.

Multi-agent systems often require agents to collaborate with or compete against other agents with diverse goals, behaviors, or strategies. Agent modeling is essential when designing adaptive policies for intelligent machine agents in multiagent systems, as this is the means by which the ego agent understands other agents' behavior and extracts their meaningful policy representations. These representations can be used to enhance the ego agent's adaptive policy which is trained by reinforcement learning. However, existing agent modeling approaches typically assume the availability of local observations from other agents (modeled agents) during training or a long observation trajectory for policy adaption. To remove these constrictive assumptions and improve agent modeling performance, we devised a Contrastive Learning-based Agent Modeling (CLAM) method that relies only on the local observations from the ego agent during training and execution. With these observations, CLAM is capable of generating consistent high-quality policy representations in real-time right from the beginning of each episode. We evaluated the efficacy of our approach in both cooperative and competitive multi-agent environments. Our experiments demonstrate that our approach achieves state-of-the-art on both cooperative and competitive tasks, highlighting the potential of contrastive learning-based agent modeling for enhancing reinforcement learning.

AIApr 8, 2025
AEGIS: Human Attention-based Explainable Guidance for Intelligent Vehicle Systems

Zhuoli Zhuang, Cheng-You Lu, Yu-Cheng Fred Chang et al.

Improving decision-making capabilities in Autonomous Intelligent Vehicles (AIVs) has been a heated topic in recent years. Despite advancements, training machines to capture regions of interest for comprehensive scene understanding, like human perception and reasoning, remains a significant challenge. This study introduces a novel framework, Human Attention-based Explainable Guidance for Intelligent Vehicle Systems (AEGIS). AEGIS utilizes human attention, converted from eye-tracking, to guide reinforcement learning (RL) models to identify critical regions of interest for decision-making. AEGIS uses a pre-trained human attention model to guide RL models to identify critical regions of interest for decision-making. By collecting 1.2 million frames from 20 participants across six scenarios, AEGIS pre-trains a model to predict human attention patterns.

HCJan 18, 2021
Motor-Imagery-Based Brain Computer Interface using Signal Derivation and Aggregation Functions

Javier Fumanal-Idocin, Yu-Kai Wang, Chin-Teng Lin et al.

Brain Computer Interface technologies are popular methods of communication between the human brain and external devices. One of the most popular approaches to BCI is Motor Imagery. In BCI applications, the ElectroEncephaloGraphy is a very popular measurement for brain dynamics because of its non-invasive nature. Although there is a high interest in the BCI topic, the performance of existing systems is still far from ideal, due to the difficulty of performing pattern recognition tasks in EEG signals. BCI systems are composed of a wide range of components that perform signal pre-processing, feature extraction and decision making. In this paper, we define a BCI Framework, named Enhanced Fusion Framework, where we propose three different ideas to improve the existing MI-based BCI frameworks. Firstly, we include aan additional pre-processing step of the signal: a differentiation of the EEG signal that makes it time-invariant. Secondly, we add an additional frequency band as feature for the system and we show its effect on the performance of the system. Finally, we make a profound study of how to make the final decision in the system. We propose the usage of both up to six types of different classifiers and a wide range of aggregation functions (including classical aggregations, Choquet and Sugeno integrals and their extensions and overlap functions) to fuse the information given by the considered classifiers. We have tested this new system on a dataset of 20 volunteers performing motor imagery-based brain-computer interface experiments. On this dataset, the new system achieved a 88.80% of accuracy. We also propose an optimized version of our system that is able to obtain up to 90,76%. Furthermore, we find that the pair Choquet/Sugeno integrals and overlap functions are the ones providing the best results.

HCNov 19, 2020
Interval-valued aggregation functions based on moderate deviations applied to Motor-Imagery-Based Brain Computer Interface

Javier Fumanal-Idocin, Zdenko Takáč, Javier Fernández Jose Antonio Sanz et al.

In this work we study the use of moderate deviation functions to measure similarity and dissimilarity among a set of given interval-valued data. To do so, we introduce the notion of interval-valued moderate deviation function and we study in particular those interval-valued moderate deviation functions which preserve the width of the input intervals. Then, we study how to apply these functions to construct interval-valued aggregation functions. We have applied them in the decision making phase of two Motor-Imagery Brain Computer Interface frameworks, obtaining better results than those obtained using other numerical and intervalar aggregations.

LGJan 9, 2020
Supervised Discriminative Sparse PCA with Adaptive Neighbors for Dimensionality Reduction

Zhenhua Shi, Dongrui Wu, Jian Huang et al.

Dimensionality reduction is an important operation in information visualization, feature extraction, clustering, regression, and classification, especially for processing noisy high dimensional data. However, most existing approaches preserve either the global or the local structure of the data, but not both. Approaches that preserve only the global data structure, such as principal component analysis (PCA), are usually sensitive to outliers. Approaches that preserve only the local data structure, such as locality preserving projections, are usually unsupervised (and hence cannot use label information) and uses a fixed similarity graph. We propose a novel linear dimensionality reduction approach, supervised discriminative sparse PCA with adaptive neighbors (SDSPCAAN), to integrate neighborhood-free supervised discriminative sparse PCA and projected clustering with adaptive neighbors. As a result, both global and local data structures, as well as the label information, are used for better dimensionality reduction. Classification experiments on nine high-dimensional datasets validated the effectiveness and robustness of our proposed SDSPCAAN.

LGJan 8, 2020
EEG-based Drowsiness Estimation for Driving Safety using Deep Q-Learning

Yurui Ming, Dongrui Wu, Yu-Kai Wang et al.

Fatigue is the most vital factor of road fatalities and one manifestation of fatigue during driving is drowsiness. In this paper, we propose using deep Q-learning to analyze an electroencephalogram (EEG) dataset captured during a simulated endurance driving test. By measuring the correlation between drowsiness and driving performance, this experiment represents an important brain-computer interface (BCI) paradigm especially from an application perspective. We adapt the terminologies in the driving test to fit the reinforcement learning framework, thus formulate the drowsiness estimation problem as an optimization of a Q-learning task. By referring to the latest deep Q-Learning technologies and attending to the characteristics of EEG data, we tailor a deep Q-network for action proposition that can indirectly estimate drowsiness. Our results show that the trained model can trace the variations of mind state in a satisfactory way against the testing EEG data, which demonstrates the feasibility and practicability of this new computation paradigm. We also show that our method outperforms the supervised learning counterpart and is superior for real applications. To the best of our knowledge, we are the first to introduce the deep reinforcement learning method to this BCI scenario, and our method can be potentially generalized to other BCI cases.

LGNov 27, 2019
Adaptive Initialization Method for K-means Algorithm

Jie Yang, Yu-Kai Wang, Xin Yao et al.

The K-means algorithm is a widely used clustering algorithm that offers simplicity and efficiency. However, the traditional K-means algorithm uses the random method to determine the initial cluster centers, which make clustering results prone to local optima and then result in worse clustering performance. Many initialization methods have been proposed, but none of them can dynamically adapt to datasets with various characteristics. In our previous research, an initialization method for K-means based on hybrid distance was proposed, and this algorithm can adapt to datasets with different characteristics. However, it has the following drawbacks: (a) When calculating density, the threshold cannot be uniquely determined, resulting in unstable results. (b) Heavily depending on adjusting the parameter, the parameter must be adjusted five times to obtain better clustering results. (c) The time complexity of the algorithm is quadratic, which is difficult to apply to large datasets. In the current paper, we proposed an adaptive initialization method for the K-means algorithm (AIMK) to improve our previous work. AIMK can not only adapt to datasets with various characteristics but also obtain better clustering results within two interactions. In addition, we then leverage random sampling in AIMK, which is named as AIMK-RS, to reduce the time complexity. AIMK-RS is easily applied to large and high-dimensional datasets. We compared AIMK and AIMK-RS with 10 different algorithms on 16 normal and six extra-large datasets. The experimental results show that AIMK and AIMK-RS outperform the current initialization methods and several well-known clustering algorithms. Furthermore, AIMK-RS can significantly reduce the complexity of applying it to extra-large datasets with high dimensions. The time complexity of AIMK-RS is O(n).