CLAug 9, 2022Code
Thai Wav2Vec2.0 with CommonVoice V8Wannaphong Phatthiyaphaibun, Chompakorn Chaksangchaichot, Peerat Limkonchotiwat et al.
Recently, Automatic Speech Recognition (ASR), a system that converts audio into text, has caught a lot of attention in the machine learning community. Thus, a lot of publicly available models were released in HuggingFace. However, most of these ASR models are available in English; only a minority of the models are available in Thai. Additionally, most of the Thai ASR models are closed-sourced, and the performance of existing open-sourced models lacks robustness. To address this problem, we train a new ASR model on a pre-trained XLSR-Wav2Vec model with the Thai CommonVoice corpus V8 and train a trigram language model to boost the performance of our ASR model. We hope that our models will be beneficial to individuals and the ASR community in Thailand.
CVMar 23, 2023
Zero-guidance Segmentation Using Zero Segment LabelsPitchaporn Rewatbowornwong, Nattanat Chatthee, Ekapol Chuangsuwanich et al.
CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: https://zero-guide-seg.github.io/.
IRJun 17, 2023
Typo-Robust Representation Learning for Dense RetrievalPanuthep Tasawong, Wuttikorn Ponwitayarat, Peerat Limkonchotiwat et al.
Dense retrieval is a basic building block of information retrieval applications. One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only focus on the alignment between misspelled and pristine queries, our method also improves the contrast between each misspelled query and its surrounding queries. To assess the effectiveness of our proposed method, we compare it against the existing competitors using two benchmark datasets and two base encoders. Our method outperforms the competitors in all cases with misspelled queries. Our code and models are available at https://github. com/panuthept/DST-DenseRetrieval.
CLNov 6, 2023
An Efficient Self-Supervised Cross-View Training For Sentence EmbeddingPeerat Limkonchotiwat, Wuttikorn Ponwitayarat, Lalita Lowphansirikul et al.
Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrades as the number of parameters decreases. In this paper, we propose a framework called Self-supervised Cross-View Training (SCT) to narrow the performance gap between large and small PLMs. To evaluate the effectiveness of SCT, we compare it to 5 baseline and state-of-the-art competitors on seven Semantic Textual Similarity (STS) benchmarks using 5 PLMs with the number of parameters ranging from 4M to 340M. The experimental results show that STC outperforms the competitors for PLMs with less than 100M parameters in 18 of 21 cases.
IROct 30, 2025
Evaluating Perspectival Biases in Cross-Modal RetrievalTeerapol Saengsukhiran, Peerawat Chomphooyod, Narabodee Rodjananant et al.
Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.
CLMar 24, 2024Code
WangchanLion and WangchanX MRC EvalWannaphong Phatthiyaphaibun, Surapon Nonesung, Patomporn Payoungkhamdee et al.
This technical report describes the development of WangchanLion, an instruction fine-tuned model focusing on Machine Reading Comprehension (MRC) in the Thai language. Our model is based on SEA-LION and a collection of instruction following datasets. To promote open research and reproducibility, we publicly release all training data, code, and the final model weights under the Apache-2 license. To assess the contextual understanding capability, we conducted extensive experimental studies using two Thai MRC datasets, XQuAD and Iapp_wiki_qa_squad. Experimental results demonstrate the model's ability to comprehend the context and produce an answer faithful to the reference one in 0-shot and 1-shot settings. In addition, our evaluation goes beyond the traditional MRC. We propose a new evaluation scheme assessing the answer's correctness, helpfulness, conciseness, and contextuality. Our code is available publicly at https://github.com/vistec-AI/WangchanLion.
CVOct 22, 2020Code
High resolution weakly supervised localization architectures for medical imagesKonpat Preechakul, Sira Sriswasdi, Boonserm Kijsirikul et al.
In medical imaging, Class-Activation Map (CAM) serves as the main explainability tool by pointing to the region of interest. Since the localization accuracy from CAM is constrained by the resolution of the model's feature map, one may expect that segmentation models, which generally have large feature maps, would produce more accurate CAMs. However, we have found that this is not the case due to task mismatch. While segmentation models are developed for datasets with pixel-level annotation, only image-level annotation is available in most medical imaging datasets. Our experiments suggest that Global Average Pooling (GAP) and Group Normalization are the main culprits that worsen the localization accuracy of CAM. To address this issue, we propose Pyramid Localization Network (PYLON), a model for high-accuracy weakly-supervised localization that achieved 0.62 average point localization accuracy on NIH's Chest X-Ray 14 dataset, compared to 0.45 for a traditional CAM model. Source code and extended results are available at https://github.com/cmb-chula/pylon.
CLFeb 25, 2025
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual EnvironmentsPatomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek et al.
Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.
CLFeb 5, 2025
Mitigating Language Bias in Cross-Lingual Job Retrieval: A Recruitment Platform PerspectiveNapat Laosaengpha, Thanit Tativannarat, Attapol Rutherford et al.
Understanding the textual components of resumes and job postings is critical for improving job-matching accuracy and optimizing job search systems in online recruitment platforms. However, existing works primarily focus on analyzing individual components within this information, requiring multiple specialized tools to analyze each aspect. Such disjointed methods could potentially hinder overall generalizability in recruitment-related text processing. Therefore, we propose a unified sentence encoder that utilized multi-task dual-encoder framework for jointly learning multiple component into the unified sentence encoder. The results show that our method outperforms other state-of-the-art models, despite its smaller model size. Moreover, we propose a novel metric, Language Bias Kullback-Leibler Divergence (LBKL), to evaluate language bias in the encoder, demonstrating significant bias reduction and superior cross-lingual performance.
CLAug 17, 2025
SEA-BED: Southeast Asia Embedding BenchmarkWuttikorn Ponwitayarat, Raymond Ng, Jann Railey Montalan et al.
Sentence embeddings are essential for NLP tasks such as semantic search, re-ranking, and textual similarity. Although multilingual benchmarks like MMTEB broaden coverage, Southeast Asia (SEA) datasets are scarce and often machine-translated, missing native linguistic properties. With nearly 700 million speakers, the SEA region lacks a region-specific embedding benchmark. We introduce SEA-BED, the first large-scale SEA embedding benchmark with 169 datasets across 9 tasks and 10 languages, where 71% are formulated by humans, not machine generation or translation. We address three research questions: (1) which SEA languages and tasks are challenging, (2) whether SEA languages show unique performance gaps globally, and (3) how human vs. machine translations affect evaluation. We evaluate 17 embedding models across six studies, analyzing task and language challenges, cross-benchmark comparisons, and translation trade-offs. Results show sharp ranking shifts, inconsistent model performance among SEA languages, and the importance of human-curated datasets for low-resource languages like Burmese.
CLJul 19, 2025
Mangosteen: An Open Thai Corpus for Language Model PretrainingWannaphong Phatthiyaphaibun, Can Udomcharoenchaikit, Pakpoom Singkorapoom et al.
Pre-training data shapes a language model's quality, but raw web text is noisy and demands careful cleaning. Existing large-scale corpora rely on English-centric or language-agnostic pipelines whose heuristics do not capture Thai script or cultural nuances, leaving risky material such as gambling content untreated. Prior Thai-specific efforts customize pipelines or build new ones, yet seldom release their data or document design choices, hindering reproducibility and raising the question of how to construct a transparent, high-quality Thai corpus. We introduce Mangosteen: a 47 billion-token Thai corpus built through a Thai-adapted Dolma pipeline that includes custom rule-based language ID, revised C4/Gopher quality filters, and Thai-trained content filters, plus curated non-web sources such as Wikipedia, Royal Gazette texts, OCR-extracted books, and CC-licensed YouTube subtitles. Systematic ablations using GPT-2 show the pipeline trims CommonCrawl from 202M to 25M documents while raising SEA-HELM NLG from 3 to 11; an 8B-parameter SEA-LION model continually pre-trained on Mangosteen then surpasses SEA-LION-v3 and Llama-3.1 by about four points on Thai benchmarks. We release the full pipeline code, cleaning manifests, corpus snapshot, and all checkpoints, providing a fully reproducible foundation for future Thai and regional LLM research.
CLAug 9, 2025
SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian LanguagesMuhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim et al.
Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents.
ROJul 26, 2025
Spatial Language Likelihood Grounding Network for Bayesian Fusion of Human-Robot ObservationsSupawich Sitdhipol, Waritwong Sukprasongdee, Ekapol Chuangsuwanich et al.
Fusing information from human observations can help robots overcome sensing limitations in collaborative tasks. However, an uncertainty-aware fusion framework requires a grounded likelihood representing the uncertainty of human inputs. This paper presents a Feature Pyramid Likelihood Grounding Network (FP-LGN) that grounds spatial language by learning relevant map image features and their relationships with spatial relation semantics. The model is trained as a probability estimator to capture aleatoric uncertainty in human language using three-stage curriculum learning. Results showed that FP-LGN matched expert-designed rules in mean Negative Log-Likelihood (NLL) and demonstrated greater robustness with lower standard deviation. Collaborative sensing results demonstrated that the grounded likelihood successfully enabled uncertainty-aware fusion of heterogeneous human language observations and robot sensor measurements, achieving significant improvements in human-robot collaborative task performance.
CLJul 13, 2025
Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering?Pawitsapak Akarajaradwong, Chompakorn Chaksangchaichot, Pirat Pothavorn et al.
The Retrieval-Augmented Generation (RAG) systems' performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce an approach aligning LLMs toward improved law citation accuracy and better response quality using Group-Relative Policy Optimization (GRPO). Our approach leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to large language model judges. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains from the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our method shows enhanced robustness on complex legal reasoning tasks compared to instruction tuning, providing an effective and resource-efficient solution for enhancing Thai legal LLMs.
CLMay 30, 2025
Explainable Depression Detection using Masked Hard Instance MiningPatawee Prakrankamanant, Shinji Watanabe, Ekapol Chuangsuwanich
This paper addresses the critical need for improved explainability in text-based depression detection. While offering predictive outcomes, current solutions often overlook the understanding of model predictions which can hinder trust in the system. We propose the use of Masked Hard Instance Mining (MHIM) to enhance the explainability in the depression detection task. MHIM strategically masks attention weights within the model, compelling it to distribute attention across a wider range of salient features. We evaluate MHIM on two datasets representing distinct languages: Thai (Thai-Maywe) and English (DAIC-WOZ). Our results demonstrate that MHIM significantly improves performance in terms of both prediction accuracy and explainability metrics.
LGMay 29, 2025
Decom-Renorm-Merge: Model Merging on the Right Space Improves MultitaskingYuatyong Chaichana, Thanapat Trachu, Peerat Limkonchotiwat et al. · berkeley
In the era of large-scale training, model merging has evolved into a tool for creating multitasking models efficiently. It enables the knowledge of models to be fused, without the need for heavy computation as required in traditional multitask learning. Existing merging methods often assume that entries at identical positions in weight matrices serve the same function, enabling straightforward entry-wise comparison and merging. However, this assumption overlooks the complexity of finetuned neural networks, where neurons may develop distinct feature compositions, making direct entry-wise merging problematic. We present Decom-Renorm-Merge (DRM), a simple yet effective approach that leverages Singular Value Decomposition to decompose and coordinate weight matrices into an aligned joint space, where entry-wise merging becomes possible. We showcase the effectiveness of DRM across various settings ranging from smaller encoder-based such as ViT and DeBERTa, encoder-decoder-based such as T5, and larger decoder-based such as Llama3.1-8B. Our experimental results show that DRM outperforms several state-of-the-art merging techniques across full finetuning and low-rank adaptation settings. Moreover, our analysis reveals renormalization as the crucial component for creating a robust and even joint space for merging, significantly contributing to the method's performance.
CLJun 12, 2024
Learning Job Title Representation from Job Description Aggregation NetworkNapat Laosaengpha, Thanit Tativannarat, Chawan Piansaddhayanon et al.
Learning job title representation is a vital process for developing automatic human resource tools. To do so, existing methods primarily rely on learning the title representation through skills extracted from the job description, neglecting the rich and diverse content within. Thus, we propose an alternative framework for learning job titles through their respective job description (JD) and utilize a Job Description Aggregator component to handle the lengthy description and bidirectional contrastive loss to account for the bidirectional relationship between the job title and its description. We evaluated the performance of our method on both in-domain and out-of-domain settings, achieving a superior performance over the skill-based approach.
SDJun 10, 2024
Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian BridgeThanapat Trachu, Chawan Piansaddhayanon, Ekapol Chuangsuwanich
Diffusion-based speech enhancement has shown promising results, but can suffer from a slower inference time. Initializing the diffusion process with the enhanced audio generated by a regression-based model can be used to reduce the computational steps required. However, these approaches often necessitate a regression model, further increasing the system's complexity. We propose Thunder, a unified regression-diffusion model that utilizes the Brownian bridge process which can allow the model to act in both modes. The regression mode can be accessed by setting the diffusion time step closed to 1. However, the standard score-based diffusion modeling does not perform well in this setup due to gradient instability. To mitigate this problem, we modify the diffusion model to predict the clean speech instead of the score function, achieving competitive performance with a more compact model size and fewer reverse steps.
CLJun 9, 2024
MrRank: Improving Question Answering Retrieval System through Multi-Result Ranking ModelDanupat Khamnuansin, Tawunrat Chalothorn, Ekapol Chuangsuwanich
Large Language Models (LLMs) often struggle with hallucinations and outdated information. To address this, Information Retrieval (IR) systems can be employed to augment LLMs with up-to-date knowledge. However, existing IR techniques contain deficiencies, posing a performance bottleneck. Given the extensive array of IR systems, combining diverse approaches presents a viable strategy. Nevertheless, prior attempts have yielded restricted efficacy. In this work, we propose an approach that leverages learning-to-rank techniques to combine heterogeneous IR systems. We demonstrate the method on two Retrieval Question Answering (ReQA) tasks. Our empirical findings exhibit a significant performance enhancement, outperforming previous approaches and achieving state-of-the-art results on ReQA SQuAD.
CLJun 5, 2024
Space Decomposition for Sentence EmbeddingWuttikorn Ponwitayarat, Peerat Limkonchotiwat, Ekapol Chuangsuwanich et al.
Determining sentence pair similarity is crucial for various NLP tasks. A common technique to address this is typically evaluated on a continuous semantic textual similarity scale from 0 to 5. However, based on a linguistic observation in STS annotation guidelines, we found that the score in the range [4,5] indicates an upper-range sample, while the rest are lower-range samples. This necessitates a new approach to treating the upper-range and lower-range classes separately. In this paper, we introduce a novel embedding space decomposition method called MixSP utilizing a Mixture of Specialized Projectors, designed to distinguish and rank upper-range and lower-range samples accurately. The experimental results demonstrate that MixSP decreased the overlap representation between upper-range and lower-range classes significantly while outperforming competitors on STS and zero-shot benchmarks.
CVFeb 28, 2022
ReCasNet: Improving consistency within the two-stage mitosis detection frameworkChawan Piansaddhayanon, Sakun Santisukwongchote, Shanop Shuangshoti et al.
Mitotic count (MC) is an important histological parameter for cancer diagnosis and grading, but the manual process for obtaining MC from whole-slide histopathological images is very time-consuming and prone to error. Therefore, deep learning models have been proposed to facilitate this process. Existing approaches utilize a two-stage pipeline: the detection stage for identifying the locations of potential mitotic cells and the classification stage for refining prediction confidences. However, this pipeline formulation can lead to inconsistencies in the classification stage due to the poor prediction quality of the detection stage and the mismatches in training data distributions between the two stages. In this study, we propose a Refine Cascade Network (ReCasNet), an enhanced deep learning pipeline that mitigates the aforementioned problems with three improvements. First, window relocation was used to reduce the number of poor quality false positives generated during the detection stage. Second, object re-cropping was performed with another deep learning model to adjust poorly centered objects. Third, improved data selection strategies were introduced during the classification stage to reduce the mismatches in training data distributions. ReCasNet was evaluated on two large-scale mitotic figure recognition datasets, canine cutaneous mast cell tumor (CCMCT) and canine mammary carcinoma (CMC), which resulted in up to 4.8% percentage point improvements in the F1 scores for mitotic cell detection and 44.1% reductions in mean absolute percentage error (MAPE) for MC prediction. Techniques that underlie ReCasNet can be generalized to other two-stage object detection networks and should contribute to improving the performances of deep learning models in broad digital pathology applications.
ASMay 16, 2020
Reducing Spelling Inconsistencies in Code-Switching ASR using Contextualized CTC LossBurin Naowarat, Thananchai Kongthaworn, Korrawe Karunratanakul et al.
Code-Switching (CS) remains a challenge for Automatic Speech Recognition (ASR), especially character-based models. With the combined choice of characters from multiple languages, the outcome from character-based models suffers from phoneme duplication, resulting in language-inconsistent spellings. We propose Contextualized Connectionist Temporal Classification (CCTC) loss to encourage spelling consistencies of a character-based non-autoregressive ASR which allows for faster inference. The CCTC loss conditions the main prediction on the predicted contexts to ensure language consistency in the spellings. In contrast to existing CTC-based approaches, CCTC loss does not require frame-level alignments, since the context ground truth is obtained from the model's estimated path. Compared to the same model trained with regular CTC loss, our method consistently improved the ASR performance on both CS and monolingual corpora.
NCApr 8, 2020
MetaSleepLearner: A Pilot Study on Fast Adaptation of Bio-signals-Based Sleep Stage Classifier to New Individual Subject Using Meta-LearningNannapas Banluesombatkul, Pichayoot Ouppaphan, Pitshaporn Leelaarporn et al.
Identifying bio-signals based-sleep stages requires time-consuming and tedious labor of skilled clinicians. Deep learning approaches have been introduced in order to challenge the automatic sleep stage classification conundrum. However, the difficulties can be posed in replacing the clinicians with the automatic system due to the differences in many aspects found in individual bio-signals, causing the inconsistency in the performance of the model on every incoming individual. Thus, we aim to explore the feasibility of using a novel approach, capable of assisting the clinicians and lessening the workload. We propose the transfer learning framework, entitled MetaSleepLearner, based on Model Agnostic Meta-Learning (MAML), in order to transfer the acquired sleep staging knowledge from a large dataset to new individual subjects. The framework was demonstrated to require the labelling of only a few sleep epochs by the clinicians and allow the remainder to be handled by the system. Layer-wise Relevance Propagation (LRP) was also applied to understand the learning course of our approach. In all acquired datasets, in comparison to the conventional approach, MetaSleepLearner achieved a range of 5.4\% to 17.7\% improvement with statistical difference in the mean of both approaches. The illustration of the model interpretation after the adaptation to each subject also confirmed that the performance was directed towards reasonable learning. MetaSleepLearner outperformed the conventional approaches as a result from the fine-tuning using the recordings of both healthy subjects and patients. This is the first work that investigated a non-conventional pre-training method, MAML, resulting in a possibility for human-machine collaboration in sleep stage classification and easing the burden of the clinicians in labelling the sleep stages through only several epochs rather than an entire recording.
CLAug 4, 2019
Semi-supervised Thai Sentence Segmentation Using Local and Distant Word RepresentationsChanatip Saetia, Ekapol Chuangsuwanich, Tawunrat Chalothorn et al.
A sentence is typically treated as the minimal syntactic unit used for extracting valuable information from a longer piece of text. However, in written Thai, there are no explicit sentence markers. We proposed a deep learning model for the task of sentence segmentation that includes three main contributions. First, we integrate n-gram embedding as a local representation to capture word groups near sentence boundaries. Second, to focus on the keywords of dependent clauses, we combine the model with a distant representation obtained from self-attention modules. Finally, due to the scarcity of labeled data, for which annotation is difficult and time-consuming, we also investigate and adapt Cross-View Training (CVT) as a semi-supervised learning technique, allowing us to utilize unlabeled data to improve the model representations. In the Thai sentence segmentation experiments, our model reduced the relative error by 7.4% and 10.5% compared with the baseline models on the Orchid and UGWC datasets, respectively. We also applied our model to the task of pronunciation recovery on the IWSLT English dataset. Our model outperformed the prior sequence tagging models, achieving a relative error reduction of 2.5%. Ablation studies revealed that utilizing n-gram presentations was the main contributing factor for Thai, while the semi-supervised training helped the most for English.
SPAug 31, 2018
Towards Asynchronous Motor Imagery-Based Brain-Computer Interfaces: a joint training scheme using deep learningPatcharin Cheng, Phairot Autthasan, Boriwat Pijarana et al.
In this paper, the deep learning (DL) approach is applied to a joint training scheme for asynchronous motor imagery-based Brain-Computer Interface (BCI). The proposed DL approach is a cascade of one-dimensional convolutional neural networks and fully-connected neural networks (CNN-FC). The focus is mainly on three types of brain responses: non-imagery EEG (\textit{background EEG}), (\textit{pure imagery}) EEG, and EEG during the transitional period between background EEG and pure imagery (\textit{transitional imagery}). The study of transitional imagery signals should provide greater insight into real-world scenarios. It may be inferred that pure imagery and transitional EEG are high and low power EEG imagery, respectively. Moreover, the results from the CNN-FC are compared to the conventional approach for motor imagery-BCI, namely the common spatial pattern (CSP) for feature extraction and support vector machine (SVM) for classification (CSP-SVM). Under a joint training scheme, pure and transitional imagery are treated as the same class, while background EEG is another class. Ten-fold cross-validation is used to evaluate whether the joint training scheme significantly improves the performance task of classifying pure and transitional imagery signals from background EEG. Using sparse of just a few electrode channels ($C_{z}$, $C_{3}$ and $C_{4}$), mean accuracy reaches 71.52 % and 70.27 % for CNN-FC and CSP-SVM, respectively. On the other hand, mean accuracy without the joint training scheme achieve only 62.68 % and 52.41 % for CNN-FC and CSP-SVM, respectively.
SPJul 5, 2018
Affective EEG-Based Person Identification Using the Deep Learning ApproachTheerawit Wilaiprasitporn, Apiwat Ditthapron, Karis Matchaparn et al.
Electroencephalography (EEG) is another mode for performing Person Identification (PI). Due to the nature of the EEG signals, EEG-based PI is typically done while the person is performing some kind of mental task, such as motor control. However, few works have considered EEG-based PI while the person is in different mental states (affective EEG). The aim of this paper is to improve the performance of affective EEG-based PI using a deep learning approach. \textcolor{red}{We proposed a cascade of deep learning using a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)}. CNNs are used to handle the spatial information from the EEG while RNNs extract the temporal information. \textcolor{red}{We evaluated two types of RNNs, namely, Long Short-Term Memory (CNN-LSTM) and Gated Recurrent Unit (CNN-GRU). } The proposed method is evaluated on the state-of-the-art affective dataset DEAP. The results indicate that CNN-GRU and CNN-LSTM can perform PI from different affective states and reach up to 99.90--100\% mean Correct Recognition Rate (CRR), significantly outperforming a support vector machine (SVM) baseline system that uses power spectral density (PSD) features. Notably, the 100\% mean \emph{CRR} comes from only 40 subjects in DEAP dataset. To reduce the number of EEG electrodes from thirty-two to five for more practical applications, the frontal region gives the best results reaching up to 99.17\% CRR (from CNN-GRU). Amongst the two deep learning models, we find CNN-GRU to slightly outperform CNN-LSTM, while having faster training time. \textcolor{red}{Furthermore, CNN-GRU overcomes the influence of affective states in EEG-Based PI reported in the previous works.
CVMay 29, 2018
Rice Classification Using Spatio-Spectral Deep Convolutional Neural NetworkItthi Chatnuntawech, Kittipong Tantisantisom, Paisan Khanchaitit et al.
Rice has been one of the staple foods that contribute significantly to human food supplies. Numerous rice varieties have been cultivated, imported, and exported worldwide. Different rice varieties could be mixed during rice production and trading. Rice impurities could damage the trust between rice importers and exporters, calling for the need to develop a rice variety inspection system. In this work, we develop a non-destructive rice variety classification system that benefits from the synergy between hyperspectral imaging and deep convolutional neural network (CNN). The proposed method uses a hyperspectral imaging system to simultaneously acquire complementary spatial and spectral information of rice seeds. The rice varieties are then determined from the acquired spatio-spectral data using a deep CNN. As opposed to several existing rice variety classification methods that require hand-engineered features, the proposed method automatically extracts spatio-spectral features from the raw sensor data. As demonstrated using two types of rice datasets, the proposed method achieved up to 11.9% absolute improvement in the mean classification accuracy, compared to the commonly used classification methods based on support vector machines.
CLOct 30, 2015
Prediction-Adaptation-Correction Recurrent Neural Networks for Low-Resource Language Speech RecognitionYu Zhang, Ekapol Chuangsuwanich, James Glass et al.
In this paper, we investigate the use of prediction-adaptation-correction recurrent neural networks (PAC-RNNs) for low-resource speech recognition. A PAC-RNN is comprised of a pair of neural networks in which a {\it correction} network uses auxiliary information given by a {\it prediction} network to help estimate the state probability. The information from the correction network is also used by the prediction network in a recurrent loop. Our model outperforms other state-of-the-art neural networks (DNNs, LSTMs) on IARPA-Babel tasks. Moreover, transfer learning from a language that is similar to the target language can help improve performance further.