CLSep 17, 2024Code
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language ModelsPotsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong et al.
Audio language models process audio inputs using textual prompts for tasks like speech recognition and audio captioning. Although built on multilingual pre-trained components, most are trained primarily on English, limiting their usability for other languages. This paper evaluates audio language models on Thai, a low-resource language, and finds that they lack emergent cross-lingual abilities despite their multilingual foundations. To address this, we explore data mixtures that optimize audio language models for both a target language and English while integrating audio comprehension and speech instruction-following into a unified model. Our experiments provide insights into improving instruction-following in low-resource languages by balancing language-specific and multilingual training data. The proposed model, Typhoon-Audio, significantly outperforms existing open-source models and achieves performance comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai.
CLNov 6, 2025Code
ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in ThaiSurapon Nonesung, Teetouch Jaknamon, Sirinya Chaiophat et al.
We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.
CLDec 18, 2024
Typhoon 2: A Family of Open Text and Multimodal Thai Large Language ModelsKunat Pipatanakul, Potsawee Manakul, Natapong Nitarach et al.
This paper introduces Typhoon 2, a series of text and multimodal large language models optimized for the Thai language. The series includes models for text, vision, and audio. Typhoon2-Text builds on state-of-the-art open models, such as Llama 3 and Qwen2, and we perform continual pre-training on a mixture of English and Thai data. We employ post-training techniques to enhance Thai language performance while preserving the base models' original capabilities. We release text models across a range of sizes, from 1 to 70 billion parameters, available in both base and instruction-tuned variants. To guardrail text generation, we release Typhoon2-Safety, a classifier enhanced for Thai cultures and language. Typhoon2-Vision improves Thai document understanding while retaining general visual capabilities, such as image captioning. Typhoon2-Audio introduces an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs and generating both text and speech outputs.
CLJul 17, 2025
AudioJudge: Understanding What Works in Large Audio Model Based Speech EvaluationPotsawee Manakul, Woody Haosheng Gan, Michael J. Ryan et al. · gatech
Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference simulation tasks. We further introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on our system ranking benchmark. Robustness analysis reveals that while LAMs maintain strong performance under acoustic noise, they exhibit significant verbosity and positional biases that require careful mitigation.
CLJun 19, 2025
FinCoT: Grounding Chain-of-Thought in Expert Financial ReasoningNatapong Nitarach, Warit Sirichotedumrong, Panop Pitchayarthorn et al.
This paper presents FinCoT, a structured chain-of-thought (CoT) prompting framework that embeds domain-specific expert financial reasoning blueprints to guide large language models' behaviors. We identify three main prompting styles in financial NLP (FinNLP): (1) standard prompting (zero-shot), (2) unstructured CoT (free-form reasoning), and (3) structured CoT (with explicitly structured reasoning steps). Prior work has mainly focused on the first two, while structured CoT remains underexplored and lacks domain expertise incorporation. Therefore, we evaluate all three prompting approaches across ten CFA-style financial domains and introduce FinCoT as the first structured finance-specific prompting approach incorporating blueprints from domain experts. FinCoT improves the accuracy of a general-purpose model, Qwen3-8B-Base, from 63.2% to 80.5%, and boosts Fin-R1 (7B), a finance-specific model, from 65.7% to 75.7%, while reducing output length by up to 8.9x and 1.16x compared to structured CoT methods, respectively. We find that FinCoT proves most effective for models lacking financial post-training. Our findings show that FinCoT does not only improve performance and reduce inference costs but also yields more interpretable and expert-aligned reasoning traces.
CLJan 19
Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech RecognitionWarit Sirichotedumrong, Adisai Na-Thalang, Potsawee Manakul et al.
Large encoder-decoder models like Whisper achieve strong offline transcription but remain impractical for streaming applications due to high latency. However, due to the accessibility of pre-trained checkpoints, the open Thai ASR landscape remains dominated by these offline architectures, leaving a critical gap in efficient streaming solutions. We present Typhoon ASR Real-time, a 115M-parameter FastConformer-Transducer model for low-latency Thai speech recognition. We demonstrate that rigorous text normalization can match the impact of model scaling: our compact model achieves a 45x reduction in computational cost compared to Whisper Large-v3 while delivering comparable accuracy. Our normalization pipeline resolves systemic ambiguities in Thai transcription --including context-dependent number verbalization and repetition markers (mai yamok) --creating consistent training targets. We further introduce a two-stage curriculum learning approach for Isan (north-eastern) dialect adaptation that preserves Central Thai performance. To address reproducibility challenges in Thai ASR, we release the Typhoon ASR Benchmark, a gold-standard human-labeled datasets with transcriptions following established Thai linguistic conventions, providing standardized evaluation protocols for the research community.
CLOct 17, 2025
Extending Audio Context for Long-Form Understanding in Large Audio-Language ModelsYuatyong Chaichana, Pittawat Taveekitworachai, Warit Sirichotedumrong et al.
Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.
CROct 13, 2020
Visual Security Evaluation of Learnable Image Encryption Methods against Ciphertext-only AttacksWarit Sirichotedumrong, Hitoshi Kiya
Various visual information protection methods have been proposed for privacy-preserving deep neural networks (DNNs). In contrast, attack methods on such protection methods have been studied simultaneously. In this paper, we evaluate state-of-the-art visual protection methods for privacy-preserving DNNs in terms of visual security against ciphertext-only attacks (COAs). We focus on brute-force attack, feature reconstruction attack (FR-Attack), inverse transformation attack (ITN-Attack), and GAN-based attack (GAN-Attack), which have been proposed to reconstruct visual information on plain images from the visually-protected images. The detail of various attack is first summarized, and then visual security of the protection methods is evaluated. Experimental results demonstrate that most of protection methods, including pixel-wise encryption, have not enough robustness against GAN-Attack, while a few protection methods are robust enough against GAN-Attack.
CRJun 2, 2020
A GAN-Based Image Transformation Scheme for Privacy-Preserving Deep Neural NetworksWarit Sirichotedumrong, Hitoshi Kiya
We propose a novel image transformation scheme using generative adversarial networks (GANs) for privacy-preserving deep neural networks (DNNs). The proposed scheme enables us not only to apply images without visual information to DNNs, but also to enhance robustness against ciphertext-only attacks (COAs) including DNN-based attacks. In this paper, the proposed transformation scheme is demonstrated to be able to protect visual information on plain images, and the visually-protected images are directly applied to DNNs for privacy-preserving image classification. Since the proposed scheme utilizes GANs, there is no need to manage encryption keys. In an image classification experiment, we evaluate the effectiveness of the proposed scheme in terms of classification accuracy and robustness against COAs.
CRDec 9, 2019
On the Security of Pixel-Based Image Encryption for Privacy-Preserving Deep Neural NetworksWarit Sirichotedumrong, Yuma Kinoshita, Hitoshi Kiya
This paper aims to evaluate the safety of a pixel-based image encryption method, which has been proposed to apply images with no visual information to deep neural networks (DNN), in terms of robustness against ciphertext-only attacks (COA). In addition, we propose a novel DNN-based COA that aims to reconstruct the visual information of encrypted images. The effectiveness of the proposed attack is evaluated under two encryption key conditions: same encryption key, and different encryption keys. The results show that the proposed attack can recover the visual information of the encrypted images if images are encrypted under same encryption key. Otherwise, the pixel-based image encryption method has robustness against COA.
IVJul 31, 2019
Adversarial Test on Learnable Image EncryptionMaungMaung AprilPyone, Warit Sirichotedumrong, Hitoshi Kiya
Data for deep learning should be protected for privacy preserving. Researchers have come up with the notion of learnable image encryption to satisfy the requirement. However, existing privacy preserving approaches have never considered the threat of adversarial attacks. In this paper, we ran an adversarial test on learnable image encryption in five different scenarios. The results show different behaviors of the network in the variable key scenarios and suggest learnable image encryption provides certain level of adversarial robustness.
CRMay 6, 2019
Privacy-Preserving Deep Neural Networks with Pixel-based Image Encryption Considering Data Augmentation in the Encrypted DomainWarit Sirichotedumrong, Takahiro Maekawa, Yuma Kinoshita et al.
We present a novel privacy-preserving scheme for deep neural networks (DNNs) that enables us not to only apply images without visual information to DNNs for both training and testing but to also consider data augmentation in the encrypted domain for the first time. In this paper, a novel pixel-based image encryption method is first proposed for privacy-preserving DNNs. In addition, a novel adaptation network is considered that reduces the influence of image encryption. In an experiment, the proposed method is applied to a well-known network, ResNet-18, for image classification. The experimental results demonstrate that conventional privacy-preserving machine learning methods including the state-of-the-arts cannot be applied to data augmentation in the encrypted domain and that the proposed method outperforms them in terms of classification accuracy.
CRDec 14, 2018
Grayscale-Based Image Encryption Considering Color Sub-sampling Operation for Encryption-then-Compression SystemsWarit Sirichotedumrong, Tatsuya Chuman, Hitoshi Kiya
A new grayscale-based block scrambling image encryption scheme is presented to enhance the security of Encryption-then-Compression (EtC) systems, which are used to securely transmit images through an untrusted channel provider. The proposed scheme enables the use of a smaller block size and a larger number of blocks than the conventional scheme. Images encrypted using the proposed scheme include less color information due to the use of grayscale images even when the original image has three color channels. These features enhance security against various attacks, such as jigsaw puzzle solver and brute-force attacks. Moreover, it allows the use of color sub-sampling, which can improve the compression performance, although the encrypted images have no color information. In an experiment, encrypted images were uploaded to and then downloaded from Facebook and Twitter, and the results demonstrated that the proposed scheme is effective for EtC systems, while maintaining a high compression performance.
CRNov 1, 2018
Encryption-then-Compression Systems using Grayscale-based Image Encryption for JPEG ImagesTatsuya Chuman, Warit Sirichotedumrong, Hitoshi Kiya
A block scrambling-based encryption scheme is presented to enhance the security of Encryption-then-Compression (EtC) systems with JPEG compression, which allow us to securely transmit images through an untrusted channel provider, such as social network service providers. The proposed scheme enables the use of a smaller block size and a larger number of blocks than the conventional scheme. Images encrypted using the proposed scheme include less color information due to the use of grayscale images even when the original image has three color channels. These features enhance security against various attacks such as jigsaw puzzle solver and brute-force attacks. In an experiment, the security against jigsaw puzzle solver attacks is evaluated. Encrypted images were uploaded to and then downloaded from Facebook and Twitter, and the results demonstrated that the proposed scheme is effective for EtC systems.
CROct 31, 2018
Compression Performance of Grayscale-based Image Encryption for Encryption-then-Compression SystemsWarit Sirichotedumrong, Tatsuya Chuman, Hitoshi Kiya
This paper considers a new grayscale-based image encryption for Encryption-then-Compression (EtC) systems with JPEG compression. Firstly, generation methods of grayscale-based images are discussed in terms of the selection of color space. In addition, a new JPEG quantization table for the grayscale-based images is proposed to provide a better compression performance. Moreover, the quality of both images uploaded to Social Network Services (SNS) and downloaded from SNS, are discussed and evaluated. In the experiments, encrypted images are compressed using various compression parameters and quantization tables, and uploaded to Twitter and Facebook. The results proved that the selection of color space and the proposed quantization table can improve the compression performances of not only uploaded images but also downloaded ones.
CROct 4, 2018
Image Manipulation Specifications on Social Networking Services for Encryption-then-Compression SystemsTatsuya Chuman, Kenta Iida, Warit Sirichotedumrong et al.
Encryption-then-Compression (EtC) systems have been proposed to securely transmit images through an untrusted channel provider. In this study, EtC systems were applied to social media like Twitter that carry out image manipulations. The block scrambling-based encryption schemes used in EtC systems were evaluated in terms of their robustness against image manipulation on social media. The aim was to investigate how five social networking service (SNS) providers, Facebook, Twitter, Google+, Tumblr and Flickr, manipulate images and to determine whether the encrypted images uploaded to SNS providers can avoid being distorted by such manipulations. In an experiment, encrypted and non-encrypted JPEG images were uploaded to various SNS providers. The results show that EtC systems are applicable to the five SNS providers.
CRJun 11, 2018
Grayscale-based Block Scrambling Image Encryption for Social Networking ServicesWarit Sirichotedumrong, Tatsuya Chuman, Shoko Imaizumi et al.
This paper proposes a new block scrambling encryption scheme that enhances the security of encryption-then-compression (EtC) systems for JPEG images, which are used, for example, to securely transmit images through an untrusted channel provider. The proposed method allows the use of a smaller block size and a larger number of blocks than the conventional ones. Moreover, images encrypted using proposed scheme include less color information due to the use of grayscale even when the original image has three color channels. These features enhance security against various attacks such as jigsaw puzzle solver and brute-force attacks. The results of an experiment in which encrypted images were uploaded to and then downloaded from Twitter and Facebook demonstrated the effectiveness of the proposed scheme for EtC systems.