CVApr 3, 2023
Grand Challenge On Detecting CheapfakesDuc-Tien Dang-Nguyen, Sohail Ahmed Khan, Cise Midoglu et al.
Cheapfake is a recently coined term that encompasses non-AI ("cheap") manipulations of multimedia content. Cheapfakes are known to be more prevalent than deepfakes. Cheapfake media can be created using editing software for image/video manipulations, or even without using any software, by simply altering the context of an image/video by sharing the media alongside misleading claims. This alteration of context is referred to as out-of-context (OOC) misuse of media. OOC media is much harder to detect than fake media, since the images and videos are not tampered. In this challenge, we focus on detecting OOC images, and more specifically the misuse of real photographs with conflicting image captions in news items. The aim of this challenge is to develop and benchmark models that can be used to detect whether given samples (news image and associated captions) are OOC, based on the recently compiled COSMOS dataset.
CVJan 6, 2023
Augmenting Ego-Vehicle for Traffic Near-Miss and Accident Classification Dataset using Manipulating Conditional Style TranslationHilmil Pradana, Minh-Son Dao, Koji Zettsu
To develop the advanced self-driving systems, many researchers are focusing to alert all possible traffic risk cases from closed-circuit television (CCTV) and dashboard-mounted cameras. Most of these methods focused on identifying frame-by-frame in which an anomaly has occurred, but they are unrealized, which road traffic participant can cause ego-vehicle leading into collision because of available annotation dataset only to detect anomaly on traffic video. Near-miss is one type of accident and can be defined as a narrowly avoided accident. However, there is no difference between accident and near-miss at the time before the accident happened, so our contribution is to redefine the accident definition and re-annotate the accident inconsistency on DADA-2000 dataset together with near-miss. By extending the start and end time of accident duration, our annotation can precisely cover all ego-motions during an incident and consistently classify all possible traffic risk accidents including near-miss to give more critical information for real-world driving assistance systems. The proposed method integrates two different components: conditional style translation (CST) and separable 3-dimensional convolutional neural network (S3D). CST architecture is derived by unsupervised image-to-image translation networks (UNIT) used for augmenting the re-annotation DADA-2000 dataset to increase the number of traffic risk accident videos and to generalize the performance of video classification model on different types of conditions while S3D is useful for video classification to prove dataset re-annotation consistency. In evaluation, the proposed method achieved a significant improvement result by 10.25% positive margin from the baseline model for accuracy on cross-validation analysis.
CYNov 23, 2022
Monitoring and Improving Personalized Sleep Quality from Long-Term LifelogsWenbin Gan, Minh-Son Dao, Koji Zettsu
Sleep plays a vital role in our physical, cognitive, and psychological well-being. Despite its importance, long-term monitoring of personalized sleep quality (SQ) in real-world contexts is still challenging. Many sleep researches are still developing clinically and far from accessible to the general public. Fortunately, wearables and IoT devices provide the potential to explore the sleep insights from multimodal data, and have been used in some SQ researches. However, most of these studies analyze the sleep related data and present the results in a delayed manner (i.e., today's SQ obtained from last night's data), it is sill difficult for individuals to know how their sleep will be before they go to bed and how they can proactively improve it. To this end, this paper proposes a computational framework to monitor the individual SQ based on both the objective and subjective data from multiple sources, and moves a step further towards providing the personalized feedback to improve the SQ in a data-driven manner. The feedback is implemented by referring the insights from the PMData dataset based on the discovered patterns between life events and different levels of SQ. The deep learning based personal SQ model (PerSQ), using the long-term heterogeneous data and considering the carry-over effect, achieves higher prediction performance compared with baseline models. A case study also shows reasonable results for an individual to monitor and improve the SQ in the future.
CYNov 23, 2022
An Open Case-based Reasoning Framework for Personalized On-board Driving Assistance in Risk ScenariosWenbin Gan, Minh-Son Dao, Koji Zettsu
Driver reaction is of vital importance in risk scenarios. Drivers can take correct evasive maneuver at proper cushion time to avoid the potential traffic crashes, but this reaction process is highly experience-dependent and requires various levels of driving skills. To improve driving safety and avoid the traffic accidents, it is necessary to provide all road drivers with on-board driving assistance. This study explores the plausibility of case-based reasoning (CBR) as the inference paradigm underlying the choice of personalized crash evasive maneuvers and the cushion time, by leveraging the wealthy of human driving experience from the steady stream of traffic cases, which have been rarely explored in previous studies. To this end, in this paper, we propose an open evolving framework for generating personalized on-board driving assistance. In particular, we present the FFMTE model with high performance to model the traffic events and build the case database; A tailored CBR-based method is then proposed to retrieve, reuse and revise the existing cases to generate the assistance. We take the 100-Car Naturalistic Driving Study dataset as an example to build and test our framework; the experiments show reasonable results, providing the drivers with valuable evasive information to avoid the potential crashes in different scenarios.
AISep 3, 2022
Multimodal and Crossmodal AI for Smart Data AnalysisMinh-Son Dao
Recently, the multimodal and crossmodal AI techniques have attracted the attention of communities. The former aims to collect disjointed and heterogeneous data to compensate for complementary information to enhance robust prediction. The latter targets to utilize one modality to predict another modality by discovering the common attention sharing between them. Although both approaches share the same target: generate smart data from collected raw data, the former demands more modalities while the latter aims to decrease the variety of modalities. This paper first discusses the role of multimodal and crossmodal AI in smart data analysis in general. Then, we introduce the multimodal and crossmodal AI framework (MMCRAI) to balance the abovementioned approaches and make it easy to scale into different domains. This framework is integrated into xDataPF (the cross-data platform https://www.xdata.nict.jp/). We also introduce and discuss various applications built on this framework and xDataPF.
51.0IRApr 28
GeoSearch: Augmenting Worldwide Geolocalization with Web-Scale Reverse Image Search and Image MatchingTung-Duong Le-Duc, Hoang-Quoc Nguyen-Son, Minh-Son Dao
Worldwide image geolocalization, which aims to predict the GPS coordinates of any image on Earth, remains challenging due to global visual diversity. Recent generative approaches based on Retrieval-Augmented Generation (RAG) and Large Multimodal Models (LMMs) leverage candidates retrieved from fixed databases for reasoning, but often struggle with scenes that are absent from the reference set. In this work, we propose GeoSearch, an open-world geolocation framework that integrates web-scale reverse image search into the RAG pipeline. GeoSearch augments LMM prompts with database-retrieved coordinates and textual evidence extracted from web pages. To mitigate noise from irrelevant content, we introduce a two-layer filtering mechanism consisting of image matching, followed by confidence-based gating. Experiments on standard benchmarks Im2GPS3k and YFCC4k demonstrate the superiority of GeoSearch under leakage-aware evaluation. Our code and data are publicly available to support reproducibility.
AIJun 25, 2025Code
Case-based Reasoning Augmented Large Language Model Framework for Decision Making in Realistic Safety-Critical Driving ScenariosWenbin Gan, Minh-Son Dao, Koji Zettsu
Driving in safety-critical scenarios requires quick, context-aware decision-making grounded in both situational understanding and experiential reasoning. Large Language Models (LLMs), with their powerful general-purpose reasoning capabilities, offer a promising foundation for such decision-making. However, their direct application to autonomous driving remains limited due to challenges in domain adaptation, contextual grounding, and the lack of experiential knowledge needed to make reliable and interpretable decisions in dynamic, high-risk environments. To address this gap, this paper presents a Case-Based Reasoning Augmented Large Language Model (CBR-LLM) framework for evasive maneuver decision-making in complex risk scenarios. Our approach integrates semantic scene understanding from dashcam video inputs with the retrieval of relevant past driving cases, enabling LLMs to generate maneuver recommendations that are both context-sensitive and human-aligned. Experiments across multiple open-source LLMs show that our framework improves decision accuracy, justification quality, and alignment with human expert behavior. Risk-aware prompting strategies further enhance performance across diverse risk types, while similarity-based case retrieval consistently outperforms random sampling in guiding in-context learning. Case studies further demonstrate the framework's robustness in challenging real-world conditions, underscoring its potential as an adaptive and trustworthy decision-support tool for intelligent driving systems.
CLJan 23
SearchLLM: Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search EngineHoang-Quoc Nguyen-Son, Minh-Son Dao, Koji Zettsu
With the advent of large language models (LLMs), it has become common practice for users to draft text and utilize LLMs to enhance its quality through paraphrasing. However, this process can sometimes result in the loss or distortion of the original intended meaning. Due to the human-like quality of LLM-generated text, traditional detection methods often fail, particularly when text is paraphrased to closely mimic original content. In response to these challenges, we propose a novel approach named SearchLLM, designed to identify LLM-paraphrased text by leveraging search engine capabilities to locate potential original text sources. By analyzing similarities between the input and regenerated versions of candidate sources, SearchLLM effectively distinguishes LLM-paraphrased content. SearchLLM is designed as a proxy layer, allowing seamless integration with existing detectors to enhance their performance. Experimental results across various LLMs demonstrate that SearchLLM consistently enhances the accuracy of recent detectors in detecting LLM-paraphrased text that closely mimics original content. Furthermore, SearchLLM also helps the detectors prevent paraphrasing attacks.
CVMay 18, 2025Code
KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News DetectionTuan-Vinh La, Minh-Hieu Nguyen, Minh-Son Dao
Fake news detection remains a challenging problem due to the complex interplay between textual misinformation, manipulated images, and external knowledge reasoning. While existing approaches have achieved notable results in verifying veracity and cross-modal consistency, two key challenges persist: (1) Existing methods often consider only the global image context while neglecting local object-level details, and (2) they fail to incorporate external knowledge and entity relationships for deeper semantic understanding. To address these challenges, we propose a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations. Our approach leverages bottom-up attention to capture fine-grained object details, CLIP for global image semantics, and RoBERTa for context-aware text encoding. We further enhance knowledge utilization by retrieving and adaptively selecting relevant entities from a knowledge graph. The fused multi-modal features are processed through a Transformer-based classifier to predict news veracity. Experimental results demonstrate that our model outperforms recent approaches, showcasing the effectiveness of neighbor selection mechanism and multi-modal fusion for fake news detection. Our proposal introduces a new paradigm: knowledge-grounded multimodal reasoning. By integrating explicit entity-level selection and NLI-guided filtering, we shift fake news detection from feature fusion to semantically grounded verification. For reproducibility and further research, the source code is publicly at \href{https://github.com/latuanvinh1998/KGAlign}{github.com/latuanvinh1998/KGAlign}.
LGMay 8, 2024
xMTrans: Temporal Attentive Cross-Modality Fusion Transformer for Long-Term Traffic PredictionHuy Quang Ung, Hao Niu, Minh-Son Dao et al.
Traffic predictions play a crucial role in intelligent transportation systems. The rapid development of IoT devices allows us to collect different kinds of data with high correlations to traffic predictions, fostering the development of efficient multi-modal traffic prediction models. Until now, there are few studies focusing on utilizing advantages of multi-modal data for traffic predictions. In this paper, we introduce a novel temporal attentive cross-modality transformer model for long-term traffic predictions, namely xMTrans, with capability of exploring the temporal correlations between the data of two modalities: one target modality (for prediction, e.g., traffic congestion) and one support modality (e.g., people flow). We conducted extensive experiments to evaluate our proposed model on traffic congestion and taxi demand predictions using real-world datasets. The results showed the superiority of xMTrans against recent state-of-the-art methods on long-term traffic predictions. In addition, we also conducted a comprehensive ablation study to further analyze the effectiveness of each module in xMTrans.