Kyohei Atarashi

h-index5

8papers

94citations

Novelty51%

AI Score49

Ranked #23,168 of 194,257 authors (top 12%)#4,849 in CL (top 16%)

8 Papers

1.1CLMar 3

Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

Shunki Uebayashi, Kento Masui, Kyohei Atarashi et al.

Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.

13.9CLFeb 18, 2025

Emulating Retrieval Augmented Generation via Prompt Engineering for Enhanced Long Context Comprehension in LLMs

Joon Park, Kyohei Atarashi, Koh Takeuchi et al.

This paper addresses the challenge of comprehending very long contexts in Large Language Models (LLMs) by proposing a method that emulates Retrieval Augmented Generation (RAG) through specialized prompt engineering and chain-of-thought (CoT) reasoning. While recent LLMs support over 100,000 tokens in a single prompt, simply enlarging context windows has not guaranteed robust multi-hop reasoning when key details are scattered across massive input. Our approach treats the model as both the retriever and the reasoner: it first tags relevant segments within a long passage, then employs a stepwise CoT workflow to integrate these pieces of evidence. This single-pass method thereby reduces reliance on an external retriever, yet maintains focus on crucial segments. We evaluate our approach on selected tasks from BABILong, which interleaves standard bAbI QA problems with large amounts of distractor text. Compared to baseline (no retrieval) and naive RAG pipelines, our approach more accurately handles multi-fact questions such as object location tracking, counting, and indefinite knowledge. Furthermore, we analyze how prompt structure, including the order of question, relevant-text tags, and overall instructions, significantly affects performance. These findings underscore that optimized prompt engineering, combined with guided reasoning, can enhance LLMs' long-context comprehension and serve as a lightweight alternative to traditional retrieval pipelines.

CYMay 22

Estimating Learners' Skill Acquisition Without Temporal Information

Ryosuke Nagai, Kyohei Atarashi, Koh Takeuchi et al.

Recent research in educational data mining, especially knowledge tracing, has focused on predicting learners' future knowledge states to support adaptive instruction. However, in many real-world educational settings, learning data are often available only as single-time-point assessments without temporal information, making existing time-series-based approaches difficult to apply. In this paper, we propose a novel framework for predicting future skill acquisition using only snapshot data. Specifically, we address the problem of predicting the next skill to be acquired from skill mastery patterns estimated by cognitive diagnostic models (CDMs). In the absence of temporal information, we exploit inclusion relations among learners' skill sets to induce a pseudo-temporal ordering, interpreting expanding skill sets as a proxy for learning progression. To efficiently approximate unobserved acquisition paths, we introduce a neural model that captures latent skill acquisition dynamics through expected skill increments. Experiments on both synthetic and real-world datasets demonstrate that the proposed method consistently outperforms baseline approaches, with particularly strong advantages as the skill space becomes larger. These results indicate that meaningful skill acquisition patterns can be inferred from snapshot data alone, providing a practical framework for adaptive learning support in data-constrained educational environments.

4.1LGSep 22, 2025

Robust Anomaly Detection Under Normality Distribution Shift in Dynamic Graphs

Xiaoyang Xu, Xiaofeng Lin, Koh Takeuchi et al.

Anomaly detection in dynamic graphs is a critical task with broad real-world applications, including social networks, e-commerce, and cybersecurity. Most existing methods assume that normal patterns remain stable over time; however, this assumption often fails in practice due to the phenomenon we refer to as normality distribution shift (NDS), where normal behaviors evolve over time. Ignoring NDS can lead models to misclassify shifted normal instances as anomalies, degrading detection performance. To tackle this issue, we propose WhENDS, a novel unsupervised anomaly detection method that aligns normal edge embeddings across time by estimating distributional statistics and applying whitening transformations. Extensive experiments on four widely-used dynamic graph datasets show that WhENDS consistently outperforms nine strong baselines, achieving state-of-the-art results and underscoring the importance of addressing NDS in dynamic graph anomaly detection.

6.4CRJul 31, 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Lijia Liu, Takumi Kondo, Kyohei Atarashi et al.

This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.

4.5MLJun 12, 2025Code

Box-Constrained Softmax Function and Its Application for Post-Hoc Calibration

Kyohei Atarashi, Satoshi Oyama, Hiromi Arai et al.

Controlling the output probabilities of softmax-based models is a common problem in modern machine learning. Although the $\mathrm{Softmax}$ function provides soft control via its temperature parameter, it lacks the ability to enforce hard constraints, such as box constraints, on output probabilities, which can be critical in certain applications requiring reliable and trustworthy models. In this work, we propose the box-constrained softmax ($\mathrm{BCSoftmax}$) function, a novel generalization of the $\mathrm{Softmax}$ function that explicitly enforces lower and upper bounds on output probabilities. While $\mathrm{BCSoftmax}$ is formulated as the solution to a box-constrained optimization problem, we develop an exact and efficient computation algorithm for $\mathrm{BCSoftmax}$. As a key application, we introduce two post-hoc calibration methods based on $\mathrm{BCSoftmax}$. The proposed methods mitigate underconfidence and overconfidence in predictive models by learning the lower and upper bounds of the output probabilities or logits after model training, thereby enhancing reliability in downstream decision-making tasks. We demonstrate the effectiveness of our methods experimentally using the TinyImageNet, CIFAR-100, and 20NewsGroups datasets, achieving improvements in calibration metrics.

8.4CVFeb 24, 2025

Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models

Yaqi Sun, Kyohei Atarashi, Koh Takeuchi et al.

Large Vision-Language Models (LVLMs) integrate image encoders with Large Language Models (LLMs) to process multi-modal inputs and perform complex visual tasks. However, they often generate hallucinations by describing non-existent objects or attributes, compromising their reliability. This study analyzes hallucination patterns in image captioning, showing that not all tokens in the generation process are influenced by image input and that image dependency can serve as a useful signal for hallucination detection. To address this, we develop an automated pipeline to identify hallucinated objects and train a token-level classifier using hidden representations from parallel inference passes-with and without image input. Leveraging this classifier, we introduce a decoding strategy that effectively controls hallucination rates in image captioning at inference time.

3.8MLOct 19, 2020Code

Factorization Machines with Regularization for Sparse Feature Interactions

Kyohei Atarashi, Satoshi Oyama, Masahito Kurihara

Factorization machines (FMs) are machine learning predictive models based on second-order feature interactions and FMs with sparse regularization are called sparse FMs. Such regularizations enable feature selection, which selects the most relevant features for accurate prediction, and therefore they can contribute to the improvement of the model accuracy and interpretability. However, because FMs use second-order feature interactions, the selection of features often causes the loss of many relevant feature interactions in the resultant models. In such cases, FMs with regularization specially designed for feature interaction selection trying to achieve interaction-level sparsity may be preferred instead of those just for feature selection trying to achieve feature-level sparsity. In this paper, we present a new regularization scheme for feature interaction selection in FMs. The proposed regularizer is an upper bound of the $\ell_1$ regularizer for the feature interaction matrix, which is computed from the parameter matrix of FMs. For feature interaction selection, our proposed regularizer makes the feature interaction matrix sparse without a restriction on sparsity patterns imposed by the existing methods. We also describe efficient proximal algorithms for the proposed FMs and present theoretical analyses of both existing and the new regularize. In addition, we will discuss how our ideas can be applied or extended to more accurate feature selection and other related models such as higher-order FMs and the all-subsets model. The analysis and experimental results on synthetic and real-world datasets show the effectiveness of the proposed methods.