CVNov 15, 2025Code
MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question AnsweringSeokwon Song, Minsu Park, Gunhee Kim
Source attribution aims to enhance the reliability of AI-generated answers by including references for each statement, helping users validate the provided answers. However, existing work has primarily focused on text-only scenario and largely overlooked the role of multimodality. We introduce MAVIS, the first benchmark designed to evaluate multimodal source attribution systems that understand user intent behind visual questions, retrieve multimodal evidence, and generate long-form answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documents. We develop fine-grained automatic metrics along three dimensions of informativeness, groundedness, and fluency, and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs with multimodal RAG generate more informative and fluent answers than unimodal RAG, but they exhibit weaker groundedness for image documents than for text documents, a gap amplified in multimodal settings. (2) Given the same multimodal documents, there is a trade-off between informativeness and groundedness across different prompting methods. (3) Our proposed method highlights mitigating contextual bias in interpreting image documents as a crucial direction for future research. The dataset and experimental code are available at https://github.com/seokwon99/MAVIS
SDMay 7
Do Melody and Rhythm Coevolve?Harin Lee, Rainer Polak, Manuel Anglada-Tort et al.
Music comprises two core structural components, melody and rhythm, that vary widely across cultures. Whether these components coevolve in a coupled way or follow independent trajectories remains unclear. We introduce a novel computational pipeline to extract vocal melodic pitch-interval and percussive inter-onset timing distributions from 27,628 popular songs across 59 countries, enabling large-scale cross-cultural comparison that bypasses traditional music annotations. Musical similarities between countries aligned with geographic and linguistic relationships, validating our approach. Substantial variation emerged in both melodic and rhythmic structures across countries, yet the diversity of the two components was not significantly correlated, challenging assumptions of coupled evolution. Only rhythmic diversity was significantly associated with ethnic and linguistic heterogeneity, while melodic diversity showed no such association. These findings suggest that melody and rhythm constitute partially independent systems shaped by distinct cultural and evolutionary pressures, rather than components of a single monolithic musical style.
CLMay 4
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language ModelsCosimo Galeone, Minsu Park, Giuseppe Ettorre et al.
Deployed language models must produce outputs that are both correct and format-compliant. We study this structured-output reliability gap using two mathematical benchmarks -- GSM8K and MATH -- as a controlled testbed: ground truth is unambiguous and the output contract is strict (JSON with required fields). We evaluate three 7-9B models under five prompting strategies and report output accuracy -- the joint event of mathematical correctness and valid JSON structure -- as the primary metric. A systematic format failure emerges: NAIVE prompting (no system prompt) achieves up to 85% task accuracy on GSM8K but 0% output accuracy across all models and datasets. REFERENCE prompting (a minimal hand-written JSON format prompt) fares little better, yielding 0% output accuracy for two of four models tested. Constrained decoding enforces syntactic validity but incurs 3.6x-8.2x latency overhead and in several settings degrades task performance substantially. To overcome this limitation, we developed AloLab, an iterative system-prompt optimizer (meta-agent: Claude Sonnet 4.5) requiring only black-box API access to the target model; it reaches 84-87% output accuracy on GSM8K and 34-40% on MATH across five independent runs per model, with 29/30 paired McNemar comparisons against the best static prompt significant at p < 0.05, at near-NAIVE inference latency and without model fine-tuning. The same format failure extends to GPT-4o (OpenAI, 2024), a proprietary closed-source model: REFERENCE achieves 0% output accuracy due to systematic markdown-fence wrapping, while AloLab reaches 95.2% [94.8, 95.6]. An ablation replacing the Sonnet 4.5 meta-agent with Claude 3 Haiku reduces mean output accuracy to 61.0% and increases run-to-run standard deviation from <1 pp to 21.8 pp, confirming that meta-agent capability is a primary driver of optimization quality.
LGMar 4, 2024
Locally Convex Global Loss Network for Decision-Focused LearningHaeun Jeon, Hyunglip Bae, Minsu Park et al.
In decision-making problems under uncertainty, predicting unknown parameters is often considered independent of the optimization part. Decision-focused learning (DFL) is a task-oriented framework that integrates prediction and optimization by adapting the predictive model to give better decisions for the corresponding task. Here, an inevitable challenge arises when computing the gradients of the optimal decision with respect to the parameters. Existing research copes with this issue by smoothly reforming surrogate optimization or constructing surrogate loss functions that mimic task loss. However, they are applied to restricted optimization domains. In this paper, we propose Locally Convex Global Loss Network (LCGLN), a global surrogate loss model that can be implemented in a general DFL paradigm. LCGLN learns task loss via a partial input convex neural network which is guaranteed to be convex for chosen inputs while keeping the non-convex global structure for the other inputs. This enables LCGLN to admit general DFL through only a single surrogate loss without any sense for choosing appropriate parametric forms. We confirm the effectiveness and flexibility of LCGLN by evaluating our proposed model with three stochastic decision-making problems.
CLOct 28, 2025
LASTIST: LArge-Scale Target-Independent STance datasetDongJae Kim, Yaejin Lee, Minsu Park et al.
Stance detection has emerged as an area of research in the field of artificial intelligence. However, most research is currently centered on the target-dependent stance detection task, which is based on a person's stance in favor of or against a specific target. Furthermore, most benchmark datasets are based on English, making it difficult to develop models in low-resource languages such as Korean, especially for an emerging field such as stance detection. This study proposes the LArge-Scale Target-Independent STance (LASTIST) dataset to fill this research gap. Collected from the press releases of both parties on Korean political parties, the LASTIST dataset uses 563,299 labeled Korean sentences. We provide a detailed description of how we collected and constructed the dataset and trained state-of-the-art deep learning and stance detection models. Our LASTIST dataset is designed for various tasks in stance detection, including target-independent stance detection and diachronic evolution stance detection. We deploy our dataset on https://anonymous.4open.science/r/LASTIST-3721/.
MMApr 30, 2025
Solving Copyright Infringement on Short Video Platforms: Novel Datasets and an Audio Restoration Deep Learning PipelineMinwoo Oh, Minsu Park, Eunil Park
Short video platforms like YouTube Shorts and TikTok face significant copyright compliance challenges, as infringers frequently embed arbitrary background music (BGM) to obscure original soundtracks (OST) and evade content originality detection. To tackle this issue, we propose a novel pipeline that integrates Music Source Separation (MSS) and cross-modal video-music retrieval (CMVMR). Our approach effectively separates arbitrary BGM from the original OST, enabling the restoration of authentic video audio tracks. To support this work, we introduce two domain-specific datasets: OASD-20K for audio separation and OSVAR-160 for pipeline evaluation. OASD-20K contains 20,000 audio clips featuring mixed BGM and OST pairs, while OSVAR-160 is a unique benchmark dataset comprising 1,121 video and mixed-audio pairs, specifically designed for short video restoration tasks. Experimental results demonstrate that our pipeline not only removes arbitrary BGM with high accuracy but also restores OSTs, ensuring content integrity. This approach provides an ethical and scalable solution to copyright challenges in user-generated content on short video platforms.
CYMay 7, 2023
Perception, performance, and detectability of conversational artificial intelligence across 32 university coursesHazem Ibrahim, Fengyuan Liu, Rohail Asim et al.
The emergence of large language models has led to the development of powerful tools such as ChatGPT that can produce text indistinguishable from human-generated work. With the increasing accessibility of such technology, students across the globe may utilize it to help with their school work -- a possibility that has sparked discussions on the integrity of student evaluations in the age of artificial intelligence (AI). To date, it is unclear how such tools perform compared to students on university-level courses. Further, students' perspectives regarding the use of such tools, and educators' perspectives on treating their use as plagiarism, remain unknown. Here, we compare the performance of ChatGPT against students on 32 university-level courses. We also assess the degree to which its use can be detected by two classifiers designed specifically for this purpose. Additionally, we conduct a survey across five countries, as well as a more in-depth survey at the authors' institution, to discern students' and educators' perceptions of ChatGPT's use. We find that ChatGPT's performance is comparable, if not superior, to that of students in many courses. Moreover, current AI-text classifiers cannot reliably detect ChatGPT's use in school work, due to their propensity to classify human-written answers as AI-generated, as well as the ease with which AI-generated text can be edited to evade detection. Finally, we find an emerging consensus among students to use the tool, and among educators to treat this as plagiarism. Our findings offer insights that could guide policy discussions addressing the integration of AI into educational frameworks.
IRAug 2, 2021
Cross-cultural Mood Perception in Pop Songs and its Alignment with Mood Detection AlgorithmsHarin Lee, Frank Hoeger, Marc Schoenwiesner et al.
Do people from different cultural backgrounds perceive the mood in music the same way? How closely do human ratings across different cultures approximate automatic mood detection algorithms that are often trained on corpora of predominantly Western popular music? Analyzing 166 participants responses from Brazil, South Korea, and the US, we examined the similarity between the ratings of nine categories of perceived moods in music and estimated their alignment with four popular mood detection algorithms. We created a dataset of 360 recent pop songs drawn from major music charts of the countries and constructed semantically identical mood descriptors across English, Korean, and Portuguese languages. Multiple participants from the three countries rated their familiarity, preference, and perceived moods for a given song. Ratings were highly similar within and across cultures for basic mood attributes such as sad, cheerful, and energetic. However, we found significant cross-cultural differences for more complex characteristics such as dreamy and love. To our surprise, the results of mood detection algorithms were uniformly correlated across human ratings from all three countries and did not show a detectable bias towards any particular culture. Our study thus suggests that the mood detection algorithms can be considered as an objective measure at least within the popular music context.
HCMar 28, 2016
A Data-driven Study of View Duration on YouTubeMinsu Park, Mor Naaman, Jonah Berger
Video watching had emerged as one of the most frequent media activities on the Internet. Yet, little is known about how users watch online video. Using two distinct YouTube datasets, a set of random YouTube videos crawled from the Web and a set of videos watched by participants tracked by a Chrome extension, we examine whether and how indicators of collective preferences and reactions are associated with view duration of videos. We show that video view duration is positively associated with the video's view count, the number of likes per view, and the negative sentiment in the comments. These metrics and reactions have a significant predictive power over the duration the video is watched by individuals. Our findings provide a more precise understandings of user engagement with video content in social media beyond view count.