CVMar 25Code
A^3: Towards Advertising Aesthetic AssessmentKaiyuan Ji, Yixuan Gao, Lu Sun et al.
Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present A^3 (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A^3-Law), a dataset (A^3-Dataset), a multimodal large language model (A^3-Align), and a benchmark (A^3-Bench). Central to A^3 is a theory-driven paradigm, A^3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A^3-Law, we construct A^3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A^3-Align, trained under A^3-Law with CoT-guided learning on A^3-Dataset. Extensive experiments on A^3-Bench demonstrate that A^3-Align achieves superior alignment with A^3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment. Dataset, code, and models can be found at: https://github.com/euleryuan/A3-Align.
STDec 31, 2025Code
PriceSeer: Evaluating Large Language Models in Real-Time Stock PredictionBohan Liang, Zijian Chen, Qi Jia et al.
Stock prediction, a subject closely related to people's investment activities in fully dynamic and live environments, has been widely studied. Current large language models (LLMs) have shown remarkable potential in various domains, exhibiting expert-level performance through advanced reasoning and contextual understanding. In this paper, we introduce PriceSeer, a live, dynamic, and data-uncontaminated benchmark specifically designed for LLMs performing stock prediction tasks. Specifically, PriceSeer includes 110 U.S. stocks from 11 industrial sectors, with each containing 249 historical data points. Our benchmark implements both internal and external information expansion, where LLMs receive extra financial indicators, news, and fake news to perform stock price prediction. We evaluate six cutting-edge LLMs under different prediction horizons, demonstrating their potential in generating investment strategies after obtaining accurate price predictions for different sectors. Additionally, we provide analyses of LLMs' suboptimal performance in long-term predictions, including the vulnerability to fake news and specific industries. The code and evaluation data will be open-sourced at https://github.com/BobLiang2113/PriceSeer.
QUANT-PHMar 4
Barycentric bounds on the error exponents of quantum hypothesis exclusionKaiyuan Ji, Hemant K. Mishra, Milán Mosonyi et al.
Quantum state exclusion is an operational task with application to ontological interpretations of quantum states. In such a task, one is given a system whose state is randomly selected from a finite set, and the goal is to identify a state from the set that is not the true state of the system. An error occurs if and only if the state identified is the true state. In this paper, we study the optimal error probability of quantum state exclusion and its error exponent from an information-theoretic perspective. Our main finding is a single-letter upper bound on the error exponent of state exclusion given by the multivariate log-Euclidean Chernoff divergence, and we prove that this improves upon the best previously known upper bound. We also extend our analysis to quantum channel exclusion, and we establish a single-letter and efficiently computable upper bound on its error exponent, admitting the use of adaptive strategies. We derive both upper bounds, for state and channel exclusion, based on one-shot analysis and formulate them as a type of multivariate divergence measure called a barycentric Chernoff divergence. Moreover, our result on channel exclusion has implications in two important special cases. First, when there are two hypotheses, our result provides the first known efficiently computable upper bound on the error exponent of symmetric binary channel discrimination. Second, when all channels are classical, we show that our upper bound is achievable by a parallel strategy, thus solving the exact error exponent of classical channel exclusion.
QUANT-PHOct 23, 2025
Converse bounds for quantum hypothesis exclusion: A divergence-radius approachKaiyuan Ji, Hemant K. Mishra, Milán Mosonyi et al.
Hypothesis exclusion is an information-theoretic task in which an experimenter aims at ruling out a false hypothesis from a finite set of known candidates, and an error occurs if and only if the hypothesis being ruled out is the ground truth. For the tasks of quantum state exclusion and quantum channel exclusion -- where hypotheses are represented by quantum states and quantum channels, respectively -- efficiently computable upper bounds on the asymptotic error exponents were established in a recent work of the current authors [Ji et al., arXiv:2407.13728 (2024)], where the derivation was based on nonasymptotic analysis. In this companion paper of our previous work, we provide alternative proofs for the same upper bounds on the asymptotic error exponents of quantum state and channel exclusion, but using a conceptually different approach from the one adopted in the previous work. Specifically, we apply strong converse results for asymmetric binary hypothesis testing to distinguishing an arbitrary ``dummy'' hypothesis from each of the concerned candidates. This leads to the desired upper bounds in terms of divergence radii via a geometrically inspired argument.
CLJun 2, 2025Code
Human-Centric Evaluation for Foundation ModelsYijin Guo, Kaiyuan Ji, Xiaorong Zhu et al.
Currently, nearly all evaluations of foundation models focus on objective metrics, emphasizing quiz performance to define model capabilities. While this model-centric approach enables rapid performance assessment, it fails to reflect authentic human experiences. To address this gap, we propose a Human-Centric subjective Evaluation (HCE) framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience. Through experiments involving Deepseek R1, OpenAI o3 mini, Grok 3, and Gemini 2.5, we conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks, yielding a comprehensive subjective dataset. This dataset captures diverse user feedback across multiple disciplines, revealing distinct model strengths and adaptability. Our findings highlight Grok 3's superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind. By offering a novel framework and a rich dataset, this study not only enhances subjective evaluation methodologies but also lays the foundation for standardized, automated assessments, advancing LLM development for research and practical scenarios. Our dataset link is https://github.com/yijinguo/Human-Centric-Evaluation.
LGMar 2
SafeSci: Safety Evaluation of Large Language Models in Science Domains and BeyondXiangyang Zhu, Yuan Tian, Qi Jia et al.
The success of large language models (LLMs) in scientific domains has heightened safety concerns, prompting numerous benchmarks to evaluate their scientific safety. Existing benchmarks often suffer from limited risk coverage and a reliance on subjective evaluation. To address these problems, we introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts. SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement. SafeSciBench distinguishes between safety knowledge and risk to cover extensive scopes and employs objective metrics such as deterministically answerable questions to mitigate evaluation bias. We evaluate 24 advanced LLMs, revealing critical vulnerabilities in current models. We also observe that LLMs exhibit varying degrees of excessive refusal behaviors on safety-related issues. For safety enhancement, we demonstrate that fine-tuning on SafeSciTrain significantly enhances the safety alignment of models. Finally, we argue that knowledge is a double-edged sword, and determining the safety of a scientific question should depend on specific context, rather than universally categorizing it as safe or unsafe. Our work provides both a diagnostic tool and a practical resource for building safer scientific AI systems.
CVMar 11, 2025Code
Towards All-in-One Medical Image Re-IdentificationYuan Tian, Kaiyuan Ji, Rongzhao Zhang et al.
Medical image re-identification (MedReID) is under-explored so far, despite its critical applications in personalized healthcare and privacy protection. In this paper, we introduce a thorough benchmark and a unified model for this problem. First, to handle various medical modalities, we propose a novel Continuous Modality-based Parameter Adapter (ComPA). ComPA condenses medical content into a continuous modality representation and dynamically adjusts the modality-agnostic model with modality-specific parameters at runtime. This allows a single model to adaptively learn and process diverse modality data. Furthermore, we integrate medical priors into our model by aligning it with a bag of pre-trained medical foundation models, in terms of the differential features. Compared to single-image feature, modeling the inter-image difference better fits the re-identification problem, which involves discriminating multiple images. We evaluate the proposed model against 25 foundation models and 8 large multi-modal language models across 11 image datasets, demonstrating consistently superior performance. Additionally, we deploy the proposed MedReID technique to two real-world applications, i.e., history-augmented personalized diagnosis and medical privacy protection. Codes and model is available at \href{https://github.com/tianyuan168326/All-in-One-MedReID-Pytorch}{https://github.com/tianyuan168326/All-in-One-MedReID-Pytorch}.
QUANT-PHApr 15
Retrocausal capacity of a quantum channelKaiyuan Ji, Seth Lloyd, Mark M. Wilde
We study the capacity of a quantum channel for retrocausal communication, where messages are transmitted backward in time, from a sender in the future to a receiver in the past, through a noisy postselected closed timelike curve (P-CTC) mathematically represented by the channel. We completely characterize the one-shot retrocausal quantum and classical capacities, and we show that the corresponding asymptotic capacities are equal to the average and sum, respectively, of the channel's max-information and its regularized Doeblin information. This endows these information measures with a novel operational interpretation. Furthermore, our characterization can be generalized beyond quantum channels to all completely positive maps. This imposes information-theoretic limits on transmitting messages via postselected-teleportation-like mechanisms with arbitrary initial- and final-state boundary conditions, including those considered in various black-hole final-state models.
CVAug 22, 2025Code
MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in MedicineKaiyuan Ji, Yijin Guo, Zicheng Zhang et al.
With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness -- whether reasoning aligns with responses and medical facts -- and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics -- Accuracy, CoT-Faithfulness, and Anti-Sycophancy -- are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.
QUANT-PHSep 24, 2025
Entropic and operational characterizations of dynamic quantum resourcesKaiyuan Ji, Eric Chitambar
We offer new methods for characterizing general closed and convex quantum resource theories, including dynamic ones, based on entropic concepts and operational tasks. We propose a resource-theoretic generalization of the quantum conditional min-entropy, termed the free conditional min-entropy (FCME), in the sense that it quantifies an observer's ``subjective'' degree of uncertainty about a quantum system given that the observer's information processing is limited to free operations of the resource theory. Using this generalized concept, we provide a complete set of entropic conditions for free convertibility between quantum states or channels in any closed and convex quantum resource theory. We also derive an information-theoretic interpretation for the resource global robustness of a state or a channel in terms of a mutual-information-like quantity based on the FCME. Apart from this entropic approach, we characterize dynamic resources by also analyzing their performance in operational tasks. We construct operationally meaningful and complete sets of resource monotones with these tasks, which enable faithful tests of free convertibility between quantum channels. Finally, we show that every well-defined robustness-based measure of a channel can be interpreted as an operational advantage of the channel over free channels in a communication task.