70.9CLMay 29
KnowledgeGain: Evaluating and Optimizing Science News Generation for Reader LearningDominik Soós, Meng Jiang, Jian Wu
Science news is an important medium to communicate discoveries between the research communities and the public. Yet, most metrics for generated or summarized text evaluate semantic similarity and factual consistency, but do not measure how much knowledge readers learn from the news. We introduce KnowledgeGain, a metric that evaluates the quality of science news by measuring how much knowledge readers gained after reading it. To evaluate the metric, we first performed a controlled human study and showed that the metric successfully captures the differential knowledge gained by human readers reading different types of science media. The data allowed us to calibrate a prompt-only LLM reader simulator. We use it to rank and filter candidate articles before human evaluation. A second human study shows that articles selected with this simulator improve post-reading accuracy and normalized KnowledgeGain over a strong generation baseline. Our work is a step toward generating science news that better meets the knowledge and comprehension goals of Bloom's Taxonomy.
AIFeb 11Code
ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral SciencesBang Nguyen, Dominik Soós, Qian Ma et al.
The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents' ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to capture the inconsistent availability of new data for replication as opposed to reproduction, and (2) lacks ground-truth diversity by focusing only on reproducible papers, thereby failing to evaluate an agent's ability to identify non-replicable research. Furthermore, most benchmarks only evaluate outcomes rather than the replication process. In response, we introduce ReplicatorBench, an end-to-end benchmark, including human-verified replicable and non-replicable research claims in social and behavioral sciences for evaluating AI agents in research replication across three stages: (1) extraction and retrieval of replication data; (2) design and execution of computational experiments; and (3) interpretation of results, allowing a test of AI agents' capability to mimic the activities of human replicators in real world. To set a baseline of AI agents' capability, we develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments, to accomplish tasks in ReplicatorBench. We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access. Our findings reveal that while current LLM agents are capable of effectively designing and executing computational experiments, they struggle with retrieving resources, such as new data, necessary to replicate a claim. All code and data are publicly available at https://github.com/CenterForOpenScience/llm-benchmarking.
23.8QUANT-PHApr 22
Distributed Quantum-Enhanced Optimization: A Topographical Preconditioning Approach for High-Dimensional SearchDominik Soós, Marc Paterno, John Stenger et al.
Optimization problems become fundamentally challenging as the number of variables increases. Because the volume of the search space grows exponentially, classical algorithms frequently fail to locate the global minimum of non-convex functions. While quantum optimization offers a potential alternative, mapping continuous problems onto near-term quantum hardware introduces severe scaling limits and barren plateaus. To bridge this gap, we propose the Distributed Quantum-Enhanced Optimization (D-QEO) framework. Instead of forcing the quantum processor to find the exact minimum, we use it simply as a topographical preconditioner. The QPU maps the landscape to locate the most promising basin of attraction, generating high-quality seed points for a classical GPU-accelerated solver to refine. To make this approach viable for utility-scale problems, we exploit the mathematical structure of separable functions. This allows us to cut a 50-qubit (i.e., $2^{50}$) global search space into independent and manageable sub-spaces using 5-qubit subcircuits. By executing these fragments concurrently with CUDA-Q, we completely bypass the overhead of cross-register entanglement and classical tensor knitting for separable functions. Benchmarks on the 10-dimensional Rastrigin and Ackley functions show that D-QEO prevents the exponential failure rates observed in purely classical algorithms. Furthermore, this quantum warm-start significantly reduces the number of classical BFGS iterations required to converge, providing a highly practical blueprint for utilizing near-term quantum resources in complex global search.