Yuxuan Qin

AI
h-index9
4papers
5citations
Novelty61%
AI Score51

4 Papers

CLDec 17, 2025Code
Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams

Yiming Cui, Xin Yao, Yuxuan Qin et al.

Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.

95.7AIMay 4Code
AcademiClaw: When Students Set Challenges for AI Agents

Junjie Yu, Pengrui Lu, Weiye Si et al.

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.

22.4CRApr 24
Information-Theoretic Authenticated PIR: From PIR-RV To APIR

Pengzhen Ke, Yuxuan Qin, Liang Feng Zhang

Private Information Retrieval (PIR) allows clients to retrieve database entries without leaking retrieval indices, yet malicious servers seriously compromise retrieval correctness. Existing Authenticated PIR (APIR) schemes resist selective-failure attacks but rely on computational hardness assumptions. In contrast, information-theoretic PIR with Result Verification (itPIR-RV) achieves integrity without computational assumptions, yet only provides relaxed query privacy with no defense against selective-failure attacks. This paper focuses on unconditionally secure information-theoretic APIR (itAPIR) constructions. We propose the rigorous information-theoretic security definition for itAPIR with statistical privacy against selective-failure attacks and integrity as core properties, formalize the hierarchical relation between itAPIR and itPIR-RV as a relaxed variant with identical integrity but basic query privacy, and prove a conversion theorem that valid itPIR-RV schemes can be directly upgraded to secure itAPIR with no extra overhead. Our work bridges the theoretical gap, simplifies itAPIR design, and enables quantum-resistant PIR in malicious server environments.

DCJan 17, 2022
Efficient Data-Plane Memory Scheduling for In-Network Aggregation

Hao Wang, Yuxuan Qin, ChonLam Lao et al.

As the scale of distributed training grows, communication becomes a bottleneck. To accelerate the communication, recent works introduce In-Network Aggregation (INA), which moves the gradients summation into network middle-boxes, e.g., programmable switches to reduce the traffic volume. However, switch memory is scarce compared to the volume of gradients transmitted in distributed training. Although literature applies methods like pool-based streaming or dynamic sharing to tackle the mismatch, switch memory is still a potential performance bottleneck. Furthermore, we observe the under-utilization of switch memory due to the synchronization requirement for aggregator deallocation in recent works. To improve the switch memory utilization, we propose ESA, an $\underline{E}$fficient Switch Memory $\underline{S}$cheduler for In-Network $\underline{A}$ggregation. At its cores, ESA enforces the preemptive aggregator allocation primitive and introduces priority scheduling at the data-plane, which improves the switch memory utilization and average job completion time (JCT). Experiments show that ESA can improve the average JCT by up to $1.35\times$.