Qianyi Wang

CL
h-index11
3papers
1citation
Novelty42%
AI Score41

3 Papers

CLMay 26
Self-Ensembling Vision-Language Models for Chart Data Extraction

Thomas Berkane, Qianyi Wang, Maimuna S. Majumder

Charts effectively convey quantitative information, but the underlying data are often locked in image form, hindering reuse and analysis. Manually digitizing charts is time-consuming and error-prone, motivating automatic chart-to-table extraction. Recent approaches use specialized vision-language models (VLMs), yet performance still lags on charts with many datapoints or substantial stylistic variation. We propose a VLM self-ensembling method that repeatedly samples multiple tabular outputs from the same VLM for a fixed chart image and aggregates them at the level of individual table cells. We align candidate tables and take per-cell medians over numerical values to produce a more accurate consensus table. Our method also includes convergence detection to stop sampling once the aggregated table stabilizes, and uncertainty estimation based on dispersion across samples to help users assess extraction reliability. Because existing chart extraction benchmarks contain relatively simple plots with limited room for improvement, we introduce WB-ChartExtract, a new benchmark built from World Bank data with more complex and stylistically diverse charts; on average, its charts contain 7 times more datapoints than those in the ChartQA benchmark. Across both ChartQA and WB-ChartExtract, our approach improves extraction accuracy over single-pass VLM outputs, yielding up to 23% relative improvement on WB-ChartExtract after ensembling. More broadly, our method helps unlock tabular data previously siloed in chart images, enabling downstream analysis and reuse.

CVNov 21, 2025
A lightweight detector for real-time detection of remote sensing images

Qianyi Wang, Guoqiang Ren

Remote sensing imagery is widely used across various fields, yet real-time detection remains challenging due to the prevalence of small objects and the need to balance accuracy with efficiency. To address this, we propose DMG-YOLO, a lightweight real-time detector tailored for small object detection in remote sensing images. Specifically, we design a Dual-branch Feature Extraction (DFE) module in the backbone, which partitions feature maps into two parallel branches: one extracts local features via depthwise separable convolutions, and the other captures global context using a vision transformer with a gating mechanism. Additionally, a Multi-scale Feature Fusion (MFF) module with dilated convolutions enhances multi-scale integration while preserving fine details. In the neck, we introduce the Global and Local Aggregate Feature Pyramid Network (GLAFPN) to further boost small object detection through global-local feature fusion. Extensive experiments on the VisDrone2019 and NWPU VHR-10 datasets show that DMG-YOLO achieves competitive performance in terms of mAP, model size, and other key metrics.

CYJul 25, 2025
PEMUTA: Pedagogically-Enriched Multi-Granular Undergraduate Thesis Assessment

Jialu Zhang, Qingyang Sun, Qianyi Wang et al.

The undergraduate thesis (UGTE) plays an indispensable role in assessing a student's cumulative academic development throughout their college years. Although large language models (LLMs) have advanced education intelligence, they typically focus on holistic assessment with only one single evaluation score, but ignore the intricate nuances across multifaceted criteria, limiting their ability to reflect structural criteria, pedagogical objectives, and diverse academic competencies. Meanwhile, pedagogical theories have long informed manual UGTE evaluation through multi-dimensional assessment of cognitive development, disciplinary thinking, and academic performance, yet remain underutilized in automated settings. Motivated by the research gap, we pioneer PEMUTA, a pedagogically-enriched framework that effectively activates domain-specific knowledge from LLMs for multi-granular UGTE assessment. Guided by Vygotsky's theory and Bloom's Taxonomy, PEMUTA incorporates a hierarchical prompting scheme that evaluates UGTEs across six fine-grained dimensions: Structure, Logic, Originality, Writing, Proficiency, and Rigor (SLOWPR), followed by holistic synthesis. Two in-context learning techniques, \ie, few-shot prompting and role-play prompting, are also incorporated to further enhance alignment with expert judgments without fine-tuning. We curate a dataset of authentic UGTEs with expert-provided SLOWPR-aligned annotations to support multi-granular UGTE assessment. Extensive experiments demonstrate that PEMUTA achieves strong alignment with expert evaluations, and exhibits strong potential for fine-grained, pedagogically-informed UGTE evaluations.