Jie Yao

CL
h-index54
10papers
478citations
Novelty52%
AI Score54

10 Papers

CLJun 15, 2023Code
Learning by Analogy: Diverse Questions Generation in Math Word Problem

Zihao Zhou, Maizhen Ning, Qiufeng Wang et al.

Solving math word problem (MWP) with AI techniques has recently made great progress with the success of deep neural networks (DNN), but it is far from being solved. We argue that the ability of learning by analogy is essential for an MWP solver to better understand same problems which may typically be formulated in diverse ways. However most existing works exploit the shortcut learning to train MWP solvers simply based on samples with a single question. In lack of diverse questions, these methods merely learn shallow heuristics. In this paper, we make a first attempt to solve MWPs by generating diverse yet consistent questions/equations. Given a typical MWP including the scenario description, question, and equation (i.e., answer), we first generate multiple consistent equations via a group of heuristic rules. We then feed them to a question generator together with the scenario to obtain the corresponding diverse questions, forming a new MWP with a variety of questions and equations. Finally we engage a data filter to remove those unreasonable MWPs, keeping the high-quality augmented ones. To evaluate the ability of learning by analogy for an MWP solver, we generate a new MWP dataset (called DiverseMath23K) with diverse questions by extending the current benchmark Math23K. Extensive experimental results demonstrate that our proposed method can generate high-quality diverse questions with corresponding equations, further leading to performance improvement on Diverse-Math23K. The code and dataset is available at: https://github.com/zhouzihao501/DiverseMWP

CLSep 4, 2023
MathAttack: Attacking Large Language Models Towards Math Solving Ability

Zihao Zhou, Qiufeng Wang, Mingyu Jin et al.

With the boom of Large Language Models (LLMs), the research of solving Math Word Problem (MWP) has recently made great progress. However, there are few studies to examine the security of LLMs in math solving ability. Instead of attacking prompts in the use of LLMs, we propose a MathAttack model to attack MWP samples which are closer to the essence of security in solving math problems. Compared to traditional text adversarial attack, it is essential to preserve the mathematical logic of original MWPs during the attacking. To this end, we propose logical entity recognition to identify logical entries which are then frozen. Subsequently, the remaining text are attacked by adopting a word-level attacker. Furthermore, we propose a new dataset RobustMath to evaluate the robustness of LLMs in math solving ability. Extensive experiments on our RobustMath and two another math benchmark datasets GSM8K and MultiAirth show that MathAttack could effectively attack the math solving ability of LLMs. In the experiments, we observe that (1) Our adversarial samples from higher-accuracy LLMs are also effective for attacking LLMs with lower accuracy (e.g., transfer from larger to smaller-size LLMs, or from few-shot to zero-shot prompts); (2) Complex MWPs (such as more solving steps, longer text, more numbers) are more vulnerable to attack; (3) We can improve the robustness of LLMs by using our adversarial samples in few-shot prompts. Finally, we hope our practice and observation can serve as an important attempt towards enhancing the robustness of LLMs in math solving ability. We will release our code and dataset.

CVNov 16, 2022
Learnable Graph Convolutional Network and Feature Fusion for Multi-view Learning

Zhaoliang Chen, Lele Fu, Jie Yao et al.

In practical applications, multi-view data depicting objectives from assorted perspectives can facilitate the accuracy increase of learning algorithms. However, given multi-view data, there is limited work for learning discriminative node relationships and graph information simultaneously via graph convolutional network that has drawn the attention from considerable researchers in recent years. Most of existing methods only consider the weighted sum of adjacency matrices, yet a joint neural network of both feature and graph fusion is still under-explored. To cope with these issues, this paper proposes a joint deep learning framework called Learnable Graph Convolutional Network and Feature Fusion (LGCN-FF), consisting of two stages: feature fusion network and learnable graph convolutional network. The former aims to learn an underlying feature representation from heterogeneous views, while the latter explores a more discriminative graph fusion via learnable weights and a parametric activation function dubbed Differentiable Shrinkage Activation (DSA) function. The proposed LGCN-FF is validated to be superior to various state-of-the-art methods in multi-view semi-supervised classification.

CLAug 26, 2023Code
Solving Math Word Problem with Problem Type Classification

Jie Yao, Zihao Zhou, Qiufeng Wang

Math word problems (MWPs) require analyzing text descriptions and generating mathematical equations to derive solutions. Existing works focus on solving MWPs with two types of solvers: tree-based solver and large language model (LLM) solver. However, these approaches always solve MWPs by a single solver, which will bring the following problems: (1) Single type of solver is hard to solve all types of MWPs well. (2) A single solver will result in poor performance due to over-fitting. To address these challenges, this paper utilizes multiple ensemble approaches to improve MWP-solving ability. Firstly, We propose a problem type classifier that combines the strengths of the tree-based solver and the LLM solver. This ensemble approach leverages their respective advantages and broadens the range of MWPs that can be solved. Furthermore, we also apply ensemble techniques to both tree-based solver and LLM solver to improve their performance. For the tree-based solver, we propose an ensemble learning framework based on ten-fold cross-validation and voting mechanism. In the LLM solver, we adopt self-consistency (SC) method to improve answer selection. Experimental results demonstrate the effectiveness of these ensemble approaches in enhancing MWP-solving ability. The comprehensive evaluation showcases improved performance, validating the advantages of our proposed approach. Our code is available at this url: https://github.com/zhouzihao501/NLPCC2023-Shared-Task3-ChineseMWP.

AIJul 20, 2023
PPN: Parallel Pointer-based Network for Key Information Extraction with Complex Layouts

Kaiwen Wei, Jie Yao, Jingyuan Zhang et al.

Key Information Extraction (KIE) is a challenging multimodal task that aims to extract structured value semantic entities from visually rich documents. Although significant progress has been made, there are still two major challenges that need to be addressed. Firstly, the layout of existing datasets is relatively fixed and limited in the number of semantic entity categories, creating a significant gap between these datasets and the complex real-world scenarios. Secondly, existing methods follow a two-stage pipeline strategy, which may lead to the error propagation problem. Additionally, they are difficult to apply in situations where unseen semantic entity categories emerge. To address the first challenge, we propose a new large-scale human-annotated dataset named Complex Layout form for key information EXtraction (CLEX), which consists of 5,860 images with 1,162 semantic entity categories. To solve the second challenge, we introduce Parallel Pointer-based Network (PPN), an end-to-end model that can be applied in zero-shot and few-shot scenarios. PPN leverages the implicit clues between semantic entities to assist extracting, and its parallel extraction mechanism allows it to extract multiple results simultaneously and efficiently. Experiments on the CLEX dataset demonstrate that PPN outperforms existing state-of-the-art methods while also offering a much faster inference speed.

FLU-DYNMay 18
Long-horizon prediction of three-dimensional wall-bounded turbulence with CTA-Swin-UNet and resolvent analysis

Bo Chen, Yitong Fan, Jie Yao et al.

Long-horizon prediction of three-dimensional (3D) wall-bounded turbulence with machine-learning methods remains a challenging task, due to the rapid accumulation of autoregressive errors and the substantially computational cost. To address these challenges, we present a hybrid machine-learning framework, in which a channel-time-attention Swin-UNet (CTA-Swin-UNet) and a multi-time-scale fusion correction (MTFC) strategy are developed to predict the turbulent flow fields in a wall-parallel plane, with affordable computational cost. Then, 3D flow fields are reconstructed via a resolvent-based spectral linear stochastic estimation (SLSE), rooting from the predicted planar flow. Results show that the CTA-Swin-UNet outperforms the baseline models (LSTM, FNO and traditional Swin-UNet) in both single-step prediction and autoregressive rollouts, indicating the effectiveness of introducing the CTA module into the Swin-UNet architecture. At the same temporal interval, the CTA-Swin-UNet remains stable for approximately 150 rollout steps, while the baseline models fail within 20 to 50 rollout steps. After introducing the MTFC strategy, a longer horizon upto 300 steps is achieved. Using the resolvent-based SLSE reconstruction further recovers the 3D flow structures and energy spectral distributions from the predicted planar inputs, which demonstrates that the proposed framework provides an effective and computationally efficient approach for long-horizon autoregressive prediction of 3D wall-bounded turbulence.

AIFeb 5
Clinical Validation of Medical-based Large Language Model Chatbots on Ophthalmic Patient Queries with LLM-based Evaluation

Ting Fang Tan, Kabilan Elangovan, Andreas Pollreisz et al.

Domain specific large language models are increasingly used to support patient education, triage, and clinical decision making in ophthalmology, making rigorous evaluation essential to ensure safety and accuracy. This study evaluated four small medical LLMs Meerkat-7B, BioMistral-7B, OpenBioLLM-8B, and MedLLaMA3-v20 in answering ophthalmology related patient queries and assessed the feasibility of LLM based evaluation against clinician grading. In this cross sectional study, 180 ophthalmology patient queries were answered by each model, generating 2160 responses. Models were selected for parameter sizes under 10 billion to enable resource efficient deployment. Responses were evaluated by three ophthalmologists of differing seniority and by GPT-4-Turbo using the S.C.O.R.E. framework assessing safety, consensus and context, objectivity, reproducibility, and explainability, with ratings assigned on a five point Likert scale. Agreement between LLM and clinician grading was assessed using Spearman rank correlation, Kendall tau statistics, and kernel density estimate analyses. Meerkat-7B achieved the highest performance with mean scores of 3.44 from Senior Consultants, 4.08 from Consultants, and 4.18 from Residents. MedLLaMA3-v20 performed poorest, with 25.5 percent of responses containing hallucinations or clinically misleading content, including fabricated terminology. GPT-4-Turbo grading showed strong alignment with clinician assessments overall, with Spearman rho of 0.80 and Kendall tau of 0.67, though Senior Consultants graded more conservatively. Overall, medical LLMs demonstrated potential for safe ophthalmic question answering, but gaps remained in clinical depth and consensus, supporting the feasibility of LLM based evaluation for large scale benchmarking and the need for hybrid automated and clinician review frameworks to guide safe clinical deployment.

MMDec 25, 2024
Adaptive Rate Control for Deep Video Compression with Rate-Distortion Prediction

Bowen Gu, Hao Chen, Ming Lu et al.

Deep video compression has made significant progress in recent years, achieving rate-distortion performance that surpasses that of traditional video compression methods. However, rate control schemes tailored for deep video compression have not been well studied. In this paper, we propose a neural network-based $λ$-domain rate control scheme for deep video compression, which determines the coding parameter $λ$ for each to-be-coded frame based on the rate-distortion-$λ$ (R-D-$λ$) relationships directly learned from uncompressed frames, achieving high rate control accuracy efficiently without the need for pre-encoding. Moreover, this content-aware scheme is able to mitigate inter-frame quality fluctuations and adapt to abrupt changes in video content. Specifically, we introduce two neural network-based predictors to estimate the relationship between bitrate and $λ$, as well as the relationship between distortion and $λ$ for each frame. Then we determine the coding parameter $λ$ for each frame to achieve the target bitrate. Experimental results demonstrate that our approach achieves high rate control accuracy at the mini-GOP level with low time overhead and mitigates inter-frame quality fluctuations across video content of varying resolutions.

CVSep 29, 2025
EVLF-FM: Explainable Vision Language Foundation Model for Medicine

Yang Bai, Haoran Cheng, Yang Zhou et al.

Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM encompassed over 1.3 million total samples from 23 global datasets across eleven imaging modalities related to six clinical specialties: dermatology, hepatology, ophthalmology, pathology, pulmonology, and radiology. External validation employed 8,884 independent test samples from 10 additional datasets across five imaging modalities. Technically, EVLF-FM is developed to assist with multiple disease diagnosis and visual question answering with pixel-level visual grounding and reasoning capabilities. In internal validation for disease diagnostics, EVLF-FM achieved the highest average accuracy (0.858) and F1-score (0.797), outperforming leading generalist and specialist models. In medical visual grounding, EVLF-FM also achieved stellar performance across nine modalities with average mIOU of 0.743 and Acc@0.5 of 0.837. External validations further confirmed strong zero-shot and few-shot performance, with competitive F1-scores despite a smaller model size. Through a hybrid training strategy combining supervised and visual reinforcement fine-tuning, EVLF-FM not only achieves state-of-the-art accuracy but also exhibits step-by-step reasoning, aligning outputs with visual evidence. EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.

IVJun 30, 2025
Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)

Yang Zhou, Chrystie Wan Ning Quek, Jun Zhou et al.

Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundation model trained using self-supervised learning and a memory module. MerMED-FM was trained on 3.3 million medical images from over ten specialties and seven modalities, including computed tomography (CT), chest X-rays (CXR), ultrasound (US), pathology patches, color fundus photography (CFP), optical coherence tomography (OCT) and dermatology images. MerMED-FM was evaluated across multiple diseases and compared against existing foundational models. Strong performance was achieved across all modalities, with AUROCs of 0.988 (OCT); 0.982 (pathology); 0.951 (US); 0.943 (CT); 0.931 (skin); 0.894 (CFP); 0.858 (CXR). MerMED-FM has the potential to be a highly adaptable, versatile, cross-specialty foundation model that enables robust medical imaging interpretation across diverse medical disciplines.