Hao Liang

CV
h-index8
19papers
250citations
Novelty50%
AI Score59

19 Papers

4.2CLSep 30, 2024Code
QAEncoder: Towards Aligned Representation Learning in Question Answering Systems

Zhengren Wang, Qinhan Yu, Shida Wei et al.

Modern QA systems entail retrieval-augmented generation (RAG) for accurate and trustworthy responses. However, the inherent gap between user queries and relevant documents hinders precise matching. We introduce QAEncoder, a training-free approach to bridge this gap. Specifically, QAEncoder estimates the expectation of potential queries in the embedding space as a robust surrogate for the document embedding, and attaches document fingerprints to effectively distinguish these embeddings. Extensive experiments across diverse datasets, languages, and embedding models confirmed QAEncoder's alignment capability, which offers a simple-yet-effective solution with zero additional index storage, retrieval latency, training costs, or catastrophic forgetting and hallucination issues. The repository is publicly available at https://github.com/IAAR-Shanghai/QAEncoder.

12.1CVAug 14, 2024Code
MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts

Hao Liang, Linzhuang Sun, Minxuan Zhou et al.

With the rapid progress of Multimodal LLMs, evaluating their mathematical reasoning capabilities has become an increasingly important research direction. In particular, visual-textual mathematical reasoning serves as a key indicator of an MLLM's ability to comprehend and solve complex, multi-step quantitative problems. While existing benchmarks such as MathVista and MathVerse have advanced the evaluation of multimodal math proficiency, they primarily rely on digitally rendered content and fall short in capturing the complexity of real-world scenarios. To bridge this gap, we introduce MathScape, a novel benchmark focused on assessing MLLMs' reasoning ability in realistic mathematical contexts. MathScape comprises 1,369 high-quality math problems paired with human-captured real-world images, closely reflecting the challenges encountered in practical educational settings. We conduct a thorough multi-dimensional evaluation across nine leading closed-source MLLMs, three open-source MLLMs with over 20 billion parameters, and seven smaller-scale MLLMs. Our results show that even state-of-the-art models struggle with real-world math tasks, lagging behind human performance, highlighting critical limitations in current model capabilities. Moreover, we find that strong performance on synthetic or digitally rendered images does not guarantee similar effectiveness on real-world tasks. This underscores the necessity of MathScape in the next stage of multimodal mathematical reasoning.

23.8CVJul 3, 2024
KeyVideoLLM: Towards Large-scale Video Keyframe Selection

Hao Liang, Jiapeng Li, Tianyi Bai et al.

Recently, with the rise of web videos, managing and understanding large-scale video datasets has become increasingly important. Video Large Language Models (VideoLLMs) have emerged in recent years due to their strong video understanding capabilities. However, training and inference processes for VideoLLMs demand vast amounts of data, presenting significant challenges to data management, particularly regarding efficiency, robustness, and effectiveness. In this work, we present KeyVideoLLM, a text-video frame similarity-based keyframe selection method designed to manage VideoLLM data efficiently, robustly, and effectively. Specifically, KeyVideoLLM achieves a remarkable data compression rate of up to 60.9 times, substantially lowering disk space requirements, which proves its high efficiency. Additionally, it maintains a 100% selection success rate across all video formats and scales, enhances processing speed by up to 200 times compared to existing keyframe selection methods, and does not require hyperparameter tuning. Beyond its outstanding efficiency and robustness, KeyVideoLLM further improves model performance in video question-answering tasks during both training and inference stages. Notably, it consistently achieved the state-of-the-art (SoTA) experimental results on diverse datasets.

7.7CLSep 26, 2024Code
BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search

Linzhuang Sun, Hao Liang, Jingxuan Wei et al.

Large Language Models (LLMs) have exhibited exceptional performance across a broad range of tasks and domains. However, they still encounter difficulties in solving mathematical problems due to the rigorous and logical nature of mathematics. Previous studies have employed techniques such as supervised fine-tuning (SFT), prompt engineering, and search-based methods to improve the mathematical problem-solving abilities of LLMs. Despite these efforts, their performance remains suboptimal and demands substantial computational resources. To address this issue, we propose a novel approach, BEATS, to enhance mathematical problem-solving abilities. Our method leverages newly designed prompts that guide the model to iteratively rewrite, advance by one step, and generate answers based on previous steps. Additionally, we introduce a new back-verification technique that uses LLMs to validate the correctness of the generated answers. Furthermore, we employ a pruning tree search to optimize search time while achieving strong performance. Notably, our method improves Qwen2-7b-Instruct's score from 36.94 to 61.52, outperforming GPT4's 42.5 on the MATH benchmark.

14.1CVNov 18, 2024Code
MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Ruichuan An, Sihan Yang, Ming Lu et al.

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at https://github.com/arctanxarc/MC-LLaVA.

11.3CVAug 1, 2024
Are Bigger Encoders Always Better in Vision Large Models?

Bozhou Li, Hao Liang, Zimo Meng et al.

In recent years, multimodal large language models (MLLMs) have shown strong potential in real-world applications. They are developing rapidly due to their remarkable ability to comprehend multimodal information and their inherent powerful cognitive and reasoning capabilities. Among MLLMs, vision language models (VLM) stand out for their ability to understand vision information. However, the scaling trend of VLMs under the current mainstream paradigm has not been extensively studied. Whether we can achieve better performance by training even larger models is still unclear. To address this issue, we conducted experiments on the pretraining stage of MLLMs. We conduct our experiment using different encoder sizes and large language model (LLM) sizes. Our findings indicate that merely increasing the size of encoders does not necessarily enhance the performance of VLMs. Moreover, we analyzed the effects of LLM backbone parameter size and data quality on the pretraining outcomes. Additionally, we explored the differences in scaling laws between LLMs and VLMs.

1.0CLSep 26, 2024Code
Data Proportion Detection for Optimized Data Management for Large Language Models

Hao Liang, Keshi Zhao, Yajie Yang et al.

Large language models (LLMs) have demonstrated exceptional performance across a wide range of tasks and domains, with data preparation playing a critical role in achieving these results. Pre-training data typically combines information from multiple domains. To maximize performance when integrating data from various domains, determining the optimal data proportion is essential. However, state-of-the-art (SOTA) LLMs rarely disclose details about their pre-training data, making it difficult for researchers to identify ideal data proportions. In this paper, we introduce a new topic, \textit{data proportion detection}, which enables the automatic estimation of pre-training data proportions by analyzing the generated outputs of LLMs. We provide rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection. Based on these findings, we offer valuable insights into the challenges and future directions for effective data proportion detection and data management.

19.4CLMay 20, 2025Code
Let's Verify Math Questions Step by Step

Chengyu Shen, Zhen Hao Wong, Runming He et al.

Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math QA data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the validity of the questions themselves. In this work, we propose Math Question Verification (MathQ-Verify), a novel five-stage pipeline designed to rigorously filter ill-posed or under-specified math problems. MathQ-Verify first performs format-level validation to remove redundant instructions and ensure that each question is syntactically well-formed. It then formalizes each question, decomposes it into atomic conditions, and verifies them against mathematical definitions. Next, it detects logical contradictions among these conditions, followed by a goal-oriented completeness check to ensure the question provides sufficient information for solving. To evaluate this task, we use existing benchmarks along with an additional dataset we construct, containing 2,147 math questions with diverse error types, each manually double-validated. Experiments show that MathQ-Verify achieves state-of-the-art performance across multiple benchmarks, improving the F1 score by up to 25 percentage points over the direct verification baseline. It further attains approximately 90% precision and 63% recall through a lightweight model voting scheme. MathQ-Verify offers a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding unnecessary computation on invalid questions. Our code and data are available at https://github.com/scuuy/MathQ-Verify.

16.4CVDec 14, 2025Code
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Yuran Wang, Bohan Zeng, Chengzhuo Tong et al.

Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

3.3AINov 20, 2025Code
FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks

Zhen Hao Wong, Jingwen Deng, Hao Liang et al.

The development of Large Language Models (LLMs) increasingly depends on high-quality supervised data, yet existing instruction-tuning and RL datasets remain costly to curate and often rely on synthetic samples that introduce hallucination and limited diversity. At the same time, textbooks and exercise materials contain abundant, high-quality human-authored Question-Answer(QA) content that remains underexploited due to the difficulty of transforming raw PDFs into AI-ready supervision. Although modern OCR and vision-language models can accurately parse document structure, their outputs lack the semantic alignment required for training. We propose an automated pipeline that extracts well-formed QA and visual-QA (VQA) pairs from educational documents by combining layout-aware OCR with LLM-based semantic parsing. Experiments across diverse document types show that the method produces accurate, aligned, and low-noise QA/VQA pairs. This approach enables scalable use of real-world educational content and provides a practical alternative to synthetic data generation for improving reasoning-oriented LLM training. All code and data-processing pipelines are open-sourced at https://github.com/OpenDCAI/DataFlow.

8.4CVOct 20, 2025Code
LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

ZhaoYang Han, Qihan Lin, Hao Liang et al.

We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.

16.4CVJun 15, 2025Code
Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models

Junbo Niu, Yuanhong Zheng, Ziyang Miao et al.

Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored integrating native resolution visual encoding to improve model performance, such efforts remain fragmented and lack a systematic framework within the open-source community. Moreover, existing benchmarks fall short in evaluating VLMs under varied visual conditions, often neglecting resolution as a critical factor. To address the "Resolution Dilemma" stemming from both model design and benchmark limitations, we introduce RC-Bench, a novel benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, with an emphasis on resolution and aspect ratio variations. In conjunction, we propose NativeRes-LLaVA, an open-source training framework that empowers VLMs to effectively process images at their native resolutions and aspect ratios. Based on RC-Bench and NativeRes-LLaVA, we conduct comprehensive experiments on existing visual encoding strategies. The results show that Native Resolution Visual Encoding significantly improves the performance of VLMs on RC-Bench as well as other resolution-centric benchmarks. Code is available at https://github.com/Niujunbo2002/NativeRes-LLaVA.

6.3IVJun 13, 2024Code
Hybrid Spatial-spectral Neural Network for Hyperspectral Image Denoising

Hao Liang, Chengjie, Kun Li et al.

Hyperspectral image (HSI) denoising is an essential procedure for HSI applications. Unfortunately, the existing Transformer-based methods mainly focus on non-local modeling, neglecting the importance of locality in image denoising. Moreover, deep learning methods employ complex spectral learning mechanisms, thus introducing large computation costs. To address these problems, we propose a hybrid spatial-spectral denoising network (HSSD), in which we design a novel hybrid dual-path network inspired by CNN and Transformer characteristics, leading to capturing both local and non-local spatial details while suppressing noise efficiently. Furthermore, to reduce computational complexity, we adopt a simple but effective decoupling strategy that disentangles the learning of space and spectral channels, where multilayer perception with few parameters is utilized to learn the global correlations among spectra. The synthetic and real experiments demonstrate that our proposed method outperforms state-of-the-art methods on spatial and spectral reconstruction. The code and details are available on https://github.com/HLImg/HSSD.

6.2CVMar 2, 2025Code
Evaluating and Predicting Distorted Human Body Parts for Generated Images

Lu Ma, Kaibo Cao, Hao Liang et al.

Recent advancements in text-to-image (T2I) models enable high-quality image synthesis, yet generating anatomically accurate human figures remains challenging. AI-generated images frequently exhibit distortions such as proliferated limbs, missing fingers, deformed extremities, or fused body parts. Existing evaluation metrics like Inception Score (IS) and Fréchet Inception Distance (FID) lack the granularity to detect these distortions, while human preference-based metrics focus on abstract quality assessments rather than anatomical fidelity. To address this gap, we establish the first standards for identifying human body distortions in AI-generated images and introduce Distortion-5K, a comprehensive dataset comprising 4,700 annotated images of normal and malformed human figures across diverse styles and distortion types. Based on this dataset, we propose ViT-HD, a Vision Transformer-based model tailored for detecting human body distortions in AI-generated images, which outperforms state-of-the-art segmentation models and visual language models, achieving an F1 score of 0.899 and IoU of 0.831 on distortion localization. Additionally, we construct the Human Distortion Benchmark with 500 human-centric prompts to evaluate four popular T2I models using trained ViT-HD, revealing that nearly 50\% of generated images contain distortions. This work pioneers a systematic approach to evaluating anatomical accuracy in AI-generated humans, offering tools to advance the fidelity of T2I models and their real-world applicability. The Distortion-5K dataset, trained ViT-HD will soon be released in our GitHub repository: \href{https://github.com/TheRoadQaQ/Predicting-Distortion}{https://github.com/TheRoadQaQ/Predicting-Distortion}.

28.2CLFeb 19, 2025Code
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification

Linzhuang Sun, Hao Liang, Jingxuan Wei et al.

According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune MM-Reasoner. Our MM-Verifier outperforms all larger models on the MathCheck, MathVista, and MathVerse benchmarks. Moreover, MM-Reasoner demonstrates strong effectiveness and scalability, with performance improving as data size increases. Finally, our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista, surpassing GPT-4o (63.8) with 12 rollouts.

16.2CLOct 16, 2024Code
Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning

Mingyang Chen, Haoze Sun, Tianpeng Li et al.

Large Language Models (LLMs) have exhibited significant potential in performing diverse tasks, including the ability to call functions or use external tools to enhance their performance. While current research on function calling by LLMs primarily focuses on single-turn interactions, this paper addresses the overlooked necessity for LLMs to engage in multi-turn function calling--critical for handling compositional, real-world queries that require planning with functions but not only use functions. To facilitate this, we introduce an approach, BUTTON, which generates synthetic compositional instruction tuning data via bottom-up instruction construction and top-down trajectory generation. In the bottom-up phase, we generate simple atomic tasks based on real-world scenarios and build compositional tasks using heuristic strategies based on atomic tasks. Corresponding function definitions are then synthesized for these compositional tasks. The top-down phase features a multi-agent environment where interactions among simulated humans, assistants, and tools are utilized to gather multi-turn function calling trajectories. This approach ensures task compositionality and allows for effective function and trajectory generation by examining atomic tasks within compositional tasks. We produce a dataset BUTTONInstruct comprising 8k data points and demonstrate its effectiveness through extensive experiments across various LLMs.

17.4CVMay 19, 2025
Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning

Mingrui Chen, Haogeng Liu, Hao Liang et al.

In this work, we investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning. Our exploration mainly comprises of following three perspective: First, through offline data curation, we analyze the U-shaped difficulty distribution of two given datasets using the base model by multi-round sampling, and then filter out prompts that are either too simple or extremely difficult to provide meaningful gradients and perform subsequent two-stage training. Second, we implement an online advantage differentiation, computing group-wise empirical accuracy as a difficulty proxy to adaptively reweight advantages estimation, providing stronger learning signals for more challenging problems. Finally, we introduce difficulty hints as explicit prompts for more complex samples in the second training stage, encouraging the model to calibrate its reasoning depth and perform reflective validation checks. Our comprehensive approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.

7.8AIOct 26, 2025
Jarvis: Towards Personalized AI Assistant via Personal KV-Cache Retrieval

Binxiao Xu, Junyu Feng, Shaolin Lu et al.

The rapid development of Vision-language models (VLMs) enables open-ended perception and reasoning. Recent works have started to investigate how to adapt general-purpose VLMs into personalized assistants. Even commercial models such as ChatGPT now support model personalization by incorporating user-specific information. However, existing methods either learn a set of concept tokens or train a VLM to utilize user-specific information. However, both pipelines struggle to generate accurate answers as personalized assistants. We introduce Jarvis, an innovative framework for a personalized AI assistant through personal KV-Cache retrieval, which stores user-specific information in the KV-Caches of both textual and visual tokens. The textual tokens are created by summarizing user information into metadata, while the visual tokens are produced by extracting distinct image patches from the user's images. When answering a question, Jarvis first retrieves related KV-Caches from personal storage and uses them to ensure accuracy in responses. We also introduce a fine-grained benchmark built with the same distinct image patch mining pipeline, emphasizing accurate question answering based on fine-grained user-specific information. Jarvis is capable of providing more accurate responses, particularly when they depend on specific local details. Jarvis achieves state-of-the-art results in both visual question answering and text-only tasks across multiple datasets, indicating a practical path toward personalized AI assistants. The code and dataset will be released.

8.4CVMar 17, 2025
Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs

Ruichuan An, Kai Zeng, Ming Lu et al.

Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for existing techniques. To reveal the relationship between sample and model performance, we systematically investigate the amount and diversity impact of positive and negative samples (easy and hard) on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity, and can be easily extended to multi-concept scenarios. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the capabilities of VLMs across personalization benchmarks. To the best of our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code will be released.