Zhe Yang

CL
h-index67
45papers
2,489citations
Novelty46%
AI Score59

45 Papers

CLOct 12, 2023Code
Not All Demonstration Examples are Equally Beneficial: Reweighting Demonstration Examples for In-Context Learning

Zhe Yang, Damai Dai, Peiyi Wang et al. · pku

Large Language Models (LLMs) have recently gained the In-Context Learning (ICL) ability with the models scaling up, allowing them to quickly adapt to downstream tasks with only a few demonstration examples prepended in the input sequence. Nonetheless, the current practice of ICL treats all demonstration examples equally, which still warrants improvement, as the quality of examples is usually uneven. In this paper, we investigate how to determine approximately optimal weights for demonstration examples and how to apply them during ICL. To assess the quality of weights in the absence of additional validation data, we design a masked self-prediction (MSP) score that exhibits a strong correlation with the final ICL performance. To expedite the weight-searching process, we discretize the continuous weight space and adopt beam search. With approximately optimal weights obtained, we further propose two strategies to apply them to demonstrations at different model positions. Experimental results on 8 text classification tasks show that our approach outperforms conventional ICL by a large margin. Our code are publicly available at https:github.com/Zhe-Young/WICL.

CLSep 4, 2024
Towards a Unified View of Preference Learning for Large Language Models: A Survey

Bofei Gao, Feifan Song, Yibo Miao et al. · pku

Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM's performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to understand. The relationships between different methods have been under-explored, limiting the development of the preference alignment. In light of this, we break down the existing popular alignment strategies into different components and provide a unified framework to study the current alignment strategies, thereby establishing connections among them. In this survey, we decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm. This unified view offers an in-depth understanding of existing alignment algorithms and also opens up possibilities to synergize the strengths of different strategies. Furthermore, we present detailed working examples of prevalent existing algorithms to facilitate a comprehensive understanding for the readers. Finally, based on our unified perspective, we explore the challenges and future research directions for aligning large language models with human preferences.

CVAug 25, 2024Code
SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

Wenrui Li, Fucheng Cai, Yapeng Mi et al.

Text-driven 3D scene generation has seen significant advancements recently. However, most existing methods generate single-view images using generative models and then stitch them together in 3D space. This independent generation for each view often results in spatial inconsistency and implausibility in the 3D scenes. To address this challenge, we proposed a novel text-driven 3D-consistent scene generation model: SceneDreamer360. Our proposed method leverages a text-driven panoramic image generation model as a prior for 3D scene generation and employs 3D Gaussian Splatting (3DGS) to ensure consistency across multi-view panoramic images. Specifically, SceneDreamer360 enhances the fine-tuned Panfusion generator with a three-stage panoramic enhancement, enabling the generation of high-resolution, detail-rich panoramic images. During the 3D scene construction, a novel point cloud fusion initialization method is used, producing higher quality and spatially consistent point clouds. Our extensive experiments demonstrate that compared to other methods, SceneDreamer360 with its panoramic image generation and 3DGS can produce higher quality, spatially consistent, and visually appealing 3D scenes from any text prompt. Our codes are available at \url{https://github.com/liwrui/SceneDreamer360}.

LGOct 28, 2023
Explainable Modeling for Wind Power Forecasting: A Glass-Box Approach with High Accuracy

Wenlong Liao, Fernando Porte-Agel, Jiannong Fang et al.

Machine learning models (e.g., neural networks) achieve high accuracy in wind power forecasting, but they are usually regarded as black boxes that lack interpretability. To address this issue, the paper proposes a glass-box approach that combines high accuracy with transparency for wind power forecasting. Specifically, the core is to sum up the feature effects by constructing shape functions, which effectively map the intricate non-linear relationships between wind power output and input features. Furthermore, the forecasting model is enriched by incorporating interaction terms that adeptly capture interdependencies and synergies among the input features. The additive nature of the proposed glass-box approach ensures its interpretability. Simulation results show that the proposed glass-box approach effectively interprets the results of wind power forecasting from both global and instance perspectives. Besides, it outperforms most benchmark models and exhibits comparable performance to the best-performing neural networks. This dual strength of transparency and high accuracy positions the proposed glass-box approach as a compelling choice for reliable wind power forecasting.

CLOct 13, 2022
Low-resource Neural Machine Translation with Cross-modal Alignment

Zhe Yang, Qingkai Fang, Yang Feng

How to achieve neural machine translation with limited parallel data? Existing techniques often rely on large-scale monolingual corpora, which is impractical for some low-resource languages. In this paper, we turn to connect several low-resource languages to a particular high-resource one by additional visual modality. Specifically, we propose a cross-modal contrastive learning method to learn a shared space for all languages, where both a coarse-grained sentence-level objective and a fine-grained token-level one are introduced. Experimental results and further analysis show that our method can effectively learn the cross-modal and cross-lingual alignment with a small amount of image-text pairs and achieves significant improvements over the text-only baseline under both zero-shot and few-shot scenarios.

CLJul 3, 2024
FSM: A Finite State Machine Based Zero-Shot Prompting Paradigm for Multi-Hop Question Answering

Xiaochen Wang, Junqing He, Zhe yang et al. · pku

Large Language Models (LLMs) with chain-of-thought (COT) prompting have demonstrated impressive abilities on simple nature language inference tasks. However, they tend to perform poorly on Multi-hop Question Answering (MHQA) tasks due to several challenges, including hallucination, error propagation and limited context length. We propose a prompting method, Finite State Machine (FSM) to enhance the reasoning capabilities of LLM for complex tasks in addition to improved effectiveness and trustworthiness. Different from COT methods, FSM addresses MHQA by iteratively decomposing a question into multi-turn sub-questions, and self-correcting in time, improving the accuracy of answers in each step. Specifically, FSM addresses one sub-question at a time and decides on the next step based on its current result and state, in an automaton-like format. Experiments on benchmarks show the effectiveness of our method. Although our method performs on par with the baseline on relatively simpler datasets, it excels on challenging datasets like Musique. Moreover, this approach mitigates the hallucination phenomenon, wherein the correct final answer can be recovered despite errors in intermediate reasoning. Furthermore, our method improves LLMs' ability to follow specified output format requirements, significantly reducing the difficulty of answer interpretation and the need for reformatting.

LGNov 14, 2025Code
Adaptive Redundancy Regulation for Balanced Multimodal Information Refinement

Zhe Yang, Wenrui Li, Hongtao Chen et al.

Multimodal learning aims to improve performance by leveraging data from multiple sources. During joint multimodal training, due to modality bias, the advantaged modality often dominates backpropagation, leading to imbalanced optimization. Existing methods still face two problems: First, the long-term dominance of the dominant modality weakens representation-output coupling in the late stages of training, resulting in the accumulation of redundant information. Second, previous methods often directly and uniformly adjust the gradients of the advantaged modality, ignoring the semantics and directionality between modalities. To address these limitations, we propose Adaptive Redundancy Regulation for Balanced Multimodal Information Refinement (RedReg), which is inspired by information bottleneck principle. Specifically, we construct a redundancy phase monitor that uses a joint criterion of effective gain growth rate and redundancy to trigger intervention only when redundancy is high. Furthermore, we design a co-information gating mechanism to estimate the contribution of the current dominant modality based on cross-modal semantics. When the task primarily relies on a single modality, the suppression term is automatically disabled to preserve modality-specific information. Finally, we project the gradient of the dominant modality onto the orthogonal complement of the joint multimodal gradient subspace and suppress the gradient according to redundancy. Experiments show that our method demonstrates superiority among current major methods in most scenarios. Ablation experiments verify the effectiveness of our method. The code is available at https://github.com/xia-zhe/RedReg.git

CVFeb 6Code
Can We Build a Monolithic Model for Fake Image Detection? SICA: Semantic-Induced Constrained Adaptation for Unified-Yet-Discriminative Artifact Feature Space Reconstruction

Bo Du, Xiaochen Ma, Xuekang Zhu et al.

Fake Image Detection (FID), aiming at unified detection across four image forensic subdomains, is critical in real-world forensic scenarios. Compared with ensemble approaches, monolithic FID models are theoretically more promising, but to date, consistently yield inferior performance in practice. In this work, by discovering the ``heterogeneous phenomenon'', which is the intrinsic distinctness of artifacts across subdomains, we diagnose the cause of this underperformance for the first time: the collapse of the artifact feature space driven by such phenomenon. The core challenge for developing a practical monolithic FID model thus boils down to the ``unified-yet-discriminative" reconstruction of the artifact feature space. To address this paradoxical challenge, we hypothesize that high-level semantics can serve as a structural prior for the reconstruction, and further propose Semantic-Induced Constrained Adaptation (SICA), the first monolithic FID paradigm. Extensive experiments on our OpenMMSec dataset demonstrate that SICA outperforms 15 state-of-the-art methods and reconstructs the target unified-yet-discriminative artifact feature space in a near-orthogonal manner, thus firmly validating our hypothesis. The code and dataset are available at:https: //github.com/scu-zjz/SICA_OpenMMSec.

AINov 1, 2023
On the Opportunities of Green Computing: A Survey

You Zhou, Xiujing Lin, Xiang Zhang et al.

Artificial Intelligence (AI) has achieved significant advancements in technology and research with the development over several decades, and is widely used in many areas including computing vision, natural language processing, time-series analysis, speech synthesis, etc. During the age of deep learning, especially with the arise of Large Language Models, a large majority of researchers' attention is paid on pursuing new state-of-the-art (SOTA) results, resulting in ever increasing of model size and computational complexity. The needs for high computing power brings higher carbon emission and undermines research fairness by preventing small or medium-sized research institutions and companies with limited funding in participating in research. To tackle the challenges of computing resources and environmental impact of AI, Green Computing has become a hot research topic. In this survey, we give a systematic overview of the technologies used in Green Computing. We propose the framework of Green Computing and devide it into four key components: (1) Measures of Greenness, (2) Energy-Efficient AI, (3) Energy-Efficient Computing Systems and (4) AI Use Cases for Sustainability. For each components, we discuss the research progress made and the commonly used techniques to optimize the AI efficiency. We conclude that this new research direction has the potential to address the conflicts between resource constraints and AI development. We encourage more researchers to put attention on this direction and make AI more environmental friendly.

92.2ROMar 13Code
Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots

Guoqiang Zhao, Zhe Yang, Sheng Wu et al.

Panoramic imagery provides holistic 360° visual coverage for perception in quadruped robots. However, existing occupancy prediction methods are mainly designed for wheeled autonomous driving and rely heavily on RGB cues, limiting their robustness in complex environments. To bridge this gap, (1) we present PanoMMOcc, the first real-world panoramic multimodal occupancy dataset for quadruped robots, featuring four sensing modalities across diverse scenes. (2) We propose a panoramic multimodal occupancy perception framework, VoxelHound, tailored for legged mobility and spherical imaging. Specifically, we design (i) a Vertical Jitter Compensation (VJC) module to mitigate severe viewpoint perturbations caused by body pitch and roll during mobility, enabling more consistent spatial reasoning, and (ii) an effective Multimodal Information Prompt Fusion (MIPF) module that jointly leverages panoramic visual cues and auxiliary modalities to enhance volumetric occupancy prediction. (3) We establish a benchmark based on PanoMMOcc and provide detailed data analysis to enable systematic evaluation of perception methods under challenging embodied scenarios. Extensive experiments demonstrate that VoxelHound achieves state-of-the-art performance on PanoMMOcc (+4.16%} in mIoU). The dataset and code will be publicly released to facilitate future research on panoramic multimodal 3D perception for embodied robotic systems at https://github.com/SXDR/PanoMMOcc, along with the calibration tools released at https://github.com/losehu/CameraLiDAR-Calib.

CLFeb 26
Towards Better RL Training Data Utilization via Second-Order Rollout

Zhe Yang, Yudong Wang, Rang Li et al. · pku

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training

CVDec 19, 2025
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Rang Li, Lei Li, Shuhuai Ren et al. · pku

Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative, distinguishing highly similar objects, (2) Spatial, understanding complex relational descriptions, (3) Limited, handling occlusions or tiny objects, and (4) Rejection, recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks, reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.

SYNov 7, 2023
An Explainable Framework for Machine learning-Based Reactive Power Optimization of Distribution Network

Wenlong Liao, Benjamin Schäfer, Dalin Qin et al.

To reduce the heavy computational burden of reactive power optimization of distribution networks, machine learning models are receiving increasing attention. However, most machine learning models (e.g., neural networks) are usually considered as black boxes, making it challenging for power system operators to identify and comprehend potential biases or errors in the decision-making process of machine learning models. To address this issue, an explainable machine-learning framework is proposed to optimize the reactive power in distribution networks. Firstly, a Shapley additive explanation framework is presented to measure the contribution of each input feature to the solution of reactive power optimizations generated from machine learning models. Secondly, a model-agnostic approximation method is developed to estimate Shapley values, so as to avoid the heavy computational burden associated with direct calculations of Shapley values. The simulation results show that the proposed explainable framework can accurately explain the solution of the machine learning model-based reactive power optimization by using visual analytics, from both global and instance perspectives. Moreover, the proposed explainable framework is model-agnostic, and thus applicable to various models (e.g., neural networks).

CLDec 27, 2024Code
Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

Zhe Yang, Yichang Zhang, Yudong Wang et al. · pku

Large Language Models (LLMs) can correct their self-generated responses, but a decline in accuracy after self-correction is also witnessed. To have a deeper understanding of self-correction, we endeavor to decompose, evaluate, and analyze the self-correction behaviors of LLMs. By enumerating and analyzing answer correctness before and after self-correction, we decompose the self-correction capability into confidence (being confident to correct answers) and critique (turning wrong answers to correct) capabilities, and propose two metrics from a probabilistic perspective to measure these 2 capabilities, along with another metric for overall self-correction capability evaluation. Based on our decomposition and evaluation metrics, we conduct extensive experiments and draw some empirical conclusions. For example, we find different models can exhibit distinct behaviors: some models are confident while others are more critical. We also find the trade-off between the two capabilities (i.e. improving one can lead to a decline in the other) when manipulating model self-correction behavior by prompts or in-context learning. Further, we find a simple yet efficient strategy to improve self-correction capability by transforming Supervision Fine-Tuning (SFT) data format, and our strategy outperforms vanilla SFT in both capabilities and achieves much higher accuracy after self-correction. Our code will be publicly available on GitHub.

84.7CRApr 20
Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective

Meifang Chen, Zhe Yang, Huang Nianchen et al.

Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \textit{gibberish bias}. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger vocabulary'' trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.

29.6CVMay 16
Towards Generalized Image Manipulation Localization via Score-based Model

Yunfei Wang, Bo Du, Zhe Yang et al.

With the rapid evolution of synthetic media, Image Manipulation Localization (IML) has emerged as a critical component in multimedia forensics for ensuring the integrity of digital content. However, generalization remains a core challenge, as existing discriminative methods typically learn a fixed decision boundary that tends to overfit to specific training artifacts and fails to adapt to unseen manipulation types. To address this, we propose DiffIML, a novel framework that introduces score-based generative modeling to IML. Diverging from the direct estimation of hard boundaries, DiffIML approximates the score function, the gradient of the log-likelihood, to capture the intrinsic geometric topology of mask distributions. This paradigm leverages structural priors to iteratively recover coherent masks from noise, thereby circumventing the brittleness associated with discriminative models. Under this formulation, diffusion models serve as an effective numerical solver for the learned score function.To ensure practicality, we respectively resolve the efficiency and stability bottlenecks of standard diffusion by: (1) utilizing a Lightweight Mask-Specific VAE for fast latent-space process and a decoupled architecture with a lightweight denoising UNet, (2) edge supervision and error prior to mitigate error accumulation during sampling. Extensive experiments of two distinct protocols on eight non-generative and three generative benchmarks demonstrate that DiffIML consistently outperforms state-of-the-art methods, yielding remarkable generalization improvements on diverse unseen datasets. The code will be publicly available.

CLJan 15, 2024
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Heming Xia, Zhe Yang, Qingxiu Dong et al. · pku

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts several future tokens efficiently and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, such as drafter selection and verification strategies. Furthermore, we present a comparative analysis of leading methods under third-party testing environments. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.

CLNov 12, 2025
C$^3$TG: Conflict-aware, Composite, and Collaborative Controlled Text Generation

Yu Li, Zhe Yang, Yi Huang et al.

Recent advancements in large language models (LLMs) have demonstrated remarkable text generation capabilities. However, controlling specific attributes of generated text remains challenging without architectural modifications or extensive fine-tuning. Current methods typically toggle a single, basic attribute but struggle with precise multi-attribute control. In scenarios where attribute requirements conflict, existing methods lack coordination mechanisms, causing interference between desired attributes. Furthermore, these methods fail to incorporate iterative optimization processes in the controlled generation pipeline. To address these limitations, we propose Conflict-aware, Composite, and Collaborative Controlled Text Generation (C$^3$TG), a two-phase framework for fine-grained, multi-dimensional text attribute control. During generation, C$^3$TG selectively pairs the LLM with the required attribute classifiers from the 17 available dimensions and employs weighted KL-divergence to adjust token probabilities. The optimization phase then leverages an energy function combining classifier scores and penalty terms to resolve attribute conflicts through iterative feedback, enabling precise control over multiple dimensions simultaneously while preserving natural text flow. Experiments show that C$^3$TG significantly outperforms baselines across multiple metrics including attribute accuracy, linguistic fluency, and output diversity, while simultaneously reducing toxicity. These results establish C$^3$TG as an effective and flexible solution for multi-dimensional text attribute control that requires no costly model modifications.

IRDec 15, 2024Code
Multi-Graph Co-Training for Capturing User Intent in Session-based Recommendation

Zhe Yang, Tiantian Liang

Session-based recommendation focuses on predicting the next item a user will interact with based on sequences of anonymous user sessions. A significant challenge in this field is data sparsity due to the typically short-term interactions. Most existing methods rely heavily on users' current interactions, overlooking the wealth of auxiliary information available. To address this, we propose a novel model, the Multi-Graph Co-Training model (MGCOT), which leverages not only the current session graph but also similar session graphs and a global item relation graph. This approach allows for a more comprehensive exploration of intrinsic relationships and better captures user intent from multiple views, enabling session representations to complement each other. Additionally, MGCOT employs multi-head attention mechanisms to effectively capture relevant session intent and uses contrastive learning to form accurate and robust session representations. Extensive experiments on three datasets demonstrate that MGCOT significantly enhances the performance of session-based recommendations, particularly on the Diginetica dataset, achieving improvements up to 2.00% in P@20 and 10.70% in MRR@20. Resources have been made publicly available in our GitHub repository https://github.com/liang-tian-tian/MGCOT.

IRDec 16, 2024Code
SPGL: Enhancing Session-based Recommendation with Single Positive Graph Learning

Tiantian Liang, Zhe Yang

Session-based recommendation seeks to forecast the next item a user will be interested in, based on their interaction sequences. Due to limited interaction data, session-based recommendation faces the challenge of limited data availability. Traditional methods enhance feature learning by constructing complex models to generate positive and negative samples. This paper proposes a session-based recommendation model using Single Positive optimization loss and Graph Learning (SPGL) to deal with the problem of data sparsity, high model complexity and weak transferability. SPGL utilizes graph convolutional networks to generate global item representations and batch session representations, effectively capturing intrinsic relationships between items. The use of single positive optimization loss improves uniformity of item representations, thereby enhancing recommendation accuracy. In the intent extractor, SPGL considers the hop count of the adjacency matrix when constructing the directed global graph to fully integrate spatial information. It also takes into account the reverse positional information of items when constructing session representations to incorporate temporal information. Comparative experiments across three benchmark datasets, Tmall, RetailRocket and Diginetica, demonstrate the model's effectiveness. The source code can be accessed on https://github.com/liang-tian-tian/SPGL .

CVJan 7, 2025
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Shaolei Zhang, Qingkai Fang, Zhe Yang et al.

The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and textual instructions into the context of large language models (LLMs), where large-scale parameters and numerous context tokens (predominantly vision tokens) result in substantial computational overhead. Previous efforts towards efficient LMMs always focus on replacing the LLM backbone with smaller models, while neglecting the crucial issue of token quantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal vision tokens. To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to fuse visual information into text tokens in advance, thereby facilitating the extreme compression of vision tokens fed to LLM backbone into one token. LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Experiments across 11 image-based and 7 video-based benchmarks demonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token instead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.

LGApr 7, 2024
TimeGPT in Load Forecasting: A Large Time Series Model Perspective

Wenlong Liao, Fernando Porte-Agel, Jiannong Fang et al.

Machine learning models have made significant progress in load forecasting, but their forecast accuracy is limited in cases where historical load data is scarce. Inspired by the outstanding performance of large language models (LLMs) in computer vision and natural language processing, this paper aims to discuss the potential of large time series models in load forecasting with scarce historical data. Specifically, the large time series model is constructed as a time series generative pre-trained transformer (TimeGPT), which is trained on massive and diverse time series datasets consisting of 100 billion data points (e.g., finance, transportation, banking, web traffic, weather, energy, healthcare, etc.). Then, the scarce historical load data is used to fine-tune the TimeGPT, which helps it to adapt to the data distribution and characteristics associated with load forecasting. Simulation results show that TimeGPT outperforms the benchmarks (e.g., popular machine learning models and statistical models) for load forecasting on several real datasets with scarce training samples, particularly for short look-ahead times. However, it cannot be guaranteed that TimeGPT is always superior to benchmarks for load forecasting with scarce data, since the performance of TimeGPT may be affected by the distribution differences between the load data and the training data. In practical applications, we can divide the historical data into a training set and a validation set, and then use the validation set loss to decide whether TimeGPT is the best choice for a specific dataset.

CLFeb 25, 2024
PeriodicLoRA: Breaking the Low-Rank Bottleneck in LoRA Optimization

Xiangdi Meng, Damai Dai, Weiyao Luo et al. · pku

Supervised fine-tuning is the most common method to adapt large language models (LLMs) to downstream tasks, but full fine-tuning LLMs requires massive computational resources. Recently, parameter-efficient fine-tuning (PEFT) methods have been widely studied due to its cost-effectiveness. LoRA is one of the most widely used methods, which assumes that the optimization process is essentially low-dimensional. Although LoRA fine-tuning is effective, there is still a performance gap compared to full fine-tuning, since its weight update is limited to low-rank matrices. In order to break the low-rank bottleneck in LoRA Optimization, we propose PeriodicLoRA (PLoRA), which accumulates low-rank update matrices multiple times to achieve a higher update rank. PLoRA has multiple training stages. During each stage, we still update only the LoRA weights. However, at the end of each stage, we unload the LoRA weights into the backbone parameters and then reinitialize the LoRA states. Experimental results show that PLoRA has stronger learning ability, approximately 1.8 times that of LoRA's learning ability at most, but it does not increase memory usage. Further, we introduce a momentum-based unloading strategy for PLoRA to mitigate the training instability.

84.0DCApr 21
DPC: A Distributed Page Cache over CXL

Shai Bergman, Zhe Yang, Julien Eudine et al.

Modern distributed file systems rely on uncoordinated, per node page caches that replicate hot data locally across the cluster. While ensuring fast local access, this architecture underutilizes aggregate cluster DRAM capacity through massive data redundancy and incurs prohibitive coherence overhead via heavyweight, lock-based protocols. In this paper, we focus on the design of a distributed page cache that treats the entire cluster's main memory as a single cache budget while preserving standard file-system interfaces and semantics. We present Distributed Page Cache (DPC), an OS-level, distributed page cache built on top of Compute Express Link (CXL) 3.0 memory semantics. DPC enforces a single-copy invariant at page granularity: each file page has exactly one owner node holding the sole resident DRAM copy, and other nodes access it via CXL-based remote mappings rather than creating replicas of the page. DPC is implemented end-to-end on a CXL-based emulation framework that models multi-host CXL 3.0 memory fabrics, enabling detailed evaluation in the absence of widespread hardware. Across real-world and representative data-sharing workloads, DPC delivers speedups of up to 12.4X, with a geometric-mean speedup of 5.6X.

LGNov 18, 2024
Zero-Shot Load Forecasting with Large Language Models

Wenlong Liao, Zhe Yang, Mengshuo Jia et al.

Deep learning models have shown strong performance in load forecasting, but they generally require large amounts of data for model training before being applied to new scenarios, which limits their effectiveness in data-scarce scenarios. Inspired by the great success of pre-trained language models (LLMs) in natural language processing, this paper proposes a zero-shot load forecasting approach using an advanced LLM framework denoted as the Chronos model. By utilizing its extensive pre-trained knowledge, the Chronos model enables accurate load forecasting in data-scarce scenarios without the need for extensive data-specific training. Simulation results across five real-world datasets demonstrate that the Chronos model significantly outperforms nine popular baseline models for both deterministic and probabilistic load forecasting with various forecast horizons (e.g., 1 to 48 hours), even though the Chronos model is neither tailored nor fine-tuned to these specific load datasets. Notably, Chronos reduces root mean squared error (RMSE), continuous ranked probability score (CRPS), and quantile score (QS) by approximately 7.34%-84.30%, 19.63%-60.06%, and 22.83%-54.49%, respectively, compared to baseline models. These results highlight the superiority and flexibility of the Chronos model, positioning it as an effective solution in data-scarce scenarios.

CVDec 12, 2024
Hyperbolic-constraint Point Cloud Reconstruction from Single RGB-D Images

Wenrui Li, Zhe Yang, Wei Han et al.

Reconstructing desired objects and scenes has long been a primary goal in 3D computer vision. Single-view point cloud reconstruction has become a popular technique due to its low cost and accurate results. However, single-view reconstruction methods often rely on expensive CAD models and complex geometric priors. Effectively utilizing prior knowledge about the data remains a challenge. In this paper, we introduce hyperbolic space to 3D point cloud reconstruction, enabling the model to represent and understand complex hierarchical structures in point clouds with low distortion. We build upon previous methods by proposing a hyperbolic Chamfer distance and a regularized triplet loss to enhance the relationship between partial and complete point clouds. Additionally, we design adaptive boundary conditions to improve the model's understanding and reconstruction of 3D structures. Our model outperforms most existing models, and ablation studies demonstrate the significance of our model and its components. Experimental results show that our method significantly improves feature extraction capabilities. Our model achieves outstanding performance in 3D reconstruction tasks.

86.0CVApr 6
FileGram: Grounding Agent Personalization in File-System Behavioral Traces

Shuai Liu, Shulin Tian, Kairui Hu et al.

Coworking AI agents operating within local file systems are rapidly emerging as a paradigm in human-AI interaction; however, effective personalization remains limited by severe data constraints, as strict privacy barriers and the difficulty of jointly collecting multimodal real-world traces prevent scalable training and evaluation, and existing methods remain interaction-centric while overlooking dense behavioral traces in file-system operations; to address this gap, we propose FileGram, a comprehensive framework that grounds agent memory and personalization in file-system behavioral traces, comprising three core components: (1) FileGramEngine, a scalable persona-driven data engine that simulates realistic workflows and generates fine-grained multimodal action sequences at scale; (2) FileGramBench, a diagnostic benchmark grounded in file-system behavioral traces for evaluating memory systems on profile reconstruction, trace disentanglement, persona drift detection, and multimodal grounding; and (3) FileGramOS, a bottom-up memory architecture that builds user profiles directly from atomic actions and content deltas rather than dialogue summaries, encoding these traces into procedural, semantic, and episodic channels with query-time abstraction; extensive experiments show that FileGramBench remains challenging for state-of-the-art memory systems and that FileGramEngine and FileGramOS are effective, and by open-sourcing the framework, we hope to support future research on personalized memory-centric file-system agents.

88.2AIApr 1
HippoCamp: Benchmarking Contextual Agents on Personal Computers

Zhe Yang, Shulin Tian, Kairui Hu et al.

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

AIDec 22, 2025
FC-MIR: A Mobile Screen Awareness Framework for Intent-Aware Recommendation based on Frame-Compressed Multimodal Trajectory Reasoning

Zhe Yang, Xiaoshuang Sheng, Zhengnan Zhang et al.

Identifying user intent from mobile UI operation trajectories is critical for advancing UI understanding and enabling task automation agents. While Multimodal Large Language Models (MLLMs) excel at video understanding tasks, their real-time mobile deployment is constrained by heavy computational costs and inefficient redundant frame processing. To address these issues, we propose the FC-MIR framework: leveraging keyframe sampling and adaptive concatenation, it cuts visual redundancy to boost inference efficiency, while integrating state-of-the-art closed-source MLLMs or fine-tuned models (e.g., Qwen3-VL) for trajectory summarization and intent prediction. We further expand task scope to explore generating post-prediction operations and search suggestions, and introduce a fine-grained metric to evaluate the practical utility of summaries, predictions, and suggestions. For rigorous assessment, we construct a UI trajectory dataset covering scenarios from UI-Agents (Agent-I) and real user interactions (Person-I). Experimental results show our compression method retains performance at 50%-60% compression rates; both closed-source and fine-tuned MLLMs demonstrate strong intent summarization, supporting potential lightweight on-device deployment. However, MLLMs still struggle with useful and "surprising" suggestions, leaving room for improvement. Finally, we deploy the framework in a real-world setting, integrating UI perception and UI-Agent proxies to lay a foundation for future progress in this field.

CLAug 22, 2025
A Probabilistic Inference Scaling Theory for LLM Self-Correction

Zhe Yang, Yichang Zhang, Yudong Wang et al. · pku

Large Language Models (LLMs) have demonstrated the capability to refine their generated answers through self-correction, enabling continuous performance improvement over multiple rounds. However, the mechanisms underlying how and why accuracy evolves during this iterative process remain unexplored. To fill this gap, we propose a probabilistic theory to model the dynamics of accuracy change and explain the performance improvements observed in multi-round self-correction. Through mathematical derivation, we establish that the accuracy after the $t^{th}$ round of self-correction is given by: $Acc_t = Upp - α^t(Upp - Acc_0),$ where $Acc_0$ denotes the initial accuracy, $Upp$ represents the upper bound of accuracy convergence, and $α$ determines the rate of convergence. Based on our theory, these parameters can be calculated and the predicted accuracy curve then can be obtained through only a single round of self-correction. Extensive experiments across diverse models and datasets demonstrate that our theoretical predictions align closely with empirical accuracy curves, validating the effectiveness of the theory. Our work provides a theoretical foundation for understanding LLM self-correction, thus paving the way for further explorations.

CLJul 8, 2025
Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

Yizhan Huang, Zhe Yang, Meifang Chen et al.

Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present the Entropy-Memorization Law. It suggests that data entropy is linearly correlated with memorization score. Moreover, in a case study of memorizing highly randomized strings, or "gibberish", we observe that such sequences, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the broader training corpus. Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data, enabling Dataset Inference (DI).

SYDec 25, 2023
Improving the Accuracy and Interpretability of Neural Networks for Wind Power Forecasting

Wenlong Liao, Fernando Porte-Agel, Jiannong Fang et al.

Deep neural networks (DNNs) are receiving increasing attention in wind power forecasting due to their ability to effectively capture complex patterns in wind data. However, their forecasted errors are severely limited by the local optimal weight issue in optimization algorithms, and their forecasted behavior also lacks interpretability. To address these two challenges, this paper firstly proposes simple but effective triple optimization strategies (TriOpts) to accelerate the training process and improve the model performance of DNNs in wind power forecasting. Then, permutation feature importance (PFI) and local interpretable model-agnostic explanation (LIME) techniques are innovatively presented to interpret forecasted behaviors of DNNs, from global and instance perspectives. Simulation results show that the proposed TriOpts not only drastically improve the model generalization of DNNs for both the deterministic and probabilistic wind power forecasting, but also accelerate the training process. Besides, the proposed PFI and LIME techniques can accurately estimate the contribution of each feature to wind power forecasting, which helps to construct feature engineering and understand how to obtain forecasted values for a given sample.

CLMar 14, 2025
Palette of Language Models: A Solver for Controlled Text Generation

Zhe Yang, Yi Huang, Yaqin Chen et al.

Recent advancements in large language models have revolutionized text generation with their remarkable capabilities. These models can produce controlled texts that closely adhere to specific requirements when prompted appropriately. However, designing an optimal prompt to control multiple attributes simultaneously can be challenging. A common approach is to linearly combine single-attribute models, but this strategy often overlooks attribute overlaps and can lead to conflicts. Therefore, we propose a novel combination strategy inspired by the Law of Total Probability and Conditional Mutual Information Minimization on generative language models. This method has been adapted for single-attribute control scenario and is termed the Palette of Language Models due to its theoretical linkage between attribute strength and generation style, akin to blending colors on an artist's palette. Moreover, positive correlation and attribute enhancement are advanced as theoretical properties to guide a rational combination strategy design. We conduct experiments on both single control and multiple control settings, and achieve surpassing results.

CLSep 4, 2025
Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning

Zhilin Wang, Zhe Yang, Yun Luo et al.

Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. Inspired by mathematics, where simple operations yield infinite verifiable problems, we introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions to systematically synthesize a vast and diverse corpus of sheet music reasoning problems. This approach allows us to introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music, while also pointing out the ongoing challenges in understanding sheet music in a visual format. By leveraging synthetic data for RLVR, all models show significant improvements on the SSMR-Bench. Additionally, they also demonstrate considerable advancements on previously established human-crafted benchmarks, such as MusicTheoryBench and the music subset of MMMU. Finally, our results show that the enhanced reasoning ability can also facilitate music composition.

CLFeb 15, 2025
CiteCheck: Towards Accurate Citation Faithfulness Detection

Ziyao Xu, Shaohang Wei, Zhuoheng Han et al. · pku

Citation faithfulness detection is critical for enhancing retrieval-augmented generation (RAG) systems, yet large-scale Chinese datasets for this task are scarce. Existing methods face prohibitive costs due to the need for manually annotated negative samples. To address this, we introduce the first large-scale Chinese dataset CiteCheck for citation faithfulness detection, constructed via a cost-effective approach using two-stage manual annotation. This method balances positive and negative samples while significantly reducing annotation expenses. CiteCheck comprises training and test splits. Experiments demonstrate that: (1) the test samples are highly challenging, with even state-of-the-art LLMs failing to achieve high accuracy; and (2) training data augmented with LLM-generated negative samples enables smaller models to attain strong performance using parameter-efficient fine-tuning. CiteCheck provides a robust foundation for advancing citation faithfulness detection in Chinese RAG systems. The dataset is publicly available to facilitate research.

CLJun 18, 2024
Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Zhe Yang, Yichang Zhang, Tianyu Liu et al.

Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.

AIJun 14, 2024
SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering

Zhe Yang, Wenrui Li, Guanghui Cheng

The Audio-Visual Question Answering (AVQA) task holds significant potential for applications. Compared to traditional unimodal approaches, the multi-modal input of AVQA makes feature extraction and fusion processes more challenging. Euclidean space is difficult to effectively represent multi-dimensional relationships of data. Especially when extracting and processing data with a tree structure or hierarchical structure, Euclidean space is not suitable as an embedding space. Additionally, the self-attention mechanism in Transformers is effective in capturing the dynamic relationships between elements in a sequence. However, the self-attention mechanism's limitations in window modeling and quadratic computational complexity reduce its effectiveness in modeling long sequences. To address these limitations, we propose SHMamba: Structured Hyperbolic State Space Model to integrate the advantages of hyperbolic geometry and state space models. Specifically, SHMamba leverages the intrinsic properties of hyperbolic space to represent hierarchical structures and complex relationships in audio-visual data. Meanwhile, the state space model captures dynamic changes over time by globally modeling the entire sequence. Furthermore, we introduce an adaptive curvature hyperbolic alignment module and a cross fusion block to enhance the understanding of hierarchical structures and the dynamic exchange of cross-modal information, respectively. Extensive experiments demonstrate that SHMamba outperforms previous methods with fewer parameters and computational costs. Our learnable parameters are reduced by 78.12\%, while the average performance improves by 2.53\%. Experiments show that our method demonstrates superiority among all current major methods and is more suitable for practical application scenarios.

SYNov 15, 2021
Short-Term Power Prediction for Renewable Energy Using Hybrid Graph Convolutional Network and Long Short-Term Memory Approach

Wenlong Liao, Birgitte Bak-Jensen, Jayakrishnan Radhakrishna Pillai et al.

Accurate short-term solar and wind power predictions play an important role in the planning and operation of power systems. However, the short-term power prediction of renewable energy has always been considered a complex regression problem, owing to the fluctuation and intermittence of output powers and the law of dynamic change with time due to local weather conditions, i.e. spatio-temporal correlation. To capture the spatio-temporal features simultaneously, this paper proposes a new graph neural network-based short-term power forecasting approach, which combines the graph convolutional network (GCN) and long short-term memory (LSTM). Specifically, the GCN is employed to learn complex spatial correlations between adjacent renewable energies, and the LSTM is used to learn dynamic changes of power generation curves. The simulation results show that the proposed hybrid approach can model the spatio-temporal correlation of renewable energies, and its performance outperforms popular baselines on real-world datasets.

CVNov 10, 2021
A Structure Feature Algorithm for Multi-modal Forearm Registration

Jiaxin Li, Yan Ding, Weizhong Zhang et al.

Augmented reality technology based on image registration is becoming increasingly popular for the convenience of pre-surgery preparation and medical education. This paper focuses on the registration of forearm images and digital anatomical models. Due to the difference in texture features of forearm multi-modal images, this paper proposes a forearm feature representation curve (FFRC) based on structure compliant multi-modal image registration framework (FAM) for the forearm.

CVAug 30, 2021
LUAI Challenge 2021 on Learning to Understand Aerial Images

Gui-Song Xia, Jian Ding, Ming Qian et al.

This report summarizes the results of Learning to Understand Aerial Images (LUAI) 2021 challenge held on ICCV 2021, which focuses on object detection and semantic segmentation in aerial images. Using DOTA-v2.0 and GID-15 datasets, this challenge proposes three tasks for oriented object detection, horizontal object detection, and semantic segmentation of common categories in aerial images. This challenge received a total of 146 registrations on the three tasks. Through the challenge, we hope to draw attention from a wide range of communities and call for more efforts on the problems of learning to understand aerial images.

DCMay 13, 2021
CrossRoI: Cross-camera Region of Interest Optimization for Efficient Real Time Video Analytics at Scale

Hongpeng Guo, Shuochao Yao, Zhe Yang et al.

Video cameras are pervasively deployed in city scale for public good or community safety (i.e. traffic monitoring or suspected person tracking). However, analyzing large scale video feeds in real time is data intensive and poses severe challenges to network and computation systems today. We present CrossRoI, a resource-efficient system that enables real time video analytics at scale via harnessing the videos content associations and redundancy across a fleet of cameras. CrossRoI exploits the intrinsic physical correlations of cross-camera viewing fields to drastically reduce the communication and computation costs. CrossRoI removes the repentant appearances of same objects in multiple cameras without harming comprehensive coverage of the scene. CrossRoI operates in two phases - an offline phase to establish cross-camera correlations, and an efficient online phase for real time video inference. Experiments on real-world video feeds show that CrossRoI achieves 42% - 65% reduction for network overhead and 25% - 34% reduction for response delay in real time video analytics applications with more than 99% query accuracy, when compared to baseline methods. If integrated with SotA frame filtering systems, the performance gains of CrossRoI reach 50% - 80% (network overhead) and 33% - 61% (end-to-end delay).

DCMay 5, 2021
DeepRT: A Soft Real Time Scheduler for Computer Vision Applications on the Edge

Zhe Yang, Klara Nahrstedt, Hongpeng Guo et al.

The ubiquity of smartphone cameras and IoT cameras, together with the recent boom of deep learning and deep neural networks, proliferate various computer vision driven mobile and IoT applications deployed on the edge. This paper focuses on applications which make soft real time requests to perform inference on their data - they desire prompt responses within designated deadlines, but occasional deadline misses are acceptable. Supporting soft real time applications on a multi-tenant edge server is not easy, since the requests sharing the limited GPU computing resources of an edge server interfere with each other. In order to tackle this problem, we comprehensively evaluate how latency and throughput respond to different GPU execution plans. Based on this analysis, we propose a GPU scheduler, DeepRT, which provides latency guarantee to the requests while maintaining high overall system throughput. The key component of DeepRT, DisBatcher, batches data from different requests as much as possible while it is proven to provide latency guarantee for requests admitted by an Admission Control Module. DeepRT also includes an Adaptation Module which tackles overruns. Our evaluation results show that DeepRT outperforms state-of-the-art works in terms of the number of deadline misses and throughput.

CLAug 29, 2019
A Joint Model for Aspect-Category Sentiment Analysis with Shared Sentiment Prediction Layer

Yuncong Li, Zhe Yang, Cunxiang Yin et al.

Aspect-category sentiment analysis (ACSA) aims to predict the aspect categories mentioned in texts and their corresponding sentiment polarities. Some joint models have been proposed to address this task. Given a text, these joint models detect all the aspect categories mentioned in the text and predict the sentiment polarities toward them at once. Although these joint models obtain promising performances, they train separate parameters for each aspect category and therefore suffer from data deficiency of some aspect categories. To solve this problem, we propose a novel joint model which contains a shared sentiment prediction layer. The shared sentiment prediction layer transfers sentiment knowledge between aspect categories and alleviates the problem caused by data deficiency. Experiments conducted on SemEval-2016 Datasets demonstrate the effectiveness of our model.

IVAug 28, 2019
CAMEL: A Weakly Supervised Learning Framework for Histopathology Image Segmentation

Gang Xu, Zhigang Song, Zhuo Sun et al.

Histopathology image analysis plays a critical role in cancer diagnosis and treatment. To automatically segment the cancerous regions, fully supervised segmentation algorithms require labor-intensive and time-consuming labeling at the pixel level. In this research, we propose CAMEL, a weakly supervised learning framework for histopathology image segmentation using only image-level labels. Using multiple instance learning (MIL)-based label enrichment, CAMEL splits the image into latticed instances and automatically generates instance-level labels. After label enrichment, the instance-level labels are further assigned to the corresponding pixels, producing the approximate pixel-level labels and making fully supervised training of segmentation models possible. CAMEL achieves comparable performance with the fully supervised approaches in both instance-level classification and pixel-level segmentation on CAMELYON16 and a colorectal adenoma dataset. Moreover, the generality of the automatic labeling methodology may benefit future weakly supervised learning studies for histopathology image analysis.

CVJun 10, 2019
BAGS: An automatic homework grading system using the pictures taken by smart phones

Xiaoshuo Li, Tiezhu Yue, Xuanping Huang et al.

Homework grading is critical to evaluate teaching quality and effect. However, it is usually time-consuming to grade the homework manually. In automatic homework grading scenario, many optical mark reader (OMR)-based solutions which require specific equipments have been proposed. Although many of them can achieve relatively high accuracy, they are less convenient for users. In contrast, with the popularity of smart phones, the automatic grading system which depends on the image photographed by phones becomes more available. In practice, due to different photographing angles or uneven papers, images may be distorted. Moreover, most of images are photographed under complex backgrounds, making answer areas detection more difficult. To solve these problems, we propose BAGS, an automatic homework grading system which can effectively locate and recognize handwritten answers. In BAGS, all the answers would be written above the answer area underlines (AAU), and we use two segmentation networks based on DeepLabv3+ to locate the answer areas. Then, we use the characters recognition part to recognize students' answers. Finally, the grading part is designed for the comparison between the recognized answers and the standard ones. In our test, BAGS correctly locates and recognizes the handwritten answers in 91% of total answer areas.