Yong He

CV
h-index24
34papers
445citations
Novelty45%
AI Score53

34 Papers

CVJun 27, 2022
PST: Plant segmentation transformer for 3D point clouds of rapeseed plants at the podding stage

Ruiming Du, Zhihong Ma, Pengyao Xie et al.

Segmentation of plant point clouds to obtain high-precise morphological traits is essential for plant phenotyping. Although the fast development of deep learning has boosted much research on segmentation of plant point clouds, previous studies mainly focus on the hard voxelization-based or down-sampling-based methods, which are limited to segmenting simple plant organs. Segmentation of complex plant point clouds with a high spatial resolution still remains challenging. In this study, we proposed a deep learning network plant segmentation transformer (PST) to achieve the semantic and instance segmentation of rapeseed plants point clouds acquired by handheld laser scanning (HLS) with the high spatial resolution, which can characterize the tiny siliques as the main traits targeted. PST is composed of: (i) a dynamic voxel feature encoder (DVFE) to aggregate the point features with the raw spatial resolution; (ii) the dual window sets attention blocks to capture the contextual information; and (iii) a dense feature propagation module to obtain the final dense point feature map. The results proved that PST and PST-PointGroup (PG) achieved superior performance in semantic and instance segmentation tasks. For the semantic segmentation, the mean IoU, mean Precision, mean Recall, mean F1-score, and overall accuracy of PST were 93.96%, 97.29%, 96.52%, 96.88%, and 97.07%, achieving an improvement of 7.62%, 3.28%, 4.8%, 4.25%, and 3.88% compared to the second-best state-of-the-art network PAConv. For instance segmentation, PST-PG reached 89.51%, 89.85%, 88.83% and 82.53% in mCov, mWCov, mPerc90, and mRec90, achieving an improvement of 2.93%, 2.21%, 1.99%, and 5.9% compared to the original PG. This study proves that the deep-learning-based point cloud segmentation method has a great potential for resolving dense plant point clouds with complex morphological traits.

CVMar 8, 2023
Full Point Encoding for Local Feature Aggregation in 3D Point Clouds

Yong He, Hongshan Yu, Zhengeng Yang et al.

Point cloud processing methods exploit local point features and global context through aggregation which does not explicity model the internal correlations between local and global features. To address this problem, we propose full point encoding which is applicable to convolution and transformer architectures. Specifically, we propose Full Point Convolution (FPConv) and Full Point Transformer (FPTransformer) architectures. The key idea is to adaptively learn the weights from local and global geometric connections, where the connections are established through local and global correlation functions respectively. FPConv and FPTransformer simultaneously model the local and global geometric relationships as well as their internal correlations, demonstrating strong generalization ability and high performance. FPConv is incorporated in classical hierarchical network architectures to achieve local and global shape-aware learning. In FPTransformer, we introduce full point position encoding in self-attention, that hierarchically encodes each point position in the global and local receptive field. We also propose a shape aware downsampling block which takes into account the local shape and the global context. Experimental comparison to existing methods on benchmark datasets show the efficacy of FPConv and FPTransformer for semantic segmentation, object detection, classification, and normal estimation tasks. In particular, we achieve state-of-the-art semantic segmentation results of 76% mIoU on S3DIS 6-fold and 72.2% on S3DIS Area5.

CVMar 8, 2023
DANet: Density Adaptive Convolutional Network with Interactive Attention for 3D Point Clouds

Yong He, Hongshan Yu, Zhengeng Yang et al.

Local features and contextual dependencies are crucial for 3D point cloud analysis. Many works have been devoted to designing better local convolutional kernels that exploit the contextual dependencies. However, current point convolutions lack robustness to varying point cloud density. Moreover, contextual modeling is dominated by non-local or self-attention models which are computationally expensive. To solve these problems, we propose density adaptive convolution, coined DAConv. The key idea is to adaptively learn the convolutional weights from geometric connections obtained from the point density and position. To extract precise context dependencies with fewer computations, we propose an interactive attention module (IAM) that embeds spatial information into channel attention along different spatial directions. DAConv and IAM are integrated in a hierarchical network architecture to achieve local density and contextual direction-aware learning for point cloud analysis. Experiments show that DAConv is significantly more robust to point density compared to existing methods and extensive comparisons on challenging 3D point cloud datasets show that our network achieves state-of-the-art classification results of 93.6% on ModelNet40, competitive semantic segmentation results of 68.71% mIoU on S3DIS and part segmentation results of 86.7% mIoU on ShapeNet.

IRAug 31, 2023
AntM$^{2}$C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction

Zhaoxin Huan, Ke Ding, Ang Li et al.

Click-through rate (CTR) prediction is a crucial issue in recommendation systems. There has been an emergence of various public CTR datasets. However, existing datasets primarily suffer from the following limitations. Firstly, users generally click different types of items from multiple scenarios, and modeling from multiple scenarios can provide a more comprehensive understanding of users. Existing datasets only include data for the same type of items from a single scenario. Secondly, multi-modal features are essential in multi-scenario prediction as they address the issue of inconsistent ID encoding between different scenarios. The existing datasets are based on ID features and lack multi-modal features. Third, a large-scale dataset can provide a more reliable evaluation of models, fully reflecting the performance differences between models. The scale of existing datasets is around 100 million, which is relatively small compared to the real-world CTR prediction. To address these limitations, we propose AntM$^{2}$C, a Multi-Scenario Multi-Modal CTR dataset based on industrial data from Alipay. Specifically, AntM$^{2}$C provides the following advantages: 1) It covers CTR data of 5 different types of items, providing insights into the preferences of users for different items, including advertisements, vouchers, mini-programs, contents, and videos. 2) Apart from ID-based features, AntM$^{2}$C also provides 2 multi-modal features, raw text and image features, which can effectively establish connections between items with different IDs. 3) AntM$^{2}$C provides 1 billion CTR data with 200 features, including 200 million users and 6 million items. It is currently the largest-scale CTR dataset available. Based on AntM$^{2}$C, we construct several typical CTR tasks and provide comparisons with baseline methods. The dataset homepage is available at https://www.atecup.cn/home.

CLOct 8, 2022
KG-MTT-BERT: Knowledge Graph Enhanced BERT for Multi-Type Medical Text Classification

Yong He, Cheng Wang, Shun Zhang et al.

Medical text learning has recently emerged as a promising area to improve healthcare due to the wide adoption of electronic health record (EHR) systems. The complexity of the medical text such as diverse length, mixed text types, and full of medical jargon, poses a great challenge for developing effective deep learning models. BERT has presented state-of-the-art results in many NLP tasks, such as text classification and question answering. However, the standalone BERT model cannot deal with the complexity of the medical text, especially the lengthy clinical notes. Herein, we develop a new model called KG-MTT-BERT (Knowledge Graph Enhanced Multi-Type Text BERT) by extending the BERT model for long and multi-type text with the integration of the medical knowledge graph. Our model can outperform all baselines and other state-of-the-art models in diagnosis-related group (DRG) classification, which requires comprehensive medical text for accurate classification. We also demonstrated that our model can effectively handle multi-type text and the integration of medical knowledge graph can significantly improve the performance.

LGJan 31, 2023
GDOD: Effective Gradient Descent using Orthogonal Decomposition for Multi-Task Learning

Xin Dong, Ruize Wu, Chao Xiong et al.

Multi-task learning (MTL) aims at solving multiple related tasks simultaneously and has experienced rapid growth in recent years. However, MTL models often suffer from performance degeneration with negative transfer due to learning several tasks simultaneously. Some related work attributed the source of the problem is the conflicting gradients. In this case, it is needed to select useful gradient updates for all tasks carefully. To this end, we propose a novel optimization approach for MTL, named GDOD, which manipulates gradients of each task using an orthogonal basis decomposed from the span of all task gradients. GDOD decomposes gradients into task-shared and task-conflict components explicitly and adopts a general update rule for avoiding interference across all task gradients. This allows guiding the update directions depending on the task-shared components. Moreover, we prove the convergence of GDOD theoretically under both convex and non-convex assumptions. Experiment results on several multi-task datasets not only demonstrate the significant improvement of GDOD performed to existing MTL models but also prove that our algorithm outperforms state-of-the-art optimization methods in terms of AUC and Logloss metrics.

IRFeb 25
Trie-Aware Transformers for Generative Recommendation

Zhenxiang Xu, Jiawei Chen, Sirui Chen et al.

Generative recommendation (GR) aligns with advances in generative AI by casting next-item prediction as token-level generation rather than score-based ranking. Most GR methods adopt a two-stage pipeline: (i) \textit{item tokenization}, which maps each item to a sequence of discrete, hierarchically organized tokens; and (ii) \textit{autoregressive generation}, which predicts the next item's tokens conditioned on the tokens of user's interaction history. Although hierarchical tokenization induces a prefix tree (trie) over items, standard autoregressive modeling with conventional Transformers often flattens item tokens into a linear stream and overlooks the underlying topology. To address this, we propose TrieRec, a trie-aware generative recommendation method that augments Transformers with structural inductive biases via two positional encodings. First, a \textit{trie-aware absolute positional encoding} aggregates a token's (node's) local structural context (\eg depth, ancestors, and descendants) into the token representation. Second, a \textit{topology-aware relative positional encoding} injects pairwise structural relations into self-attention to capture topology-induced semantic relatedness. TrieRec is also model-agnostic, efficient, and hyperparameter-free. In our experiments, we implement TrieRec within three representative GR backbones, achieving notably improvements of 8.83\% on average across four real-world datasets.

LGNov 20, 2024Code
Multimodal large language model for wheat breeding: a new exploration of smart breeding

Guofeng Yang, Yu Li, Yong He et al.

UAV remote sensing technology has become a key technology in crop breeding, which can achieve high-throughput and non-destructive collection of crop phenotyping data. However, the multidisciplinary nature of breeding has brought technical barriers and efficiency challenges to knowledge mining. Therefore, it is important to develop a smart breeding goal tool to mine cross-domain multimodal data. Based on different pre-trained open-source multimodal large language models (MLLMs) (e.g., Qwen-VL, InternVL, Deepseek-VL), this study used supervised fine-tuning (SFT), retrieval-augmented generation (RAG), and reinforcement learning from human feedback (RLHF) technologies to inject cross-domain knowledge into MLLMs, thereby constructing multiple multimodal large language models for wheat breeding (WBLMs). The above WBLMs were evaluated using the newly created evaluation benchmark in this study. The results showed that the WBLM constructed using SFT, RAG and RLHF technologies and InternVL2-8B has leading performance. Then, subsequent experiments were conducted using the WBLM. Ablation experiments indicated that the combination of SFT, RAG, and RLHF technologies can improve the overall generation performance, enhance the generated quality, balance the timeliness and adaptability of the generated answer, and reduce hallucinations and biases. The WBLM performed best in wheat yield prediction using cross-domain data (remote sensing, phenotyping, weather, germplasm) simultaneously, with R2 and RMSE of 0.821 and 489.254 kg/ha, respectively. Furthermore, the WBLM can generate professional decision support answers for phenotyping estimation, environmental stress assessment, target germplasm screening, cultivation technique recommendation, and seed price query tasks.

37.3IRMar 22
MI-DPG: Decomposable Parameter Generation Network Based on Mutual Information for Multi-Scenario Recommendation

Wenzhuo Cheng, Ke Ding, Xin Dong et al.

Conversion rate (CVR) prediction models play a vital role in recommendation and advertising systems. Recent research on multi-scenario recommendation shows that learning a unified model to serve multiple scenarios is effective for improving overall performance. However, it remains challenging to improve model prediction performance across scenarios at low model parameter cost, and current solutions are hard to robustly model multi-scenario diversity. In this paper, we propose MI-DPG for the multi-scenario CVR prediction, which learns scenario-conditioned dynamic model parameters for each scenario in a more efficient and effective manner. Specifically, we introduce an auxiliary network to generate scenario-conditioned dynamic weighting matrices, which are obtained by combining decomposed scenario-specific and scenario-shared low-rank matrices with parameter efficiency. For each scene, weighting the backbone model parameters by the weighting matrix helps to specialize the model parameters for different scenarios. It can not only modulate the complete parameter space of the backbone model but also improve the model effectiveness. Furthermore, we design a mutual information regularization to enhance the diversity of model parameters across different scenarios by maximizing the mutual information between the scenario-aware input and the scene-conditioned dynamic weighting matrix. Experiments from three real-world datasets show that MI-DPG significantly outperforms previous multi-scenario recommendation models.

CLDec 8, 2024Code
7B Fully Open Source Moxin-LLM/VLM -- From Pretraining to GRPO-based Reinforcement Learning Enhancement

Pu Zhao, Xuan Shen, Zhenglun Kong et al. · harvard

Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed, adhering to principles of open science, open source, open data, and open access. We release the pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints, aiming to make continuous commitments to fully open-source LLMs. After pre-training the base model, we finetune the Moxin Base model with SOTA post-training framework and instruction data to obtain Moxin Instruct model. To improve the reasoning capability, we further finetune our Instruct model with chain-of-thought data distilled from DeepSeek R1, and then use Group Relative Policy Optimization (GRPO) following DeepSeek R1 to finetune our model, leading to the Moxin Reasoning model. Moreover, we develop our vision language model based on our Moxin model. Experiments show that our models achieve superior performance in various evaluations such as zero-shot evaluation, few-shot evaluation, and CoT evaluation.

28.4SYMar 14
Fully distributed consensus control for stochastic multi-agent systems under undirected and directed topologies

Xuping Hou, Xiaofeng Zong, Yong He

This work aims to address the design of fully distributed control protocols for stochastic consensus, and, for the first time, establishes the existence and uniqueness of solutions for the path-dependent and highly nonlinear closed-loop systems under both undirected and directed topologies, bridging a critical gap in the literature. For the case of directed graphs, a unified fully distributed control protocol is designed for the first time to guarantee mean square and almost sure consensus of stochastic multi-agent systems under directed graphs. Moreover, an enhanced fully distributed protocol with additional tunable parameters designed for undirected graphs is proposed, which guarantees stochastic consensus while achieving superior convergence speed. Additionally, our work provides explicit exponential estimates for the corresponding convergence rates of stochastic consensus, elucidating the relationship between the exponential convergence rate and the system parameters. Simulations validate the theoretical results.

CLAug 5, 2025Code
CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction

Zixuan Li, Binzong Geng, Jing Xiong et al.

Click-Through Rate (CTR) prediction, a core task in recommendation systems, estimates user click likelihood using historical behavioral data. Modeling user behavior sequences as text to leverage Language Models (LMs) for this task has gained traction, owing to LMs' strong semantic understanding and contextual modeling capabilities. However, a critical structural gap exists: user behavior sequences consist of discrete actions connected by semantically empty separators, differing fundamentally from the coherent natural language in LM pre-training. This mismatch causes semantic fragmentation, where LM attention scatters across irrelevant tokens instead of focusing on meaningful behavior boundaries and inter-behavior relationships, degrading prediction performance. To address this, we propose $\textit{CTR-Sink}$, a novel framework introducing behavior-level attention sinks tailored for recommendation scenarios. Inspired by attention sink theory, it constructs attention focus sinks and dynamically regulates attention aggregation via external information. Specifically, we insert sink tokens between consecutive behaviors, incorporating recommendation-specific signals such as temporal distance to serve as stable attention sinks. To enhance generality, we design a two-stage training strategy that explicitly guides LM attention toward sink tokens and a attention sink mechanism that amplifies inter-sink dependencies to better capture behavioral correlations. Experiments on one industrial dataset and two open-source datasets (MovieLens, Kuairec), alongside visualization results, validate the method's effectiveness across scenarios.

IRMar 28, 2024
Breaking the Length Barrier: LLM-Enhanced CTR Prediction in Long Textual User Behaviors

Binzong Geng, Zhaoxin Huan, Xiaolu Zhang et al.

With the rise of large language models (LLMs), recent works have leveraged LLMs to improve the performance of click-through rate (CTR) prediction. However, we argue that a critical obstacle remains in deploying LLMs for practical use: the efficiency of LLMs when processing long textual user behaviors. As user sequences grow longer, the current efficiency of LLMs is inadequate for training on billions of users and items. To break through the efficiency barrier of LLMs, we propose Behavior Aggregated Hierarchical Encoding (BAHE) to enhance the efficiency of LLM-based CTR modeling. Specifically, BAHE proposes a novel hierarchical architecture that decouples the encoding of user behaviors from inter-behavior interactions. Firstly, to prevent computational redundancy from repeated encoding of identical user behaviors, BAHE employs the LLM's pre-trained shallow layers to extract embeddings of the most granular, atomic user behaviors from extensive user sequences and stores them in the offline database. Subsequently, the deeper, trainable layers of the LLM facilitate intricate inter-behavior interactions, thereby generating comprehensive user embeddings. This separation allows the learning of high-level user representations to be independent of low-level behavior encoding, significantly reducing computational complexity. Finally, these refined user embeddings, in conjunction with correspondingly processed item embeddings, are incorporated into the CTR model to compute the CTR scores. Extensive experimental results show that BAHE reduces training time and memory by five times for CTR models using LLMs, especially with longer user sequences. BAHE has been deployed in a real-world system, allowing for daily updates of 50 million CTR data on 8 A100 GPUs, making LLMs practical for industrial CTR prediction.

AIFeb 11
Abstraction Generation for Generalized Planning with Pretrained Large Language Models

Zhenhe Cui, Huaxiang Xia, Hangjun Shen et al.

Qualitative Numerical Planning (QNP) serves as an important abstraction model for generalized planning (GP), which aims to compute general plans that solve multiple instances at once. Recent works show that large language models (LLMs) can function as generalized planners. This work investigates whether LLMs can serve as QNP abstraction generators for GP problems and how to fix abstractions via automated debugging. We propose a prompt protocol: input a GP domain and training tasks to LLMs, prompting them to generate abstract features and further abstract the initial state, action set, and goal into QNP problems. An automated debugging method is designed to detect abstraction errors, guiding LLMs to fix abstractions. Experiments demonstrate that under properly guided by automated debugging, some LLMs can generate useful QNP abstractions.

CVFeb 1, 2025
Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering

Zechuan Li, Hongshan Yu, Yihao Ding et al.

3D Scene Question Answering (3D SQA) represents an interdisciplinary task that integrates 3D visual perception and natural language processing, empowering intelligent agents to comprehend and interact with complex 3D environments. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. However, this rapid progress introduces challenges, particularly in achieving unified analysis and comparison across datasets and baselines. In this survey, we provide the first comprehensive and systematic review of 3D SQA. We organize existing work from three perspectives: datasets, methodologies, and evaluation metrics. Beyond basic categorization, we identify shared architectural patterns across methods. Our survey further synthesizes core limitations and discusses how current trends, such as instruction tuning, multimodal alignment, and zero-shot, can shape future developments. Finally, we propose a range of promising research directions covering dataset construction, task generalization, interaction modeling, and unified evaluation protocols. This work aims to serve as a foundation for future research and foster progress toward more generalizable and intelligent 3D SQA systems.

MLMar 12, 2024
Knowledge Transfer across Multiple Principal Component Analysis Studies

Zeyu Li, Kangxiang Qin, Yong He et al.

Transfer learning has aroused great interest in the statistical community. In this article, we focus on knowledge transfer for unsupervised learning tasks in contrast to the supervised learning tasks in the literature. Given the transferable source populations, we propose a two-step transfer learning algorithm to extract useful information from multiple source principal component analysis (PCA) studies, thereby enhancing estimation accuracy for the target PCA task. In the first step, we integrate the shared subspace information across multiple studies by a proposed method named as Grassmannian barycenter, instead of directly performing PCA on the pooled dataset. The proposed Grassmannian barycenter method enjoys robustness and computational advantages in more general cases. Then the resulting estimator for the shared subspace from the first step is further utilized to estimate the target private subspace in the second step. Our theoretical analysis credits the gain of knowledge transfer between PCA studies to the enlarged eigenvalue gap, which is different from the existing supervised transfer learning tasks where sparsity plays the central role. In addition, we prove that the bilinear forms of the empirical spectral projectors have asymptotic normality under weaker eigenvalue gap conditions after knowledge transfer. When the set of informativesources is unknown, we endow our algorithm with the capability of useful dataset selection by solving a rectified optimization problem on the Grassmann manifold, which in turn leads to a computationally friendly rectified Grassmannian K-means procedure. In the end, extensive numerical simulation results and a real data case concerning activity recognition are reported to support our theoretical claims and to illustrate the empirical usefulness of the proposed transfer learning methods.

MLDec 9, 2024
Representational Transfer Learning for Matrix Completion

Yong He, Zeyu Li, Dong Liu et al.

We propose to transfer representational knowledge from multiple sources to a target noisy matrix completion task by aggregating singular subspaces information. Under our representational similarity framework, we first integrate linear representation information by solving a two-way principal component analysis problem based on a properly debiased matrix-valued dataset. After acquiring better column and row representation estimators from the sources, the original high-dimensional target matrix completion problem is then transformed into a low-dimensional linear regression, of which the statistical efficiency is guaranteed. A variety of extensional arguments, including post-transfer statistical inference and robustness against negative transfer, are also discussed alongside. Finally, extensive simulation results and a number of real data cases are reported to support our claims.

CVNov 13, 2024
Biomass phenotyping of oilseed rape through UAV multi-view oblique imaging with 3DGS and SAM model

Yutao Shen, Hongyu Zhou, Xin Yang et al.

Biomass estimation of oilseed rape is crucial for optimizing crop productivity and breeding strategies. While UAV-based imaging has advanced high-throughput phenotyping, current methods often rely on orthophoto images, which struggle with overlapping leaves and incomplete structural information in complex field environments. This study integrates 3D Gaussian Splatting (3DGS) with the Segment Anything Model (SAM) for precise 3D reconstruction and biomass estimation of oilseed rape. UAV multi-view oblique images from 36 angles were used to perform 3D reconstruction, with the SAM module enhancing point cloud segmentation. The segmented point clouds were then converted into point cloud volumes, which were fitted to ground-measured biomass using linear regression. The results showed that 3DGS (7k and 30k iterations) provided high accuracy, with peak signal-to-noise ratios (PSNR) of 27.43 and 29.53 and training times of 7 and 49 minutes, respectively. This performance exceeded that of structure from motion (SfM) and mipmap Neural Radiance Fields (Mip-NeRF), demonstrating superior efficiency. The SAM module achieved high segmentation accuracy, with a mean intersection over union (mIoU) of 0.961 and an F1-score of 0.980. Additionally, a comparison of biomass extraction models found the point cloud volume model to be the most accurate, with an determination coefficient (R2) of 0.976, root mean square error (RMSE) of 2.92 g/plant, and mean absolute percentage error (MAPE) of 6.81%, outperforming both the plot crop volume and individual crop volume models. This study highlights the potential of combining 3DGS with multi-view UAV imaging for improved biomass phenotyping.

AIAug 12, 2025
AgriGPT: a Large Language Model Ecosystem for Agriculture

Bo Yang, Yu Zhang, Lanfei Feng et al.

Despite the rapid progress of Large Language Models (LLMs), their application in agriculture remains limited due to the lack of domain-specific models, curated datasets, and robust evaluation frameworks. To address these challenges, we propose AgriGPT, a domain-specialized LLM ecosystem for agricultural usage. At its core, we design a multi-agent scalable data engine that systematically compiles credible data sources into Agri-342K, a high-quality, standardized question-answer (QA) dataset. Trained on this dataset, AgriGPT supports a broad range of agricultural stakeholders, from practitioners to policy-makers. To enhance factual grounding, we employ Tri-RAG, a three-channel Retrieval-Augmented Generation framework combining dense retrieval, sparse retrieval, and multi-hop knowledge graph reasoning, thereby improving the LLM's reasoning reliability. For comprehensive evaluation, we introduce AgriBench-13K, a benchmark suite comprising 13 tasks with varying types and complexities. Experiments demonstrate that AgriGPT significantly outperforms general-purpose LLMs on both domain adaptation and reasoning. Beyond the model itself, AgriGPT represents a modular and extensible LLM ecosystem for agriculture, comprising structured data construction, retrieval-enhanced generation, and domain-specific evaluation. This work provides a generalizable framework for developing scientific and industry-specialized LLMs. All models, datasets, and code will be released to empower agricultural communities, especially in underserved regions, and to promote open, impactful research.

AISep 30, 2025
Collaborative Compression for Large-Scale MoE Deployment on Edge

Yixiao Chen, Yanyue Xie, Ruining Yang et al.

The Mixture of Experts (MoE) architecture is an important method for scaling Large Language Models (LLMs). It increases model capacity while keeping computation cost low. However, the ultra-large MoE models still have hundreds of billions of parameters, requiring massive memory/storage and leading to difficulties for deployment on resource-constrained edge platforms. Pruning or quantization alone can hardly address the issue, because of the super-aggressive compression ratio with significantly degraded accuracy and output quality. To facilitate the deployment of ultra-large MoEs on edge platforms, we propose a collaborative compression framework by combining expert pruning, mixed-precision quantization, and activation optimization. It can effectively reduce the storage footprint of the ultra-large MoE DeepSeek-V3 from 1.3TB to 103GB, while preserving high output quality with better accuracy than traditional uniform low-bit quantization methods. To the best of our knowledge, we are the first to deploy a compressed model from the ultra-large DeepSeek-V3 on the platform with a strict 128GB total memory limit. Our comprehensive experiments on multiple benchmarks under various memory constraints demonstrate the effectiveness of our method with smaller model sizes and higher accuracy than uniform low-bit quantization methods.

CVApr 26, 2025
Depth as Points: Center Point-based Depth Estimation

Zhiheng Tu, Xinjian Huang, Yong He et al.

The perception of vehicles and pedestrians in urban scenarios is crucial for autonomous driving. This process typically involves complicated data collection, imposes high computational and hardware demands. To address these limitations, we first develop a highly efficient method for generating virtual datasets, which enables the creation of task- and scenario-specific datasets in a short time. Leveraging this method, we construct the virtual depth estimation dataset VirDepth, a large-scale, multi-task autonomous driving dataset. Subsequently, we propose CenterDepth, a lightweight architecture for monocular depth estimation that ensures high operational efficiency and exhibits superior performance in depth estimation tasks with highly imbalanced height-scale distributions. CenterDepth integrates global semantic information through the innovative Center FC-CRFs algorithm, aggregates multi-scale features based on object key points, and enables detection-based depth estimation of targets. Experiments demonstrate that our proposed method achieves superior performance in terms of both computational speed and prediction accuracy.

CVMar 8, 2025
PointDiffuse: A Dual-Conditional Diffusion Model for Enhanced Point Cloud Semantic Segmentation

Yong He, Hongshan Yu, Mingtao Feng et al.

Diffusion probabilistic models are traditionally used to generate colors at fixed pixel positions in 2D images. Building on this, we extend diffusion models to point cloud semantic segmentation, where point positions also remain fixed, and the diffusion model generates point labels instead of colors. To accelerate the denoising process in reverse diffusion, we introduce a noisy label embedding mechanism. This approach integrates semantic information into the noisy label, providing an initial semantic reference that improves the reverse diffusion efficiency. Additionally, we propose a point frequency transformer that enhances the adjustment of high-level context in point clouds. To reduce computational complexity, we introduce the position condition into MLP and propose denoising PointNet to process the high-resolution point cloud without sacrificing geometric details. Finally, we integrate the proposed noisy label embedding, point frequency transformer and denoising PointNet in our proposed dual conditional diffusion model-based network (PointDiffuse) to perform large-scale point cloud semantic segmentation. Extensive experiments on five benchmarks demonstrate the superiority of PointDiffuse, achieving the state-of-the-art mIoU of 74.2\% on S3DIS Area 5, 81.2\% on S3DIS 6-fold and 64.8\% on SWAN dataset.

NCMar 4, 2025
TReND: Transformer derived features and Regularized NMF for neonatal functional network Delineation

Sovesh Mohapatra, Minhui Ouyang, Shufang Tan et al.

Precise parcellation of functional networks (FNs) of early developing human brain is the fundamental basis for identifying biomarker of developmental disorders and understanding functional development. Resting-state fMRI (rs-fMRI) enables in vivo exploration of functional changes, but adult FN parcellations cannot be directly applied to the neonates due to incomplete network maturation. No standardized neonatal functional atlas is currently available. To solve this fundamental issue, we propose TReND, a novel and fully automated self-supervised transformer-autoencoder framework that integrates regularized nonnegative matrix factorization (RNMF) to unveil the FNs in neonates. TReND effectively disentangles spatiotemporal features in voxel-wise rs-fMRI data. The framework integrates confidence-adaptive masks into transformer self-attention layers to mitigate noise influence. A self supervised decoder acts as a regulator to refine the encoder's latent embeddings, which serve as reliable temporal features. For spatial coherence, we incorporate brain surface-based geodesic distances as spatial encodings along with functional connectivity from temporal features. The TReND clustering approach processes these features under sparsity and smoothness constraints, producing robust and biologically plausible parcellations. We extensively validated our TReND framework on three different rs-fMRI datasets: simulated, dHCP and HCP-YA against comparable traditional feature extraction and clustering techniques. Our results demonstrated the superiority of the TReND framework in the delineation of neonate FNs with significantly better spatial contiguity and functional homogeneity. Collectively, we established TReND, a novel and robust framework, for neonatal FN delineation. TReND-derived neonatal FNs could serve as a neonatal functional atlas for perinatal populations in health and disease.

LGJan 8, 2025
Integrating remote sensing data assimilation, deep learning and large language model for interactive wheat breeding yield prediction

Guofeng Yang, Nanfei Jin, Wenjie Ai et al.

Yield is one of the core goals of crop breeding. By predicting the potential yield of different breeding materials, breeders can screen these materials at various growth stages to select the best performing. Based on unmanned aerial vehicle remote sensing technology, high-throughput crop phenotyping data in breeding areas is collected to provide data support for the breeding decisions of breeders. However, the accuracy of current yield predictions still requires improvement, and the usability and user-friendliness of yield forecasting tools remain suboptimal. To address these challenges, this study introduces a hybrid method and tool for crop yield prediction, designed to allow breeders to interactively and accurately predict wheat yield by chatting with a large language model (LLM). First, the newly designed data assimilation algorithm is used to assimilate the leaf area index into the WOFOST model. Then, selected outputs from the assimilation process, along with remote sensing inversion results, are used to drive the time-series temporal fusion transformer model for wheat yield prediction. Finally, based on this hybrid method and leveraging an LLM with retrieval augmented generation technology, we developed an interactive yield prediction Web tool that is user-friendly and supports sustainable data updates. This tool integrates multi-source data to assist breeding decision-making. This study aims to accelerate the identification of high-yield materials in the breeding process, enhance breeding efficiency, and enable more scientific and smart breeding decisions.

CVMar 21, 2024
Soft Masked Transformer for Point Cloud Processing with Skip Attention-Based Upsampling

Yong He, Hongshan Yu, Muhammad Ibrahim et al.

Point cloud processing methods leverage local and global point features %at the feature level to cater to downstream tasks, yet they often overlook the task-level context inherent in point clouds during the encoding stage. We argue that integrating task-level information into the encoding stage significantly enhances performance. To that end, we propose SMTransformer which incorporates task-level information into a vector-based transformer by utilizing a soft mask generated from task-level queries and keys to learn the attention weights. Additionally, to facilitate effective communication between features from the encoding and decoding layers in high-level tasks such as segmentation, we introduce a skip-attention-based up-sampling block. This block dynamically fuses features from various resolution points across the encoding and decoding layers. To mitigate the increase in network parameters and training time resulting from the complexity of the aforementioned blocks, we propose a novel shared position encoding strategy. This strategy allows various transformer blocks to share the same position information over the same resolution points, thereby reducing network parameters and training time without compromising accuracy.Experimental comparisons with existing methods on multiple datasets demonstrate the efficacy of SMTransformer and skip-attention-based up-sampling for point cloud processing tasks, including semantic segmentation and classification. In particular, we achieve state-of-the-art semantic segmentation results of 73.4% mIoU on S3DIS Area 5 and 62.4% mIoU on SWAN dataset

CVMar 26, 2021
MagDR: Mask-guided Detection and Reconstruction for Defending Deepfakes

Zhikai Chen, Lingxi Xie, Shanmin Pang et al.

Deepfakes raised serious concerns on the authenticity of visual contents. Prior works revealed the possibility to disrupt deepfakes by adding adversarial perturbations to the source data, but we argue that the threat has not been eliminated yet. This paper presents MagDR, a mask-guided detection and reconstruction pipeline for defending deepfakes from adversarial attacks. MagDR starts with a detection module that defines a few criteria to judge the abnormality of the output of deepfakes, and then uses it to guide a learnable reconstruction procedure. Adaptive masks are extracted to capture the change in local facial regions. In experiments, MagDR defends three main tasks of deepfakes, and the learned reconstruction pipeline transfers across input data, showing promising performance in defending both black-box and white-box attacks.

CVMar 9, 2021
Deep Learning Based 3D Segmentation: A Survey

Yong He, Hongshan Yu, Xiaoyan Liu et al.

3D segmentation is a fundamental and challenging problem in computer vision with applications in autonomous driving and robotics. It has received significant attention from the computer vision, graphics and machine learning communities. Conventional methods for 3D segmentation, based on hand-crafted features and machine learning classifiers, lack generalization ability. Driven by their success in 2D computer vision, deep learning techniques have recently become the tool of choice for 3D segmentation tasks. This has led to an influx of many methods in the literature that have been evaluated on different benchmark datasets. Whereas survey papers on RGB-D and point cloud segmentation exist, there is a lack of a recent in-depth survey that covers all 3D data modalities and application domains. This paper fills the gap and comprehensively surveys the recent progress in deep learning-based 3D segmentation techniques. We cover over 220 works from the last six years, analyze their strengths and limitations, and discuss their competitive results on benchmark datasets. The survey provides a summary of the most commonly used pipelines and finally highlights promising research directions for the future.

CVDec 18, 2020
Self-supervised Learning with Fully Convolutional Networks

Zhengeng Yang, Hongshan Yu, Yong He et al.

Although deep learning based methods have achieved great success in many computer vision tasks, their performance relies on a large number of densely annotated samples that are typically difficult to obtain. In this paper, we focus on the problem of learning representation from unlabeled data for semantic segmentation. Inspired by two patch-based methods, we develop a novel self-supervised learning framework by formulating the Jigsaw Puzzle problem as a patch-wise classification process and solving it with a fully convolutional network. By learning to solve a Jigsaw Puzzle problem with 25 patches and transferring the learned features to semantic segmentation task on Cityscapes dataset, we achieve a 5.8 percentage point improvement over the baseline model that initialized from random values. Moreover, experiments show that our self-supervised learning method can be applied to different datasets and models. In particular, we achieved competitive performance with the state-of-the-art methods on the PASCAL VOC2012 dataset using significant fewer training images.

ROAug 22, 2020
Fast ORB-SLAM without Keypoint Descriptors

Qiang Fu, Hongshan Yu, Xiaolong Wang et al.

Indirect methods for visual SLAM are gaining popularity due to their robustness to environmental variations. ORB-SLAM2 \cite{orbslam2} is a benchmark method in this domain, however, it consumes significant time for computing descriptors that never get reused unless a frame is selected as a keyframe. To overcome these problems, we present FastORB-SLAM which is lightweight and efficient as it tracks keypoints between adjacent frames without computing descriptors. To achieve this, a two-stage coarse-to-fine descriptor independent keypoint matching method is proposed based on sparse optical flow. In the first stage, we predict initial keypoint correspondences via a simple but effective motion model and then robustly establish the correspondences via pyramid-based sparse optical flow tracking. In the second stage, we leverage the constraints of the motion smoothness and epipolar geometry to refine the correspondences. In particular, our method computes descriptors only for keyframes. We test FastORB-SLAM on \textit{TUM} and \textit{ICL-NUIM} RGB-D datasets and compare its accuracy and efficiency to nine existing RGB-D SLAM methods. Qualitative and quantitative results show that our method achieves state-of-the-art accuracy and is about twice as fast as the ORB-SLAM2.

CYJul 8, 2020
Enable an Open Software Defined Mobility Ecosystem through VEC-OF

Sanchu Han, Yong He, Yin Ding

OEMs and new entrants can take the Mobility as a Service market (MaaS) as the entry point, upgrade its E/E (Electric and Electronic) architecture to be C/C (Computing and Communication) architecture, build one open software defined and data driven software platform for its production and service model, use efficient and collaborative ways of vehicles, roads, cloud and network to continuously improve core technologies such as autonomous driving, provide MaaS operators with an affordable and agile platform. In this paper we present one new framework, VEC-OF (Vehicle-Edge-Cloud Open Framework), which is a new data and AI centric vehicle software framework enabling a much safer, more efficient, connected and trusted MaaS through cooperative vehicle, infrastructure and cloud capabilities and intelligence

CVDec 10, 2019
Appending Adversarial Frames for Universal Video Attack

Zhikai Chen, Lingxi Xie, Shanmin Pang et al.

There have been many efforts in attacking image classification models with adversarial perturbations, but the same topic on video classification has not yet been thoroughly studied. This paper presents a novel idea of video-based attack, which appends a few dummy frames (e.g., containing the texts of `thanks for watching') to a video clip and then adds adversarial perturbations only on these new frames. Our approach enjoys three major benefits, namely, a high success rate, a low perceptibility, and a strong ability in transferring across different networks. These benefits mostly come from the common dummy frame which pushes all samples towards the boundary of classification. On the other hand, such attacks are easily to be concealed since most people would not notice the abnormality behind the perturbed video clips. We perform experiments on two popular datasets with six state-of-the-art video classification models, and demonstrate the effectiveness of our approach in the scenario of universal video attacks.

MANov 16, 2019
Optimizing Cooperative path-finding: A Scalable Multi-Agent RRT* with Dynamic Potential Fields

Jinmingwu Jiang, Kaigui Wu, Haiyang Liu et al.

Cooperative path-finding in multi-agent systems demands scalable solutions to navigate agents from their origins to destinations without conflict. Despite the breadth of research, scalability remains hampered by increased computational demands in complex environments. This study introduces the multi-agent RRT* potential field (MA-RRT*PF), an innovative algorithm that addresses computational efficiency and path-finding efficacy in dense scenarios. MA-RRT*PF integrates a dynamic potential field with a heuristic method, advancing obstacle avoidance and optimizing the expansion of random trees in congested spaces. The empirical evaluations highlight MA-RRT*PF's significant superiority over conventional multi-agent RRT* (MA-RRT*) in dense environments, offering enhanced performance and solution quality without compromising integrity. This work not only contributes a novel approach to the field of cooperative multi-agent path-finding but also offers a new perspective for practical applications in densely populated settings where traditional methods are less effective.

SOFTSep 11, 2017
Confined subdiffusion in three dimensions

Shanlin Qin, Yong He

The three-dimensional (3D) Fick's diffusion equation and fractional diffusion equation are solved for different reflecting boundaries. We use the continuous time random walk model (CTRW) to investigate the time averaged mean square displacement (MSD) of 3D single particle trajectory. Theoretical results show the ensemble average of the time averaged MSD can be expressed analytically by a Mittag-Leffler function. Our new expression is in agreement with previous formulas in two limiting cases which are $ < \overline{δ^2} > \simΔ$ in short lag time and $ < \overline{δ^2}> \simΔ^{1-α}$ in long lag time. We also simulate the experimental data of mRNA diffusion in living E. coli using 3D CTRW model under confined and crowded conditions. The simulated results are well consistent with experimental results. The calculations of power spectral density (PSD) indicate further the subdiffsive behavior of individual trajectory.

CLMar 30, 2016
Model Interpolation with Trans-dimensional Random Field Language Models for Speech Recognition

Bin Wang, Zhijian Ou, Yong He et al.

The dominant language models (LMs) such as n-gram and neural network (NN) models represent sentence probabilities in terms of conditionals. In contrast, a new trans-dimensional random field (TRF) LM has been recently introduced to show superior performances, where the whole sentence is modeled as a random field. In this paper, we examine how the TRF models can be interpolated with the NN models, and obtain 12.1\% and 17.9\% relative error rate reductions over 6-gram LMs for English and Chinese speech recognition respectively through log-linear combination.