Jing Zhu

CV
h-index40
28papers
1,346citations
Novelty48%
AI Score54

28 Papers

52.1ITMay 30
Hybrid Bit and Semantic Communications for UAV-Enabled Wireless Power Transfer Networks: A Decision-Assisted Deep Reinforcement Learning Approach

Jingfu Li, Jingjing Cui, Chong Huang et al.

Semantic communications which can significantly reduce spectrum consumption in wireless networks, have recently become a popular research area. When combined with wireless power transfer (WPT), semantic communications can help achieve high spectral efficiency for energy-limited devices in wireless communications. In energy-constrained and link budget-limited scenarios such as UAV networks, the integration of semantic communications and WPT enables highly energyefficient transmission mechanisms. In this paper, we investigate semantic communications in UAV-enabled WPT networks. To achieve adaptability to varying signal-to-noise ratio (SNR) and task requirements, we introduce a multi-layer hybrid bit and semantic communication framework. We adopt a semantic communication efficiency metric and aim to maximize it by jointly optimizing UAV trajectory, energy harvesting base station (EHBS) selection, user association, semantic mode selection, and energy harvesting time allocation. To address this complex longterm optimization problem, we introduce the distributional soft actor-critic (DSAC) algorithm and introduce a decision assistant to further enhance the convergence performance of DSAC. Simulation results validate the effectiveness of the proposed method and framework and demonstrate that our algorithm can achieve superior long-term optimization performance in dynamic network environments.

98.1CLJun 4
YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

PSBC LLM Team, Huawei LLM Team, Ruihan Long et al.

Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.

LGSep 25, 2023
TouchUp-G: Improving Feature Representation through Graph-Centric Finetuning

Jing Zhu, Xiang Song, Vassilis N. Ioannidis et al.

How can we enhance the node features acquired from Pretrained Models (PMs) to better suit downstream graph learning tasks? Graph Neural Networks (GNNs) have become the state-of-the-art approach for many high-impact, real-world graph applications. For feature-rich graphs, a prevalent practice involves utilizing a PM directly to generate features, without incorporating any domain adaptation techniques. Nevertheless, this practice is suboptimal because the node features extracted from PM are graph-agnostic and prevent GNNs from fully utilizing the potential correlations between the graph structure and node features, leading to a decline in GNNs performance. In this work, we seek to improve the node features obtained from a PM for downstream graph tasks and introduce TOUCHUP-G, which has several advantages. It is (a) General: applicable to any downstream graph task, including link prediction which is often employed in recommender systems; (b) Multi-modal: able to improve raw features of any modality (e.g. images, texts, audio); (c) Principled: it is closely related to a novel metric, feature homophily, which we propose to quantify the potential correlations between the graph structure and node features and we show that TOUCHUP-G can effectively shrink the discrepancy between the graph structure and node features; (d) Effective: achieving state-of-the-art results on four real-world datasets spanning different tasks and modalities.

CVNov 22, 2022
Touch and Go: Learning from Human-Collected Vision and Touch

Fengyu Yang, Chenyang Ma, Jiacheng Zhang et al.

The ability to associate touch with sight is essential for tasks that require physically interacting with objects in the world. We propose a dataset with paired visual and tactile data called Touch and Go, in which human data collectors probe objects in natural environments using tactile sensors, while simultaneously recording egocentric video. In contrast to previous efforts, which have largely been confined to lab settings or simulated environments, our dataset spans a large number of "in the wild" objects and scenes. To demonstrate our dataset's effectiveness, we successfully apply it to a variety of tasks: 1) self-supervised visuo-tactile feature learning, 2) tactile-driven image stylization, i.e., making the visual appearance of an object more consistent with a given tactile signal, and 3) predicting future frames of a tactile signal from visuo-tactile inputs.

LGJun 1, 2023
Pitfalls in Link Prediction with Graph Neural Networks: Understanding the Impact of Target-link Inclusion & Better Practices

Jing Zhu, Yuhang Zhou, Vassilis N. Ioannidis et al.

While Graph Neural Networks (GNNs) are remarkably successful in a variety of high-impact applications, we demonstrate that, in link prediction, the common practices of including the edges being predicted in the graph at training and/or test have outsized impact on the performance of low-degree nodes. We theoretically and empirically investigate how these practices impact node-level performance across different degrees. Specifically, we explore three issues that arise: (I1) overfitting; (I2) distribution shift; and (I3) implicit test leakage. The former two issues lead to poor generalizability to the test data, while the latter leads to overestimation of the model's performance and directly impacts the deployment of GNNs. To address these issues in a systematic way, we introduce an effective and efficient GNN training framework, SpotTarget, which leverages our insight on low-degree nodes: (1) at training time, it excludes a (training) edge to be predicted if it is incident to at least one low-degree node; and (2) at test time, it excludes all test edges to be predicted (thus, mimicking real scenarios of using GNNs, where the test data is not included in the graph). SpotTarget helps researchers and practitioners adhere to best practices for learning from graph data, which are frequently overlooked even by the most widely-used frameworks. Our experiments on various real-world datasets show that SpotTarget makes GNNs up to 15x more accurate in sparse graphs, and significantly improves their performance for low-degree nodes in dense graphs.

LGSep 26, 2024
On the Impact of Feature Heterophily on Link Prediction with Graph Neural Networks

Jiong Zhu, Gaotang Li, Yao-An Yang et al.

Heterophily, or the tendency of connected nodes in networks to have different class labels or dissimilar features, has been identified as challenging for many Graph Neural Network (GNN) models. While the challenges of applying GNNs for node classification when class labels display strong heterophily are well understood, it is unclear how heterophily affects GNN performance in other important graph learning tasks where class labels are not available. In this work, we focus on the link prediction task and systematically analyze the impact of heterophily in node features on GNN performance. Theoretically, we first introduce formal definitions of homophilic and heterophilic link prediction tasks, and present a theoretical framework that highlights the different optimizations needed for the respective tasks. We then analyze how different link prediction encoders and decoders adapt to varying levels of feature homophily and introduce designs for improved performance. Our empirical analysis on a variety of synthetic and real-world datasets confirms our theoretical insights and highlights the importance of adopting learnable decoders and GNN encoders with ego- and neighbor-embedding separation in message passing for link prediction tasks beyond homophily.

IVJul 13, 2023
Full-resolution Lung Nodule Segmentation from Chest X-ray Images using Residual Encoder-Decoder Networks

Michael James Horry, Subrata Chakraborty, Biswajeet Pradhan et al.

Lung cancer is the leading cause of cancer death and early diagnosis is associated with a positive prognosis. Chest X-ray (CXR) provides an inexpensive imaging mode for lung cancer diagnosis. Suspicious nodules are difficult to distinguish from vascular and bone structures using CXR. Computer vision has previously been proposed to assist human radiologists in this task, however, leading studies use down-sampled images and computationally expensive methods with unproven generalization. Instead, this study localizes lung nodules using efficient encoder-decoder neural networks that process full resolution images to avoid any signal loss resulting from down-sampling. Encoder-decoder networks are trained and tested using the JSRT lung nodule dataset. The networks are used to localize lung nodules from an independent external CXR dataset. Sensitivity and false positive rates are measured using an automated framework to eliminate any observer subjectivity. These experiments allow for the determination of the optimal network depth, image resolution and pre-processing pipeline for generalized lung nodule localization. We find that nodule localization is influenced by subtlety, with more subtle nodules being detected in earlier training epochs. Therefore, we propose a novel self-ensemble model from three consecutive epochs centered on the validation optimum. This ensemble achieved a sensitivity of 85% in 10-fold internal testing with false positives of 8 per image. A sensitivity of 81% is achieved at a false positive rate of 6 following morphological false positive reduction. This result is comparable to more computationally complex systems based on linear and spatial filtering, but with a sub-second inference time that is faster than other methods. The proposed algorithm achieved excellent generalization results against an external dataset with sensitivity of 77% at a false positive rate of 7.6.

SIAug 23, 2022
CAPER: Coarsen, Align, Project, Refine - A General Multilevel Framework for Network Alignment

Jing Zhu, Danai Koutra, Mark Heimann

Network alignment, or the task of finding corresponding nodes in different networks, is an important problem formulation in many application domains. We propose CAPER, a multilevel alignment framework that Coarsens the input graphs, Aligns the coarsened graphs, Projects the alignment solution to finer levels and Refines the alignment solution. We show that CAPER can improve upon many different existing network alignment algorithms by enforcing alignment consistency across multiple graph resolutions: nodes matched at finer levels should also be matched at coarser levels. CAPER also accelerates the use of slower network alignment methods, at the modest cost of linear-time coarsening and refinement steps, by allowing them to be run on smaller coarsened versions of the input graphs. Experiments show that CAPER can improve upon diverse network alignment methods by an average of 33% in accuracy and/or an order of magnitude faster in runtime.

SPSep 13, 2022
Weight-based Channel-model Matrix Framework provides a reasonable solution for EEG-based cross-dataset emotion recognition

Huayu Chen, Huanhuan He, Jing Zhu et al.

Cross-dataset emotion recognition as an extremely challenging task in the field of EEG-based affective computing is influenced by many factors, which makes the universal models yield unsatisfactory results. Facing the situation that lacks EEG information decoding research, we first analyzed the impact of different EEG information(individual, session, emotion and trial) for emotion recognition by sample space visualization, sample aggregation phenomena quantification, and energy pattern analysis on five public datasets. Based on these phenomena and patterns, we provided the processing methods and interpretable work of various EEG differences. Through the analysis of emotional feature distribution patterns, the Individual Emotional Feature Distribution Difference(IEFDD) was found, which was also considered as the main factor of the stability for emotion recognition. After analyzing the limitations of traditional modeling approach suffering from IEFDD, the Weight-based Channel-model Matrix Framework(WCMF) was proposed. To reasonably characterize emotional feature distribution patterns, four weight extraction methods were designed, and the optimal was the correction T-test(CT) weight extraction method. Finally, the performance of WCMF was validated on cross-dataset tasks in two kinds of experiments that simulated different practical scenarios, and the results showed that WCMF had more stable and better emotion recognition ability.

AINov 11, 2023
BClean: A Bayesian Data Cleaning System

Jianbin Qin, Sifan Huang, Yaoshu Wang et al.

There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian methods. However, existing probabilistic methods often assume a simplistic distribution (e.g., Gaussian distribution), which is frequently underfitted in practice, or they necessitate experts to provide a complex prior distribution (e.g., via a programming language). This requirement is both labor-intensive and costly, rendering these methods less suitable for real-world applications. In this paper, we propose BClean, a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction. We recast the data cleaning problem as a Bayesian inference that fully exploits the relationships between attributes in the observed dataset and any prior information provided by users. To this end, we present an automatic Bayesian network construction method that extends a structure learning-based functional dependency discovery method with similarity functions to capture the relationships between attributes. Furthermore, our system allows users to modify the generated Bayesian network in order to specify prior information or correct inaccuracies identified by the automatic generation process. We also design an effective scoring model (called the compensative scoring model) necessary for the Bayesian inference. To enhance the efficiency of data cleaning, we propose several approximation strategies for the Bayesian inference, including graph partitioning, domain pruning, and pre-detection. By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning, outperforming existing Bayesian methods by 2% and other data cleaning methods by 15%.

CLMay 21, 2025Code
DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data

Yuhang Zhou, Jing Zhu, Shengyi Qian et al.

Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups, assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks. Our code and data are available at https://github.com/Tonyzhou98/disco_grpo.

CVNov 9, 2023
Spatial Attention-based Distribution Integration Network for Human Pose Estimation

Sihan Gao, Jing Zhu, Xiaoxuan Zhuang et al.

In recent years, human pose estimation has made significant progress through the implementation of deep learning techniques. However, these techniques still face limitations when confronted with challenging scenarios, including occlusion, diverse appearances, variations in illumination, and overlap. To cope with such drawbacks, we present the Spatial Attention-based Distribution Integration Network (SADI-NET) to improve the accuracy of localization in such situations. Our network consists of three efficient models: the receptive fortified module (RFM), spatial fusion module (SFM), and distribution learning module (DLM). Building upon the classic HourglassNet architecture, we replace the basic block with our proposed RFM. The RFM incorporates a dilated residual block and attention mechanism to expand receptive fields while enhancing sensitivity to spatial information. In addition, the SFM incorporates multi-scale characteristics by employing both global and local attention mechanisms. Furthermore, the DLM, inspired by residual log-likelihood estimation (RLE), reconfigures a predicted heatmap using a trainable distribution weight. For the purpose of determining the efficacy of our model, we conducted extensive experiments on the MPII and LSP benchmarks. Particularly, our model obtained a remarkable $92.10\%$ percent accuracy on the MPII test dataset, demonstrating significant improvements over existing models and establishing state-of-the-art performance.

CLNov 10, 2025
LLM Optimization Unlocks Real-Time Pairwise Reranking

Jingyu Wu, Aditya Shrivastava, Jing Zhu et al.

Efficiently reranking documents retrieved from information retrieval (IR) pipelines to enhance overall quality of Retrieval-Augmented Generation (RAG) system remains an important yet challenging problem. Recent studies have highlighted the importance of Large Language Models (LLMs) in reranking tasks. In particular, Pairwise Reranking Prompting (PRP) has emerged as a promising plug-and-play approach due to its usability and effectiveness. However, the inherent complexity of the algorithm, coupled with the high computational demands and latency incurred due to LLMs, raises concerns about its feasibility in real-time applications. To address these challenges, this paper presents a focused study on pairwise reranking, demonstrating that carefully applied optimization methods can significantly mitigate these issues. By implementing these methods, we achieve a remarkable latency reduction of up to 166 times, from 61.36 seconds to 0.37 seconds per query, with an insignificant drop in performance measured by Recall@k. Our study highlights the importance of design choices that were previously overlooked, such as using smaller models, limiting the reranked set, using lower precision, reducing positional bias with one-directional order inference, and restricting output tokens. These optimizations make LLM-based reranking substantially more efficient and feasible for latency-sensitive, real-world deployments.

IRMar 30, 2025
Beyond Unimodal Boundaries: Generative Recommendation with Multimodal Semantics

Jing Zhu, Mingxuan Ju, Yozen Liu et al.

Generative recommendation (GR) has become a powerful paradigm in recommendation systems that implicitly links modality and semantics to item representation, in contrast to previous methods that relied on non-semantic item identifiers in autoregressive models. However, previous research has predominantly treated modalities in isolation, typically assuming item content is unimodal (usually text). We argue that this is a significant limitation given the rich, multimodal nature of real-world data and the potential sensitivity of GR models to modality choices and usage. Our work aims to explore the critical problem of Multimodal Generative Recommendation (MGR), highlighting the importance of modality choices in GR nframeworks. We reveal that GR models are particularly sensitive to different modalities and examine the challenges in achieving effective GR when multiple modalities are available. By evaluating design strategies for effectively leveraging multiple modalities, we identify key challenges and introduce MGR-LF++, an enhanced late fusion framework that employs contrastive modality alignment and special tokens to denote different modalities, achieving a performance improvement of over 20% compared to single-modality alternatives.

NIOct 30, 2024
NetworkGym: Reinforcement Learning Environments for Multi-Access Traffic Management in Network Simulation

Momin Haider, Ming Yin, Menglei Zhang et al. · princeton

Mobile devices such as smartphones, laptops, and tablets can often connect to multiple access networks (e.g., Wi-Fi, LTE, and 5G) simultaneously. Recent advancements facilitate seamless integration of these connections below the transport layer, enhancing the experience for apps that lack inherent multi-path support. This optimization hinges on dynamically determining the traffic distribution across networks for each device, a process referred to as \textit{multi-access traffic splitting}. This paper introduces \textit{NetworkGym}, a high-fidelity network environment simulator that facilitates generating multiple network traffic flows and multi-access traffic splitting. This simulator facilitates training and evaluating different RL-based solutions for the multi-access traffic splitting problem. Our initial explorations demonstrate that the majority of existing state-of-the-art offline RL algorithms (e.g. CQL) fail to outperform certain hand-crafted heuristic policies on average. This illustrates the urgent need to evaluate offline RL algorithms against a broader range of benchmarks, rather than relying solely on popular ones such as D4RL. We also propose an extension to the TD3+BC algorithm, named Pessimistic TD3 (PTD3), and demonstrate that it outperforms many state-of-the-art offline RL algorithms. PTD3's behavioral constraint mechanism, which relies on value-function pessimism, is theoretically motivated and relatively simple to implement.

DLFeb 15
FMMD: A multimodal open peer review dataset based on F1000Research

Zhenzhen Zhuang, Yuqing Fu, Jing Zhu et al.

Automated scholarly paper review (ASPR) has entered the coexistence phase with traditional peer review, where artificial intelligence (AI) systems are increasingly incorporated into real-world manuscript evaluation. In parallel, research on automated and AI-assisted peer review has proliferated. Despite this momentum, empirical progress remains constrained by several critical limitations in existing datasets. While reviewers routinely evaluate figures, tables, and complex layouts to assess scientific claims, most existing datasets remain overwhelmingly text-centric. This bias is reinforced by a narrow focus on data from computer science venues. Furthermore, these datasets lack precise alignment between reviewer comments and specific manuscript versions, obscuring the iterative relationship between peer review and manuscript evolution. In response, we introduce FMMD, a multimodal and multidisciplinary open peer review dataset curated from F1000Research. The dataset bridges the current gap by integrating manuscript-level visual and structural data with version-specific reviewer reports and editorial decisions. By providing explicit alignment between reviewer comments and the exact article iteration under review, FMMD enables fine-grained analysis of the peer review lifecycle across diverse scientific domains. FMMD supports tasks such as multimodal issue detection and multimodal review comment generation. It provides a comprehensive empirical resource for the development of peer review research.

LGJun 24, 2024
Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning

Jing Zhu, Yuhang Zhou, Shengyi Qian et al.

Graph machine learning has made significant strides in recent years, yet the integration of visual information with graph structure and its potential for improving performance in downstream tasks remains an underexplored area. To address this critical gap, we introduce the Multimodal Graph Benchmark (MM-GRAPH), a pioneering benchmark that incorporates both visual and textual information into graph learning tasks. MM-GRAPH extends beyond existing text-attributed graph benchmarks, offering a more comprehensive evaluation framework for multimodal graph learning Our benchmark comprises seven diverse datasets of varying scales (ranging from thousands to millions of edges), designed to assess algorithms across different tasks in real-world scenarios. These datasets feature rich multimodal node attributes, including visual data, which enables a more holistic evaluation of various graph learning frameworks in complex, multimodal environments. To support advancements in this emerging field, we provide an extensive empirical study on various graph learning frameworks when presented with features from multiple modalities, particularly emphasizing the impact of visual information. This study offers valuable insights into the challenges and opportunities of integrating visual data into graph learning.

CLJun 19, 2024
Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation

Yuhang Zhou, Jing Zhu, Paiheng Xu et al.

Large language models (LLMs) have significantly advanced various natural language processing tasks, but deploying them remains computationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final outcomes, shows great potential in enhancing students' reasoning capabilities. However, current methods struggle with sequence level KD under long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.

LGJun 7, 2024
LinkGPT: Teaching Large Language Models To Predict Missing Links

Zhongmou He, Jing Zhu, Shengyi Qian et al.

Large Language Models (LLMs) have shown promising results on various language and vision tasks. Recently, there has been growing interest in applying LLMs to graph-based tasks, particularly on Text-Attributed Graphs (TAGs). However, most studies have focused on node classification, while the use of LLMs for link prediction (LP) remains understudied. In this work, we propose a new task on LLMs, where the objective is to leverage LLMs to predict missing links between nodes in a graph. This task evaluates an LLM's ability to reason over structured data and infer new facts based on learned patterns. This new task poses two key challenges: (1) How to effectively integrate pairwise structural information into the LLMs, which is known to be crucial for LP performance, and (2) how to solve the computational bottleneck when teaching LLMs to perform LP. To address these challenges, we propose LinkGPT, the first end-to-end trained LLM for LP tasks. To effectively enhance the LLM's ability to understand the underlying structure, we design a two-stage instruction tuning approach where the first stage fine-tunes the pairwise encoder, projector, and node projector, and the second stage further fine-tunes the LLMs to predict links. To address the efficiency challenges at inference time, we introduce a retrieval-reranking scheme. Experiments show that LinkGPT can achieve state-of-the-art performance on real-world graphs as well as superior generalization in zero-shot and few-shot learning, surpassing existing benchmarks. At inference time, it can achieve $10\times$ speedup while maintaining high LP accuracy.

CVJan 24, 2022
Debiasing pipeline improves deep learning model generalization for X-ray based lung nodule detection

Michael Horry, Subrata Chakraborty, Biswajeet Pradhan et al.

Lung cancer is the leading cause of cancer death worldwide and a good prognosis depends on early diagnosis. Unfortunately, screening programs for the early diagnosis of lung cancer are uncommon. This is in-part due to the at-risk groups being located in rural areas far from medical facilities. Reaching these populations would require a scaled approach that combines mobility, low cost, speed, accuracy, and privacy. We can resolve these issues by combining the chest X-ray imaging mode with a federated deep-learning approach, provided that the federated model is trained on homogenous data to ensure that no single data source can adversely bias the model at any point in time. In this study we show that an image pre-processing pipeline that homogenizes and debiases chest X-ray images can improve both internal classification and external generalization, paving the way for a low-cost and accessible deep learning-based clinical system for lung cancer screening. An evolutionary pruning mechanism is used to train a nodule detection deep learning model on the most informative images from a publicly available lung nodule X-ray dataset. Histogram equalization is used to remove systematic differences in image brightness and contrast. Model training is performed using all combinations of lung field segmentation, close cropping, and rib suppression operators. We show that this pre-processing pipeline results in deep learning models that successfully generalize an independent lung nodule dataset using ablation studies to assess the contribution of each operator in this pipeline. In stripping chest X-ray images of known confounding variables by lung field segmentation, along with suppression of signal noise from the bone structure we can train a highly accurate deep learning lung nodule detection algorithm with outstanding generalization accuracy of 89% to nodule samples in unseen data.

CVAug 7, 2021
ContinuityLearner: Geometric Continuity Feature Learning for Lane Segmentation

Haoyu Fang, Jing Zhu, Yi Fang

Lane segmentation is a challenging issue in autonomous driving system designing because lane marks show weak textural consistency due to occlusion or extreme illumination but strong geometric continuity in traffic images, from which general convolution neural networks (CNNs) are not capable of learning semantic objects. To empower conventional CNNs in learning geometric clues of lanes, we propose a deep network named ContinuityLearner to better learn geometric prior within lane. Specifically, our proposed CNN-based paradigm involves a novel Context-encoding image feature learning network to generate class-dependent image feature maps and a new encoding layer to exploit the geometric continuity feature representation by fusing both spatial and visual information of lane together. The ContinuityLearner, performing on the geometric continuity feature of lanes, is trained to directly predict the lane in traffic scenarios with integrated and continuous instance semantic. The experimental results on the CULane dataset and the Tusimple benchmark demonstrate that our ContinuityLearner has superior performance over other state-of-the-art techniques in lane segmentation.

SIFeb 26, 2021
Node Proximity Is All You Need: Unified Structural and Positional Node and Graph Embedding

Jing Zhu, Xingyu Lu, Mark Heimann et al.

While most network embedding techniques model the relative positions of nodes in a network, recently there has been significant interest in structural embeddings that model node role equivalences, irrespective of their distances to any specific nodes. We present PhUSION, a proximity-based unified framework for computing structural and positional node embeddings, which leverages well-established methods for calculating node proximity scores. Clarifying a point of contention in the literature, we show which step of PhUSION produces the different kinds of embeddings and what steps can be used by both. Moreover, by aggregating the PhUSION node embeddings, we obtain graph-level features that model information lost by previous graph feature learning and kernel methods. In a comprehensive empirical study with over 10 datasets, 4 tasks, and 35 methods, we systematically reveal successful design choices for node and graph-level machine learning with embeddings.

AINov 15, 2020
NegatER: Unsupervised Discovery of Negatives in Commonsense Knowledge Bases

Tara Safavi, Jing Zhu, Danai Koutra

Codifying commonsense knowledge in machines is a longstanding goal of artificial intelligence. Recently, much progress toward this goal has been made with automatic knowledge base (KB) construction techniques. However, such techniques focus primarily on the acquisition of positive (true) KB statements, even though negative (false) statements are often also important for discriminative reasoning over commonsense KBs. As a first step toward the latter, this paper proposes NegatER, a framework that ranks potential negatives in commonsense KBs using a contextual language model (LM). Importantly, as most KBs do not contain negatives, NegatER relies only on the positive knowledge in the LM and does not require ground-truth negative examples. Experiments demonstrate that, compared to multiple contrastive data augmentation approaches, NegatER yields negatives that are more grammatical, coherent, and informative -- leading to statistically significant accuracy improvements in a challenging KB completion task and confirming that the positive knowledge in LMs can be "re-purposed" to generate negative knowledge.

IRJun 27, 2020
Modeling Long-Term and Short-Term Interests with Parallel Attentions for Session-based Recommendation

Jing Zhu, Yanan Xu, Yanmin Zhu

The aim of session-based recommendation is to predict the users' next clicked item, which is a challenging task due to the inherent uncertainty in user behaviors and anonymous implicit feedback information. A powerful session-based recommender can typically explore the users' evolving interests (i.e., a combination of his/her long-term and short-term interests). Recent advances in attention mechanisms have led to state-of-the-art methods for solving this task. However, there are two main drawbacks. First, most of the attention-based methods only simply utilize the last clicked item to represent the user's short-term interest ignoring the temporal information and behavior context, which may fail to capture the recent preference of users comprehensively. Second, current studies typically think long-term and short-term interests as equally important, but the importance of them should be user-specific. Therefore, we propose a novel Parallel Attention Network model (PAN) for Session-based Recommendation. Specifically, we propose a novel time-aware attention mechanism to learn user's short-term interest by taking into account the contextual information and temporal signals simultaneously. Besides, we introduce a gated fusion method that adaptively integrates the user's long-term and short-term preferences to generate the hybrid interest representation. Experiments on the three real-world datasets show that PAN achieves obvious improvements than the state-of-the-art methods.

DLFeb 20, 2020
MODMA dataset: a Multi-modal Open Dataset for Mental-disorder Analysis

Hanshu Cai, Yiwen Gao, Shuting Sun et al.

According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important reason is due to the lack of physiological indicators for mental disorders. With the rising of tools such as data mining and artificial intelligence, using physiological data to explore new possible physiological indicators of mental disorder and creating new applications for mental disorder diagnosis has become a new research hot topic. However, good quality physiological data for mental disorder patients are hard to acquire. We present a multi-modal open dataset for mental-disorder analysis. The dataset includes EEG and audio data from clinically depressed patients and matching normal controls. All our patients were carefully diagnosed and selected by professional psychiatrists in hospitals. The EEG dataset includes not only data collected using traditional 128-electrodes mounted elastic cap, but also a novel wearable 3-electrode EEG collector for pervasive applications. The 128-electrodes EEG signals of 53 subjects were recorded as both in resting state and under stimulation; the 3-electrode EEG signals of 55 subjects were recorded in resting state; the audio data of 52 subjects were recorded during interviewing, reading, and picture description. We encourage other researchers in the field to use it for testing their methods of mental-disorder analysis.

CVSep 28, 2019
Self-Supervised Learning of Depth and Ego-motion with Differentiable Bundle Adjustment

Yunxiao Shi, Jing Zhu, Yi Fang et al.

Learning to predict scene depth and camera motion from RGB inputs only is a challenging task. Most existing learning based methods deal with this task in a supervised manner which require ground-truth data that is expensive to acquire. More recent approaches explore the possibility of estimating scene depth and camera pose in a self-supervised learning framework. Despite encouraging results are shown, current methods either learn from monocular videos for depth and pose and typically do so without enforcing multi-view geometry constraints between scene structure and camera motion, or require stereo sequences as input where the ground-truth between-frame motion parameters need to be known. In this paper we propose to jointly optimize the scene depth and camera motion via incorporating differentiable Bundle Adjustment (BA) layer by minimizing the feature-metric error, and then form the photometric consistency loss with view synthesis as the final supervisory signal. The proposed approach only needs unlabeled monocular videos as input, and extensive experiments on the KITTI and Cityscapes dataset show that our method achieves state-of-the-art results in self-supervised approaches using monocular videos as input, and even gains advantage to the line of methods that learns from calibrated stereo sequences (i.e. with pose supervision).

CVSep 10, 2019
Structure-Attentioned Memory Network for Monocular Depth Estimation

Jing Zhu, Yunxiao Shi, Mengwei Ren et al.

Monocular depth estimation is a challenging task that aims to predict a corresponding depth map from a given single RGB image. Recent deep learning models have been proposed to predict the depth from the image by learning the alignment of deep features between the RGB image and the depth domains. In this paper, we present a novel approach, named Structure-Attentioned Memory Network, to more effectively transfer domain features for monocular depth estimation by taking into account the common structure regularities (e.g., repetitive structure patterns, planar surfaces, symmetries) in domain adaptation. To this end, we introduce a new Structure-Oriented Memory (SOM) module to learn and memorize the structure-specific information between RGB image domain and the depth domain. More specifically, in the SOM module, we develop a Memorable Bank of Filters (MBF) unit to learn a set of filters that memorize the structure-aware image-depth residual pattern, and also an Attention Guided Controller (AGC) unit to control the filter selection in the MBF given image features queries. Given the query image feature, the trained SOM module is able to adaptively select the best customized filters for cross-domain feature transferring with an optimal structural disparity between image and depth. In summary, we focus on addressing this structure-specific domain adaption challenge by proposing a novel end-to-end multi-scale memorable network for monocular depth estimation. The experiments show that our proposed model demonstrates the superior performance compared to the existing supervised monocular depth estimation approaches on the challenging KITTI and NYU Depth V2 benchmarks.

CVSep 9, 2019
Learning Object-specific Distance from a Monocular Image

Jing Zhu, Yi Fang, Husam Abu-Haimed et al.

Environment perception, including object detection and distance estimation, is one of the most crucial tasks for autonomous driving. Many attentions have been paid on the object detection task, but distance estimation only arouse few interests in the computer vision community. Observing that the traditional inverse perspective mapping algorithm performs poorly for objects far away from the camera or on the curved road, in this paper, we address the challenging distance estimation problem by developing the first end-to-end learning-based model to directly predict distances for given objects in the images. Besides the introduction of a learning-based base model, we further design an enhanced model with a keypoint regressor, where a projection loss is defined to enforce a better distance estimation, especially for objects close to the camera. To facilitate the research on this task, we construct the extented KITTI and nuScenes (mini) object detection datasets with a distance for each object. Our experiments demonstrate that our proposed methods outperform alternative approaches (e.g., the traditional IPM, SVR) on object-specific distance estimation, particularly for the challenging cases that objects are on a curved road. Moreover, the performance margin implies the effectiveness of our enhanced method.