Jing Zhao

CL
h-index14
11papers
1,576citations
Novelty41%
AI Score30

11 Papers

12.6CVMar 23, 2023Code
MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models

Jing Zhao, Heliang Zheng, Chaoyue Wang et al.

The advent of open-source AI communities has produced a cornucopia of powerful text-guided diffusion models that are trained on various datasets. While few explorations have been conducted on ensembling such models to combine their strengths. In this work, we propose a simple yet effective method called Saliency-aware Noise Blending (SNB) that can empower the fused text-guided diffusion models to achieve more controllable generation. Specifically, we experimentally find that the responses of classifier-free guidance are highly related to the saliency of generated images. Thus we propose to trust different models in their areas of expertise by blending the predicted noises of two diffusion models in a saliency-aware manner. SNB is training-free and can be completed within a DDIM sampling process. Additionally, it can automatically align the semantics of two noise spaces without requiring additional annotations such as masks. Extensive experiments show the impressive effectiveness of SNB in various applications. Project page is available at https://magicfusion.github.io/.

31.9CLMay 5, 2022Code
LUNA: Learning Slot-Turn Alignment for Dialogue State Tracking

Yifan Wang, Jing Zhao, Junwei Bao et al.

Dialogue state tracking (DST) aims to predict the current dialogue state given the dialogue history. Existing methods generally exploit the utterances of all dialogue turns to assign value for each slot. This could lead to suboptimal results due to the information introduced from irrelevant utterances in the dialogue history, which may be useless and can even cause confusion. To address this problem, we propose LUNA, a sLot-tUrN Alignment enhanced approach. It first explicitly aligns each slot with its most relevant utterance, then further predicts the corresponding value based on this aligned utterance instead of all dialogue utterances. Furthermore, we design a slot ranking auxiliary task to learn the temporal correlation among slots which could facilitate the alignment. Comprehensive experiments are conducted on multi-domain task-oriented dialogue datasets, i.e., MultiWOZ 2.0, MultiWOZ 2.1, and MultiWOZ 2.2. The results show that LUNA achieves new state-of-the-art results on these datasets.

31.8CLApr 29, 2022
OPERA:Operation-Pivoted Discrete Reasoning over Text

Yongwei Zhou, Junwei Bao, Chaoqun Duan et al.

Machine reading comprehension (MRC) that requires discrete reasoning involving symbolic operations, e.g., addition, sorting, and counting, is a challenging task. According to this nature, semantic parsing-based methods predict interpretable but complex logical forms. However, logical form generation is nontrivial and even a little perturbation in a logical form will lead to wrong answers. To alleviate this issue, multi-predictor -based methods are proposed to directly predict different types of answers and achieve improvements. However, they ignore the utilization of symbolic operations and encounter a lack of reasoning ability and interpretability. To inherit the advantages of these two types of methods, we propose OPERA, an operation-pivoted discrete reasoning framework, where lightweight symbolic operations (compared with logical forms) as neural modules are utilized to facilitate the reasoning ability and interpretability. Specifically, operations are first selected and then softly executed to simulate the answer reasoning procedure. Extensive experiments on both DROP and RACENum datasets show the reasoning ability of OPERA. Moreover, further analysis verifies its interpretability.

22.1CLDec 13, 2022Code
TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities

Zhe Zhao, Yudong Li, Cheng Hou et al.

Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.

0.3CLOct 22, 2022
P$^3$LM: Probabilistically Permuted Prophet Language Modeling for Generative Pre-Training

Junwei Bao, Yifan Wang, Jiangyong Ying et al.

Conventional autoregressive left-to-right (L2R) sequence generation faces two issues during decoding: limited to unidirectional target sequence modeling, and constrained on strong local dependencies. To address the aforementioned problem, we propose P$^3$LM, a probabilistically permuted prophet language model, which strengthens the modeling of bidirectional information and long token dependencies for sequence generation. Specifically, P$^3$LM learns to generate tokens in permuted order upon an order-aware transformer decoder, as well as to generate the corresponding future $N$ tokens with a multi-stream attention mechanism. Extensive experiments are conducted on the GLGE benchmark, which includes four datasets for summarization, two for question generation, one for conversational question answering, and one for dialog response generation, where P$^3$LM achieves state-of-the-art results compared with strong publicly available generative pre-training methods.

3.7CVSep 21, 2022Code
IoU-Enhanced Attention for End-to-End Task Specific Object Detection

Jing Zhao, Shengjian Wu, Li Sun et al.

Without densely tiled anchor boxes or grid points in the image, sparse R-CNN achieves promising results through a set of object queries and proposal boxes updated in the cascaded training manner. However, due to the sparse nature and the one-to-one relation between the query and its attending region, it heavily depends on the self attention, which is usually inaccurate in the early training stage. Moreover, in a scene of dense objects, the object query interacts with many irrelevant ones, reducing its uniqueness and harming the performance. This paper proposes to use IoU between different boxes as a prior for the value routing in self attention. The original attention matrix multiplies the same size matrix computed from the IoU of proposal boxes, and they determine the routing scheme so that the irrelevant features can be suppressed. Furthermore, to accurately extract features for both classification and regression, we add two lightweight projection heads to provide the dynamic channel masks based on object query, and they multiply with the output from dynamic convs, making the results suitable for the two different tasks. We validate the proposed scheme on different datasets, including MS-COCO and CrowdHuman, showing that it significantly improves the performance and increases the model convergence speed.

1.8LGJul 5, 2022
A Safe Semi-supervised Graph Convolution Network

Zhi Yang, Yadong Yan, Haitao Gan et al.

In the semi-supervised learning field, Graph Convolution Network (GCN), as a variant model of GNN, has achieved promising results for non-Euclidean data by introducing convolution into GNN. However, GCN and its variant models fail to safely use the information of risk unlabeled data, which will degrade the performance of semi-supervised learning. Therefore, we propose a Safe GCN framework (Safe-GCN) to improve the learning performance. In the Safe-GCN, we design an iterative process to label the unlabeled data. In each iteration, a GCN and its supervised version(S-GCN) are learned to find the unlabeled data with high confidence. The high-confidence unlabeled data and their pseudo labels are then added to the label set. Finally, both added unlabeled data and labeled ones are used to train a S-GCN which can achieve the safe exploration of the risk unlabeled data and enable safe use of large numbers of unlabeled data. The performance of Safe-GCN is evaluated on three well-known citation network datasets and the obtained results demonstrate the effectiveness of the proposed framework over several graph-based semi-supervised learning methods.

0.3CLJun 29, 2022
The THUEE System Description for the IARPA OpenASR21 Challenge

Jing Zhao, Haoyu Wang, Jinpeng Li et al.

This paper describes the THUEE team's speech recognition system for the IARPA Open Automatic Speech Recognition Challenge (OpenASR21), with further experiment explorations. We achieve outstanding results under both the Constrained and Constrained-plus training conditions. For the Constrained training condition, we construct our basic ASR system based on the standard hybrid architecture. To alleviate the Out-Of-Vocabulary (OOV) problem, we extend the pronunciation lexicon using Grapheme-to-Phoneme (G2P) techniques for both OOV and potential new words. Standard acoustic model structures such as CNN-TDNN-F and CNN-TDNN-F-A are adopted. In addition, multiple data augmentation techniques are applied. For the Constrained-plus training condition, we use the self-supervised learning framework wav2vec2.0. We experiment with various fine-tuning techniques with the Connectionist Temporal Classification (CTC) criterion on top of the publicly available pre-trained model XLSR-53. We find that the frontend feature extractor plays an important role when applying the wav2vec2.0 pre-trained model to the encoder-decoder based CTC/Attention ASR architecture. Extra improvements can be achieved by using the CTC model finetuned in the target language as the frontend feature extractor.

12.6CVMay 11, 2023
Null-text Guidance in Diffusion Models is Secretly a Cartoon-style Creator

Jing Zhao, Heliang Zheng, Chaoyue Wang et al.

Classifier-free guidance is an effective sampling technique in diffusion models that has been widely adopted. The main idea is to extrapolate the model in the direction of text guidance and away from null-text guidance. In this paper, we demonstrate that null-text guidance in diffusion models is secretly a cartoon-style creator, i.e., the generated images can be efficiently transformed into cartoons by simply perturbing the null-text guidance. Specifically, we proposed two disturbance methods, i.e., Rollback disturbance (Back-D) and Image disturbance (Image-D), to construct misalignment between the noisy images used for predicting null-text guidance and text guidance (subsequently referred to as \textbf{null-text noisy image} and \textbf{text noisy image} respectively) in the sampling process. Back-D achieves cartoonization by altering the noise level of null-text noisy image via replacing $x_t$ with $x_{t+Δt}$. Image-D, alternatively, produces high-fidelity, diverse cartoons by defining $x_t$ as a clean input image, which further improves the incorporation of finer image details. Through comprehensive experiments, we delved into the principle of noise disturbing for null-text and uncovered that the efficacy of disturbance depends on the correlation between the null-text noisy image and the source image. Moreover, our proposed techniques, which can generate cartoon images and cartoonize specific ones, are training-free and easily integrated as a plug-and-play component in any classifier-free guided diffusion model. Project page is available at \url{https://nulltextforcartoon.github.io/}.

4.5AIJun 16, 2021
An Intelligent Question Answering System based on Power Knowledge Graph

Yachen Tang, Haiyun Han, Xianmao Yu et al.

The intelligent question answering (IQA) system can accurately capture users' search intention by understanding the natural language questions, searching relevant content efficiently from a massive knowledge-base, and returning the answer directly to the user. Since the IQA system can save inestimable time and workforce in data search and reasoning, it has received more and more attention in data science and artificial intelligence. This article introduced a domain knowledge graph using the graph database and graph computing technologies from massive heterogeneous data in electric power. It then proposed an IQA system based on the electrical power knowledge graph to extract the intent and constraints of natural interrogation based on the natural language processing (NLP) method, to construct graph data query statements via knowledge reasoning, and to complete the accurate knowledge search and analysis to provide users with an intuitive visualization. This method thoroughly combined knowledge graph and graph computing characteristics, realized high-speed multi-hop knowledge correlation reasoning analysis in tremendous knowledge. The proposed work can also provide a basis for the context-aware intelligent question and answer.

1.9RONov 9, 2019
Human Driver Behavior Prediction based on UrbanFlow

Zhiqian Qiao, Jing Zhao, Zachariah Tyree et al.

How autonomous vehicles and human drivers share public transportation systems is an important problem, as fully automatic transportation environments are still a long way off. Understanding human drivers' behavior can be beneficial for autonomous vehicle decision making and planning, especially when the autonomous vehicle is surrounded by human drivers who have various driving behaviors and patterns of interaction with other vehicles. In this paper, we propose an LSTM-based trajectory prediction method for human drivers which can help the autonomous vehicle make better decisions, especially in urban intersection scenarios. Meanwhile, in order to collect human drivers' driving behavior data in the urban scenario, we describe a system called UrbanFlow which includes the whole procedure from raw bird's-eye view data collection via drone to the final processed trajectories. The system is mainly intended for urban scenarios but can be extended to be used for any traffic scenarios.