Fengshun Xiao

CL
3papers
1,260citations
Novelty52%
AI Score27

3 Papers

CLNov 6, 2019
Hierarchical Contextualized Representation for Named Entity Recognition

Ying Luo, Fengshun Xiao, Hai Zhao

Named entity recognition (NER) models are typically based on the architecture of Bi-directional LSTM (BiLSTM). The constraints of sequential nature and the modeling of single input prevent the full utilization of global information from larger scope, not only in the entire sentence, but also in the entire document (dataset). In this paper, we address these two deficiencies and propose a model augmented with hierarchical contextualized representation: sentence-level representation and document-level representation. In sentence-level, we take different contributions of words in a single sentence into consideration to enhance the sentence representation learned from an independent BiLSTM via label embedding attention mechanism. In document-level, the key-value memory network is adopted to record the document-aware information for each unique word which is sensitive to similarity of context information. Our two-level hierarchical contextualized representations are fused with each input token embedding and corresponding hidden state of BiLSTM, respectively. The experimental results on three benchmark NER datasets (CoNLL-2003 and Ontonotes 5.0 English datasets, CoNLL-2002 Spanish dataset) show that we establish new state-of-the-art results.

CLAug 22, 2019
Controllable Dual Skew Divergence Loss for Neural Machine Translation

Zuchao Li, Hai Zhao, Yingting Wu et al.

In sequence prediction tasks like neural machine translation, training with cross-entropy loss often leads to models that overgeneralize and plunge into local optima. In this paper, we propose an extended loss function called \emph{dual skew divergence} (DSD) that integrates two symmetric terms on KL divergences with a balanced weight. We empirically discovered that such a balanced weight plays a crucial role in applying the proposed DSD loss into deep models. Thus we eventually develop a controllable DSD loss for general-purpose scenarios. Our experiments indicate that switching to the DSD loss after the convergence of ML training helps models escape local optima and stimulates stable performance improvements. Our evaluations on the WMT 2014 English-German and English-French translation tasks demonstrate that the proposed loss as a general and convenient mean for NMT training indeed brings performance improvement in comparison to strong baselines.

CLJun 4, 2019
Lattice-Based Transformer Encoder for Neural Machine Translation

Fengshun Xiao, Jiangtong Li, Hai Zhao et al.

Neural machine translation (NMT) takes deterministic sequences for source representations. However, either word-level or subword-level segmentations have multiple choices to split a source sequence with different word segmentors or different subword vocabulary sizes. We hypothesize that the diversity in segmentations may affect the NMT performance. To integrate different segmentations with the state-of-the-art NMT model, Transformer, we propose lattice-based encoders to explore effective word or subword representation in an automatic way during training. We propose two methods: 1) lattice positional encoding and 2) lattice-aware self-attention. These two methods can be used together and show complementary to each other to further improve translation performance. Experiment results show superiorities of lattice-based encoders in word-level and subword-level representations over conventional Transformer encoder.