Xiangpeng Wan

h-index10

6papers

569citations

Novelty36%

AI Score30

Ranked #138,872 of 194,257 authors (top 71%)#24,876 in CL (top 81%)

6 Papers

26.4CLMay 14, 2022Code

RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL

Jiexing Qi, Jingyao Tang, Ziwei He et al. · meta-ai, mila

Relational structures such as schema linking and schema encoding have been validated as a key component to qualitatively translating natural language into SQL queries. However, introducing these structural relations comes with prices: they often result in a specialized model structure, which largely prohibits using large pretrained models in text-to-SQL. To address this problem, we propose RASAT: a Transformer seq2seq architecture augmented with relation-aware self-attention that could leverage a variety of relational structures while inheriting the pretrained parameters from the T5 model effectively. Our model can incorporate almost all types of existing relations in the literature, and in addition, we propose introducing co-reference relations for the multi-turn scenario. Experimental results on three widely used text-to-SQL datasets, covering both single-turn and multi-turn scenarios, have shown that RASAT could achieve state-of-the-art results across all three benchmarks (75.5% EX on Spider, 52.6% IEX on SParC, and 37.4% IEX on CoSQL).

21.5CLMay 23, 2022Code

Local Byte Fusion for Neural Machine Translation

Makesh Narsimhan Sreedhar, Xiangpeng Wan, Yu Cheng et al.

Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in multilingual corpora, subword tokenization schemes over-segment low-resource languages leading to a drop in translation performance. A simple alternative to subword tokenizers is byte-based methods i.e. tokenization into byte sequences using encoding schemes such as UTF-8. Byte tokens often represent inputs at a sub-character granularity i.e. one character can be represented by a sequence of multiple byte tokens. This results in byte sequences that are significantly longer than character sequences. Enforcing aggregation of local information in the lower layers can guide the model to build higher-level semantic information. We propose a Local Byte Fusion (LOBEF) method for byte-based machine translation -- utilizing byte $n$-gram and word boundaries -- to aggregate local semantic information. Extensive experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional byte-based models and even over subword techniques. Further analysis also indicates that our byte-based models are parameter-efficient and can be trained faster than subword models.

1.7CLDec 5, 2023Code

Inherent limitations of LLMs regarding spatial information

He Yan, Xinyao Hu, Xiangpeng Wan et al.

Despite the significant advancements in natural language processing capabilities demonstrated by large language models such as ChatGPT, their proficiency in comprehending and processing spatial information, especially within the domains of 2D and 3D route planning, remains notably underdeveloped. This paper investigates the inherent limitations of ChatGPT and similar models in spatial reasoning and navigation-related tasks, an area critical for applications ranging from autonomous vehicle guidance to assistive technologies for the visually impaired. In this paper, we introduce a novel evaluation framework complemented by a baseline dataset, meticulously crafted for this study. This dataset is structured around three key tasks: plotting spatial points, planning routes in two-dimensional (2D) spaces, and devising pathways in three-dimensional (3D) environments. We specifically developed this dataset to assess the spatial reasoning abilities of ChatGPT. Our evaluation reveals key insights into the model's capabilities and limitations in spatial understanding.

0.2CLMay 25, 2021

Topic Modeling and Progression of American Digital News Media During the Onset of the COVID-19 Pandemic

Xiangpeng Wan, Michael C. Lucic, Hakim Ghazzai et al.

Currently, the world is in the midst of a severe global pandemic, which has affected all aspects of people's lives. As a result, there is a deluge of COVID-related digital media articles published in the United States, due to the disparate effects of the pandemic. This large volume of information is difficult to consume by the audience in a reasonable amount of time. In this paper, we develop a Natural Language Processing (NLP) pipeline that is capable of automatically distilling various digital articles into manageable pieces of information, while also modelling the progression topics discussed over time in order to aid readers in rapidly gaining holistic perspectives on pressing issues (i.e., the COVID-19 pandemic) from a diverse array of sources. We achieve these goals by first collecting a large corpus of COVID-related articles during the onset of the pandemic. After, we apply unsupervised and semi-supervised learning procedures to summarize articles, then cluster them based on their similarities using the community detection methods. Next, we identify the topic of each cluster of articles using the BART algorithm. Finally, we provide a detailed digital media analysis based on the NLP-pipeline outputs and show how the conversation surrounding COVID-19 evolved over time.

0.3CLApr 21, 2020

Word Embedding-based Text Processing for Comprehensive Summarization and Distinct Information Extraction

Xiangpeng Wan, Hakim Ghazzai, Yehia Massoud

In this paper, we propose two automated text processing frameworks specifically designed to analyze online reviews. The objective of the first framework is to summarize the reviews dataset by extracting essential sentence. This is performed by converting sentences into numerical vectors and clustering them using a community detection algorithm based on their similarity levels. Afterwards, a correlation score is measured for each sentence to determine its importance level in each cluster and assign it as a tag for that community. The second framework is based on a question-answering neural network model trained to extract answers to multiple different questions. The collected answers are effectively clustered to find multiple distinct answers to a single question that might be asked by a customer. The proposed frameworks are shown to be more comprehensive than existing reviews processing solutions.

0.2CLApr 21, 2020

Leveraging Personal Navigation Assistant Systems Using Automated Social Media Traffic Reporting

Xiangpeng Wan, Hakim Ghazzai, Yehia Massoud

Modern urbanization is demanding smarter technologies to improve a variety of applications in intelligent transportation systems to relieve the increasing amount of vehicular traffic congestion and incidents. Existing incident detection techniques are limited to the use of sensors in the transportation network and hang on human-inputs. Despite of its data abundance, social media is not well-exploited in such context. In this paper, we develop an automated traffic alert system based on Natural Language Processing (NLP) that filters this flood of information and extract important traffic-related bullets. To this end, we employ the fine-tuning Bidirectional Encoder Representations from Transformers (BERT) language embedding model to filter the related traffic information from social media. Then, we apply a question-answering model to extract necessary information characterizing the report event such as its exact location, occurrence time, and nature of the events. We demonstrate the adopted NLP approaches outperform other existing approach and, after effectively training them, we focus on real-world situation and show how the developed approach can, in real-time, extract traffic-related information and automatically convert them into alerts for navigation assistance applications such as navigation apps.