LGMar 19, 2022
Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and DefenseThai Le, Jooyoung Lee, Kevin Yen et al.
We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K human-written text perturbations in the wild and leverages them for realistic adversarial attack. Unlike existing character-based attacks which often deductively hypothesize a set of manipulation strategies, our work is grounded on actual observations from real-world texts. We find that adversarial texts generated by ANTHRO achieve the best trade-off between (1) attack success rate, (2) semantic preservation of the original text, and (3) stealthiness--i.e. indistinguishable from human writings hence harder to be flagged as suspicious. Specifically, our attacks accomplished around 83% and 91% attack success rates on BERT and RoBERTa, respectively. Moreover, it outperformed the TextBugger baseline with an increase of 50% and 40% in terms of semantic preservation and stealthiness when evaluated by both layperson and professional human workers. ANTHRO can further enhance a BERT classifier's performance in understanding different variations of human-written toxic texts via adversarial training when compared to the Perspective API.
LGJun 13, 2022
SmartGD: A GAN-Based Graph Drawing Framework for Diverse Aesthetic GoalsXiaoqi Wang, Kevin Yen, Yifan Hu et al.
While a multitude of studies have been conducted on graph drawing, many existing methods only focus on optimizing a single aesthetic aspect of graph layouts, which can lead to sub-optimal results. There are a few existing methods that have attempted to develop a flexible solution for optimizing different aesthetic aspects measured by different aesthetic criteria. Furthermore, thanks to the significant advance in deep learning techniques, several deep learning-based layout methods were proposed recently. These methods have demonstrated the advantages of deep learning approaches for graph drawing. However, none of these existing methods can be directly applied to optimizing non-differentiable criteria without special accommodation. In this work, we propose a novel Generative Adversarial Network (GAN) based deep learning framework for graph drawing, called SmartGD, which can optimize different quantitative aesthetic goals, regardless of their differentiability. To demonstrate the effectiveness and efficiency of SmartGD, we conducted experiments on minimizing stress, minimizing edge crossing, maximizing crossing angle, maximizing shape-based metrics, and a combination of multiple aesthetics. Compared with several popular graph drawing algorithms, the experimental results show that SmartGD achieves good performance both quantitatively and qualitatively.
CLSep 18, 2021
BERT-Beta: A Proactive Probabilistic Approach to Text ModerationFei Tan, Yifan Hu, Kevin Yen et al.
Text moderation for user generated content, which helps to promote healthy interaction among users, has been widely studied and many machine learning models have been proposed. In this work, we explore an alternative perspective by augmenting reactive reviews with proactive forecasting. Specifically, we propose a new concept {\it text toxicity propensity} to characterize the extent to which a text tends to attract toxic comments. Beta regression is then introduced to do the probabilistic modeling, which is demonstrated to function well in comprehensive experiments. We also propose an explanation method to communicate the model decision clearly. Both propensity scoring and interpretation benefit text moderation in a novel manner. Finally, the proposed scaling mechanism for the linear model offers useful insights beyond this work.
CLAug 18, 2021
TSI: an Ad Text Strength Indicator using Text-to-CTR and Semantic-Ad-SimilarityShaunak Mishra, Changwei Hu, Manisha Verma et al.
Coming up with effective ad text is a time consuming process, and particularly challenging for small businesses with limited advertising experience. When an inexperienced advertiser onboards with a poorly written ad text, the ad platform has the opportunity to detect low performing ad text, and provide improvement suggestions. To realize this opportunity, we propose an ad text strength indicator (TSI) which: (i) predicts the click-through-rate (CTR) for an input ad text, (ii) fetches similar existing ads to create a neighborhood around the input ad, (iii) and compares the predicted CTRs in the neighborhood to declare whether the input ad is strong or weak. In addition, as suggestions for ad text improvement, TSI shows anonymized versions of superior ads (higher predicted CTR) in the neighborhood. For (i), we propose a BERT based text-to-CTR model trained on impressions and clicks associated with an ad text. For (ii), we propose a sentence-BERT based semantic-ad-similarity model trained using weak labels from ad campaign setup data. Offline experiments demonstrate that our BERT based text-to-CTR model achieves a significant lift in CTR prediction AUC for cold start (new) advertisers compared to bag-of-words based baselines. In addition, our semantic-textual-similarity model for similar ads retrieval achieves a precision@1 of 0.93 (for retrieving ads from the same product category); this is significantly higher compared to unsupervised TF-IDF, word2vec, and sentence-BERT baselines. Finally, we share promising online results from advertisers in the Yahoo (Verizon Media) ad platform where a variant of TSI was implemented with sub-second end-to-end latency.
LGJun 27, 2021
DeepGD: A Deep Learning Framework for Graph Drawing Using GNNXiaoqi Wang, Kevin Yen, Yifan Hu et al.
In the past decades, many graph drawing techniques have been proposed for generating aesthetically pleasing graph layouts. However, it remains a challenging task since different layout methods tend to highlight different characteristics of the graphs. Recently, studies on deep learning based graph drawing algorithm have emerged but they are often not generalizable to arbitrary graphs without re-training. In this paper, we propose a Convolutional Graph Neural Network based deep learning framework, DeepGD, which can draw arbitrary graphs once trained. It attempts to generate layouts by compromising among multiple pre-specified aesthetics considering a good graph layout usually complies with multiple aesthetics simultaneously. In order to balance the trade-off, we propose two adaptive training strategies which adjust the weight factor of each aesthetic dynamically during training. The quantitative and qualitative assessment of DeepGD demonstrates that it is capable of drawing arbitrary graphs effectively, while being flexible at accommodating different aesthetic criteria.
CVDec 19, 2020
Political Posters Identification with Appearance-Text FusionXuan Qin, Meizhu Liu, Yifan Hu et al.
In this paper, we propose a method that efficiently utilizes appearance features and text vectors to accurately classify political posters from other similar political images. The majority of this work focuses on political posters that are designed to serve as a promotion of a certain political event, and the automated identification of which can lead to the generation of detailed statistics and meets the judgment needs in a variety of areas. Starting with a comprehensive keyword list for politicians and political events, we curate for the first time an effective and practical political poster dataset containing 13K human-labeled political images, including 3K political posters that explicitly support a movement or a campaign. Second, we make a thorough case study for this dataset and analyze common patterns and outliers of political posters. Finally, we propose a model that combines the power of both appearance and text information to classify political posters with significantly high accuracy.
CLOct 17, 2020
HABERTOR: An Efficient and Effective Deep Hatespeech DetectorThanh Tran, Yifan Hu, Changwei Hu et al.
We present our HABERTOR model for detecting hatespeech in large scale user-generated content. Inspired by the recent success of the BERT model, we propose several modifications to BERT to enhance the performance on the downstream hatespeech classification task. HABERTOR inherits BERT's architecture, but is different in four aspects: (i) it generates its own vocabularies and is pre-trained from the scratch using the largest scale hatespeech dataset; (ii) it consists of Quaternion-based factorized components, resulting in a much smaller number of parameters, faster training and inferencing, as well as less memory usage; (iii) it uses our proposed multi-source ensemble heads with a pooling layer for separate input sources, to further enhance its effectiveness; and (iv) it uses a regularized adversarial training with our proposed fine-grained and adaptive noise magnitude to enhance its robustness. Through experiments on the large-scale real-world hatespeech dataset with 1.4M annotated comments, we show that HABERTOR works better than 15 state-of-the-art hatespeech detection methods, including fine-tuning Language Models. In particular, comparing with BERT, our HABERTOR is 4~5 times faster in the training/inferencing phase, uses less than 1/3 of the memory, and has better performance, even though we pre-train it by using less than 1% of the number of words. Our generalizability analysis shows that HABERTOR transfers well to other unseen hatespeech datasets and is a more efficient and effective alternative to BERT for the hatespeech classification.