LGMay 24, 2023
TaxoKnow: Taxonomy as Prior Knowledge in the Loss Function of Multi-class ClassificationMohsen Pourvali, Yao Meng, Chen Sheng et al.
In this paper, we investigate the effectiveness of integrating a hierarchical taxonomy of labels as prior knowledge into the learning algorithm of a flat classifier. We introduce two methods to integrate the hierarchical taxonomy as an explicit regularizer into the loss function of learning algorithms. By reasoning on a hierarchical taxonomy, a neural network alleviates its output distributions over the classes, allowing conditioning on upper concepts for a minority class. We limit ourselves to the flat classification task and provide our experimental results on two industrial in-house datasets and two public benchmarks, RCV1 and Amazon product reviews. Our obtained results show the significant effect of a taxonomy in increasing the performance of a learner in semisupervised multi-class classification and the considerable results obtained in a fully supervised fashion.
CVMay 19, 2023
Embrace Limited and Imperfect Training Datasets: Opportunities and Challenges in Plant Disease Recognition Using Deep LearningMingle Xu, Hyongsuk Kim, Jucheng Yang et al.
Recent advancements in deep learning have brought significant improvements to plant disease recognition. However, achieving satisfactory performance often requires high-quality training datasets, which are challenging and expensive to collect. Consequently, the practical application of current deep learning-based methods in real-world scenarios is hindered by the scarcity of high-quality datasets. In this paper, we argue that embracing poor datasets is viable and aim to explicitly define the challenges associated with using these datasets. To delve into this topic, we analyze the characteristics of high-quality datasets, namely large-scale images and desired annotation, and contrast them with the \emph{limited} and \emph{imperfect} nature of poor datasets. Challenges arise when the training datasets deviate from these characteristics. To provide a comprehensive understanding, we propose a novel and informative taxonomy that categorizes these challenges. Furthermore, we offer a brief overview of existing studies and approaches that address these challenges. We believe that our paper sheds light on the importance of embracing poor datasets, enhances the understanding of the associated challenges, and contributes to the ambitious objective of deploying deep learning in real-world applications. To facilitate the progress, we finally describe several outstanding questions and point out potential future directions. Although our primary focus is on plant disease recognition, we emphasize that the principles of embracing and analyzing poor datasets are applicable to a wider range of domains, including agriculture.
CLMay 22, 2020
Bootstrapping Named Entity Recognition in E-Commerce with Positive Unlabeled LearningHanchu Zhang, Leonhard Hennig, Christoph Alt et al.
Named Entity Recognition (NER) in domains like e-commerce is an understudied problem due to the lack of annotated datasets. Recognizing novel entity types in this domain, such as products, components, and attributes, is challenging because of their linguistic complexity and the low coverage of existing knowledge resources. To address this problem, we present a bootstrapped positive-unlabeled learning algorithm that integrates domain-specific linguistic features to quickly and efficiently expand the seed dictionary. The model achieves an average F1 score of 72.02% on a novel dataset of product descriptions, an improvement of 3.63% over a baseline BiLSTM classifier, and in particular exhibits better recall (4.96% on average).
SYApr 3, 2020
FeederGAN: Synthetic Feeder Generation via Deep Graph Adversarial NetsMing Liang, Yao Meng, Jiyu Wang et al.
This paper presents a novel, automated, generative adversarial networks (GAN) based synthetic feeder generation mechanism, abbreviated as FeederGAN. FeederGAN digests real feeder models represented by directed graphs via a deep learning framework powered by GAN and graph convolutional networks (GCN). Information of a distribution feeder circuit is extracted from its model input files so that the device connectivity is mapped onto the adjacency matrix and the device characteristics, such as circuit types (i.e., 3-phase, 2-phase, and 1-phase) and component attributes (e.g., length and current ratings), are mapped onto the attribute matrix. Then, Wasserstein distance is used to optimize the GAN and GCN is used to discriminate the generated graphs from the actual ones. A greedy method based on graph theory is developed to reconstruct the feeder using the generated adjacency and attribute matrices. Our results show that the GAN generated feeders resemble the actual feeder in both topology and attributes verified by visual inspection and by empirical statistics obtained from actual distribution feeders.
IRMay 5, 2015
A Feature-based Classification Technique for Answering Multi-choice World History QuestionsShuangyong Song, Yao Meng, Zhongguang Zheng et al.
Our FRDC_QA team participated in the QA-Lab English subtask of the NTCIR-11. In this paper, we describe our system for solving real-world university entrance exam questions, which are related to world history. Wikipedia is used as the main external resource for our system. Since problems with choosing right/wrong sentence from multiple sentence choices account for about two-thirds of the total, we individually design a classification based model for solving this type of questions. For other types of questions, we also design some simple methods.
IRMay 5, 2015
Classifying and Ranking Microblogging Hashtags with News CategoriesShuangyong Song, Yao Meng
In microblogging, hashtags are used to be topical markers, and they are adopted by users that contribute similar content or express a related idea. However, hashtags are created in a free style and there is no domain category information about them, which make users hard to get access to organized hashtag presentation. In this paper, we propose an approach that classifies hashtags with news categories, and then carry out a domain-sensitive popularity ranking to get hot hashtags in each domain. The proposed approach first trains a domain classification model with news content and news category information, then detects microblogs related to a hashtag to be its representative text, based on which we can classify this hashtag with a domain. Finally, we calculate the domain-sensitive popularity of each hashtag with multiple factors, to get most hotly discussed hashtags in each domain. Preliminary experimental results on a dataset from Sina Weibo, one of the largest Chinese microblogging websites, show usefulness of the proposed approach on describing hashtags.
CLApr 30, 2015
Detecting Concept-level Emotion Cause in MicrobloggingShuangyong Song, Yao Meng
In this paper, we propose a Concept-level Emotion Cause Model (CECM), instead of the mere word-level models, to discover causes of microblogging users' diversified emotions on specific hot event. A modified topic-supervised biterm topic model is utilized in CECM to detect emotion topics' in event-related tweets, and then context-sensitive topical PageRank is utilized to detect meaningful multiword expressions as emotion causes. Experimental results on a dataset from Sina Weibo, one of the largest microblogging websites in China, show CECM can better detect emotion causes than baseline methods.