Cross-Language Domain Adaptation for Classifying Crisis-Related Short Messages
This addresses data scarcity for crisis response volunteers, but it is incremental as it builds on existing domain adaptation methods.
The study tackled the problem of classifying crisis-related short messages by investigating whether labeled data from past disasters can improve classifier performance, finding that past labels are useful when source and target events are of the same type, and cross-language adaptation works for similar languages but decreases for different ones.
Rapid crisis response requires real-time analysis of messages. After a disaster happens, volunteers attempt to classify tweets to determine needs, e.g., supplies, infrastructure damage, etc. Given labeled data, supervised machine learning can help classify these messages. Scarcity of labeled data causes poor performance in machine training. Can we reuse old tweets to train classifiers? How can we choose labeled tweets for training? Specifically, we study the usefulness of labeled data of past events. Do labeled tweets in different language help? We observe the performance of our classifiers trained using different combinations of training sets obtained from past disasters. We perform extensive experimentation on real crisis datasets and show that the past labels are useful when both source and target events are of the same type (e.g. both earthquakes). For similar languages (e.g., Italian and Spanish), cross-language domain adaptation was useful, however, when for different languages (e.g., Italian and English), the performance decreased.