CLMar 1, 2021
Unbiased Sentence Encoder For Large-Scale Multi-lingual Search EnginesMahdi Hajiaghayi, Monir Hajiaghayi, Mark Bolin
In this paper, we present a multi-lingual sentence encoder that can be used in search engines as a query and document encoder. This embedding enables a semantic similarity score between queries and documents that can be an important feature in document ranking and relevancy. To train such a customized sentence encoder, it is beneficial to leverage users search data in the form of query-document clicked pairs however, we must avoid relying too much on search click data as it is biased and does not cover many unseen cases. The search data is heavily skewed towards short queries and for long queries is small and often noisy. The goal is to design a universal multi-lingual encoder that works for all cases and covers both short and long queries. We select a number of public NLI datasets in different languages and translation data and together with user search data we train a language model using a multi-task approach. A challenge is that these datasets are not homogeneous in terms of content, size and the balance ratio. While the public NLI datasets are usually two-sentence based with the same portion of positive and negative pairs, the user search data can contain multi-sentence documents and only positive pairs. We show how multi-task training enables us to leverage all these datasets and exploit knowledge sharing across these tasks.
LGDec 13, 2018
Code Failure Prediction and Pattern Extraction using LSTM NetworksMahdi Hajiaghayi, Ehsan Vahedi
In this paper, we use a well-known Deep Learning technique called Long Short Term Memory (LSTM) recurrent neural networks to find sessions that are prone to code failure in applications that rely on telemetry data for system health monitoring. We also use LSTM networks to extract telemetry patterns that lead to a specific code failure. For code failure prediction, we treat the telemetry events, sequence of telemetry events and the outcome of each sequence as words, sentence and sentiment in the context of sentiment analysis, respectively. Our proposed method is able to process a large set of data and can automatically handle edge cases in code failure prediction. We take advantage of Bayesian optimization technique to find the optimal hyper parameters as well as the type of LSTM cells that leads to the best prediction performance. We then introduce the Contributors and Blockers concepts. In this paper, contributors are the set of events that cause a code failure, while blockers are the set of events that each of them individually prevents a code failure from happening, even in presence of one or multiple contributor(s). Once the proposed LSTM model is trained, we use a greedy approach to find the contributors and blockers. To develop and test our proposed method, we use synthetic (simulated) data in the first step. The synthetic data is generated using a number of rules for code failures, as well as a number of rules for preventing a code failure from happening. The trained LSTM model shows over 99% accuracy for detecting code failures in the synthetic data. The results from the proposed method outperform the classical learning models such as Decision Tree and Random Forest. Using the proposed greedy method, we are able to find the contributors and blockers in the synthetic data in more than 90% of the cases, with a performance better than sequential rule and pattern mining algorithms.