CLMar 1, 2021
Unbiased Sentence Encoder For Large-Scale Multi-lingual Search EnginesMahdi Hajiaghayi, Monir Hajiaghayi, Mark Bolin
In this paper, we present a multi-lingual sentence encoder that can be used in search engines as a query and document encoder. This embedding enables a semantic similarity score between queries and documents that can be an important feature in document ranking and relevancy. To train such a customized sentence encoder, it is beneficial to leverage users search data in the form of query-document clicked pairs however, we must avoid relying too much on search click data as it is biased and does not cover many unseen cases. The search data is heavily skewed towards short queries and for long queries is small and often noisy. The goal is to design a universal multi-lingual encoder that works for all cases and covers both short and long queries. We select a number of public NLI datasets in different languages and translation data and together with user search data we train a language model using a multi-task approach. A challenge is that these datasets are not homogeneous in terms of content, size and the balance ratio. While the public NLI datasets are usually two-sentence based with the same portion of positive and negative pairs, the user search data can contain multi-sentence documents and only positive pairs. We show how multi-task training enables us to leverage all these datasets and exploit knowledge sharing across these tasks.
CVJan 28, 2019
An End-to-End Solution for Effectively Demoting Watermarked Images in Image SearchNing Ma, Xin Zhao, Mark Bolin
We propose an end-to-end solution, from watermark feature generation to metric design, for effectively demoting watermarked images surfed by a real world image search engine. We use a few fundamental techniques to obtain effective watermark features of images in the image search index, and utilize the signals in a commercial search engine to improve the image search quality. We collect a diverse and large set (about 1M) of images with human labels indicating whether the image contains visible watermark. We train a few deep convolutional neural networks to extract watermark information from the raw images. The deep CNN classifiers we trained can achieve high accuracy on the watermark test data set. We also analyze the images based on their domains to get watermark information from a domain-based watermark classifier. We design a new novel hybrid metric which includes the relevance, image attractiveness and watermark information all together. We demonstrate that using these watermark signals together with the new metric in image search ranker can significantly demote the watermarked images during the online image ranking.
CVApr 12, 2018
An Universal Image Attractiveness Ranking FrameworkNing Ma, Alexey Volkov, Aleksandr Livshits et al.
We propose a new framework to rank image attractiveness using a novel pairwise deep network trained with a large set of side-by-side multi-labeled image pairs from a web image index. The judges only provide relative ranking between two images without the need to directly assign an absolute score, or rate any predefined image attribute, thus making the rating more intuitive and accurate. We investigate a deep attractiveness rank net (DARN), a combination of deep convolutional neural network and rank net, to directly learn an attractiveness score mean and variance for each image and the underlying criteria the judges use to label each pair. The extension of this model (DARN-V2) is able to adapt to individual judge's personal preference. We also show the attractiveness of search results are significantly improved by using this attractiveness information in a real commercial search engine. We evaluate our model against other state-of-the-art models on our side-by-side web test data and another public aesthetic data set. With much less judgments (1M vs 50M), our model outperforms on side-by-side labeled data, and is comparable on data labeled by absolute score.