CVAug 13, 2024Code
Masked Image Modeling: A SurveyVlad Hondru, Florinel Alin Croitoru, Shervin Minaee et al.
In this work, we survey recent studies on masked image modeling (MIM), an approach that emerged as a powerful self-supervised learning technique in computer vision. The MIM task involves masking some information, e.g. pixels, patches, or even latent representations, and training a model, usually an autoencoder, to predicting the missing information by using the context available in the visible part of the input. We identify and formalize two categories of approaches on how to implement MIM as a pretext task, one based on reconstruction and one based on contrastive learning. Then, we construct a taxonomy and review the most prominent papers in recent years. We complement the manually constructed taxonomy with a dendrogram obtained by applying a hierarchical clustering algorithm. We further identify relevant clusters via manually inspecting the resulting dendrogram. Our review also includes datasets that are commonly used in MIM research. We aggregate the performance results of various masked image modeling methods on the most popular datasets, to facilitate the comparison of competing methods. Finally, we identify research gaps and propose several interesting directions of future work. We supplement our survey with the following public repository containing organized references: https://github.com/vladhondru25/MIM-Survey.
CVMar 4, 2022
Show Me What and Tell Me How: Video Synthesis via Multimodal ConditioningLigong Han, Jian Ren, Hsin-Ying Lee et al.
Most methods for conditional video synthesis use a single modality as the condition. This comes with major limitations. For example, it is problematic for a model conditioned on an image to generate a specific motion trajectory desired by the user since there is no means to provide motion information. Conversely, language information can describe the desired motion, while not precisely defining the content of the video. This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We leverage the recent progress in quantized representations for videos and apply a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. To improve video quality and consistency, we propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. We introduce text augmentation to improve the robustness of the textual representation and diversity of generated videos. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images. It can generate much longer sequences than the one used for training. In addition, our model can extract visual information as suggested by the text prompt, e.g., "an object in image one is moving northeast", and generate corresponding videos. We run evaluations on three public datasets and a newly collected dataset labeled with facial attributes, achieving state-of-the-art generation results on all four.
CLFeb 9, 2024
Large Language Models: A SurveyShervin Minaee, Tomas Mikolov, Narjes Nikzad et al.
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.
CVApr 20, 2020Code
Deep-COVID: Predicting COVID-19 From Chest X-Ray Images Using Deep Transfer LearningShervin Minaee, Rahele Kafieh, Milan Sonka et al.
The COVID-19 pandemic is causing a major outbreak in more than 150 countries around the world, having a severe impact on the health and life of many people globally. One of the crucial step in fighting COVID-19 is the ability to detect the infected patients early enough, and put them under special care. Detecting this disease from radiography and radiology images is perhaps one of the fastest ways to diagnose the patients. Some of the early studies showed specific abnormalities in the chest radiograms of patients infected with COVID-19. Inspired by earlier works, we study the application of deep learning models to detect COVID-19 patients from their chest radiography images. We first prepare a dataset of 5,000 Chest X-rays from the publicly available datasets. Images exhibiting COVID-19 disease presence were identified by board-certified radiologist. Transfer learning on a subset of 2,000 radiograms was used to train four popular convolutional neural networks, including ResNet18, ResNet50, SqueezeNet, and DenseNet-121, to identify COVID-19 disease in the analyzed chest X-ray images. We evaluated these models on the remaining 3,000 images, and most of these networks achieved a sensitivity rate of 98% ($\pm$ 3%), while having a specificity rate of around 90%. Besides sensitivity and specificity rates, we also present the receiver operating characteristic (ROC) curve, precision-recall curve, average prediction, and confusion matrix of each model. We also used a technique to generate heatmaps of lung regions potentially infected by COVID-19 and show that the generated heatmaps contain most of the infected areas annotated by our board certified radiologist. While the achieved performance is very encouraging, further analysis is required on a larger set of COVID-19 images, to have a more reliable estimation of accuracy rates. The dataset, model implementations (in PyTorch), and evaluations, are all made publicly available for research community at https://github.com/shervinmin/DeepCovid.git
CVFeb 18, 2022
Modern Augmented Reality: Applications, Trends, and Future DirectionsShervin Minaee, Xiaodan Liang, Shuicheng Yan
Augmented reality (AR) is one of the relatively old, yet trending areas in the intersection of computer vision and computer graphics with numerous applications in several areas, from gaming and entertainment, to education and healthcare. Although it has been around for nearly fifty years, it has seen a lot of interest by the research community in the recent years, mainly because of the huge success of deep learning models for various computer vision and AR applications, which made creating new generations of AR technologies possible. This work tries to provide an overview of modern augmented reality, from both application-level and technical perspective. We first give an overview of main AR applications, grouped into more than ten categories. We then give an overview of around 100 recent promising machine learning based works developed for AR systems, such as deep learning works for AR shopping (clothing, makeup), AR based image filters (such as Snapchat's lenses), AR animations, and more. In the end we discuss about some of the current challenges in AR domain, and the future directions in this area.
CVMar 27, 2021
Going Deeper Into Face Detection: A SurveyShervin Minaee, Ping Luo, Zhe Lin et al.
Face detection is a crucial first step in many facial recognition and face analysis systems. Early approaches for face detection were mainly based on classifiers built on top of hand-crafted features extracted from local image regions, such as Haar Cascades and Histogram of Oriented Gradients. However, these approaches were not powerful enough to achieve a high accuracy on images of from uncontrolled environments. With the breakthrough work in image classification using deep neural networks in 2012, there has been a huge paradigm shift in face detection. Inspired by the rapid progress of deep learning in computer vision, many deep learning based frameworks have been proposed for face detection over the past few years, achieving significant improvements in accuracy. In this work, we provide a detailed overview of some of the most representative deep learning based face detection methods by grouping them into a few major categories, and present their core architectural designs and accuracies on popular benchmarks. We also describe some of the most popular face detection datasets. Finally, we discuss some current challenges in the field, and suggest potential future research directions.
CVOct 8, 2020
Age and Gender Prediction From Face Images Using Attentional Convolutional NetworkAmirali Abdolrashidi, Mehdi Minaei, Elham Azimi et al.
Automatic prediction of age and gender from face images has drawn a lot of attention recently, due it is wide applications in various facial analysis problems. However, due to the large intra-class variation of face images (such as variation in lighting, pose, scale, occlusion), the existing models are still behind the desired accuracy level, which is necessary for the use of these models in real-world applications. In this work, we propose a deep learning framework, based on the ensemble of attentional and residual convolutional networks, to predict gender and age group of facial images with high accuracy rate. Using attention mechanism enables our model to focus on the important and informative parts of the face, which can help it to make a more accurate prediction. We train our model in a multi-task learning fashion, and augment the feature embedding of the age classifier, with the predicted gender, and show that doing so can further increase the accuracy of age prediction. Our model is trained on a popular face age and gender dataset, and achieved promising results. Through visualization of the attention maps of the train model, we show that our model has learned to become sensitive to the right regions of the face.
IVSep 10, 2020
COVID CT-Net: Predicting Covid-19 From Chest CT Images Using Attentional Convolutional NetworkShakib Yazdani, Shervin Minaee, Rahele Kafieh et al.
The novel corona-virus disease (COVID-19) pandemic has caused a major outbreak in more than 200 countries around the world, leading to a severe impact on the health and life of many people globally. As of Aug 25th of 2020, more than 20 million people are infected, and more than 800,000 death are reported. Computed Tomography (CT) images can be used as a as an alternative to the time-consuming "reverse transcription polymerase chain reaction (RT-PCR)" test, to detect COVID-19. In this work we developed a deep learning framework to predict COVID-19 from CT images. We propose to use an attentional convolution network, which can focus on the infected areas of chest, enabling it to perform a more accurate prediction. We trained our model on a dataset of more than 2000 CT images, and report its performance in terms of various popular metrics, such as sensitivity, specificity, area under the curve, and also precision-recall curve, and achieve very promising results. We also provide a visualization of the attention maps of the model for several test images, and show that our model is attending to the infected regions as intended. In addition to developing a machine learning modeling framework, we also provide the manual annotation of the potentionally infected regions of chest, with the help of a board-certified radiologist, and make that publicly available for other researchers.
CLSep 8, 2020
Covid-Transformer: Detecting COVID-19 Trending Topics on Twitter Using Universal Sentence EncoderMeysam Asgari-Chenaghlu, Narjes Nikzad-Khasmakhi, Shervin Minaee
The novel corona-virus disease (also known as COVID-19) has led to a pandemic, impacting more than 200 countries across the globe. With its global impact, COVID-19 has become a major concern of people almost everywhere, and therefore there are a large number of tweets coming out from every corner of the world, about COVID-19 related topics. In this work, we try to analyze the tweets and detect the trending topics and major concerns of people on Twitter, which can enable us to better understand the situation, and devise better planning. More specifically we propose a model based on the universal sentence encoder to detect the main topics of Tweets in recent months. We used universal sentence encoder in order to derive the semantic representation and the similarity of tweets. We then used the sentence similarity and their embeddings, and feed them to K-means clustering algorithm to group similar tweets (in semantic sense). After that, the cluster summary is obtained using a text summarization algorithm based on deep learning, which can uncover the underlying topics of each cluster. Through experimental results, we show that our model can detect very informative topics, by processing a large number of tweets on sentence level (which can preserve the overall meaning of the tweets). Since this framework has no restriction on specific data distribution, it can be used to detect trending topics from any other social media and any other context rather than COVID-19. Experimental results show superiority of our proposed approach to other baselines, including TF-IDF, and latent Dirichlet allocation (LDA).
IVJul 24, 2020
COVID TV-UNet: Segmenting COVID-19 Chest CT Images Using Connectivity Imposed U-NetNarges Saeedizadeh, Shervin Minaee, Rahele Kafieh et al.
The novel corona-virus disease (COVID-19) pandemic has caused a major outbreak in more than 200 countries around the world, leading to a severe impact on the health and life of many people globally. As of mid-July 2020, more than 12 million people were infected, and more than 570,000 death were reported. Computed Tomography (CT) images can be used as an alternative to the time-consuming RT-PCR test, to detect COVID-19. In this work we propose a segmentation framework to detect chest regions in CT images, which are infected by COVID-19. We use an architecture similar to U-Net model, and train it to detect ground glass regions, on pixel level. As the infected regions tend to form a connected component (rather than randomly distributed pixels), we add a suitable regularization term to the loss function, to promote connectivity of the segmentation map for COVID-19 pixels. 2D-anisotropic total-variation is used for this purpose, and therefore the proposed model is called "TV-UNet". Through experimental results on a relatively large-scale CT segmentation dataset of around 900 images, we show that adding this new regularization term leads to 2\% gain on overall segmentation performance compared to the U-Net model. Our experimental analysis, ranging from visual evaluation of the predicted segmentation results to quantitative assessment of segmentation performance (precision, recall, Dice score, and mIoU) demonstrated great ability to identify COVID-19 associated regions of the lungs, achieving a mIoU rate of over 99\%, and a Dice score of around 86\%.
CLApr 6, 2020
Deep Learning Based Text Classification: A Comprehensive ReviewShervin Minaee, Nal Kalchbrenner, Erik Cambria et al.
Deep learning based models have surpassed classical machine learning based approaches in various text classification tasks, including sentiment analysis, news categorization, question answering, and natural language inference. In this paper, we provide a comprehensive review of more than 150 deep learning based models for text classification developed in recent years, and discuss their technical contributions, similarities, and strengths. We also provide a summary of more than 40 popular datasets widely used for text classification. Finally, we provide a quantitative analysis of the performance of different deep learning models on popular benchmarks, and discuss future research directions.
CVMar 21, 2020
Palm-GAN: Generating Realistic Palmprint Images Using Total-Variation Regularized GANShervin Minaee, Mehdi Minaei, Amirali Abdolrashidi
Generating realistic palmprint (more generally biometric) images has always been an interesting and, at the same time, challenging problem. Classical statistical models fail to generate realistic-looking palmprint images, as they are not powerful enough to capture the complicated texture representation of palmprint images. In this work, we present a deep learning framework based on generative adversarial networks (GAN), which is able to generate realistic palmprint images. To help the model learn more realistic images, we proposed to add a suitable regularization to the loss function, which imposes the line connectivity of generated palmprint images. This is very desirable for palmprints, as the principal lines in palm are usually connected. We apply this framework to a popular palmprint databases, and generate images which look very realistic, and similar to the samples in this database. Through experimental results, we show that the generated palmprint images look very realistic, have a good diversity, and are able to capture different parts of the prior distribution. We also report the Frechet Inception distance (FID) of the proposed model, and show that our model is able to achieve really good quantitative performance in terms of FID score.
LGFeb 10, 2020
Regularized Submodular Maximization at ScaleEhsan Kazemi, Shervin Minaee, Moran Feldman et al.
In this paper, we propose scalable methods for maximizing a regularized submodular function $f = g - \ell$ expressed as the difference between a monotone submodular function $g$ and a modular function $\ell$. Indeed, submodularity is inherently related to the notions of diversity, coverage, and representativeness. In particular, finding the mode of many popular probabilistic models of diversity, such as determinantal point processes, submodular probabilistic models, and strongly log-concave distributions, involves maximization of (regularized) submodular functions. Since a regularized function $f$ can potentially take on negative values, the classic theory of submodular maximization, which heavily relies on the non-negativity assumption of submodular functions, may not be applicable. To circumvent this challenge, we develop the first one-pass streaming algorithm for maximizing a regularized submodular function subject to a $k$-cardinality constraint. It returns a solution $S$ with the guarantee that $f(S)\geq(φ^{-2}-ε) \cdot g(OPT)-\ell (OPT)$, where $φ$ is the golden ratio. Furthermore, we develop the first distributed algorithm that returns a solution $S$ with the guarantee that $\mathbb{E}[f(S)] \geq (1-ε) [(1-e^{-1}) \cdot g(OPT)-\ell(OPT)]$ in $O(1/ ε)$ rounds of MapReduce computation, without keeping multiple copies of the entire dataset in each round (as it is usually done). We should highlight that our result, even for the unregularized case where the modular term $\ell$ is zero, improves the memory and communication complexity of the existing work by a factor of $O(1/ ε)$ while arguably provides a simpler distributed algorithm and a unifying analysis. We also empirically study the performance of our scalable methods on a set of real-life applications, including finding the mode of distributions, data summarization, and product recommendation.
CVJan 15, 2020
Image Segmentation Using Deep Learning: A SurveyShervin Minaee, Yuri Boykov, Fatih Porikli et al.
Image segmentation is a key topic in image processing and computer vision with applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among many others. Various algorithms for image segmentation have been developed in the literature. Recently, due to the success of deep learning models in a wide range of vision applications, there has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models. In this survey, we provide a comprehensive review of the literature at the time of this writing, covering a broad spectrum of pioneering works for semantic and instance-level segmentation, including fully convolutional pixel-labeling networks, encoder-decoder architectures, multi-scale and pyramid based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the similarity, strengths and challenges of these deep learning models, examine the most widely used datasets, report performances, and discuss promising future research directions in this area.
CVNov 30, 2019
Biometrics Recognition Using Deep Learning: A SurveyShervin Minaee, Amirali Abdolrashidi, Hang Su et al.
Deep learning-based models have been very successful in achieving state-of-the-art results in many of the computer vision, speech recognition, and natural language processing tasks in the last few years. These models seem a natural fit for handling the ever-increasing scale of biometric recognition problems, from cellphone authentication to airport security systems. Deep learning-based models have increasingly been leveraged to improve the accuracy of different biometric recognition systems in recent years. In this work, we provide a comprehensive survey of more than 120 promising works on biometric recognition (including face, fingerprint, iris, palmprint, ear, voice, signature, and gait recognition), which deploy deep learning models, and show their strengths and potentials in different applications. For each biometric, we first introduce the available datasets that are widely used in the literature and their characteristics. We will then talk about several promising deep learning works developed for that biometric, and show their performance on popular public benchmarks. We will also discuss some of the main challenges while using these models for biometric recognition, and possible future directions to which research in this area is headed.
IRSep 30, 2019
Hotel2vec: Learning Attribute-Aware Hotel Embeddings with Self-SupervisionAli Sadeghian, Shervin Minaee, Ioannis Partalas et al.
We propose a neural network architecture for learning vector representations of hotels. Unlike previous works, which typically only use user click information for learning item embeddings, we propose a framework that combines several sources of data, including user clicks, hotel attributes (e.g., property type, star rating, average user rating), amenity information (e.g., the hotel has free Wi-Fi or free breakfast), and geographic information. During model training, a joint embedding is learned from all of the above information. We show that including structured attributes about hotels enables us to make better predictions in a downstream task than when we rely exclusively on click data. We train our embedding model on more than 40 million user click sessions from a leading online travel platform and learn embeddings for more than one million hotels. Our final learned embeddings integrate distinct sub-embeddings for user clicks, hotel attributes, and geographic information, providing an interpretable representation that can be used flexibly depending on the application. We show empirically that our model generates high-quality representations that boost the performance of a hotel recommendation system in addition to other applications. An important advantage of the proposed neural model is that it addresses the cold-start problem for hotels with insufficient historical click information by incorporating additional hotel attributes which are available for all hotels.
CVSep 17, 2019
Masked-RPCA: Sparse and Low-rank Decomposition Under Overlaying Model and Application to Moving Object DetectionAmirhossein Khalilian-Gourtani, Shervin Minaee, Yao Wang
Foreground detection in a given video sequence is a pivotal step in many computer vision applications such as video surveillance system. Robust Principal Component Analysis (RPCA) performs low-rank and sparse decomposition and accomplishes such a task when the background is stationary and the foreground is dynamic and relatively small. A fundamental issue with RPCA is the assumption that the low-rank and sparse components are added at each element, whereas in reality, the moving foreground is overlaid on the background. We propose the representation via masked decomposition (i.e. an overlaying model) where each element either belongs to the low-rank or the sparse component, decided by a mask. We propose the Masked-RPCA algorithm to recover the mask and the low-rank components simultaneously, utilizing linearizing and alternating direction techniques. We further extend our formulation to be robust to dynamic changes in the background and enforce spatial connectivity in the foreground component. Our study shows significant improvement of the detected mask compared to post-processing on the sparse component obtained by other frameworks.
CVJul 28, 2019
FingerNet: Pushing The Limits of Fingerprint Recognition Using Convolutional Neural NetworkShervin Minaee, Elham Azimi, Amirali Abdolrashidi
Fingerprint recognition has been utilized for cellphone authentication, airport security and beyond. Many different features and algorithms have been proposed to improve fingerprint recognition. In this paper, we propose an end-to-end deep learning framework for fingerprint recognition using convolutional neural networks (CNNs) which can jointly learn the feature representation and perform recognition. We train our model on a large-scale fingerprint recognition dataset, and improve over previous approaches in terms of accuracy. Our proposed model is able to achieve a very high recognition accuracy on a well-known fingerprint dataset. We believe this framework can be widely used for biometrics recognition tasks, making more scalable and accurate systems possible. We have also used a visualization technique to highlight the important areas in an input fingerprint image, that mostly impact the recognition results.
CVJul 22, 2019
DeepIris: Iris Recognition Using A Deep Learning ApproachShervin Minaee, Amirali Abdolrashidi
Iris recognition has been an active research area during last few decades, because of its wide applications in security, from airports to homeland security border control. Different features and algorithms have been proposed for iris recognition in the past. In this paper, we propose an end-to-end deep learning framework for iris recognition based on residual convolutional neural network (CNN), which can jointly learn the feature representation and perform recognition. We train our model on a well-known iris recognition dataset using only a few training images from each class, and show promising results and improvements over previous approaches. We also present a visualization technique which is able to detect the important areas in iris images which can mostly impact the recognition results. We believe this framework can be widely used for other biometrics recognition tasks, helping to have a more scalable and accurate systems.
CLApr 8, 2019
Deep-Sentiment: Sentiment Analysis Using Ensemble of CNN and Bi-LSTM ModelsShervin Minaee, Elham Azimi, AmirAli Abdolrashidi
With the popularity of social networks, and e-commerce websites, sentiment analysis has become a more active area of research in the past few years. On a high level, sentiment analysis tries to understand the public opinion about a specific product or topic, or trends from reviews or tweets. Sentiment analysis plays an important role in better understanding customer/user opinion, and also extracting social/political trends. There has been a lot of previous works for sentiment analysis, some based on hand-engineering relevant textual features, and others based on different neural network architectures. In this work, we present a model based on an ensemble of long-short-term-memory (LSTM), and convolutional neural network (CNN), one to capture the temporal information of the data, and the other one to extract the local structure thereof. Through experimental results, we show that using this ensemble model we can outperform both individual models. We are also able to achieve a very high accuracy rate compared to the previous works.
CVFeb 4, 2019
Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional NetworkShervin Minaee, Amirali Abdolrashidi
Facial expression recognition has been an active research area over the past few decades, and it is still challenging due to the high intra-class variation. Traditional approaches for this problem rely on hand-crafted features such as SIFT, HOG and LBP, followed by a classifier trained on a database of images or videos. Most of these works perform reasonably well on datasets of images captured in a controlled condition, but fail to perform as good on more challenging datasets with more image variation and partial faces. In recent years, several works proposed an end-to-end framework for facial expression recognition, using deep learning models. Despite the better performance of these works, there still seems to be a great room for improvement. In this work, we propose a deep learning approach based on attentional convolutional network, which is able to focus on important parts of the face, and achieves significant improvement over previous models on multiple datasets, including FER-2013, CK+, FERG, and JAFFE. We also use a visualization technique which is able to find important face regions for detecting different emotions, based on the classifier's output. Through experimental results, we show that different emotions seems to be sensitive to different parts of the face.
CVDec 25, 2018
Finger-GAN: Generating Realistic Fingerprint Images Using Connectivity Imposed GANShervin Minaee, Amirali Abdolrashidi
Generating realistic biometric images has been an interesting and, at the same time, challenging problem. Classical statistical models fail to generate realistic-looking fingerprint images, as they are not powerful enough to capture the complicated texture representation in fingerprint images. In this work, we present a machine learning framework based on generative adversarial networks (GAN), which is able to generate fingerprint images sampled from a prior distribution (learned from a set of training images). We also add a suitable regularization term to the loss function, to impose the connectivity of generated fingerprint images. This is highly desirable for fingerprints, as the lines in each finger are usually connected. We apply this framework to two popular fingerprint databases, and generate images which look very realistic, and similar to the samples in those databases. Through experimental results, we show that the generated fingerprint images have a good diversity, and are able to capture different parts of the prior distribution. We also evaluate the Frechet Inception distance (FID) of our proposed model, and show that our model is able to achieve good quantitative performance in terms of this score.
CVDec 12, 2018
Iris-GAN: Learning to Generate Realistic Iris Images Using Convolutional GANShervin Minaee, Amirali Abdolrashidi
Generating iris images which look realistic is both an interesting and challenging problem. Most of the classical statistical models are not powerful enough to capture the complicated texture representation in iris images, and therefore fail to generate iris images which look realistic. In this work, we present a machine learning framework based on generative adversarial network (GAN), which is able to generate iris images sampled from a prior distribution (learned from a set of training images). We apply this framework to two popular iris databases, and generate images which look very realistic, and similar to the image distribution in those databases. Through experimental results, we show that the generated iris images have a good diversity, and are able to capture different part of the prior distribution.
CVDec 12, 2018
Efficient Super Resolution For Large-Scale Images Using Attentional GANHarsh Nilesh Pathak, Xinxin Li, Shervin Minaee et al.
Single Image Super Resolution (SISR) is a well-researched problem with broad commercial relevance. However, most of the SISR literature focuses on small-size images under 500px, whereas business needs can mandate the generation of very high resolution images. At Expedia Group, we were tasked with generating images of at least 2000px for display on the website, four times greater than the sizes typically reported in the literature. This requirement poses a challenge that state-of-the-art models, validated on small images, have not been proven to handle. In this paper, we investigate solutions to the problem of generating high-quality images for large-scale super resolution in a commercial setting. We find that training a generative adversarial network (GAN) with attention from scratch using a large-scale lodging image data set generates images with high PSNR and SSIM scores. We describe a novel attentional SISR model for large-scale images, A-SRGAN, that uses a Flexible Self Attention layer to enable processing of large-scale images. We also describe a distributed algorithm which speeds up training by around a factor of five.
CVJun 27, 2018
MTBI Identification From Diffusion MR Images Using Bag of Adversarial Visual FeaturesShervin Minaee, Yao Wang, Alp Aygar et al.
In this work, we propose bag of adversarial features (BAF) for identifying mild traumatic brain injury (MTBI) patients from their diffusion magnetic resonance images (MRI) (obtained within one month of injury) by incorporating unsupervised feature learning techniques. MTBI is a growing public health problem with an estimated incidence of over 1.7 million people annually in US. Diagnosis is based on clinical history and symptoms, and accurate, concrete measures of injury are lacking. Unlike most of previous works, which use hand-crafted features extracted from different parts of brain for MTBI classification, we employ feature learning algorithms to learn more discriminative representation for this task. A major challenge in this field thus far is the relatively small number of subjects available for training. This makes it difficult to use an end-to-end convolutional neural network to directly classify a subject from MR images. To overcome this challenge, we first apply an adversarial auto-encoder (with convolutional structure) to learn patch-level features, from overlapping image patches extracted from different brain regions. We then aggregate these features through a bag-of-word approach. We perform an extensive experimental study on a dataset of 227 subjects (including 109 MTBI patients, and 118 age and sex matched healthy controls), and compare the bag-of-deep-features with several previous approaches. Our experimental results show that the BAF significantly outperforms earlier works relying on the mean values of MR metrics in selected brain regions.
CVJun 22, 2018
Ad-Net: Audio-Visual Convolutional Neural Network for Advertisement Detection In VideosShervin Minaee, Imed Bouazizi, Prakash Kolan et al.
Personalized advertisement is a crucial task for many of the online businesses and video broadcasters. Many of today's broadcasters use the same commercial for all customers, but as one can imagine different viewers have different interests and it seems reasonable to have customized commercial for different group of people, chosen based on their demographic features, and history. In this project, we propose a framework, which gets the broadcast videos, analyzes them, detects the commercial and replaces it with a more suitable commercial. We propose a two-stream audio-visual convolutional neural network, that one branch analyzes the visual information and the other one analyzes the audio information, and then the audio and visual embedding are fused together, and are used for commercial detection, and content categorization. We show that using both the visual and audio content of the videos significantly improves the model performance for video analysis. This network is trained on a dataset of more than 50k regular video and commercial shots, and achieved much better performance compared to the models based on hand-crafted features.
CVApr 6, 2018
Image Segmentation Using Subspace Representation and Sparse DecompositionShervin Minaee
Image foreground extraction is a classical problem in image processing and vision, with a large range of applications. In this dissertation, we focus on the extraction of text and graphics in mixed-content images, and design novel approaches for various aspects of this problem. We first propose a sparse decomposition framework, which models the background by a subspace containing smooth basis vectors, and foreground as a sparse and connected component. We then formulate an optimization framework to solve this problem, by adding suitable regularizations to the cost function to promote the desired characteristics of each component. We present two techniques to solve the proposed optimization problem, one based on alternating direction method of multipliers (ADMM), and the other one based on robust regression. Promising results are obtained for screen content image segmentation using the proposed algorithm. We then propose a robust subspace learning algorithm for the representation of the background component using training images that could contain both background and foreground components, as well as noise. With the learnt subspace for the background, we can further improve the segmentation results, compared to using a fixed subspace. Lastly, we investigate a different class of signal/image decomposition problem, where only one signal component is active at each signal element. In this case, besides estimating each component, we need to find their supports, which can be specified by a binary mask. We propose a mixed-integer programming problem, that jointly estimates the two components and their supports through an alternating optimization scheme. We show the application of this algorithm on various problems, including image segmentation, video motion segmentation, and also separation of text from textured images.
CVFeb 8, 2018
A Deep Unsupervised Learning Approach Toward MTBI Identification Using Diffusion MRIShervin Minaee, Yao Wang, Anna Choromanska et al.
Mild traumatic brain injury is a growing public health problem with an estimated incidence of over 1.7 million people annually in US. Diagnosis is based on clinical history and symptoms, and accurate, concrete measures of injury are lacking. This work aims to directly use diffusion MR images obtained within one month of trauma to detect injury, by incorporating deep learning techniques. To overcome the challenge due to limited training data, we describe each brain region using the bag of word representation, which specifies the distribution of representative patch patterns. We apply a convolutional auto-encoder to learn the patch-level features, from overlapping image patches extracted from the MR images, to learn features from diffusion MR images of brain using an unsupervised approach. Our experimental results show that the bag of word representation using patch level features learnt by the auto encoder provides similar performance as that using the raw patch patterns, both significantly outperform earlier work relying on the mean values of MR metrics in selected brain regions.
CVOct 18, 2017
Identifying Mild Traumatic Brain Injury Patients From MR Images Using Bag of Visual WordsShervin Minaee, Siyun Wang, Yao Wang et al.
Mild traumatic brain injury (mTBI) is a growing public health problem with an estimated incidence of one million people annually in US. Neurocognitive tests are used to both assess the patient condition and to monitor the patient progress. This work aims to directly use MR images taken shortly after injury to detect whether a patient suffers from mTBI, by incorporating machine learning and computer vision techniques to learn features suitable discriminating between mTBI and normal patients. We focus on 3 regions in brain, and extract multiple patches from them, and use bag-of-visual-word technique to represent each subject as a histogram of representative patterns derived from patches from all training subjects. After extracting the features, we use greedy forward feature selection, to choose a subset of features which achieves highest accuracy. We show through experimental studies that BoW features perform better than the simple mean value features which were used previously.
CVAug 27, 2017
A Machine Learning Approach For Identifying Patients with Mild Traumatic Brain Injury Using Diffusion MRI ModelingShervin Minaee, Yao Wang, Sohae Chung et al.
While diffusion MRI has been extremely promising in the study of MTBI, identifying patients with recent MTBI remains a challenge. The literature is mixed with regard to localizing injury in these patients, however, gray matter such as the thalamus and white matter including the corpus callosum and frontal deep white matter have been repeatedly implicated as areas at high risk for injury. The purpose of this study is to develop a machine learning framework to classify MTBI patients and controls using features derived from multi-shell diffusion MRI in the thalamus, frontal white matter and corpus callosum.
CLAug 5, 2017
Automatic Question-Answering Using A Deep Similarity Neural NetworkShervin Minaee, Zhu Liu
Automatic question-answering is a classical problem in natural language processing, which aims at designing systems that can automatically answer a question, in the same way as human does. In this work, we propose a deep learning based model for automatic question-answering. First the questions and answers are embedded using neural probabilistic modeling. Then a deep similarity neural network is trained to find the similarity score of a pair of answer and question. Then for each question, the best answer is found as the one with the highest similarity score. We first train this model on a large-scale public question-answering database, and then fine-tune it to transfer to the customer-care chat data. We have also tested our framework on a public question-answering database and achieved very good performance.
CVJun 11, 2017
Text Extraction From Texture Images Using Masked Signal DecompositionShervin Minaee, Yao Wang
Text extraction is an important problem in image processing with applications from optical character recognition to autonomous driving. Most of the traditional text segmentation algorithms consider separating text from a simple background (which usually has a different color from texts). In this work we consider separating texts from a textured background, that has similar color to texts. We look at this problem from a signal decomposition perspective, and consider a more realistic scenario where signal components are overlaid on top of each other (instead of adding together). When the signals are overlaid, to separate signal components, we need to find a binary mask which shows the support of each component. Because directly solving the binary mask is intractable, we relax this problem to the approximated continuous problem, and solve it by alternating optimization method. We show that the proposed algorithm achieves significantly better results than other recent works on several challenging images.
CVApr 25, 2017
An ADMM Approach to Masked Signal Decomposition Using Subspace RepresentationShervin Minaee, Yao Wang
Signal decomposition is a classical problem in signal processing, which aims to separate an observed signal into two or more components each with its own property. Usually each component is described by its own subspace or dictionary. Extensive research has been done for the case where the components are additive, but in real world applications, the components are often non-additive. For example, an image may consist of a foreground object overlaid on a background, where each pixel either belongs to the foreground or the background. In such a situation, to separate signal components, we need to find a binary mask which shows the location of each component. Therefore it requires to solve a binary optimization problem. Since most of the binary optimization problems are intractable, we relax this problem to the approximated continuous problem, and solve it by alternating optimization technique. We show the application of the proposed algorithm for three applications: separation of text from background in images, separation of moving objects from a background undergoing global camera motion in videos, separation of sinusoidal and spike components in one dimensional signals. We demonstrate in each case that considering the non-additive nature of the problem can lead to significant improvement.
CVMar 14, 2017
Subspace Learning in The Presence of Sparse Structured Outliers and NoiseShervin Minaee, Yao Wang
Subspace learning is an important problem, which has many applications in image and video processing. It can be used to find a low-dimensional representation of signals and images. But in many applications, the desired signal is heavily distorted by outliers and noise, which negatively affect the learned subspace. In this work, we present a novel algorithm for learning a subspace for signal representation, in the presence of structured outliers and noise. The proposed algorithm tries to jointly detect the outliers and learn the subspace for images. We present an alternating optimization algorithm for solving this problem, which iterates between learning the subspace and finding the outliers. This algorithm has been trained on a large number of image patches, and the learned subspace is used for image segmentation, and is shown to achieve better segmentation results than prior methods, including least absolute deviation fitting, k-means clustering based segmentation in DjVu, and shape primitive extraction and coding algorithm.
CVFeb 4, 2017
An Experimental Study of Deep Convolutional Features For Iris RecognitionShervin Minaee, Amirali Abdolrashidi, Yao Wang
Iris is one of the popular biometrics that is widely used for identity authentication. Different features have been used to perform iris recognition in the past. Most of them are based on hand-crafted features designed by biometrics experts. Due to tremendous success of deep learning in computer vision problems, there has been a lot of interest in applying features learned by convolutional neural networks on general image recognition to other tasks such as segmentation, face recognition, and object detection. In this paper, we have investigated the application of deep features extracted from VGG-Net for iris recognition. The proposed scheme has been tested on two well-known iris databases, and has shown promising results with the best accuracy rate of 99.4\%, which outperforms the previous best result.
CVNov 23, 2016
Image Segmentation Using Overlapping Group SparsityShervin Minaee, Yao Wang
Sparse decomposition has been widely used for different applications, such as source separation, image classification and image denoising. This paper presents a new algorithm for segmentation of an image into background and foreground text and graphics using sparse decomposition. First, the background is represented using a suitable smooth model, which is a linear combination of a few smoothly varying basis functions, and the foreground text and graphics are modeled as a sparse component overlaid on the smooth background. Then the background and foreground are separated using a sparse decomposition framework and imposing some prior information, which promote the smoothness of background, and the sparsity and connectivity of foreground pixels. This algorithm has been tested on a dataset of images extracted from HEVC standard test sequences for screen content coding, and is shown to outperform prior methods, including least absolute deviation fitting, k-means clustering based segmentation in DjVu, and shape primitive extraction and coding algorithm.
CVSep 13, 2016
Image Decomposition Using a Robust Regression ApproachShervin Minaee, Yao Wang
This paper considers how to separate text and/or graphics from smooth background in screen content and mixed content images and proposes an algorithm to perform this segmentation task. The proposed methods make use of the fact that the background in each block is usually smoothly varying and can be modeled well by a linear combination of a few smoothly varying basis functions, while the foreground text and graphics create sharp discontinuity. This algorithm separates the background and foreground pixels by trying to fit pixel values in the block into a smooth function using a robust regression method. The inlier pixels that can be well represented with the smooth model will be considered as background, while remaining outlier pixels will be considered foreground. We have also created a dataset of screen content images extracted from HEVC standard test sequences for screen content coding with their ground truth segmentation result which can be used for this task. The proposed algorithm has been tested on the dataset mentioned above and is shown to have superior performance over other methods, such as the hierarchical k-means clustering algorithm, shape primitive extraction and coding, and the least absolute deviation fitting scheme for foreground segmentation.
CVJul 30, 2016
Face Recognition Using Scattering Convolutional NetworkShervin Minaee, Amirali Abdolrashidi, Yao Wang
Face recognition has been an active research area in the past few decades. In general, face recognition can be very challenging due to variations in viewpoint, illumination, facial expression, etc. Therefore it is essential to extract features which are invariant to some or all of these variations. Here a new image representation, called scattering transform/network, has been used to extract features from faces. The scattering transform is a kind of convolutional network which provides a powerful multi-layer representation for signals. After extraction of scattering features, PCA is applied to reduce the dimensionality of the data and then a multi-class support vector machine is used to perform recognition. The proposed algorithm has been tested on three face datasets and achieved a very high recognition rate.
CVJul 8, 2016
Screen Content Image Segmentation Using Robust Regression and Sparse DecompositionShervin Minaee, Yao Wang
This paper considers how to separate text and/or graphics from smooth background in screen content and mixed document images and proposes two approaches to perform this segmentation task. The proposed methods make use of the fact that the background in each block is usually smoothly varying and can be modeled well by a linear combination of a few smoothly varying basis functions, while the foreground text and graphics create sharp discontinuity. The algorithms separate the background and foreground pixels by trying to fit background pixel values in the block into a smooth function using two different schemes. One is based on robust regression, where the inlier pixels will be considered as background, while remaining outlier pixels will be considered foreground. The second approach uses a sparse decomposition framework where the background and foreground layers are modeled with a smooth and sparse components respectively. These algorithms have been tested on images extracted from HEVC standard test sequences for screen content coding, and are shown to have superior performance over previous approaches. The proposed methods can be used in different applications such as text extraction, separate coding of background and foreground for compression of screen content, and medical image segmentation.
CVMar 30, 2016
Palmprint Recognition Using Deep Scattering Convolutional NetworkShervin Minaee, Yao Wang
Palmprint recognition has drawn a lot of attention during the recent years. Many algorithms have been proposed for palmprint recognition in the past, majority of them being based on features extracted from the transform domain. Many of these transform domain features are not translation or rotation invariant, and therefore a great deal of preprocessing is needed to align the images. In this paper, a powerful image representation, called scattering network/transform, is used for palmprint recognition. Scattering network is a convolutional network where its architecture and filters are predefined wavelet transforms. The first layer of scattering network captures similar features to SIFT descriptors and the higher-layer features capture higher-frequency content of the signal which are lost in SIFT and other similar descriptors. After extraction of the scattering features, their dimensionality is reduced by applying principal component analysis (PCA) which reduces the computational complexity of the recognition task. Two different classifiers are used for recognition: multi-class SVM and minimum-distance classifier. The proposed scheme has been tested on a well-known palmprint database and achieved accuracy rate of 99.95% and 100% using minimum distance classifier and SVM respectively.
CVFeb 7, 2016
Screen Content Image Segmentation Using Sparse Decomposition and Total Variation MinimizationShervin Minaee, Yao Wang
Sparse decomposition has been widely used for different applications, such as source separation, image classification, image denoising and more. This paper presents a new algorithm for segmentation of an image into background and foreground text and graphics using sparse decomposition and total variation minimization. The proposed method is designed based on the assumption that the background part of the image is smoothly varying and can be represented by a linear combination of a few smoothly varying basis functions, while the foreground text and graphics can be modeled with a sparse component overlaid on the smooth background. The background and foreground are separated using a sparse decomposition framework regularized with a few suitable regularization terms which promotes the sparsity and connectivity of foreground pixels. This algorithm has been tested on a dataset of images extracted from HEVC standard test sequences for screen content coding, and is shown to have superior performance over some prior methods, including least absolute deviation fitting, k-means clustering based segmentation in DjVu and shape primitive extraction and coding (SPEC) algorithm.
CVNov 21, 2015
Screen Content Image Segmentation Using Sparse-Smooth DecompositionShervin Minaee, Amirali Abdolrashidi, Yao Wang
Sparse decomposition has been extensively used for different applications including signal compression and denoising and document analysis. In this paper, sparse decomposition is used for image segmentation. The proposed algorithm separates the background and foreground using a sparse-smooth decomposition technique such that the smooth and sparse components correspond to the background and foreground respectively. This algorithm is tested on several test images from HEVC test sequences and is shown to have superior performance over other methods, such as the hierarchical k-means clustering in DjVu. This segmentation algorithm can also be used for text extraction, video compression and medical image segmentation.
CVSep 11, 2015
Fingerprint Recognition Using Translation Invariant Scattering NetworkShervin Minaee, Yao Wang
Fingerprint recognition has drawn a lot of attention during last decades. Different features and algorithms have been used for fingerprint recognition in the past. In this paper, a powerful image representation called scattering transform/network, is used for recognition. Scattering network is a convolutional network where its architecture and filters are predefined wavelet transforms. The first layer of scattering representation is similar to sift descriptors and the higher layers capture higher frequency content of the signal. After extraction of scattering features, their dimensionality is reduced by applying principal component analysis (PCA). At the end, multi-class SVM is used to perform template matching for the recognition task. The proposed scheme is tested on a well-known fingerprint database and has shown promising results with the best accuracy rate of 98\%.
CVJul 8, 2015
Iris Recognition Using Scattering Transform and Textural FeaturesShervin Minaee, AmirAli Abdolrashidi, Yao Wang
Iris recognition has drawn a lot of attention since the mid-twentieth century. Among all biometric features, iris is known to possess a rich set of features. Different features have been used to perform iris recognition in the past. In this paper, two powerful sets of features are introduced to be used for iris recognition: scattering transform-based features and textural features. PCA is also applied on the extracted features to reduce the dimensionality of the feature vector while preserving most of the information of its initial value. Minimum distance classifier is used to perform template matching for each new test sample. The proposed scheme is tested on a well-known iris database, and showed promising results with the best accuracy rate of 99.2%.
CVJan 15, 2015
Screen Content Image Segmentation Using Least Absolute Deviation FittingShervin Minaee, Yao Wang
We propose an algorithm for separating the foreground (mainly text and line graphics) from the smoothly varying background in screen content images. The proposed method is designed based on the assumption that the background part of the image is smoothly varying and can be represented by a linear combination of a few smoothly varying basis functions, while the foreground text and graphics create sharp discontinuity and cannot be modeled by this smooth representation. The algorithm separates the background and foreground using a least absolute deviation method to fit the smooth model to the image pixels. This algorithm has been tested on several images from HEVC standard test sequences for screen content coding, and is shown to have superior performance over other popular methods, such as k-means clustering based segmentation in DjVu and shape primitive extraction and coding (SPEC) algorithm. Such background/foreground segmentation are important pre-processing steps for text extraction and separate coding of background and foreground for compression of screen content images.
CVDec 16, 2014
A Robust Regression Approach for Background/Foreground SegmentationShervin Minaee, Haoping Yu, Yao Wang
Background/foreground segmentation has a lot of applications in image and video processing. In this paper, a segmentation algorithm is proposed which is mainly designed for text and line extraction in screen content. The proposed method makes use of the fact that the background in each block is usually smoothly varying and can be modeled well by a linear combination of a few smoothly varying basis functions, while the foreground text and graphics create sharp discontinuity. The algorithm separates the background and foreground pixels by trying to fit pixel values in the block into a smooth function using a robust regression method. The inlier pixels that can fit well will be considered as background, while remaining outlier pixels will be considered foreground. This algorithm has been extensively tested on several images from HEVC standard test sequences for screen content coding, and is shown to have superior performance over other methods, such as the k-means clustering based segmentation algorithm in DjVu. This background/foreground segmentation can be used in different applications such as: text extraction, separate coding of background and foreground for compression of screen content and mixed content documents, principle line extraction from palmprint and crease detection in fingerprint images.
CVSep 27, 2014
On The Power of Joint Wavelet-DCT Features for Multispectral Palmprint RecognitionShervin Minaee, AmirAli Abdolrashidi
Biometric-based identification has drawn a lot of attention in the recent years. Among all biometrics, palmprint is known to possess a rich set of features. In this paper we have proposed to use DCT-based features in parallel with wavelet-based ones for palmprint identification. PCA is applied to the features to reduce their dimensionality and the majority voting algorithm is used to perform classification. The features introduced here result in a near-perfectly accurate identification. This method is tested on a well-known multispectral palmprint database and an accuracy rate of 99.97-100\% is achieved, outperforming all previous methods in similar conditions.
CVAug 28, 2014
Multispectral Palmprint Recognition Using Textural FeaturesShervin Minaee, AmirAli Abdolrashidi
In order to utilize identification to the best extent, we need robust and fast algorithms and systems to process the data. Having palmprint as a reliable and unique characteristic of every person, we extract and use its features based on its geometry, lines and angles. There are countless ways to define measures for the recognition task. To analyze a new point of view, we extracted textural features and used them for palmprint recognition. Co-occurrence matrix can be used for textural feature extraction. As classifiers, we have used the minimum distance classifier (MDC) and the weighted majority voting system (WMV). The proposed method is tested on a well-known multispectral palmprint dataset of 6000 samples and an accuracy rate of 99.96-100% is obtained for most scenarios which outperforms all previous works in multispectral palmprint recognition.
CVAug 16, 2014
Highly Accurate Multispectral Palmprint Recognition Using Statistical and Wavelet FeaturesShervin Minaee, AmirAli Abdolrashidi
Palmprint is one of the most useful physiological biometrics that can be used as a powerful means in personal recognition systems. The major features of the palmprints are palm lines, wrinkles and ridges, and many approaches use them in different ways towards solving the palmprint recognition problem. Here we have proposed to use a set of statistical and wavelet-based features; statistical to capture the general characteristics of palmprints; and wavelet-based to find those information not evident in the spatial domain. Also we use two different classification approaches, minimum distance classifier scheme and weighted majority voting algorithm, to perform palmprint matching. The proposed method is tested on a well-known palmprint dataset of 6000 samples and has shown an impressive accuracy rate of 99.65\%-100\% for most scenarios.