Yosi Keller

CV
h-index13
31papers
1,504citations
Novelty52%
AI Score51

31 Papers

CVJul 12, 2022Code
Camera Pose Auto-Encoders for Improving Pose Regression

Yoli Shavit, Yosi Keller

Absolute pose regressor (APR) networks are trained to estimate the pose of the camera given a captured image. They compute latent image representations from which the camera position and orientation are regressed. APRs provide a different tradeoff between localization accuracy, runtime, and memory, compared to structure-based localization schemes that provide state-of-the-art accuracy. In this work, we introduce Camera Pose Auto-Encoders (PAEs), multilayer perceptrons that are trained via a Teacher-Student approach to encode camera poses using APRs as their teachers. We show that the resulting latent pose representations can closely reproduce APR performance and demonstrate their effectiveness for related tasks. Specifically, we propose a light-weight test-time optimization in which the closest train poses are encoded and used to refine camera position estimation. This procedure achieves a new state-of-the-art position accuracy for APRs, on both the CambridgeLandmarks and 7Scenes benchmarks. We also show that train images can be reconstructed from the learned pose encoding, paving the way for integrating visual information from the train set at a low memory cost. Our code and pre-trained models are available at https://github.com/yolish/camera-pose-auto-encoders.

CVAug 22, 2023
Coarse-to-Fine Multi-Scene Pose Regression with Transformers

Yoli Shavit, Ron Ferens, Yosi Keller

Absolute camera pose regressors estimate the position and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into pose predictions. This allows our model to focus on general features that are informative for localization, while embedding multiple scenes in parallel. We extend our previous MS-Transformer approach \cite{shavit2021learning} by introducing a mixed classification-regression architecture that improves the localization accuracy. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors.

CVFeb 3Code
LaVPR: Benchmarking Language and Vision for Place Recognition

Ofer Idan, Dan Badur, Yosi Keller et al.

Visual Place Recognition (VPR) often fails under extreme environmental changes and perceptual aliasing. Furthermore, standard systems cannot perform "blind" localization from verbal descriptions alone, a capability needed for applications such as emergency response. To address these challenges, we introduce LaVPR, a large-scale benchmark that extends existing VPR datasets with over 650,000 rich natural-language descriptions. Using LaVPR, we investigate two paradigms: Multi-Modal Fusion for enhanced robustness and Cross-Modal Retrieval for language-based localization. Our results show that language descriptions yield consistent gains in visually degraded conditions, with the most significant impact on smaller backbones. Notably, adding language allows compact models to rival the performance of much larger vision-only architectures. For cross-modal retrieval, we establish a baseline using Low-Rank Adaptation (LoRA) and Multi-Similarity loss, which substantially outperforms standard contrastive methods across vision-language models. Ultimately, LaVPR enables a new class of localization systems that are both resilient to real-world stochasticity and practical for resource-constrained deployment. Our dataset and code are available at https://github.com/oferidan1/LaVPR.

CVOct 7, 2022
Learning to embed semantic similarity for joint image-text retrieval

Noam Malali, Yosi Keller

We present a deep learning approach for learning the joint semantic embeddings of images and captions in a Euclidean space, such that the semantic similarity is approximated by the L2 distances in the embedding space. For that, we introduce a metric learning scheme that utilizes multitask learning to learn the embedding of identical semantic concepts using a center loss. By introducing a differentiable quantization scheme into the end-to-end trainable network, we derive a semantic embedding of semantically similar concepts in Euclidean space. We also propose a novel metric learning formulation using an adaptive margin hinge loss, that is refined during the training phase. The proposed scheme was applied to the MS-COCO, Flicke30K and Flickr8K datasets, and was shown to compare favorably with contemporary state-of-the-art approaches.

CVMar 5, 2023
Learning to Localize in Unseen Scenes with Relative Pose Regressors

Ofer Idan, Yoli Shavit, Yosi Keller

Relative pose regressors (RPRs) localize a camera by estimating its relative translation and rotation to a pose-labelled reference. Unlike scene coordinate regression and absolute pose regression methods, which learn absolute scene parameters, RPRs can (theoretically) localize in unseen environments, since they only learn the residual pose between camera pairs. In practice, however, the performance of RPRs is significantly degraded in unseen scenes. In this work, we propose to aggregate paired feature maps into latent codes, instead of operating on global image descriptors, in order to improve the generalization of RPRs. We implement aggregation with concatenation, projection, and attention operations (Transformer Encoders) and learn to regress the relative pose parameters from the resulting latent codes. We further make use of a recently proposed continuous representation of rotation matrices, which alleviates the limitations of the commonly used quaternions. Compared to state-of-the-art RPRs, our model is shown to localize significantly better in unseen environments, across both indoor and outdoor benchmarks, while maintaining competitive performance in seen scenes. We validate our findings and architecture design through multiple ablations. Our code and pretrained models is publicly available.

CVAug 27, 2023
FaceCoresetNet: Differentiable Coresets for Face Set Recognition

Gil Shapira, Yosi Keller

In set-based face recognition, we aim to compute the most discriminative descriptor from an unbounded set of images and videos showing a single person. A discriminative descriptor balances two policies when aggregating information from a given set. The first is a quality-based policy: emphasizing high-quality and down-weighting low-quality images. The second is a diversity-based policy: emphasizing unique images in the set and down-weighting multiple occurrences of similar images as found in video clips which can overwhelm the set representation. This work frames face-set representation as a differentiable coreset selection problem. Our model learns how to select a small coreset of the input set that balances quality and diversity policies using a learned metric parameterized by the face quality, optimized end-to-end. The selection process is a differentiable farthest-point sampling (FPS) realized by approximating the non-differentiable Argmax operation with differentiable sampling from the Gumbel-Softmax distribution of distances. The small coreset is later used as queries in a self and cross-attention architecture to enrich the descriptor with information from the whole set. Our model is order-invariant and linear in the input set size. We set a new SOTA to set face verification on the IJB-B and IJB-C datasets. Our code is publicly available.

CVApr 23, 2023
Deep Convolutional Tables: Deep Learning without Convolutions

Shay Dekel, Yosi Keller, Aharon Bar-Hillel

We propose a novel formulation of deep networks that do not use dot-product neurons and rely on a hierarchy of voting tables instead, denoted as Convolutional Tables (CT), to enable accelerated CPU-based inference. Convolutional layers are the most time-consuming bottleneck in contemporary deep learning techniques, severely limiting their use in Internet of Things and CPU-based devices. The proposed CT performs a fern operation at each image location: it encodes the location environment into a binary index and uses the index to retrieve the desired local output from a table. The results of multiple tables are combined to derive the final output. The computational complexity of a CT transformation is independent of the patch (filter) size and grows gracefully with the number of channels, outperforming comparable convolutional layers. It is shown to have a better capacity:compute ratio than dot-product neurons, and that deep CT networks exhibit a universal approximation property similar to neural networks. As the transformation involves computing discrete indices, we derive a soft relaxation and gradient-based approach for training the CT hierarchy. Deep CT networks have been experimentally shown to have accuracy comparable to that of CNNs of similar architectures. In the low compute regime, they enable an error:speed trade-off superior to alternative efficient CNN architectures.

CVMar 5, 2023
Estimating Extreme 3D Image Rotation with Transformer Cross-Attention

Shay Dekel, Yosi Keller, Martin Cadik

The estimation of large and extreme image rotation plays a key role in multiple computer vision domains, where the rotated images are related by a limited or a non-overlapping field of view. Contemporary approaches apply convolutional neural networks to compute a 4D correlation volume to estimate the relative rotation between image pairs. In this work, we propose a cross-attention-based approach that utilizes CNN feature maps and a Transformer-Encoder, to compute the cross-attention between the activation maps of the image pairs, which is shown to be an improved equivalent of the 4D correlation volume, used in previous works. In the suggested approach, higher attention scores are associated with image regions that encode visual cues of rotation. Our approach is end-to-end trainable and optimizes a simple regression loss. It is experimentally shown to outperform contemporary state-of-the-art schemes when applied to commonly used image rotation datasets and benchmarks, and establishes a new state-of-the-art accuracy on these datasets. We make our code publicly available.

CVMar 5, 2023
HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset

Ron Ferens, Yosi Keller

In this work, we propose HyperPose, which utilizes hyper-networks in absolute camera pose regressors. The inherent appearance variations in natural scenes, attributable to environmental conditions, perspective, and lighting, induce a significant domain disparity between the training and test datasets. This disparity degrades the precision of contemporary localization networks. To mitigate this, we advocate for incorporating hypernetworks into single-scene and multiscene camera pose regression models. During inference, the hypernetwork dynamically computes adaptive weights for the localization regression heads based on the particular input image, effectively narrowing the domain gap. Using indoor and outdoor datasets, we evaluate the HyperPose methodology across multiple established absolute pose regression architectures. We also introduce and share the Extended Cambridge Landmarks (ECL), a novel localization dataset, based on the Cambridge Landmarks dataset, showing it in multiple seasons with significantly varying appearance conditions. Our empirical experiments demonstrate that HyperPose yields notable performance enhancements for single- and multi-scene architectures. We have made our source code, pre-trained models, and the ECL dataset openly available.

CVMar 8Code
GazeShift: Unsupervised Gaze Estimation and Dataset for VR

Gil Shapira, Ishay Goldin, Evgeny Artyomov et al.

Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity - particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze - the first large-scale off-axis gaze estimation dataset for VR - comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze-appearance disentanglement in a compact, real-time model. GazeShift embeddings can be optionally adapted to individual users via lightweight few-shot calibration, achieving a 1.84-degree mean error on VRGaze. On the remote-camera MPIIGaze dataset, the model achieves a 7.15-degree person-agnostic error, doing so with 10x fewer parameters and 35x fewer FLOPs than baseline methods. Deployed natively on a VR headset GPU, inference takes only 5 ms. Combined with demonstrated robustness to illumination changes, these results highlight GazeShift as a label-efficient, real-time solution for VR gaze tracking. Project code and the VRGaze dataset are released at https://github.com/gazeshift3/gazeshift.

CVAug 12, 2025Code
Relative Pose Regression with Pose Auto-Encoders: Enhancing Accuracy and Data Efficiency for Retail Applications

Yoli Shavit, Yosi Keller

Accurate camera localization is crucial for modern retail environments, enabling enhanced customer experiences, streamlined inventory management, and autonomous operations. While Absolute Pose Regression (APR) from a single image offers a promising solution, approaches that incorporate visual and spatial scene priors tend to achieve higher accuracy. Camera Pose Auto-Encoders (PAEs) have recently been introduced to embed such priors into APR. In this work, we extend PAEs to the task of Relative Pose Regression (RPR) and propose a novel re-localization scheme that refines APR predictions using PAE-based RPR, without requiring additional storage of images or pose data. We first introduce PAE-based RPR and establish its effectiveness by comparing it with image-based RPR models of equivalent architectures. We then demonstrate that our refinement strategy, driven by a PAE-based RPR, enhances APR localization accuracy on indoor benchmarks. Notably, our method is shown to achieve competitive performance even when trained with only 30% of the data, substantially reducing the data collection burden for retail deployment. Our code and pre-trained models are available at: https://github.com/yolish/camera-pose-auto-encoders

CVJan 17, 2022Code
Synthesis and Reconstruction of Fingerprints using Generative Adversarial Networks

Rafael Bouzaglo, Yosi Keller

Deep learning-based models have been shown to improve the accuracy of fingerprint recognition. While these algorithms show exceptional performance, they require large-scale fingerprint datasets for training and evaluation. In this work, we propose a novel fingerprint synthesis and reconstruction framework based on the StyleGan2 architecture, to address the privacy issues related to the acquisition of such large-scale datasets. We also derive a computational approach to modify the attributes of the generated fingerprint while preserving their identity. This allows synthesizing multiple different fingerprint images per finger. In particular, we introduce the SynFing synthetic fingerprints dataset consisting of 100K image pairs, each pair corresponding to the same identity. The proposed framework was experimentally shown to outperform contemporary state-of-the-art approaches for both fingerprint synthesis and reconstruction. It significantly improved the realism of the generated fingerprints, both visually and in terms of their ability to spoof fingerprint-based verification systems. The code and fingerprints dataset are publicly available: https://github.com/rafaelbou/fingerprint_generator.

CVMar 21, 2021Code
Learning Multi-Scene Absolute Pose Regression with Transformers

Yoli Shavit, Ron Ferens, Yosi Keller

Absolute camera pose regressors estimate the position and orientation of a camera from the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended for learning multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into candidate pose predictions. This mechanism allows our model to focus on general features that are informative for localization while embedding multiple scenes in parallel. We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it surpasses both multi-scene and state-of-the-art single-scene absolute pose regressors. We make our code publicly available from https://github.com/yolish/multi-scene-pose-transformer.

CVFeb 7, 2025
Relative Age Estimation Using Face Images

Ran Sandhaus, Yosi Keller

This work introduces a novel deep-learning approach for estimating age from a single facial image by refining an initial age estimate. The refinement leverages a reference face database of individuals with similar ages and appearances. We employ a network that estimates age differences between an input image and reference images with known ages, thus refining the initial estimate. Our method explicitly models age-dependent facial variations using differential regression, yielding improved accuracy compared to conventional absolute age estimation. Additionally, we introduce an age augmentation scheme that iteratively refines initial age estimates by modeling their error distribution during training. This iterative approach further enhances the initial estimates. Our approach surpasses existing methods, achieving state-of-the-art accuracy on the MORPH II and CACD datasets. Furthermore, we examine the biases inherent in contemporary state-of-the-art age estimation techniques.

CVMay 13, 2023
Image Segmentation via Probabilistic Graph Matching

Ayelet Heimowitz, Yosi Keller

This work presents an unsupervised and semi-automatic image segmentation approach where we formulate the segmentation as a inference problem based on unary and pairwise assignment probabilities computed using low-level image cues. The inference is solved via a probabilistic graph matching scheme, which allows rigorous incorporation of low level image cues and automatic tuning of parameters. The proposed scheme is experimentally shown to compare favorably with contemporary semi-supervised and unsupervised image segmentation schemes, when applied to contemporary state-of-the-art image sets.

CVMay 4, 2023
Age-Invariant Face Embedding using the Wasserstein Distance

Eran Dahan, Yosi Keller

In this work, we study face verification in datasets where images of the same individuals exhibit significant age differences. This poses a major challenge for current face recognition and verification techniques. To address this issue, we propose a novel approach that utilizes multitask learning and a Wasserstein distance discriminator to disentangle age and identity embeddings of facial images. Our approach employs multitask learning with a Wasserstein distance discriminator that minimizes the mutual information between the age and identity embeddings by minimizing the Jensen-Shannon divergence. This improves the encoding of age and identity information in face images and enhances the performance of face verification in age-variant datasets. We evaluate the effectiveness of our approach using multiple age-variant face datasets and demonstrate its superiority over state-of-the-art methods in terms of face verification accuracy.

CVFeb 25, 2022
FSGANv2: Improved Subject Agnostic Face Swapping and Reenactment

Yuval Nirkin, Yosi Keller, Tal Hassner

We present Face Swapping GAN (FSGAN) for face swapping and reenactment. Unlike previous work, we offer a subject agnostic swapping scheme that can be applied to pairs of faces without requiring training on those faces. We derive a novel iterative deep learning--based approach for face reenactment which adjusts significant pose and expression variations that can be applied to a single image or a video sequence. For video sequences, we introduce a continuous interpolation of the face views based on reenactment, Delaunay Triangulation, and barycentric coordinates. Occluded face regions are handled by a face completion network. Finally, we use a face blending network for seamless blending of the two faces while preserving the target skin color and lighting conditions. This network uses a novel Poisson blending loss combining Poisson optimization with a perceptual loss. We compare our approach to existing state-of-the-art systems and show our results to be both qualitatively and quantitatively superior. This work describes extensions of the FSGAN method, proposed in an earlier conference version of our work, as well as additional experiments and results.

CVMar 21, 2021
Paying Attention to Activation Maps in Camera Pose Regression

Yoli Shavit, Ron Ferens, Yosi Keller

Camera pose regression methods apply a single forward pass to the query image to estimate the camera pose. As such, they offer a fast and light-weight alternative to traditional localization schemes based on image retrieval. Pose regression approaches simultaneously learn two regression tasks, aiming to jointly estimate the camera position and orientation using a single embedding vector computed by a convolutional backbone. We propose an attention-based approach for pose regression, where the convolutional activation maps are used as sequential inputs. Transformers are applied to encode the sequential activation maps as latent vectors, used for camera pose regression. This allows us to pay attention to spatially-varying deep features. Using two Transformer heads, we separately focus on the features for camera position and orientation, based on how informative they are per task. Our proposed approach is shown to compare favorably to contemporary pose regressors schemes and achieves state-of-the-art accuracy across multiple outdoor and indoor benchmarks. In particular, to the best of our knowledge, our approach is the only method to attain sub-meter average accuracy across outdoor scenes. We make our code publicly available from here.

CVMar 20, 2021
Attention-Based Multimodal Image Matching

Aviad Moreshet, Yosi Keller

We propose an attention-based approach for multimodal image patch matching using a Transformer encoder attending to the feature maps of a multiscale Siamese CNN. Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues. We also introduce an attention-residual architecture, using a residual connection bypassing the encoder. This additional learning signal facilitates end-to-end training from scratch. Our approach is experimentally shown to achieve new state-of-the-art accuracy on both multimodal and single modality benchmarks, illustrating its general applicability. To the best of our knowledge, this is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task.

CVMar 17, 2021
Hierarchical Attention-based Age Estimation and Bias Estimation

Shakediel Hiba, Yosi Keller

In this work we propose a novel deep-learning approach for age estimation based on face images. We first introduce a dual image augmentation-aggregation approach based on attention. This allows the network to jointly utilize multiple face image augmentations whose embeddings are aggregated by a Transformer-Encoder. The resulting aggregated embedding is shown to better encode the face image attributes. We then propose a probabilistic hierarchical regression framework that combines a discrete probabilistic estimate of age labels, with a corresponding ensemble of regressors. Each regressor is particularly adapted and trained to refine the probabilistic estimate over a range of ages. Our scheme is shown to outperform contemporary schemes and provide a new state-of-the-art age estimation accuracy, when applied to the MORPH II dataset for age estimation. Last, we introduce a bias analysis of state-of-the-art age estimation results.

CVSep 12, 2020
A Unified Approach to Kinship Verification

Eran Dahan, Yosi Keller

In this work, we propose a deep learning-based approach for kin verification using a unified multi-task learning scheme where all kinship classes are jointly learned. This allows us to better utilize small training sets that are typical of kin verification. We introduce a novel approach for fusing the embeddings of kin images, to avoid overfitting, which is a common issue in training such networks. An adaptive sampling scheme is derived for the training set images to resolve the inherent imbalance in kin verification datasets. A thorough ablation study exemplifies the effectivity of our approach, which is experimentally shown to outperform contemporary state-of-the-art kin verification results when applied to the Families In the Wild, FG2018, and FG2020 datasets.

CVAug 27, 2020
DeepFake Detection Based on the Discrepancy Between the Face and its Context

Yuval Nirkin, Lior Wolf, Yosi Keller et al.

We propose a method for detecting face swapping and other identity manipulations in single images. Face swapping methods, such as DeepFake, manipulate the face region, aiming to adjust the face to the appearance of its context, while leaving the context unchanged. We show that this modus operandi produces discrepancies between the two regions. These discrepancies offer exploitable telltale signs of manipulation. Our approach involves two networks: (i) a face identification network that considers the face region bounded by a tight semantic segmentation, and (ii) a context recognition network that considers the face context (e.g., hair, ears, neck). We describe a method which uses the recognition signals from our two networks to detect such discrepancies, providing a complementary detection signal that improves conventional real vs. fake classifiers commonly used for detecting fake images. Our method achieves state of the art results on the FaceForensics++, Celeb-DF-v2, and DFDC benchmarks for face manipulation detection, and even generalizes to detect fakes produced by unseen methods.

CVAug 16, 2019
FSGAN: Subject Agnostic Face Swapping and Reenactment

Yuval Nirkin, Yosi Keller, Tal Hassner

We present Face Swapping GAN (FSGAN) for face swapping and reenactment. Unlike previous work, FSGAN is subject agnostic and can be applied to pairs of faces without requiring training on those faces. To this end, we describe a number of technical contributions. We derive a novel recurrent neural network (RNN)-based approach for face reenactment which adjusts for both pose and expression variations and can be applied to a single image or a video sequence. For video sequences, we introduce continuous interpolation of the face views based on reenactment, Delaunay Triangulation, and barycentric coordinates. Occluded face regions are handled by a face completion network. Finally, we use a face blending network for seamless blending of the two faces while preserving target skin color and lighting conditions. This network uses a novel Poisson blending loss which combines Poisson optimization with perceptual loss. We compare our approach to existing state-of-the-art systems and show our results to be both qualitatively and quantitatively superior.

CVNov 3, 2018
Auto-ML Deep Learning for Rashi Scripts OCR

Shahar Mahpod, Yosi Keller

In this work we propose an OCR scheme for manuscripts printed in Rashi font that is an ancient Hebrew font and corresponding dialect used in religious Jewish literature, for more than 600 years. The proposed scheme utilizes a convolution neural network (CNN) for visual inference and Long-Short Term Memory (LSTM) to learn the Rashi scripts dialect. In particular, we derive an AutoML scheme to optimize the CNN architecture, and a book-specific CNN training to improve the OCR accuracy. The proposed scheme achieved an accuracy of more than 99.8% using a dataset of more than 3M annotated letters from the Responsa Project dataset.

CVOct 30, 2018
Joint detection and matching of feature points in multimodal images

Elad Ben Baruch, Yosi Keller

In this work, we propose a novel Convolutional Neural Network (CNN) architecture for the joint detection and matching of feature points in images acquired by different sensors using a single forward pass. The resulting feature detector is tightly coupled with the feature descriptor, in contrast to classical approaches (SIFT, etc.), where the detection phase precedes and differs from computing the descriptor. Our approach utilizes two CNN subnetworks, the first being a Siamese CNN and the second, consisting of dual non-weight-sharing CNNs. This allows simultaneous processing and fusion of the joint and disjoint cues in the multimodal image patches. The proposed approach is experimentally shown to outperform contemporary state-of-the-art schemes when applied to multiple datasets of multimodal images. It is also shown to provide repeatable feature points detections across multisensor images, outperforming state-of-the-art detectors. To the best of our knowledge, it is the first unified approach for the detection and matching of such images.

CVSep 22, 2018
SelfKin: Self Adjusted Deep Model For Kinship Verification

Eran Dahan, Yosi Keller

One of the unsolved challenges in the field of biometrics and face recognition is Kinship Verification. This problem aims to understand if two people are family-related and how (sisters, brothers, etc.) Solving this problem can give rise to varied tasks and applications. In the area of homeland security (HLS) it is crucial to auto-detect if the person questioned is related to a wanted suspect, In the field of biometrics, kinship-verification can help to discriminate between families by photos and in the field of predicting or fashion it can help to predict an older or younger model of people faces. Lately, and with the advanced deep learning technology, this problem has gained focus from the research community in matters of data and research. In this article, we propose using a Deep Learning approach for solving the Kinship-Verification problem. Further, we offer a novel self-learning deep model, which learns the essential features from different faces. We show that our model wins the Recognize Families In the Wild(RFIW2018,FG2018) challenge and obtains state-of-the-art results. Moreover, we show that our proposed model can reduce the size of the network by half without loss in performance.

CVMay 3, 2018
Facial Landmarks Localization using Cascaded Neural Networks

Shahar Mahpod, Rig Das, Emanuele Maiorana et al.

The accurate localization of facial landmarks is at the core of face analysis tasks, such as face recognition and facial expression analysis, to name a few. In this work, we propose a novel localization approach based on a deep learning architecture that utilizes cascaded subnetworks with convolutional neural network units. The cascaded units of the first subnetwork estimate heatmap-based encodings of the landmarks locations, while the cascaded units of the second subnetwork receive as input the output of the corresponding heatmap estimation units, and refine them through regression. The proposed scheme is experimentally shown to compare favorably with contemporary state-of-the-art schemes, especially when applied to images depicting challenging localization conditions.

CVMar 26, 2018
Multi-scale Processing of Noisy Images using Edge Preservation Losses

Nati Ofir, Yosi Keller

Noisy images processing is a fundamental task of computer vision. The first example is the detection of faint edges in noisy images, a challenging problem studied in the last decades. A recent study introduced a fast method to detect faint edges in the highest accuracy among all the existing approaches. Their complexity is nearly linear in the image's pixels and their runtime is seconds for a noisy image. Their approach utilizes a multi-scale binary partitioning of the image. By utilizing the multi-scale U-net architecture, we show in this paper that their method can be dramatically improved in both aspects of run time and accuracy. By training the network on a dataset of binary images, we developed an approach for faint edge detection that works in a linear complexity. Our runtime of a noisy image is milliseconds on a GPU. Even though our method is orders of magnitude faster, we still achieve higher accuracy of detection under many challenging scenarios. In addition, we show that our approach to performing multi-scale preprocessing of noisy images using U-net improves the ability to perform other vision tasks under the presence of noise. We prove it on the problems of noisy objects classification and classical image denoising. We show that multi-scale denoising can be carried out by a novel edge preservation loss. As our experiments show, we achieve high-quality results in the three aspects of faint edge detection, noisy image classification and natural image denoising.

CVJan 16, 2018
Deep Multi-Spectral Registration Using Invariant Descriptor Learning

Nati Ofir, Shai Silberstein, Hila Levi et al.

In this paper, we introduce a novel deep-learning method to align cross-spectral images. Our approach relies on a learned descriptor which is invariant to different spectra. Multi-modal images of the same scene capture different signals and therefore their registration is challenging and it is not solved by classic approaches. To that end, we developed a feature-based approach that solves the visible (VIS) to Near-Infra-Red (NIR) registration problem. Our algorithm detects corners by Harris and matches them by a patch-metric learned on top of CIFAR-10 network descriptor. As our experiments demonstrate we achieve a high-quality alignment of cross-spectral images with a sub-pixel accuracy. Comparing to other existing methods, our approach is more accurate in the task of VIS to NIR registration.

CVNov 5, 2017
Registration and Fusion of Multi-Spectral Images Using a Novel Edge Descriptor

Nati Ofir, Shai Silberstein, Dani Rozenbaum et al.

In this paper we introduce a fully end-to-end approach for multi-spectral image registration and fusion. Our method for fusion combines images from different spectral channels into a single fused image by different approaches for low and high frequency signals. A prerequisite of fusion is a stage of geometric alignment between the spectral bands, commonly referred to as registration. Unfortunately, common methods for image registration of a single spectral channel do not yield reasonable results on images from different modalities. For that end, we introduce a new algorithm for multi-spectral image registration, based on a novel edge descriptor of feature points. Our method achieves an accurate alignment of a level that allows us to further fuse the images. As our experiments show, we produce a high quality of multi-spectral image registration and fusion under many challenging scenarios.

CVNov 20, 2014
An algorithm for improving Non-Local Means operators via low-rank approximation

Victor May, Yosi Keller, Nir Sharon et al.

We present a method for improving a Non Local Means operator by computing its low-rank approximation. The low-rank operator is constructed by applying a filter to the spectrum of the original Non Local Means operator. This results in an operator which is less sensitive to noise while preserving important properties of the original operator. The method is efficiently implemented based on Chebyshev polynomials and is demonstrated on the application of natural images denoising. For this application, we provide a comprehensive comparison of our method with leading denoising methods.