Raja Giryes

CV
h-index74
150papers
37,803citations
Novelty51%
AI Score62

150 Papers

CVMar 1, 2022
Generative Adversarial Networks

Gilad Cohen, Raja Giryes · deepmind

Generative Adversarial Networks (GANs) are very popular frameworks for generating high-quality data, and are immensely used in both the academia and industry in many domains. Arguably, their most substantial impact has been in the area of computer vision, where they achieve state-of-the-art image generation. This chapter gives an introduction to GANs, by discussing their principle mechanism and presenting some of their inherent problems during training and evaluation. We focus on these three issues: (1) mode collapse, (2) vanishing gradients, and (3) generation of low-quality images. We then list some architecture-variant and loss-variant GANs that remedy the above challenges. Lastly, we present two utilization examples of GANs for real-world applications: Data augmentation and face images generation.

CVNov 14, 2022Code
Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures

Gal Metzer, Elad Richardson, Or Patashnik et al.

Text-guided image generation has progressed rapidly in recent years, inspiring major breakthroughs in text-guided shape generation. Recently, it has been shown that using score distillation, one can successfully text-guide a NeRF model to generate a 3D object. We adapt the score distillation to the publicly available, and computationally efficient, Latent Diffusion Models, which apply the entire diffusion process in a compact latent space of a pretrained autoencoder. As NeRFs operate in image space, a naive solution for guiding them with latent score distillation would require encoding to the latent space at each guidance step. Instead, we propose to bring the NeRF to the latent space, resulting in a Latent-NeRF. Analyzing our Latent-NeRF, we show that while Text-to-3D models can generate impressive results, they are inherently unconstrained and may lack the ability to guide or enforce a specific 3D structure. To assist and direct the 3D generation, we propose to guide our Latent-NeRF using a Sketch-Shape: an abstract geometry that defines the coarse structure of the desired object. Then, we present means to integrate such a constraint directly into a Latent-NeRF. This unique combination of text and shape guidance allows for increased control over the generation process. We also show that latent score distillation can be successfully applied directly on 3D meshes. This allows for generating high-quality textures on a given geometry. Our experiments validate the power of our different forms of guidance and the efficiency of using latent rendering. Implementation is available at https://github.com/eladrich/latent-nerf

CVJun 10, 2023
AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder

Tal Shaharabany, Aviad Dahan, Raja Giryes et al. · meta-ai

The recently introduced Segment Anything Model (SAM) combines a clever architecture and large quantities of training data to obtain remarkable image segmentation capabilities. However, it fails to reproduce such results for Out-Of-Distribution (OOD) domains such as medical images. Moreover, while SAM is conditioned on either a mask or a set of points, it may be desirable to have a fully automatic solution. In this work, we replace SAM's conditioning with an encoder that operates on the same input image. By adding this encoder and without further fine-tuning SAM, we obtain state-of-the-art results on multiple medical images and video benchmarks. This new encoder is trained via gradients provided by a frozen SAM. For inspecting the knowledge within it, and providing a lightweight segmentation solution, we also learn to decode it into a mask by a shallow deconvolution network.

CVNov 25, 2022Code
MAEDAY: MAE for few and zero shot AnomalY-Detection

Eli Schwartz, Assaf Arbelle, Leonid Karlinsky et al.

We propose using Masked Auto-Encoder (MAE), a transformer model self-supervisedly trained on image inpainting, for anomaly detection (AD). Assuming anomalous regions are harder to reconstruct compared with normal regions. MAEDAY is the first image-reconstruction-based anomaly detection method that utilizes a pre-trained model, enabling its use for Few-Shot Anomaly Detection (FSAD). We also show the same method works surprisingly well for the novel tasks of Zero-Shot AD (ZSAD) and Zero-Shot Foreign Object Detection (ZSFOD), where no normal samples are available. Code is available at https://github.com/EliSchwartz/MAEDAY .

LGMay 26, 2022Code
Membership Inference Attack Using Self Influence Functions

Gilad Cohen, Raja Giryes

Member inference (MI) attacks aim to determine if a specific data sample was used to train a machine learning model. Thus, MI is a major privacy threat to models trained on private sensitive data, such as medical records. In MI attacks one may consider the black-box settings, where the model's parameters and activations are hidden from the adversary, or the white-box case where they are available to the attacker. In this work, we focus on the latter and present a novel MI attack for it that employs influence functions, or more specifically the samples' self-influence scores, to perform the MI prediction. We evaluate our attack on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets, using versatile architectures such as AlexNet, ResNet, and DenseNet. Our attack method achieves new state-of-the-art results for both training with and without data augmentations. Code is available at https://github.com/giladcohen/sif_mi_attack.

CVJul 9, 2024Code
ProtoSAM: One-Shot Medical Image Segmentation With Foundational Models

Lev Ayzenberg, Raja Giryes, Hayit Greenspan

This work introduces a new framework, ProtoSAM, for one-shot medical image segmentation. It combines the use of prototypical networks, known for few-shot segmentation, with SAM - a natural image foundation model. The method proposed creates an initial coarse segmentation mask using the ALPnet prototypical network, augmented with a DINOv2 encoder. Following the extraction of an initial mask, prompts are extracted, such as points and bounding boxes, which are then input into the Segment Anything Model (SAM). State-of-the-art results are shown on several medical image datasets and demonstrate automated segmentation capabilities using a single image example (one shot) with no need for fine-tuning of the foundation model. Our code is available at: https://github.com/levayz/ProtoSAM

CVFeb 3, 2023
TEXTure: Text-Guided Texturing of 3D Shapes

Elad Richardson, Gal Metzer, Yuval Alaluf et al.

In this paper, we present TEXTure, a novel method for text-guided generation, editing, and transfer of textures for 3D shapes. Leveraging a pretrained depth-to-image diffusion model, TEXTure applies an iterative scheme that paints a 3D model from different viewpoints. Yet, while depth-to-image models can create plausible textures from a single viewpoint, the stochastic nature of the generation process can cause many inconsistencies when texturing an entire 3D object. To tackle these problems, we dynamically define a trimap partitioning of the rendered image into three progression states, and present a novel elaborated diffusion sampling process that uses this trimap representation to generate seamless textures from different views. We then show that one can transfer the generated texture maps to new 3D geometries without requiring explicit surface-to-surface mapping, as well as extract semantic textures from a set of images without requiring any explicit reconstruction. Finally, we show that TEXTure can be used to not only generate new textures but also edit and refine existing textures using either a text prompt or user-provided scribbles. We demonstrate that our TEXTuring method excels at generating, transferring, and editing textures through extensive evaluation, and further close the gap between 2D image generation and 3D texturing.

CVMar 23, 2023
Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes

Dana Cohen-Bar, Elad Richardson, Gal Metzer et al.

Recent breakthroughs in text-guided image generation have led to remarkable progress in the field of 3D synthesis from text. By optimizing neural radiance fields (NeRF) directly from text, recent methods are able to produce remarkable results. Yet, these methods are limited in their control of each object's placement or appearance, as they represent the scene as a whole. This can be a major issue in scenarios that require refining or manipulating objects in the scene. To remedy this deficit, we propose a novel GlobalLocal training framework for synthesizing a 3D scene using object proxies. A proxy represents the object's placement in the generated scene and optionally defines its coarse geometry. The key to our approach is to represent each object as an independent NeRF. We alternate between optimizing each NeRF on its own and as part of the full scene. Thus, a complete representation of each object can be learned, while also creating a harmonious scene with style and lighting match. We show that using proxies allows a wide variety of editing options, such as adjusting the placement of each independent object, removing objects from a scene, or refining an object. Our results show that Set-the-Scene offers a powerful solution for scene synthesis and manipulation, filling a crucial gap in controllable text-to-3D synthesis.

CVJun 16, 2023Code
Group Orthogonalization Regularization For Vision Models Adaptation and Robustness

Yoav Kurtz, Noga Bar, Raja Giryes

As neural networks become deeper, the redundancy within their parameters increases. This phenomenon has led to several methods that attempt to reduce the correlation between convolutional filters. We propose a computationally efficient regularization technique that encourages orthonormality between groups of filters within the same layer. Our experiments show that when incorporated into recent adaptation methods for diffusion models and vision transformers (ViTs), this regularization improves performance on downstream tasks. We further show improved robustness when group orthogonality is enforced during adversarial training. Our code is available at https://github.com/YoavKurtz/GOR.

CVNov 21, 2022
Teaching Structured Vision&Language Concepts to Vision&Language Models

Sivan Doveh, Assaf Arbelle, Sivan Harary et al.

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved, language structure is much better modeled and understood, allowing for its effective utilization in teaching VL models. In this paper, we propose various techniques based on language structure understanding that can be used to manipulate the textual part of off-the-shelf paired VL datasets. VL models trained with the updated data exhibit a significant improvement of up to 15% in their SVLC understanding with only a mild degradation in their zero-shot capabilities both when training from scratch or fine-tuning a pre-trained model.

CVNov 14, 2023Code
On The Relationship Between Universal Adversarial Attacks And Sparse Representations

Dana Weitzner, Raja Giryes

The prominent success of neural networks, mainly in computer vision tasks, is increasingly shadowed by their sensitivity to small, barely perceivable adversarial perturbations in image input. In this work, we aim at explaining this vulnerability through the framework of sparsity. We show the connection between adversarial attacks and sparse representations, with a focus on explaining the universality and transferability of adversarial examples in neural networks. To this end, we show that sparse coding algorithms, and the neural network-based learned iterative shrinkage thresholding algorithm (LISTA) among them, suffer from this sensitivity, and that common attacks on neural networks can be expressed as attacks on the sparse representation of the input image. The phenomenon that we observe holds true also when the network is agnostic to the sparse representation and dictionary, and thus can provide a possible explanation for the universality and transferability of adversarial attacks. The code is available at https://github.com/danawr/adversarial_attacks_and_sparse_representations.

GRJun 9, 2023
SENS: Part-Aware Sketch-based Implicit Neural Shape Modeling

Alexandre Binninger, Amir Hertz, Olga Sorkine-Hornung et al.

We present SENS, a novel method for generating and editing 3D models from hand-drawn sketches, including those of abstract nature. Our method allows users to quickly and easily sketch a shape, and then maps the sketch into the latent space of a part-aware neural implicit shape architecture. SENS analyzes the sketch and encodes its parts into ViT patch encoding, subsequently feeding them into a transformer decoder that converts them to shape embeddings suitable for editing 3D neural implicit shapes. SENS provides intuitive sketch-based generation and editing, and also succeeds in capturing the intent of the user's sketch to generate a variety of novel and expressive 3D shapes, even from abstract and imprecise sketches. Additionally, SENS supports refinement via part reconstruction, allowing for nuanced adjustments and artifact removal. It also offers part-based modeling capabilities, enabling the combination of features from multiple sketches to create more complex and customized 3D shapes. We demonstrate the effectiveness of our model compared to the state-of-the-art using objective metric evaluation criteria and a user study, both indicating strong performance on sketches with a medium level of abstraction. Furthermore, we showcase our method's intuitive sketch-based shape editing capabilities, and validate it through a usability study.

CVAug 30, 2022
A Diffusion Model Predicts 3D Shapes from 2D Microscopy Images

Dominik J. E. Waibel, Ernst Röell, Bastian Rieck et al.

Diffusion models are a special type of generative model, capable of synthesising new data from a learnt distribution. We introduce DISPR, a diffusion-based model for solving the inverse problem of three-dimensional (3D) cell shape prediction from two-dimensional (2D) single cell microscopy images. Using the 2D microscopy image as a prior, DISPR is conditioned to predict realistic 3D shape reconstructions. To showcase the applicability of DISPR as a data augmentation tool in a feature-based single cell classification task, we extract morphological features from the red blood cells grouped into six highly imbalanced classes. Adding features from the DISPR predictions to the three minority classes improved the macro F1 score from $F1_\text{macro} = 55.2 \pm 4.6\%$ to $F1_\text{macro} = 72.2 \pm 4.9\%$. We thus demonstrate that diffusion models can be successfully applied to inverse biomedical problems, and that they learn to reconstruct 3D shapes with realistic morphological features from 2D microscopy images.

CVDec 6, 2022
ADIR: Adaptive Diffusion for Image Reconstruction

Shady Abu-Hussein, Tom Tirer, Raja Giryes

Denoising diffusion models have recently achieved remarkable success in image generation, capturing rich information about natural image statistics. This makes them highly promising for image reconstruction, where the goal is to recover a clean image from a degraded observation. In this work, we introduce a conditional sampling framework that leverages the powerful priors learned by diffusion models while enforcing consistency with the available measurements. To further adapt pre-trained diffusion models to the specific degradation at hand, we propose a novel fine-tuning strategy. In particular, we employ LoRA-based adaptation using images that are semantically and visually similar to the degraded input, efficiently retrieved from a large and diverse dataset via an off-the-shelf vision-language model. We evaluate our approach on two leading publicly available diffusion models--Stable Diffusion and Guided Diffusion--and demonstrate that our method, termed Adaptive Diffusion for Image Reconstruction (ADIR), yields substantial improvements across a range of image reconstruction tasks.

CVApr 25, 2022
ProCST: Boosting Semantic Segmentation Using Progressive Cyclic Style-Transfer

Shahaf Ettedgui, Shady Abu-Hussein, Raja Giryes

Using synthetic data for training neural networks that achieve good performance on real-world data is an important task as it can reduce the need for costly data annotation. Yet, synthetic and real world data have a domain gap. Reducing this gap, also known as domain adaptation, has been widely studied in recent years. Closing the domain gap between the source (synthetic) and target (real) data by directly performing the adaptation between the two is challenging. In this work, we propose a novel two-stage framework for improving domain adaptation techniques on image data. In the first stage, we progressively train a multi-scale neural network to perform image translation from the source domain to the target domain. We denote the new transformed data as "Source in Target" (SiT). Then, we insert the generated SiT data as the input to any standard UDA approach. This new data has a reduced domain gap from the desired target domain, which facilitates the applied UDA approach to close the gap further. We emphasize the effectiveness of our method via a comparison to other leading UDA and image-to-image translation techniques when used as SiT generators. Moreover, we demonstrate the improvement of our framework with three state-of-the-art UDA methods for semantic segmentation, HRDA, DAFormer and ProDA, on two UDA tasks, GTA5 to Cityscapes and Synthia to Cityscapes.

CVNov 25, 2022
PIP: Positional-encoding Image Prior

Nimrod Shabtay, Eli Schwartz, Raja Giryes

In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN's internal image-prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random or learned latent with Fourier-Features (Positional Encoding). We show that thanks to the Fourier features properties, we can replace the convolution layers with simple pixel-level MLPs. We name this scheme ``Positional Encoding Image Prior" (PIP) and exhibit that it performs very similarly to DIP on various image-reconstruction tasks with much less parameters required. Additionally, we demonstrate that PIP can be easily extended to videos, where 3D-DIP struggles and suffers from instability. Code and additional examples for all tasks, including videos, are available on the project page https://nimrodshabtay.github.io/PIP/

CVMar 26, 2023
VisDA 2022 Challenge: Domain Adaptation for Industrial Waste Sorting

Dina Bashkirova, Samarth Mishra, Diala Lteif et al.

Label-efficient and reliable semantic segmentation is essential for many real-life applications, especially for industrial settings with high visual diversity, such as waste sorting. In industrial waste sorting, one of the biggest challenges is the extreme diversity of the input stream depending on factors like the location of the sorting facility, the equipment available in the facility, and the time of year, all of which significantly impact the composition and visual appearance of the waste stream. These changes in the data are called ``visual domains'', and label-efficient adaptation of models to such domains is needed for successful semantic segmentation of industrial waste. To test the abilities of computer vision models on this task, we present the VisDA 2022 Challenge on Domain Adaptation for Industrial Waste Sorting. Our challenge incorporates a fully-annotated waste sorting dataset, ZeroWaste, collected from two real material recovery facilities in different locations and seasons, as well as a novel procedurally generated synthetic waste sorting dataset, SynthWaste. In this competition, we aim to answer two questions: 1) can we leverage domain adaptation techniques to minimize the domain gap? and 2) can synthetic data augmentation improve performance on this task and help adapt to changing data distributions? The results of the competition show that industrial waste detection poses a real domain adaptation problem, that domain generalization techniques such as augmentations, ensembling, etc., improve the overall performance on the unlabeled target domain examples, and that leveraging synthetic data effectively remains an open problem. See https://ai.bu.edu/visda-2022/

HEP-EXJun 11, 2023
Trees versus Neural Networks for enhancing tau lepton real-time selection in proton-proton collisions

Maayan Yaary, Uriel Barron, Luis Pascual Domínguez et al.

This paper introduces supervised learning techniques for real-time selection (triggering) of hadronically decaying tau leptons in proton-proton colliders. By implementing classic machine learning decision trees and advanced deep learning models, such as Multi-Layer Perceptron or residual neural networks, visible improvements in performance compared to standard threshold tau triggers are observed. We show how such an implementation may lower selection energy thresholds, thus contributing to increasing the sensitivity of searches for new phenomena in proton-proton collisions classified by low-energy tau leptons. Moreover, we analyze when it is better to use neural networks versus decision trees for tau triggers with conclusions relevant to other problems in physics.

LGOct 25, 2022
Learning Low Dimensional State Spaces with Overparameterized Recurrent Neural Nets

Edo Cohen-Karlik, Itamar Menuhin-Gruman, Raja Giryes et al.

Overparameterization in deep learning typically refers to settings where a trained neural network (NN) has representational capacity to fit the training data in many ways, some of which generalize well, while others do not. In the case of Recurrent Neural Networks (RNNs), there exists an additional layer of overparameterization, in the sense that a model may exhibit many solutions that generalize well for sequence lengths seen in training, some of which extrapolate to longer sequences, while others do not. Numerous works have studied the tendency of Gradient Descent (GD) to fit overparameterized NNs with solutions that generalize well. On the other hand, its tendency to fit overparameterized RNNs with solutions that extrapolate has been discovered only recently and is far less understood. In this paper, we analyze the extrapolation properties of GD when applied to overparameterized linear RNNs. In contrast to recent arguments suggesting an implicit bias towards short-term memory, we provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory. Our result relies on a dynamical characterization which shows that GD (with small step size and near-zero initialization) strives to maintain a certain form of balancedness, as well as on tools developed in the context of the moment problem from statistics (recovery of a probability distribution from its moments). Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.

CVApr 16, 2022
Stress-Testing Point Cloud Registration on Automotive LiDAR

Amnon Drory, Shai Avidan, Raja Giryes

Rigid Point Cloud Registration (PCR) algorithms aim to estimate the 6-DOF relative motion between two point clouds, which is important in various fields, including autonomous driving. Recent years have seen a significant improvement in global PCR algorithms, i.e. algorithms that can handle a large relative motion. This has been demonstrated in various scenarios, including indoor scenes, but has only been minimally tested in the Automotive setting, where point clouds are produced by vehicle-mounted LiDAR sensors. In this work, we aim to answer questions that are important for automotive applications, including: which of the new algorithms is the most accurate, and which is fastest? How transferable are deep-learning approaches, e.g. what happens when you train a network with data from Boston, and run it in a vehicle in Singapore? How small can the overlap between point clouds be before the algorithms start to deteriorate? To what extent are the algorithms rotation invariant? Our results are at times surprising. When comparing robust parameter estimation methods for registration, we find that the fastest and most accurate is not one of the newest approaches. Instead, it is a modern variant of the well known RANSAC technique. We also suggest a new outlier filtering method, Grid-Prioritized Filtering (GPF), to further improve it. An additional contribution of this work is an algorithm for selecting challenging sets of frame-pairs from automotive LiDAR datasets. This enables meaningful benchmarking in the Automotive LiDAR setting, and can also improve training for learning algorithms.

LGJun 11, 2023
An information-Theoretic Approach to Semi-supervised Transfer Learning

Daniel Jakubovitz, David Uliel, Miguel Rodrigues et al.

Transfer learning is a valuable tool in deep learning as it allows propagating information from one "source dataset" to another "target dataset", especially in the case of a small number of training examples in the latter. Yet, discrepancies between the underlying distributions of the source and target data are commonplace and are known to have a substantial impact on algorithm performance. In this work we suggest novel information-theoretic approaches for the analysis of the performance of deep neural networks in the context of transfer learning. We focus on the task of semi-supervised transfer learning, in which unlabeled samples from the target dataset are available during network training on the source dataset. Our theory suggests that one may improve the transferability of a deep neural network by incorporating regularization terms on the target data based on information-theoretic quantities, namely the Mutual Information and the Lautum Information. We demonstrate the effectiveness of the proposed approaches in various semi-supervised transfer learning experiments.

CVMar 23Code
Anatomical Token Uncertainty for Transformer-Guided Active MRI Acquisition

Lev Ayzenberg, Shady Abu-Hussein, Raja Giryes et al.

Full data acquisition in MRI is inherently slow, which limits clinical throughput and increases patient discomfort. Compressed Sensing MRI (CS-MRI) seeks to accelerate acquisition by reconstructing images from under-sampled k-space data, requiring both an optimal sampling trajectory and a high-fidelity reconstruction model. In this work, we propose a novel active sampling framework that leverages the inherent discrete structure of a pretrained medical image tokenizer and a latent transformer. By representing anatomy through a dictionary of quantized visual tokens, the model provides a well-defined probability distribution over the latent space. We utilize this distribution to derive a principled uncertainty measure via token entropy, which guides the active sampling process. We introduce two strategies to exploit this latent uncertainty: (1) Latent Entropy Selection (LES), projecting patch-wise token entropy into the $k$-space domain to identify informative sampling lines, and (2) Gradient-based Entropy Optimization (GEO), which identifies regions of maximum uncertainty reduction via the $k$-space gradient of a total latent entropy loss. We evaluate our framework on the fastMRI singlecoil Knee and Brain datasets at $\times 8$ and $\times 16$ acceleration. Our results demonstrate that our active policies outperform state-of-the-art baselines in perceptual metrics, and feature-based distances. Our code is available at https://github.com/levayz/TRUST-MRI.

LGJul 12, 2022
Utilizing Excess Resources in Training Neural Networks

Amit Henig, Raja Giryes

In this work, we suggest Kernel Filtering Linear Overparameterization (KFLO), where a linear cascade of filtering layers is used during training to improve network performance in test time. We implement this cascade in a kernel filtering fashion, which prevents the trained architecture from becoming unnecessarily deeper. This also allows using our approach with almost any network architecture and let combining the filtering layers into a single layer in test time. Thus, our approach does not add computational complexity during inference. We demonstrate the advantage of KFLO on various network models and datasets in supervised learning.

LGMar 12
Entropy-Preserving Reinforcement Learning

Aleksei Petrenko, Ben Lipkin, Kevin Chen et al.

Policy gradient algorithms have driven many recent advancements in language model reasoning. An appealing property is their ability to learn from exploration on their own trajectories, a process crucial for fostering diverse and creative solutions. As we show in this paper, many policy gradient algorithms naturally reduce the entropy -- and thus the diversity of explored trajectories -- as part of training, yielding a policy increasingly limited in its ability to explore. In this paper, we argue that entropy should be actively monitored and controlled throughout training. We formally analyze the contributions of leading policy gradient objectives on entropy dynamics, identify empirical factors (such as numerical precision) that significantly impact entropy behavior, and propose explicit mechanisms for entropy control. These include REPO, a family of algorithms that modify the advantage function to regulate entropy, and ADAPO, an adaptive asymmetric clipping approach. Models trained with our entropy-preserving methods maintain diversity throughout training, yielding final policies that are more performant and retain their trainability for sequential learning in new environments.

CVApr 24
Video Analysis and Generation via a Semantic Progress Function

Gal Metzer, Sagi Polaczek, Ali Mahdavi-Amiri et al.

Transformations produced by image and video generation models often evolve in a highly non-linear manner: long stretches where the content barely changes are followed by sudden, abrupt semantic jumps. To analyze and correct this behavior, we introduce a Semantic Progress Function, a one-dimensional representation that captures how the meaning of a given sequence evolves over time. For each frame, we compute distances between semantic embeddings and fit a smooth curve that reflects the cumulative semantic shift across the sequence. Departures of this curve from a straight line reveal uneven semantic pacing. Building on this insight, we propose a semantic linearization procedure that reparameterizes (or retimes) the sequence so that semantic change unfolds at a constant rate, yielding smoother and more coherent transitions. Beyond linearization, our framework provides a model-agnostic foundation for identifying temporal irregularities, comparing semantic pacing across different generators, and steering both generated and real-world video sequences toward arbitrary target pacing.

CVDec 6, 2023Code
Mitigating Open-Vocabulary Caption Hallucinations

Assaf Ben-Kish, Moran Yanuka, Morris Alper et al. · apple-ml

While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations for image captioning, surpassing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Furthermore, to mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark and other existing metrics. Code and models can be found at: https://github.com/assafbk/mocha_code

SDMar 10
ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Aviad Dahan, Moran Yanuka, Noa Kraicer et al. · apple-ml

Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.

CVMar 14
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Nimrod Shabtay, Moshe Kimhi, Artem Spector et al.

Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes

CVOct 14, 2024Code
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh et al.

The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA). This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables. Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models. This significantly reduces the overall evaluation cost. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination. Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset. By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%). Our dataset is available online on HuggingFace, and our code will be available here.

CVDec 28, 2023Code
3VL: Using Trees to Improve Vision-Language Models' Interpretability

Nir Yellinek, Leonid Karlinsky, Raja Giryes

Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model's success or failure. Our code is available at: https://github.com/niryellinek/3VL.

IVMar 17
A Lensless Polarization Camera

Noa Kraicer, Shay Elmalem, Erez Yosef et al.

Polarization imaging is a technique that creates a pixel map of the polarization state in a scene. Although invisible to the human eye, polarization can assist various sensing and computer vision tasks. Existing polarization cameras use spatial or temporal multiplexing, which increases the camera volume, weight, cost, or all of the above. Recent lensless imaging approaches, such as DiffuserCam, have demonstrated that compact imaging systems can be realized by replacing the lens with a coding element and performing computational reconstruction. In this work, we propose a compact lensless polarization camera composed of a diffuser and a simple striped polarization mask. By combining this optical design with a reconstruction algorithm that explicitly models the polarization-encoded lensless measurements, four linear polarization images are recovered from a single snapshot. Our results demonstrate the potential of lensless approaches for polarization imaging and reveal the physical factors that govern reconstruction quality, guiding the development of high-quality practical systems.

CLDec 18, 2024Code
Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models

Ido Cohen, Daniela Gottesman, Mor Geva et al. · deepmind

Vision-language models (VLMs) excel at extracting and reasoning about information from images. Yet, their capacity to leverage internal knowledge about specific entities remains underexplored. This work investigates the disparity in model performance when answering factual questions about an entity described in text versus depicted in an image. Our results reveal a significant accuracy drop - reaching 18% for some models - when the entity is presented visually instead of textually. To study this gap we present PopVQA, a dataset which allows separating entity recognition and question answering, and use it to benchmark several models. We hypothesize that this decline arises from limitations in how information flows from image tokens to query tokens. Thus, we use mechanistic interpretability tools to reveal that, although image tokens are preprocessed by the vision encoder, meaningful information flow from these tokens occurs only in the much deeper layers. Furthermore, critical image processing happens in the language model's middle layers, allowing few layers for consecutive reasoning, highlighting a potential inefficiency in how the model utilizes its layers for reasoning. These insights shed light on the internal mechanics of VLMs and offer pathways for enhancing their reasoning capabilities. PopVQA can be found at https://huggingface.co/datasets/idoco/PopVQA.

CVNov 13, 2024Code
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions

Moran Yanuka, Assaf Ben Kish, Yonatan Bitton et al. · apple-ml

Recent research increasingly focuses on training vision-language models (VLMs) with long, detailed image captions. However, small-scale VLMs often struggle to balance the richness of these captions with the risk of hallucinating content during fine-tuning. In this paper, we explore how well VLMs adapt to such captions. To quantify caption quality, we propose Decomposed NLI (DNLI), an evaluation framework that breaks down generated captions into individual propositions, assessing each in isolation. This fine-grained analysis reveals a critical balance between capturing descriptive details and preventing hallucinations. Our findings show that simply reducing caption complexity or employing standard data curation techniques does not effectively resolve this issue. To tackle this challenge, we introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model's existing knowledge and visual understanding. KnowAda minimizes hallucinations while preserving high descriptiveness. We validate this approach across several small-scale VLMs (up to 7B parameters) and dense caption datasets, demonstrating that KnowAda effectively balances hallucination reduction and descriptiveness. Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations. We will release our code and models.

CVAug 14, 2024
DifuzCam: Replacing Camera Lens with a Mask and a Diffusion Model

Erez Yosef, Raja Giryes

The flat lensless camera design reduces the camera size and weight significantly. In this design, the camera lens is replaced by another optical element that interferes with the incoming light. The image is recovered from the raw sensor measurements using a reconstruction algorithm. Yet, the quality of the reconstructed images is not satisfactory. To mitigate this, we propose utilizing a pre-trained diffusion model with a control network and a learned separable transformation for reconstruction. This allows us to build a prototype flat camera with high-quality imaging, presenting state-of-the-art results in both terms of quality and perceptuality. We demonstrate its ability to leverage also textual descriptions of the captured scene to further enhance reconstruction. Our reconstruction method which leverages the strong capabilities of a pre-trained diffusion model can be used in other imaging systems for improved reconstruction results.

OPTICSJun 26, 2025Code
Inverse Design of Diffractive Metasurfaces Using Diffusion Models

Liav Hen, Erez Yosef, Dan Raviv et al.

Metasurfaces are ultra-thin optical elements composed of engineered sub-wavelength structures that enable precise control of light. Their inverse design - determining a geometry that yields a desired optical response - is challenging due to the complex, nonlinear relationship between structure and optical properties. This often requires expert tuning, is prone to local minima, and involves significant computational overhead. In this work, we address these challenges by integrating the generative capabilities of diffusion models into computational design workflows. Using an RCWA simulator, we generate training data consisting of metasurface geometries and their corresponding far-field scattering patterns. We then train a conditional diffusion model to predict meta-atom geometry and height from a target spatial power distribution at a specified wavelength, sampled from a continuous supported band. Once trained, the model can generate metasurfaces with low error, either directly using RCWA-guided posterior sampling or by serving as an initializer for traditional optimization methods. We demonstrate our approach on the design of a spatially uniform intensity splitter and a polarization beam splitter, both produced with low error in under 30 minutes. To support further research in data-driven metasurface design, we publicly release our code and datasets.

CVMay 10
Do multimodal models imagine electric sheep?

Santhosh Kumar Ramakrishnan, Carl Vondrick, Raja Giryes et al.

Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks -- including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour -- that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.

CVNov 25, 2025Code
Smooth regularization for efficient video recognition

Gil Goldman, Raja Giryes, Mahadev Satyanarayanan

We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/cmusatyalab/grw-smoothing.

IVJun 23, 2024Code
X-ray2CTPA: Leveraging Diffusion Models to Enhance Pulmonary Embolism Classification

Noa Cahan, Eyal Klang, Galit Aviram et al.

Chest X-rays or chest radiography (CXR), commonly used for medical diagnostics, typically enables limited imaging compared to computed tomography (CT) scans, which offer more detailed and accurate three-dimensional data, particularly contrast-enhanced scans like CT Pulmonary Angiography (CTPA). However, CT scans entail higher costs, greater radiation exposure, and are less accessible than CXRs. In this work we explore cross-modal translation from a 2D low contrast-resolution X-ray input to a 3D high contrast and spatial-resolution CTPA scan. Driven by recent advances in generative AI, we introduce a novel diffusion-based approach to this task. We evaluate the models performance using both quantitative metrics and qualitative feedback from radiologists, ensuring diagnostic relevance of the generated images. Furthermore, we employ the synthesized 3D images in a classification framework and show improved AUC in a PE categorization task, using the initial CXR input. The proposed method is generalizable and capable of performing additional cross-modality translations in medical imaging. It may pave the way for more accessible and cost-effective advanced diagnostic tools. The code for this project is available: https://github.com/NoaCahan/X-ray2CTPA .

CVMay 25, 2023Code
UDPM: Upsampling Diffusion Probabilistic Models

Shady Abu-Hussein, Raja Giryes

Denoising Diffusion Probabilistic Models (DDPM) have recently gained significant attention. DDPMs compose a Markovian process that begins in the data domain and gradually adds noise until reaching pure white noise. DDPMs generate high-quality samples from complex data distributions by defining an inverse process and training a deep neural network to learn this mapping. However, these models are inefficient because they require many diffusion steps to produce aesthetically pleasing samples. Additionally, unlike generative adversarial networks (GANs), the latent space of diffusion models is less interpretable. In this work, we propose to generalize the denoising diffusion process into an Upsampling Diffusion Probabilistic Model (UDPM). In the forward process, we reduce the latent variable dimension through downsampling, followed by the traditional noise perturbation. As a result, the reverse process gradually denoises and upsamples the latent variable to produce a sample from the data distribution. We formalize the Markovian diffusion processes of UDPM and demonstrate its generation capabilities on the popular FFHQ, AFHQv2, and CIFAR10 datasets. UDPM generates images with as few as three network evaluations, whose overall computational cost is less than a single DDPM or EDM step, while achieving an FID score of 6.86. This surpasses current state-of-the-art efficient diffusion models that use a single denoising step for sampling. Additionally, UDPM offers an interpretable and interpolable latent space, which gives it an advantage over traditional DDPMs. Our code is available online: \url{https://github.com/shadyabh/UDPM/}

CVSep 16, 2021Code
Simple Post-Training Robustness Using Test Time Augmentations and Random Forest

Gilad Cohen, Raja Giryes

Although Deep Neural Networks (DNNs) achieve excellent performance on many real-world tasks, they are highly vulnerable to adversarial attacks. A leading defense against such attacks is adversarial training, a technique in which a DNN is trained to be robust to adversarial attacks by introducing adversarial noise to its input. This procedure is effective but must be done during the training phase. In this work, we propose Augmented Random Forest (ARF), a simple and easy-to-use strategy for robustifying an existing pretrained DNN without modifying its weights. For every image, we generate randomized test time augmentations by applying diverse color, blur, noise, and geometric transforms. Then we use the DNN's logits output to train a simple random forest to predict the real class label. Our method achieves state-of-the-art adversarial robustness on a diversity of white and black box attacks with minimal compromise on the natural images' classification. We test ARF also against numerous adaptive white-box attacks and it shows excellent results when combined with adversarial training. Code is available at https://github.com/giladcohen/ARF.

LGSep 15, 2019Code
Detecting Adversarial Samples Using Influence Functions and Nearest Neighbors

Gilad Cohen, Guillermo Sapiro, Raja Giryes

Deep neural networks (DNNs) are notorious for their vulnerability to adversarial attacks, which are small perturbations added to their input images to mislead their prediction. Detection of adversarial examples is, therefore, a fundamental requirement for robust classification frameworks. In this work, we present a method for detecting such adversarial attacks, which is suitable for any pre-trained neural network classifier. We use influence functions to measure the impact of every training sample on the validation set data. From the influence scores, we find the most supportive training samples for any given validation example. A k-nearest neighbor (k-NN) model fitted on the DNN's activation layers is employed to search for the ranking of these supporting training samples. We observe that these samples are highly correlated with the nearest neighbors of the normal inputs, while this correlation is much weaker for adversarial inputs. We train an adversarial detector using the k-NN ranks and distances and show that it successfully distinguishes adversarial examples, getting state-of-the-art results on six attack methods with three datasets. Code is available at https://github.com/giladcohen/NNIF_adv_defense.

CVSep 29, 2018Code
NICE: Noise Injection and Clamping Estimation for Neural Network Quantization

Chaim Baskin, Natan Liss, Yoav Chai et al.

Convolutional Neural Networks (CNN) are very popular in many fields including computer vision, speech recognition, natural language processing, to name a few. Though deep learning leads to groundbreaking performance in these domains, the networks used are very demanding computationally and are far from real-time even on a GPU, which is not power efficient and therefore does not suit low power systems such as mobile devices. To overcome this challenge, some solutions have been proposed for quantizing the weights and activations of these networks, which accelerate the runtime significantly. Yet, this acceleration comes at the cost of a larger error. The \uniqname method proposed in this work trains quantized neural networks by noise injection and a learned clamping, which improve the accuracy. This leads to state-of-the-art results on various regression and classification tasks, e.g., ImageNet classification with architectures such as ResNet-18/34/50 with low as 3-bit weights and activations. We implement the proposed solution on an FPGA to demonstrate its applicability for low power real-time applications. The implementation of the paper is available at https://github.com/Lancer555/NICE

IVMar 28
Guided Lensless Polarization Imaging

Noa Kraicer, Erez Yosef, Raja Giryes

Polarization imaging captures the polarization state of light, revealing information invisible to the human eye yet valuable in domains such as biomedical diagnostics, autonomous driving, and remote sensing. However, conventional polarization cameras are often expensive, bulky, or both, limiting their practical use. Lensless imaging offers a compact, low-cost alternative by replacing the lens with a simple optical element like a diffuser and performing computational reconstruction, but existing lensless polarization systems suffer from limited reconstruction quality. To overcome these limitations, we introduce a RGB-guided lensless polarization imaging system that combines a compact polarization-RGB sensor with an auxiliary, widely available conventional RGB camera providing structural guidance. We reconstruct multi-angle polarization images for each RGB color channel through a two-stage pipeline: a physics-based inversion recovers an initial polarization image, followed by a Transformer-based fusion network that refines this reconstruction using the RGB guidance image from the conventional RGB camera. Our two-stage method significantly improves reconstruction quality and fidelity over lensless-only baselines, generalizes across datasets and imaging conditions, and achieves high-quality real-world results on our physical prototype lensless camera without any fine-tuning.

ASNov 5, 2025
Principled Coarse-Grained Acceptance for Speculative Decoding in Speech

Moran Yanuka, Paul Dixon, Eyal Finkelshtein et al.

Speculative decoding accelerates autoregressive speech generation by letting a fast draft model propose tokens that a larger target model verifies. However, for speech LLMs that generate acoustic tokens, exact token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, reducing acceptance rates and limiting speedups. We introduce Principled Coarse-Graining (PCG), which verifies proposals at the level of Acoustic Similarity Groups (ASGs) derived from the target model's embedding space. By splitting each token's probability mass across the overlapping groups that contain it, we define an overlap-aware coarse-grained distribution and perform rejection sampling on the resulting group variable. This yields an exactness guarantee at the group level while allowing the accepted draft token to stand in for any member of the group in practice. On LibriTTS, PCG increases acceptance and throughput relative to standard speculative decoding and prior speech-specific relaxations while maintaining intelligibility and speaker similarity. These results suggest acoustically aware, group-level acceptance as a simple and general way to accelerate speech token generation while maintaining speech quality.

MLFeb 8, 2024
How do Transformers perform In-Context Autoregressive Learning?

Michael E. Sander, Raja Giryes, Taiji Suzuki et al.

Transformers have achieved state-of-the-art performance in language modeling tasks. However, the reasons behind their tremendous success are still unclear. In this paper, towards a better understanding, we train a Transformer model on a simple next token prediction task, where sequences are generated as a first-order autoregressive process $s_{t+1} = W s_t$. We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping. We call the resulting procedure in-context autoregressive learning. More precisely, focusing on commuting orthogonal matrices $W$, we first show that a trained one-layer linear Transformer implements one step of gradient descent for the minimization of an inner objective function, when considering augmented tokens. When the tokens are not augmented, we characterize the global minima of a one-layer diagonal linear multi-head Transformer. Importantly, we exhibit orthogonality between heads and show that positional encoding captures trigonometric relations in the data. On the experimental side, we consider the general case of non-commuting orthogonal matrices and generalize our theoretical findings.

IVDec 26, 2023
A Self Supervised StyleGAN for Image Annotation and Classification with Extremely Limited Labels

Dana Cohen Hochberg, Hayit Greenspan, Raja Giryes

The recent success of learning-based algorithms can be greatly attributed to the immense amount of annotated data used for training. Yet, many datasets lack annotations due to the high costs associated with labeling, resulting in degraded performances of deep learning methods. Self-supervised learning is frequently adopted to mitigate the reliance on massive labeled datasets since it exploits unlabeled data to learn relevant feature representations. In this work, we propose SS-StyleGAN, a self-supervised approach for image annotation and classification suitable for extremely small annotated datasets. This novel framework adds self-supervision to the StyleGAN architecture by integrating an encoder that learns the embedding to the StyleGAN latent space, which is well-known for its disentangled properties. The learned latent space enables the smart selection of representatives from the data to be labeled for improved classification performance. We show that the proposed method attains strong classification results using small labeled datasets of sizes 50 and even 10. We demonstrate the superiority of our approach for the tasks of COVID-19 and liver tumor pathology identification.

LGMar 2, 2024
ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation

Moran Yanuka, Morris Alper, Hadar Averbuch-Elor et al. · apple-ml

Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concrete samples that provide the strongest signal for learning in a noisy dataset. In this work, we propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and sentence-level texts. Moreover, we show that curation using ICC complements existing approaches: It succeeds in selecting the highest quality samples from multimodal web-scale datasets to allow for efficient training in resource-constrained settings.

LGDec 12, 2023
Deep Internal Learning: Deep Learning from a Single Input

Tom Tirer, Raja Giryes, Se Young Chun et al.

Deep learning, in general, focuses on training a neural network from large labeled datasets. Yet, in many cases there is value in training a network just from the input at hand. This is particularly relevant in many signal and image processing problems where training data is scarce and diversity is large on the one hand, and on the other, there is a lot of structure in the data that can be exploited. Using this information is the key to deep internal-learning strategies, which may involve training a network from scratch using a single input or adapting an already trained network to a provided input example at inference time. This survey paper aims at covering deep internal-learning techniques that have been proposed in the past few years for these two important directions. While our main focus will be on image processing problems, most of the approaches that we survey are derived for general signals (vectors with recurring patterns that can be distinguished from noise) and are therefore applicable to other modalities.

CVMar 5, 2024
DINOv2 based Self Supervised Learning For Few Shot Medical Image Segmentation

Lev Ayzenberg, Raja Giryes, Hayit Greenspan

Deep learning models have emerged as the cornerstone of medical image segmentation, but their efficacy hinges on the availability of extensive manually labeled datasets and their adaptability to unforeseen categories remains a challenge. Few-shot segmentation (FSS) offers a promising solution by endowing models with the capacity to learn novel classes from limited labeled examples. A leading method for FSS is ALPNet, which compares features between the query image and the few available support segmented images. A key question about using ALPNet is how to design its features. In this work, we delve into the potential of using features from DINOv2, which is a foundational self-supervised learning model in computer vision. Leveraging the strengths of ALPNet and harnessing the feature extraction capabilities of DINOv2, we present a novel approach to few-shot segmentation that not only enhances performance but also paves the way for more robust and adaptable medical image analysis.

CVJan 11, 2024
TriNeRFLet: A Wavelet Based Triplane NeRF Representation

Rajaei Khatib, Raja Giryes

In recent years, the neural radiance field (NeRF) model has gained popularity due to its ability to recover complex 3D scenes. Following its success, many approaches proposed different NeRF representations in order to further improve both runtime and performance. One such example is Triplane, in which NeRF is represented using three 2D feature planes. This enables easily using existing 2D neural networks in this framework, e.g., to generate the three planes. Despite its advantage, the triplane representation lagged behind in its 3D recovery quality compared to NeRF solutions. In this work, we propose TriNeRFLet, a 2D wavelet-based multiscale triplane representation for NeRF, which closes the 3D recovery performance gap and is competitive with current state-of-the-art methods. Building upon the triplane framework, we also propose a novel super-resolution (SR) technique that combines a diffusion model with TriNeRFLet for improving NeRF resolution.