Sahand Sharifzadeh

CV
h-index117
19papers
10,716citations
Novelty54%
AI Score43

19 Papers

CVApr 29, 2022
Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc et al. · deepmind

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.

CVAug 22, 2022Code
InstanceFormer: An Online Video Instance Segmentation Framework

Rajat Koner, Tanveer Hannan, Suprosanna Shit et al. · deepmind

Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole video and the immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage transformer-based efficient online VIS framework named InstanceFormer, which is especially suitable for long and challenging videos. We propose three novel components to model short-term and long-term dependency and temporal coherence. First, we propagate the representation, location, and semantic information of prior instances to model short-term changes. Second, we propose a novel memory cross-attention in the decoder, which allows the network to look into earlier instances within a certain temporal window. Finally, we employ a temporal contrastive loss to impose coherence in the representation of an instance across all frames. Memory attention and temporal coherence are particularly beneficial to long-range dependency modeling, including challenging scenarios like occlusion. The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets. Most importantly, InstanceFormer surpasses offline approaches for challenging and long datasets such as YouTube-VIS-2021 and OVIS. Code is available at https://github.com/rajatkoner08/InstanceFormer.

CVMar 19, 2022
Relationformer: A Unified Framework for Image-to-Graph Generation

Suprosanna Shit, Rajat Koner, Bastian Wittmann et al. · deepmind

A comprehensive representation of an image requires understanding objects and their mutual relationship, especially in image-to-graph generation, e.g., road network extraction, blood-vessel network extraction, or scene graph generation. Traditionally, image-to-graph generation is addressed with a two-stage approach consisting of object detection followed by a separate relation prediction, which prevents simultaneous object-relation interaction. This work proposes a unified one-stage transformer-based framework, namely Relationformer, that jointly predicts objects and their relations. We leverage direct set-based object prediction and incorporate the interaction among the objects to learn an object-relation representation jointly. In addition to existing [obj]-tokens, we propose a novel learnable token, namely [rln]-token. Together with [obj]-tokens, [rln]-token exploits local and global semantic reasoning in an image through a series of mutual associations. In combination with the pair-wise [obj]-token, the [rln]-token contributes to a computationally efficient relation prediction. We achieve state-of-the-art performance on multiple, diverse and multi-domain datasets that demonstrate our approach's effectiveness and generalizability.

INS-DETMar 7, 2023
Ultra-High-Resolution Detector Simulation with Intra-Event Aware GAN and Self-Supervised Relational Reasoning

Baran Hashemi, Nikolai Hartmann, Sahand Sharifzadeh et al. · deepmind

Simulating high-resolution detector responses is a computationally intensive process that has long been challenging in Particle Physics. Despite the ability of generative models to streamline it, full ultra-high-granularity detector simulation still proves to be difficult as it contains correlated and fine-grained information. To overcome these limitations, we propose Intra-Event Aware Generative Adversarial Network (IEA-GAN). IEA-GAN presents a Relational Reasoning Module that approximates an event in detector simulation, generating contextualized high-resolution full detector responses with a proper relational inductive bias. IEA-GAN also introduces a Self-Supervised intra-event aware loss and Uniformity loss, significantly enhancing sample fidelity and diversity. We demonstrate IEA-GAN's application in generating sensor-dependent images for the ultra-high-granularity Pixel Vertex Detector (PXD), with more than 7.5 M information channels at the Belle II Experiment. Applications of this work span from Foundation Models for high-granularity detector simulation, such as at the HL-LHC (High Luminosity LHC), to simulation-based inference and fine-grained density estimation. To our knowledge, IEA-GAN is the first algorithm for faithful ultra-high-granularity full detector simulation with event-based reasoning.

CVDec 23, 2022
Do DALL-E and Flamingo Understand Each Other?

Hang Li, Jindong Gu, Rajat Koner et al. · deepmind, oxford

The field of multimodal research focusing on the comprehension and creation of both images and text has witnessed significant strides. This progress is exemplified by the emergence of sophisticated models dedicated to image captioning at scale, such as the notable Flamingo model and text-to-image generative models, with DALL-E serving as a prominent example. An interesting question worth exploring in this domain is whether Flamingo and DALL-E understand each other. To study this question, we propose a reconstruction task where Flamingo generates a description for a given image and DALL-E uses this description as input to synthesize a new image. We argue that these models understand each other if the generated image is similar to the given image. Specifically, we study the relationship between the quality of the image reconstruction and that of the text generation. We find that an optimal description of an image is one that gives rise to a generated image similar to the original one. The finding motivates us to propose a unified framework to finetune the text-to-image and image-to-text models. Concretely, the reconstruction part forms a regularization loss to guide the tuning of the models. Extensive experiments on multiple datasets with different image captioning and image generation models validate our findings and demonstrate the effectiveness of our proposed unified framework. As DALL-E and Flamingo are not publicly available, we use Stable Diffusion and BLIP in the remaining work. Project website: https://dalleflamingo.github.io.

CVMar 24, 2023
Prior-RadGraphFormer: A Prior-Knowledge-Enhanced Transformer for Generating Radiology Graphs from X-Rays

Yiheng Xiong, Jingsong Liu, Kamilia Zaripova et al. · deepmind

The extraction of structured clinical information from free-text radiology reports in the form of radiology graphs has been demonstrated to be a valuable approach for evaluating the clinical correctness of report-generation methods. However, the direct generation of radiology graphs from chest X-ray (CXR) images has not been attempted. To address this gap, we propose a novel approach called Prior-RadGraphFormer that utilizes a transformer model with prior knowledge in the form of a probabilistic knowledge graph (PKG) to generate radiology graphs directly from CXR images. The PKG models the statistical relationship between radiology entities, including anatomical structures and medical observations. This additional contextual information enhances the accuracy of entity and relation extraction. The generated radiology graphs can be applied to various downstream tasks, such as free-text or structured reports generation and multi-label classification of pathologies. Our approach represents a promising method for generating radiology graphs directly from CXR images, and has significant potential for improving medical image analysis and clinical decision-making.

CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

LGFeb 21, 2024Code
PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models

Eli M Carrami, Sahand Sharifzadeh

Understanding protein structure and function is crucial in biology. However, current computational methods are often task-specific and resource-intensive. To address this, we propose zero-shot Protein Question Answering (PQA), a task designed to answer a wide range of protein-related queries without task-specific training. The success of PQA hinges on high-quality datasets and robust evaluation strategies, both of which are lacking in current research. Existing datasets suffer from biases, noise, and lack of evolutionary context, while current evaluation methods fail to accurately assess model performance. We introduce the Pika framework to overcome these limitations. Pika comprises a curated, debiased dataset tailored for PQA and a biochemically relevant benchmarking strategy. We also propose multimodal large language models as a strong baseline for PQA, leveraging their natural language processing and knowledge. This approach promises a more flexible and efficient way to explore protein properties, advancing protein research. Our comprehensive PQA framework, Pika, including dataset, code, and model checkpoints, is openly accessible on github.com/EMCarrami/Pika, promoting wider research in the field.

LGJun 23, 2020Code
Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework

Mehdi Ali, Max Berrendorf, Charles Tapley Hoyt et al.

The heterogeneity in recently published knowledge graph embedding models' implementations, training, and evaluation has made fair and thorough comparisons difficult. In order to assess the reproducibility of previously published results, we re-implemented and evaluated 21 interaction models in the PyKEEN software package. Here, we outline which results could be reproduced with their reported hyper-parameters, which could only be reproduced with alternate hyper-parameters, and which could not be reproduced at all as well as provide insight as to why this might be the case. We then performed a large-scale benchmarking on four datasets with several thousands of experiments and 24,804 GPU hours of computation time. We present insights gained as to best practices, best configurations for each model, and where improvements could be made over previously published best configurations. Our results highlight that the combination of model architecture, training approach, loss function, and the explicit modeling of inverse relations is crucial for a model's performances, and not only determined by the model architecture. We provide evidence that several architectures can obtain results competitive to the state-of-the-art when configured carefully. We have made all code, experimental configurations, results, and analyses that lead to our interpretations available at https://github.com/pykeen/pykeen and https://github.com/pykeen/benchmarking

CVMar 12, 2024
Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Sahand Sharifzadeh, Christos Kaplanis, Shreya Pathak et al.

The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). In this work, we investigate an approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator's ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data, while requiring significantly less data. Furthermore, we perform a set of analyses on captions which reveals that semantic diversity and balance are key aspects for better downstream performance. Finally, we show that synthesizing images in the image embedding space is 25\% faster than in the pixel space. We believe our work not only addresses a significant challenge in VLM training but also opens up promising avenues for the development of self-improving multi-modal models.

LGMay 28, 2023
MemeGraphs: Linking Memes to Knowledge Graphs

Vasiliki Kougia, Simon Fetzel, Thomas Kirchmair et al.

Memes are a popular form of communicating trends and ideas in social media and on the internet in general, combining the modalities of images and text. They can express humor and sarcasm but can also have offensive content. Analyzing and classifying memes automatically is challenging since their interpretation relies on the understanding of visual elements, language, and background knowledge. Thus, it is important to meaningfully represent these sources and the interaction between them in order to classify a meme as a whole. In this work, we propose to use scene graphs, that express images in terms of objects and their visual relations, and knowledge graphs as structured representations for meme classification with a Transformer-based architecture. We compare our approach with ImgBERT, a multimodal model that uses only learned (instead of structured) representations of the meme, and observe consistent improvements. We further provide a dataset with human graph annotations that we compare to automatically generated graphs and entity linking. Analysis shows that automatic methods link more entities than human annotators and that automatically generated graphs are better suited for hatefulness classification in memes.

AISep 27, 2021
The Tensor Brain: A Unified Theory of Perception, Memory and Semantic Decoding

Volker Tresp, Sahand Sharifzadeh, Hang Li et al.

We present a unified computational theory of an agent's perception and memory. In our model, perception, episodic memory, and semantic memory are realized by different operational modes of the oscillating interactions between a symbolic index layer and a subsymbolic representation layer. The two layers form a bilayer tensor network (BTN). Although memory appears to be about the past, its main purpose is to support the agent in the present and the future. Recent episodic memory provides the agent with a sense of the here and now. Remote episodic memory retrieves relevant past experiences to provide information about possible future scenarios. This aids the agent in decision-making. "Future" episodic memory, based on expected future events, guides planning and action. Semantic memory retrieves specific information, which is not delivered by current perception, and defines priors for future observations. We argue that it is important for the agent to encode individual entities, not just classes and attributes. We demonstrate that a form of self-supervised learning can acquire new concepts and refine existing ones. We test our model on a standard benchmark data set, which we expanded to contain richer representations for attributes, classes, and individuals. Our key hypothesis is that obtaining a better understanding of perception and memory is a crucial prerequisite to comprehending human-level intelligence.

CVFeb 9, 2021
Improving Scene Graph Classification by Exploiting Knowledge from Texts

Sahand Sharifzadeh, Sina Moayed Baharlou, Martin Schmitt et al.

Training scene graph classification models requires a large amount of annotated image data. Meanwhile, scene graphs represent relational knowledge that can be modeled with symbolic data from texts or knowledge graphs. While image annotation demands extensive labor, collecting textual descriptions of natural scenes requires less effort. In this work, we investigate whether textual scene descriptions can substitute for annotated image data. To this end, we employ a scene graph classification framework that is trained not only from annotated images but also from symbolic data. In our architecture, the symbolic entities are first mapped to their correspondent image-grounded representations and then fed into the relational reasoning pipeline. Even though a structured form of knowledge, such as the form in knowledge graphs, is not always available, we can generate it from unstructured texts using a transformer-based language model. We show that by fine-tuning the classification pipeline with the extracted knowledge from texts, we can achieve ~8x more accurate results in scene graph classification, ~3x in object classification, and ~1.5x in predicate classification, compared to the supervised baselines with only 1% of the annotated images.

CVNov 19, 2020
Classification by Attention: Scene Graph Classification with Prior Knowledge

Sahand Sharifzadeh, Sina Moayed Baharlou, Volker Tresp

A major challenge in scene graph classification is that the appearance of objects and relations can be significantly different from one image to another. Previous works have addressed this by relational reasoning over all objects in an image or incorporating prior knowledge into classification. Unlike previous works, we do not consider separate models for perception and prior knowledge. Instead, we take a multi-task learning approach, where we implement the classification as an attention layer. This allows for the prior knowledge to emerge and propagate within the perception model. By enforcing the model also to represent the prior, we achieve a strong inductive bias. We show that our model can accurately generate commonsense knowledge and that the iterative injection of this knowledge to scene representations leads to significantly higher classification performance. Additionally, our model can be fine-tuned on external knowledge given as triples. When combined with self-supervised learning and with 1% of annotated images only, this gives more than 3% improvement in object classification, 26% in scene graph classification, and 36% in predicate prediction accuracy.

LGJul 28, 2020
PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings

Mehdi Ali, Max Berrendorf, Charles Tapley Hoyt et al.

Recently, knowledge graph embeddings (KGEs) received significant attention, and several software libraries have been developed for training and evaluating KGEs. While each of them addresses specific needs, we re-designed and re-implemented PyKEEN, one of the first KGE libraries, in a community effort. PyKEEN 1.0 enables users to compose knowledge graph embedding models (KGEMs) based on a wide range of interaction models, training approaches, loss functions, and permits the explicit modeling of inverse relations. Besides, an automatic memory optimization has been realized in order to exploit the provided hardware optimally, and through the integration of Optuna extensive hyper-parameter optimization (HPO) functionalities are provided.

AIJan 29, 2020
The Tensor Brain: Semantic Decoding for Perception and Memory

Volker Tresp, Sahand Sharifzadeh, Dario Konopatzki et al.

We analyse perception and memory, using mathematical models for knowledge graphs and tensors, to gain insights into the corresponding functionalities of the human mind. Our discussion is based on the concept of propositional sentences consisting of \textit{subject-predicate-object} (SPO) triples for expressing elementary facts. SPO sentences are the basis for most natural languages but might also be important for explicit perception and declarative memories, as well as intra-brain communication and the ability to argue and reason. A set of SPO sentences can be described as a knowledge graph, which can be transformed into an adjacency tensor. We introduce tensor models, where concepts have dual representations as indices and associated embeddings, two constructs we believe are essential for the understanding of implicit and explicit perception and memory in the brain. We argue that a biological realization of perception and memory imposes constraints on information processing. In particular, we propose that explicit perception and declarative memories require a semantic decoder, which, in a simple realization, is based on four layers: First, a sensory memory layer, as a buffer for sensory input, second, an index layer representing concepts, third, a memoryless representation layer for the broadcasting of information ---the "blackboard", or the "canvas" of the brain--- and fourth, a working memory layer as a processing center and data buffer. We discuss the operations of the four layers and relate them to the global workspace theory. In a Bayesian brain interpretation, semantic memory defines the prior for observable triple statements. We propose that ---in evolution and during development--- semantic memory, episodic memory, and natural language evolved as emergent properties in agents' process to gain a deeper understanding of sensory information.

CVMay 2, 2019
Improving Visual Relation Detection using Depth Maps

Sahand Sharifzadeh, Sina Moayed Baharlou, Max Berrendorf et al.

Visual relation detection methods rely on object information extracted from RGB images such as 2D bounding boxes, feature maps, and predicted class probabilities. We argue that depth maps can additionally provide valuable information on object relations, e.g. helping to detect not only spatial relations, such as standing behind, but also non-spatial relations, such as holding. In this work, we study the effect of using different object features with a focus on depth maps. To enable this study, we release a new synthetic dataset of depth maps, VG-Depth, as an extension to Visual Genome (VG). We also note that given the highly imbalanced distribution of relations in VG, typical evaluation metrics for visual relation detection cannot reveal improvements of under-represented relations. To address this problem, we propose using an additional metric, calling it Macro Recall@K, and demonstrate its remarkable performance on VG. Finally, our experiments confirm that by effective utilization of depth maps within a simple, yet competitive framework, the performance of visual relation detection can be improved by a margin of up to 8%.

CLApr 20, 2019
An Unsupervised Joint System for Text Generation from Knowledge Graphs and Semantic Parsing

Martin Schmitt, Sahand Sharifzadeh, Volker Tresp et al.

Knowledge graphs (KGs) can vary greatly from one domain to another. Therefore supervised approaches to both graph-to-text generation and text-to-graph knowledge extraction (semantic parsing) will always suffer from a shortage of domain-specific parallel graph-text data; at the same time, adapting a model trained on a different domain is often impossible due to little or no overlap in entities and relations. This situation calls for an approach that (1) does not need large amounts of annotated data and thus (2) does not need to rely on domain adaptation techniques to work well in different domains. To this end, we present the first approach to unsupervised text generation from KGs and show simultaneously how it can be used for unsupervised semantic parsing. We evaluate our approach on WebNLG v2.1 and a new benchmark leveraging scene graphs from Visual Genome. Our system outperforms strong baselines for both text$\leftrightarrow$graph conversion tasks without any manual adaptation from one dataset to the other. In additional experiments, we investigate the impact of using different unsupervised objectives.

AIDec 12, 2016
Learning to Drive using Inverse Reinforcement Learning and Deep Q-Networks

Sahand Sharifzadeh, Ioannis Chiotellis, Rudolph Triebel et al.

We propose an inverse reinforcement learning (IRL) approach using Deep Q-Networks to extract the rewards in problems with large state spaces. We evaluate the performance of this approach in a simulation-based autonomous driving scenario. Our results resemble the intuitive relation between the reward function and readings of distance sensors mounted at different poses on the car. We also show that, after a few learning rounds, our simulated agent generates collision-free motions and performs human-like lane change behaviour.