Evžen Wybitul

CV
h-index2
3papers
3citations
Novelty52%
AI Score35

3 Papers

CVJan 12
Representations of Text and Images Align From Layer One

Evžen Wybitul, Javier Rando, Florian Tramèr et al.

We show that for a variety of concepts in adapter-based vision-language models, the representations of their images and their text descriptions are meaningfully aligned from the very first layer. This contradicts the established view that such image-text alignment only appears in late layers. We show this using a new synthesis-based method inspired by DeepDream: given a textual concept such as "Jupiter", we extract its concept vector at a given layer, and then use optimisation to synthesise an image whose representation aligns with that vector. We apply our approach to hundreds of concepts across seven layers in Gemma 3, and find that the synthesised images often depict salient visual features of the targeted textual concepts: for example, already at layer 1, more than 50 % of images depict recognisable features of animals, activities, or seasons. Our method thus provides direct, constructive evidence of image-text alignment on a concept-by-concept and layer-by-layer basis. Unlike previous methods for measuring multimodal alignment, our approach is simple, fast, and does not require auxiliary models or datasets. It also offers a new path towards model interpretability, by providing a way to visualise a model's representation space by backtracing through its image processing components.

AIMay 14, 2025
Access Controls Will Solve the Dual-Use Dilemma

Evžen Wybitul · eth-zurich

AI safety systems face the dual-use dilemma. It is unclear whether to answer dual-use requests, since the same query could be either harmless or harmful depending on who made it and why. To make better decisions, such systems would need to examine requests' real-world context, but currently, they lack access to this information. Instead, they sometimes end up making arbitrary choices that result in refusing legitimate queries and allowing harmful ones, which hurts both utility and safety. To address this, we propose a conceptual framework based on access controls where only verified users can access dual-use outputs. We describe the framework's components, analyse its feasibility, and explain how it addresses both over-refusals and under-refusals. While only a high-level proposal, our work takes the first step toward giving model providers more granular tools for managing dual-use content. Such tools would enable users to access more capabilities without sacrificing safety, and offer regulators new options for targeted policies.

CVNov 20, 2024
ViSTa Dataset: Do vision-language models understand sequential tasks?

Evžen Wybitul, Evan Ryan Gunter, Mikhail Seleznyov et al. · eth-zurich

Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by the final state alone. To this end, we introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. Its novel hierarchical structure -- basic single-step tasks composed into more and more complex sequential tasks -- allows a fine-grained understanding of how well VLMs can judge tasks with varying complexity. To illustrate this, we use ViSTa to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We find that, while they are all good at object recognition, they fail to understand sequential tasks, with only GPT-4o achieving non-trivial performance.