Adil Kaan Akan

CV
h-index27
14papers
270citations
Novelty55%
AI Score57

14 Papers

CVMay 28
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral et al.

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.

CVMar 20, 2022
Stochastic Video Prediction with Structure and Motion

Adil Kaan Akan, Sadra Safadoust, Fatma Güney

While stochastic video prediction models enable future prediction under uncertainty, they mostly fail to model the complex dynamics of real-world scenes. For example, they cannot provide reliable predictions for scenes with a moving camera and independently moving foreground objects in driving scenarios. The existing methods fail to fully capture the dynamics of the structured world by only focusing on changes in pixels. In this paper, we assume that there is an underlying process creating observations in a video and propose to factorize it into static and dynamic components. We model the static part based on the scene structure and the ego-motion of the vehicle, and the dynamic part based on the remaining motion of the dynamic objects. By learning separate distributions of changes in foreground and background, we can decompose the scene into static and dynamic parts and separately model the change in each. Our experiments demonstrate that disentangling structure and motion helps stochastic video prediction, leading to better future predictions in complex driving scenarios on two real-world driving datasets, KITTI and Cityscapes.

CVJul 26, 2023
ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation

Görkay Aydemir, Adil Kaan Akan, Fatma Güney

Forecasting future trajectories of agents in complex traffic scenes requires reliable and efficient predictions for all agents in the scene. However, existing methods for trajectory prediction are either inefficient or sacrifice accuracy. To address this challenge, we propose ADAPT, a novel approach for jointly predicting the trajectories of all agents in the scene with dynamic weight learning. Our approach outperforms state-of-the-art methods in both single-agent and multi-agent settings on the Argoverse and Interaction datasets, with a fraction of their computational overhead. We attribute the improvement in our performance: first, to the adaptive head augmenting the model capacity without increasing the model size; second, to our design choices in the endpoint-conditioned prediction, reinforced by gradient stopping. Our analyses show that ADAPT can focus on each agent with adaptive prediction, allowing for accurate predictions efficiently. https://KUIS-AI.github.io/adapt

CVMar 25, 2022
StretchBEV: Stretching Future Instance Prediction Spatially and Temporally

Adil Kaan Akan, Fatma Güney

In self-driving, predicting future in terms of location and motion of all the agents around the vehicle is a crucial requirement for planning. Recently, a new joint formulation of perception and prediction has emerged by fusing rich sensory information perceived from multiple cameras into a compact bird's-eye view representation to perform prediction. However, the quality of future predictions degrades over time while extending to longer time horizons due to multiple plausible predictions. In this work, we address this inherent uncertainty in future predictions with a stochastic temporal model. Our model learns temporal dynamics in a latent space through stochastic residual updates at each time step. By sampling from a learned distribution at each time step, we obtain more diverse future predictions that are also more accurate compared to previous work, especially stretching both spatially further regions in the scene and temporally over longer time horizons. Despite separate processing of each time step, our model is still efficient through decoupling of the learning of dynamics and the generation of future predictions.

CVJul 1, 2022
Trajectory Forecasting on Temporal Graphs

Görkay Aydemir, Adil Kaan Akan, Fatma Güney

Predicting future locations of agents in the scene is an important problem in self-driving. In recent years, there has been a significant progress in representing the scene and the agents in it. The interactions of agents with the scene and with each other are typically modeled with a Graph Neural Network. However, the graph structure is mostly static and fails to represent the temporal changes in highly dynamic scenes. In this work, we propose a temporal graph representation to better capture the dynamics in traffic scenes. We complement our representation with two types of memory modules; one focusing on the agent of interest and the other on the entire scene. This allows us to learn temporally-aware representations that can achieve good results even with simple regression of multiple futures. When combined with goal-conditioned prediction, we show better results that can reach the state-of-the-art performance on the Argoverse benchmark.

CVMay 14
Aligning Latent Geometry for Spherical Flow Matching in Image Generation

Tuna Han Salih Meral, Kaan Oktay, Hidir Yesiltepe et al.

Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.

CVSep 21, 2022
Stochastic Future Prediction in Real World Driving Scenarios

Adil Kaan Akan

Uncertainty plays a key role in future prediction. The future is uncertain. That means there might be many possible futures. A future prediction method should cover the whole possibilities to be robust. In autonomous driving, covering multiple modes in the prediction part is crucially important to make safety-critical decisions. Although computer vision systems have advanced tremendously in recent years, future prediction remains difficult today. Several examples are uncertainty of the future, the requirement of full scene understanding, and the noisy outputs space. In this thesis, we propose solutions to these challenges by modeling the motion explicitly in a stochastic way and learning the temporal dynamics in a latent space.

CVJan 27, 2025
Slot-Guided Adaptation of Pre-trained Diffusion Models for Object-Centric Learning and Compositional Generation

Adil Kaan Akan, Yucel Yemez

We present SlotAdapt, an object-centric learning method that combines slot attention with pretrained diffusion models by introducing adapters for slot-based conditioning. Our method preserves the generative power of pretrained diffusion models, while avoiding their text-centric conditioning bias. We also incorporate an additional guidance loss into our architecture to align cross-attention from adapter layers with slot attention. This enhances the alignment of our model with the objects in the input image without using external supervision. Experimental results show that our method outperforms state-of-the-art techniques in object discovery and image generation tasks across multiple datasets, including those with real images. Furthermore, we demonstrate through experiments that our method performs remarkably well on complex real-world images for compositional generation, in contrast to other slot-based generative methods in the literature. The project page can be found at https://kaanakan.github.io/SlotAdapt/.

CVNov 25, 2025
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan et al.

Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.

CVSep 29, 2025
Learning Object-Centric Representations Based on Slots in Real World Scenarios

Adil Kaan Akan

A central goal in AI is to represent scenes as compositions of discrete objects, enabling fine-grained, controllable image and video generation. Yet leading diffusion models treat images holistically and rely on text conditioning, creating a mismatch for object-level editing. This thesis introduces a framework that adapts powerful pretrained diffusion models for object-centric synthesis while retaining their generative capacity. We identify a core challenge: balancing global scene coherence with disentangled object control. Our method integrates lightweight, slot-based conditioning into pretrained models, preserving their visual priors while providing object-specific manipulation. For images, SlotAdapt augments diffusion models with a register token for background/style and slot-conditioned modules for objects, reducing text-conditioning bias and achieving state-of-the-art results in object discovery, segmentation, compositional editing, and controllable image generation. We further extend the framework to video. Using Invariant Slot Attention (ISA) to separate object identity from pose and a Transformer-based temporal aggregator, our approach maintains consistent object representations and dynamics across frames. This yields new benchmarks in unsupervised video object segmentation and reconstruction, and supports advanced editing tasks such as object removal, replacement, and insertion without explicit supervision. Overall, this work establishes a general and scalable approach to object-centric generative modeling for images and videos. By bridging human object-based perception and machine learning, it expands the design space for interactive, structured, and user-driven generative tools in creative, scientific, and practical domains.

CVJul 28, 2025
Compositional Video Synthesis by Temporal Object-Centric Learning

Adil Kaan Akan, Yucel Yemez

We present a novel framework for compositional video synthesis that leverages temporally consistent object-centric representations, extending our previous work, SlotAdapt, from images to video. While existing object-centric approaches either lack generative capabilities entirely or treat video sequences holistically, thus neglecting explicit object-level structure, our approach explicitly captures temporal dynamics by learning pose invariant object-centric slots and conditioning them on pretrained diffusion models. This design enables high-quality, pixel-level video synthesis with superior temporal coherence, and offers intuitive compositional editing capabilities such as object insertion, deletion, or replacement, maintaining consistent object identities across frames. Extensive experiments demonstrate that our method sets new benchmarks in video generation quality and temporal consistency, outperforming previous object-centric generative methods. Although our segmentation performance closely matches state-of-the-art methods, our approach uniquely integrates this capability with robust generative performance, significantly advancing interactive and controllable video generation and opening new possibilities for advanced content creation, semantic editing, and dynamic scene understanding.

CVAug 5, 2021
SLAMP: Stochastic Latent Appearance and Motion Prediction

Adil Kaan Akan, Erkut Erdem, Aykut Erdem et al.

Motion is an important cue for video prediction and often utilized by separating video content into static and dynamic components. Most of the previous work utilizing motion is deterministic but there are stochastic methods that can model the inherent uncertainty of the future. Existing stochastic models either do not reason about motion explicitly or make limiting assumptions about the static part. In this paper, we reason about appearance and motion in the video stochastically by predicting the future based on the motion history. Explicit reasoning about motion without history already reaches the performance of current stochastic models. The motion history further improves the results by allowing to predict consistent dynamics several frames into the future. Our model performs comparably to the state-of-the-art models on the generic video prediction datasets, however, significantly outperforms them on two challenging real-world autonomous driving datasets with complex motion and dynamic background.

CVFeb 16, 2021
Just Noticeable Difference for Machine Perception and Generation of Regularized Adversarial Images with Minimal Perturbation

Adil Kaan Akan, Emre Akbas, Fatos T. Yarman Vural

In this study, we introduce a measure for machine perception, inspired by the concept of Just Noticeable Difference (JND) of human perception. Based on this measure, we suggest an adversarial image generation algorithm, which iteratively distorts an image by an additive noise until the model detects the change in the image by outputting a false label. The noise added to the original image is defined as the gradient of the cost function of the model. A novel cost function is defined to explicitly minimize the amount of perturbation applied to the input image while enforcing the perceptual similarity between the adversarial and input images. For this purpose, the cost function is regularized by the well-known total variation and bounded range terms to meet the natural appearance of the adversarial image. We evaluate the adversarial images generated by our algorithm both qualitatively and quantitatively on CIFAR10, ImageNet, and MS COCO datasets. Our experiments on image classification and object detection tasks show that adversarial images generated by our JND method are both more successful in deceiving the recognition/detection models and less perturbed compared to the images generated by the state-of-the-art methods, namely, FGV, FSGM, and DeepFool methods.

IVJan 29, 2020
Just Noticeable Difference for Machines to Generate Adversarial Images

Adil Kaan Akan, Mehmet Ali Genc, Fatos T. Yarman Vural

One way of designing a robust machine learning algorithm is to generate authentic adversarial images which can trick the algorithms as much as possible. In this study, we propose a new method to generate adversarial images which are very similar to true images, yet, these images are discriminated from the original ones and are assigned into another category by the model. The proposed method is based on a popular concept of experimental psychology, called, Just Noticeable Difference. We define Just Noticeable Difference for a machine learning model and generate a least perceptible difference for adversarial images which can trick a model. The suggested model iteratively distorts a true image by gradient descent method until the machine learning algorithm outputs a false label. Deep Neural Networks are trained for object detection and classification tasks. The cost function includes regularization terms to generate just noticeably different adversarial images which can be detected by the model. The adversarial images generated in this study looks more natural compared to the output of state of the art adversarial image generators.