Özgür Erkent

RO
7papers
22citations
Novelty47%
AI Score51

7 Papers

CVMay 25Code
Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation

İsmail Emre Canıtez, Özgür Erkent

Semantic segmentation in complex environments such as urban driving scenes remains challenging under adverse lighting conditions, where RGB images alone provide insufficient information. RGB-Thermal fusion leverages the complementary strengths of visible and infrared imagery to improve scene understanding; however, effectively integrating these heterogeneous modalities at varying levels of feature abstraction remains an open problem. In this paper, we propose a multi-modal fusion architecture built upon dual ConvNeXt V2 backbones that employs stage-wise, modality-adaptive fusion strategies. For early-stage features, we introduce a Frequency-Based Fusion Module that decomposes infrared features into low- and high-frequency components via Gaussian filtering, applies dual-branch spatial attention to selectively emphasize thermal patterns and fine-grained boundaries, and integrates them with RGB features through a confidence-gated residual mechanism. For late-stage features, we design a semantic fusion module with cross-modal attention and multi-scale depthwise convolutions to capture semantic correspondences across modalities. The fused features are decoded via a PANet-style bidirectional decoder with deep supervision. Experiments on MFNet and PST900 demonstrate that our lightest variant achieves 61.73\% and 86.24\% mIoU, respectively, with only 35.43M parameters, outperforming recent methods while using substantially fewer parameters and lower computational cost. Code is available at https://github.com/ismailemrecntz/VISIBLE-INFRARED-SENSOR-FUSION

ROFeb 10, 2023
LAPTNet-FPN: Multi-scale LiDAR-aided Projective Transform Network for Real Time Semantic Grid Prediction

Manuel Alejandro Diaz-Zapata, David Sierra González, Özgür Erkent et al.

Semantic grids can be useful representations of the scene around an autonomous system. By having information about the layout of the space around itself, a robot can leverage this type of representation for crucial tasks such as navigation or tracking. By fusing information from multiple sensors, robustness can be increased and the computational load for the task can be lowered, achieving real time performance. Our multi-scale LiDAR-Aided Perspective Transform network uses information available in point clouds to guide the projection of image features to a top-view representation, resulting in a relative improvement in the state of the art for semantic grid generation for human (+8.67%) and movable object (+49.07%) classes in the nuScenes dataset, as well as achieving results close to the state of the art for the vehicle, drivable area and walkway classes, while performing inference at 25 FPS.

ROMay 15Code
Health-Conditioned Vision-Language-Action Models for Malfunction-Aware Robot Control

Hüseyin Arslan, Özgür Erkent

Research on Vision Language Action (VLA) models has been increasing rapidly in recent years. Although some of them focus on detecting, preventing, and recovering from task failures, they usually don't deal with adapting to robot's physical failures. In real-life scenarios, most robots face physical degradations in various ways such as joint degradation, actuator failure, or weak gripper. We introduce malfunction-aware (health-conditioned) VLA that takes a health vector as an input that gives information about robots' joints' operation angle and torque capability, and adapts its predictions to complete the tasks with the degraded joints. To achieve this, we inject a Health Projector module to the VLA-Adapter architecture and train it on malfunction robot data we collected on the LIBERO environment [1]. We collect 128 teleoperated episodes on Libero-Spatial tasks. Our results show that, with a very lightweight addition, the model can learn to operate successfully with different configurations of degraded joints which the default pretrained VLA-Adapter's Libero-Spatial-Pro model cannot. The code and dataset will be available soon at https://github.com/h-arslan/health-aware-vla

ROAug 23, 2023
Multi-Modal Multi-Task (3MT) Road Segmentation

Erkan Milli, Özgür Erkent, Asım Egemen Yılmaz

Multi-modal systems have the capacity of producing more reliable results than systems with a single modality in road detection due to perceiving different aspects of the scene. We focus on using raw sensor inputs instead of, as it is typically done in many SOTA works, leveraging architectures that require high pre-processing costs such as surface normals or dense depth predictions. By using raw sensor inputs, we aim to utilize a low-cost model thatminimizes both the pre-processing andmodel computation costs. This study presents a cost-effective and highly accurate solution for road segmentation by integrating data from multiple sensorswithin a multi-task learning architecture.Afusion architecture is proposed in which RGB and LiDAR depth images constitute the inputs of the network. Another contribution of this study is to use IMU/GNSS (inertial measurement unit/global navigation satellite system) inertial navigation system whose data is collected synchronously and calibrated with a LiDAR-camera to compute aggregated dense LiDAR depth images. It has been demonstrated by experiments on the KITTI dataset that the proposed method offers fast and high-performance solutions. We have also shown the performance of our method on Cityscapes where raw LiDAR data is not available. The segmentation results obtained for both full and half resolution images are competitive with existing methods. Therefore, we conclude that our method is not dependent only on raw LiDAR data; rather, it can be used with different sensor modalities. The inference times obtained in all experiments are very promising for real-time experiments.

CVNov 14, 2022
LAPTNet: LiDAR-Aided Perspective Transform Network

Manuel Alejandro Diaz-Zapata, Özgür Erkent, Christian Laugier et al.

Semantic grids are a useful representation of the environment around a robot. They can be used in autonomous vehicles to concisely represent the scene around the car, capturing vital information for downstream tasks like navigation or collision assessment. Information from different sensors can be used to generate these grids. Some methods rely only on RGB images, whereas others choose to incorporate information from other sensors, such as radar or LiDAR. In this paper, we present an architecture that fuses LiDAR and camera information to generate semantic grids. By using the 3D information from a LiDAR point cloud, the LiDAR-Aided Perspective Transform Network (LAPTNet) is able to associate features in the camera plane to the bird's eye view without having to predict any depth information about the scene. Compared to state-of-theart camera-only methods, LAPTNet achieves an improvement of up to 8.8 points (or 38.13%) over state-of-art competing approaches for the classes proposed in the NuScenes dataset validation split.

ROMay 23
Sum of Costs Diffusion with Dynamic Guidance for Motion Planning

Aysu Aylin Kaplan, Özgür Erkent

The motion planning problem for robotic manipulation can be addressed through classical or deep learning approaches. Existing methods face significant challenges in generalizing to diverse settings. In this study, we present a method with high generalization capability that generates collision-free trajectories using diffusion models where the denoising process is guided by the gradient of the total collision cost. We are also presenting a dynamic approach for choosing start step of the gradient guidance. Experimental results demonstrate that guiding the diffusion model dynamically with the sum of collision costs offers more robust performance by overcoming the generalization issues faced by competing methods. The proposed model demonstrates its effectiveness by achieving the highest performance on diverse test settings in M$π$nets\ dataset among the compared methods.

CVApr 14
Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

Ahmet İnanç, Özgür Erkent

Bird's-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity to share complementary information between them: detection features encode object-level geometry that can sharpen segmentation boundaries, while segmentation features provide dense semantic context that can anchor detection. We propose \textbf{CTAB} (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model reaches comparable mIoU on 4 classes while simultaneously providing 3D detection.