Jan Piewek

CV
h-index2
3papers
3citations
Novelty53%
AI Score40

3 Papers

45.8CVApr 15
Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

Danish Nazir, Antoine Hanna-Asaad, Lucas Görnhardt et al.

Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \texttt{ToC3D} for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \texttt{ToC3D}, our approach enables dynamic layer-wise token selection within the ViT backbone. Furthermore, we introduce a parameter-efficient fine-tuning strategy, which trains only the proposed modules, thereby reducing the number of fine-tuned parameters from more than $300$ million (M) to only $1.6$ M. Experiments on the large-scale NuScenes dataset across three multi-view 3D object detection approaches demonstrate that our proposed method decreases computational complexity (GFLOPs) by $48\%$ ... $55\%$, inference latency (on an \texttt{NVIDIA-GV100} GPU) by $9\%$ ... $25\%$, while still improving mean average precision by $1.0\%$ ... $2.8\%$ absolute and NuScenes detection score by $0.4\%$ ... $1.2\%$ absolute compared to so-far SOTA \texttt{ToC3D}.

IVJul 15, 2024
Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding

Danish Nazir, Timo Bartels, Jan Piewek et al.

Distributed computing in the context of deep neural networks (DNNs) implies the execution of one part of the network on edge devices and the other part typically on a large-scale cloud platform. Conventional methods propose to employ a serial concatenation of a learned image and source encoder, the latter projecting the image encoder output (bottleneck features) into a quantized representation for bitrate-efficient transmission. In the cloud, a respective source decoder reprojects the quantized representation to the original feature representation, serving as an input for the downstream task decoder performing, e.g., semantic segmentation. In this work, we propose joint source and task decoding, as it allows for a smaller network size in the cloud. This further enables the scalability of such services in large numbers without requiring extensive computational load on the cloud per channel. We demonstrate the effectiveness of our method by achieving a distributed semantic segmentation SOTA over a wide range of bitrates on the mean intersection over union metric, while using only $9.8 \%$ ... $11.59 \%$ of cloud DNN parameters used in the previous SOTA on the COCO and Cityscapes datasets.

LGOct 19, 2025
An Efficient Semantic Segmentation Decoder for In-Car or Distributed Applications

Danish Nazir, Gowtham Sai Inti, Timo Bartels et al.

Modern automotive systems leverage deep neural networks (DNNs) for semantic segmentation and operate in two key application areas: (1) In-car, where the DNN solely operates in the vehicle without strict constraints on the data rate. (2) Distributed, where one DNN part operates in the vehicle and the other part typically on a large-scale cloud platform with a particular constraint on transmission bitrate efficiency. Typically, both applications share an image and source encoder, while each uses distinct (joint) source and task decoders. Prior work utilized convolutional neural networks for joint source and task decoding but did not investigate transformer-based alternatives such as SegDeformer, which offer superior performance at the cost of higher computational complexity. In this work, we propose joint feature and task decoding for SegDeformer, thereby enabling lower computational complexity in both in-car and distributed applications, despite SegDeformer's computational demands. This improves scalability in the cloud while reducing in-car computational complexity. For the in-car application, we increased the frames per second (fps) by up to a factor of $11.7$ ($1.4$ fps to $16.5$ fps) on Cityscapes and by up to a factor of $3.5$ ($43.3$ fps to $154.3$ fps) on ADE20K, while being on-par w.r.t.\ the mean intersection over union (mIoU) of the transformer-based baseline that doesn't compress by a source codec. For the distributed application, we achieve state-of-the-art (SOTA) over a wide range of bitrates on the mIoU metric, while using only $0.14$\% ($0.04$\%) of cloud DNN parameters used in previous SOTA, reported on ADE20K (Cityscapes).