DepthTCM: High Efficient Depth Compression via Physics-aware Transformer-CNN Mixed Architecture
This addresses efficient depth compression for applications like 3D sensing and robotics, though it appears incremental as it builds on existing encoding and neural network methods.
The paper tackles depth map compression by proposing DepthTCM, a physics-aware framework that converts depth maps to 3-channel images using multiwavelength encoding, quantizes them to 4 bits per channel, and compresses them with a Transformer-CNN mixed neural network. Results show it achieves 0.307 bpp with 99.38% accuracy on Middlebury 2014 and reduces bitrate by 66% compared to 8-bit quantization while maintaining quality.
We propose DepthTCM, a physics-aware end-to-end framework for depth map compression. In our framework of DepthTCM, the high-bit depth map is first converted to a conventional 3-channel image representation losslessly using a method inspired by a physical sinusoidal fringe pattern based profiliometry system, then the 3-channel color image is encoded and decoded by a recently developed Transformer-CNN mixed neural network architecture. Specifically, DepthTCM maps depth to a smooth 3-channel using multiwavelength depth (MWD) encoding, then globally quantized the MWD encoded representation to 4 bits per channel to reduce entropy, and finally is compressed using a learned codec that combines convolutional and Transformer layers. Experiment results demonstrate the advantage of our proposed method. On Middlebury 2014, DepthTCM reaches 0.307 bpp while preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG. We additionally demonstrate practical efficiency and scalability, reporting average end-to-end inference times of 41.48 ms (encoder) and 47.45 ms (decoder) on the ScanNet++ iPhone RGB-D subset. Ablations validate our design choices: relative to 8-bit quantization, 4-bit quantization reduces bitrate by 66% while maintaining comparable reconstruction quality, with only a marginal 0.68 dB PSNR change and a 0.04% accuracy difference. In addition, Transformer--CNN blocks further improve PSNR by up to 0.75 dB over CNN-only architectures.