Accelerating Discrete Wavelet Transforms on Parallel Architectures
This work addresses performance bottlenecks in image-processing algorithms for researchers and engineers using GPUs, but it is incremental as it builds on prior separable methods.
The paper tackled the problem of accelerating 2-D discrete wavelet transforms (DWT) on parallel architectures by proposing non-separable calculation schemes that merge separable parts, halving the number of steps and reducing arithmetic operations. The result showed that these non-separable methods outperformed existing separable schemes on various GPU setups, particularly with pixel shaders.
The 2-D discrete wavelet transform (DWT) can be found in the heart of many image-processing algorithms. Until recently, several studies have compared the performance of such transform on various shared-memory parallel architectures, especially on graphics processing units (GPUs). All these studies, however, considered only separable calculation schemes. We show that corresponding separable parts can be merged into non-separable units, which halves the number of steps. In addition, we introduce an optional optimization approach leading to a reduction in the number of arithmetic operations. The discussed schemes were adapted on the OpenCL framework and pixel shaders, and then evaluated using GPUs of two biggest vendors. We demonstrate the performance of the proposed non-separable methods by comparison with existing separable schemes. The non-separable schemes outperform their separable counterparts on numerous setups, especially considering the pixel shaders.