DCJan 20, 2016
Architecture-Aware Optimization of an HEVC decoder on Asymmetric Multicore ProcessorsRafael Rodríguez-Sánchez, Enrique S. Quintana-Ortí
Low-power asymmetric multicore processors (AMPs) attract considerable attention due to their appealing performance-power ratio for energy-constrained environments. However, these processors pose a significant programming challenge due to the integration of cores with different performance capabilities, asking for an asymmetry-aware scheduling solution that carefully distributes the workload. The recent HEVC standard, which offers several high-level parallelization strategies, is an important application that can benefit from an implementation tailored for the low-power AMPs present in many current mobile or hand-held devices. In this scenario, we present an architecture-aware implementation of an HEVC decoder that embeds a criticality-aware scheduling strategy tuned for a Samsung Exynos 5422 system-on-chip furnished with an ARM big.LITTLE AMP. The performance and energy efficiency of our solution is further enhanced by exploiting the NEON vector engine available in the ARM big.LITTLE architecture. Experimental results expose a 1080p real-time HEVC decoding at 24 frames/sec, and a reduction of energy consumption over 20%.
PFJun 30, 2015
Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore ProcessorsSandra Catalán, Francisco D. Igual, Rafael Mayo et al.
Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency.