CVJul 1, 2020
Robustifying the Deployment of tinyML Models for Autonomous mini-vehiclesMiguel de Prado, Manuele Rusci, Romain Donze et al.
Standard-size autonomous navigation vehicles have rapidly improved thanks to the breakthroughs of deep learning. However, scaling autonomous driving to low-power systems deployed on dynamic environments poses several challenges that prevent their adoption. To address them, we propose a closed-loop learning flow for autonomous driving mini-vehicles that includes the target environment in-the-loop. We leverage a family of compact and high-throughput tinyCNNs to control the mini-vehicle, which learn in the target environment by imitating a computer vision algorithm, i.e., the expert. Thus, the tinyCNNs, having only access to an on-board fast-rate linear camera, gain robustness to lighting conditions and improve over time. Further, we leverage GAP8, a parallel ultra-low-power RISC-V SoC, to meet the inference requirements. When running the family of CNNs, our GAP8's solution outperforms any other implementation on the STM32L4 and NXP k64f (Cortex-M4), reducing the latency by over 13x and the energy consummation by 92%.
LGJun 9, 2020
Automated Design Space Exploration for optimised Deployment of DNN on Arm Cortex-A CPUsMiguel de Prado, Andrew Mundy, Rabia Saeed et al.
The spread of deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN). Works have mainly focused on: i) efficient DNN architectures, ii) network optimisation techniques such as pruning and quantisation, iii) optimised algorithms to speed up the execution of the most computational intensive layers and, iv) dedicated hardware to accelerate the data flow and computation. However, there is a lack of research on cross-level optimisation as the space of approaches becomes too large to test and obtain a globally optimised solution. Thus, leading to suboptimal deployment in terms of latency, accuracy, and memory. In this work, we first detail and analyse the methods to improve the deployment of DNNs across the different levels of software optimisation. Building on this knowledge, we present an automated exploration framework to ease the deployment of DNNs. The framework relies on a Reinforcement Learning search that, combined with a deep learning inference framework, automatically explores the design space and learns an optimised solution that speeds up the performance and reduces the memory on embedded CPU platforms. Thus, we present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory with negligible loss in accuracy with respect to the BLAS floating-point implementation.
LGJan 15, 2019
Bonseyes AI Pipeline -- bringing AI to you. End-to-end integration of data, algorithms and deployment toolsMiguel de Prado, Jing Su, Rabia Saeed et al.
Next generation of embedded Information and Communication Technology (ICT) systems are collaborative systems able to perform autonomous tasks. The remarkable expansion of the embedded ICT market, together with the rise and breakthroughs of Artificial Intelligence (AI), have put the focus on the Edge as it stands as one of the keys for the next technological revolution: the seamless integration of AI in our daily life. However, training and deployment of custom AI solutions on embedded devices require a fine-grained integration of data, algorithms, and tools to achieve high accuracy. Such integration requires a high level of expertise that becomes a real bottleneck for small and medium enterprises wanting to deploy AI solutions on the Edge which, ultimately, slows down the adoption of AI on daily-life applications. In this work, we present a modular AI pipeline as an integrating framework to bring data, algorithms, and deployment tools together. By removing the integration barriers and lowering the required expertise, we can interconnect the different stages of tools and provide a modular end-to-end development of AI products for embedded devices. Our AI pipeline consists of four modular main steps: i) data ingestion, ii) model training, iii) deployment optimization and, iv) the IoT hub integration. To show the effectiveness of our pipeline, we provide examples of different AI applications during each of the steps. Besides, we integrate our deployment framework, LPDNN, into the AI pipeline and present its lightweight architecture and deployment capabilities for embedded devices. Finally, we demonstrate the results of the AI pipeline by showing the deployment of several AI applications such as keyword spotting, image classification and object detection on a set of well-known embedded platforms, where LPDNN consistently outperforms all other popular deployment frameworks.
CVNov 18, 2018
Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded SystemsMiguel de Prado, Nuria Pazos, Luca Benini
Deep Learning is increasingly being adopted by industry for computer vision applications running on embedded devices. While Convolutional Neural Networks' accuracy has achieved a mature and remarkable state, inference latency and throughput are a major concern especially when targeting low-cost and low-power embedded platforms. CNNs' inference latency may become a bottleneck for Deep Learning adoption by industry, as it is a crucial specification for many real-time processes. Furthermore, deployment of CNNs across heterogeneous platforms presents major compatibility issues due to vendor-specific technology and acceleration libraries. In this work, we present QS-DNN, a fully automatic search based on Reinforcement Learning which, combined with an inference engine optimizer, efficiently explores through the design space and empirically finds the optimal combinations of libraries and primitives to speed up the inference of CNNs on heterogeneous embedded devices. We show that, an optimized combination can achieve 45x speedup in inference latency on CPU compared to a dependency-free baseline and 2x on average on GPGPU compared to the best vendor library. Further, we demonstrate that, the quality of results and time "to-solution" is much better than with Random Search and achieves up to 15x better results for a short-time search.
NENov 14, 2018
QUENN: QUantization Engine for low-power Neural NetworksMiguel de Prado, Maurizio Denna, Luca Benini et al.
Deep Learning is moving to edge devices, ushering in a new age of distributed Artificial Intelligence (AI). The high demand of computational resources required by deep neural networks may be alleviated by approximate computing techniques, and most notably reduced-precision arithmetic with coarsely quantized numerical representations. In this context, Bonseyes comes in as an initiative to enable stakeholders to bring AI to low-power and autonomous environments such as: Automotive, Medical Healthcare and Consumer Electronics. To achieve this, we introduce LPDNN, a framework for optimized deployment of Deep Neural Networks on heterogeneous embedded devices. In this work, we detail the quantization engine that is integrated in LPDNN. The engine depends on a fine-grained workflow which enables a Neural Network Design Exploration and a sensitivity analysis of each layer for quantization. We demonstrate the engine with a case study on Alexnet and VGG16 for three different techniques for direct quantization: standard fixed-point, dynamic fixed-point and k-means clustering, and demonstrate the potential of the latter. We argue that using a Gaussian quantizer with k-means clustering can achieve better performance than linear quantizers. Without retraining, we achieve over 55.64\% saving for weights' storage and 69.17\% for run-time memory accesses with less than 1\% drop in top5 accuracy in Imagenet.