LGApr 8, 2024Code
David and Goliath: An Empirical Evaluation of Attacks and Defenses for QNNs at the Deep EdgeMiguel Costa, Sandro Pinto
ML is shifting from the cloud to the edge. Edge computing reduces the surface exposing private data and enables reliable throughput guarantees in real-time applications. Of the panoply of devices deployed at the edge, resource-constrained MCUs, e.g., Arm Cortex-M, are more prevalent, orders of magnitude cheaper, and less power-hungry than application processors or GPUs. Thus, enabling intelligence at the deep edge is the zeitgeist, with researchers focusing on unveiling novel approaches to deploy ANNs on these constrained devices. Quantization is a well-established technique that has proved effective in enabling the deployment of neural networks on MCUs; however, it is still an open question to understand the robustness of QNNs in the face of adversarial examples. To fill this gap, we empirically evaluate the effectiveness of attacks and defenses from (full-precision) ANNs on (constrained) QNNs. Our evaluation includes three QNNs targeting TinyML applications, ten attacks, and six defenses. With this study, we draw a set of interesting findings. First, quantization increases the point distance to the decision boundary and leads the gradient estimated by some attacks to explode or vanish. Second, quantization can act as a noise attenuator or amplifier, depending on the noise magnitude, and causes gradient misalignment. Regarding adversarial defenses, we conclude that input pre-processing defenses show impressive results on small perturbations; however, they fall short as the perturbation increases. At the same time, train-based defenses increase the average point distance to the decision boundary, which holds after quantization. However, we argue that train-based defenses still need to smooth the quantization-shift and gradient misalignment phenomenons to counteract adversarial example transferability to QNNs. All artifacts are open-sourced to enable independent validation of results.
CRJul 8, 2021Code
Towards a Trusted Execution Environment via Reconfigurable FPGASérgio Pereira, David Cerdeira, Cristiano Rodrigues et al.
Trusted Execution Environments (TEEs) are used to protect sensitive data and run secure execution for security-critical applications, by providing an environment isolated from the rest of the system. However, over the last few years, TEEs have been proven weak, as either TEEs built upon security-oriented hardware extensions (e.g., Arm TrustZone) or resorting to dedicated secure elements were exploited multiple times. In this project, we introduce Trusted Execution Environments On-Demand (TEEOD), a novel TEE design that leverages the programmable logic (PL) in the heterogeneous system on chips (SoC) as the secure execution environment. Unlike other TEE designs, TEEOD can provide high-bandwidth connections and physical on-chip isolation. We implemented a proof-of-concept (PoC) implementation targeting an Ultra96-V2 platform. The conducted evaluation demonstrated TEEOD can host up to 6 simultaneous enclaves with a resource usage per enclave of 7.0%, 3.8%, and 15.3% of the total LUTs, FFs, and BRAMS, respectively. To demonstrate the practicability of TEEOD in real-world applications, we successfully run a legacy open-source Bitcoin wallet.
CRFeb 6, 2021Code
uTango: an open-source TEE for IoT devicesDaniel Oliveira, Tiago Gomes, Sandro Pinto
Security is one of the main challenges of the Internet of Things (IoT). IoT devices are mainly powered by low-cost microcontrollers (MCUs) that typically lack basic hardware security mechanisms to separate security-critical applications from less critical components. Recently, Arm has started to release Cortex-M MCUs enhanced with TrustZone technology (i.e., TrustZone-M), a system-wide security solution aiming at providing robust protection for IoT devices. Trusted Execution Environments (TEEs) relying on TrustZone hardware have been perceived as safe havens for securing mobile devices. However, for the past few years, considerable effort has gone into unveiling hundreds of vulnerabilities and proposing a collection of relevant defense techniques to address several issues. While new TEE solutions built on TrustZone-M start flourishing, the lessons gathered from the research community appear to be falling short, as these new systems are trapping into the same pitfalls of the past. In this paper, we present uTango, the first multi-world TEE for modern IoT devices. uTango proposes a novel architecture aiming at tackling the major architectural deficiencies currently affecting TrustZone(-M)-assisted TEEs. In particular, we leverage the very same TrustZone hardware primitives used by dual-world implementations to create multiple and equally secure execution environments within the normal world. We demonstrate the benefits of uTango by conducting an extensive evaluation on a real TrustZone-M hardware platform, i.e., Arm Musca-B1. uTango will be open-sourced and freely available on GitHub in hopes of engaging academia and industry on securing the foreseeable trillion IoT devices.
ARJan 7, 2024
A Heterogeneous RISC-V based SoC for Secure Nano-UAV NavigationLuca Valente, Alessandro Nadalini, Asif Veeran et al.
The rapid advancement of energy-efficient parallel ultra-low-power (ULP) ucontrollers units (MCUs) is enabling the development of autonomous nano-sized unmanned aerial vehicles (nano-UAVs). These sub-10cm drones represent the next generation of unobtrusive robotic helpers and ubiquitous smart sensors. However, nano-UAVs face significant power and payload constraints while requiring advanced computing capabilities akin to standard drones, including real-time Machine Learning (ML) performance and the safe co-existence of general-purpose and real-time OSs. Although some advanced parallel ULP MCUs offer the necessary ML computing capabilities within the prescribed power limits, they rely on small main memories (<1MB) and ucontroller-class CPUs with no virtualization or security features, and hence only support simple bare-metal runtimes. In this work, we present Shaheen, a 9mm2 200mW SoC implemented in 22nm FDX technology. Differently from state-of-the-art MCUs, Shaheen integrates a Linux-capable RV64 core, compliant with the v1.0 ratified Hypervisor extension and equipped with timing channel protection, along with a low-cost and low-power memory controller exposing up to 512MB of off-chip low-cost low-power HyperRAM directly to the CPU. At the same time, it integrates a fully programmable energy- and area-efficient multi-core cluster of RV32 cores optimized for general-purpose DSP as well as reduced- and mixed-precision ML. To the best of the authors' knowledge, it is the first silicon prototype of a ULP SoC coupling the RV64 and RV32 cores in a heterogeneous host+accelerator architecture fully based on the RISC-V ISA. We demonstrate the capabilities of the proposed SoC on a wide range of benchmarks relevant to nano-UAV applications. The cluster can deliver up to 90GOp/s and up to 1.8TOp/s/W on 2-bit integer kernels and up to 7.9GFLOp/s and up to 150GFLOp/s/W on 16-bit FP kernels.
LGSep 10, 2025
Decentor-V: Lightweight ML Training on Low-Power RISC-V Edge DevicesMarcelo Ribeiro, Diogo Costa, Gonçalo Moreira et al.
Modern IoT devices increasingly rely on machine learning solutions to process data locally. However, the lack of graphics processing units (GPUs) or dedicated accelerators on most platforms makes on-device training largely infeasible, often requiring cloud-based services to perform this task. This procedure often raises privacy-related concerns, and creates dependency on reliable and always-on connectivity. Federated Learning (FL) is a new trend that addresses these issues by enabling decentralized and collaborative training directly on devices, but it requires highly efficient optimization algorithms. L-SGD, a lightweight variant of stochastic gradient descent, has enabled neural network training on Arm Cortex-M Microcontroller Units (MCUs). This work extends L-SGD to RISC-V-based MCUs, an open and emerging architecture that still lacks robust support for on-device training. L-SGD was evaluated on both Arm and RISC-V platforms using 32-bit floating-point arithmetic, highlighting the performance impact of the absence of Floating-Point Units (FPUs) in RISC-V MCUs. To mitigate these limitations, we introduce an 8-bit quantized version of L-SGD for RISC-V, which achieves nearly 4x reduction in memory usage and a 2.2x speedup in training time, with negligible accuracy degradation.
LGOct 6, 2021
Shifting Capsule Networks from the Cloud to the Deep EdgeMiguel Costa, Diogo Costa, Tiago Gomes et al.
Capsule networks (CapsNets) are an emerging trend in image processing. In contrast to a convolutional neural network, CapsNets are not vulnerable to object deformation, as the relative spatial information of the objects is preserved across the network. However, their complexity is mainly related to the capsule structure and the dynamic routing mechanism, which makes it almost unreasonable to deploy a CapsNet, in its original form, in a resource-constrained device powered by a small microcontroller (MCU). In an era where intelligence is rapidly shifting from the cloud to the edge, this high complexity imposes serious challenges to the adoption of CapsNets at the very edge. To tackle this issue, we present an API for the execution of quantized CapsNets in Arm Cortex-M and RISC-V MCUs. Our software kernels extend the Arm CMSIS-NN and RISC-V PULP-NN to support capsule operations with 8-bit integers as operands. Along with it, we propose a framework to perform post-training quantization of a CapsNet. Results show a reduction in memory footprint of almost 75%, with accuracy loss ranging from 0.07% to 0.18%. In terms of throughput, our Arm Cortex-M API enables the execution of primary capsule and capsule layers with medium-sized kernels in just 119.94 and 90.60 milliseconds (ms), respectively (STM32H755ZIT6U, Cortex-M7 @ 480 MHz). For the GAP-8 SoC (RISC-V RV32IMCXpulp @ 170 MHz), the latency drops to 7.02 and 38.03 ms, respectively.