Sri Parameswaran

ARSep 9, 2022Code

ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training and Inference

Jing Gong, Hassaan Saadat, Hasindu Gamaarachchi et al.

Edge training of Deep Neural Networks (DNNs) is a desirable goal for continuous learning; however, it is hindered by the enormous computational power required by training. Hardware approximate multipliers have shown their effectiveness for gaining resource-efficiency in DNN inference accelerators; however, training with approximate multipliers is largely unexplored. To build resource efficient accelerators with approximate multipliers supporting DNN training, a thorough evaluation of training convergence and accuracy for different DNN architectures and different approximate multipliers is needed. This paper presents ApproxTrain, an open-source framework that allows fast evaluation of DNN training and inference using simulated approximate multipliers. ApproxTrain is as user-friendly as TensorFlow (TF) and requires only a high-level description of a DNN architecture along with C/C++ functional models of the approximate multiplier. We improve the speed of the simulation at the multiplier level by using a novel LUT-based approximate floating-point (FP) multiplier simulator on GPU (AMSim). ApproxTrain leverages CUDA and efficiently integrates AMSim into the TensorFlow library, in order to overcome the absence of native hardware approximate multiplier in commercial GPUs. We use ApproxTrain to evaluate the convergence and accuracy of DNN training with approximate multipliers for small and large datasets (including ImageNet) using LeNets and ResNets architectures. The evaluations demonstrate similar convergence behavior and negligible change in test accuracy compared to FP32 and bfloat16 multipliers. Compared to CPU-based approximate multiplier simulations in training and inference, the GPU-accelerated ApproxTrain is more than 2500x faster. Based on highly optimized closed-source cuDNN/cuBLAS libraries with native hardware multipliers, the original TensorFlow is only 8x faster than ApproxTrain.

CRNov 20, 2020

SIMF: Single-Instruction Multiple-Flush Mechanism for Processor Temporal Isolation

Tuo Li, Bradley Hopkins, Sri Parameswaran

Microarchitectural timing attacks are a type of information leakage attack, which exploit the time-shared microarchitectural components, such as caches, translation look-aside buffers (TLBs), branch prediction unit (BPU), and speculative execution, in modern processors to leak critical information from a victim process or thread. To mitigate such attacks, the mechanism for flushing the on-core state is extensively used by operating-system-level solutions, since on-core state is too expensive to partition. In these systems, the flushing operations are implemented in software (using cache maintenance instructions), which severely limit the efficiency of timing attack protection. To bridge this gap, we propose specialized hardware support, a single-instruction multiple-flush (SIMF) mechanism to flush the core-level state, which consists of L1 caches, BPU, TLBs, and register file. We demonstrate SIMF by implementing it as an ISA extension, i.e., flushx instruction, in scalar in-order RISC-V processor. The resultant processor is prototyped on Xilinx ZCU102 FPGA and validated with state-of-art seL4 microkernel, Linux kernel in multi-core scenarios, and a cache timing attack. Our evaluation shows that SIMF significantly alleviates the overhead of flushing by more than a factor of two in execution time and reduces dynamic instruction count by orders-of-magnitude.

Sri Parameswaran

2 Papers