Akul Swami

SY
3papers
1citation
Novelty28%
AI Score41

3 Papers

71.7SYMay 30Code
Wire-Level Interrupt-to-Decision Latency of On-Sensor MLC versus Host Inference on the NVIDIA Jetson Orin Nano: A Pre-Registered Measurement Study

Akul Swami, Dnyaneshwar Sonawane

The Machine Learning Core (MLC) embedded in the STMicroelectronics LSM6DSOX IMU is widely cited as a low-latency alternative to host-side inference, yet wire-level decision-delivery latency is rarely measured. Using a Saleae Logic Pro 8 logic analyzer on an NVIDIA Jetson Orin Nano, we measured interrupt-to-decision latency (sensor INT1 edge to host decision GPIO) for three pipelines (a host-side decision-tree classifier, the standard MLC bank-switch read protocol, and an MLC binary-fast variant) under idle, I2C bus contention, and CPU stress. The protocol was pre-registered with 12 externally-timestamped Zenodo amendments before confirmatory data collection (4,770 of 4,860 trials included, 98.15%, across nine cells). The host pipeline exhibits lower median latency than the MLC pipeline under all conditions: 321.7 vs 681.5 us at idle (2.1x faster) and 574.5 vs 1,325.4 us under I2C contention (2.3x faster). The three-transaction I2C read protocol, not the silicon's classification, is the dominant latency contributor. We additionally characterize a reproducible 706.5 ms MLC decision cadence that bounds full stimulus-to-decision latency. Code, data, and pre-registration: github.com/akulswami/sensor-mlc-latency.

43.7SYMay 17
Architecture Dependent Temporal Observability Under Deployment Interference in Edge Inference Systems

Akul Swami, Nikhil Chougule

Edge inference systems are typically evaluated with software-reported latency collected under controlled conditions. We argue, and demonstrate empirically, that deployment interference can corrupt not only the inference timing being measured but the timing observability infrastructure that measures it, and that the two failures can occur independently. We pair software-reported timing with externally observable GPIO intervals captured by a Saleae Logic Pro 8 logic analyzer on an NVIDIA Jetson Orin Nano, running MobileNetV2 under two inference architectures (TensorRT FP16 GPU and ONNX Runtime CPU) across baseline, light memory pressure, and storage writeback stress. Across 35 paired capture runs (3500 samples) plus 3 storage-stress runs where external pairing failed (300 software-only samples), we observe three findings the software-only view does not surface. (1) The two architectures differ not only in mean latency but in distributional structure: TensorRT baseline clusters tightly near 1.23 ms (run-mean SD 15 us) while ORT CPU baseline is multimodal with run-mean SD 31.8 ms. (2) Light memory pressure inflates TensorRT P99 from 1.28 ms to 1.61 ms, while one of five ORT memory-stress runs collapses into a deterministic 198 ms regime rather than uniformly inflating variance. (3) All three TensorRT storage-stress runs produce complete software timing logs (100/100 iterations) alongside externally observable timing failures of three different kinds (full post-marker collapse, ~40% transition loss, and complete acquisition failure) -- while the runtime reports normal completion in every case. We claim, narrowly, that timing observability is itself an interference-sensitive resource, and that summary statistics from a single timing source can hide failure modes an independent external observer makes visible.

0.8SYMay 4
Per-Platform GPIO Overhead in Hardware-Validated Edge ML Inference Timing

Akul Swami, Nikhil Chougule

Edge machine learning (ML) deployments increasingly rely on per-inference timing measured by software clocks such as Python's perf_counter, but these measurements are not always validated against external hardware references on embedded Linux, and edge ML benchmarking methodologies typically do not isolate platform-dependent instrumentation overhead. This paper reports a preliminary characterization of GPIO call overhead in hardware-validated edge ML inference timing on two embedded platforms running a one-dimensional convolutional neural network (1-D CNN) arrhythmia classifier on electrocardiogram (ECG) data from the MIT-BIH Arrhythmia Database, with five classes per the Association for the Advancement of Medical Instrumentation (AAMI) EC57 standard. Across $n = 10$ trials on each platform at a controlled steady-state baseline, the per-platform constant on the Jetson Orin Nano (TensorRT FP16, Jetson.GPIO) is approximately $-20\,μ$s, and on the Raspberry Pi 4 (ONNX Runtime CPU, pigpio) approximately $-86\,μ$s, yielding a cross-platform asymmetry of approximately $66\,μ$s that is large relative to commonly used uniform validation tolerances. The Jetson constant is well-approximated by direct GPIO call duration (the direct profile accounts for ~88% of the platform constant), while the Pi direct profile over-predicts the platform constant by ~19%, motivating empirical per-platform calibration in the deployed measurement context. The Pi constant is not a single sharp value but exhibits a cross-day range of approximately $6\,μ$s across the three sessions sampled, while the Jetson constant reproduces to within approximately $0.14\,μ$s. These preliminary results suggest that cross-platform edge ML timing studies may benefit from platform-aware and potentially session-aware validation gates.