Weitao Pan

4papers

21citations

Novelty55%

AI Score43

Ranked #80,219 of 201,326 authors (top 40%)#305 in AR (top 40%)

4 Papers

ARMar 29, 2022

Eventor: An Efficient Event-Based Monocular Multi-View Stereo Accelerator on FPGA Platform

Mingjun Li, Jianlei Yang, Yingjie Qi et al.

Event cameras are bio-inspired vision sensors that asynchronously represent pixel-level brightness changes as event streams. Event-based monocular multi-view stereo (EMVS) is a technique that exploits the event streams to estimate semi-dense 3D structure with known trajectory. It is a critical task for event-based monocular SLAM. However, the required intensive computation workloads make it challenging for real-time deployment on embedded platforms. In this paper, Eventor is proposed as a fast and efficient EMVS accelerator by realizing the most critical and time-consuming stages including event back-projection and volumetric ray-counting on FPGA. Highly paralleled and fully pipelined processing elements are specially designed via FPGA and integrated with the embedded ARM as a heterogeneous system to improve the throughput and reduce the memory footprint. Meanwhile, the EMVS algorithm is reformulated to a more hardware-friendly manner by rescheduling, approximate computing and hybrid data quantization. Evaluation results on DAVIS dataset show that Eventor achieves up to $24\times$ improvement in energy efficiency compared with Intel i5 CPU platform.

AROct 31, 2023

DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data Capacity of SRAM-based Processing-In-Memory

Cenlin Duan, Jianlei Yang, Xiaolin He et al.

Processing-in-memory (PIM), as a novel computing paradigm, provides significant performance benefits from the aspect of effective data movement reduction. SRAM-based PIM has been demonstrated as one of the most promising candidates due to its endurance and compatibility. However, the integration density of SRAM-based PIM is much lower than other non-volatile memory-based ones, due to its inherent 6T structure for storing a single bit. Within comparable area constraints, SRAM-based PIM exhibits notably lower capacity. Thus, aiming to unleash its capacity potential, we propose DDC-PIM, an efficient algorithm/architecture co-design methodology that effectively doubles the equivalent data capacity. At the algorithmic level, we propose a filter-wise complementary correlation (FCC) algorithm to obtain a bitwise complementary pair. At the architecture level, we exploit the intrinsic cross-coupled structure of 6T SRAM to store the bitwise complementary pair in their complementary states ($Q/\overline{Q}$), thereby maximizing the data capacity of each SRAM cell. The dual-broadcast input structure and reconfigurable unit support both depthwise and pointwise convolution, adhering to the requirements of various neural networks. Evaluation results show that DDC-PIM yields about $2.84\times$ speedup on MobileNetV2 and $2.69\times$ on EfficientNet-B0 with negligible accuracy loss compared with PIM baseline implementation. Compared with state-of-the-art SRAM-based PIM macros, DDC-PIM achieves up to $8.41\times$ and $2.75\times$ improvement in weight density and area efficiency, respectively.

96.6NIApr 13

Programmable Packet Scheduling with Dynamic Reordering at Line Rate

Zekun Wang, Binghao Yue, Yichen Deng et al.

High-speed switch packet scheduling demands both line-rate performance and programmability. Existing programmable hardware scheduling models, such as PIFO and PIEO, can express a broad range of scheduling algorithms; however, their semantics are restricted to packet-level ordering and cannot dynamically reorder buffered packets, which limits the support for dynamic-ordering algorithms such as pFabric. To overcome this limitation, we propose UIFO (Update-In-First-Out), a new programmable scheduling model that introduces a two-level abstraction over classes and packets. UIFO enables dynamic updates to the scheduling order at the class level while preserving in-order packet scheduling within each class, thereby supporting dynamic reordering of already-buffered packets. Furthermore, UIFO remains fully compatible with and generalizes existing PIFO and PIEO models. We implement a hardware prototype of UIFO based on priority-queue designs and evaluate it on an FPGA platform and in a 28 nm ASIC process. Overall, UIFO significantly enhances scheduling expressiveness and maintains favorable scalability while sustaining 100 Gbps line-rate throughput.

DSJan 14

A Grouped Sorting Queue Supporting Dynamic Updates for Timer Management in High-Speed Network Interface Cards

Zekun Wang, Binghao Yue, Weitao Pan et al.

With the hardware offloading of network functions, network interface cards (NICs) undertake massive stateful, high-precision, and high-throughput tasks, where timers serve as a critical enabling component. However, existing timer management schemes suffer from heavy software load, low precision, lack of hardware update support, and overflow. This paper proposes two novel operations for priority queues--update and group sorting--to enable hardware timer management. To the best of our knowledge, this work presents the first hardware priority queue to support an update operation through the composition and propagation of basic operations to modify the priorities of elements within the queue. The group sorting mechanism ensures correct timing behavior post-overflow by establishing a group boundary priority to alter the sorting process and element insertion positions. Implemented with a hybrid architecture of a one-dimension (1D) systolic array and shift registers, our design is validated through packet-level simulations for flow table timeout management. Results demonstrate that a 4K-depth, 16-bit timer queue achieves over 500 MHz (175 Mpps, 12 ns precision) in a 28nm process and over 300 MHz (116 Mpps) on an FPGA. Critically, it reduces LUTs and FFs usage by 31% and 25%, respectively, compared to existing designs.