IV CVOct 21, 2022

High-Fidelity Visual Structural Inspections through Transformers and Learnable Resizers

Kareem Eltouny, Seyedomid Sajedi, Xiao Liang

arXiv:2210.12175v12.73 citationsh-index: 11

Originality Incremental advance

AI Analysis

This work addresses the problem of efficient and accurate autonomous visual inspections for civil engineers, offering an incremental improvement over existing methods by managing computational trade-offs.

The paper tackles the challenge of high-resolution semantic segmentation for visual inspections of civil infrastructure, proposing a hybrid framework that balances global and local semantics to achieve improved performance on tasks like component type and damage state segmentation, with concrete metrics reported on the Quake City dataset.

Visual inspection is the predominant technique for evaluating the condition of civil infrastructure. The recent advances in unmanned aerial vehicles (UAVs) and artificial intelligence have made the visual inspections faster, safer, and more reliable. Camera-equipped UAVs are becoming the new standard in the industry by collecting massive amounts of visual data for human inspectors. Meanwhile, there has been significant research on autonomous visual inspections using deep learning algorithms, including semantic segmentation. While UAVs can capture high-resolution images of buildings' façades, high-resolution segmentation is extremely challenging due to the high computational memory demands. Typically, images are uniformly downsized at the price of losing fine local details. Contrarily, breaking the images into multiple smaller patches can cause a loss of global contextual in-formation. We propose a hybrid strategy that can adapt to different inspections tasks by managing the global and local semantics trade-off. The framework comprises a compound, high-resolution deep learning architecture equipped with an attention-based segmentation model and learnable downsampler-upsampler modules designed for optimal efficiency and in-formation retention. The framework also utilizes vision transformers on a grid of image crops aiming for high precision learning without downsizing. An augmented inference technique is used to boost the performance and re-duce the possible loss of context due to grid cropping. Comprehensive experiments have been performed on 3D physics-based graphics models synthetic environments in the Quake City dataset. The proposed framework is evaluated using several metrics on three segmentation tasks: component type, component damage state, and global damage (crack, rebar, spalling).

View on arXiv PDF

Similar