ROOct 2, 2019

ROS Rescue : Fault Tolerance System for Robot Operating System

arXiv:1910.01078v14 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses reliability issues for robotic deployments in real-world applications, but it is incremental as it builds on existing ROS infrastructure.

The paper tackles the problem of master failure in ROS1.0 by designing a fault-tolerant mechanism that enables recovery without aborting or restarting all nodes, with preliminary tests conducted successfully on various robots and systems.

In this chapter we discuss the problem of master failure in ROS1.0 and its impact on robotic deployments in the real world. We address this issue in this tutorial chapter where we outline, design and demonstrate a fault tolerant mechanism associated with ROS master failure. Unlike previous solutions which use primary backup replication and external checkpointing libraries which are process heavy, our mechanism adds a lightweight functionality to the ROS master to enable it to recover from failure. We present a modified version of ROS master which is equipped with a logging mechanism to record the meta information and network state of ROS nodes as well as a recovery mechanism to go back to the previous state without having to abort or restart all the nodes. We also implement an additional master monitor node responsible for failure detection on the master by polling it for its availability. Our code is implemented in python and preliminary tests were conducted successfully on a variety of land, aerial and underwater robots and a tele-operating computer running ROS Kinetic on Ubuntu 16.04. The code is publicly available under a creative commons license on github at https://github.com/PushyamiKaveti/fault-tolerant-ros-master

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes