Revisiting FunnyBirds evaluation framework for prototypical parts networks
This work highlights a methodological flaw in an existing explainability benchmark, which is important for researchers in interpretable machine learning but is incremental as it focuses on correcting visualization choices.
The study addressed the inconsistent visualization methods in the FunnyBirds benchmark for evaluating prototypical parts networks like ProtoPNet, finding that using similarity maps instead of bounding boxes yields different metric scores and better aligns with the network's essence.
Prototypical parts networks, such as ProtoPNet, became popular due to their potential to produce more genuine explanations than post-hoc methods. However, for a long time, this potential has been strictly theoretical, and no systematic studies have existed to support it. That changed recently with the introduction of the FunnyBirds benchmark, which includes metrics for evaluating different aspects of explanations. However, this benchmark employs attribution maps visualization for all explanation techniques except for the ProtoPNet, for which the bounding boxes are used. This choice significantly influences the metric scores and questions the conclusions stated in FunnyBirds publication. In this study, we comprehensively compare metric scores obtained for two types of ProtoPNet visualizations: bounding boxes and similarity maps. Our analysis indicates that employing similarity maps aligns better with the essence of ProtoPNet, as evidenced by different metric scores obtained from FunnyBirds. Therefore, we advocate using similarity maps as a visualization technique for prototypical parts networks in explainability evaluation benchmarks.