Detecting Call Graph Unsoundness without Ground Truth
This work addresses a fundamental problem for static analysis researchers and practitioners by exposing inconsistencies in evaluation practices, highlighting the need for joint reasoning about algorithms, configurations, and semantics, though it is incremental in challenging existing assumptions rather than introducing a new solution.
The study tackled the flawed assumption that Java static analysis frameworks yield comparable results by conducting a large-scale empirical analysis of four frameworks, revealing that algorithmic precision orders break due to modern language features, configuration choices cause synergistic failures, and cross-framework comparisons expose irreconcilable semantic gaps.
Java static analysis frameworks are commonly compared under the assumption that analysis algorithms and configurations compose monotonically and yield semantically comparable results across tools. In this work, we show that this assumption is fundamentally flawed. We present a large-scale empirical study of semantic consistency within and across four widely used Java static analysis frameworks: Soot, SootUp, WALA, and Doop. Using precision partial orders over analysis algorithms and configurations, we systematically identify violations where increased precision introduces new call-graph edges or amplifies inconsistencies. Our results reveal three key findings. First, algorithmic precision orders frequently break within frameworks due to modern language features such as lambdas, reflection, and native modeling. Second, configuration choices strongly interact with analysis algorithms, producing synergistic failures that exceed the effects of algorithm or configuration changes alone. Third, cross-framework comparisons expose irreconcilable semantic gaps, demonstrating that different frameworks operate over incompatible notions of call-graph ground truth. These findings challenge prevailing evaluation practices in static analysis and highlight the need to reason jointly about algorithms, configurations, and framework semantics when assessing precision and soundness.