Local Adjoints for Simultaneous Preaccumulations with Shared Inputs
For developers of parallel automatic differentiation tools, this work provides practical solutions to enable safe preaccumulations with shared inputs, though the improvements are incremental.
The paper addresses data races in shared-memory parallel automatic differentiation when inputs are shared among simultaneous thread-local preaccumulations. It proposes using local adjoints to enable these preaccumulations and benchmarks the approaches in SU2, showing tradeoffs in memory and performance.
In shared-memory parallel automatic differentiation, inputs that are shared among simultaneous thread-local preaccumulations lead to data races if Jacobians are accumulated with a single, shared vector of adjoint variables. In this work, we discuss the benefits and tradeoffs of re-enabling such preaccumulations by a transition to suitable local adjoints. We propose different vector- and map-based approaches for storing local adjoint variables and analyze them with respect to memory consumption, memory allocation, and adjoint variable access times in the context of simultaneous preaccumulations in multiple threads. We implement the approaches in CoDiPack and benchmark them in parallel discrete adjoint computations in the multiphysics simulation suite SU2.