Smart caching in a Data Lake for High Energy Physics analysis
This addresses data management challenges for High Energy Physics researchers in distributed environments, but it is an incremental application of existing methods to a specific domain.
The authors tackled the problem of data access and management in a distributed High Energy Physics Data Lake by proposing an autonomous reinforcement learning-based caching method, which improved user experience and reduced maintenance costs.
The continuous growth of data production in almost all scientific areas raises new problems in data access and management, especially in a scenario where the end-users, as well as the resources that they can access, are worldwide distributed. This work is focused on the data caching management in a Data Lake infrastructure in the context of the High Energy Physics field. We are proposing an autonomous method, based on Reinforcement Learning techniques, to improve the user experience and to contain the maintenance costs of the infrastructure.