Near-Optimal Data Source Selection for Bayesian Learning
This work provides a method for efficiently selecting data sources to optimize cost and learning performance for practitioners using Bayesian learning, offering a practical solution to a known challenge.
This paper addresses the problem of selecting data sources to minimize cost while achieving a desired learning performance in Bayesian learning. The authors demonstrate the NP-hardness of this problem and propose both a standard greedy algorithm and a faster greedy algorithm, both with provable performance guarantees, to solve it.
We study a fundamental problem in Bayesian learning, where the goal is to select a set of data sources with minimum cost while achieving a certain learning performance based on the data streams provided by the selected data sources. First, we show that the data source selection problem for Bayesian learning is NP-hard. We then show that the data source selection problem can be transformed into an instance of the submodular set covering problem studied in the literature, and provide a standard greedy algorithm to solve the data source selection problem with provable performance guarantees. Next, we propose a fast greedy algorithm that improves the running times of the standard greedy algorithm, while achieving performance guarantees that are comparable to those of the standard greedy algorithm. The fast greedy algorithm can also be applied to solve the general submodular set covering problem with performance guarantees. Finally, we validate the theoretical results using numerical examples, and show that the greedy algorithms work well in practice.