Making the most of your day: online learning for optimal allocation of time
This work addresses a practical problem for agents in resource allocation scenarios like job scheduling or ride-sharing, but it is incremental as it builds on existing bandit frameworks with a specific twist.
The paper tackles the problem of online learning for optimal time allocation, where an agent sequentially decides to accept or reject tasks with unknown durations and reward functions, and studies the regret incurred in both scenarios. The result is a theoretical analysis of regret bounds for this setting, which differs from contextual bandits due to the reward's dependence on the entire context distribution.
We study online learning for optimal allocation when the resource to be allocated is time. %Examples of possible applications include job scheduling for a computing server, a driver filling a day with rides, a landlord renting an estate, etc. An agent receives task proposals sequentially according to a Poisson process and can either accept or reject a proposed task. If she accepts the proposal, she is busy for the duration of the task and obtains a reward that depends on the task duration. If she rejects it, she remains on hold until a new task proposal arrives. We study the regret incurred by the agent, first when she knows her reward function but does not know the distribution of the task duration, and then when she does not know her reward function, either. This natural setting bears similarities with contextual (one-armed) bandits, but with the crucial difference that the normalized reward associated to a context depends on the whole distribution of contexts.