Non-stationary bandit problems

Restless bandits

November 28, 2014 — June 19, 2024

bandit problems
decision theory
signal processing
stochastic processes
Figure 1

Bandit problems where the reward is nonstationary.

Wikipedia recommends we consider Whittle (1988) as a foundational article and that hip recent results include Garivier and Moulines (2008).

1 References

Bedi, Koppel, Rajawat, et al. 2019. Nonstationary Nonparametric Online Learning: Balancing Dynamic Regret and Model Parsimony.”
D’Orazio, Loizou, Laradji, et al. 2023. Stochastic Mirror Descent: Convergence Analysis and Adaptive Variants via the Mirror Stochastic Polyak Stepsize.” Transactions on Machine Learning Research.
Dick, Gyorgy, and Szepesvari. 2014. Online Learning in Markov Decision Processes with Changing Cost Sequences.” In Proceedings of the 31st International Conference on Machine Learning.
Duchi, Agarwal, Johansson, et al. 2012. Ergodic Mirror Descent.” SIAM Journal on Optimization.
Even. 2023. Stochastic Gradient Descent Under Markovian Sampling Schemes.”
Fei, Yang, Wang, et al. 2020. Dynamic Regret of Policy Optimization in Non-Stationary Environments.” In Advances in Neural Information Processing Systems.
Garivier, and Moulines. 2008. On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems.”
Karampatziakis, Langford, and Mineiro. 2020. Empirical Likelihood for Contextual Bandits.” In Advances in Neural Information Processing Systems.
Li, Zhao, and Lan. n.d. “Robust Policy Mirror Descent for Controlling Uncertain Markov Decision Process.”
Whittle. 1988. Restless Bandits: Activity Allocation in a Changing World.” Journal of Applied Probability.
Yan, Zhao, and Zhou. 2023. Online Non-Stochastic Control with Partial Feedback.” Journal of Machine Learning Research.