Bandit problems

Also stochastic control

November 28, 2014 — June 19, 2024

bandit problems
decision theory
signal processing
stochastic processes

Bandit problems, Markov decision processes, a smattering of dynamic programming, game theory, optimal control, and online learning of the solutions to such problems, esp. reinforcement learning.

Learning, where you must learn an optimal action in response to your stimulus, possibly an optimal “policy” of trying different actions over time, not just an MMSE-minimal prediction from complete data.

Comes in adversarial, Markov, and stochastic flavours, apparently, although I’ve hit the boundaries of my knowledge there.

1 Pseudopolitical diversion

See clickbait bandits.

2 Intros

Figure 1

Here’s an intro to all of machine learning through a historical tale about how one particular attempt to teach a machine (not a computer!) to play tic-tac-toe: Rodney Brooks, Machine Learning Explained. Introductions recommended by Bubeck include (Slivkins 2019; Bubeck and Cesa-Bianchi 2012; Lattimore 2020).

2.1 Theory

3 Bayes version

And other probability-matching methods

3.1 Thompson sampling

Thompson sampling (Chapelle and Li 2011; Scott 2010)is a heuristic for choosing actions that addresses the exploration/exploitation trade-off in the multi-armed bandit problem.

There is a neat tutorial (Russo et al. 2018) with supporting code at iosband/ts_tutorial.

4 Bandits-meet-optimisation

Bubeck again: Kernel-based methods for bandit convex optimization, part 1.

5 Bandits-meet-evolution

Ben Recht and Roy Frostig, Nesterov’s punctuation equilibrium:

In a landmark new paper by Salimans, Ho, Chen, and Sutskever from OpenAI, (Salimans et al. 2017) the authors show that a particular class of genetic algorithms (called Evolutionary Strategies) gives excellent performance on a variety of reinforcement learning benchmarks. As optimizers, the application of genetic algorithms raises red flags and usually causes us to close browser Windows. But fear not! As we will explain, the particular algorithm deployed happens to be a core method in optimization, and the fact that this method is successful sheds light on the peculiarities of reinforcement learning more than it does about genetic algorithms in general.

6 Details


Conceptually, the base model is a one- or many-armed poker machine. You can pop coins in, and each time you do you may pull an arm; you might get rewarded. Each arm of the machine might have different returns; but the only way to find out is to play.

How do you choose optimally which arms to pull and when? How much is to work spending to find the arm with the best return on investment, given that it costs to collect more data?

This can be formalised by minimizing regret and defining some other terms and what you get out is a formalized version of Skinnerian learning that you can easily implement as an algorithm and feel that you have got some satisfyingly optimal properties for.

The thing about whether to choose a new arm when you are on a known good one, or to switch to another one in hope of it being better, this is a symbolic one and it’s called the exploration/exploitation trade-off.

7 Multi-world testing

A setting by Microsoft: Multi-World Testing (MWT) appears to be an online learning problem that augments its data by re-using the data for offline testing:

… is a toolbox of machine learning technology for principled and efficient experimentation, plausibly applicable to most Microsoft services that interact with customers. In many scenarios, this technology is exponentially more efficient than the traditional A/B testing. The underlying research area, mature and yet very active, is known under many names: “multi-armed bandits”, “contextual bandits”, “associative reinforcement learning”, and “counterfactual evaluation”, among others.

To take an example, suppose one wants to optimize clicks on suggested news stories. To discover what works, one needs to explore over the possible news stories. Further, if the suggested news story can be chosen depending on the visitor’s profile, then one needs to explore over the possible “policies” that map profiles to news stories (and there are exponentially more “policies” than news stories!). Traditional ML fails at this because it does not explore. Whereas MWT allows you to explore continuously, and optimize your decisions using this exploration data.

It has a better introduction here, by John Langford: (cf (Alekh Agarwal et al. 2015).)

Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.

Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what we’ve done and now it can be used by anyone. This drops the barrier to use down to: “Do you have permissions? And do you have a reasonable idea of what a good feature is?”

A key foundational idea is Multiworld Testing: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system we’ve made at Microsoft Research.

8 Reinforcement learning

See reinforcement learning.

9 Markov decision problems


10 Connection to graphical models

See Levine (2018).

11 Practicalities

Vowpal Wabbit does contextual bandit learning.:

VW contains a contextual bandit module which allows you to optimize a predictor based on already collected contextual bandit data. In other words, the module does not handle the exploration issue, it assumes it can only use the currently available data previously collected from some “exploration” policy.

Multiple World Testing source code.

12 Sequential surrogate interactive model optimisation

Not what I think of in reference to bandit problems, but it and many other hyperparameter optimization-type problems have an RL interpretation apparently. That is, you can use RL to learn the hyper parameters of your deep learning model. (This is not the same as deep learning of RL policies.) See Sequential surrogate model optimisation.

13 Incoming

14 References

Afsharrad, Moradipari, and Lall. 2023. Convex Methods for Constrained Linear Bandits.”
Agarwal, Aman, Basu, Schnabel, et al. 2017. Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers.” In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’17.
Agarwal, Alekh, Bird, Cozowicz, et al. 2015. Introduction to Multi-World Testing.”
Allen-Zhu, Li, Singh, et al. 2017. Near-Optimal Design of Experiments via Regret Minimization.” In PMLR.
Bellman. 1957. A Markovian Decision Process.”
Bellman, and Kalaba. 1961. A Note on Interrupted Stochastic Control Processes.” Information and Control.
Bensoussan, Li, Nguyen, et al. 2020. Machine Learning and Control Theory.” arXiv:2006.05604 [Cs, Math, Stat].
Bottou, Leon. 2014. “Learning to Interact.”
Bottou, Léon, Peters, Quiñonero-Candela, et al. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising.” Journal of Machine Learning Research.
Brockman, Cheung, Pettersson, et al. 2016. OpenAI Gym.” arXiv:1606.01540 [Cs].
Bubeck, and Cesa-Bianchi. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.
Bubeck, Munos, and Stoltz. 2011. Pure Exploration in Finitely–Armed and Continuous–Armed Bandits.” Theoretical Computer Science.
Bubeck, and Slivkins. 2012. The Best of Both Worlds: Stochastic and Adversarial Bandits.” arXiv:1202.4473 [Cs].
Cesa-Bianchi, and Lugosi. 2006. Prediction, Learning, and Games.
Chapelle, and Li. 2011. An Empirical Evaluation of Thompson Sampling.” In Advances in Neural Information Processing Systems.
Clifton, and Laber. 2020. Q-Learning: Theory and Applications.” Annual Review of Statistics and Its Application.
Dayan, and Watkins. n.d. “Reinforcement Learning.” In Encyclopedia of Cognitve Science.
Du, Krishnamurthy, Jiang, et al. 2019. Provably Efficient RL with Rich Observations via Latent State Decoding.”
Gopnik. 2020. Childhood as a Solution to Explore–Exploit Tensions.” Philosophical Transactions of the Royal Society B: Biological Sciences.
Gueudré, Dobrinevski, and Bouchaud. 2014. Explore or Exploit? A Generic Model and an Exactly Solvable Case.” Physical Review Letters.
Hofmann, Schuth, Whiteson, et al. 2013. Reusing Historical Interaction Data for Faster Online Learning to Rank for IR.”
Howard. 1960. Dynamic Programming and Markov Processes..
Jaakkola, Singh, and Jordan. 1995. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems.” In Advances in Neural Information Processing Systems.
Jamieson, and Talwalkar. 2015. Non-Stochastic Best Arm Identification and Hyperparameter Optimization.” arXiv:1502.07943 [Cs, Stat].
Kaelbling, Littman, and Moore. 1996. Reinforcement Learning: A Survey.” Journal of Artifical Intelligence Research.
Krakovsky. 2016. Reinforcement Renaissance.” Commun. ACM.
Krishnamurthy, Agarwal, and Langford. 2016. Contextual-MDPs for PAC-Reinforcement Learning with Rich Observations.” arXiv:1602.02722 [Cs, Stat].
Langford. 2013. Learning to Interact.”
Lattimore. 2020. Bandit Algorithms.
Levine. 2018. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review.” arXiv:1805.00909 [Cs, Stat].
Li, Lihong, Chen, Kleban, et al. 2015. Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study.” In Proceedings of the 24th International World Wide Web Conference (WWW’14), Companion Volume.
Li, Lihong, Chu, Langford, et al. 2011. Unbiased Offline Evaluation of Contextual-Bandit-Based News Article Recommendation Algorithms.” In Proceedings of the Fourth International Conference on Web Search and Web Data Mining (WSDM-11).
Li, Lisha, Jamieson, DeSalvo, et al. 2016. Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits.” arXiv:1603.06560 [Cs, Stat].
———, et al. 2017. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.” The Journal of Machine Learning Research.
Mania, Guy, and Recht. 2018. Simple Random Search Provides a Competitive Approach to Reinforcement Learning.” arXiv:1803.07055 [Cs, Math, Stat].
Parisotto, and Salakhutdinov. 2017. Neural Map: Structured Memory for Deep Reinforcement Learning.” arXiv:1702.08360 [Cs].
Pfau, and Vinyals. 2016. Connecting Generative Adversarial Networks and Actor-Critic Methods.” arXiv:1610.01945 [Cs, Stat].
Powell. 2009. What You Should Know about Approximate Dynamic Programming.” Naval Research Logistics (NRL).
Roy, Gordon, and Thrun. 2005. Finding Approximate POMDP Solutions Through Belief Compression.” Journal of Artificial Intelligence Research.
Russo, Roy, Kazerouni, et al. 2018. A Tutorial on Thompson Sampling.” Foundations and Trends® in Machine Learning.
Salimans, Ho, Chen, et al. 2017. Evolution Strategies as a Scalable Alternative to Reinforcement Learning.” arXiv:1703.03864 [Cs, Stat].
Scott. 2010. A Modern Bayesian Look at the Multi-Armed Bandit.” Applied Stochastic Models in Business and Industry.
Shibata, Yoshinaka, and Chikayama. 2006. Probabilistic Generalization of Simple Grammars and Its Application to Reinforcement Learning.” In Algorithmic Learning Theory. Lecture Notes in Computer Science 4264.
Si. 2004. Handbook of Learning and Approximate Dynamic Programming.
Silver, Singh, Precup, et al. 2021. Reward Is Enough.” Artificial Intelligence.
Simchi-Levi, and Xu. 2021. Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits Under Realizability.”
Slivkins. 2019. Introduction to Multi-Armed Bandits.” Foundations and Trends® in Machine Learning.
Song, Xie, and Pokutta. 2015. Sequential Information Guided Sensing.” arXiv:1509.00130 [Cs, Math, Stat].
Strehl, Langford, Li, et al. 2011. Learning from Logged Implicit Exploration Data.” In Advances in Neural Information Processing Systems 23 (NIPS-10).
Sutton, Richard S, and Barto. 1998. Reinforcement Learning.
Sutton, Richard S., McAllester, Singh, et al. 2000. Policy Gradient Methods for Reinforcement Learning with Function Approximation.” In Advances in Neural Information Processing Systems.
Su, Wang, Santacatterina, et al. 2018. CAB: Continuous Adaptive Blending Estimator for Policy Evaluation and Learning.” arXiv:1811.02672 [Cs, Stat].
Swaminathan, and Joachims. 2015. Counterfactual Risk Minimization.” In Proceedings of the 24th International Conference on World Wide Web - WWW ’15 Companion.
Thrun. 1992. Efficient Exploration In Reinforcement Learning.”
Whittle. 1988. Restless Bandits: Activity Allocation in a Changing World.” Journal of Applied Probability.