Aditya Gopalan

Search our stories, awards, events and funding

Aditya Gopalan

Department of Electrical Communication Engineering, Indian Institute of Science

Aditya Gopalan is an associate professor at the Indian Institute of Science, Bengaluru, in the Department of Electrical Communication Engineering (ECE). He received the PhD in electrical engineering from the University of Texas at Austin, and was postdoc at the Technion-Israel Institute of Technology. His research interests lie in statistical learning, with a focus on online and reinforcement learning, optimization and control, and statistical inference. He served as an editor of the IEEE/ACM Transactions on Networking, and is an associate fellow of the Indian National Science Academy.

Topic: Multi-armed Bandits and Reinforcement Learning

Abstract:

These lectures will discuss models and algorithms for sequential decision-making in the face of uncertainty, often when there is an information constraint leading to an exploration-exploitation tradeoff for the learning agent. The quintessential example of such a problem is the multi-armed bandit, which is a special case of more general reinforcement learning. Sequential learning problems arise in the realm of many modern data-driven intelligent systems (think Internet recommendation engines, hyperparameter tuning for machine learning algorithms, game playing, financial portfolio allocation, resource allocation in communication systems, etc.), in which one desires the ability to continuously adapt to changing conditions.

The plan for the lectures is as follows:

1) Introduction to multi-armed bandit problems, the stochastic k-armed bandit model, performance metrics: regret, best-arm identification.

2) Algorithms: Explore-Then-Commit (Epsilon-Greedy), UCB, Thompson Sampling, Median Elimination, Successive Rejects.

3) Lower bounds: Lai-Robbins lower bound on asymptotic regret, minimax lower bounds.

4) More general RL problems (linear and contextual bandits, Markov decision processes).

In the first session, we will present the basic stochastic bandit model and discuss various performance metrics that quantify what it means to learn optimal behavior in this setting. One could frame learning a bandit problem as either optimisation (maximizing net reward) or an adaptive sequential hypothesis testing problem (identifying the best arm reliably).

In the second session, we will look at some common algorithmic principles for balancing exploration and exploitation in bandit problems. These include forced exploration (known popularly as epsilon-greedy in RL), the optimism principle under uncertainty (e.g., the Upper Confidence Bound or UCB algorithm), and Bayesian-inspired methods such as Thompson Sampling. Time permitting, we will also discuss sampling algorithms for the best arm identification problem, such as Median Elimination and Successive Rejects.

Next, wearing an information theorist's hat, we will turn to quantifying the performance limits, or price of learning, for bandit algorithms. We will show how information theory tools can be applied to deduce the minimum regret rate (or sample complexity for reliable best-arm identification) across all sequential strategies.

Finally, we will broaden the modeling scope to more general decision-making models―bandits with structured arms (linear bandits, contextual bandits), whose rewards can change with time and actions played (Markov decision processes). We will discuss how some of the algorithmic principles from the simpler bandit models can be extended to these more general settings.

Sign in

Reset your password

Password reset instructions sent

Search our stories, awards, events and funding

Sign up for a Croucher account

Passionate about science? Stay updated with the latest scientific developments in Hong Kong through Croucher News.

Passionate about science?
Stay updated with the latest scientific developments in Hong Kong through Croucher News.