Multi-armed Bandits (4)

Proof of Thompson Sampling

对于 Bernoulli Bandit 来说,它的 Bayesian regret 就是 $$ \operatorname{BR}(T) = \sum_{t=1}^T\mathbb{E}_{\theta \sim \text{prior}} \left[ \, \mathbb{E}[\mu(a^\ast)-\mu(a_t) \mid \mathcal{F}_t \, ] \, \right] $$ 对任意的 $U_t$,有 $$ \mathbb{E}[U_t(a^\ast) \mid \mathcal{F}_t] = \mathbb{E}[U_t(a_t) \mid \mathcal{F}_t] $$ 所以 $$ \begin{aligned} \mathbb{E}[\mu(a^\ast)-\mu(a_t) \mid \mathcal{F}_t \, ] & = \mathbb{E}[\mu(a^\ast) - U_t(a^\ast) + U_t(a_t) - \mu(a_t) \mid \mathcal{F}_t] \\ & = \mathbb{E}[\mu(a^\ast) - U_t(a^\ast) \mid \mathcal{F}_t ] + \mathbb{E}[U_t(a_t) - \mu(a_t) \mid \mathcal{F}_t] \end{aligned} $$ 并且 $$ \operatorname{BR}(T) = \sum_{t=1}^T \left[ \, \mathbb{E}[\mu(a^\ast) - U_t(a^\ast) ] + \mathbb{E}[U_t(a_t) - \mu(a_t) ] \, \right] $$ 假设存在 $U_t, L_t, \gamma > 0$ 使得对任意的 $a, t$ 成立 $$ \mathbb{E}[(U_t(a) - \mu(a))^{-}] \leq \frac{\gamma}{TK}, \qquad \mathbb{E}[(\mu(a) - L_t(a))^{-}] \leq \frac{\gamma}{TK} $$ 那么 $$ \begin{aligned} \mathbb{E}[\mu(a^\ast) - U_t(a^\ast) ] & \leq \mathbb{E}[(\mu(a^\ast) - U_t(a^\ast))^{+} ] \\ & \leq \sum_{a \in \mathcal{A}} \mathbb{E}[(\mu(a) - U_t(a))^{+}] \\ & \leq \sum_{a \in \mathcal{A}} \mathbb{E}[(U_t(a) - \mu(a) )^{-}] \\ & \leq \frac{\gamma}{T} \end{aligned} $$ 另一方面 $$ \begin{aligned} \mathbb{E}[U_t(a_t) - \mu(a_t) ] & = \mathbb{E}[U_t(a_t) - L_t(a_t) + L_t(a_t) - \mu(a_t) ] \\ & = \mathbb{E}[U_t(a_t) - L_t(a_t) ] + \mathbb{E}[ L_t(a_t) - \mu(a_t)]\\ & \leq \mathbb{E}[U_t(a_t) - L_t(a_t) ] + \mathbb{E}[(L_t(a_t) - \mu(a_t))^{+}] \\ & \leq \mathbb{E}[U_t(a_t) - L_t(a_t) ] + \sum_{a \in \mathcal{A}}\mathbb{E}[(L_t(a) - \mu(a))^{+}] \\ & \leq \mathbb{E}[U_t(a_t) - L_t(a_t) ] + \frac{\gamma}{T} \end{aligned} $$ 于是 $$ \operatorname{BR}(T) \leq 2 \gamma+\sum_{t=1}^T \mathbb{E}\left[U_t\left(a_t\right)-L_t\left(a_t\right)\right] $$ 取 $$ \begin{aligned} & \gamma = 2\\ & U_t(a)=\bar{\mu}_t(a)+\sqrt{\frac{2 \log (T)}{n_t(a)}} \\ & L_t(a)=\bar{\mu}_t(a)-\sqrt{\frac{2 \log (T)}{n_t(a)}} \end{aligned} $$

因为 $$ \sum_{t=1}^T \sqrt{\frac{1}{n_t(a)}} =\sum_{ a} \sum_{t: a_t=a} \frac{1}{\sqrt{n_t(a)}} =\sum_{\operatorname{arms} a} \sum_{j=1}^{n(a)} \frac{1}{\sqrt{j}} =\sum_{ a} {O}(\sqrt{n(a)}) \leq O (\sqrt{KT}) $$

我们最后就有: $$ \operatorname{BR}(T) \leq O( \sqrt{KT \log T}) $$

updatedupdated2025-12-162025-12-16