Wikinotes

Maintainer: admin

These are transcribed almost directly from the professor's own lecture notes, with very few annotations since I don't really know what's going on. The handwriting is also difficult to read. Wish I had taken Mathematical Finance instead.

1 With discounting
- 1.1 Formulas
2 Without discounting
- 2.1 Formulas

Sequential decision processes with discounting $\alpha$ (where $0 < \alpha < 1$), solved by value iteration.

$$f_i(n) = \max_k \left [q_i^k + \alpha \sum_{j=1}^N p_{ij}^k f_j(n-1) \right ]$$

(Infinite horizon)

$$\lim_{n\to \infty} f_i(N) = f_i \qquad \lim_{n \to \infty} f_j(n-1) = f_j$$

For a given policy $k$, solve

$$v_i = q_i + \alpha \sum_{j=1}^N p_{ij} v_j \tag{$i=1, 2, \ldots N$}$$

The resulting P.I.R. is as follows: for each end state $i$, find (illegible)^{maybe "alternative"?} $k'$ that maximises

$$q_i^k + \alpha \sum_{j=1}^N p_{ijk}^k v_j$$

using the present^? value $v_i$ from previous policies. Then $k'$ becomes the new decision in the $i$^th state. $q_i^{k'}$ becomes $q_i$, and $p_{ij}^{k'}$ becomes $p_{ij}$. Stop when there is no change in the policy. This whole section of his notes is impossible to read.

Consider policy $A$ and its successor policy $B$ produced by the P.I.R:

$$q_i^b + \alpha\sum_{j=1}^n p_{ij}^B v_j^A \geq q_i^A + \alpha \sum_{j=1}^N p_{ij}^A v_j^A \tag{1}$$

in every state $i$. We also know for policies taken individually that:

$$\begin{align} v_i^A & = q_i^A + \alpha \sum_{j=1}^N p_{ij}^A v_j^A \quad \text{or} \quad \underline v^A = [I-\alpha P^A]^{-1} \underline q^A \tag{2} \\ \therefore \, \underline v^A & = \underline q^A +\alpha P^A\underline v^A \tag{not sure about the $\therefore$} \\ v_i^B & = q_i^B + \alpha\sum_{j=1}^N p_{ij}^B v_j^B \quad \text{or} \quad \underline v^B = [I-\alpha P^B]\underline q^B \tag{3} \\ \therefore \, \underline v^B & = \underline q^B + \alpha P^B\underline v^B \end{align}$$

(I'm not sure if $p$ is meant to be uppercase or lowercase. Does it matter?)

Now we define $\gamma_i$ (what is $\gamma$?) as follows:

$$\begin{align} \gamma_i & = q_i^B + \alpha_{j=1}^N p_{ij}^B v_j^A - q_i^A - \alpha\sum_{j=1}^N p_{ij}^A v_j^B \\ \therefore \, v_i^B - v_i^A & = q_i^B - q_i^A + \alpha \sum_{j=1}^N p_{ij}^B v_j^B - \alpha \sum_{j=1}^N p_{ij}^A v_j^A \\ & = \gamma_i - \alpha\sum_{j=1}^N p_{ij}^B v_j^A + \cancel{\alpha \sum_{j=1}^N p_{ij}^A v_j^A} + \alpha \sum_{j=1}^N p_{ij}^B v_j^B - \cancel{\alpha\sum_{j=1}^N p_{ij}^A v_j^A} \end{align}$$

Let $\Delta v_i = v_i^B - v_i^A$. In terms of $\gamma_i$, we can write

$$\Delta v_i = \gamma_i + \alpha\sum_{j=1}^N p_{ij}^B \Delta v_j$$

In vector/matrix form, we have:

$$\Delta \underline{v} = \underline{\gamma} + \alpha P\Delta v$$

Thus we can write

$$[I-\alpha P]\Delta \underline{v} = \underline{\gamma} \quad \text{or} \quad \Delta \underline{v} = [I-\alpha P]^{-1} \underline{\gamma}$$

But $[I-\alpha P]^{-1}$ has non-negative elements. Hence if $\gamma_i > 0$, at least one $\Delta v_i$ must be greater than zero, and no $\Delta v_i$ can be less than zero.

In other words, the P.I.R. must increase at least once and cannot decrease, and if $\Delta \underline{v} = 0$ then the algorithm has converged.

Another set of notes, for the non-discounting case.

$$f_n(i) = \max_k \left [ q_i^k + \sum_{j=1}^N p_{ij} f_{n-1}(j) \right ], \quad f_1(i) = \max q_i^k \tag{I think}$$

Also, $\displaystyle \lim_{n\to \infty}f_n(i) = f(i)$.

For a given policy, we have:

$$v_n(i) = q_i + \sum_{j=1}^N p_{ij} v_{n-1}(j) \tag{not sure about this}$$

Let $v_n(i) = ng + c_i$. Then, substituting this into the formula above, we have:

$$\begin{align} ng+c_i & = q_i + \sum_{j=1}^N p_{ij} \big [ (n-1)g + c_j \big ] \\ g + c_i & = q_i + \sum_{j=1}^N p_{ij} c_j \tag{$n$ equations with $n$ unknowns} \end{align}$$

Use $p_{ij}$ and $q_i$ for a given policy to solve the following:

$$g+c_i = q_i + \sum_{j=1}^N p_{ij} c_j \tag{$i=1, 2, \ldots, N$}$$

Set $c_N = 0$.

For each state, find an alternative policy $k'$ that maximises

$$q_i^{k} + \sum_{j=1}^N p_{ij}^k c_j$$

Using the relative value of $c_j$ from the previous policy^{what does this mean?}, $k'$ becomes the new decision in state $i^k$. Furthermore, $q_i^k$ becomes $q_i$, and $p_{ij}^k$ becomes $p_{ij}$.

We stop iterating when there is no change in $g$

We have evaluated policy $A$ for the operation of the system^{what does this mean?}, and the P.I.R. has produced policy $B$.

We seek to prove that $g^B > g^A$, i.e., the gain using policy $B$ is greater than the gain using policy $A$ (and so policy $B$ is indeed optimal).

Since $B$ was chosen over $A$, the following must be true:

$$q_i^B + \sum_{j=1}^N p_{ij}^Bv_j^A \geq q_i^A + \sum_{j=1}^N p_{ij}^A v_j^A \quad i = 1, 2, \ldots, N \tag{1}$$

Next, we define $\gamma_i$ as follows:

$$\gamma_i = q_i^B + \left ( \sum_{j=1}^N p_{ij}^B v_j^A \right ) - q_i^A - \left (\sum_{j=1}^N p_{ij}^A v_j^A \right ) \tag{2}$$

For policies $A$ and $B$ individually, we have the following:

$$\begin{align} g^B + v_i^B & = q_i^B + \sum_{j=1}^N p_{ij}^B v_j^B \quad i=1, 2, \ldots, N \tag{3} \\ g^A + v_i^A & = q_i^A + \sum_{j=1}^N p_{ij}^A v_j^A \quad i=1, 2, \ldots, N \tag{4} \end{align}$$

If we subtract (4) from (3), we get:

$$g^B - g^A + v_i^B - v_i^A = q_i^B - q_i^A + \left (\sum_{j=1}^N p_{ij}^B v_j^B\right ) - \left (\sum_{j=1}^N p_{ij}^A v_j^A \right ) \tag{5}$$

Now, if we take (1) + (2) and substitute that into (5), we obtain:

$$\begin{align} g^B - g^A + v_i^B - v_i^A & = \gamma_i - \left (\sum_{j=1}^N p_{ij}^B v_j^A \right ) + \cancel{\left ( \sum_{j=1}^N p_{ij}^A v_j^A \right )} + \left ( \sum_{j=1}^N p_{ij}^B v_j^B \right ) - \cancel{\left ( \sum_{j=1}^N p_{ij}^A v_j^A \right )} \\ & = \gamma_i + \sum_{j=1}^N p_{ij}^B (v_j^B - v_j^A) \tag{6} \end{align}$$

Let $\Delta g = g^B - g^A$ and let $\Delta v_i = v_i^B - v_i^A$. Then, (6) becomes:¹

$$\Delta g + \Delta v_i = \gamma_i + \sum_{j=1}^N p_{ij}^B \Delta c_j$$

Thus we have

$$\Delta g = \sum_{i=1}^N \pi_i^B \gamma_i \tag{what}$$

In the value determination operation, we obtained

$$g + v_i = q_i + \sum_{j=1}^N p_{ij} v_j, \quad i = 1, 2, \ldots, N \tag{1}$$

where $g$ is the gain per period of time in the steady state^what?. In other words, we have

$$g = \sum_{i=1}^N \pi_i q_i$$

where $\pi_i$ is the limiting state probability of the $i$^th state^what?.

Hence,

$$\Delta g + \Delta c_i = \gamma_i + \sum_{j=1}^N p_{ij}^B \Delta c_j$$

with $\Delta g = \sum \pi_i \gamma_i$ is the same as equation (1) with $\Delta g$ replacing $g$ and $\Delta c_i$ replacing $c$. In other words, $\Delta g= \sum \pi_i \gamma_i$ and $\gamma_i$ is positive, so $\Delta g \geq 0$, and we have a policy improvement until it becomes steady^{just guessed the word, it's kind of illegible}, that is, when $\Delta g = 0$.

Not sure if $v_j$ or $c_j$. It originally said $v_j$ but he crossed it out and replaced it with $c_j$, but, $\Delta c_j$ isn't defined anywhere, and a bunch of other $c$'s were crossed out and replaced with $v$'s and I don't know what is happening I just what. Is add-drop over already? ↩

Sequential decision processes

1With discounting¶

1.1Formulas¶

1.1.1Value-determination operation¶

1.1.2Policy improvement routine¶

1.1.3Proof of iteration cycle¶

2Without discounting¶

2.1Formulas¶

2.1.1Value-determination operation¶

2.1.2Policy improvement routine¶

2.1.3Proof of policy-iteration method¶