Thursday, March 13, 2014 Statistical trials, quicksort

1Chernoff bounds continued¶

Recall what we had from the end of last class:

Let $X_1, x_2, \ldots, X_k$ be independent Bernoulli trials such that $P(X_i = 1) = p_i$. If $\displaystyle X = \sum_i X_i$, then

$$P(X > (1+\delta)\mu) \leq \left ( \frac{e^\delta}{(1+\delta)^{(1+\delta)}} \right)^\mu$$

where $\displaystyle \mu = E(X) = \sum_i P_i$.

Today we'll introduce some corollaries, which are simpler and often easier to use.

1.1A corollary¶

For $1 > \delta > 0$, we have that

$$P(x > (1+\delta)\mu) \leq e^{-\frac{1}{3} \mu \delta^2}$$

We had to prove this for question 4(b) in assignment 4. It's a somewhat long proof (at least if you write out all the steps), but the general idea is to use the Taylor series expansion of $\ln(1+\delta)$, multiply that by $(1+\delta)$, and ignore some of the higher-order terms in the expansion (as they just make the sum smaller) to find a lower bound for $(1+\delta)\ln(1+\delta)$. Then, use some inequalities derived from the fact that $1 > \delta > 0$ to get the desired inequality.

1.2Another corollary¶

Also, for $b \geq 6\mu$, we have:

$$P(X > b) \leq 2^{-b}$$

We had to prove this for question 4(a) in assignment 4. Start from the standard Chernoff bound, with the assumption that $(1+\delta) \geq 6$. Furthermore, since $e < 3$, $2e < 6$ and so $1+\delta > 2e$. Also, $(1+\delta) > \delta > 0$. So we can write

$$\frac{e^\delta}{(1+\delta)^{(1+\delta)}} \leq \frac{e^{(1+\delta)}}{(1+\delta)^{(1+\delta)}} = \left ( \frac{e}{(1+\delta)} \right )^{(1+\delta)} \leq \left ( \frac{e}{2e} \right )^{(1+\delta)}$$

which is just $2^{-(1+\delta)}$, and if we raise both sides to the power of $\mu$, everything falls nicely into place.

1.3Examples¶

With $n$ balls and $n$ bins. Given that $\mu = 1$, what's the probability that $P(X > 2\log(n))$ (where $X$ is, I don't know, the number of balls in a bin? Or vice versa?)?

Well, when $2\log(n) > 6$, we can use the first corollary above, resulting in

$$P(X > 2 \log(n) \leq 2^{-2\log(n)} = \frac{1}{n^2} \tag{when 2\log(n) > 6}$$

Another example, with coin-tossing. Suppose we toss $n$ coins. What is the probability that we get more than $n/2 + 2 \sqrt{n}\sqrt{\log n}$ heads?

Well, the probability of getting heads is $p=\frac{1}{2}$, thus the expected value is $\mu = \frac{n}{2}$. Consequently, we can rewrite the probability we are interested in as follows:

\begin{align} P\left (X > \frac{n}{2} + 2 \sqrt{\log(n)}\sqrt{n} \right ) & = P \left ( X > \mu + \frac{4\mu}{\sqrt{n}} \sqrt{\log(n)} \right ) \tag{as 2\sqrt{n} = \frac{4\mu}{\sqrt{n}}} \\ & = P\left (X > \underbrace{ \left(1 + \frac{4\sqrt{\log(n)}}{\sqrt{n}} \right )}_{= \delta, \, 0 < \delta < 1} \mu \right ) \\ & \leq e^{-\frac{1}{3} \mu \cdot \delta^2} \tag{by the first corollary above} \\ & = e^{-\frac{1}{3} \cdot \frac{n}{2} \cdot 16 \cdot \frac{\log(n)}{n}} \\ & = e^{-\frac{8}{3} \log(n)} \\ & = n^{-8/3} \end{align}

2Statistical trials¶

Suppose we have $n$ patients ($p_1, p_2, \ldots, p_n$) in a medical trial, split into 2 groups (one a control group). Suppose each patient has $m$ boolean characteristics ($c_1, c_2, \ldots, c_m$), which we can represent in a table:

Patients $c_1$ $c_2$ $\ldots$ $c_m$
$p_1$ 1 0 $\ldots$ 0
$p_2$ 0 0 $\ldots$ 1
$\vdots$ $\vdots$ $\vdots$ $\ddots$ $\vdots$
$p_n$ 1 1 $\ldots$ 0

Clearly, when designing a statistical trial, we'd want the characteristics to be similar across groups. For instance, suppose $n_j < n$ people have characteristic $j$. We'd want there to be $\sim n_j/2$ people in the test group and the same number in the control group. Similarly, we'd want $(n-n_j)/2$ people without the characteristic in each of the test and the control groups.

Now, $\frac{n_j}/2 < \frac{n}{2}$ and $\frac{n-n_j}{2} < \frac{n}{2}$, so by the coin-tossing example from above, the probability that there is more than a $2\sqrt{n}\sqrt{\log n}$ discrepancy from the expectation is less than $2n^{-8/3}$ (twice the upper bound we found in the coin-tossing argument, though, presumably because we have to account for above $n/2$ and below $n/2$).

By Booles' inequality (union bound), if $m \leq n^2$, say, then the probability that any characteristic has a large deviation is $\leq 2n^{-8/3}$.

So, with high probability, any random sample is going to be balanced!

3Analysis of quicksort¶

Recall that quicksort works by randomly choosing a pivot. Knowing this, we can perform a probabilistic analysis of the quicksort algorithm, by representing it as a tree $T$ where nodes are labelled by the pivot elements.

3.1Number of comparisons¶

First claim: The number of comparisons made is equal to $\displaystyle \sum_{v \in T} \text{depth}(v)$.

Proof: the pivot $v$ is compared to $\text{depth}(v)$ pivot elements before it becomes the pivot itself. $\blacksquare$

Note that we define $\text{depth}(T) = \max_{v \in T} \text{depth}(v)$, as one would expect.

A trivial corollary is that the number of comparisons $\leq n \cdot \text{depth}(T)$.

3.2Average depth¶

So what is $\text{depth}(T)$ in the average case? For any arbitrary choice of $v$, consider the size of the groups containing $v$ as we go down the tree. For example, suppose that $v$ is in some group $X$ with probability $1/2$. The next pivot for the group is in the range $[\frac{1}{4}|X|, \frac{3}{4}|X|]$ (i.e., the middle half of the group). Call this a "good" pivot. If we get a good pivot, then the two new subgroups have size at most $\frac{3}{4}|X|$.

Thus, if we get $4\ln(n)$ good pivots on a path containing $V$, then the group size is at most

$$\left ( \frac{3}{4} \right)^{4\ln(n)} \cdot n = \frac{n}{\left ( \left (\frac{4}{3} \right )^4 \right )^{\ln n}} < 1$$

By then, you'd have reached a leaf in $T$. We claim that this happens with high probability before the depth is $32 \ln n$. _(Proof?)

Now, the probability that a pivot is bad is $\frac{1}{2}$. So $\mu = E(\text{number of bad pivots}) = 16\ln n$. If $G$ is the number of good pivots, and $B$ the number of bad pivots, then:

\begin{align} P(G < 4\ln n) & = P(B \geq 28\ln n) \tag{treating the 32 \ln n figure as fact} \\ & = P\left(B \geq \frac{7}{4} 16\ln n\right) \\ & = P\left(B \geq \frac{7}{4} \mu\right) \\ & = P\left(B \geq \left(1+\underbrace{\frac{3}{4}}_{=\delta}\right)\mu\right) \\ & < e^{-\frac{1}{3} \delta^2\mu} \tag{by the first Chernoff bound corollary} \\ & = e^{-\frac{1}{3} \cdot \frac{9}{16} \cdot 16 \ln n} \\ & = e^{-3\ln n} \\ & = n^{-3} \end{align}

There are $n$ numbers, so the total number of paths to leaves is $\leq n$. Thus, with probability of at least

$$1 - \frac{1}{n^2}$$

all the paths reach leaves by a depth of $32\ln n$.

3.2.1Upper bound on expected depth¶

$$E(\text{depth}) \leq \frac{(1-\frac{1}{n^2}) \cdot 32 \ln n}{1/n^2 \cdot n} \leq 32 \ln n + 1$$

(Not sure where this came from)