Wikinotes

Maintainer: admin

1 Chernoff bounds continued
2 Statistical trials
3 Analysis of quicksort
- 3.1 Number of comparisons
- 3.2 Average depth
  - 3.2.1 Upper bound on expected depth

Recall what we had from the end of last class:

Let $X_1, x_2, \ldots, X_k$ be independent Bernoulli trials such that $P(X_i = 1) = p_i$. If $\displaystyle X = \sum_i X_i$, then

$$P(X > (1+\delta)\mu) \leq \left ( \frac{e^\delta}{(1+\delta)^{(1+\delta)}} \right)^\mu$$

where $\displaystyle \mu = E(X) = \sum_i P_i$.

Today we'll introduce some corollaries, which are simpler and often easier to use.

For $1 > \delta > 0$, we have that

$$P(x > (1+\delta)\mu) \leq e^{-\frac{1}{3} \mu \delta^2}$$

We had to prove this for question 4(b) in assignment 4. It's a somewhat long proof (at least if you write out all the steps), but the general idea is to use the Taylor series expansion of $\ln(1+\delta)$, multiply that by $(1+\delta)$, and ignore some of the higher-order terms in the expansion (as they just make the sum smaller) to find a lower bound for $(1+\delta)\ln(1+\delta)$. Then, use some inequalities derived from the fact that $1 > \delta > 0$ to get the desired inequality.

Also, for $b \geq 6\mu$, we have:

$$P(X > b) \leq 2^{-b}$$

We had to prove this for question 4(a) in assignment 4. Start from the standard Chernoff bound, with the assumption that $(1+\delta) \geq 6$. Furthermore, since $e < 3$, $2e < 6$ and so $1+\delta > 2e$. Also, $(1+\delta) > \delta > 0$. So we can write

$$\frac{e^\delta}{(1+\delta)^{(1+\delta)}} \leq \frac{e^{(1+\delta)}}{(1+\delta)^{(1+\delta)}} = \left ( \frac{e}{(1+\delta)} \right )^{(1+\delta)} \leq \left ( \frac{e}{2e} \right )^{(1+\delta)}$$

which is just $2^{-(1+\delta)}$, and if we raise both sides to the power of $\mu$, everything falls nicely into place.

With $n$ balls and $n$ bins. Given that $\mu = 1$, what's the probability that $P(X > 2\log(n))$ (where $X$ is, I don't know, the number of balls in a bin? Or vice versa?)?

Well, when $2\log(n) > 6$, we can use the first corollary above, resulting in

$$P(X > 2 \log(n) \leq 2^{-2\log(n)} = \frac{1}{n^2} \tag{when $2\log(n) > 6$}$$

Another example, with coin-tossing. Suppose we toss $n$ coins. What is the probability that we get more than $n/2 + 2 \sqrt{n}\sqrt{\log n}$ heads?

Well, the probability of getting heads is $p=\frac{1}{2}$, thus the expected value is $\mu = \frac{n}{2}$. Consequently, we can rewrite the probability we are interested in as follows:

$$\begin{align} P\left (X > \frac{n}{2} + 2 \sqrt{\log(n)}\sqrt{n} \right ) & = P \left ( X > \mu + \frac{4\mu}{\sqrt{n}} \sqrt{\log(n)} \right ) \tag{as $2\sqrt{n} = \frac{4\mu}{\sqrt{n}}$} \\ & = P\left (X > \underbrace{ \left(1 + \frac{4\sqrt{\log(n)}}{\sqrt{n}} \right )}_{= \delta, \, 0 < \delta < 1} \mu \right ) \\ & \leq e^{-\frac{1}{3} \mu \cdot \delta^2} \tag{by the first corollary above} \\ & = e^{-\frac{1}{3} \cdot \frac{n}{2} \cdot 16 \cdot \frac{\log(n)}{n}} \\ & = e^{-\frac{8}{3} \log(n)} \\ & = n^{-8/3} \end{align}$$

Suppose we have $n$ patients ($p_1, p_2, \ldots, p_n$) in a medical trial, split into 2 groups (one a control group). Suppose each patient has $m$ boolean characteristics ($c_1, c_2, \ldots, c_m$), which we can represent in a table:

Patients	$c_1$	$c_2$	$\ldots$	$c_m$
$p_1$	1	0	$\ldots$	0
$p_2$	0	0	$\ldots$	1
$\vdots$	$\vdots$	$\vdots$	$\ddots$	$\vdots$
$p_n$	1	1	$\ldots$	0

Clearly, when designing a statistical trial, we'd want the characteristics to be similar across groups. For instance, suppose $n_j < n$ people have characteristic $j$. We'd want there to be $\sim n_j/2$ people in the test group and the same number in the control group. Similarly, we'd want $(n-n_j)/2$ people without the characteristic in each of the test and the control groups.

Now, $\frac{n_j}/2 < \frac{n}{2}$ and $\frac{n-n_j}{2} < \frac{n}{2}$, so by the coin-tossing example from above, the probability that there is more than a $2\sqrt{n}\sqrt{\log n}$ discrepancy from the expectation is less than $2n^{-8/3}$ (twice the upper bound we found in the coin-tossing argument, though, presumably because we have to account for above $n/2$ and below $n/2$).

By Booles' inequality (union bound), if $m \leq n^2$, say, then the probability that any characteristic has a large deviation is $\leq 2n^{-8/3}$.

So, with high probability, any random sample is going to be balanced!

Recall that quicksort works by randomly choosing a pivot. Knowing this, we can perform a probabilistic analysis of the quicksort algorithm, by representing it as a tree $T$ where nodes are labelled by the pivot elements.

First claim: The number of comparisons made is equal to $\displaystyle \sum_{v \in T} \text{depth}(v)$.

Proof: the pivot $v$ is compared to $\text{depth}(v)$ pivot elements before it becomes the pivot itself. $\blacksquare$

Note that we define $\text{depth}(T) = \max_{v \in T} \text{depth}(v)$, as one would expect.

A trivial corollary is that the number of comparisons $\leq n \cdot \text{depth}(T)$.

So what is $\text{depth}(T)$ in the average case? For any arbitrary choice of $v$, consider the size of the groups containing $v$ as we go down the tree. For example, suppose that $v$ is in some group $X$ with probability $1/2$. The next pivot for the group is in the range $[\frac{1}{4}|X|, \frac{3}{4}|X|]$ (i.e., the middle half of the group). Call this a "good" pivot. If we get a good pivot, then the two new subgroups have size at most $\frac{3}{4}|X|$.

Thus, if we get $4\ln(n)$ good pivots on a path containing $V$, then the group size is at most

$$\left ( \frac{3}{4} \right)^{4\ln(n)} \cdot n = \frac{n}{\left ( \left (\frac{4}{3} \right )^4 \right )^{\ln n}} < 1$$

By then, you'd have reached a leaf in $T$. We claim that this happens with high probability before the depth is $32 \ln n$. _(Proof?)

Now, the probability that a pivot is bad is $\frac{1}{2}$. So $\mu = E(\text{number of bad pivots}) = 16\ln n$. If $G$ is the number of good pivots, and $B$ the number of bad pivots, then:

$$\begin{align} P(G < 4\ln n) & = P(B \geq 28\ln n) \tag{treating the $32 \ln n$ figure as fact} \\ & = P\left(B \geq \frac{7}{4} 16\ln n\right) \\ & = P\left(B \geq \frac{7}{4} \mu\right) \\ & = P\left(B \geq \left(1+\underbrace{\frac{3}{4}}_{=\delta}\right)\mu\right) \\ & < e^{-\frac{1}{3} \delta^2\mu} \tag{by the first Chernoff bound corollary} \\ & = e^{-\frac{1}{3} \cdot \frac{9}{16} \cdot 16 \ln n} \\ & = e^{-3\ln n} \\ & = n^{-3} \end{align}$$

There are $n$ numbers, so the total number of paths to leaves is $\leq n$. Thus, with probability of at least

$$1 - \frac{1}{n^2}$$

all the paths reach leaves by a depth of $32\ln n$.

$$E(\text{depth}) \leq \frac{(1-\frac{1}{n^2}) \cdot 32 \ln n}{1/n^2 \cdot n} \leq 32 \ln n + 1$$

(Not sure where this came from)

Thursday, March 13, 2014
Statistical trials, quicksort

1Chernoff bounds continued¶

1.1A corollary¶

1.2Another corollary¶

1.3Examples¶

2Statistical trials¶

3Analysis of quicksort¶

3.1Number of comparisons¶

3.2Average depth¶

3.2.1Upper bound on expected depth¶

Thursday, March 13, 2014 Statistical trials, quicksort