Wikinotes

Maintainer: admin

Will eventually include a review of absolutely everything in the entire course. Sources: Professor Anderson's notes on WebCT, and the Mathematical Statistics with Applications textbook by Wackerley, Mendenhall, and Scheaffer.

Stuff to cover:

Lecture notes (everything except stuff marked with a †)
The following sections from Mathematical Statistics with Applications:
- 6.7
- 7.2
- Chapter 8
- 9.1-9.7 plus exercise 9.8
- Chapter 10
- 11.1-11.7
- 13.1-13.6, 13.8-13.9
- 14.1-14.4

Problems from the assignments will be covered in the HTSEFP, and things that we need to memorise can be found on the Formulas to memorise page.

Under construction

1 Sampling distributions
2 Estimation
3 Confidence intervals
4 Theory of hypothesis testing
5 Applications of hypothesis testing
- 5.1 The bivariate normal distribution
- 5.2 Correlation analysis
6 Linear models
- 6.1 Linear regression
  - 6.1.1 Finding distribution for the betas
- 6.2 Experimental design
  - 6.2.1 Completely randomised designs
  - 6.2.2 Randomised block designs
7 Chi-square tests
8 Non-parametric methods of inference

Chapter 7 in the textbook, section 1 in Anderson's notes.

Density function

The density function of a random variable $X$ with a chi-square distribution with $n$ degrees of freedom ($X \sim \chi^2_n$, $n \geq 1$) is as follows:

$$f(x) = \begin{cases} \frac{1}{2^{n/2}\Gamma(n/2)} x^{(n/2)-1}e^{-x/2} & \text{if }x > 0, \\ 0 & \text{otherwise}\end{cases}$$

Graph

Properties

Incidentally, this is a special case of the gamma distribution, with $\alpha = n/2$ and $\beta = 2$. We can consequently derive the distribution's expected value, variance, and moment-generating function:

$$\begin{align*}E(x) & = n \\ Var(X) & = 2n \\ M_X(t) & = \frac{1}{(1-2t)^{n/2}}\end{align*}$$

Relation to other distributions

Using the central limit theorem, we can map a chi-square distribution, $X$, to the normal distribution with mean 0 and variance 1 ($N(0, 1)$), by the following formula:

$$\frac{X-n}{\sqrt{2n}}$$

The sum of independent chi-square random variables has a chi-square distribution. Its number of degrees of freedom is equal to the sum of the d.f. of the random variables. Or, in mathematical terms, if $X = X_1 + \ldots X_m$ where $X_i$ is an independent chi-square random variable with $n_i$ degrees of freedom, then $X \sim \chi^2_n$ where $n = n_1 + \ldots + n_m$. Proof of this is accomplished by writing out the moment-generating functions. (Note that this also implies that if $X$ is the sum of two chi-square distributions, and we know the d.f. of $X$ and of one of the other distributions, then the d.f. of the other distribution is given by the difference between the two.)
The sum of the square of independent normal random variables has a chi-square distribution. Its degree of freedom is equal to the number of random variables. Or, in mathematical terms, if $Y = X_1^2 + \ldots X_n^2$ where $X_i \sim N(0, 1)$, then $Y \sim X_n^2$. Proof of this is accomplished by algebraically manipulating $P(Z^2 \leq w)$ until it becomes $2P(0 \leq Z \leq \sqrt{w})$ and then integrating the normal distribution's density function.

Critical value

$$P(X > \chi_{\alpha, n}^2) = \alpha$$

Insert diagram later

Density function

The density function of a random variable $T$ with a $t$-distribution with $n$ degrees of freedom ($n \geq 1, T \sim t_n$) is given by:

$$f(t) = \frac{\Gamma(\frac{n+1}{2})}{\sqrt{\pi n}\Gamma(\frac{n}{2})}\left ( 1 + \frac{t^2}{n} \right )^{-\frac{n+1}{2}}$$

Graph

Notice that as $n \to \infty$, the graph looks more and more like the normal distribution (the one in black).

Properties

None at the moment

Relation to other distributions

If $X \sim N(0,1)$ and $Y \sim X_n^2$, then

$$T = \frac{X}{\sqrt{Y/n}}$$

gives us a $t$ distribution with $n$ degrees of freedom (same as as the $\chi^2$ distribution).

Critical value

$$P(T > t_{\alpha, n}^2) = \alpha$$

Insert diagram later

Density function

The density function of a random variable $Y$ with an $F$-distribution with $n_1$ numerator and $n_2$ denominator degrees of freedom ($n_1, n_2 \geq 1, Y \sim F_{n_1, n_2}$) is given by:

$$g(y) = \begin{cases} \frac{\Gamma(\frac{n_1+n_2}{2})}{\Gamma(\frac{n_1}{2})\Gamma(\frac{n_2}{2})} \left ( \frac{n_1}{n_2} \right )^{n_1/2} y^{n_1/2 - 1} \left ( 1 + \frac{n_1}{n_2} y \right )^{-\frac{n_1+n_2}{2}}, & \text{if } y > 0, \\ 0, & \text{if } y \leq 0 \end{cases}$$

Graph

Properties

The reciprocal of an $F$-distribution with $n_1$ numerator and $n_2$ denominator df is also an $F$ distribution, but with $n_1$ denominator and $n_2$ numerator d.f. Mathematically: if $Y \sim F_{n_1,n_2}$, then $1 / Y \sim F_{n_2, n_1}$.
The reciprocal of the critical value is also the critical value of what? Come back to this later

Relation to other distributions

If $X_1 \sim \chi^2_{n_1}$ and $X_2 \sim \chi^2_{n_2}$ are independent, then

$$Y = \frac{X_1/n_1}{X_2/n_2}$$

gives us an $F$-distribution with $n_1$ numerator and $n_2$ denominator degrees of freedom.

Critical value

$$P(Y > F_{\alpha, n_1,n_2}^2) = \alpha$$

Draw diagram later

Below is an overview of the above distributions (and the normal distribution), presented in tabular form:

Distribution	Formula	d.f.	Properties	Relation to others
Normal - $N(\mu,\sigma^2)$	$\displaystyle \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left (\frac{x-\mu}{\sigma} \right)^2}$	N/A	$E(X) = \mu$, $Var(X) = \sigma^2$, $M_X(t) = e^{\mu t + \frac{1}{2}\sigma^2t^2}$	N/A
Chi-square - $\chi^2_n$	$\displaystyle\frac{1}{2^{n/2}\Gamma(n/2)}x^{n/2-1}e^{-x/2}$, $x > 0$	$n$	$E(X) = n$, $Var(x) = 2n)$, $M_X(t) = \frac{1}{(1-2t)^{n/2}}$	Special case of gamma distribution ($\alpha = n/2$, $\beta = 2$). Mapped to $N(0,1)$ by $\frac{X-n}{\sqrt{2n}}$. Sum of $\chi^2$ distributions is a $\chi^2$ distribution (d.f.: sum of d.f.s). Sum of square of normal distributions is a $\chi^2$ distribution (d.f.: number of random variables).
t - $t_n$	$\displaystyle \frac{\Gamma(\frac{n+1}{2})}{\sqrt{\pi n} \Gamma(\frac{n}{2})} \left ( 1+ \frac{t^2}{n} \right )^{-\frac{n+1}{2}}$	$n$	As $n \to \infty$, looks more like $N(0, 1)$	$\displaystyle T = \frac{X}{\sqrt{Y/n}}$ where $X \sim N(0,1)$ and $Y \sim \chi^2_n$ ($T \sim t_n$)
F - $F_{n_1, n_2}$	$\displaystyle \frac{\Gamma(\frac{n_1+n_2}{2})}{\Gamma(\frac{n_1}{2})\Gamma(\frac{n_2}{2})} \left ( \frac{n_1}{n_2} \right )^{n_1/2} y^{n_1/2 -1} \left ( 1 + \frac{n_1}{n_2} y \right )^{-\frac{n_1+n_2}{2}}$, $y > 0$	$n_1$ numerator, $n_2$ denominator	Reciprocal is also $F$, reverse the d.f.	$\displaystyle Y = \frac{X_1/n_1}{X_2/n_2}$ where $X_1 \sim \chi^2_{n_1}$, $X_2 \sim \chi^2_{n_2}$ ($Y \sim F_{n_1, n_2}$)

First, we make note of the following lemma:

Let ${x_1, \ldots, x_n}$ and ${y_1, \ldots, y_n}$ be two sets of $n$ numbers each, and let $\bar x$ and $\bar y$ be the means of each respective set. Then

$$ \sum_{i=1}^n (x_i - c) (y_i - d) = \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) + n(\bar x - c)(\bar y - d)$$

for any numbers $c$ and $d$. If they happen to be equal, then we have the following special case:

$$\sum_{i=1}^n (x_i - \mu)^2 = \sum_{i=1}^n (x_i - \bar x)^2 + n(\bar x - \mu)^2$$

I'd imagine that $c = d = \mu$, but the choice of symbol here seems deliberate - a connection to the normal distribution perhaps? In any case, the above can be proved using simple algebraic manipulation.

Let's say we have $n$ independent random variables - $X_1,\ldots X_n$ - all having the same distribution, with a mean of $\mu$ and a variance of $\sigma^2$. We can define the terms sample mean and sample variance as follows:

$$\text{Sample mean } = \overline X = \frac{X_1 + \ldots X_n}{n}, \quad \text{Sample variance } = s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \overline X)^2$$

Now, these metrics possess some interesting properties that are independent of the specific distribution of the random variables:

$$E(\overline X) = E\left ( \frac{X_1 + \ldots + X_n}{n} \right ) = \frac{1}{n} E\left( \sum_{i=1}^n X_i \right) = \frac{1}{n} \sum_{i=1}^n E(X_i) = \frac{1}{n}\sum_{i=1}^n \mu = \frac{\mu n}{n} = \mu$$

where $E(X_i) = \mu$ comes from the fact that each random variable has a distribution with a mean of $\mu$.

$$Var(\overline X) = Var\left ( \frac{X_1 + \ldots + X_n}{n} \right ) = \frac{1}{n^2} Var\left( \sum_{i=1}^n X_i \right) = \frac{1}{n^2}\sum_{i=1}^n Var(X_i) = \frac{1}{n^2}\sum_{i=1}^n \sigma^2 = \frac{\sigma^2 n}{n^2} = \frac{\sigma^2}{n}$$

where $Var(X_i) = \sigma^2$ comes from the fact that each random variable has a distribution with a variance of $\sigma^2$.

If the samples all have a normal distribution, then $\overline X \sim N(\mu, \sigma^2/n)$. This can be proven using moment-generating functions. (Later.)

If they have an $F$ distribution, then this is only approximately true when $n$ is large, by the central limit theorem.

First, recall that $Var(x) = E((x - \mu)^2)$. Then, we rewrite the lemma above with the thing we need on the right, we have:

$$\sum_{i=1}^n (X_i-\overline X)^2 = \sum_{i=1}^n (X_i - \mu)^2 - n(\overline X - \mu)^2$$

(Note the switched order.) Then, we can derive the expected value of the sample variance as follows:

$$\begin{align*}E(s^2) & = E \left (\frac{1}{n-1} \sum_{i=1}^n (X_i - \overline X)^2 \right ) = \frac{1}{n-1} E\left (\sum_{i=1}^n (X_i - \overline X)^2 \right) = \frac{1}{n-1} E\left ( \sum_{i=1}^n (X_i - \mu)^2 - n(\overline X - \mu)^2 \right ) \\ & = \frac{1}{n-1} \left ( \sum_{i=1}^n E((X_i - \mu)^2) - n(E(\overline X- \mu)^2) \right ) = \frac{1}{n-1} \left ( \sum_{i=1}^n Var(X_i) - n Var(\overline X) \right ) = \frac{1}{n-1} \left ( n \sigma^2 - n \cdot \frac{\sigma^2}{n} \right ) \\ & = \frac{1}{n-1} (n\sigma^2 - \sigma^2) = \frac{(n-1)\sigma^2}{n-1} \\ & = \sigma^2 \end{align*}$$

Which is, well, kind of beautiful imo.

Recall that the formula for covariance is:

$$Cov(X, Y) = E((X - E(X))(Y-E(Y)))$$

and that the following properties hold:

$$Cov(X, Y - Z) = Cov(X, Y) - Cov(X, Z) \quad Cov(X, X) = Var(X)$$

We can then show that the sample mean, $\overline X$, and the difference between the sample mean and an individual sample, $X_i - \overline X$, are uncorrelated:

$$\begin{align*}Cov(\overline X, X_i - \overline X) & = Cov(\overline X, X_i) - Cov(\overline X, \overline X) = Cov(\overline X, X_i) - Var(\overline X) = Cov\left (\frac{X_1 + \ldots + X_n}{n}, X_i \right) - \frac{\sigma^2}{n} \\ & = \frac{1}{n} Cov \left (\sum_{j=1}^n X_j , X_i \right ) - \frac{\sigma^2}{n} = \frac{Cov(X_i, X_i)}{n} - \frac{\sigma^2}{n} = \frac{Var(X_i)}{n} - \frac{\sigma^2}{n} = \frac{\sigma^2}{n} - \frac{\sigma^2}{n} \\ & = 0\end{align*}$$

(We use the fact that the samples are independent to deduce that $Cov(X_i, X_j)$ when $i \neq j = 0$, and so can simplify $\displaystyle Cov \left (\sum_{j=1}^n X_j , X_i \right )$ to $Cov(X_i, X_i)$.) This is also kind of pretty.

Apparently the fact that $\overline X$ and $X_i - \overline X$ are uncorrelated implies that they are also independent. I didn't think this was true in the general case - maybe it's only true when they have a bivariate normal distribution? In any case, if those two are independent, then so are $\overline X$ and $(X_i - \overline X)^2$, and so $\overline X$ and $s^2$ are independent.

If the random samples come from a normal distribution, then the sample variance relates to the chi-square distribution as follows:

$$\frac{(n-1)s^2}{\sigma^2} \sim \chi^2_{n-1}$$

We can prove this using the previously used lemma. First, we rewrite the above statistic by substituting in the equation for $s^2$:

$$\frac{(n-1)s^2}{\sigma^2} = \frac{(n-1)}{\sigma^2} \frac{1}{n-1} \sum_{i=1}^n (X_i - \overline X)^2 = \frac{\displaystyle \sum (X_i - \overline X)^2}{\sigma^2}$$

Next, let's divide each side of the equals sign in the lemma by $\sigma^2$:

$$\frac{\displaystyle \sum (X_i-\overline X)^2 = \sum (X_i - \mu)^2 - n(\overline X - \mu)^2}{\sigma^2} \quad \therefore \frac{\displaystyle \sum (X_i - \overline X)^2}{\sigma^2} = \frac{\displaystyle \sum (X_i - \mu)^2}{\sigma^2} - \frac{n(\overline X- \mu)^2}{\sigma^2}$$
$$\therefore \sum_{i=1}^n \left ( \frac{(X_i - \overline X)^2}{\sigma^2} \right ) = \frac{\displaystyle \sum (X_i - \mu)^2}{\sigma^2} - \frac{(\sqrt{n} \cdot \overline X- \mu)^2}{\sigma^2} = \underbrace{\sum_{i=1}^n \left ( \frac{X_i - \mu}{\sigma^2} \right )^2}_{\chi^2_n} - \underbrace{\left ( \frac{\overline X- \mu}{\sigma^2/\sqrt{n}} \right )^2}_{\chi^2_1}$$

and since the term on the left is equivalent to the statistic we're looking at, it has a $\chi^2$ distribution with $n-1$ degrees of freedom (see the "relation to other distributions" section of the chi-square distribution).

(We know that the two terms are $\chi^2$ distributions because WHY??? Oh, is it because you're squaring things that involve normal distributions ... yeah probably.

$$\frac{\overline X - \mu}{s/\sqrt{n}} \sim t_{n-1}$$

where $X \sim N(0, 1)$ and $Y \sim \chi^2_m$.

This can be derived simply by rewriting the above statistic. Except I don't actually know how to do this. But it's probably not important.

The $k$th order statistic ($Y_k$) is the $k$th smallest value among the random sample $X_1, \ldots, X_n$ from a continuous distribution. If we let $\omega$ mean "outcome" (not really sure why this is significant), then if $X_1(\omega) = 5$, $X_2(\omega) = 10$, $X_3(\omega) = 1$ then $Y_1(\omega) = 1$, $Y_2(\omega) = 5$ and $Y_3(\omega) = 10$.

Density function

$$g_k(y) = \frac{n!}{(k-1)!(n-k)!} (F(y))^{k-1} f(y) [1-F(y)]^{n-k}$$

where $F(y)$ is the distribution function, $f(y)$ is the density function and $g_k(y)$ is the density function for the $k$th order statistic ($Y_k$). This is apparently proved using the trinomial distribution and taking the limit of the probability density as $h \to 0$.

Basically the value that is smack in the middle. So if there are 5 values then $Y_3$, if there are 6 then the mean of $Y_3$ and $Y_4$. Not going to define it mathematically because it's kind of obvious.

Density function

The density function for the sample median given an odd sample size of $2m+1$ is:

$$g_m(y) = \frac{(2m+1)!}{m!m!} (F(y))^m f(y) (1-F(y))^m$$

To find the expected value of an order statistic, integrate the density function. Remember that $f(x)$, the density function, is the derivative of the distribution function lol lol lol

Given some value $x$ from a distribution $X$ which depends on a parameter $\theta$, there are two main methods by which we can draw some conclusions about $\theta$:

Estimation: guessing $\theta$
Hypothesis testing: we use $x$ to help us decide between two hypotheses about $\theta$ (either it's one thing, or it's not)
Statistic: A function of $X$; if used to estimate $\theta$, it is called an estimator ($\hat \theta$), and its value is called the estimate

Usually, $X$ will be a random sample ($X_1, \ldots, X_n$) and $\theta$ will be either a scalar value or a vector (e.g. $(\mu, \sigma^2)$).

Some commonly used estimators of $\mu$ and $\sigma^2$:

$$\hat \mu = \overline X = \frac{1}{n} \sum_{i=1}^n X_i, \quad \hat \sigma^2 = s^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \overline X)^2$$

In this section: how to derive a suitable estimator for some given parameter.

Use the likelihood function $L(\theta)$, and choose the value of $\theta$ for which $L(\theta)$ is at a maximum (can be found by taking the log and determining where the derivative is equal to 0).

Note that for the normal distribution, the MLEs for $\mu$ and $\sigma^2$ are just the sample mean and variance, respectively.

Properties

MLEs are not unique
May not always exist (for example, the uniform density function's log likelihood function has a derivative of $\displaystyle \frac{n}{\theta}$, and so there is no value of $\theta$ such that the derivative is 0, since we know that $n > 0$)
Invariance: if $\hat \theta$ is an MLE of $\theta$, then $f(\hat \theta)$ is an MLE of $f(\theta)$ where $f$ is some function

Examples

When $\theta = 0$, $x$ is always 2. Then $\theta = 1$, $x$ is sometimes 1 but usually 2. If we observe that $x = 1$, then the MLE of $\theta$ is 1; if we observe that $x=2$, then the MLE of $\theta$ is 0.

Another example (from the lecture notes)

When $\theta = 0$, $P(X = 1) = 0.1, P(X = 2) = 0.4, P(X = 3) = 0.5$
When $\theta = 1/4$, $P(X = 1) = 0.2, P(X = 2) = 0.4, P(X=3) = 0.4$
When $\theta = 1/2$, $P(X=1) = 0, P(X=2) = 0.3, P(X=3) =0.7$

Consequently, when $x = 1$, the MLE of $\theta$ is $1/4$. When $x = 2$, the MLE of $\theta$ is either $0$ or $1/4$. When $x = 3$, the MLE of $\theta$ is 1/2$.

A solved example from the lecture notes using a continuous likelihood function, where the density function is given by

$$g_{\theta}(t) = \frac{1}{\theta} e^{-t/\theta} \quad \text{when } t > 0; 0 \text{ otherwise}$$

and we want to estimate $\theta$. First, we find the likelihood function:

$$L(\theta) = f_{\theta}(t_1, \ldots, t_n) = g_{\theta}(t_1) \times \ldots \times g_{\theta}(t_n) = \frac{1}{\theta^n}e^{-n\overline t/\theta}$$

when all $t_i > 0$ and 0 otherwise. We then take the logarithm of the above to get the log likelihood function:

$$\mathcal{L}(\theta) = \log L(\theta) = \log \left ( \frac{1}{\theta^n}e^{-n\overline t/\theta} \right ) = -\frac{n \overline t}{\theta} - n\log \theta$$

If we differentiate this with respect to $\theta$, we get

$$\mathcal{L}'(\theta) = \frac{n\overline t}{\theta^2} - \frac{n}{\theta} = \frac{n}{\theta} \left ( \frac{\overline t}{\theta} - 1 \right)$$

We see that if $\overline t = \theta$, then the derivative becomes 0, indicating that the MLE is $\hat \theta = \overline t$.

Similarly, if the likelihood function is

$$L(\theta) = \frac{n}{\theta} e^{-ny/\theta}$$

(when $y > 0$) then the derivative of the log likelihood function is

$$\frac{d}{d\theta} \left ( \log \left ( \frac{n}{\theta} e^{-ny/\theta} \right ) \right ) = \frac{d}{d\theta} \left ( \log(n) - \log(\theta) - \frac{ny}{\theta} \right ) = -\frac{1}{\theta} + \frac{ny}{\theta^2} = \frac{1}{\theta} \left ( \frac{ny}{\theta} - 1 \right )$$

so when $ny = \theta$, then the derivative is 0 and so the MLE of $\theta$ is $ny$.

Another example from the lecture notes, using the binomial distribution, whose likelihood function is given by

$$L(\theta) = \binom{n}{k} \theta^k(1-\theta)^{n-k}$$

where $\theta$ is what we usually call $p$ (i.e. the probability of "success"). We can find the MLE of $\theta$ by differentiating the log likelihood function:

$$\frac{d}{d\theta} (\log L) = \frac{d}{d\theta} \left ( \log \binom{n}{k} + k\log \theta + (n-k)\log(1-\theta)\right ) = \frac{k}{\theta} + \frac{n-k}{1-\theta} = \frac{k(1-\theta) - \theta(n -k}{\theta(1-\theta)} = \frac{k - n\theta}{\theta(1-\theta)}$$

we see that if $\displaystyle \theta = \frac{k}{n}$, then the derivative is 0 and so the MLE of $\theta$ is $\displaystyle \frac{k}{n}$.

Based on the strong law of large numbers:

$$\frac{1}{n} \sum_{i=1}^n X_i^k \to \mu'_k \,\,\text{as } n \to \infty$$

where $X$ is a random sample and $\mu_k'$ is the $kth$ moment ($E(X^k)$). So if $n$ is large enough, we have that

$$m_k = \frac{1}{n} \sum_{i=1}^n X_i^k \approx \mu_k'$$

If we set them to be equal, then we can solve the equations for the first few moments until we are able to identify $\theta$ (or whatever the parameters are).

Examples

For the exponential distribution with mean $\theta$: $\mu_1' = \theta$, $m_1 = \overline X$ so $\hat \theta = \overline X$.

For the Bernoulli distribution with $P(x=0) = 1-\theta$ and $P(x=1 = \theta$, $m_1 = \frac{1}{n} \sum_{i=1}^n X_i = \overline X$ and $\mu_1' = \theta$, so again we have $\hat \theta = \overline X$.

For the gamma distribution with parameters $\alpha$ and $\beta$, recall that the probability density function is given by

$$f(x) = X_i = \frac{1}{\Gamma(\alpha)\beta^{\alpha}} y^{\alpha -1}e^{-y/\beta}$$

So $\mu_1' = E(X) = \alpha\beta$ and $\mu_2' = E(X^2) = E(X^2) - (E(X))^2 + (E(X))^2 = Var(X) + (E(X))^2 = \alpha\beta^2 + (\alpha\beta)^2 = \alpha\beta^2 + \alpha^2\beta^2 = \alpha\beta(1+\alpha)$.

We don't seem to know the moments (?) so we just let $\mu_1' = m_1 = \alpha\beta$ and $\mu_2' = m_1 = \alpha\beta(1+\alpha)$ and we obtain the following equations for $\alpha$ and $\beta$:

$$\hat \alpha = \frac{m_2}{m_2 - m_1^2} \quad \hat \beta = \frac{m_2 - m_1^2}{m_1}$$

Chapter 16 in the textbook.

In this case, $\theta$ is a random variable with a distribution called the prior distribution. This essentially means that we have some information about $\theta$ (perhaps we believe it to be around 0.50, or 0.01, or something) but there is some uncertainty, which we handle by treating $\theta$ as a random variable. For example, if we believe $\theta$ to be about 0.10, we may use the beta distribution as a prior, with $\alpha = 1$, $\beta = 9$ (as the mean of the beta distribution is given by $\mu = \alpha/(\alpha+\beta)$). The density function for $\theta$ would then just be the beta distribution with $\alpha = 1$, $\beta=9$.

We then define a "loss" function $L(\theta, \hat \theta)$ which gives the loss/penalty when the true value of the parameter is $\theta$ and our estimate is $\hat \theta$. Examples of loss functions: taking the difference between the two and squaring it (the squared loss function), taking the absolute value of the difference between the two.

Consequently, to estimate $\theta$, we minimise the expected value of the loss function by finding the optimal estimator, $\hat \theta = t(X)$. For example, for the squared loss function, the Bayes estimator for $\theta$ is given by

$$\hat \theta = t(x) = E(\theta|X = x)$$

that is, the expected value of $\theta$, given that $X = x$. This conditional probability, called the posterior density function, is calculated as follows

$$\phi(\theta|x) = \frac{f_{\theta}x(x) h(\theta)}{g(x)}$$

where $h(\theta)$ is the prior density/probability function of $\theta$ and $f_{\theta}(x)$ is the conditional density/probability function $f(x|\theta)$.

The textbook definition this is slightly different (though the same idea). If $L$ is the likelihood function and $g$ is the prior density function then the posterior density of $\theta | y_1, \ldots, y_n$ is given by

$$g^*(\theta | y_1,\ldots y_n) = \frac{L(y_1, \ldots, y_n | \theta) \times g(\theta)}{\displaystyle \int_{-\infty}^{\infty} L(y_1,\ldots,y_n |\theta) \times g(\theta)\,d\theta}$$

Usually, a Bayesian estimation question will involve finding the posterior distribution for $\theta$ given the prior distribution.

Four important ways of measuring "goodness" of estimators - unbiasedness, efficiency, consistency, sufficiency.

Unbiased: if the expected value of $\hat \theta = \theta$. The bias is given by $B(\hat \theta) = E(\hat \theta) - \theta$. The mean square error is given by $MSE(\hat \theta) = E(\hat \theta - \theta)^2$, which we can rewrite in terms of the variance and bias of $\hat \theta$:

$$MSE(\hat \theta) = E((\hat \theta - \theta)^2) = \underbrace{E(((\hat \theta - E(\hat \theta)) + (E(\hat \theta) - \theta))^2)}_{\text{adding and subtracting } E(\hat \theta)} = Var(\hat \theta) + (B(\hat \theta))^2$$

$$\begin{align*} MSE(\hat \theta) & = E((\hat \theta - \theta)^2) = \underbrace{E(((\hat \theta - E(\hat \theta)) + (E(\hat \theta) - \theta))^2)}_{\text{adding and subtracting } E(\hat \theta)} = E((\hat \theta - E(\hat \theta))^2 + 2(\hat \theta - E(\hat \theta))\underbrace{(E(\hat \theta) - \theta)}_{B(\hat\theta)} + \underbrace{(E(\hat \theta) - \theta)^2}_{B(\hat\theta)^2}) \\ & = \underbrace{E((\hat \theta - E(\hat \theta))^2)}_{Var(\hat \theta)} + 2\underbrace{(\hat \theta - E(\hat \theta))}_{E(\hat \theta) - E(\hat \theta) = 0}B(\hat \theta) + (B(\hat \theta))^2 \\ & = Var(\hat \theta) + (B(\hat \theta))^2 \end{align*}$$

The standard error is the standard deviation of $\hat \theta$, that is, the square root of its variance.

The relative efficiency of $\hat \theta_1$ with respect to $\hat \theta_2$ is given by

$$\frac{E(\hat \theta_2 - \theta)^2}{E(\hat \theta_1 - \theta)^2}$$

(Basically the ratio of their MSEs - a lower MSE means the estimator is more efficient.) The higher this value is, the more efficient $\hat \theta_1$ is than $\hat \theta_2$. If both are unbiased, then their relative efficiency depends on their variance.

A minimum variance unbiased estimator is the unbiased estimator with the lowest variance. Note that there may be more than one, or there may be no unbiased estimator with minimum variance (not sure when this would happen?).

Let $\hat \theta = t(X)$ be an estimator of $\theta$, and let $f_{\theta}(x)$ be the likelihood function. Then we have

$$Var(\hat \theta) \geq \frac{\phi'(\theta)|^2}{\displaystyle E\left (\frac{\partial \log (f_{\theta}(X))}{\partial \theta} \right )^2}$$

where $\displaystyle \phi(\theta) = E(t(X)) = \int t(x) f_{\theta}(x)\,dx$ is the bias function (if the estimator is not biased, this is equal to $\theta$, and so $\phi'(\theta) = 1$).

If $X$ is a random sample $X_1, \ldots, X_n$ from a distribution with probability or density function $f_{\theta}(x)$, then the denominator (which is incidentally called the Fisher information of the observation $X$) is multiplied by $n$ and the $X$ is replaced by $X_1$ (I'm assuming because they all have the same probability/density function).

This can be proved with the assistance of the Cauchy-Schwarz inequality, i.e. that $|Cov(U, V)|^2 \leq Var(U)Var(V)$ (which incidentally shows up in MATH 223 as well).

If and only if equality holds, then $\hat \theta$ is said to be an efficient estimator of $\theta$. Note that equality holds if and only if $\hat \theta$ and the derivative of the log likelihood function are linearly related, meaning that the likelihood function must be of the form $f_{\theta}(x) = h(x)c(\theta)e^{w(\theta)t(x)}$ where $w(\theta) = \int a(\theta)\,d\theta$ (terminology: of exponential type). Not sure how important this is or what this really means.

We can use this theorem to show that an estimator is an MVUE. Recall that an MVUE must be unbiased and must be efficient. For example, to show that the estimator $\hat \mu = \overline X$ where $X_1,\ldots\,X_n \sim N(\mu, \sigma^2)$ is an MVUE of $\mu$, we have to show that equality holds (since we've already shown in Sample mean and variance that $E(\overline X) = \mu$).

First, we find the log likelihood function:

$$\log g(x_1) = \log \left (\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2\sigma^2} (y-\mu)^2} \right ) = \log\left ( \frac{1}{\sigma\sqrt{2\pi}} \right ) + \left ( -\frac{1}{2\sigma^2} (x_1-\mu^2)\right ) = -\log(\sigma\sqrt{2\pi}) - \frac{1}{2\sigma^2} (x_1-\mu)^2$$

Then we differentiate it with respect to $\mu$:

$$\frac{\partial \log g(x_1)}{\partial \mu} = \frac{\partial }{\partial \mu} \left ( -\log(\sigma\sqrt{2\pi}) - \frac{1}{2\sigma^2} (x_1-\mu)^2 \right ) = \frac{\partial }{\partial \mu} \left (-\frac{(x_1-\mu)^2}{2\sigma^2} \right ) = \frac{2(x_1-\mu)}{2\sigma^2} = \frac{x_1-\mu}{\sigma^2}$$

Now, the Fisher information in this case is

$$nE \left ( \frac{\partial \log g(X_1)}{\partial \mu}\right )^2 = nE\left ( \frac{x_1-\mu}{\sigma^2} \right )^2 = n \frac{E(x_1-\mu)^2}{\sigma^4} = \frac{nVar(X_1)}{\sigma^4} = \frac{n\sigma^2}{\sigma^4} = \frac{n}{\sigma^2}$$

Since $\overline X$ is unbiased, the numerator $\phi'(\mu) = 1$. Now, we know that the variance of $\overline X$ is $\frac{\sigma^2}{n}$, so we get

$$\frac{\sigma^2}{n} \text{ on the left and } \frac{1}{n/\sigma^2} = \frac{\sigma^2}{n} \text{ on the right. }\,\blacksquare$$

An estimator $\hat \theta$ given a random sample consisting of $n$ variables is consistent if it's unbiased and if it approaches $\theta$ as $n$ approaches infinity. Alternatively: its variance must approach 0 as $n \to \infty$ (proven by Chebychev's inequality).

Example: show that the sample variance $\displaystyle s^2 = \frac{1}{n-1}\sum (X_i-\overline X)^2$ (with random samples from $N(\mu, \sigma^2)$) is consistent for $\sigma^2$. To do this, we use the fact that $(n-1)s^2/\sigma^2$ has a chi-square distribution with $n-1$ degrees of freedom and variance $2(n-1)$. We can then rewrite the formula for the variance of $s^2$ as follows:

$$Var(s^2) = Var\left ( \frac{\sigma^2}{n-1} \cdot \frac{(n-1)s^2}{\sigma^2} \right ) = \left ( \frac{\sigma^2}{n-1} \right )^2 Var\left ( \frac{(n-1)s^2}{\sigma^2} \right ) = \frac{\sigma^4}{(n-1)^2} \cdot 2(n-1) = \frac{2\sigma^4}{n-1}$$

We then take the limit as $n \to \infty$:

$$\lim_{n\to\infty} \frac{2\sigma^4}{n-1} = 0$$

An estimator is sufficient if it uses all the useful information about $\theta$. There is a theorem we can use here (I remember using it for an assignment). Use the Neyman-Pearson factorisation criterion (?). I suppose this can be left for the HTSEFP.

Needs to be possible to factor it as

$$f_{\theta}(x) = g_{\theta}[t(x)]h(x)$$

For example, from the normal distribution, if $\sigma^2$ is known, then $\overline X$ is sufficient for $\mu$. We can prove this by first finding the likelihood function:

$$L(x_1, \ldots, x_n) = \prod_{i=1}^n \left ( \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2\sigma^2} (x_i-\mu)^2} \right ) = \frac{1}{(\sigma\sqrt{2\pi})^n} e^{-\frac{1}{2\sigma^2} \sum (x_i-\mu)^2}$$

We then use the lemma from section 1.2:

$$L(x_1, \ldots, x_n) = \frac{1}{(\sigma\sqrt{2\pi})^n} e^{-\frac{1}{2\sigma^2} \sum(x_i - \overline x)^2} \cdot \frac{1}{(\sigma\sqrt{2\pi})^n} e^{-\frac{n}{2\sigma^2} (\overline x - \mu)^2}$$

So we can split it up into the product of two functions, one of which $(h(x_1,\ldots,x_n)$) is a function of the individual random variables but does NOT contain the thing we're trying to estimate, the other of which ($g_{\mu}(\overline x)$) is a function of the estimator $\overline x$ and which CAN contain the thing we're trying to estimate.

$$\therefore g_{\mu}(\overline x) = \frac{1}{(\sigma\sqrt{2\pi})^n} e^{-\frac{n}{2\sigma^2} (\overline x - \mu)^2}, \quad h(x_1, \ldots, x_n) = \frac{1}{(\sigma\sqrt{2\pi})^n} e^{-\frac{1}{2\sigma^2} \sum(x_i - \overline x)^2}$$

Another example, when neither $\sigma^2$ nor $\mu$ are known, and we want to estimate both. We'd have to be able to split it up into the product of two functions, one of which ($h(x_1, \ldots, x_n)$) is a function of the individual random variables but does NOT contain either $\sigma$ or $\mu$, the other of which ($g_{\mu, \sigma^2}(s^2, \overline x)$) is a function of both estimators and CAN contain $\mu$ and $\sigma^2$ (but not the individual random variables). We can use the fact that

$$s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \overline x)^2 \quad \therefore \sum_{i=1}^n (x_i - \overline x)^2 = (n-1)s^2$$

to obtain

$$g_{\mu, \sigma^2}(s^2, \overline x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{n}{2\sigma^2} (\overline x - \mu)^2 - \frac{(n-1)s^2}{2\sigma^2}} \quad h(x_1,\ldots,x_n) = 1$$

and so the factorisation is complete.

A family of density/probability functions is called exponential if it can be expressed in some form, whatever. Examples: normal, binomial. Uniform is not.

use of order statistics?

We can reduce variance of an unbiased estimator by conditioning on a sufficient statistic
MVUE must be a function of "the" sufficient statistic. Why "the"?
Rao-Blackwell
Lehmann-Scheffe
What does it mean for a statistic to be complete?
fuck this section, gotta cut your losses at some point

Interval estimation (as opposed to point estimation, which is what was covered in the last section). If $\hat \theta$ is a point estimator, then the bound on the error of estimation is $b = 2\sigma_{\hat \theta}$. If $\hat \theta$ is unbiased, then Chebychev's inequality tells us that the probability that $\hat \theta $ and $\theta$ are at most $b$ apart is at least 0.75; if the distribution is normal, then the probability is at least 0.95.

For example, if we're given $n = 50$, $\overline x= 11.5$ and $s=3.5$, then our estimate for $\mu$ would be 11.5, and our bound on the error of estimation would be $2\sigma/\sqrt{n} = 2s/\sqrt{n} = 7/\sqrt{50} = 0.99$ (we can approximate $\sigma$ by $s$ since $n$ is large).

For interval estimation, we need to find an interval $[\theta_L, \theta_U]$ for which $P(\theta_L \leq \theta \leq \theta_U) = 1 - \alpha$ for a $(1-\alpha) \times 100\%$ confidence interval.

A confidence interval is derived by using a pivot, which is a random variable of the form $g(X, \theta)$ whose distribution does not depend on $\theta$ (what??) and where the function $g(x, \theta)$ is for each $x$ a monotonic function of the parameter $\theta$ (what does this mean??).

For example, for the normal distribution with $\sigma^2$ known, to find the confidence interval for $\mu$, we use the pivot

$$\frac{\overline X - \mu}{\sigma/\sqrt{n}}$$

Which has a normal distribution $N(0, 1)$. Why do we use that pivot???? Because it's a standard one? Central limit theorem? Okay let's assume that and stop panicking.

If $\sigma^2$ is unknown, then the pivot is

$$\frac{\overline X - \mu}{s/\sqrt{n}}$$

Which has a $t$ distribution with $n-1$ d.f.

If $\mu$ is unknown, and we want to find the confidence interval for $\sigma^2$:

$$\frac{(n-1)s^2}{\sigma^2}$$

which has a $\chi^2$ distribution with $n-1$ d.f. So the lower bound is $\displaystyle \frac{(n-1)s^2}{\chi^2_{\alpha/2, n-1}}$ and the upper bound is $\displaystyle \frac{(n-1)s^2}{\chi^2_{1-\alpha/2, n-1}}$

Examples

Given the random sample 785, 805, 790, 793, 802 from $N(\mu, \sigma^2)$. We don't know either $\mu$ or $\sigma^2$. To find a 90% confidence interval for $\mu$, we use the fact that

$$\frac{\overline X - \mu}{s/\sqrt{n}} \sim t_{n-1}$$

We need to find a number $\mu_U$ such that $P(\mu > \mu_U) = 0.05$. First, we figure out $\overline X$ and $s$:

$$\overline X = 795 \quad s = \sqrt{\frac{1}{n-1} \sum (X_i - \overline X)^2} = \frac{1}{2} \sqrt{(10^2 + 10^2 + 5^2 + 2^2 + 7^2)} = 8.34$$

Anyway, the critical value for $\alpha = 0.05$ is 2.132 (from the $t$-distribution table in the back of the exam/textbook, with 4 degrees of freedom). So

$$\frac{\overline X - \mu_L}{s/\sqrt{n}} = \frac{795 - \mu_L}{8.34 / \sqrt{5}} = 2.132$$

So the confidence interval is given by

$$795 \pm \frac{2.132 \times 8.34}{\sqrt{5}} = 795 \pm 7.95 = [787.05, 802.95]$$

To find one for $\sigma^2$, we use the fact that

$$\frac{(n-1)s^2}{\sigma^2} \sim \chi^2_{n-1}$$

From the table in the back of the textbook for chi square, with 4 degrees of freedom, the critical value for $\alpha = 0.05$ is 9.48773 (the lower bound). Since this distribution is not symmetrical, we also need the critical value for the upper bound: 0.710721. So the confidence interval is given by

$$\frac{(4 \times 69.5}{9.48733} \leq \sigma^2 \leq \frac{4 \times 69.5}{0.710721} = [29.3, 391.15]$$

For a binomial distribution:

$$\text{Lower bound } = \hat p - z_{\alpha/2} \sqrt{\hat p (1-\hat p)}{n} \quad \text{Upper bound } = \hat p + z_{\alpha/2} \sqrt{\hat p (1-\hat p)}{n}$$

where $p = \frac{x}{n}$ (number of successes over number of trials) and $z_{\alpha/2}$ comes from the normal distribution etc.

For the difference between the means of two normal distributions whose variances are known, the pivot is

$$\frac{\overline X_1 - \overline X_2 -(\mu_1 - \mu_2)}{\displaystyle \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$

so the bounds are

$$(\overline x_1 - \overline x_2) \pm \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} \times z_{\alpha/2}$$

If the variances are unknown, then we have to assume that the variances are the same, and use the pooled sample variance instead of $\sigma_1^2$ and $\sigma_2^2$:

$$s_p^2 = \frac{(n_1 - 1)s_1^2 +(n_2 - 1)s_2^2}{n_1+n_2+2}$$

The distribution of the pivot is $t$ with $n_1 + n_2 - 2$ degrees of freedom. The confidence interval is then given by

$$(\overline x_1 - \overline x_2) \pm \sqrt{\frac{s_p^2}{n_1} + \frac{s_p^2}{n_2}} \times t_{\alpha/2, n_1+n_2-2}$$

Numerical example: $\overline x_1 = 64$, $\overline x_2 = 69$, $s_1^2 = 52$, $s_2^2 = 71$, $n_1 = 11$, $n_2 = 14$ and we want to find the difference between the means, $\mu_1-\mu_2$. First, let's find the pooled sample variance:

$$s_p^2 = \frac{(11-1)(52) + (13)(71)}{11+14-2} = \frac{520+ 923}{23} = 62.74$$

Now let's find a 95% confidence interval, given that we have a $t$ distribution with 21 degrees of freedom. From the table in the back of the book we have $t_{0.025, 23} = 2.069$. So the confidence interval is given by

$$(\overline x_1 - \overline x_2) \pm t_{0.025, 23}\sqrt{\frac{s_p^2}{n_1^2} + \frac{s_p^2}{n_2^2}} = -5 \pm 2.069 \sqrt{\frac{62.74}{11} + \frac{62.74}{14}} = -5 \pm 6.60 = [-11.60, 1.60]$$

Confidence interval for the ratio of variances

Two normal distributions, means and variances unknown, we want to find the ratio $\sigma_1^2/\sigma_2^2$. The pivot is

$$\frac{s_2^2/\sigma_2^2}{s_1^2/\sigma_1^2}$$

which has an $F$ distribution (NOT SYMMETRICAL!!) with $n_2-1$ numerator and $n_1-1$ denominator degrees of freedom. The confidence interval is therefore given by

$$\text{Lower bound } = \frac{s_1^2}{s_2^2} F_{1-\alpha/2, n_2-1, n_1-1} \quad \text{Upper bound } = \frac{s_1^2}{s_2^2} F_{\alpha/2, n_2-1, n_1-1}$$

The fact that

$$F_{1-\alpha/2, n_2-1, n_1-2} = \frac{1}{F_{\alpha/2, n_1-1, n_2-1}}$$

(NOTE THE SWITCHED NUMERATOR AND DENOMINATOR D.F.) may be useful.

Difference between binomial parameters

$\hat p_1 = x_1/n_1$, $\hat p_2 = x_2/n_2$. The pivot used here is

$$\frac{\hat p_1 - \hat p_2 - (p_1 - p_2)}{\sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}} \sim N(0, 1)$$

where we estimate $p_1$ and $p_2$ under the square root sign using $\hat p_1$ and $\hat p_2$. The final confidence interval is given by

$$\hat p_1 - \hat p_2 \pm z_{\alpha/2} \sqrt{\frac{\hat p_1(1-\hat p_1)}{n_1} + \frac{\hat p_2(1-\hat p_2)}{n_2}}$$

Mean of an exponential distribution

$X_i$ is a random sample from an exponential distribution with mean $\theta$ ($n$ samples total), which we want to find a confidence interval for. Let the pivot be

$$\frac{2}{\theta} \sum_{i=1}^n X_i$$

which has a chi-square distribution with $2n$ degrees of freedom. So the confidence interval is

$$\left [\frac{2\sum x_i}{\chi^2_{\alpha/2, 2n}}, \frac{2\sum x_i}{\chi^2_{1-\alpha/2, 2n}} \right ]$$

Large samples

When the sample size is large, we can assume that

$$\frac{\hat \theta - \theta}{\sigma_{\hat \theta}} \sim N(0, 1)$$

where $\sigma_{\hat \theta}$ is the standard error (standard deviation of the estimator $\hat \theta$). So the confidence interval is

$$\hat \theta \pm z_{\alpha/2} \sigma_{\hat \theta}$$

Bayesian credible sets

Bayesian analog of confidence intervals. Do we need to know this?

Two hypotheses: the null hypothesis $H_0$, and the alternative hypothesis $H_1$. We select a critical region $C$ of the sample space, and if the random variable's outcome is in $C$, then we reject $H_0$ and accept $H_1$; else, we accept $H_0$ and reject $H_1$.

Of course, it's possible that we will falsely reject or accept a hypothesis. If $H_0$ is true but we reject it, that's a type I error; if $H_1$ is true but we reject it, that's a type II error. Pretty easy to remember. The probability of making a type I error is $\alpha = P(X \in C|H_0 \text{ is true})$, the probability of making a type II error is $\beta = P(X \notin C|H_1 \text{ is true})$. The power of the test is $1-\beta$.

Binomial distribution example

7 marbles, $\theta$ are red and the rest are green. We randomly select 3 without replacement to try and test the hypotheses

$$H_0:\, \theta = 3 \quad H_1:\,\theta = 5$$

Our rule is, if 2 or more of the marbles are red, reject $H_0$ and accept $H_1$; otherwise, accept $H_0$ and reject $H_1$. Let's calculate $\alpha$ and $\beta$ for this test.

$\alpha$ is the probability that there are 3 red marbles in the urn and 1 green marble and that when we select 3, there are 2 or 3 red marbles in our sample. So we find the sum of the probabilities of selecting all reds and of selecting one green:

$$\text{All red } = \frac{3}{7}\cdot \frac{2}{6} \cdot \frac{1}{5} = \frac{1}{35} \quad \text{One green } = \frac{3 \times 4 \times 3 \times 2}{7 \times 6 \times 5} = \frac{12}{35} \quad \therefore \alpha = \frac{13}{35}$$

$\beta$ is the probability that there are 5 red marbles in the urn and 2 green marbles and that when we select 3, there is 1 red marble and 2 green marbles. In other words, we select all the green marbles. This probability is

$$\beta = 3 \cdot \frac{2}{7} \cdot \frac{1}{6} = \frac{1}{7} = \frac{5}{35}$$

so the power of the test is $\frac{30}{35}$. I should probably figure out how to use the binomial distribution formulas rather than trying to derive everything all the time from scratch.

If the hypotheses were instead

$$H_0:\, \theta \leq 3 \quad H_1:\,\theta > 3$$

then the hypotheses would be composite instead of simple. We would then have to calculate $\alpha$ and $\beta$ values for each possible value of $\theta$ ($\alpha(\theta), \beta(\theta)$). The power function of the test (same as regular power, just, it depends on the value of $\theta$) is given by:

$$P(\theta) = P_{\theta}(\text{reject }H_0) = \begin{cases} \alpha(\theta) & \text{if we assume } H_0 \\ 1 - \beta(\theta) & \text{if we assume } H_1 \end{cases}$$

The size of the test is the maximum value of $\alpha(\theta)$; the level of the test is the maximum value of $\alpha$. Apparently the terms are used confusingly in the textbook.

Anyway, for this situation, we can construct a table for the power function (remember that the rule is to reject $H_0$ if there are 2 red marbles in the sample):

$\theta$	0	1	2	3	4	5	6	7
$P(\theta)$	0	0	$\frac{1}{7}$	$\frac{13}{35}$	$\frac{22}{35}$	$\frac{30}{35}$	1	1

Normal distribution example

We have a sample of size 10 from a distribution with variance $\sigma^2=625$. The hypotheses are:

$$H_0:\,\mu = 100 \quad H_1:\, \mu = 70$$

Rule: reject $H_0$ if $\overline X < 80$. Let's calculate $\alpha$ and $\beta$.

$\alpha$ is the probability that $\mu = 100$ and $\overline X < 80$. We can use the fact that

$$\frac{\overline X - \mu}{\sigma/\sqrt{n}} \sim N(0, 1)$$

for this:

$$\alpha = Pr_{H_0}\{X \in C\} = Pr_{\mu = 100} \{\overline X < 80 \} = Pr \left \{ \frac{\overline X - \mu}{\sigma/\sqrt{n}} < \frac{80 - 100}{25/\sqrt{10}} \right \} = Pr \{ Z < -2.53 \} = 0.0057$$

(using the table at the back of the book). To calculate $\beta$:

$$\beta = Pr_{H_1} \{X \notin C\} = Pr_{\mu = 70} \{ \overline X \geq 80 \} = Pr \left \{ \frac{\overline X - \mu}{\sigma/\sqrt{n}} \geq \frac{80 - 70}{25/\sqrt{10}} \right \} = Pr \left \{ Z > 1.26 \right \} = 0.1038$$

If we wanted to test the composite hypotheses

$$H_0:\, \mu \geq 100 \quad H_1:\,\mu < 100$$

it would be the same idea, except we'd have to do it for every possible value of $\mu$, which I don't think is important.

If we wanted to find a critical region of a size $\alpha$, we use

$$\frac{k - \mu}{\sigma/\sqrt{n}} = -z_{\alpha}$$

and solve for $k$ (which is the discriminant for the critical region). This gives us a new test and we can figure out the power for that.

First, we decide on a value of $\alpha$ (the level). Then, we choose the test that has maximum power for that level (i.e. we minimise $\beta$). This is called a most powerful test (critical region) of level $\alpha$.

For example, if $f_0(x)$ is the likelihood function of observation $X$ under hypothesis $H_0$ and $f_1(X)$ is the likelihood function of observation $X$ under hypothesis $H_1$ then, given a table or something, we can figure out the most powerful critical region for a given $\alpha$ level. Basically, we identify all those regions of size $\alpha$ or less, then find the power ($1-\beta$) for each region. Note that there can be a more powerful critical region with a smaller size, which is why we use levels etc.

How we do choose a most powerful critical region? We can use the Neyman-Pearson lemma:

Let $C$ be a critical region of size $\alpha$. If there exists a positive constant $k$ such that

$$f_0(x) \leq kf_1(x) \text{ when } x \in C \text{ and } f_0(x) > kf_1(x) \text{ when } c \notin C$$

then $C$ is a most powerful critical region of level $\alpha$.

Using this lemma, we can construct a most powerful region of size $\alpha$ (which will be the most powerful test of level $\alpha$).

Normal distribution example

Known variance of $\sigma^2$, random sample of size $n$, we want to test

$$H_0:\, \mu = \mu_0 \quad H_1: \,\mu=\mu_1$$

where $\mu_1 < \mu_0$. How do we find the most powerful critical region of size $\alpha$?

First, we find the likelihood function:

$$f_{\mu}(x) = f(x_1,\ldots\,x_n) = \prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2\sigma^2} (x_i-\mu)^2} = \frac{1}{(\sigma \sqrt{2n})^n} e^{-\frac{1}{2\sigma^2} \sum (x_i - \mu)^2} = \frac{1}{(\sigma\sqrt{2\pi})^2} e^{-\frac{1}{2\sigma^2}[\sum (x_i - \overline x)^2 + n(\overline x - \mu)^2]}$$

(making use of the lemma from section 1.2). Then we take the ratio of the hypotheses' likelihood functions:

$$\begin{align*} \frac{f_0(x)}{f_1(x)} & = \frac{1}{(\sigma\sqrt{2\pi})^2} e^{-\frac{1}{2\sigma^2}[\sum (x_i - \overline x)^2 + n(\overline x - \mu_0)^2]} / \frac{1}{(\sigma\sqrt{2\pi})^2} e^{-\frac{1}{2\sigma^2}[\sum (x_i - \overline x)^2 + n(\overline x - \mu_1)^2]} = e^{-\frac{1}{2\sigma^2}[\sum (x_i - \overline x)^2 + n(\overline x - \mu_0)^2] + \frac{n}{2\sigma^2}[\sum (x_i - \overline x)^2 + n(\overline x - \mu_1)^2]} \\ & = e^{\displaystyle -\frac{n}{2\sigma^2}[(\overline x - \mu_0)^2 - (\overline x - \mu_1)^2]} = e^{\displaystyle \frac{n(\mu_0-\mu_1)}{2\sigma^2}[2\overline x - (\mu_0 + \mu_1)]} \end{align*}$$

If we take the log of the above and periodically replace the term on the right by $k'$ etc (an arbitrary constant) we see that the critical region is given by

$$C = \left \{ \frac{f_0(x)}{f_1(x)} \leq k \right \} = \left \{ \frac{n(\mu_0-\mu_1)}{2\sigma^2}[2\overline x - \mu_0 - \mu_1] \leq \log k \right \} = \{ 2\overline x - \mu_0 - \mu_1 \leq k' \} = \{ \overline x \leq k'' \}$$

(Note that to get the first $k'$, we divided both sides by the fraction on the left, and since $\mu_1 < \mu_0$, this cannot be 0, so it's all valid etc.)

Anyway, we use $\displaystyle \frac{\overline X - \mu}{\sigma/\sqrt{n}}$ as usual, resulting in

$$Pr \left \{ Z \leq \frac{k'' - \mu_0}{\sigma/\sqrt{n}} \right \}$$

which implies that

$$\frac{k''-\mu_0}{\sigma/\sqrt{n}} = -z_{\alpha}$$

and so the value we use for the critical region is

$$\mu_0 - z_{\alpha} \frac{\sigma}{\sqrt{n}}$$

Or, we could calculate

$$z = \frac{\overline x- \mu_0}{\sigma/\sqrt{n}}$$

and reject $H_0$ at level $\alpha$ if $z \leq -z_{\alpha}$.

What if we wanted to test hypotheses involving $\sigma^2$? The derivation is similar, and the ratio of the two likelihood functions becomes

$$\frac{f_{\sigma^2_0}(x)}{f_{\sigma_1^2}} = \left ( \frac{\sigma_1^2}{\sigma_0^2} \right )^{n/2} e^{-\frac{1}{2}(\sigma_0^{-2} - \sigma_1^{-2} ) \sum (x_i-\mu)^2}$$

Let's use $\hat \sigma^2$ to estimate the variance:

$$\hat \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i-\mu)^2$$

So we reject $H_0$ at level $\alpha$ if

$$\frac{n\hat\sigma^2}{\sigma_0^2} \geq \chi^2_{\alpha, n}$$

Binomial distribution example

We want to test

$$H_0:\,\theta = \theta_0 \quad H_1:\,\theta=\theta_1$$

The likelihood function for a binomial distribution is

$$f_{\theta}(x) = \binom{n}{x} \theta^x (1-\theta)^{n-x}$$

The derivation is pretty long and messy but in the end you choose the smallest $k$ such that $P(X \geq k) \leq \alpha$ or $P(X < k) \geq 1-\alpha$. (We choose the "smallest" to maximise the power). We'd have to use the binomial tables at the back of the book to find the actual value of $k$. For a numerical example, let $n = 20$, $\alpha = 0.05$, and we want to test $H_0: \theta = 0.3$ against $H_1: \theta = 0.5$. We need to find the smallest value of $k$ such that $P(X < k) \geq 1 - \alpha$, which is 0.95. We see that $P(X < 10) = 0.952$, so, $k$ is 10.

A test is uniformly most powerful if it is more powerful than any other test of level $\alpha$. Although these tests do not always exist, there is a method of finding them when they do, called the likelihood ratio method. It's basically the same as before, except you use the maximum possible value of $\theta$ for that hypothesis?

Normal distribution example

Random sample of size $n$ from a normal distribution where $\sigma^2$ is known. We want to test

$$H_0:\,\mu=\mu_0 \quad H_1:\,\mu\neq \mu_0$$

Skipping the derivation, we have that

$$z = \frac{\overline x - \mu_0}{\sigma/\sqrt{n}}$$

and we reject $H_0$ is $|z| \geq z_{\alpha/2}$.

There were a bunch of optional examples skipped at the end of this section.

Recall that $H_0$ is the null hypothesis, i.e. the "normal state of affairs". $H_1$ is the hypothesis which has the burden of proof on it, so that we only accept it if there is sufficient evidence for it. $H_1$ is therefore the statement to be proved, and the probability that we mistakenly accept $H_1$ is $\alpha$.

The p-level or (p-value) of a test is the value $\alpha$ that discriminates between rejecting $H_0$ and accepting $H_0$. Also called the observed/attained level of significance. Example:

$$H_0: \,\mu < 3.5 \quad H_1:\, \mu \geq 3.5$$

where $n = 50$, $\overline X = 3.3$ and the sample standard deviation $s = 1.1$ (since $n$ is large, we take $s = \sigma$). First, at $\alpha = 0.05$ (note $z_{\alpha} = 1.645$), what can we conclude? The formula is

$$z = \frac{\overline X - \mu}{\sigma/\sqrt{n}} = \frac{3.3 - 3.5}{1.1/\sqrt{50}} = -1.286$$

We reject $H_0$ only if $z$ is less than $-z_{\alpha} = -1.645$. It's not, so we don't reject it. However, if $\alpha$ were $0.10$, then $z_{\alpha} = 1.282$, so we could reject it with 80% confidence. Or is it 90%?

The $p$-value is essentially the probability that we are wrong. To calculate the $p$-value, we just use the equation

$$-1.286 = -z_{\alpha}$$

so $p = P(Z > 1.286) = 0.0992$.

We can also find the probability of making a type II error (i.e. the power of the test) for a given value of $\mu$. That is, we find the probability of rejecting $H_1$ (falling outside the critical region) when $H_1$ is true. For example, if $\mu = 3.4$, we first find the critical region:

$$\overline x < \mu - z_{\alpha}\frac{\sigma}{\sqrt{n}} = 3.5 - 1.645 \frac{11}{\sqrt{50}} = 3.244$$

so the probability of it falling outside the critical region given that $\mu = 3.4$ is given by

$$\begin{align*} \beta & = P_{\mu = 3.4} [ \overline X \geq 3.244 ] = P\left (Z \geq \frac{3.244 - 3.4}{1.1/\sqrt{50}} \right ) = P(Z \geq -1.003) \\ & = 1-P(Z < -1.003) = 1- P(Z > 1.003) = 1 - 0.1579 = 0.8421\end{align*}$$

So the power is $1-\beta = 0.1579$. Not really sure about this part.

Note that the "some last topics" section is optional which means I will ignore it

There are some examples here - should probably be in the HTSEFP or formulas to memorise

Two random variables $X$, $Y$ have a bivariate normal distribution if their joint density function is some really long thing. Does this matter? $\rho$ is one of the parameters, and $Cov(X, Y) = \rho \sigma_1\sigma_2$.

Whatever, not worth it

that's all for section 5 (ignore the bit at the end)

Equation for the line of best fit: $y = \beta_0 + \beta_1 x$. However, since the points might not fall exactly on the line, we also have an equation for the relationship between $x_i$ and $y_i$ for the $i$th point:

$$y_i = \beta_0 + \beta_1 x_i + e_i$$

where $e_i$ is the "error" (the vertical distance of the point from the line). This chapter will be about using the data $(x_1, y_1),\ldots, (x_n, y_n)$ to estimate the parameters $\beta_0$ and $\beta_1$.

Actually, I seem to remember doing this in Calc A. The derivation is probably not important, so here's the gist: the values we need are

$$\sum x_i \quad \sum y_i \quad \sum x_i y_i \quad \sum x_i^2 \quad \sum y_i^2$$

The solutions are

$$\hat \beta_0 = \frac{(\sum x_i^2)(\sum y_i) - (\sum x_i )(\sum x_i y_i)}{n\sum x_i^2 - (\sum x_i)^2} \quad \hat \beta_1 = \frac{n(\sum x_i y_i) - (\sum x_i)(\sum y_i)}{n\sum x_i^2 - (\sum x_i)^2}$$

MEMORISE THESE. Just do it

Anyway, the values of $\beta_0$ and $\beta_1$ that we just got are estimates. It would be nice to be able to obtain confidence intervals and test hypotheses for these values. To be able to do that, we make certain assumptions:

$x_i$'s are fixed (not the values of random variables)
$x_i$ and $y_i$ are related by $y_i = \beta_0 + \beta_1 x_i + e_i$ where $e_i \sim N(0, \sigma^2)$ (the random variable)

Since there's a distribution involved, we can use another method to estimate $\beta_0$ and $\beta_1$: maximum likelihood estimation. First, we find the likelihood function by letting $\mu = y_i - \beta_0 - \beta_1 x_i$, and placing that into the formula for the normal distribution:

$$\prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2\sigma^2} (x_i - y_i + \beta_0 + \beta_1x_i)^2}$$

(this differs from the one given in the notes - mistake on my part or typo in the notes?) and by taking the log derivatives we see that we can get the same estimates for $\beta_0$ and $\beta_1$. Which is kind of neat.

$$\hat \beta_1 \sim N\left (\beta_1, \frac{\sigma^2}{n(\sum x_i^2) - (\sum x_i)^2} \right) \quad \hat \beta_0 \sim N\left ( \beta_0, \sigma^2 \left [ \frac{1}{n} + \frac{\overline x^2}{n(\sum x_i^2) - (\sum x_i)^2} \right ] \right )$$

Also, their covariance is given by

$$Cov(\hat \beta_0, \hat \beta_1) = -\frac{\sigma^2 \overline x}{n(\sum x_i^2) - (\sum x_i)^2}$$

so the estimated line $y = \hat \beta_0 + \hat\beta_1 x$ has distribution

$$\hat beta_0 + \hat \beta_1 x \sim N\left ( \beta_0 + \beta_1 x, \sigma^2 \left [ \frac{1}{n} + \frac{(x-\overline x)^2}{n(\sum x_i^2) - (\sum x_i)^2} \right ] \right )$$

(Derivations omitted)

Now, let $y = \hat \beta_0 + \hat\beta_1 x + e$ be the value resulting from some measurement, and let $\hat y = \hat \beta_0 + \hat \beta_1x$ be our prediction for $y$. Now, $E(y - \hat y) = 0$ (because of the above) and the variance is given by

$$Var(y - \hat y) = Var(y) + Var(\hat y) - \underbrace{2Cov(y, \hat y)}_{0} = \sigma^2 + \sigma^2\left [\frac{1}{n} + \frac{(x - \overline x)^2}{n(\sum x_i^2) - (\sum x_i)^2} \right ]$$

Consequently, the difference between the estimate and the true value is given by

$$y - \hat y \sim N(0, \sigma^2 \left [ 1 + \frac{1}{n} + \frac{(x-\overline x)^2}{n (\sum x_i^2) - (\sum x_i)^2} \right ]$$

However, the value of $\sigma^2$ is not likely to be known, so we have to estimate it. Let's define a value called the least squares minimum by

$$SS(Res) = \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 = S_{yy} - \hat \beta_1 S_{xy}$$

where $S_{xy}$ and $S_{yy}$ are (I think?)

$$S_{yy} = \sum_{i=1}^n (x_i - \overline x)^2 \quad S_{xy} = \sum_{i=1}^n (x_i - \overline x)(y_i - \overline y)$$

The following gives us a $\chi^2$ distribution, with $n-2$ d.f., independent of the betas:

$$\frac{SS(Res)}{\sigma^2} \sim \chi^2_{n-2}$$

So we can get an unbiased estimator of $\sigma^2$ with

$$\hat \sigma^2 = \frac{SS(res)}{n-2}$$

We can get a $t$ distribution with $n-2$ d.f. by manipulating some shit. We can get more $t$ distributions for other things. In any case, the following all have $t$ distributions with $n-2$ d.f.:

$$\frac{(\hat \beta_1 - \beta)\sqrt{S_xx}}{\hat \sigma} \quad \frac{\hat \beta_0 - \beta_0}{\hat \sigma \sqrt{\frac{1}{n} + \frac{\overline x^2}{S_xx}}} \quad \frac{\hat \beta_0 + \hat \beta_1 x - \beta_0 - \beta_1 x}{\hat \sigma\sqrt{\frac{1}{n} + \frac{(x-\overline x)^2}{S_xx}}} \quad \frac{y-\hat y}{\sigma \sqrt{1 + \frac{1}{n} + \frac{(x-\overline x)^2}{S_xx}}}$$

There are some numerical examples here - go through them later, maybe in the HTSEFP

One-way analysis of variance (experiment designs involving a single factor)

Let's say we want to test the effects of $k$ different fertilisers on wheat. So we plant a certain number of wheat plants in a bunch of different plots for each type of fertiliser. The yield from each plot is dependent on the effect of the fertiliser used as well as on the random error for that plot.

Okay, fuck that, here's what goes in an ANOVA table:

Source of variation	Degrees of freedom	Sum of squares	Mean square	F
Treatments	$k - 1$	SS(Tr)	MS(Tr)	$F_{Tr}$
Errors	$N - k$	SS(E)	MS(E)	--
Total	$N - 1$	SS(T)	--	--

There are $k$ treatments in total (e.g. fertilizers). There are $N$ test plots (or like, plant pots, or test tubes, or whatever). The sum of squares column is given as follows:

$$SS(Tr) = \sum_{i=1}^k \frac{x_i^2}{n_i} - \frac{x^2..}{N} \quad SS(E) = \sum_{i=1}^k\sum_{j=1}^{n_i} x_{ij}^2 - \sum_{i=1}^k \frac{x_i^2}{n_i} \quad SS(T) = SS(E) + SS(Tr)$$

Now, what the fuck does that all mean?

$$\begin{align*}SS(Tr) & = \sum \underbrace{\frac{(\text{sum of yields for this fertiliser})^2}{\text{number of plots for this fertiliser}}}_{\text{for all the fertilisers}} - \frac{(\text{sum of all yields})^2}{\text{total number of plots}} \\ SS(E) & = (\text{sum of all the yields})^2 - \sum \underbrace{\frac{(\text{sum of yields for this fertiliser})^2}{\text{number of plots for this fertiliser}}}_{\text{for all the fertilisers}}\end{align*}$$

The mean square column is as follows:

$$MS(Tr) = \frac{SS(Tr)}{k-1} \quad MS(E) = \frac{SS(E)}{N- k}$$

(basically you divide the sum of squares by the degrees of freedom)

The last column, $F_{Tr}$, is just $MS(Tr)$ divided by $MS(E)$. If this value is greater than $F_{\alpha, k-1, N-k}$ then we reject $H_0$ (no treatment effects, in this case).

In this situation, the things we would need to compute/note are:

$k$: the number of fertilisers used
$x_{1.}$: the sum of yields for fertiliser 1
$x_{2.}$: the sum of yields for fertiliser 2
$x_{3.}$: the sum of yields for fertiliser 3
$x_{..}$: the sum of yields among all the fertilisers
$n_1$: the number of plots used by fertiliser 1
$n_2$: the number of plots used by fertiliser 2
$n_3$: the number of plots used by fertiliser 3
$N$: the number of plots used, total, among all the fertilisers
$\displaystyle \sum_{i=1}^k \sum_{j=1}^{n_i} x_{ij}^2$: sum of the (yield)² among all the plots
$\displaystyle \sum_{i=1}^k \frac{x_{i.}^2}{n_i}$: sum of (sum of yields for this fertiliser/number of plots for this fertiliser)² among all the fertilisers

omg done

Example

Number of minutes taken to drive along 4 routes on different days of the week (thing we're testing: route)

--	Route 1	Route 2	Route 3	Route 4
Monday	22	25	26	26
Tuesday	26	27	29	28
Wednesday	25	28	33	27
Thursday	25	26	30	30
Friday	31	29	33	30

$k = 4$
$x_{1.} = 129$ (total for route 1)
$x_{2.} = 135$ (total for route 2)
$x_{3.} = 151$ (total for route 3)
$x_{4.} = 141$ (total for route 4)
$x_{..} = 556$ (total for all routes)
$n = 5$ (number of trial things for each route, all the same)
$\sum \sum x_{ij}^2 =15610$ (sum of the total-for-each-trip-squared values for all routes)
$\sum x_{i.}^2/n_i = 15509.6$ (sum of total-for-each-route-squared, divided by 5)
$SS(Tr) = 15509.6 - 556^2/20 = 15509.6 - 15456.8 = 52.8$ (sum of squares for treatment)
$SS(E) = 15610 - 15509.6 = 100.4$ (sum of squares for errors)
$SS(T) = 52.8 + 100.4 = 153.2$ (sum of squares, total)
$MS(Tr) = 52.8 / 3 = 17.6$
$MS(E) = 100.4 / 16 = 6.275$
$F_{Tr} = 17.6 / 6.275 = 2.804$

The ANOVA table is:

Source of variation	Degrees of freedom	Sum of squares	Mean square	F
Treatments	3	52.8	17.6	2.804
Errors	16	100.4	6.275	--
Total	19	153.2	--	--

The distribution is $F_{3, 16}$. At $\alpha = 0.05$, $F_{0.05, 13, 16} = $3.24$. Since $F_{Tr}$ does not exceed this, we can't reject the null hypothesis, which means the time taken could be independent of the route (either 90% or 95% certain - confirm).

Assigned randomly within blocks. When you have extraneous factors that could introduce bias (confounding), e.g. terrain in the fertiliser example.

The most important thing here is that we have a new ANOVA table to force into your head. :( Here:

Source of variation	Degrees of freedom	Sum of squares	Mean square	F
Treatments	$k - 1$	SS(Tr)	MS(Tr)	$F_{Tr}$
Blocks	$n - 1$	SS(Bl)	MS(Bl)	$F_{Bl}$
Errors	$(n-1)(k-1)$	SS(E)	MS(E)	--
Total	$nk - 1$	SS(T)	--	--

$n$ is the number of blocks, obviously. Everything else is a little bit different now:

$$SS(T) = \sum_{i=1}^k \sum_{j=1}^n \left (x_{ij} - \frac{x..}{kn} \right )^2 \quad SS(Tr) = n\sum_{i=1}^k \left ( \frac{x_i.}{n} - \frac{x..}{kn} \right )^2 \quad SS(Bl) = k\sum_{j=1}^n \left ( \frac{x_{.j}}{k} - \frac{x..}{kn} \right )^2 \quad SS(E) = \sum_{i=1}^k\sum_{j=1}^n \left ( x_{ij} - \frac{x_{i.}}{n} - \frac{x_{.j}}{k} + \frac{x..}{kn} \right )^2$$

The mean square column is pretty simple, just the sum of squares divided by the degrees of freedom for that row.

The last column is the mean square in that row divided by $MS(E)$. Only available for the first two row, obviously.

Now let's redo the table for the routes thing:

$k = 4$, as before
$x_{1.} = 129$ (total for route 1)
$x_{2.} = 135$ (total for route 2)
$x_{3.} = 151$ (total for route 3)
$x_{4.} = 141$ (total for route 4)
$x_{..} = 556$ (total for all routes)
$x_{.1} = 99$ (total for Monday)
$x_{.2} = 110$ (total for Tuesday)
$x_{.3} = 113$ (total for Wednesday)
$x_{.4} = 111$ (total for Thursday)
$x_{.5} = 123$ (total for Friday)
$n = 5$ (5 blocks, as in 5 days of the week)
$SS(Tr) = 5 * (4 + 0.64 + 5.76 + 0.16) = 52.8$ WTF why is it the same lol lol
$SS(Bl) = 4 * (9.3025 + 0.09 + 0.2025 + 0.0025 + 8.7025 = 73.2$
$SS(E) = $ fuck this I'll calculate the total instead
$SS(T) = 153.2$ so $SS(E) = 153.2 - 126 = 27.2$

Source of variation	Degrees of freedom	Sum of squares	Mean square	F
Treatments	$3$	52.8	17.6	7.75
Blocks	$4$	73.2	18.3	8.06
Errors	$12$	27.2	2.27	--
Total	$19$	153.2	--	--

To reject the null hypothesis for treatments, the $F$ value must be greater than $F_{0.05, 3, 12} = 3.49$; since it is, we can reject it. To reject the null hypothesis for blocks, the $F$ value must be greater than $F_{0.05, 4, 12} = 3.26$; since it is, we can reject it too.

What is a chi-square test?

$X_1,\ldots X_k$ observations from independent binomial distributions with different parameters. We want to test the hypotheses that either the values of $p$ are the same (or some given value) for each distribution, or not. Since $X_i$ is a binomial random variable, if $n_i$ is large enough, then

$$\frac{X_i - n_i p_i}{\sqrt{n_1 p_i(1-p_i)}} \sim N(0, 1)$$

and so

$$\frac{(X_i - n_ip_i)^2}{n_ip_i(1-p_i)} \sim \chi^2_{1}$$

and thus, if all $n_i$ are large enough ($\geq k$), then, by independence:

$$\sum_{i=1}^k \frac{(X_i - n_i p_i)^2}{n_ip_i(1-p_i)} \sim \chi^2_{k}$$

The $\alpha$-level test is: reject $H_0$ if $\chi^2 \geq \chi^2_{\alpha, k}$ where $\chi^2$ in this case is the value given by

$$\chi^2 = \sum_{i=1}^k \frac{(X_i - n_i p_{i, 0})^2}{n_ip_{i, 0}(1-p_{i, 0})}$$

where I think $p_{i, 0}$ means the value of $p$ that you think it is?

For example, we want to test

$$H_0: \, p_1 = p_2 = p_3 = 0.3 \quad H_1: \text{ not } H_0$$

at $\alpha = 0.05$ (so $p_{1, 0} = 0.3$ and $1 - p_{1, 0} = 0.7$). The observations are:

$$x_1 = 155 \quad x_2 = 118 \quad x_3 = 87 \quad n_1 = 250 \quad n_2 - 200 \quad n_3 = 150$$

$$\chi^2 = \left ( \frac{(155 - 250 \cdot 0.3)^2}{250 \cdot 0.3 \cdot 0.7} \right ) +\left ( \frac{(118 - 200 \cdot 0.3)^2}{200 \cdot 0.3 \cdot 0.7} \right ) + \left ( \frac{(87-150\cdot 0.3)^2}{150 \cdot 0.3 \cdot 0.7} \right ) = 258$$

This is greater than $\chi^2_{0.05, 3} = 7.81$, so we reject the null hypothesis $H_0$.

Alternatively, if we just want to test if the $p$s are equal (and don't know the actual value) then we have to estimate $p$ by the pooled estimate:

$$\hat p = \frac{x_1 + \ldots + x_k}{n_1 + \ldots + n_k}$$

Same ideas as before, except you lose a degree of freedom because the common value is estimated by $\hat p$. Not really sure how that works but okay.

In the previous example, we would estimate $p$ to be

$$\hat p = \frac{155 + 118 + 87}{250+200+150} = 0.6$$

and so the value $\chi^2$ becomes

$$\chi^2 = \left ( \frac{(155 - 250 \cdot 0.6)^2}{250 \cdot 0.6 \cdot 0.4} \right ) +\left ( \frac{(118 - 200 \cdot 0.6)^2}{200 \cdot 0.6 \cdot 0.4} \right ) + \left ( \frac{(87-150\cdot 0.6)^2}{150 \cdot 0.6 \cdot 0.4} \right ) = 0.75$$

This is less than $\chi^2_{0.05, 2} = 5.99$ so we cannot reject $H_0$. So it's not unlikely that all the $p$s are the same.

thank god almost done

Suppose the random vector $(X_1, \ldots, X_k)$ has the multinomial distribution

$$P(X_1 = x_1, \ldots, X_k = x_k) = \begin{cases} \frac{n!}{x_1!\ldots x_k!}p_1^{x_1} \ldots p_k^{x_k} & \text{ if } x_1 + \ldots + x_k = n \\ 0 & \text{ otherwise} \end{cases}$$

with parameters $n$ (positive) and $p_1, \ldots, p_k$. Then we have a statistic called Pearson's chi-square statistic, given by

$$\chi^2 = \sum_{i=1}^k \frac{(X_i - np_i)^2}{np_i}$$

As $n \to \infty$, this distribution tends to the chi-square distribution with $k-1$ degrees of freedom, so for large enough $n$ (i.e. $np_i \geq 5$) we can approximate it by that. (Apparently this works partly due to the central limit theorem. We lose a degree of freedom because of the sum of the p's.)

The tests are pretty much the same as before, except for the loss of a degree of freedom. For example:

Door	1	2	3	4	5
Number of rats which choose door	23	36	31	30	30

Test the hypotheses:

$$H_0: \, p_1 = p_2 = p_3 = p_4 = p_5 = 0.2 \quad H_1: \, \text{not } H_0$$

using $\alpha = 0.01$.

First, we calculate $\chi^2$ (the total number of rats is $23+36+31+30+30 = 150$

$$\chi^2 = \left (\frac{(23-30)^2}{30} \right ) + \left (\frac{(36-30)^2}{30} \right ) + \left (\frac{(31-30)^2}{30} \right ) + \left (\frac{(30-30)^2}{30} \right ) + \left (\frac{(30-30)^2}{30} \right ) = \frac{49}{30} + \frac{36}{30} + \frac{1}{30} = \frac{86}{30} = 2.87$$

Now, $\chi^2_{0.01, 4} = 13.28$. So we can't reject the hypothesis based on the data provided (i.e. the data are not sufficient to show that the rats show a preference for certain doors).

Wow, that was cool.

Incidentally, we can use this statistic to test hypotheses for other types of distributions, too. We have "observed cell frequencies" and "expected cell frequencies". The statistic becomes

$$\chi^2 = \sum_{i=1}^k \frac{(n_i-e_i)^2}{e_i}$$

where $e_i$ is the number expected in that cell. Same $\chi^2_{k-1}$ distribution. This is pretty straightforward.

Note that for each independent parameter that is estimated from the sample data, the degrees of freedom drops by 1.

Hypothesis testing, whether the hypothesised distribution is correct or not. Use the formula above. For example, we want to test if something fits a Poisson distribution with a value supplied for $\lambda$. We calculate the expected values (based on the Poisson distribution) for each value of $x$ - $n_i \cdot P(X = i)$ - then we just use Pearson's chi-square statistic with $k-1$ degrees of freedom. If $\chi_2 \ geq \chi^2_{\alpha, k-1}$ then we reject $H_0$. (We can combine cells or something. Not really sure when it's necessary or why it's done, but it's doable.)

If we just want to test whether it fits a Poisson distribution without a value for $\lambda$, then we estimate $\lambda$ with the sample mean $\overline x$. Then, same idea, and since we had to estimate one parameter we drop the d.f. by one more (so now it's $k-2$).

In any case, even if we can't reject $H_0$ (the fit hypothesis, not sure why it's not a null hypothesis ...), we can't actually accept it. All we can say is the distribution with whatever parameter provides a good fit to the data or that the data are not inconsistent with the assumption of a whatever distribution with whatever parameter.

An example ($n = 100, \alpha = 0.05$):

Cell	# of hours	$n_i$	$e_i$
1	0 - 5	37	39.34
2	5-10	20	23.87
3	10-15	17	14.47
4	15-20	13	8.78
5	20-25	8	5.32
6	25+	5	8.20

where the $e_i$s come from the exponential distribution, by integrating over the limits given in the "# of hours" column, where we estimate $\hat \theta = \overline x = 10$ (why is it that??? oh okay multiply the midpoint of the number of hours by $n_i$, with the last column pretended to be $25-30$). We then find the chi-square statistic:

$$\chi^2 = \sum_{i=1}^6 \left ( \frac{(n_i - e_i)^2}{n_i} \right ) = \frac{2.34^2}{37} + \frac{3.87^2}{20} + \frac{2.53^2}{17} + \frac{4.22^2}{13} + \frac{2.68^2}{8} + \frac{3.20^2}{5} = 5.84$$

Now, $\chi_{0.05, 4} = 9.49$ (4 degrees of freedom because, $k-1$, then subtract another 1 because of the estimated parameter) so we can't reject the hypothesis.

Example:

--	Low interest in mathematics	Average ""	High ""
Low ability in stats	65	40	15
Avg ""	54	63	29
High ""	12	45	27

We want to test $H_0:$ ability in stats and interest in math are independent and $H_1:$ not $H_0$. So what we do first is calculate the row and column sums and write those down in the table margins:

--	Low interest in mathematics	Average ""	High ""	Row sums
Low ability in stats	65	40	15	120
Avg ""	54	63	29	146
High ""	12	45	27	84
Column sums	131	148	71

Then we make an expected cell frequency table, where each cell is the product of its row and column sums, divided by $n$ (350 in this case):

--	Low interest in mathematics	Average ""	High ""
Low ability in stats	44.91	50.74	24.34
Avg ""	54.65	61.74	29.62
High ""	31.44	35.52	17.04

Then we calculate the chi-square statistic:

$$\chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(n_{ij} - e_{ij})^2}{e_{ij}}$$

which is pretty much the same thing as before (remember to divide BY THE EXPECTED, as is usual). We then just compare it with $\chi^2_{\alpha, (r-1)(c-1)}$.

In this specific case it should be $\chi^2 = 35.26$, $\chi_{0.01, 4} = 13.28$ so we reject $H_0$ meaning they are probably not independent. The notes say $\chi^2 = 31.67$ but I think this is wrong?

IGNORED THANK GOD

Course review

1Sampling distributions¶

1.1The basic distributions¶

1.1.1The chi-square distribution¶

1.1.2The t-distribution¶

1.1.3The F-distribution¶

1.1.4Distribution summary table¶

1.2Applications of some distributions¶

1.2.1Sample mean and variance¶

1.2.1.1Expected value of the sample mean¶

1.2.1.2Variance of the sample mean¶

1.2.1.3Relation to the normal distribution¶

1.2.1.4Expected value of the sample variance¶

1.2.1.5Correlation between the sample mean and the individual samples¶

1.2.1.6Independence of the sample mean and sample variance¶

1.2.1.7Relation to the chi square distribution¶

1.2.1.8Relation to the t distribution¶

1.3Order statistics¶

1.3.1Sample median¶

1.3.2Expected values and variance¶

2Estimation¶

2.1Methods of estimation¶

2.1.1Maximum likelihood estimation¶

2.1.2Method of moments¶

2.1.3Bayesian estimation¶

2.2Properties of estimators¶

2.2.1Unbiasedness and efficiency¶

2.2.1.1Cramer-Rao inequality¶

2.2.2Consistency¶

2.2.3Sufficiency¶

2.2.4Exponential families¶

2.3Minimum variance revisited¶

3Confidence intervals¶

4Theory of hypothesis testing¶

4.1Introduction and definitions¶

4.2Choosing the critical region¶

4.2.1Simple hypotheses¶

4.2.2Composite hypotheses¶

4.3Hypothesis testing examples¶

4.3.1p-levels¶

5Applications of hypothesis testing¶

5.1The bivariate normal distribution¶

5.2Correlation analysis¶

6Linear models¶

6.1Linear regression¶

6.1.1Finding distribution for the betas¶

6.2Experimental design¶

6.2.1Completely randomised designs¶

6.2.2Randomised block designs¶

7Chi-square tests¶

7.1Independent binomial populations¶

7.2Multinomial distributions¶

7.3Goodness of fit¶

7.4Contingency tables¶

8Non-parametric methods of inference¶