Fall 2009 Final CC-BY-NC

Maintainer: admin

Exam and solutions available on WebCT.

Under construction

1Question 1

An urn contains 10 marbles, of which θ are red and the rest are green. Three marbles are chosen at random, resulting in one green and two red marbles. Find the maximum likelihood estimate of θ if

(a) sampling is with replacement,
(b) sampling is without replacement.

1.1Solution

(a) There are three samples, $X_1$, $X_2$, and $X_3$. If we define "success" as getting a red marble, the likelihood function for $\theta$ is the probability that we got 1 green marble and 2 red marbles:

$$L(\theta) = \binom{3}{2} \left ( \frac{\theta}{10} \right )^2 \left ( \frac{10-\theta}{10} \right ) = \frac{3!}{2!1!} \frac{1}{10^3} \theta^2(10-\theta) = \frac{3}{10^3} \underbrace{\theta^2(10-\theta)}_{f(\theta)}$$

(We can factor out the constant as we're looking for the value of $\theta$ that results in the maximum $L(\theta)$ - we don't need t know the specific value of $L(\theta)$.) As there is a finite number of samples, we can write this out in table format:

$\theta$ 0 1 2 3 4 5 6 7 8 9 10
$f(\theta)$ 0 9 32 63 96 125 144 147 128 81 0

So $\hat \theta = 7$, as $f(\theta)$ and thus $L(\theta)$ is maximised when $\theta = 7$.

Alternatively, we could have probably differentiated the likelihood function and gotten that $\theta = \frac{20}{3}$, indicating that the MLE is between 6 and 7, and then compared the values at 6 and at 7. Not sure if that would require more justification, though (function is increasing before 7, decreasing after 7?). The table method probably works better.

(b) The likelihood function is as follows:

$$L(\theta) = \frac{\binom{\theta}{2} \left ( \binom{10-\theta}{1} \right )}{\binom{10}{3}} = \frac{1}{\binom{10}{3}} \cdot \underbrace{\theta(\theta-1)(10-\theta)}_{f(\theta)}$$

How do we know that????

Anyways, we draw out the table again:

$\theta$ 0 1 2 3 4 5 6 7 8 9 10
$f(\theta)$ 0 0 8 42 72 100 120 126 112 72 0

and again $\hat \theta = 7$.

I'm having a hard time figuring out how to generalise this for the HTSEFP (or even where this is covered in the notes/textbook). Help plz?

1.2Accuracy and discussion

Same as solutions, but, how????

2Question 2

Let $X_1, X_2, \ldots, X_n$ be a random sample from the distribution $N(\mu, \sigma^2)$, and let

$$s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \overline X)^2$$

Show that $s^2$ is an unbiased, consistent estimator of $\sigma^2$.

2.1Solution

To show that it's unbiased, we need to show that

$$E(s^2) = \sigma^2$$

Now, recall the lemma from section 1.2 of the course review:

$$\sum_{i=1}^n (x_i - \mu)^2 = \sum_{i=1}^n (x_i- \overline x)^2 + n(\overline x -\mu)^2$$

We can rewrite it with the term of interest on the left side of the equation:

$$\sum_{i=1}^n (X_i-\overline X)^2 = \sum_{i=1}^n (X_i - \mu)^2 - n(\overline X - \mu)^2$$

(Note the switched order.) Then, we can derive the expected value of the sample variance as follows:

$$\begin{align*}E(s^2) & = E \left (\frac{1}{n-1} \sum_{i=1}^n (X_i - \overline X)^2 \right ) = \frac{1}{n-1} E\left (\sum_{i=1}^n (X_i - \overline X)^2 \right) = \frac{1}{n-1} E\left ( \sum_{i=1}^n (X_i - \mu)^2 - n(\overline X - \mu)^2 \right ) \\ & = \frac{1}{n-1} \left ( \sum_{i=1}^n E((X_i - \mu)^2) - n(E(\overline X- \mu)^2) \right ) = \frac{1}{n-1} \left ( \sum_{i=1}^n Var(X_i) - n Var(\overline X) \right ) = \frac{1}{n-1} \left ( n \sigma^2 - n \cdot \frac{\sigma^2}{n} \right ) \\ & = \frac{1}{n-1} (n\sigma^2 - \sigma^2) = \frac{(n-1)\sigma^2}{n-1} \\ & = \sigma^2 \end{align*}$$

(The above comes from the course review. Link later.)

Now, we need to show that it's consistent. To do this, we use the fact that $(n-1)s^2/\sigma^2$ has a chi-square distribution (HOW DO WE KNOW THIS???) with $n-1$ degrees of freedom and variance $2(n-1)$. We can then rewrite the formula for the variance of $s^2$ as follows:

$$Var(s^2) = Var\left ( \frac{\sigma^2}{n-1} \cdot \frac{(n-1)s^2}{\sigma^2} \right ) = \left ( \frac{\sigma^2}{n-1} \right )^2 Var\left ( \frac{(n-1)s^2}{\sigma^2} \right ) = \frac{\sigma^4}{(n-1)^2} \cdot 2(n-1) = \frac{2\sigma^4}{n-1}$$

We then take the limit as $n \to \infty$:

$$\lim_{n\to\infty} \frac{2\sigma^4}{n-1} = 0$$

(Again from the course review, panicked interjection and all.)

2.2Accuracy and discussion

The consistent part is the same as in the solutions. The unbiased part is different. Why?

3Question 3

Let $X_1, X_2, \ldots, X_n$ be a random sample from the normal distribution $N(\mu, \sigma^2)$ with known mean $\mu$.

(a) Use the Neyman-Pearson lemma to a find a most powerful critical region of size $\alpha$ for testing the hypotheses

$$\begin{align*} H_0\, & :\,\sigma^2=\sigma_0^2 \\ H_1\, & : \, \sigma^2 = \sigma_1^2\end{align*}$$

where $\sigma_1^2 > \sigma_0^2$.

(b) Suppose in (a) that $\mu = 4$, $\sigma_0^2 = 5$ and $\sigma_1^2 = 10$. What conclusion at level $\alpha = 0.05$ would you make in (a) if $n=3$ and the observations are 2, 5, and 8?

3.1Solution

Later

3.2Accuracy and discussion

Later

4Question 4

The performance of a soft drink bottling machine is to be tested. Five ten ounce bottles are filled by the machine and the resulting contents in fluid ounces were measured as

$$8.0, 8.3, 7.2, 7.7, 8.1.$$

Assume that the data come from a normal population $N(\mu, \sigma^2)$.

(a) Are the data sufficient to disprove the manufacturer's claim that $\sigma \leq 0.2$? Test at level $\alpha = 0.05$.
(b) Find a 99% confidence interval for $\mu$.

4.1Solution

Use the following distribution:

$$\frac{(n-1)s^2}{\sigma^2} \sim \chi^2_{n-1}$$

where $s^2 = 0.183$, $\chi^2_{0.05, 4} = 9.49$. We can disprove the claim if

$$\frac{(n-1)s^2}{\sigma^2} > \chi^2_{0.05, 4}$$

so we do some manipulation things etc

$$\frac{4 \cdot 0.183}{0.2^2} > 9.49 \quad \therefore 18.3 > 9.49$$

This is true so we reject it.

4.2Accuracy and discussion

Should be right

5Question 5

A random check on a famous tourist location shows that on a particular day, it was visited by the following numbers of people:

Region Canada USA South America Europe Others
Numbers 60 100 40 30 20

Test the hypothesis, at level $\alpha = 0.05$, that the true proportions of visitors from the different regions are all equal.

5.1Solution

This one's really trivial. Here's the table with the addition of the "expected" values:

Region Canada USA South America Europe Others
Numbers 60 100 40 30 20
Expected 50 50 50 50 50

So we take

$$\chi^2 = \sum_{i=1}^5 \frac{(n_i - e_i)^2}{e_i} = \frac{10^2+50^2+10^2+20^2+30^2}{50} = 80$$

Since $\chi_{0.05, 4}^2 = 9.49$, we can indeed reject the hypothesis that the true proportions of visitors are all equal.

5.2Accuracy and discussion

Agrees with solutions. Easy points.

6Question 6

The midterm and final exam scores of five students in a statistics course are given by:

Midterm (x) 67 70 64 74 82
Final (y) 73 83 79 91 94

It has been decided to fit the linear model $y_i = \beta_0 + \beta_1 x_i + e_i$ to this data:

(a) Find least squares esimates of $\beta_0$ and $\beta_1$, and the predicted value of $y$ when $x=80$.
(b) Assuming that the $e_i$s are independent and have distributions $N(0, \sigma^2)$, find a 95% confidence interval for $\beta_0$. Recall that

$$t = \frac{\hat \beta_0 - \beta_0}{\displaystyle \hat \sigma \sqrt{\frac{1}{n} + \frac{\overline x^2}{S_xx}}}, \quad \hat \sigma^2 = \frac{\text{least squares minimum}}{n-2}$$

6.1Solution

(a) First, acquire the following figures (using your calculator most likely):

$$\sum x_i = 357 \quad \sum x_i^2 = 25685 \quad \sum x_iy_i = 30199 \quad \sum y_i = 420 \quad \sum y_i^2 = 35576$$

Then, calculate the $S$ things:

$$S_{xx} = \frac{n(\sum x^2) - (\sum x)^2}{n} = \frac{976}{5} = 195.2 \quad S_{xy} = \frac{n(\sum xy) - (\sum x)(\sum y)}{n} = \frac{1055}{5} = 211 \quad S_{yy} = \frac{n(\sum y^2) - (\sum y)^2}{n} = \frac{1480}{5} = 296$$

Then we can calculate the betas:

$$\beta_0 = \frac{(\sum x^2)(\sum y) - (\sum x)(\sum xy)}{5 \cdot S_xx} = \frac{6657}{976} = 6.82 \quad \beta_1 = \frac{S_{xy}}{S_{xx}} = \frac{211}{195.2} = 1.08$$

So the line of best fit is $y = 6.82 + 1.08 x$. When $x = 80$, $y = 93.3$.

(b) First, we calculate the least squares minimum:

$$SS(Res) = S_{yy} - \beta_1 S_{xy} = 296 - (1.08) \cdot 211 = 68.12$$

Then we calculate the estimate for $\sigma^2$:

$$\hat \sigma^2 = \frac{68.12}{3} = 22.7 \quad \therefore \sigma = \sqrt{22.7} = 4.77$$

Then, we substitute the rest of the shit, noting that $t_{0.025, 3} = 3.18$ and $\overline x = 357/5 = 71.4$:

$$3.18 = \frac{6.82 - \beta_0}{4.77 \sqrt{\frac{1}{5} + \frac{71.4^2}{195.2}}} \quad \therefore 6.82 - \beta_0 = 3.18 * 4.77 * 5.12 = 77.8$$

so the confidence interval is $6.82 \pm 77.8$, huge right

6.2Accuracy and discussion

Works. The numbers for part (b) are a bit off because I rounded (incorrectly?) at some point

7Question 7

Consider the linear model

$$y_i = \beta_0 + \beta_1 x_i + e_i, \quad i = 1, \ldots, n.$$

(a) Derive the least squares estimators $\hat \beta_0$ and $\hat \beta_1$ of $\beta_0$ and $\beta_1$.
(b) Assuming the $x_i$s are fixed numbers and the errors $e_i$, $i = 1,\ldots,n$ are independent random variables with zero means and variance $\sigma^2$, prove that $\hat \beta_1$ is unbiased and has variance of $\sigma^2/S_xx$.

7.1Solution

7.2Accuracy and discussion

8Question 8

Three varieties of wheat are field tested by observing their yields in 5, 4, 4 plots respectively. Complete the following analysis of variance table and test the hypothesis at level 0.01 that the expected yields of the three variables are the same.

Variation due to d.f. Sum of squares Mean square F-ratio
Varieties 60
Residual
Total 100

8.1Solution

Variation due to d.f. Sum of squares Mean square F-ratio
Varieties 2 60 30 7.5
Residual 10 40 4
Total 12 100

$F = 7.5$, $F_{0.01,2,10} = 7.56$ so we don't reject.

8.2Accuracy and discussion

Easy, actually