Course Review CC-BY-NC

Maintainer: admin

1Testing hypotheses about population means

Hypothesis testing about the real mean of a sample:

$$Z = \frac{\bar{X}-\mu_0}{S/\sqrt{n}}$$

Use T instead Z if there are less than 30 samples
Assumptions: Data sampled independently and identically distributed, data are drawn from a normal distribution

Testing if two samples have the same proportion:
$$p = \frac{n_1p_1 + n_2p_2}{n_1+n_2}$$
$$Z = \frac{p_1-p_2}{\sqrt{pq/n_1 + pq/n_2}}$$

Testing if two dependent samples have the same mean:
$$Z = \frac{\bar{X}_D-\mu_0}{S_D/\sqrt{n_D}}$$

Testing if two independent samples have the same mean:
$$Z = \frac{\bar{X}_A-\bar{X}_B-\mu_0}{\sqrt{S^2_A/n_A+S^2_B/n_B}}$$

Testing if two independent small pooled samples have the same mean:

$$Z = \frac{\bar{X}_A-\bar{X}_B-\mu_0}{\sqrt{S^2_P/n_A+S^2_P/n_B}}$$
$$S^2_p = \frac{(n_A-1)s^2_A+(n_B-1)s^2_B}{n_A+n_B-2}$$

Testing about the real proportion of one sample:
$$Z = \frac{p-p_0}{\sqrt{p_0q_0/n}}$$

Two samples with small n but assume same pop. variance:
$$s^2_p=\frac{(n_A-1)s^2_A+(n_B-1)s^2_B}{n_A+n_B-2}$$
$$df = n_A+n_B-2$$

When describing, always list assumptions:

  • Large: Independent, identically distributed, large n for CLT to be valid
  • Small: Independent, identically distributed, population is normally distributed

2Testing hypothesis about a single variance

Testing the real standard dev of a sample:
$$W=\frac{(n-1)S^2}{\sigma^2_0}$$
W is distributed under a chi-square distribution with $n-1$ degrees of freedom

Assumptions:

  • Random sample from target population
  • The population is normally distributed -> very important

Testing if two population variances are the same:
$$F= \frac{S^2_1}{S^2_2}$$
If F is close to 1, then they're obviously the same. F is distributed under an F-distribution with $n_1 -1$ and $n_2 -1$ degrees of freedom.
We should be sure that the two groups are normally distributed

3Experimental Design and Analysis of Variance

3.1Definitions

Response variable
Dependent variable, random and out of our control
Factor variable
Independent variable/covariates, we can control it, can be quantitative or qualitative
Factor levels
Possible values of the factors
Treatments
If there is only a single factor, the different levels of that factor is called treatments. If there are multiple factors, combinations of factors and their levels are the treatements
Experimental units
Objects on which the response and factor variables are observed
Designed experiment
The experimenter could control the assignment of treatments to experimental units
Observational experiment/study
The experimenter can only observe

3.2Completely randomized designs

We want to examine qualitative factor variable vs quantitative response variable.

We have discrete levels of the factor variable, for each level it has a mean corresponding to that level.

If there's no association between the factor/response, it means the means would all be the same: $\mu_1 = \mu_2 = \mu_3 ... = \mu_k$, if any $\mu_i$ is different, then these two variables are dependent

Therefore $H_0: \mu_1 = \mu_2 = \mu_3 ... = \mu_k$

We assign experimental units randomly to the treatments. => completely randomized design

If K=2, then it would be a 2 sample means test, but with K > 2 it is more complicated.

We assume the variances of the response are the same for each treatment AND the response is normally distributed.

If we let $Y_k$ be the mean of the response in the Kth treatment group, and if $H_0$ is true, then the response variables are distributed under $N(\sigma, \mu)$. But if $H_0$ is true, we can estimate $\mu$ by averaging ALL observations, and that mean should be close to $Y_k$

One way to measure how close each $Y_k$ is to the mean is looking at the sum of the squared deviations from the overall mean, weighted by the number of observations in each group:

$$SST = \sum_{k=1}^K n_k(Y_k - \hat Y)^2$$

This gives us an overall measure of how far the means are from the overall mean.

To simplify, assume $n_1 = n_2 ... = n_k = m$, so we have mK total experimental units. Because experimental units are independently allocated to groups, $Y_j$ is independent from $Y_k$
Therefore we can calculate the sample variance of $Y_k$:
$$\frac{\Sigma_{k=1}^{k}(Y_k-\hat Y)^2}{K-1}$$
If we multiply by m, we get $MST = \frac{SST}{K-1}$, we can use MST to estimate the population variance if $H_0$ is true, but we don't know $\sigma^2$

If we estimate $\sigma^2$ within EACH treatment group, it would be
$$s_k^2 = \frac{\sigma_{i=k}^{n_k}(Y_{ij}-Y_k)^2}{n_k-1}$$
This is always true, we can combine ALL the $s_k^2$ together to estimate the common $\sigma^2$, like the pooled two-sample t-test, we just generalize it to K samples
$$SSE = \sum_{k=1}^K\sum_{i=1}^{n_k}(Y_{ij} - Y_k)^2$$
$$s_p^2 = MSE = \frac{SSE}{n-K}$$
This will be valid whether $H_0$ is true or not.

Now we can just compare those two estimators by taking a ratio of those two:
$$F=\frac{MST}{MSE}$$
If $H_0$ is true, then they should have a ratio of close to 1.
IF $H_0$ is not true, then MST will be too large on average, so we only reject for large values of F
F has $K - 1$ and $n - K$ degrees of freedom.

3.3The ANOVA table

Source df SS MS F p-value
Treatments $K-1$ $SST$ $MST=\frac{SST}{K-1}$ $\frac{MST}{MSE}$ Pr(F* > F)
Error $n-K$ $SSE$ $MSE=\frac{SST}{K-1}$
Total $n-1$ $TSS$

$$TSS = \Sigma_{k=1}^K\Sigma_{i=1}^{n_k}(Y_{ij} - Y)^2 = (n-1)s^2$$

The R command is

model = aov(response~factor)
summary(model)

We can assess normality by looking at the deviations of each group from their sample means(these deviations are called residuals)

3.4Comparing multiple means

ANOVA only detects when one mean is different amongs the treatment groups.

If there are $K$ treatment means, then there are $K(K-1)/2$ pairwise comparisons

If $K = 2$ then it's EZ, if we have $n_1$ samples from group 1 and $n_2$ samples from group 2, and we assume:

  • Equal variances
  • Normality of the outcome variable

Then we can construct a $100(1-\alpha)%$ confidence interval based on:
$$y_1 - y_2 \pm t_{\alpha/2,n_1+n_2-2}\sqrt{\frac{s^2_p}{n_1}+\frac{s^2_p}{n_2}}$$

We can use the above equation to compute a confidence interval for any 2 groups i,j.

We use MSE for a good estimator of $s^2_p$ because the usual ANOVA assumption is that the variances within each treatment group are approximately equal. Thus the confidence interval is:
$$y_1-y_2\pm t_{\alpha/2,n-K}\sqrt{\frac{MSE}{n_i}+\frac{MSE}{n_j}}$$

Note: it now has $n - K$ degrees of freedom rather than $n_1+n_2-2$.

If we can compute $100(1-\alpha)%$ confidence intervals for each pair... we can calculate the probability that at least one interval would not contain the true difference.
The probability of at least one interval NOT containing the true difference will be 1 minus the probability of ALL the intervals containing the true difference:

Thus it is $1-(1-\alpha)^{K(K+1)/2}$

We can control for multiple comparisons for various kinds of statistical models:

  • The Bonferroni method
  • Tukey's Honest Significant Differences (HSD) method
  • Scheffe's method for contrasts

The Bonferroni method is the most conservative and the most general.

Recall: Pr(At least one wrong interval) = $1-(1-\alpha)^{K(K+1)/2}$

It can be shown that $1-(1-\alpha)^{K(K+1)/2} = \alpha K(K+1)/2$. Thus if we should divide $\alpha$ by $K(K+1)/2$ for an experimentwise error of $\alpha$

Bonferroni can be overly conservative because our intervals are not independent. Bonferroni will give us MORE confidence than we think.

Tukey's HSD uses the fact that the groups are sampled independently to find the distribution of ranked sample means.Tukey's HSD restricts all groups to the same size, and it is only useful when making pairwise comparisons.

Scheffe's method applied to ANOVA is a more general method that allows us to calculate confidence intervals for linear combinations of population means.

3.5Randomized block designs and two-way ANOVA

For one-way ANOVA, the model was:
$$Y_{ij} = \mu + \tau_i + \epsilon_{ij}$$
$\epsilon_{ij}$ are uncorrelated(independent), and $\mu + \tau_i = \mu_j$

We restrict that the sum of all $\tau$ is 0, and $\tau_i$ represent treatment level contrasts about the overall mean of the data.
That means $H_0 : \mu_i = \mu_j$ is equivalent to $H_0 : \tau_i = \tau_j = 0$. Thus $H_a$ would be $H_a : \tau_i \neq 0$.

We assume $\epsilon_{ij}$ are normally distributed.

Block
a group of experimental units that, together, receives all levels of treatment exactly once
Randomized block design
when the treatments are randomly assigned to the units in each block

If B is the number of blocks and K is the number of treatments, then we have $B \times K$ total observations. We want experimental units within the same block to be as homogeneous as possible. We also want the order of the treatments to be randomly assigned to each subject.
If there are 5 treatments, ideally we want 5! blocks for all possible orderings, if we can't get that many, we can use a Latin Square design where each treatment appears one time for each order, thus reducing it to 5 blocks.

The real goal is to choose blocks which are as similar as possible, so any differences between experimental units within a block are due only to the treatments.

Suppose we have B blocks and K treatments
$$Y_{ij} = \mu + \tau_i + \beta_j + \epsilon_{ij}$$
$Y_{ij}$ is the observation on treatment i in block j,

$\mu$ is the overall mean

$\tau_i$ is the non random effect of treatment i with the sum of all $\tau$ = 0

$\beta_j$ is the non random effect of block j where the sum is 0

$\epsilon_{ij}$ is the random error term where it's distributed under $N(0,\sigma^2)$.

We have two factors now, the factor of interest with K treatments, and the blocking factor with B levels.

If all $\tau$ and all $\beta$ are constant, then it is a fixed block effect model.

They are not fixed though, they are random variables, so they are uncorrelated with $\epsilon_{ij}$. This model is called random block effect model.

We decompose the total sum of squares into components.

We now have a sums of squares due to differences in means between treatments and differences in means between blocks. Thus:
$$TSS = \sum_{i=1}^K\sum_{j=1}^B(Y_{ij}-Y)^2$$
$$SST = B\sum_{i=1}^K(Y_{i-} - Y)^2$$
$$SSB = K\sum_{j=1}^B(Y_{-j}-Y)^2$$
$$SSE = TSS - SSB - SST = \sum_{i_1}^{K}\sum_{j=1}^{B}(Y_{ij}-Y_{i-}-Y_{-j}+Y)^2$$
where $Y_{i-}$ is the mean for the ith treatment, averaged over blocks, and $Y_{-j}$ is the mean for the jth block, averaged over all treatments, and $Y$ is the overall mean of the $n = B \times K$ treatments

3.5.1The Two-Way ANOVA table

Source df SS MS F
Treatments $K-1$ $SST$ $MST=\frac{SST}{K-1}$ $\frac{MST}{MSE}$
Blocks $B-1$ $SSB$ $MSB=\frac{SSB}{B-1}$ $\frac{MSB}{MSE}$
Error $n-K-B+1$ $SSE$ $MSE=\frac{SSE}{n-K-B+1}$
Total $n-1$ $TSS$

One can perform F-tests for both $H_0 : \tau_i = 0$ and $H_0 : \beta_i = 0$ using the two F-statistics. All degrees of freedom must add up to n-1

You can do this in R:

model = aov(response~factor+block)
summary(model)

3.6Factorial experiments and two-way ANOVA

Suppose we have two factor variables of interest. If the association of one factor(A) with the response depends on the level of a second factor(B), we say that A interacts with B or B interacts with A.

In order to test the interaction of two factor variables and their association with the response variable, we need to use a complete factorial experiment

Complete factorial experiment
a factorial experiment in which every factor-level combination is used

If there are two factors, A with K levels and B with J levels, then we need observations at each of the $K \times J$ combinations of the different levels.

We assume that that the factorial is balanced - same number of experimental units (R) for each of the $K \times J$ factor combinations.

We will assume there are n = $R \times K \times J$ total experimental units.

The statistical model will be:
$$Y_{ijr} = \mu + \alpha_i + \beta_i + \tau_{ij} + \epsilon_{ijr}$$
where
$Y_{ijr}$ is the r-th observation at level i of factor A and level j of factor B

$\mu$ is the overall mean

$\alpha_i$ is the non random main effect of factor A at level i, with sum of zero

$\beta_j$ is the non random main effect of factor B at level j, with sum of zero

$\tau_{ij}$ is the non random interaction effect of levels i and j and the sum of zero

$\epsilon_{ijr}$ is the random error terms distributed under $N(0,\sigma^2)

If two factors interact, it means the association of each factor with the outcome depends on the other factor.
As long as $\tau_{ij} \neq 0$ for any i or j, there is an interaction.

Once there is an interaction, the main effects can't be interpreted any more, since the association always depends of the other factor.

We can think of each combination of factor levels as a different group. Each treatment has the same mean, thus for each treatment, we can estimate $\sigma^2$ by:
$$s_{ij}^2 = \frac{1}{R-1}\sigma^2_{r=1}(Y_{ijr} - \bar Y_{ij-})^2$$
$\bar Y_{ij-}$ is the mean of the i,j-th treatment group. But we assume the variance is the same in each treatment, so a pooled estimate would be better.

$$SSE = \sum_{i=1}^{K}\sum_{j=1}^{J}\sum_{r=1}^{R}(Y_{ijr} - Y_{ij-})^2$$
$$MSE = \frac{SSE}{n-KJ}$$

$$SST = R\sum_{i=1}^{K}\sum_{j=1}^{J}(Y_{ij-}-Y)^2$$
$$MST = \frac{SST}{JK-1}$$

We can do this in R:

model = aov(response~factor1:factor2)
summary(model)

If the result is statistically significant, we need to decompose the treatment sums of square into their individual pieces to figure out what the differences are caused by(could be either of the two factors or their association)

We can take the sums of squares due to treatment(SST) and decompose these into three sources: corresponding to $\alpha_i$, $\beta_j$, or $\tau_{ij}$

$$SS(A) = RJ\sum_{i=1}^{K}(\bar y_{i--} - \bar y_{---})^2$$
$$MS(A) = \frac{SS(A)}{K-1}$$
$$SS(B) = RK\sum_{j=1}^{J}(\bar y_{-j-} - \bar y_{---})^2$$
$$MS(B) = \frac{SS(B)}{J-1}$$
$$SS(AB) = R\sum_{j=1}^J\sum_{i=1}^{K}(\bar y_{ij-} - \bar y_{i--} - \bar y_{-j-} + \bar y_{---})^2$$
$$MS(AB) = \frac{SS(AB)}{KJ-J-K+1}$$

First test is always test for interaction, then the tests for A/B are meaningless if interaction exists.

null hypothesis of no interaction is: $H_0(AB) : \tau_{ij} = 0$, under $H_0(AB)$:
$$F_{AB} = \frac{MS(AB)}{MSE}$$
with $KJ-J-K+1$ and $n-JK$ degrees of freedom.

If we reject $H_0$ for interaction, then we're done, if we don't, we have to test both factors for association with the response.
$$F_A = \frac{MS(A)}{MSE}$$
with $K-1$ and $n-JK$ degrees of freedom, where there are K levels of factor A, same with B except with J.
We can reject for both, but still no association, or fail to reject both, and fail to reject association either.

We can organize this into another ANOVA table:

Source df SS MS F
$A$ $K-1$ $SS(A)$ $MS(A)=\frac{SS(A)}{K-1}$ $\frac{MS(A)}{MSE}$
$B$ $J-1$ $SS(B)$ $MS(B)=\frac{SS(B)}{J-1}$ $\frac{MS(B)}{MSE}$
$A \times B$ $KJ - K - J + 1$ $SS(AB)$ $MS(AB)=\frac{SS(AB)}{KJ-K-J+1}$ $\frac{MS(AB)}{MSE}$
Error $n-KJ$ $SSE$ $MSE = \frac{SSE}{n-KJ}$
Total $n-1$ $TSS$

To break it down in R, we use:

model = aov(response~factor1*factor2)
summary(model)

4Regression

Regression
modelling expected/average of an outcome or a response variable

Purposes of regression:

  • Prediction
  • Covariate selection, determine which variables are associated with an outcome
  • Model specification, choosing a most appropriate model for the relationship between the outcome and the covariate
  • Parameter estimation, estimating the parameters from a hypothesized model

4.1Simple linear regression model

Simple linear regression model assumes
$$Y_i = \beta_0 + \beta_1x_i + \epsilon_i$$
where $\beta_0$ is ther intercept, $\beta_1$ is the slope and $\epsilon_i$ is the model error.

We assume:

  • Error around the measurement of $X_i$ is negligible(x is not a random variable)
  • $\epsilon_i$ are independent random variables with mean=0 and variance of $\sigma^2$
  • $Y_i$ are random variables such that $E(Y_i|Xi = xi) = \beta_0 + \beta_1x_i$, $Var(Y_i|X_i = x_i)=\sigma^2$

This means $\beta_1$ is the change in the mean of $Y_i$ per unit change in $x_i$, $\beta_0$ is the mean of an observation with the covariates being zero. This is just a rough measurement of the overall level of the response.

We want to estimate $\beta_0$ and $\beta_1$ such that the estimate $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i$ is close to observed $y_i$, thus we want $y_i - \hat{y}_i$ be as small as possible.

We want to minimize this equation:
$$\sum_{i=1}^{n}(y_i-\hat{y}_i)^2 = \sum_{i=1}^n(y_i-[\hat{\beta}_0 + \hat{\beta}_1x_i])^2$$
With the power of math unknown to us plebs, it can be shown that:

$$\hat{\beta}_1 = \frac{\sum_{i=1}^n(y_i-\bar{y})(x_i-\bar{x})}{\sum_{i=1}^n(x_i-\bar{x})^2} = \frac{SS_{XY}}{SS_{XX}}$$
$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$$

These values of $\hat{\beta}_1$ and $\hat{\beta}_0$ give us the best predictions. To calculate them in R:

model = lm(response~factor)
coef(model)

If our model is correct, then $E(\hat{\beta}_1) = \beta_1$ and same with $\beta_0$

Through fancy math, it turns out that the standard deviation(error) of $\hat{\beta}_1$ is:
$$\sigma_{\hat{\beta}_1} = \frac{\sigma}{\sqrt{S_{XX}}}$$
and the standard error of $\hat{\beta}_0$ is equal to
$$\sigma_{\hat{\beta}_0} = \sqrt{\sigma^2(\frac{1}{n} + \frac{\bar{x}^2}{SS_{XX}})}$$
Not many fucks will given about $\beta_0$.

We still need to estimate the residual variance, $\sigma^2$. We estimate it with:
$$s^2 = \frac{\sigma_{i=1}^{n}(y_i-\hat{y}_i)^2}{n-2} = \frac{SSE}{n-2}$$
$$SSE = SS_{YY} - \hat{\beta}_1SS_{XY}$$

We need to assume that the data are normally distributed, and we need large sample for CLT to work, as well as independent data. The residuals also need to have a mean of zero and constant variance

4.2Hypothesis Testing

The hypothesis of interest is that the covariate is linearly associated with the response variable.

We say there is a linear association between them when $\beta_1 \neq 0$, thus $H_0 : \beta_1 = 0$

Since $\hat{\beta}_1$ is distributed under $N(\beta_1, \sigma^2_{\hat{\beta}_1})$, where $\sigma_{\hat{\beta}_1} = \frac{\sigma}{\sqrt{SS_{XX}}}$, we can testing $H_0$ by testing if its true mean is 0.
Therefore we can use the T-statistic:
$$T = \frac{\hat{\beta}_1 - 0}{\sigma/\sqrt{SS_{XX}}}$$
we use s to estimate ${\sigma}$, and it is distributed with n - 2 degrees of freedom.

To construct a confidence interval for $\hat{\beta}_1$, we would simply use:
$$\hat{\beta}_1 \pm t_{n-2,\alpha/2}\frac{s}{\sqrt{SS_{XX}}}$$

4.3Correlation

correlation
a measure of linear association that is symmetric(correlation between X and Y is the same as Y and X). Different than regression because regression is not symmetric

Through mathemagics, we derive an equation for r:
$$r = \frac{SS_{XY}}{\sqrt{SS_{XX}SS_{YY}}}$$

where
$$SS_{XY} = \sum_{i=1}^n(y_i - \bar{y})(x_i - \bar{x})$$
$$SS_{XX} = \sum_{i=1}^n(x_i - \bar{x})^2$$
$$SS_{YY} = \sum_{i=1}^n(y_i - \bar{y})^2$$
r must be between -1 and +1, and it is scaleless, where as $\beta_1$ depends on the scale of both x and y.

r is an estimator for the population correlation, $\rho$. The test for no association between two random variables would be $H_0 : \rho = 0$, and this is equivalent to $\beta_1 = 0$.
However, distribution for r under $H_0$ is a bit more difficult to work with.

To calculate r in R(lol):

cor(response,factor)

Through algebraic manipulation:
$$r = \hat{\beta}_1\frac{s_x}{s_y}$$
where $s_x$ and $s_y$ are the sample std dev for x and y.

If SSE is the standard error of the residuals($\sum_{i=1}^n(y_i - \hat{y}_i)^2$), we can derive:
$$r^2 = 1 - \frac{SSE}{SS_{YY}}$$
$r^2$ is often written as $R^2$, and it the measure of the proportion of variance of Y explained by X. To calculate $R^2$ in R:

model = lm(response~factor)
summary(model)

To construct a confidence interval for the population correlation coefficent($\rho$), we can do:
$$r \pm t_{\alpha/2,n-2}\sqrt{(1-r^2)/(n-2)}$$
However the above interval can be undependable in small samples, in which case we use Fisher's $\rho$ transformation:
We first transform r to:
$$Z = \ln{\frac{1+r}{1-r}}$$
Then we can build a confidence interval using Z of the form:
$$Z \pm z_{\alpha/2}/\sqrt{(n-3)} = (c_L, c_U)$$
Then we reverse the transformation and get the confidence interval for $\rho$:
$$[\frac{exp(2*c_L - 1)}{exp(2*c_L + 1)},\frac{exp(2*c_U - 1)}{exp(2*c_U + 1)}]$$

We can do this in R:

library(psych)
r.con(rValue,n,p=width)

4.4Two kinds of prediction

We distinguish between the two types of predicted values:

  • Estimated mean of the response for a particular x_0 value
  • Predicted value of the response for a particular x_0 value

For any $\hat{\beta}_0$ and $\hat{\beta}_1$ and $x_0$ we have $\hat{y}(x_0) = \hat{\beta}_0 + \hat{\beta}_1x_0$

Then $E(\hat{y}(x_0))$ is obviously $\beta_0 + \beta_1x_0$, and using math beyond us untermenschs, we can show:
$$Var(\hat{y}(x_0)) = Var(\bar{y} + \hat{\beta}_1(x_0 - \bar{x})) = \sigma^2(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{SS_{XX}})$$
$$s_{\hat y(x_0)} = s\sqrt{\frac{1}{n} + \frac{(x_0 - \bar x)^2}{SS_{XX}}}
This gives point-wise confidence bands for the mean of the response, not confidence bands for actual responce values.

We substitue $s^2$ for $\sigma^2$ to get estimated standard error of prediction, and since $s^2$ is estimated, we must use t-quantiles with n-2 degrees of freedom rather than normal for confidence intervals.

If we want actual prediction intervals, rather than a prediction interval for the mean, we need to take in account additional variability.
The variance of deviation from $\hat{y}(x_0)$ is:
$$Var(y_0 - \hat{y}(x_0)) = \sigma^2 + \sigma^2(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{SS_{XX}})$$
We can estimate the prediction error with
$$S_{\tilde{y}(x_0)} = s\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{SS_{XX}}}$$
This is just isolating the $\sigma$ from the previous equation. We also use a t-distribution.

To find a 90% confidence interval for the mean for a given covariate value in R we can do:

model = lm(response~factor)
predict(model, newdata=data.frame(factor=n), se.fit=T, interval=c("confidence"), level=0.90)

To find a 90% confidence interval for the predicted response for a given covariate value in R:

model = lm(response~factor)
predict(model, newdata=data.frame(factor=n), se.fit=T, interval=c("prediction"), level=0.90)

5Multiple Regression

Sometimes in life, we have a desire to find the association between a response variable and multiple covariates. Having more than one covariate could improve our prediction.

  • Parameter estimation - we have more parameters, how do we estimate and interpret them?
  • Hypothesis testing - how do our hypothesis tests change?
  • Dagnosis of residuals - think about our assumptions

Issues in multiple linear regression:

  • Which covariates should we use?
  • Can we test hypothesis about multiple parameters at once?
  • Multi-collinearity - what if our covariates are associated with each other?
  • Qualitive covariates
  • Interactions - association of one covariate with response depends on the values of other covariates

Covariates still have linear association with the response variable, the model will be:
$$Y_i = \beta_0 + \beta_1x_{i1} + ... + \beta_Kx_{iK} + \epsilon_i$$
We will assume that the response variable is a non random linear function + random error.

5.1What is linear?

A linear model is a model that is linear in the parameters, not necessarily in the covariates. => all $\beta_i$ are linear, and no term is a product of more than one $\beta$.

5.2Interpreting the multiple linear regression model

Basic assumptions of the model:

  • $E(\epsilon_i) = 0$ for all i
  • $Var(\epsilon_i) = \sigma^2$, (variance of the response). Therefore the error doesn't depend on any covariate $X_j$ and is the same for all observation i
  • Model errors are independently distributed
  • $\epsilon_i$ are normally distributed. For large samples only need to be approximately true

$\beta_j$ is the increase in the mean of $Y_i$ observed for a unit increase in $x_{ij}$, holding all other values for $x_{ik},\:k\neq j$ constant
$\beta_j$ is the association of $X_j$ with $Y$ while accounting for the associations of all the other covariates with Y

The model for multiple regression as follow:
$$y = X\beta + \epsilon$$
$$y = (Y_1,...,Y_n)$$
$$X = \begin{pmatrix} 1 & x_{11} & \cdots & x_{1K} \\ 1 & x_{21} & \cdots & x_{2k} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & \cdots & x_{nK} \\\end{pmatrix}$$
$$\beta = (\beta_0,\beta_1, ... , \beta_K)$$
$$\epsilon = (\epsilon_1, ... \epsilon_n)$$

X is called the model matrix/design matrix.

We're still trying find parameter estimates for $\beta$ that minimize the distance between the predicted values of the model, thus we're trying to minimize:
$$\sum_{i=1}^n(y_i-\hat{y}_i)^2$$
If we just biject it to a thrice-differentiable monoid whose elements are clopen generating functions, we can see that the vector of coefficents that minimizes the above equation is equal to:
$$\hat{\beta} = (X^tX)^{-1}X^ty$$
Where $\hat{\beta}$ is a vector containing the parameter estimates. It is much easier if we use the computer:

model = lm(response~factor1+factor2+factor3)
coef(model)

The estimate we obtain for each factor will be different than if we were to simply do:

model = lm(response~factor1)
coef(model)

Because we are not adjusting for other covariates in the above case.

For each coefficient, we have $E(\hat{\beta}_j) = \beta_j$

The variances of each $\hat{\beta}_j$ is difficult to express individually, so we write it in a matrix:
$$Var(\hat{\beta}) = \sigma^2(X^tX)^{-1}$$
Where \sigma is a constant. The variances of individual $\hat{\beta}_j$ will be the diagnonal elements of this matrix.

In general you don't really need to give a fuck about this because we use the computer to do the math.

However, we still need to estimate $\sigma^2$, we can use the standard error of the residuals in the numerator:
$$s^2 = \frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{n-(K+1)}$$
Not quite sure why the denominator is n-(K+1), maybe it's magic, maybe it's maybelline. For the simple linear regression case, K = 1, so we divided by n - 2.

5.3Single Hypothesis tests

Our hypothesis tests will be the form:
$$ H_0 : \beta_j = 0$$
$$ H_a : \beta_j \neq 0$$
If we failed to reject $H_0$ for some j, it doesn't mean the covariate $X_j$ is not associated with the response, it just means it is not associated with the response after adjusting for the association of the other covariates with the response.

More generally we can test:
$$ H_0 : \beta_j = \beta^*_j$$
$$ H_a : \beta_j \neq \beta^*_j$$
The general form of the T-statistic is:
$$t = \frac{\hat{\beta}_j - \beta^*_j}{s\sqrt{c_{jj}}}$$
$c_{jj}$ is the variance of $\hat{\beta}_j$.

To do all of this in R:

model = lm(response~factor1+factor2+factor3)
summary(model)

If we use a Type I error rate of $\alpha = 0.05$ for each covariate, to have an overall Type I error rate of 0.05, we need to correct the $\alpha$ values used, like using Bonferroni.

5.4Measuring the fit of the model

Same idea as simple linear regression, we use the percentage of variance explained in the reponse by the model.
We use residual sums of squares for error:
$$ SSE = \sum_{i=1}^n(y_i-\hat{y}_i)^2$$
and we compare this to the total sum of square for the response:
$$ SS_{yy} = \sum_{i=1}^n(y_i - \bar{y})^2$$
Then we can calculate $R^2$
$$R^2 = 1 - \frac{SSE}{TSS}$$

How do we choose which covariates to include in the model? If we compare two models that are nested(one model contains all the covariates of the other and more), the larger model will have an $R^2$ value which is at least as big as the other model's.

We will never obtain a worse fitting model than the one we have by adding additional covariates, thus by this logic, we would include all the covariates!

We can use an adjusted coefficient of determination(adjusted $R^2$)
$$R^2_a = 1 - [\frac{n-1}{n-(K+1)}](\frac{SSE}{SS_{yy}})$$
The more covariates there are, the higher K is, and it lowers $R^2_a$ if the fit is not improved.

To remove a covariate in R:

model = lm(response~factor1+factor2+factor3)
updatedModel = update(model,.~.-factor1)

5.5Testing the global fit

We test for a global association of our set of covariates to the response using an F-test:
$$F = \frac{(SS_{YY} - SSE)/k}{SSE/[n-(K+1)]} = \frac{R^2/K}{(1-R^2)/(n-(K+1))} = \frac{MSR}{MSE}$$
This is distributed with K and n - (K+1) degrees of freedom. It's possible to reject $H_0$ without rejecting a single null hypothesis of the form $H_0:\beta_j = 0$, in that case, covariates go in, response comes out, we can't explain that.
But if we do not reject this $H_0$, then none of the covariates will be able to reject their $H_0$.

5.6Prediction and confidence intervals for future values

The fitted value for a particular set of covariates will equal to:
$$\hat{y}_0 = \hat{\beta}_0 +x_{10}\hat{\beta}_1 + ... + x_{k0}\hat{\beta}_k$$
$\hat{y}_0$ could be used to estimate:

  • The mean of observations at a particular set of covariate values
  • The future observation at a particular set of covariate values

We're too stupid to understand the formulae for the standard errors for these, so we just use R for everything here.

To obtain an interval for the mean of a future observation:

predict(model,newdata=data.fram(factor1=a,factor2=b,factor3=c),interval="confidence",se.fit=T)

And an interval for a single future observation:

predict(model,newdata=data.fram(factor1=a,factor2=b,factor3=c),interval="prediction",se.fit=T)

We need to be careful about extrapolating because as soon as one factor value is outside of what we have observed, we're at a risk of making a critical error.

5.7Interactions

The concept of interactions here will be similar to that from ANOVA, they are symmetric.

If $X_1$ and $X_2$ are our covariates of interest, the regression model including the interaction will be:
$$Y_i = \beta_0 + x_{i1}\beta_1 + x_{i2}\beta_2 + x_{i1}x_{i2}\beta_3 + \epsilon_i$$
We refer to $\beta_3$ as the coefficient for the interaction between $X_1$ and $X_2$.
Similar to ANOVA, if we believe there is an interaction between the covariates, we want to test for the presence of that first, if we reject a null hypothesis of no interaction, then we CANNOT interpret the coefficients for the main effects.
If we conclude $\beta_3 \neq 0$, then we cannot interpret $\beta_1$ as the change in the mean of Y for a single unit change in $X_1$, because that change now depends on $\beta_3$ and the value of $X_2$.

If we want to include the interaction term in R, we do:

model = lm(response~factor1*factor2)
summary(model)

5.8Polynomial Regression

There are also times in life when a linear regression just isn't enough. In those times, we can adjust our model to take into account non-linearity.
We can add a quadratic covariate in X to the model:
$$Y_i = \beta_0 + \beta_1x_i+\beta_2x^2_i + \epsilon_i$$
This is still the same as before, and we still use our usual least squares approach to obtain $\hat{\beta}$

The hypothesis is:
$$H_0 : \beta_2 = 0$$

To do this in R:

model = lm(response~factor+I(factor^2))
summary(model)

We can keep trying to improve the fit by adding higher order polynomial terms.

5.9Using qualitative variables in regression models

Sometimes we have binary(Yes/No) variables in our model. The key problem is that our qualitative variable is not a random variable, because random variables take values in the real numbers, and Y/N option is not a real number.
To get around this, we let $Z=0$ if N, and $Z=1$ if Y. We can then use Z indicator variable in our regression:
$$Y_i = \beta_0 + \beta_1Z_i + \epsilon_i$$
Therefore the coefficient $\beta_1$ measures the difference in mean between the two options. We can also measure the difference by taking the difference of the means of the two groups, and they're the same!

It turns out, $\hat{\beta}_1 = \bar{Y}_1 - \bar{Y}_0$.

What if we have more than two groups? We have to use multiple Z's. Suppose we now have 3 groups, the model would be:
$$Y_i = \beta_0 + \beta_1Z_1 + \beta_2Z_2 + \epsilon_i$$
$\beta_0$ would be the mean of the first group, $\beta_1$ would be the difference between the first and second group, and $\beta_2$ would be the difference between the first and third group.
The difference in means between the second and third group would be $\beta_2 - \beta_1$
If $\beta_1 = \beta_2 = 0$, it means that all three groups have the same mean.

An one way ANOVA also tests for a difference amongst 3 group means, we would conduct an F-test to test for any difference amongst the three groups.
We would use our overall F-test for the regression model, and since the overall F-test tests whether ALL regression coefficients are equal to zero or not, it tests whether the groups are all the same or not.
We can use a t-test on the coefficient OR the F-test to look for a difference between the two groups, and we MUST use the F-test for 3 groups. The overall F-test is equivalent to the ANOVA.

If the design is balanced, we can use the anova command to test for the significance of factors, similar to one-way ANOVA:

model = lm(response~factor)
anova(model)

We can also look for interactions between the two qualitative variables, and it's the same thing as interactions between two quantitative variables.

5.10Using qualitative and quantitative variables together in perfect harmony

We interpret these coefficients in the same way we interpret regression coefficients from a usual multiple linear regression model. We can even look for interactions between qualitative and quantitative variables.

5.11Comparing Nested Models

How do we compare nested models which have different covariates?
That is, our null hypothesis is: $H_0 : \beta_j = 0$ for some j, which means that covariate gives no statistically significant improvements to the model.

Suppose that a model $M_0$ is nested inside of another model $M_1$. We know that the SSE of $M_1$ is at least as small as $M_0$. This tells us that the improvement in the SSE by using the larger model is $SSE_{M_0} - SSE_{M_1}$.
If that term is large, it means adding the terms into $M_1$ explains a significantly larger amount of variance, and vice versa.

We can build a test statistic around this difference.

Suppose:
$$M_0:\beta_0 + \beta_1x_1 + ... + \beta_gx_g$$
$$M_1:\beta_0 + \beta_1x_1 + ... + \beta_gx_g + \beta_{g+1}x_{g+1} + ... + \beta_kx_k$$

Then the null hypothesis would be $M_1$ does not improve over the fit of $M_0$:

$$H_0 :\beta_{g+1} = \beta_{g+2} = ... = \beta_k = 0$$

This is a simultaneous test of the parameters of the larger model.

The F statistic is:
$$F = \frac{(SSE_{M_0} - SSE_{M_1})/(k-g)}{SSE_{M1}/(n-(k+1))}$$
and it has $k-g$ and $n-(k+1)$ degrees of freedom.
If $H_0$ is true, we expect that the reduction in SSE is approximately equal to $(k-g)\sigma^2$

To do this in R, we can simply pass two models to anova:

model1 = lm(response~factor)
model2 = lm(response~factor*factor1)
anova(model1,model2)

We can use the anova function for testing nested models with unbalanced designs. We couldn't use it on a single unbalanced ANOVA model.

5.12Stepwise regression

Stepwise regression via hypothesis testing is when you compare nested models with an F-test, then remove/add covariates based on the test. Backwards stepwise variable selection is when you start with all the possible covariates, then remove them one by one, until none of the terms yield a p-value larger than $\alpha$, or until all the terms have been deleted.
We can also do forward stepwise regression, by start with the simplest model that contains only one factor, and add factors to it, until the terms we are all above the significance level. This will sometimes give you different results than the backwards way. Generally forward will give you a model with fewer covariates, because if a term is not statistically significant, even if its interaction term is, we cannot add it in.

5.13Residual analysis

Our regression model contains these assumptions:

  • $E(\epsilon_i) = 0$
  • $\epsilon_i$ are all independent
  • $Var(\epsilon_i) = \sigma^2$
  • $\epsilon_i$ are all normally distributed

We need to check for these assumptions.
We can guess the value of $\epsilon_i$:
$$e_i = y_i - \hat{y}_i$$
Where $\hat{y}_i$ is the fitted value for the $i^{th}$ observation. $e_i$ is the sample residual/regression residual for the ith observation.
We can obtain the residuals in R by doing:

model.residuals = resid(model)

We can then check the normality of these residuals by using the boxplot, histogram, and the Q-Q normal plot.
It is better to standardize the residuals by dividing them by the standard error, this allows us to detect outliers more accuarately:
$$e^{std}_i = \frac{e_i}{s}$$
Or in R:

model.stdresiduals = stdres(model)

If we're using $\alpha = 0.05$, it is possible for residuals to be outside the $\pm1.96$ range. We would expect 5% of the residuals to be outside that range, even if they are really normally distributed.
We will correct for multiple comparisons using Bonferroni. If we have $n$ observations in our dataset, we could look for observations outside a z-score of $\pm(0.05/n)/2$, if we're at $\alpha = 0.05$.

We can then decide to exclude those data points if they are special cases, but if the entire sample is randomly selected and the case isn't special, then we should NOT exclude it.

We still need to test our assumptions about constant variance and zero mean. To examine those, we should plot our residuals against our fitted values. We can do this in R:

model.stdresiduals = stdres(model)
model.fitted = fitted(model)
plot(model.fitted, model.stdresiduals,xlab="Fitted values",ylab="Standardized residuals")

If we see any special pattern, it usually means those assumptions fail.
crescent

This means that we're over estimating at the lower and upper ends and underestimating in the middle of the range, and it seems to violate the principle of residuals having the same mean. To fix this we might add polynomial terms, or adding more covariates/interaction terms.

Sometimes we might also see a problem with the constant variance assumption:

baby you're a firework

It clearly appears to have a much larger variance for larger values, this is often a problem in the dataset itself, where large values tend to become more varied.

Another shape we might see is the "football" pattern:

handegg

This happens when there are extreme values at the each end of the datasets, and they pull the line towards them.

5.14Other issues in multiple regression

The covariates must be sufficiently different from each other in order to see their associations with the response.

If two covariates are highly correlated, it's difficult to see what the association of each covariate is. When two covariates in a multiple linear regression analysis are highly correlated with each other, we say there is a problem of multicollinearity
When two variables are highly correlated, they compete for the explanatory power in the association, and they increase the standard error of each other. When two models are completely collinear(r=1), the multiple regression breaks down completely.

We also assumed that our errors are independent, but sometimes that's not the case, for example, different data points are always taken at different points in time, but sometimes we don't include time as a variable, and so it reflects itself in the error term, and we might see the errors being dependent to each other.

6Categorical data

6.1The multinomial distribution

Suppose we have a random variable that can take on k possible values, and a defined probability for each value. Then suppose there are $n$ independent/identically distributed random variables with that qualitative distribution.
Let $(X_1, ..., X_k)$ be the number of observations that fall in each category.

The set of random variables $(X_1, ..., X_k)$ is called a multinomial random variable, and it has the joint probability mass function:

$$Pr(X_1 = x_1, X_2 = x_2, ... X_k=x_k) = \frac{n!}{x_1!x_2!..x_k!}P^{x_1}_1P^{x_2}_2...P^{x_k}_k$$

Also,
$$E(X_j) = np_j$$
$$Var(X_j) = np_j(1-p_j)$$
$$Cov(X_s,X_t) = -np_sp_t$$
(not sure what Cov() is)

6.2Hypotheses

We want to test that $p_j = p_0$.

We will use the Chi-Square test, which tests the deviation from what would be expected under the $H_0$.

If we have a multinomial random variable such that $E(X_j) = np_j$, and we observed a count of $X_j = n_j$ for each category, then the deviation of our observed value would be
$$n_j - np_j$$
We can perform this for all $k$ categories.
Our hypotheses are:
$$H_0 : p_j = p_j^{(0)}$$
$$H_a : p_j \neq p_j^{(0)}$$

Our test statistic is:
$$\chi^2 = \sum_{j=1}^k\frac{(n_j - np_j^{(0)})^2}{np_j^{(0)}}$$
or
$$\chi^2 = \sum_{j=1}^k\frac{(observed - expected)^2}{expected}$$
The degrees of freedom is the difference between k - 1 and the number of unspecified probabilities in $H_0$.
For example:
If our $H_0$ was $p_1 = 0.25, p_2 = 0.10$, and there are 4 possible probabilities, there is only 1 unspecified probability, because once $p_3$ is specified, $p_4$ can be calculated by $1-p_1-p_2-p_3$. Thus the df for the chi-square is 3 - 1 = 2.
Then we can just look up the chi-square distribution and test our hypothesis.

6.3Contingency tables

We want to compare bivariate qualitative data.
Suppoose for each subject we have 2 random variables $X_1$ and $X_2$. $X_1$ can take on a set of values $1, 2, ..., r$ and $X_2$ can take on $1, 2, ..., c$, that means for each subject we have $X_{i1} = j$ and $X_{i2} = k$
We can represent an estimate of the bivariate distribution of $X_1$ and $X_2$ in the form of a contingency table.

A d-way contingency table is a table listing all the possible combinations of $X_1, ..., X_d$ in a table.
For example, a 2-way contingency table for $X_1$ with 2 levels and $X_2$ with 3 levels, we would have:

x1\x2 1 2 3 total
1 $n_{11}$ $n_{12}$ $n_{13}$ $n_{1-}$
2 $n_{21}$ $n_{22}$ $n_{23}$ $n_{2-}$
Total $n_{-1}$ $n_{-2}$ $n_{-3}$ $n$

We can also think of this as just a normal multinomial random variable where each $X_1X_2$ category is a level.

6.4Independence

Under a hypothesis of independence we have to assume that $X_1$ and $X_2$ are independent, and $Pr(X_1 = j, X_2 = k) = Pr(X_1=j)Pr(X_2 = k)$.
This means that we're restricting the joint probabilities. $X_1$ can have a set of $(r-1)$ unrestricted probabilities, and $X_2$ can have a set of $(c-1)$ unrestricted probabilities, then under the totally unrestricted model we have $r + c - 2$ degrees of freedom, rather than $rc - 1$.

If our null hypothesis was that $X_1$ is independent of $X_2$, then $E(n_{jk}) = n\hat{p}_{j-}\hat{p}_{-k}$, where $\hat{p}_{j-}$ is the probability for the $j^{th}$ row, and $\hat{p}_{-k}$ is the probability for the $k^{th}$ column.

The degrees of freedom would equal to $rc - 1 - (r+c-2) = (r-1)(c-1)$, because $rc-1$ is the df for the completely unrestricted model, and $r+c-2$ is the df for the model which $X_1$ is independent from $X_2$.

To do this in R:

chisq.test(data)
# to get the expected value
chisq.test(data)$expected
# to get the residuals
resid(chisq.test(data))

What if either the row or column totals are fixed?
Suppose the row totals are fixed and we want to test that the portion for each column is the same, the our null hypothesis is:
$$H_0 : p^*_1 = p^*_2 =p^*_3$$
where $p^*_j$ is the probability of column j. This is the same thing as a multinomial variable, we just add up all the rows for each column and treat it as a multinomial distribution.

6.5Issues with the chi-square test

If the expected value for any cell in the contingency table is too small(<5), the distribution might not be chi-square.
The observations must be independently and identically distributed.

7Non-parametric statistics

Non-parametric statistics is hypothesis testing based on minimal assumptions, which are that the data are independently and identically distributed according to some cumulative distribution $F(x) = Pr(X \le x)$

If we have $X_1, ..., X_n$ independently and identically distributed according to a cdf F(x), and we want to test the general central tendency of F(X), we can use the population median, $\eta$.

Thus the cumulative probability function for F as $Pr(X \le \eta) = F(\eta) = 0.50$.

We want to test if the population median is some value:

$$H_0 : \eta = \eta_0$$
$$H_a : \eta \neq \eta_0$$

We could use the sample median to try and test the hypothesis. It turns out that the variance of the sample median depends on value of the probability density function at the mediance, this is hard to get.
Which is why we use the sample mean, because the sample mean only depends on the mean/variance of the population.

However, we can look at the differences between each data value and $\eta_0$ ($D_i = X_i - \eta_0$). We know that if $H_0$ is true, then $Pr(D_i \ge 0) = 0.50$
If $\eta < \eta_0$, then most $D_i$ would be negative and vice versa.
Therefore our hypothesis test is based on $W$ = |Number of positive $D_i$|
$W$ is just a binomial distributed random variables with parameters $n$ and $p$.
Under $H_0$ we know that $p = 0.5$, and we must assume that no $D_i = 0$, shit goes bad if there are.
Then we can just use a binomial test:

binom.test(positiveDiCount, n=total, alternative="greater")  #could be "less" or "two.sided"

7.1Two sample shift model

We can extend this to paired data as well. We want to test if two samples have the same distribution.
We assume that $Y_i = W_i + \theta$, where $Y_i$ is the distribution of one sample, and $W_i$ is the distribution of the other sample.
If the two samples are identically distributed, then $\theta = 0$

We can calculate the differences: $D_i = X_i - Y_i$, then we can just test if the median of $D_i$ is zero. That test would just be the same as before.
This is the same thing as testing that the median of $Y$ is the same as the median of $X$.

7.2Wilcoxon Paired Rank Sum test for matched pairs

The problem with before is that we reduced every data point to its sign, and we threw away their magnitude.
An alternate approach would be to use the sample ranks of the absolute value of thedifferences. We're just ranking all $|D_i| = |X_i - Y_i|$ from 1 to $n$.
We sum up the ranks for the positive $D_i$ ($T^+$) and the ranks for ($T^-$), if Y is shifted to the right(bigger than X), then we expect $T^+$ to be small and vice versa.
The rejection values for certain sample sizes are given and the distribution of the test statistics is foreign and exotic, like mangos.

If the sample size is 25 or more, we can use a normal approximation:
$$E(T^+) = \frac{n(n+1)}{4}$$
$$Var(T^+) = \frac{n(n+1)(2n+1)}{24}$$

The expected value is just the sum of all ranks from 1 to n divided by 2, which means under $H_0$, the $T^+$ is under half of the total ranks.
The test statistic is:

$$Z=\frac{T^+-E(T^+)}{\sqrt{Var(T^+)}}$$

We can do this in R:

wilcox.test(X1,Y1,paired=TRUE)

The biggest problem with Wilcoxon is that there might be even more ties, because the ranks could be tied, which impact the accuracy.

7.3Wilcoxon rank sum test for independent random samples

If we have two independent samples with different sizes, we use the Wilcoxon rank sum test.
We pool all the observations together and rank the two samples against each other, and compare the ranks of the first group relative to the second group.
We count the number of observations in the first sample that precede each observations in the second sample. Suppose X contains 3,9,11, and Y contains 1,4,5,12

Value 1 3 4 5 9 11 12
Group y x y y x x y
Count 0 - 1 1 - - 3

The Mann Whitney U statistic is equal to $U = 1 + 1 + 3 = 5$
$U$ can also be calculated as $U = n_1n_2 + \frac{n_1(n_1 + 1)}{2} - W$, where W is the sum of ranks for the first sample.

$U$ is symmetric around $n_1n_2/2$

If both $n_1$ and $n_2$ are larger than 10, we can use a normal approximation. The Z statistic is calculated as follows:
$$Z = \frac{U-(n_1n_2/2)}{\sqrt{n_1n_2(n_1+n_2+1)/12}}$$
There is another way we can calculate Z, but the sign changes:
$$Z = \frac{W-n_1(n_1+n_2+1)/2}{\sqrt{n_1n_2(n_1+n_2+1)/12}}$$

7.4Kruskal-Wallis test

If we want to see if more than 2 groups have the same distribution, we need a non-parametric test that is analogous to one-way ANOVA. The test is called Kruskal-Wallis H-test
The null hypothesis is that all the observations come from the same distribution.

Under $H_0$, if we pooled the groups together, no group's observations will be larger/smaller than any other group. So we can pool all groups together, then take the average of each group's rank.
Under $H_0$, the averages should be approximately the same.

If $\bar{R}_j$ is the average rank of the $j^{th}$ group, under $H_0$ $\bar{R}_j$ would be approximately equal to $\bar{R} = \frac{n+1}{2}$
The test statistic is:

$$H = \frac{12}{n(n+1)}\sum_{j=1}^Kn_j(\bar{R}_j - \bar{R})^2$$
H is distributed according to a $\chi^2$ distribution, with $K-1$ degrees of freedom.

This test will be generally accurate if one of the sample sizes of the groups is larger than 5, and we average the ranks for ties.

We do this in R:

kruskal.test(response, groups)

7.5Friedman Fr-test for randomized blocks

Friedman's Fr-test statistic looks for deviations of the ranks of a treatment within a block across all blocks.
The null hypothesis is that the average rank of each level of the $K$ treatment factors across $B$ blocks would be same.
If $\bar{R}_j$ is the average rank of the $j^{th}$ level of treatment across the $B$ blocks, the $F_r$ statistic would be:
$$F_r = \frac{12B}{K(K+1)}\sum_{j=1}^K(\bar{R}_j-\bar{R})^2$$
Again it is $\chi^2$ sitribution with $K-1$ degrees of freedom if either $B$ or $K$ is larger than 5, and we reject for large values of $F_r$.
To do this in R:

friedman.test(response,factor,block)

7.6Spearman rank correlation

We need a non-parametric way of assessing association between two quantitative variables.

Suppose we have pairs of variables, $(X_i, Y_i$ from $n$ independently and identically distributed random variables. If they are highly associated, we would see small values of X that are paired with small values of Y, and large values of X that are paired with large values of Y.
We can just consider using the correlation of the ranks of X and Y.

This measure of correlation is known as Spearman's rank correlation coefficient

Let $u_i$,$v_i$ be the rank of the $i^{th}$ observation in $X$ and $Y$ respectively. We can calculate the sample correlation of the ranks by:
$$r_s = \frac{SS_{uv}}{\sqrt{SS_{uu}SS_{vv}}}$$
Where:

$$SS_{uv} = \sum_{i=1}^nu_iv_i - n\bar{u}\bar{v}$$
$$SS_{uu} = \sum_{i=1}^nu_i^2 - n\bar{u}^2$$
$$SS_{vv} = \sum_{i=1}^nv_i^2 = n\bar{v}^2$$