Regression and Correlation
- Regression and correlation is used to analyze the relationship between two continuous
variables.
- If we are thinking of a cause and effect relationship between the two random variables
we analyze the data using regression.
- For example, we might be interested in the effect of fat intake on risk of breast
cancer. There are two continuous random variables (fat intake and cancer risk) and we
think of fat intake effecting cancer risk, not the other way around.
- The variable being effected is called the response variable or the dependent
variable. It is typically denoted by the letter Y. In the above example the
dependent variable is cancer risk.
- The variable that is effecting the dependent variable is called the independent
variable. The independent variable is typically denoted by the letter X.
In the above example the independent variable is fat intake.
- When we aren't thinking of a cause and effect relationship we analyze in terms of
correlation.
- Scattergrams are plots used to examine two continuous variables at the
same time.
- For an example we'll look at data of monthly sales of leaded gasoline in Massachusetts
and monthly average lead levels measured in newborn baby umbilical cord blood (LIUC) (JMP
data here).
| Month |
Gas |
LIUC |
| March, 1980 |
141 |
6.4 |
| April, 1980 |
166 |
6.1 |
| May, 1980 |
161 |
5.7 |
| June, 1980 |
170 |
6.9 |
| July, 1980 |
148 |
7.0 |
| August, 1980 |
136 |
7.2 |
| September, 1980 |
169 |
6.6 |
| October, 1980 |
109 |
5.7 |
| November, 1980 |
117 |
5.7 |
| December, 1980 |
87 |
5.3 |
| January, 1981 |
105 |
4.9 |
| February, 1981 |
73 |
5.4 |
| March, 1981 |
82 |
4.5 |
| April, 1981 |
75 |
6 |
We think the sales of leaded gasoline effects the LIUC so Gas is the independent
variable and LIUC is the dependent variable.
- We can think of the value of a dependent variable as resulting from two separate
relationships:
- A deterministic model, and
- random error
- The deterministic model that relates a value of the independent variable (X) with the
mean value of the dependent value (Y) at that value of X. For instance, the mean reduction
of cholesterol have a linear relationship with dosage of Lovastatin (mg/d). The formula
for this line might be y = -0.5x for instance. This tells us that at a dosage of 40 mg/d
(x=40) the mean reduction in cholesterol levels would be -20 (-0.5 * 40).
- Of course not everyone who got 40 mg/d of Lovastatin would get a -20% reduction in
cholesterol levels. There would be some variation around that mean. This is the random
error component.
- We will only consider regression models when the deterministic part of the model is
linear so the analysis is called linear regression (as opposed to non-linear
regression). We will also limit ourselves to models with only one independent
variable, so the analysis can more precisely be called simple linear regression
(as opposed to multiple regression).
- In simple linear regression the deterministic part of the model can be written:
= + x
- The full model is written as y =
+ x + where is the
random error.
- Estimating the parameters
and are unknown parameters. We use the data to estimate these parameters.
The estimates are labeled and .
- Estimates of
are labeled .
= + x
- Residuals are the difference between a value of y (an actual observation at a value of
x) and
(the
predicted value of y at that value of x).
- The values of
and that are selected are those that result in a line with the minimum
possible sum of squared residuals. The estimates are called "least squares
estimates".
- Assumptions
- There are no assumptions necessary for the least squares estimates to be valid. However,
assumptions are necessary to conduct any hypothesis tests or to form any confidence
intervals.
- The assumptions are:
- The model must be correct. There must really be a linear relationship between the
dependent and independent variables.
~ N(0, )
- The data comes from a random sample
- The important points are that the variance is constant throughout the range of x values
and that the errors are normally distributed.
- Hypothesis testing
- We are typically interested in testing the hypotheses:
Ho: = 0
Ha:
0
- The above hypothesis is really a test of whether the two variables are related or not.
If the slope is zero then knowing the value of X does not give you any information about
Y. Y always equals the y-intercept in that case regardless of the value of X.
- Sometimes we are interested in one-sided (left or right) hypotheses about the slope.
Sometimes we are interested in tests about the y-intercept. The above hypothesis is by far
more common though.
- R-Squared
- Consider two numbers:
- the sum of squared deviations of each y from
. This is called SSTO (sum of squares total).
The sample variance of y is SSTO/(n-1).
- the sum of squared residuals (the deviation of each y from
). This is called
SSE (sum of squares error). The estimate of is SSE/(n-2) and is called MSE (mean
squared error).
- R2 is (SSTO - SSE)/SSTO. Roughly speaking (and the following is good enough
for an exam) R2 is the proportion of the variability of y that is explained by
the regression analysis.
- When R2 = 0 the 0 then SSTO = SSE, meaning the slope was zero. This means x
explains nothing about y.
- When R2 = 1 then SSE = 0. This means that every data point falls on the line.
This means that x explains all of y.
- You can have a situation where the hypothesis test for the slope concludes that
0 but the R2
value is very small. This tells us that the mean response of y changes with x but that
there is a lot of variability on top of that.
- Correlation
- Correlation measures the linear relationship between 2 variables
without assuming a cause/effect relationship. The statistic that is calculated is called a
correlation coefficient and is labeled r.
- r = 1 means there is a perfect and increasing relationship between the two variables.
r = -1 means there is a perfect decreasing (inverse) relationship between the
two variables. r = 0 means that there is no linear relationship between the two
variables.
- If the two variables have a non-linear relationship then the correlation coefficient is
meaningless.
|
|