home.gif (1194 bytes)grades.gif (1215 bytes)assignments.gif (1284 bytes)feedback.gif (1254 bytes)discboard.gif (1264 bytes)

syllabus.gif (1124 bytes)terminology.gif (1142 bytes)lectures.gif (1112 bytes)resources.gif (1130 bytes)jmp.gif (1086 bytes)

 

title.gif (3960 bytes)

 

Regression and Correlation

  1. Regression and correlation is used to analyze the relationship between two continuous variables.
     
    1. If we are thinking of a cause and effect relationship between the two random variables we analyze the data using regression.
       
      1. For example, we might be interested in the effect of fat intake on risk of breast cancer. There are two continuous random variables (fat intake and cancer risk) and we think of fat intake effecting cancer risk, not the other way around.
         
      2. The variable being effected is called the response variable or the dependent variable. It is typically denoted by the letter Y. In the above example the dependent variable is cancer risk.
         
      3. The variable that is effecting the dependent variable is called the independent variable.  The independent variable is typically denoted by the letter X. In the above example the independent variable is fat intake.
         
    2. When we aren't thinking of a cause and effect relationship we analyze in terms of correlation.
       
  2. Scattergrams are plots used to examine two continuous variables at the same time.
     
    1. For an example we'll look at data of monthly sales of leaded gasoline in Massachusetts and monthly average lead levels measured in newborn baby umbilical cord blood (LIUC) (JMP data here).
       
      Month Gas LIUC
      March, 1980 141 6.4
      April, 1980 166 6.1
      May, 1980 161 5.7
      June, 1980 170 6.9
      July, 1980 148 7.0
      August, 1980 136 7.2
      September, 1980 169 6.6
      October, 1980 109 5.7
      November, 1980 117 5.7
      December, 1980 87 5.3
      January, 1981 105 4.9
      February, 1981 73 5.4
      March, 1981 82 4.5
      April, 1981 75 6

       
      We think the sales of leaded gasoline effects the LIUC so Gas is the independent variable and LIUC is the dependent variable.
       

  3. We can think of the value of a dependent variable as resulting from two separate relationships:
     
    1. A deterministic model, and
       
    2. random error
       
    3. The deterministic model that relates a value of the independent variable (X) with the mean value of the dependent value (Y) at that value of X. For instance, the mean reduction of cholesterol have a linear relationship with dosage of Lovastatin (mg/d). The formula for this line might be y = -0.5x for instance. This tells us that at a dosage of 40 mg/d (x=40) the mean reduction in cholesterol levels would be -20 (-0.5 * 40).
       
    4. Of course not everyone who got 40 mg/d of Lovastatin would get a -20% reduction in cholesterol levels. There would be some variation around that mean. This is the random error component.
       
  4. We will only consider regression models when the deterministic part of the model is linear so the analysis is called linear regression (as opposed to non-linear regression). We will also limit ourselves to models with only one independent variable, so the analysis can more precisely be called simple linear regression (as opposed to multiple regression).
     
    1. In simple linear regression the deterministic part of the model can be written:
       
      mu_x.gif (317 bytes) = beta_0.gif (327 bytes) + beta_1.gif (323 bytes)x
       
    2. The full model is written as y = beta_0.gif (327 bytes) + beta_1.gif (323 bytes)x + epsilon.gif (273 bytes) where epsilon.gif (273 bytes) is the random error.
       
  5. Estimating the parameters
     
    1. beta_0.gif (327 bytes) and beta_1.gif (323 bytes) are unknown parameters. We use the data to estimate these parameters. The estimates are labeled betahat_0.gif (337 bytes) and betahat_1.gif (333 bytes).
       
    2. Estimates of mu_x.gif (317 bytes) are labeled yhat.gif (290 bytes).
       
    3. yhat.gif (290 bytes) = betahat_0.gif (337 bytes) + betahat_1.gif (333 bytes)x
       
    4. Residuals are the difference between a value of y (an actual observation at a value of x) and yhat.gif (290 bytes) (the predicted value of y at that value of x).
       
    5. The values of betahat_0.gif (337 bytes) and betahat_1.gif (333 bytes) that are selected are those that result in a line with the minimum possible sum of squared residuals. The estimates are called "least squares estimates".
       
  6. Assumptions
     
    1. There are no assumptions necessary for the least squares estimates to be valid. However, assumptions are necessary to conduct any hypothesis tests or to form any confidence intervals.
       
    2. The assumptions are:
       
      1. The model must be correct. There must really be a linear relationship between the dependent and independent variables.
         
      2. epsilon.gif (273 bytes) ~ N(0, sigma2.gif (310 bytes))
         
      3. The data comes from a random sample
         
    3. The important points are that the variance is constant throughout the range of x values and that the errors are normally distributed.
       
  7. Hypothesis testing
     
    1. We are typically interested in testing the hypotheses:

      Ho: beta_1.gif (323 bytes) = 0
      Ha: beta_1.gif (323 bytes) ne.gif (273 bytes) 0
       
    2. The above hypothesis is really a test of whether the two variables are related or not. If the slope is zero then knowing the value of X does not give you any information about Y. Y always equals the y-intercept in that case regardless of the value of X.
       
    3. Sometimes we are interested in one-sided (left or right) hypotheses about the slope. Sometimes we are interested in tests about the y-intercept. The above hypothesis is by far more common though.
       
  8. R-Squared
     
    1. Consider two numbers:
       
      1. the sum of squared deviations of each y from ybar.gif (286 bytes). This is called SSTO (sum of squares total). The sample variance of y is SSTO/(n-1).
         
      2. the sum of squared residuals (the deviation of each y from yhat.gif (290 bytes)). This is called SSE (sum of squares error). The estimate of sigma2.gif (310 bytes) is SSE/(n-2) and is called MSE (mean squared error).
    2. R2 is (SSTO - SSE)/SSTO. Roughly speaking (and the following is good enough for an exam) R2 is the proportion of the variability of y that is explained by the regression analysis.
       
    3. When R2 = 0 the 0 then SSTO = SSE, meaning the slope was zero. This means x explains nothing about y.
       
    4. When R2 = 1 then SSE = 0. This means that every data point falls on the line. This means that x explains all of y.
       
    5. You can have a situation where the hypothesis test for the slope concludes that beta_1.gif (323 bytes) ne.gif (273 bytes) 0 but the R2 value is very small. This tells us that the mean response of y changes with x but that there is a lot of variability on top of that.
       
  9. Correlation
     
    1. Correlation measures the linear relationship between 2 variables without assuming a cause/effect relationship. The statistic that is calculated is called a correlation coefficient and is labeled r.
       
    2. r = 1 means there is a perfect and increasing relationship between the two variables. r = -1 means there is a perfect decreasing (inverse) relationship between the two variables. r = 0 means that there is no linear relationship between the two variables.
       
    3. If the two variables have a non-linear relationship then the correlation coefficient is meaningless.

 

E-mail Mr. Callahan at stat110@edcallahan.com with questions or comments about this web site or about the class itself.

This page was last modified on December 07, 1999.