home.gif (1194 bytes)grades.gif (1215 bytes)assignments.gif (1284 bytes)feedback.gif (1254 bytes)discboard.gif (1264 bytes)

syllabus.gif (1124 bytes)terminology.gif (1142 bytes)lectures.gif (1112 bytes)resources.gif (1130 bytes)jmp.gif (1086 bytes)

 

title.gif (3960 bytes)

 

Summary Statistics

 

  1. First, some notation. Say we have a data set comprised of the numbers 8, 4, 7, 1 and 9.

A shorthand way of writing "sum up the observations" is:

sum.gif (1241 bytes)

A shorthand way of writing "square each observation and sum those numbers up" is:

sum2.gif (1539 bytes)

 
[The above total is actually 211. I'll fix the graphic when I have time!]

  1. Measures of central tendency
     
    1. mean: divide the sum of the observations by the number of observations.

mean.gif (996 bytes)

    1. The notation for the sample mean is always xbar.gif (869 bytes). The notation for the population mean is always mu.gif (877 bytes). The population is generally unknown since, unless you have taken a census, you don't have data from the entire population.
       
    2. median: the middle number when observations are ordered (to be precise, at least half the data is less than or equal to the median and at least half the data is greater than or equal to the median)

      If the data is: the median is:
      4, 6, 9 6
      4, 6, 7, 9 6.5
      4, 6, 6, 6 6


       

    3. Skewness
       
      1. mean of NBA salary data: 2.1 mill
      2. median of NBA salary data: 1.3 mill
      3. difference due to skewness
      4. left skewed means long left tail, mean < median
      5. right skewed means long right tail, mean > median
      6. symmetric distribution median=mean
         
    4. Robustness
       
      1. Bulls: mean 4.1 mill, median 1.4 mill
      2. Bull w/out Michel Jordan: mean 2.0 mill, median 1.3 mill
      3. NBA: (453 players) mean 2.1 mill, median 1.3 mill
      4. NBA w/out > 6 mill players (20): mean 1.7 mill, median 1.2 mill
         
    5. Which measure of central tendency do you prefer? What if you're data is symmetric and without outliers?
       
    6. During a labor strike would the union be using the mean or the median when they talk about "average wages" of the employees. Which would management use when talking about "average wages".
  1. Measures of variation
     
    1. sample variance: first, subtract the mean from each observation. Square and then sum those differences. Divide the sum by one less than the number of observations.

sdev.gif (1129 bytes)
 

    1. For instance, consider the data 3, 5, 7, 9 and 11. The mean is 7. So

      sdexmpl.gif (2412 bytes)

    2. The sample variance is denoted as s2 and the population variance is denoted as sigma2.gif (905 bytes).
       
    3. Why do we divide by (n-1) instead of n? We'll worry about this later. Basically, when we use s2 as an estimate of sigma2.gif (905 bytes) the estimate would tend to be too small if we divided by n instead of (n-1).
       
    4. The standard deviation is the square root of the variance. Standard deviations have the same units of measurement as the data. The sample standard deviation is denoted as s and the population standard deviation is denoted as sigma.gif (870 bytes).
       
    5. range: max observation - minimum observation (notice that the range is a single number). The range of the firearm death data is 41.6 - 3.6 = 38
    6. width of the inter-quartile range (IQR)
       
      25th percentile: one fourth of the data less than or equal to the lower quartile, three quarters greater than or equal to the lower quartile

      75th percentile: three fourths of the data less than or equal to the upper quartile, one quarters greater than or equal to the upper quartile.

      inter-quartile range: (lower quartile, upper quartile). (notice that the inter-quartile range is given as 2 numbers, the smaller first).

      width of the inter-quartile range = upper quartile - lower quartile
       
    7. The IQR for the firearm death data is (8.65, 15.75). The IQR width is 7.1
       
    8. You will not be asked to calculate lower or upper quartiles by hand.
       
    9. The IQR width is more robust than the range or the standard deviation. The standard deviation is more robust than the range. Here are the 3 measures of variation for the firearm death data with and without the District of Columbia.
       
        With DC Without DC
      Range 38 19.6
      Standard
      Deviation
      6.4 5.0
      IQR width 7.1 6.9


       

  1. Chebychev's Theorem
     
    1. For any dataset or population, at least 75% of the data or population will be within 2 standard deviations of the mean. At least 8/9 of the data (approximately 90%) of the data or population will be within 3 standard deviations of the mean.
       
    2. For example, for the 51 observations in the firearm death data the mean is 13.2 and the standard deviation is 6.4. So at least 75% of the data will be between 13.2 - 2*6.4 = 0.4 and 13.2 + 2*6.4 = 26. In actuality 50 of the 51 states (98%) fall within this range.
       
  2. Measure of position
     
    1. z-score
       
       zscore.gif (1068 bytes)
       
    2. z-scores are unit-less. A z-score is the number of standard deviations a subject is from the mean.
       
    3. z-score allows comparing apples and oranges
       
    4. For instance, was Michael Jordan worth his pay? He made 33 million dollars and averaged 28.7 points per game. Unless you know a lot about basketball it's hard to know how great the points per game statistic is. The z-score for points per game is (28.7 - 7.9)/5.8 = 3.6. That's a large z-score. But his z-score for salary is (33 - 2.1)/2.8 = 11.0. This is an insanely large z-score. By this comparison he was paid way too much money for the points they got from him. Of course this is only one way of addressing this question.
       
    5. Chebychev's rule can be re-state that at least 75% of subjects will have a z-score between -2 and 2. At least 8/9 of subjects will have z-scores between -3 and 3.

 

E-mail Mr. Callahan at stat110@edcallahan.com with questions or comments about this web site or about the class itself.

This page was last modified on October 17, 1999.