Summary Statistics
- First, some notation. Say we have a data set comprised of the numbers 8, 4, 7, 1 and 9.
A shorthand way of writing "sum up the observations" is:

A shorthand way of writing "square each observation and sum those numbers up"
is:

[The above total is actually 211. I'll fix the graphic when I have time!]
- Measures of central tendency
- mean: divide the sum of the observations by the number of observations.

- The notation for the sample mean is always
. The notation for the population mean is always
. The population is
generally unknown since, unless you have taken a census, you don't have data from the
entire population.
- median: the middle number when observations are
ordered (to be precise, at least half the data is less than or equal to the median and at
least half the data is greater than or equal to the median)
| If the data is: |
the median is: |
| 4, 6, 9 |
6 |
| 4, 6, 7, 9 |
6.5 |
| 4, 6, 6, 6 |
6 |
- Skewness
- mean of NBA salary data: 2.1 mill
- median of NBA salary data: 1.3 mill
- difference due to skewness
- left skewed means long left tail, mean < median
- right skewed means long right tail, mean > median
- symmetric distribution median=mean
- Robustness
- Bulls: mean 4.1 mill, median 1.4 mill
- Bull w/out Michel Jordan: mean 2.0 mill, median 1.3 mill
- NBA: (453 players) mean 2.1 mill, median 1.3 mill
- NBA w/out > 6 mill players (20): mean 1.7 mill, median 1.2 mill
- Which measure of central tendency do you prefer? What if you're data is symmetric and
without outliers?
- During a labor strike would the union be using the mean or the median when they talk
about "average wages" of the employees. Which would management use when talking
about "average wages".
- Measures of variation
- sample variance: first, subtract the mean from each observation. Square and then sum
those differences. Divide the sum by one less than the number of observations.

For instance, consider the data 3, 5, 7, 9 and 11. The mean is 7. So

- The sample variance is denoted as s2 and the population variance is
denoted as
.
- Why do we divide by (n-1) instead of n? We'll worry about this later. Basically, when we
use s2 as an estimate of
the estimate would tend to be too small if we
divided by n instead of (n-1).
- The standard deviation is the square root of the variance. Standard deviations have the
same units of measurement as the data. The sample standard deviation is denoted as s
and the population standard deviation is denoted as
.
- range: max observation - minimum observation (notice
that the range is a single number). The range of the firearm
death data is 41.6 - 3.6 = 38
- width of the inter-quartile range (IQR)
25th percentile: one fourth of the data less than
or equal to the lower quartile, three quarters greater than or equal to the lower quartile
75th percentile: three fourths of the data less than or
equal to the upper quartile, one quarters greater than or equal to the upper quartile.
inter-quartile range: (lower quartile, upper quartile). (notice that the inter-quartile
range is given as 2 numbers, the smaller first).
width of the inter-quartile range = upper quartile - lower quartile
- The IQR for the firearm death data is (8.65, 15.75).
The IQR width is 7.1
- You will not be asked to calculate lower or upper quartiles by hand.
- The IQR width is more robust than the range or the standard deviation. The standard
deviation is more robust than the range. Here are the 3 measures of variation for the firearm death data with and without the District of
Columbia.
| |
With DC |
Without DC |
| Range |
38 |
19.6 |
Standard
Deviation |
6.4 |
5.0 |
| IQR width |
7.1 |
6.9 |
- Chebychev's Theorem
- For any dataset or population, at least 75% of the data or population
will be within 2 standard deviations of the mean. At least 8/9 of the data (approximately
90%) of the data or population will be within 3 standard deviations of the mean.
- For example, for the 51 observations in the firearm death
data the mean is 13.2 and the standard deviation is 6.4. So at least 75% of the data
will be between 13.2 - 2*6.4 = 0.4 and 13.2 + 2*6.4 = 26. In actuality 50 of the 51 states
(98%) fall within this range.
- Measure of position
- z-score

- z-scores are unit-less. A z-score is the number of standard deviations a
subject is from the mean.
- z-score allows comparing apples and oranges
- For instance, was Michael Jordan worth his pay? He made 33 million dollars and
averaged 28.7 points per game. Unless you know a lot about basketball it's hard to know
how great the points per game statistic is. The z-score for points per game is (28.7 -
7.9)/5.8 = 3.6. That's a large z-score. But his z-score for salary is (33 - 2.1)/2.8 =
11.0. This is an insanely large z-score. By this comparison he was paid way too much money
for the points they got from him. Of course this is only one way of addressing this
question.
- Chebychev's rule can be re-state that at least 75% of subjects will have a
z-score between -2 and 2. At least 8/9 of subjects will have z-scores between -3 and 3.
|
|