Terms and Definitions
Simple statistic
A number that carries some information.
Descriptive Statistics
Development of numerical and graphical summaries of data.
Inferential Statistics
Uses data collected from a sample to make inferences about a population.
Population
A set of subjects of interest
Sample
A subset of a population on which observations are made
Data
The set of numerical information collected on variables of the
subjects of a sample
Variable
Characteristics or property of an individual subject
Census
When all members of a population are included in a sample. Data consists of
measurements of variables taken on every member of the population.
Reliability
A measure of the uncertainty of a statistical inference
Target population
The population one wants to make inferences about
Sampled population
The population that is actually sampled
Representative sample
A sample that reflects the characteristics of the target population
Sample of Convenience
A sample collected without a statistical design
Quota Sampling
Separate samples of convenience are collected within each strata of a population.
The sample size within a strata is proportional to that strata's prevalence in the
population.
Strata
A subdivision of a population
Random Sampling
Each member of the population has the same probability of being included in the sample
Probability Sampling
Random Sampling
Sampling error
error resulting from the sampled population not being the same as the target population
Non-response error
Occurs when those that respond to the survey have different traits than those that
don't respond. Results in sampling error.
Reporting error
occurs when respondents do not answer questions honestly or accurately
Volunteer error
occurs when respondents volunteer to participate in a poll or experiment and those
respondents are not representative of the target population. Results in sampling error.
Observational study
"treatment" is a trait of the subject
Confounding factors
variables that are correlated with the treatment (or differs between treatment groups)
that effect the response variable.
Designed experiment
"treatment" is assigned to the subject
Double blind study
A designed experiment where neither the researcher or the subjects know which subjects
have received treatment and which are in the control group.
Control
In a designed experiment the group of untreated subjects that the treated subjects are
compared to.
Placebo
A "fake" treatment that simulates the treatment but actually has no effect.
Placebos make double blind studies possible. For instance, in a drug trial a sugar pill is
used as the placebo for the control group.
Quantitative variable
Measurements recorded on a naturally occurring numerical scale. Such as height,
weight, number of people attending a protest.
Qualitative variable
Measurements that cannot be naturally measured on a numerical scale, they can only be
classified into categories. Such as sex (M/F), race (Caucasian, Hispanic, etc),
satisfaction level (1-5, for instance).
Continuous variable
Any possible outcome within some interval is possible, for instance height.
Discrete variable
Possible outcomes are countable, such number of accidents at an intersection (1, 2, 3,
4, ...).
Ordinal variable
Qualitative data that can be naturally put into order, such as satisfaction level (1-5)
or age category (under 21, 21-30, 31-40, over 40).
Nominal variable
Qualitative data that cannot be naturally put into order, such as sex (M/F) or race
(Caucasian, Hispanic, etc).
Class
one of the categories into which a qualitative variable can be classified
Class frequency
number of observations in a dataset falling within a particular class
Class relative frequency
class frequency divided by the number of observations in a dataset
Skewness
Tendency of a distribution or dataset to have one tail longer than the other. A left
skewed distribution has a long left tail; a right skewed distribution has a long right
tail.
Robustness
A statistic is robust if changing only a few observations in a dataset does not effect
that statistic too much. For instance, the median is more robust than the mean.
Percentile
the p'th percentile is the observation for which p% of the data is less or equal to it
and (1-p)% of the data is greater than equal to it.
For instance, of 5 is the 25th percentile of a data set, at least 25% of the
data is less than or equal to 5 and at least 75% of the data is greater that or
equal to 5.
Median
50th percentile
Lower quartile
Subjects below the 25th percentile
Upper quartile
Subjects above the 75th percentile
Inter-quartile range
(Lower quartile, Upper quartile)
Range
max observation - min observation
Random variable
A variable that assumes a numerical value associated with the random outcome of an
experiment. Only one outcome allowed per experiment.
Chance
If an experiment is repeated many, many times (infinitely) the chance of a certain
outcome is the percent of times that outcome would occur.
Probability
Chance divided by 100.
Independence
Two events are independent if the outcome of one does not effect the outcome of
another. For instance, coin flips are independent since the probability of getting a head
on any flip does not depend on the outcome of previous flips.
Probability distribution
A graph, table or formula that specifies the probability associated with each possible
outcome or measurement.
Population parameter
An attribute of the population probability distribution function, usually the
population mean or variance. Generally the value of the population parameter is unknown
and we want to estimate it.
Standard Normal Distribution
Normal distribution with mean 0 and sigma 1.
Sample statistic
A statistic calculated from a sample. Generally an estimate of a population parameter
(for instance the sample mean is an estimate of the population mean).
Sampling distribution
The pdf of a sample statistic.
Bias
A sample statistic is a biased estimate of a population parameter if the mean of its
sampling distribution is not equal to the population parameter. For instance, suppose we
want to estimate the variance of a population. If the average value of the sample standard
deviation is equal to the population variance then the standard deviation is an unbiased
estimate of the population variance. Otherwise it is a biased estimator.
Standard error
The population standard error is the variance of a sampling distribution of the mean.
The sample standard error is the estimate of the population standard error and is equal to
the standard deviation divided by the square root of n.
Null Hypothesis
The status quo hypothesis. The hypothesis you are trying to disprove.
Alternative Hypothesis
The research hypothesis. The hypothesis you are trying to prove.
p-value
The probability of getting a value more extreme than the test statistic in the
direction of the alternative hypothesis if the null hypothesis were actually true. Reject
the null hypothesis when the p-value is less than the Type I error rate, alpha.
Type I error rate
The probability of rejecting the null hypothesis when the null hypothesis is true.
Type II error rate
The probability of not rejecting the null hypothesis when the null hypothesis is false.
Power of the test
The probability of rejecting the null hypothesis when the null hypothesis is false.