⇦ Back

Example Data

Let’s start with some example data. Create a series of 20 numbers distributed normally about a mean value of 100 and with a standard deviation of 5:

import numpy as np

# Set a seed for the random number generator
# (so we get the same random numbers each time)
np.random.seed(20210710)

# Create fake data
mean = 100
standard_deviation = 5
sample_size = 20
x = np.random.normal(mean, standard_deviation, sample_size)

print([f'{x:.1f}' for x in sorted(x)])
## ['88.6', '89.9', '91.6', '94.4', '95.7', '97.4', '97.6', '98.1', '98.2', '99.4', '99.8', '100.0', '101.7', '101.8', '102.2', '104.3', '105.4', '106.7', '107.0', '109.5']

This is what they look like on a number line:

import matplotlib.pyplot as plt

# Formatting options for plots
A = 6  # Want figure to be A6
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A) * 0.3])
plt.rc('text', usetex=True)  # Use LaTeX
plt.rc('font', family='serif')  # Use a serif font
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')  # Load Greek letters

# Create plot axes
ax = plt.axes()
# Add jitter to separate out the points
y = np.ones(len(x)) + np.random.uniform(-0.2, 0.2, size=len(x))
# Create scatter plot
ax.scatter(x, y, s=10)
ax.set_title('Example Data: 20 Random Measurements')
# Remove axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
# Add arrows on x-axis
ax.arrow(100, 0, 11.5, 0, head_width=0.2, color='k', clip_on=False)
ax.arrow(100, 0, -11.5, 0, head_width=0.2, color='k', clip_on=False)
# Axes' settings
ax.set_ylim(0, 2)
ax.set_xlim(88.5, 111.5)
ax.tick_params(axis='y', left=False, labelleft=False)
# Finish
plt.subplots_adjust(left=0.1, bottom=0.2, right=0.9, top=0.8)
plt.show()
plt.close()

These 20 numbers are a random sample drawn from the full population of numbers that have a true mean of 100 and a true standard deviation (the population standard deviation) of 5. Here’s a graph representing the full population (more correctly, this is a graph of the probability distribution function - the function which produced our 20 numbers):

import scipy.stats as st

# Create data
x_pdf = np.linspace(84, 116, 1000)
mean = 100
std = 5
y_pdf = st.norm.pdf(x_pdf, mean, std)

# Create plot
ax = plt.axes()
ax.plot(x_pdf, y_pdf)
ax.set_title('Probability Distribution Function')
ax.set_ylabel('Relative Likelihood')
ax.set_xlabel('Value')
ax.set_ylim(0, np.max(y_pdf) * 1.08)
ax.set_xlim(84, 116)
# Minus one standard deviation
y = st.norm.pdf(mean - std, mean, std)
ax.vlines(mean - std, 0, y, colors='k', ls='dotted')
plt.text(mean - std, y * 1.05, r'$\mu - \sigma$', ha='right')
# Mean
y = st.norm.pdf(mean, mean, std)
ax.vlines(mean, 0, y, colors='k', ls='dashed')
plt.text(mean, y * 1.02, r'Mean, $\mu$', ha='center')
# Plus one standard deviation
y = st.norm.pdf(mean + std, mean, std)
ax.vlines(mean + std, 0, y, colors='k', ls='dotted')
plt.text(mean + std, y * 1.05, r'$\mu + \sigma$', ha='left')
# Finish
plt.show()
plt.close()

Best Estimates

Of course, the only reason we know that the true mean and true standard deviation of our numbers is 100 and 5, respectively, is because we’ve artificially created this data. If these were real-world measurements we wouldn’t know these and so could only estimate them from the sample we have:

# Sample mean
x_bar = np.mean(x)
# Sample standard deviation
s = np.std(x, ddof=1)  # Use ddof=1 to get the sample standard deviation

print(rf'Sample mean = {x_bar:4.1f}; sample standard deviation = {s:3.1f}')
## Sample mean = 99.5; sample standard deviation = 5.7

So our best estimate for the true mean is the sample mean, \(\bar x\), and is 99.5. Similarly, our best estimate for the true standard deviation is the sample standard deviation, \(s\), and is 5.7.

In summary: we use the sample mean and standard deviation when trying to estimate the population (‘true’) mean and standard deviation from that sample.

Note that we use Roman letters (\(x\) and \(s\)) when referring to sample statistics and Greek letters (\(\mu\) and \(\sigma\)) when referring to population statistics.

Confidence Interval of the Mean of a Small Sample

Now, having an estimate of the true population mean is fine and all but we can do better: we can calculate a confidence interval, \(CI\), around our estimate and we can have a certain amount of confidence, \(C\), that this interval will contain the true mean. If we want 95% confidence (\(C = 0.95\)), it means that there will be a 95% chance that the interval contains the true population mean. Expressed mathematically:

\[\begin{equation} \tag{1} P \left( \bar x - CI < \mu < \bar x + CI \right) = 0.95 \end{equation}\]

In other words, there is a 95% chance that either:

  • Our sample mean plus the confidence interval will be larger than the population mean (\(\mu < \bar x + CI\)), or:
  • Our sample mean minus the confidence interval will be smaller than the population mean (\(\bar x - CI < \mu\))

Now that we know the definition of the confidence interval, how do we calculate it? We do so by using the t-statistic. The t-statistic is the difference between an estimated value and its true value divided by its standard error, so in our case it’s the difference between our best estimate of the true mean (the sample mean, \(\bar x\)) and the actual true mean (the population mean, \(\mu\)) as a ratio of the standard error of our estimate of the mean (\(SE_{\bar x}\)):

\[t = \dfrac{\bar x - \mu}{SE_{\bar x}}\]

Given that the standard error is the ratio of the standard deviation to the square root of the sample size (\(SE = s / \sqrt{n}\) where \(n\) is the sample size) we can re-write this as:

\[\begin{equation} \tag{2} t = \dfrac{\bar x - \mu}{s / \sqrt{n}} \end{equation}\]

The values of t-statistics follow a Student’s t-distribution with \(n - 1\) degrees of freedom. This is a useful fact, because it is known that we can calculate the endpoints of intervals of a Student’s t-distribution that correspond to any proportion \(C\) of that distribution. For example, we could calculate the critical t-value, \(t^*\), that would correspond to the endpoints of the central interval of the t-distribution that contains \(C = 0.95\) of it. This implies that, for a t-statistic \(t\) randomly sampled from the t-distribution, we would have:

\[P\left(-t^* < t < t^*\right) = 0.95\]

Substituting in equation (2) - the equation for the t-statistic - we get:

\[P\left(-t^* < \dfrac{\bar x - \mu}{s / \sqrt{n}} < t^*\right) = 0.95\]

Re-arranging:

\[P\left(\bar x - t^* \times \dfrac{s}{\sqrt{n}} < \mu < \bar x + t^* \times \dfrac{s}{\sqrt{n}}\right) = 0.95\]

Comparing this with equation (1) - the definition of the 95% confidence interval - shows that:

\[CI = t^* \times \dfrac{s}{\sqrt{n}}\]

and so the full confidence interval for the sample mean \(\bar x\) is:

\[\bar x \pm t^* \times \dfrac{s}{\sqrt{n}}\]

Of course, while this equation was derived for a 95% confidence interval, it is valid for any confidence level for which you calculate \(t^*\).

In our worked example, we already have \(\bar x\), \(s\) and \(n\) so we just need to get \(t^*\) from the Student’s t-distribution:

# Sample size
n = len(x)
# Confidence level
C = 0.95  # 95%
# Significance level, α
alpha = 1 - C
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Degrees of freedom
dof = n - 1
# Critical t-statistic, calculated using the percent-point function (aka the
# quantile function) of the t-distribution
t_star = st.t.ppf(q, dof)
# Confidence interval
ci_upper = x_bar + t_star * s / np.sqrt(n)
ci_lower = x_bar - t_star * s / np.sqrt(n)

print(f'We are 95% sure that the true mean lies between {ci_lower:4.1f} and {ci_upper:5.1f}')
## We are 95% sure that the true mean lies between 96.8 and 102.1

A more correct format for reporting the confidence interval is: <answer> <unit>, 95% CI [<lower limit>, <upper limit>]

print(f'Mean = {x_bar:.1f}, 95% CI [{ci_lower:.1f}, {ci_upper:.1f}]')
## Mean = 99.5, 95% CI [96.8, 102.1]

Confidence Interval of the Mean of a Large Sample

If the sample size is 30 or greater, the sample standard deviation and the estimated population standard deviation will be close enough in value that you can consider them to be equal:

\[\sigma = s\]

More specifically, for this sample size (or greater) the Bonferroni correction (aka the delta degrees of freedom, DDOF) whereby you subtract 1 from the sample size to get the degrees of freedom as part of the sample standard deviation calculation - but not as part of the population standard deviation calculation - is negligible. Is this a good assumption? Here’s a quick example with \(n = 30\) to get a feel:

# Set a seed for the random number generator
# (so we get the same random numbers each time)
np.random.seed(20210710)

# Create fake data
mean = 100
standard_deviation = 5
sample_size = 30
x = np.random.normal(mean, standard_deviation, sample_size)

# Sample size
n = len(x)
# Sample mean
x_bar = np.mean(x)
# Population standard deviation
σ = np.std(x, ddof=0)  # Use ddof=0 to estimate the population standard deviation
# Sample standard deviation
s = np.std(x, ddof=1)  # Use ddof=1 to get the sample standard deviation

print(f's = {np.round(s, 2)}; σ = {np.round(σ, 2)}')
## s = 5.13; σ = 5.05

The difference between these values for the standard deviation is 0.09, or 1.70% of their mean value. So our assumption that they are equal can be considered valid if you think that this is sufficiently small. In the “Interval Width” section below we will see that using one or the other will lead to about a 6% difference in the width of the confidence interval when the sample size is 30.

In addition to allowing us to set \(\sigma\) equal to \(s\), the fact that we now have a large sample size means that the central limit theorem becomes valid. This theorem states that:

The distribution of the sample means is approximately normally distributed

So, the facts that we now:

  • Have a value for \(\sigma\)
  • Can assume the central limit theorem

means that we don’t have to use the t-statistic and the t-distribution, but can instead use the z-score and the normal distribution. This will produce tighter confidence intervals, which makes sense because larger sample sizes lead to more precise results.

Note that this means we are now producing our example data using the normal distribution and also calculating the confidence intervals using the normal distribution. Don’t get confused by these two uses of the same thing. Also note that the central limit theorem can often apply even if the original data source is not normally distributed. So: sometimes confidence intervals are calculated on normally distributed data using the normal distribution, sometimes not.

The derivation of the formula for this z-score-based confidence interval is the same as it was for the t-statistic-based confidence interval, changing only what is different between the two distributions:

\[\begin{equation} \tag{3} z = \dfrac{\bar x - \mu}{\sigma / \sqrt{n}} \end{equation}\]

Note the simularities between equation (3) and equation (2). Continuing on the derivation from there we have:

\[P\left(-z^* < z < z^*\right) = 0.95\]

where \(z^*\) is the critical z-score that corresponds to the endpoints of the central interval of the normal distribution that contains \(C\) of it, eg \(C = 0.95\) for 95%. Substituting in equation (3) we get:

\[P\left(-z^* < \dfrac{\bar x - \mu}{\sigma / \sqrt{n}} < z^*\right) = 0.95\]

Re-arranging:

\[P\left(\bar x - z^* \times \dfrac{\sigma}{\sqrt{n}} < \mu < \bar x + z^* \times \dfrac{\sigma}{\sqrt{n}}\right) = 0.95\]

Comparing this with equation (1) - the definition of the 95% confidence interval - shows that:

\[CI = z^* \times \dfrac{\sigma}{\sqrt{n}}\]

and so the full confidence interval for the sample mean \(\bar x\) is:

\[\bar x \pm z^* \times \dfrac{\sigma}{\sqrt{n}}\]

As the normal distribution is not dependent on the sample size (unlike the t-distribution), the value of \(z^*\) for a 95% confidence level will remain constant. Let’s calculate it:

# Confidence level
C = 0.95  # 95%
# Significance level, α
alpha = 1 - C
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Critical z-score, calculated using the percent-point function (aka the
# quantile function) of the normal distribution
z_star = st.norm.ppf(q)

print(z_star)
## 1.959963984540054

This is why you will often see the endpoints of confidence intervals described as being “1.96 standard deviations away from the mean”:

\[\bar x \pm 1.96 \times \dfrac{\sigma}{\sqrt{n}}\]

In our worked example, we already have \(\bar x\), \(\sigma\) and \(n\) so we can calculate the confidence interval immediately:

ci = x_bar + np.array(st.norm.interval(C)) * σ / np.sqrt(n)

print(f'We are 95% sure that the true mean lies between {ci[0]:4.1f} and {ci[1]:5.1f}')
## We are 95% sure that the true mean lies between 97.6 and 101.2

In the more correct format:

print(f'Mean = {x_bar:.1f}, 95% CI [{ci[0]:.1f}, {ci[1]:.1f}]')
## Mean = 99.4, 95% CI [97.6, 101.2]

Interval Width

In the above example with a ‘large’ (≥30) sample size, the endpoints of the confidence interval were calculated using the normal distribution and the estimated population standard deviation. If we instead use the t-distribution and the sample standard deviation (ie we use the formula for small sample sizes) we get:

# Degrees of freedom
dof = n - 1
# Confidence interval
ci = x_bar + np.array(st.t.interval(C, dof)) * s / np.sqrt(n)

print(f'Mean = {x_bar:.1f}, 95% CI [{ci[0]:.1f}, {ci[1]:.1f}]')
## Mean = 99.4, 95% CI [97.5, 101.3]

This confidence interval is very slightly wider than when we used the formula for large sample sizes:

ci_z = x_bar + np.array(st.norm.interval(C)) * σ / np.sqrt(n)
ci_width_z = ci_z[1] - ci_z[0]
ci_t = x_bar + np.array(st.t.interval(C, dof)) * s / np.sqrt(n)
ci_width_t = ci_t[1] - ci_t[0]

print(f'Confidence interval widths: {ci_width_z:4.2f} vs {ci_width_t:4.2f}')
## Confidence interval widths: 3.61 vs 3.83

Presented visually:

# Re-do the formatting options
mpl.rcParams.update(mpl.rcParamsDefault)
A = 6
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A) * 0.3])
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')

# Create plot
ax = plt.axes()
y = np.ones(len(x)) + np.random.uniform(-0.2, 0.2, size=len(x))
ax.scatter(x, y, s=10)
ax.set_title(r'95\% Confidence Intervals for the True Mean')
# Vertical lines
ax.axvline(100, c='k', ls='--')
ax.vlines(ci_t[0], 1, 2, colors='g', linestyles='dashed')
ax.vlines(ci_t[1], 1, 2, colors='g', linestyles='dashed')
ax.vlines(ci_z[0], 0, 1, colors='b', linestyles='dashed')
ax.vlines(ci_z[1], 0, 1, colors='b', linestyles='dashed')
# Arrows
kwargs = {'head_width': 0.15, 'color': 'g', 'length_includes_head': True}
ax.arrow(ci_t[0], 1.5, ci_t[1] - ci_t[0], 0, **kwargs)
ax.arrow(ci_t[1], 1.5, ci_t[0] - ci_t[1], 0, **kwargs)
kwargs = {'head_width': 0.15, 'color': 'b', 'length_includes_head': True}
ax.arrow(ci_z[0], 0.5, ci_z[1] - ci_z[0], 0, **kwargs)
ax.arrow(ci_z[1], 0.5, ci_z[0] - ci_z[1], 0, **kwargs)
# Remove axes and add arrows on x-axis
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.arrow(100, 0, 16, 0, head_width=0.2, color='k', clip_on=False)
ax.arrow(100, 0, -11.5, 0, head_width=0.2, color='k', clip_on=False)
# Axes' settings
ax.set_ylim(0, 2)
ax.set_xlim(88.5, 116)
ax.tick_params(axis='y', left=False, labelleft=False)
# Legend
ax.plot(0, 0, 'g', label='Using $s$')
ax.plot(0, 0, 'b', label=r'Using $\sigma$')
ax.legend()
# Finish
plt.subplots_adjust(left=0.1, bottom=0.2, right=0.9, top=0.8)
plt.show()

Notice that the true mean (100) lies inside both of the confidence intervals.

This difference in the widths of the confidence intervals is about 6% of their mean:

diff = ci_width_t - ci_width_z
mean = (ci_width_t + ci_width_z) / 2

print(f'Difference relative to the mean: {diff / mean * 100:4.2f}%')
## Difference relative to the mean: 5.95%

This won’t always be the case, it just happens to be true for this random sample of 30 numbers drawn from a normal distribution with a mean of 100 and standard deviation of 5. In general, however, a confidence interval calculated using the t-statistic and the t-distribution (‘the small sample formula’) will always be wider than one calculated using the z-score and the normal distribution (‘the large sample formula’). As already mentioned, this makes sense: for smaller samples the margins of error are larger and so the associated statistics are less precise. Said another way: if you know more information (eg if you use an estimate of the true standard deviation as opposed to the sample standard deviation) you would expect a better estimate of the true mean.

Here’s a demonstration of the above: for each sample size from 2 to 40, confidence intervals for the mean are calculated using both the ‘small sample’ and the ‘large sample’ method and the difference between the widths of these intervals (as a percentage of the mean of the widths) is plotted:

# Initialise lists that will be plotted
sample_sizes = []
ci_differences = []

# Iterate over various sample sizes
for n in range(2, 41):
    # Create fake data
    mean = 100
    standard_deviation = 5
    sample_size = n
    x = np.random.normal(mean, standard_deviation, sample_size)

    # Sample size
    n = len(x)
    # Sample mean
    x_bar = np.mean(x)

    # Confidence level
    C = 0.95  # 95%
    # Significance level, α
    alpha = 1 - C
    # Number of tails
    tails = 2
    # Quantile (the cumulative probability)
    q = 1 - (alpha / tails)

    # Degrees of freedom
    dof = n - 1
    # Critical t-statistic, calculated using the percent-point function (aka
    # the quantile function) of the t-distribution
    t_star = st.t.ppf(q, dof)
    # Sample standard deviation
    s = np.std(x, ddof=1)  # Use ddof=1 to get the sample standard deviation
    # Confidence interval
    ci_upper = x_bar + t_star * s / np.sqrt(n)
    ci_lower = x_bar - t_star * s / np.sqrt(n)
    ci_width_t = ci_upper - ci_lower

    # Critical z-score, calculated using the percent-point function (aka the
    # quantile function) of the normal distribution
    z_star = st.norm.ppf(q)
    # Population standard deviation
    σ = np.std(x, ddof=0)  # Use ddof=0 to estimate the population SD
    # Confidence interval
    ci_upper = x_bar + z_star * σ / np.sqrt(n)
    ci_lower = x_bar - z_star * σ / np.sqrt(n)
    ci_width_z = ci_upper - ci_lower

    # Calculate the difference in the widths of the confidence intervals
    diff = ci_width_t - ci_width_z
    mean = (ci_width_t + ci_width_z) / 2
    # print(n, diff, diff / mean * 100)
    sample_sizes.append(n)
    ci_differences.append(diff / mean * 100)

# Re-do the formatting options
mpl.rcParams.update(mpl.rcParamsDefault)
A = 6  # Want figure to be A6
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')

# Create plot
ax = plt.axes()
ax.axhline(6, c='0.7', ls='--')
ax.vlines(30, 0, 6, colors='0.7', ls='--')
ax.scatter(sample_sizes, ci_differences, c='k', s=5)
# Labels
ax.set_title('The Difference Between the Methods Decreases as $n$ Increases')
ax.set_ylabel(r"\% Difference Between Confidence Intervals' Widths")
ax.set_xlabel('Sample Size, $n$')
# Axes' settings
ax.set_ylim(0, max(ci_differences) * 1.05)
ax.set_xlim(0, max(sample_sizes))
# Text
s = 'At around $n = 30$, the\n difference drops to below 6\%'
ax.text(30, 15, s, ha='center')

plt.show()

In other words, for sample sizes below 30, it’s very important to not use the normal distribution when calculating confidence intervals because the central limit theorem doesn’t hold and this makes a big difference to the result. For sample sizes of 30 or greater, the two methods will yield similar results but the intervals calculated using the normal distribution will always be tighter than those calculated using the t-distribution. As the central limit theorem holds for these sample sizes, this is the method that should be used.

Sample Size Calculations

If you want to set up an experiment, how many samples should you include?

Let’s imagine you are running an experiment with two populations - a control group and an experimental group - and take one measurement from each participant in both groups. This will allow you to calculate a mean, and a confidence interval for that mean, for that variable for each group. Of course, you don’t know if there is a true difference between the two groups or not, but if there is a difference you want to be able to detect it (and if there isn’t a difference you want to have confidence in your inability to detect one). As such, having confidence intervals that are too wide could be a problem: if they overlap we can’t be sure that the means are different or not, or even if one is larger! Thus, we want to be able to ensure that, if the means of our groups are indeed different, our confidence intervals will be smaller than the difference between them. Of course, we can’t control what the sample standard deviation (\(s\)) or the difference between the means of our results are going to be beforehand, but what we can do is estimate what they might be (and what a meaningful difference between the means (\(\delta\)) would actually be) and plug those into the formula for the width of a confidence interval:

\[t^* \times\dfrac{s}{\sqrt{n}} < \delta\]

In other words, we want the width of the confidence interval to be smaller than the smallest meaningful difference between the means. This can be re-arranged to be in terms of the sample size (\(n\)):

\[n > \left( t^* \times\frac{s}{\delta}\right)^2\]

The above formula can be used as a guide when choosing the sample size for an experiment.

Reference

Sullivan L. Confidence Intervals [Internet]. Boston University School of Public Health; Available from here

⇦ Back