The unpaired two-sample t-test is one of Student’s t-tests that can be used to compare the means of two groups of numerical data. These groups need to be independent of each other - a sample in one group must not be paired with or affect the value of any sample in the other group.
In the above plot the black dashed lines represent the mean values of two groups of independent data. Is the mean of Group 0 significantly different from that of Group 1?
The code on this page uses the NumPy, SciPy, scikit-learn, Matplotlib and Seaborn packages. These can be installed from the terminal:
# `python3.11` corresponds to the version of Python you have installed and are using
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install sklearn
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install seabornOnce finished, import these packages into your Python script as follows:
import numpy as np
from scipy import stats as st
from sklearn import datasets
from matplotlib import pyplot as plt
from matplotlib import lines
import seaborn as snsThis page will use the Breast Cancer Wisconsin dataset, a toy dataset from scikit-learn.
# Load the dataset
breast_cancer = datasets.load_breast_cancer(as_frame=True)
# Extract the feature and target data together
df = breast_cancer['frame']
# Clean the raw data
df['target'] = df['target'].apply(lambda x: breast_cancer['target_names'][x])This dataset contains numerical results obtained from 569 samples of breast cancer tumours that were either benign or malignant. For this example we are only going to use the mean smoothness values of the tumours (the mean local variations in radius, which we will simply call smoothness) and the target variable (the disease classification):
# Extract the data
cols = ['mean smoothness', 'target']
df = df[cols]
# Rename
df = df.rename(columns={'mean smoothness': 'smoothness'})
print(df.tail())##      smoothness     target
## 564     0.11100  malignant
## 565     0.09780  malignant
## 566     0.08455  malignant
## 567     0.11780  malignant
## 568     0.05263     benignRe-plotting the graph from above with the proper labels added in:
# Plot
ax = plt.axes()
sns.boxplot(
   df, x='target', y='smoothness', color='lightgrey', whis=[0, 100],
   showmeans=True, meanline=True, meanprops={'color': 'black'}
)
sns.stripplot(
   df, x='target', y='smoothness',
   color='lightgrey', edgecolor='black', linewidth=1
)
ax.set_title('Breast Cancer Wisconsin Dataset')
ax.set_ylabel('Tumour Smoothness')
ax.set_ylim([0, 0.17])
ax.set_xlabel('')
ax.set_xticklabels(['Malignant', 'Benign'])
handles = [lines.Line2D([0], [0], color='k', linewidth=1, linestyle='--')]
ax.legend(handles, ['Group Means'], loc='lower left')
plt.show()Looking at the plotted data above leads to a question: is the average smoothness of a tumour different if it is malignant as opposed to if it is benign? If this were to be true it might be medically useful, and it may even help in the early detection of cancer. Wording this as a research question gives us:
Are the smoothness values of malignant tumours the same as those of benign tumours?
There are a number of things that this question does not specify: are we interested in comparing the means, medians or distributions of the two groups? Are we interested in any difference between them or specifically if one group is larger than the other? Or are we interested in if they are equivalent (which is different to them being the same)? We need to take a closer look at the data before we can fully formulate our research question as a hypothesis, but this can be considered a first draft of that hypothesis.
A number of questions need to be addressed before we can choose a test:
First, we need to decide if we should use parametric or non-parametric statistics to address this research question. Parametric statistics are based on assumptions about the data’s distribution while non-parametric statistics are not. You should choose parametric statistics if:
Let’s check if we have sufficiently large groups:
# Sample sizes
n = df['target'].value_counts()
print(n)## target
## benign       357
## malignant    212
## Name: count, dtype: int64For the malignant group we have \(n = 212\) and for the benign group it is \(n = 357\), so in both cases we have \(n > 15\). Eye-balling the boxplots shows that the groups are not particularly skewed (the mean and median lines are close together) and that there are not too many outliers. We could, if we wanted to be more rigorous, actually test for normality (see here), skewness and outliers but for the purpose of this example we already have enough information to conclude that a parametric test is appropriate.
The spread of the data should be roughly equal for both groups. Despite being called homogeneity of variance you can actually look at either variance or standard deviation to assess this. This is because the ratio of variances is - by definition - the square of the ratio of standard deviations and the square is a monotonic function, so order is preserved (for positive numbers).
With that said, here are some rules-of-thumb as to what standard deviations can be considered ‘roughly equal’:
where \(s_0\), \(s_1\) are the sample standard deviations for the two groups. Let’s check our data:
# Homogeneity of variance (using sample standard deviations)
s = df.groupby('target')['smoothness'].std(ddof=1)
ratio = s[0] / s[1]
print(f'Ratio of standard deviations: {ratio:.2f}')## Ratio of standard deviations: 1.07This is between 0.9 and 1.1, so we can conclude that the variances are ‘equal enough’.
We have enough insight into our research question (our ‘draft hypothesis’) and our data to be able to now follow a statistical test selection process:
Looking at the flowchart below tells us that we should thus be using the unpaired two-sample t-test (or Welch’s t-test, but that is used when the variances are unequal and we have established that they are close enough to be considered equal):
We can now formulate the research question properly as a null hypothesis with a corresponding alternate hypothesis:
More mathematically:
where \(\mu_0\), \(\mu_1\) are the true means - of all tumours in the population, not just in the sample - for the two groups. Notice that this is a two-tailed test: we’re interested in finding any difference if one exists, not just a specific difference (\(\mu_0 > \mu_1\) or \(\mu_0 < \mu_1\)). Note also that if our research question had instead been “are the true mean smoothness values equivalent?” or “are the true mean smoothness values the same to within a certain tolerance?” then the hypotheses above would not be the correct formulations. Examples that have research questions like these are over here.
We’ve already looked at the sample sizes, the standard deviations and the homogeneity of variance but we can also look at the following to get a better understanding of our data:
Sample means (\(\bar{x}_0\), \(\bar{x}_1\)):
# Sample means
x_bar = df.groupby('target')['smoothness'].mean()
print(x_bar)## target
## benign       0.092478
## malignant    0.102898
## Name: smoothness, dtype: float64Difference between the sample means (\(\bar{x}_0 - \bar{x}_1\)):
# Difference between the sample means
diff_btwn_means = x_bar.diff()
diff_btwn_means = diff_btwn_means.dropna()[0]
print(f'Difference between the means: {diff_btwn_means:.4f}')## Difference between the means: 0.0104Pooled estimate of the common standard deviation of the two samples (\(s_p\)) for equal or unequal sample sizes and similar variances:
# Degrees of freedom
dof = n[0] + n[1] - 2
# Pooled standard deviation
s_p = np.sqrt(((n[0] - 1) * s[0]**2 + (n[1] - 1) * s[1]**2) / dof)
print(f'Pooled standard deviation: {s_p:.4f}')## Pooled standard deviation: 0.0131Standard error (SE) of the difference between the sample means (\(SE \left( \bar{x}_0 - \bar{x}_1 \right)\)):
# Standard error (SE) of the difference between sample means
se = s_p * np.sqrt((1 / n[0]) + (1 / n[1]))
print(f'Standard error (SE) of the difference between sample means: {se:.5f}')## Standard error (SE) of the difference between sample means: 0.00114Confidence interval (CI) for the difference between the sample means:
# Confidence interval
upper = diff_btwn_means + 1.96 * se
lower = diff_btwn_means - 1.96 * se
print(f'Difference between means: {diff_btwn_means:.3f} ({lower:.3f} to {upper:.3f})')## Difference between means: 0.010 (0.008 to 0.013)These equations can be found on Wikipedia, Boston University’s site and Statology.
For the record, the values of 1.96 used in the final equations to get the confidence interval come from the fact that 95% of values lie within about this number of standard deviations of the mean (see here). The exact number can be calculated as follows:
# Confidence level
C = 0.95  # 95%
# Significance level, α
alpha = 1 - C
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Critical z-score, calculated using the percent-point function (aka the
# quantile function) of the normal distribution
z_star = st.norm.ppf(q)
print(f'95% of values lie within {z_star:.5f} standard deviations of the mean')## 95% of values lie within 1.95996 standard deviations of the meanAfter all that hard work above we are only one step away from calculating the t-statistic:
# t-statistic
t_statistic = diff_btwn_means / se
print(f't = {t_statistic:.3f}')## t = 9.146…and the corresponding p-value:
# p-value
p_value = (1 - st.t.cdf(t_statistic, dof)) * 2
print(f'p = {p_value:.3f}')## p = 0.000Note that this p-value is so small that it is being shown as 0, but this is just a data representation error.
A better (and much simpler) option is to use SciPy’s ttest_ind() function:
# Separate out the samples
malignant = df.groupby('target').get_group('malignant')
benign = df.groupby('target').get_group('benign')
# Unpaired two-sample Student's t-test
t_statistic, p_value = st.ttest_ind(malignant['smoothness'], benign['smoothness'])
print(f'Two-sample t-test: s = {t_statistic:5.3f}, p = {p_value:.2e}')## Two-sample t-test: s = 9.146, p = 1.05e-18This p-value is (very!) small - much less than 0.05 - which means that we can reject \(H_0\) and conclude that there is a statistically significant difference between the smoothness of the two tumour types (p < 0.05) with the mean smoothness value being 0.0104 units, 95% CI [0.0082, 0.0127], greater on average in malignant tumours than in benign tumours.
Often, you will see this type of result reported using asterisks to indicate the significance level (α) associated with it. Additionally, if the p-value is very small it’s usually just reported as “<0.001” rather than as an exact value. Here are functions to add in this formatting:
def get_significance(p):
    """Get the significance of a p-values as a string of stars."""
    if p <= 0.001:
        return '***'
    elif p <= 0.01:
        return '**'
    elif p <= 0.05:
        return '*'
    elif p <= 0.1:
        return '.'
    else:
        return ''
def round_p_value(p):
    """Round a small p-value so that it is human-readable."""
    if p < 0.001:
        return '<0.001'
    else:
        return f'{p:5.3}'
p_rounded = round_p_value(p_value)
significance = get_significance(p_value)
print(f'The p-value is {p_rounded} ({significance})')## The p-value is <0.001 (***)If you’re interested in adding significance bars to boxplots after having used the unpaired two-sample t-test to find a significant difference, look over here.
As already mentioned, the standard unpaired two-sample Student’s t-test makes the assumption that the variances of your two groups are equal. This is why we checked that the two standard deviations are ‘equal enough’. If they are not sufficiently similar in value we can perform Welch’s t-test which is like Student’s t-test except it does not make this equal variance assumption.
The ttest_ind() function has a parameter equal_var which allows you to choose if this assumption is made or not, and hence which test is performed. It is set to True by default and hence a Student’s t-test is run, but the ability to set equal_var=False means you have the option to perform Welch’s t-test if it is more suitable.
Another type of test is the “two one-sided t-test” (TOST) which involves, as the name suggests, performing two one-sided unpaired two-sample t-tests. While this is a hypothesis test, its purpose is to test equivalence: it asks “do the means of two populations differ by less than a certain amount?” as opposed to “are the means the same?”. For this reason it is discussed on its own page in the ‘agreement’ section.