⇦ Back

If quadratic regression involves an alternative model of degree 2 and linear regression involves an alternative model of degree 1, then a statistical test that involves an alternative model of degree 0 should be called constant regression.

Constant regression is mathematically equivalent to:

Before we take a closer look, let’s get some data to play with:

1 Example Data

Use the Longley dataset, a 1967 set of US macroeconomic variables that is in the public domain. This is one of the datasets built into the R programming language that can be imported into Python via the Statsmodels package by using get_rdataset() as follows:

import statsmodels.api as sm

# Longley's Economic Regression Data
longley = sm.datasets.get_rdataset('longley')

We’ll use the employment data, specifically the year-on-year change in employment:

# Extract the data
longley = longley.data
# Subset
df = longley['Employed']
# Take the differences
df = df.diff().dropna()

print(df)
## 1948    0.799
## 1949   -0.951
## 1950    1.016
## 1951    2.034
## 1952    0.418
## 1953    1.350
## 1954   -1.228
## 1955    2.258
## 1956    1.838
## 1957    0.312
## 1958   -1.656
## 1959    2.142
## 1960    0.909
## 1961   -0.233
## 1962    1.220
## Name: Employed, dtype: float64

2 Exploring the Data

Use graphs and descriptive statistics to make an initial decision as to what summary model to use:

import matplotlib.pyplot as plt

# Make figures A6 in size
A = 6
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
# Image quality
plt.rc('figure', dpi=141)
# Be able to add Latex
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')

# Plot
plt.scatter(df.index, df.values, c='k', s=10)
plt.axhline(0, c='k', lw=0.5)
plt.title('Longley Dataset')
plt.xlabel('Year')
plt.ylabel(r'Change in Employment [\%]')

Descriptive statistics:

import numpy as np

# Sample size
n = df.shape[0]
# Mean
y_bar = df.values.mean()
# Sample standard deviation
s = df.values.std(ddof=1)
# Standard error of the mean
sem = s / np.sqrt(n)

print(f'n = {n}; mean = {y_bar:.3f}%, std dev = {s:.2f}, std error of the mean = {sem:.3f}')
## n = 15; mean = 0.682%, std dev = 1.24, std error of the mean = 0.320

Calculate 95% confidence intervals:

from scipy import stats

# Significance level
alpha = 0.05
# Degrees of freedom
dof = n - 1
# Two-tailed test
tails = 2
# Percent-point function (aka quantile function) of the t-distribution
t_critical = stats.t.ppf(1 - (alpha / tails), dof)
# Intervals
upper_ci = y_bar + t_critical * sem
lower_ci = y_bar - t_critical * sem

print(f'The sample mean is {y_bar:.3f}%, 95% CI [{lower_ci:.3f}, {upper_ci:.3f}]')
## The sample mean is 0.682%, 95% CI [-0.005, 1.368]

You can be 95% confident that the population mean falls between -0.005% and 1.368%.

3 Defining the Model

Having a look at the data, it seems that a constant regression model (a flat line) might be more accurate than a linear regression model (a sloped line). Our null model is thus that the true change in employment over time is 0 (and any measured variation is just noise) whereas the alternative model is that the true change in employment over time is equal to the average (mean) change in employment:

  • Null model: y = 0 + ε
  • Alternative model: y = μ + ε

where ‘y’ is the dependent variable (change in employment), ε is the unexplained residual (‘noise’) and μ is the mean change in employment. Effectively, the sample mean is the alternative model. The associated hypotheses are:

  • Null hypothesis, H₀: μ = 0
  • Alternative hypothesis, H₁: μ ≠ 0

Here’s what they look like graphically:

3.1 Null Model

Our null model is that there is no true change in employment over time: change in employment = unexplained residual (‘noise’) ie y = ε:

plt.scatter(df.index, df.values, c='k', s=10)
plt.axhline(0, c='lightblue', ls='--', lw=2)
plt.title('Null Model')
plt.xlabel('Year')
plt.ylabel(r'Change in Employment [\%]')

3.2 Alternative Model

Our alternative model is that there is a true change in employment over time: change in employment = mean change in employment + unexplained residual (‘noise’) ie y = μ + ε:

plt.scatter(df.index, df.values, c='k', s=10)
plt.axhline(0, c='k', lw=0.5)
plt.axhline(df.values.mean(), c='lightblue', ls='--', lw=2)
plt.title('Alternative Model')
plt.xlabel('Year')
plt.ylabel(r'Change in Employment [\%]')

4 Fitting the Model

4.1 Tests of Between-Subject Effects

For this, we can use the Pingouin package:

import pingouin as pg

# Convert the data from a series to a data frame
df = df.to_frame()
# Add a column of constants
df['ones'] = [1] * n

# Perform a one-way ANOVA
aov = pg.anova(data=df, dv='Employed', between='ones', ss_type=3, detailed=True)
print(aov)
##    Source         SS  DF        MS  np2
## 0    ones   0.000000   0       NaN  0.0
## 1  Within  21.516912  14  1.536922  NaN

In the output data frame, the ‘Source’ column contains the factor names, ‘SS’ contains the sums of squares, ‘DF’ the degrees of freedom, ‘MS’ the mean squares and ‘np2’ the partial eta-square effect sizes. The first row (Source = ‘ones’) is the ‘corrected model’ and the second row (Source = ‘Within’) is the ‘error’.

The above can also be done with ordinary least-squares (ols) via the Statsmodels package:

import statsmodels.formula.api as smf

# Convert the data from a series to a data frame
df = df.to_frame()
# Add a column of constants
df['ones'] = [1] * n

# Perform a one-way ANOVA
model = smf.ols('Employed ~ ones', data=df)
results = model.fit()
aov = sm.stats.anova_lm(results, typ=2)

print(aov)
##              sum_sq    df         F    PR(>F)
## ones       6.974132   1.0  4.537726  0.051375
## Residual  21.516912  14.0       NaN       NaN

The first (unnamed) column is the source of between-subject effects/the factors: ‘ones’ is the intercept parameter (not the corrected model like it was in the Pingouin result above) and ‘Residual’ is the error (the same as ‘Within’ in the Pingouin result). ‘PR(>F)’ is the p-value from the statistical test of the null hypothesis. The sum of the rows is useful if you’re interested in the total between-subject effects:

print(aov.sum())
## sum_sq    28.491044
## df        15.000000
## F          4.537726
## PR(>F)     0.051375
## dtype: float64

4.2 Parameter Estimates

Estimate the parameters of the model and get the associated p-value, again using Statsmodels (although this time use the OLS() (uppercase letters) implementation of ordinary least-squares optimisation):

import statsmodels.api as sm

model = sm.OLS(df['Employed'], df['ones'])
results = model.fit()
intercept = results.params[0]
conf_int = results.conf_int()
conf_int = list(conf_int.values[0].round(3))

print(f'The intercept is {intercept:.3f}%, 95% CI {conf_int}')
## The intercept is 0.682%, 95% CI [-0.005, 1.368]

This is identical to the sample mean and its confidence interval as calculated earlier, which is expected because the horizontal line of the null model is the sample mean. Similarly, the standard error of the intercept is the same as the previously-calculated standard error of the mean:

se = results.bse[0]

print(f'Std error of the regression coefficient (the intercept) = {se:.3f}')
## Std error of the regression coefficient (the intercept) = 0.320

Finally, let’s look at the results of the statistical test:

t = results.tvalues[0]
p = results.pvalues[0]

print(f't-statistic = {t:.2f}, p-value = {p:.4f}')
## t-statistic = 2.13, p-value = 0.0514

This p-value is the same as the ‘PR(>F)’ value from the ANOVA above. It is greater than 0.05, which means that there is just not enough evidence to suggest that μ ≠ 0 (in other words, we fail to reject H₀ at the 0.05 significance level). It seems that the year-on-year change in employment is best modelled as background noise rather than as its mean value. Note that this conclusion is only valid for the domain under consideration: the years 1947-48 to 1961-62. Any extrapolations outside of this time frame might not be valid. Indeed, breaking any of the assumptions implicit in this model (eg using it to make conclusions about the employment rates in other places) would require it to be re-examined.

5 Constant Regression vs the One-Sample t-Test

As mentioned previously, constant regression is mathematically equivalent to the one-sample t-test. This is because they both have the same null and alternative hypotheses by default. To confirm this, let’s do a one-sample t-test with the same data:

# Hypothesis Test: One-Sample t-Test
# H₀: μ = 0
# H₁: μ ≠ 0
popmean = 0
statistic, pvalue = stats.ttest_1samp(df['Employed'], popmean)

print(f't-statistic = {statistic:.2f}, p-value = {pvalue:.4f}')
## t-statistic = 2.13, p-value = 0.0514

These are identical results to what we got previously.

mean_difference = y_bar - popmean

print(f'Mean difference: {mean_difference:.3f}%')
## Mean difference: 0.682%

We have already calculated the CI in an earlier step, but we can do it again:

# Sample size
n = df.shape[0]
# Significance level
alpha = 0.05
# Degrees of freedom
dof = n - 1
# Two-tailed test
tails = 2
# Percent-point function (aka quantile function) of the t-distribution
t_critical = stats.t.ppf(1 - (alpha / tails), dof)
# Margin of error
d = t_critical * sem
# Confidence interval
upper_ci = mean_difference + d
lower_ci = mean_difference - d
# Adjusted confidence interval
upper_ci = upper_ci - popmean
lower_ci = lower_ci - popmean

print(f'95% confidence interval of the difference: {lower_ci:.3f} to {upper_ci:.3f}')
## 95% confidence interval of the difference: -0.005 to 1.368

⇦ Back