⇦ Back

The Statsmodels package provides datasets that can be used as example data in MWEs and ‘Hello World’ scripts that test functionality. These can be imported via the Statsmodels API and come as ‘modules’ inside the datasets object. Here’s how to list out all of the 28 datasets that are available:

from statsmodels import api as sm

# List all 28 of the statsmodels datasets
print('Name               Title')
print('----               -----')
for attribute in dir(sm.datasets):
    # If the attribute is a module
    if str(type(getattr(sm.datasets, attribute))) == "<class 'module'>":
        # The utils module is not a dataset
        if attribute == 'utils':
            continue
        title = getattr(sm.datasets, attribute).TITLE
        print(f'{attribute:18s} {title}')
## Name               Title
## ----               -----
## anes96             American National Election Survey 1996
## cancer             Breast Cancer Data
## ccard              Bill Greene's credit scoring data.
## china_smoking      Smoking and lung cancer in eight cities in China.
## co2                Mauna Loa Weekly Atmospheric CO2 Data
## committee          First 100 days of the US House of Representatives 1995
## copper             World Copper Market 1951-1975 Dataset
## cpunish            US Capital Punishment dataset.
## danish_data        Danish Money Demand Data
## elnino             El Nino - Sea Surface Temperatures
## engel              Engel (1857) food expenditure data
## fair               Affairs dataset
## fertility          World Bank Fertility Data
## grunfeld           Grunfeld (1950) Investment Data
## heart              Transplant Survival Data
## interest_inflation (West) German interest and inflation rate 1972-1998
## longley            Longley dataset
## macrodata          United States Macroeconomic data
## modechoice         Travel Mode Choice
## nile               Nile River flows at Ashwan 1871-1970
## randhie            RAND Health Insurance Experiment Data
## scotland           Taxation Powers Vote for the Scottish Parliament 1997
## spector            Spector and Mazzeo (1980) - Program Effectiveness Data
## stackloss          Stack loss data
## star98             Star98 Educational Dataset
## statecrime         Statewide Crime Data 2009
## strikes            U.S. Strike Duration Data
## sunspots           Yearly sunspots data 1700-2008

Some more packages and settings that will be used by this page:

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from sklearn import linear_model

The datasets usually have exog and endog attributes that hold the exogenous and endogenous variables, respectively. These are economics terms for what are essentially the dependent and independent variables, or the ‘features’ and the ‘target’ in machine learning terminology.

anes96

American National Election Survey 1996

Documentation: https://www.statsmodels.org/devel/datasets/generated/anes96.html

Exogenous Variables:

  • logpopul : log(popul + .1) where popul is census place population in 1000s
  • selfLR : Respondent’s self-reported political leanings from “Left” to “Right”
  • age : Age of respondent
  • educ : Education level of respondent
  • income : Income of household

Endogenous Variable:

  • PID : Party identification of respondent

Example Usage:

# Load the data
dataset = sm.datasets.anes96.load_pandas()
X = dataset.exog
y = dataset.endog

# Plot
plt.title('American National Election Survey 1996')
df = pd.concat([X['age'], y], axis=1)
boxes = [df.loc[df['PID'] == val, 'age'] for val in df['PID'].unique()]
plt.boxplot(boxes)
plt.xlabel('Party identification of respondent')
plt.ylabel('Age')
identification = [
    'Strong Democrat', 'Weak Democrat', 'Independent-Democrat',
    'Independent-Indpendent', 'Independent-Republican', 'Weak Republican',
    'Strong Republican'
]
plt.xticks(ticks=range(1, 8), labels=identification, rotation=15)
plt.tight_layout()
plt.show()

cancer

Breast Cancer Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/cancer.html

Exogenous Variable:

  • population : The population of the county

Endogenous Variable:

  • cancer : The number of breast cancer observances

Example Usage:

# Load the data
dataset = sm.datasets.cancer.load_pandas()
X = dataset.exog
y = dataset.endog

# Create a linear regression model
model = linear_model.LinearRegression()
# Train the model
x = model.fit(X.values, y.values)

# Use the model to make a prediction
X_fitted = np.array([np.min(X), np.max(X)]).reshape(-1, 1)
y_fitted = model.predict(X_fitted)

# Plot
fig = plt.figure()
plt.title('Breast Cancer Data')
plt.scatter(X, y, ec='k', fc='gray')
label = f'y = {model.coef_[0]:.3f}x {model.intercept_:.3f}'
plt.plot(X_fitted, y_fitted, c='gray', label=label)
plt.xlabel('The population of the county')
plt.ylabel('The number of breast cancer observances')
plt.legend(frameon=False)
plt.tight_layout()
plt.show()

ccard

Bill Greene’s Credit Scoring Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/ccard.html

Exogenous Variables:

  • AGE : Age in years + 12ths of a year
  • INCOME : Income, divided by 10,000
  • INCOMESQ : INCOME squared
  • OWNRENT : Individual owns (1) or rents (0) home

Endogenous Variable:

  • AVGEXP : Avg monthly credit card expenditure

Example Usage:

# Load the data
dataset = sm.datasets.ccard.load_pandas()
X = dataset.exog
y = dataset.endog

# Plot
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle("Bill Greene's Credit Scoring Data")
ax1.scatter(X['AGE'], y)
ax1.set_xlabel('Age in years + 12ths of a year')
ax1.set_ylabel('Avg monthly credit card expenditure')
ax2.scatter(X['INCOME'], y)
ax2.set_xlabel('Income, divided by 10,000')
ax2.set_ylabel('Avg monthly credit card expenditure')
ax3.scatter(X['INCOMESQ'], y)
ax3.set_xlabel('Income squared')
ax3.set_ylabel('Avg monthly credit card expenditure')
ax4.scatter(X['OWNRENT'], y)
ax4.set_xlabel('Individual owns (1) or rents (0) home')
ax4.set_ylabel('Avg monthly credit card expenditure')
plt.tight_layout()
plt.show()

china_smoking

Smoking and Lung Cancer in Eight Cities in China

Documentation: https://www.statsmodels.org/devel/datasets/generated/china_smoking.html

Variables:

  • Location : Name of the city
  • smoking : Yes or no, according to a person’s smoking behavior
  • cancer : Yes or no, according to a person’s lung cancer status

Example Usage:

# Load the data
dataset = sm.datasets.china_smoking.load_pandas()
# Create contingency tables
contingency_tables = {}
for i, row in dataset['raw_data'].iterrows():
    dct = {
        'Cancer': [row['smoking_yes_cancer_yes'], row['smoking_no_cancer_yes']],
        'Healthy': [row['smoking_yes_cancer_no'], row['smoking_no_cancer_no']],
    }
    df = pd.DataFrame(dct, index=['Smoker', 'Non-Smoker'])
    contingency_tables[row['Location']] = df
    print(row['Location'] + ':')
    print(df)
    print('')
## Beijing:
##             Cancer  Healthy
## Smoker         126      100
## Non-Smoker      35       61
## 
## Shanghai:
##             Cancer  Healthy
## Smoker         908      688
## Non-Smoker     497      807
## 
## Shenyang:
##             Cancer  Healthy
## Smoker         913      747
## Non-Smoker     336      598
## 
## Nanjng:
##             Cancer  Healthy
## Smoker         235      172
## Non-Smoker      58      121
## 
## Harbin:
##             Cancer  Healthy
## Smoker         402      308
## Non-Smoker     121      215
## 
## Zhengzhou:
##             Cancer  Healthy
## Smoker         182      156
## Non-Smoker      72       98
## 
## Taiyuan:
##             Cancer  Healthy
## Smoker          60       99
## Non-Smoker      11       43
## 
## Nanchang:
##             Cancer  Healthy
## Smoker         104       89
## Non-Smoker      21       36

co2

Mauna Loa Weekly Atmospheric CO2 Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/co2.html

Index:

  • date : Sample date in YYYY-MM-DD format. There is only one sample reported every 7 days as each represents a weekly average.

Column:

  • co2 : CO₂ concentration in ppmv (parts per million by volume)

Example Usage:

# Load the data
dataset = sm.datasets.co2.load_pandas()
# Plot
plt.figure()
plt.plot(dataset['data'])
plt.show()

committee

First 100 days of the US House of Representatives 1995

Documentation: https://www.statsmodels.org/devel/datasets/generated/committee.html

Exogenous Variables:

  • SIZE : Number of members on the committee
  • SUBS : Number of subcommittees
  • STAFF : Number of staff members assigned to the committee
  • PRESTIGE : Dummy variable indicating whether the committee is a high prestige committee (PRESTIGE == 1 is a high prestige committee)
  • BILLS103 : The number of bill assignments in the first 100 days of the US’s 103rd House of Representatives

Endogenous Variable:

  • BILLS104 : The number of bill assignments in the first 100 days of the US’s 104th House of Representatives over 20 Committees

Example Usage:

# Load the data
dataset = sm.datasets.committee.load_pandas()
# Plot
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle('First 100 days of the US House of Representatives 1995')
ax1.scatter(dataset['exog']['SIZE'], dataset['endog'])
ax1.set_xlabel('Number of members on the committee')
ax1.set_ylabel('Bill assignments')
ax2.scatter(dataset['exog']['SUBS'], dataset['endog'])
ax2.set_xlabel('Number of subcommittees')
ax2.set_ylabel('Bill assignments')
ax3.scatter(dataset['exog']['STAFF'], dataset['endog'])
ax3.set_xlabel('Number of staff members assigned to the committee')
ax3.set_ylabel('Bill assignments')
ax4.scatter(dataset['exog']['BILLS103'], dataset['endog'])
ax4.set_xlabel('Bill assignments (103rd House of Representatives)')
ax4.set_ylabel('Bill assignments (104th House of Representatives)')
plt.tight_layout()
plt.show()

copper

World Copper Market 1951-1975 Dataset

Documentation: https://www.statsmodels.org/devel/datasets/generated/copper.html

Exogenous Variables:

  • COPPERPRICE : Constant dollar adjusted price of copper
  • INCOMEINDEX : An index of real per capita income (base 1970)
  • ALUMPRICE : The price of aluminium
  • INVENTORYINDEX : A measure of annual manufacturer inventory trend
  • TIME : A time trend

Endogenous Variable:

  • WORLDCONSUMPTION : World consumption of copper (in 1000 metric tons)

Example Usage:

# Load the data
dataset = sm.datasets.copper.load_pandas()
# Plot
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle('World Copper Market 1951-1975 Dataset')
ax1.scatter(dataset['exog']['COPPERPRICE'], dataset['endog'])
ax1.set_xlabel('Constant dollar adjusted price of copper')
ax1.set_ylabel('World consumption of copper')
ax2.scatter(dataset['exog']['INCOMEINDEX'], dataset['endog'])
ax2.set_xlabel('An index of real per capita income (base 1970)')
ax2.set_ylabel('World consumption of copper')
ax3.scatter(dataset['exog']['ALUMPRICE'], dataset['endog'])
ax3.set_xlabel('The price of aluminium')
ax3.set_ylabel('World consumption of copper')
ax4.scatter(dataset['exog']['INVENTORYINDEX'], dataset['endog'])
ax4.set_xlabel('A measure of annual manufacturer inventory trend')
ax4.set_ylabel('World consumption of copper')
plt.tight_layout()
plt.show()

cpunish

US Capital Punishment Dataset

Documentation: https://www.statsmodels.org/devel/datasets/generated/cpunish.html

Exogenous Variables:

  • INCOME : Median per capita income in 1996 dollars
  • PERPOVERTY : Percent of the population classified as living in poverty
  • PERBLACK : Percent of black citizens in the population
  • VC100k96 : Rate of violent crimes per 100,00 residents for 1996
  • SOUTH : SOUTH == 1 indicates a state in the South
  • DEGREE : An estimate of the proportion of the state population with a college degree of some kind

Endogenous Variable:

  • EXECUTIONS : Executions in 1996

Example Usage:

# Load the data
dataset = sm.datasets.cpunish.load_pandas()
# Plot
fig, ax = plt.subplots(3, 2, figsize=(10, 10))
ax = ax.flatten()
fig.suptitle('US Capital Punishment Dataset')
for i, exog_name in enumerate(dataset['exog_name']):
    ax[i].scatter(dataset['exog'][exog_name], dataset['endog'])
    ax[i].set_xlabel(dataset['exog_name'][i])
    ax[i].set_ylabel('Executions in 1996')
plt.tight_layout()
plt.show()

danish_data

Danish Money Demand Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/danish_data.html

Sample Size: 55

Variables:

  • lrm : Log real money
  • lry : Log real income
  • lpy : Log prices
  • ibo : Bond rate
  • ide : Deposit rate

Example Usage:

# Load the data
dataset = sm.datasets.danish_data.load_pandas()
data = dataset['data']
# Plot
fig = plt.figure(figsize=(10, 10))
shape = (3, 4)
ax1 = plt.subplot2grid(shape, (0, 0), colspan=2, fig=fig)
ax2 = plt.subplot2grid(shape, (0, 2), colspan=2, fig=fig)
ax3 = plt.subplot2grid(shape, (1, 0), colspan=2, fig=fig)
ax4 = plt.subplot2grid(shape, (1, 2), colspan=2, fig=fig)
ax5 = plt.subplot2grid(shape, (2, 1), colspan=2, fig=fig)
fig.suptitle('Danish Money Demand Data')
ax1.scatter(data.index, data['lrm'])
ax1.set_xlabel('Period')
ax1.set_ylabel('Log real money')
ax2.scatter(data.index, data['lry'])
ax2.set_xlabel('Period')
ax2.set_ylabel('Log real income')
ax3.scatter(data.index, data['lpy'])
ax3.set_xlabel('Period')
ax3.set_ylabel('Log prices')
ax4.scatter(data.index, data['ibo'])
ax4.set_xlabel('Period')
ax4.set_ylabel('Bond rate')
ax5.scatter(data.index, data['ide'])
ax5.set_xlabel('Period')
ax5.set_ylabel('Deposit rate')
plt.tight_layout()
plt.show()

elnino

El Nino - Sea Surface Temperatures

Documentation: https://www.statsmodels.org/devel/datasets/generated/elnino.html

Sample Size: 61 years x 12 months

Variable:

  • TEMPERATURE : Average sea surface temperature in degrees Celcius

Example Usage:

# Load the data
dataset = sm.datasets.elnino.load_pandas()
data = dataset['data']
# Convert wide to long
months = [
    'JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN',
    'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC'
]
df = data.melt(id_vars=['YEAR'], value_vars=months, var_name='MONTH')
# Combine year and month into a single 'DATE' column
df['YEAR'] = df['YEAR'].astype(int)
df['MONTH'] = df['MONTH'].apply(lambda x: months.index(x) + 1)
df['DATE'] = df['YEAR'].astype(str) + '-' + df['MONTH'].astype(str)
df['DATE'] = pd.to_datetime(df['DATE'], format='%Y-%m')
# Sort the data
df = df.sort_values('DATE')
# Plot
fig, ax = plt.subplots()
ax.plot(df['DATE'], df['value'])
ax.set_title('El Nino - Sea Surface Temperatures')
ax.set_ylabel('Average sea surface temperature (°C)')
_ = ax.set_xlim([df['DATE'].min(), df['DATE'].max()])
plt.tight_layout()
plt.show()

engel

Engel (1857) Food Expenditure Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/engel.html

Sample Size: 235

Variables:

  • income : Annual household income (Belgian francs)
  • foodexp : Annual household food expenditure (Belgian francs)

Example Usage:

# Load the data
dataset = sm.datasets.engel.load_pandas()
data = dataset['data']
# Plot
plt.scatter(data['income'], data['foodexp'])
plt.title('Engel (1857) Food Expenditure Data')
plt.ylabel('Annual household food expenditure (Belgian francs)')
plt.xlabel('Annual household income (Belgian francs)')
plt.tight_layout()
plt.show()

fair

Affairs Dataset

Documentation: https://www.statsmodels.org/devel/datasets/generated/fair.html

Sample Size: 6366

Exogenous Variables:

  • rate_marriage : Rating of marriage
  • age : Age
  • yrs_married : Number of years married - interval approximations
  • children : Number of children
  • religious : How religious
  • educ : Level of education
  • occupation : Occupation
  • occupation_husb : Husband’s occupation

Endogenous Variable:

  • affairs : Measure of time spent in extramarital affairs

Example Usage:

# Load the data
dataset = sm.datasets.fair.load_pandas()
# Plot
fig, ax = plt.subplots(4, 2, figsize=(8, 12))
ax = ax.flatten()
fig.suptitle('Affairs Dataset')
for i, exog_name in enumerate(dataset['exog_name']):
    ax[i].scatter(dataset['exog'][exog_name], dataset['endog'])
    ax[i].set_xlabel(dataset['exog_name'][i])
    ax[i].set_ylabel('Time in extramarital affairs')
plt.tight_layout()
plt.show()

fertility

World Bank Fertility Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/fertility.html

Sample Size: 219 countries/regions

Variable: The fertility rate for the given year

Example Usage:

# Load the data
dataset = sm.datasets.fertility.load_pandas()
# Convert from wide to long
years = list(range(1960, 2014))
years = [str(year) for year in years]
df = pd.melt(dataset['data'], id_vars=['Country Name'], value_vars=years)
# Sort
df = df.sort_values('variable')
# Plot
fig, ax = plt.subplots()
countries = [
    'Yemen, Rep.', 'Barbados', 'Tonga', 'Macao SAR, China', 'Niger', 'Mongolia'
]
for country in countries:
    subset = df[df['Country Name'] == country]
    plt.scatter(subset['variable'].astype(int), subset['value'], label=country)
plt.title('World Bank Fertility Data')
plt.ylabel('Fertility Rate')
plt.xlabel('Year')
plt.legend()
plt.tight_layout()
plt.show()

grunfeld

Grunfeld (1950) Investment Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/grunfeld.html

Sample Size: 220 (20 years for 11 firms)

Exogenous Variables:

  • value : Market value as of December 31 in 1947 dollars
  • capital : Stock of plant and equipment in 1947 dollars
  • firm : General Motors, US Steel, General Electric, Chrysler, Atlantic Refining, IBM, Union Oil, Westinghouse, Goodyear, Diamond Match, American Steel
  • year : 1935 - 1954

Endogenous Variable:

  • invest : Gross investment in 1947 dollars

Example Usage:

# Load the data
dataset = sm.datasets.grunfeld.load_pandas()
# Plot
fig, ax = plt.subplots(2, 2, figsize=(8, 6))
ax = ax.flatten()
fig.suptitle('Grunfeld (1950) Investment Data')
for i, exog_name in enumerate(dataset['exog_name']):
    ax[i].scatter(dataset['exog'][exog_name], dataset['endog'])
    if exog_name == 'firm':
        ax[i].set_xticks(range(11))
        ax[i].set_xticklabels(range(11))
    ax[i].set_xlabel(dataset['exog_name'][i])
    ax[i].set_ylabel('Gross Investment (1947 dollars)')
plt.tight_layout()
plt.show()

heart

Transplant Survival Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/heart.html

Sample Size: 69

Exogenous Variables:

  • censors : Indicates if an observation is censored; 1 is uncensored
  • age : Age at the time of surgery

Endogenous Variable:

  • survival : Days after surgery until death

Example Usage:

# Load the data
dataset = sm.datasets.heart.load_pandas()
df = dataset['data']
# Plot
fig, ax = plt.subplots()
plt.scatter(dataset['exog'], dataset['endog'])
plt.title('Transplant Survival Data')
plt.ylabel('Days after surgery until death')
plt.xlabel('Age at the time of surgery')
plt.tight_layout()
plt.show()

interest_inflation

(West) German interest and inflation rate 1972-1998

Documentation: https://www.statsmodels.org/devel/datasets/generated/interest_inflation.html

Sample Size: 107

Variables:

  • year : Q2 1972 - Q4 1998
  • quarter : 1-4
  • Dp : Delta log GDP deflator
  • R : Nominal long term interest rate

Example Usage:

# Load the data
dataset = sm.datasets.interest_inflation.load_pandas()
df = dataset['data']
# Re-format the date information
df['year'] = df['year'].astype(int)
df['quarter'] = df['quarter'].astype(int)
df['date'] = df['year'].astype(str) + '-' + df['quarter'].astype(str)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m')
df = df.sort_values('date')
# Plot
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
fig.suptitle('(West) German interest and inflation rate 1972-1998')
ax[0].scatter(df['date'], df['Dp'])
ax[0].set_ylabel('Delta Log GDP Deflator')
ax[0].set_xlabel('Quarter')
ax[1].scatter(df['date'], df['R'])
ax[1].set_ylabel('Nominal Long Term Interest Rate')
ax[1].set_xlabel('Quarter')
plt.tight_layout()
plt.show()

longley

Longley Dataset

Documentation: https://www.statsmodels.org/devel/datasets/generated/longley.html

Sample Size: 16

Exogenous Variables:

  • GNPDEFL : GNP deflator
  • GNP : GNP
  • UNEMP : Number of unemployed
  • ARMED : Size of armed forces
  • POP : Population
  • YEAR : Year (1947 - 1962)

Endogenous Variable:

  • TOTEMP : Total employment

Example Usage:

# Load the data
dataset = sm.datasets.longley.load_pandas()
# Plot
fig, ax = plt.subplots(3, 2, figsize=(10, 10))
ax = ax.flatten()
fig.suptitle('Longley Dataset')
for i, exog_name in enumerate(dataset['exog_name']):
    ax[i].scatter(dataset['exog'][exog_name], dataset['endog'])
    ax[i].set_xlabel(dataset['exog_name'][i])
    ax[i].set_ylabel('Total Employment')
plt.tight_layout()
plt.show()

macrodata

United States Macroeconomic Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/macrodata.html

Sample Size: 203

Variables:

  • year : 1959q1 - 2009q3
  • quarter : 1-4
  • realgdp : Real gross domestic product
  • realcons : Real personal consumption expenditures
  • realinv : Real gross private domestic investment
  • realgovt : Real federal consumption expenditures & gross investment
  • realdpi : Real private disposable income
  • cpi : End of the quarter consumer price index for all urban consumers: all items
  • m1 : End of the quarter M1 nominal money stock (Seasonally adjusted)
  • tbilrate : Quarterly monthly average of the monthly 3-month treasury bill
  • unemp : Seasonally adjusted unemployment rate (%)
  • pop : End of the quarter total population: all ages incl. armed forces over seas
  • infl : Inflation rate (ln(cpi_{t}/cpi_{t-1}) * 400)
  • realint : Real interest rate (tbilrate - infl)

Example Usage:

# Load the data
dataset = sm.datasets.macrodata.load_pandas()
# Clean
df = dataset['data']
df = df.drop('year', axis=1)
df = df.drop('quarter', axis=1)
# Plot
plt.title('United States Macroeconomic Data')
plt.matshow(df.corr())
plt.yticks(range(len(list(df))), list(df))
plt.xticks(range(len(list(df))), list(df), rotation=36)
plt.show()

modechoice

Travel Mode Choice

Documentation: https://www.statsmodels.org/devel/datasets/generated/modechoice.html

Sample Size: 210

Exogenous Variables:

  • ttme : Terminal waiting time for plane, train and bus (minutes); 0 for car
  • invc : In vehicle cost for all stages (dollars)
  • invt : Travel time (in-vehicle time) for all stages (minutes)
  • gc : Generalized cost measure: invc + (invt × value of travel time savings) (dollars)
  • hinc : Household income ($1000s)
  • psize : Traveling group size in mode chosen

Endogenous Variable:

  • choice : Yes (1) or no (0)

Example Usage:

# Load the data
dataset = sm.datasets.modechoice.load_pandas()
# Clean the data
df = dataset['data']
df['mode'] = df['mode'].replace({1: 'air', 2: 'train', 3: 'bus', 4: 'car'})
# Plot
fig, ax = plt.subplots(3, 2, figsize=(10, 10))
ax = ax.flatten()
fig.suptitle('Travel Mode Choice')
for i, exog_name in enumerate(dataset['exog_name']):
    for mode in df['mode'].unique():
        subset = df[df['mode'] == mode]
        ax[i].scatter(subset[exog_name], subset['choice'], label=mode)
    ax[i].set_xlabel(dataset['exog_name'][i])
    ax[i].set_ylabel('Choice')
    ax[i].set_yticks([0, 1])
    ax[i].set_yticklabels(['No', 'Yes'])
    ax[i].legend()
plt.tight_layout()
plt.show()

nile

Nile River flows at Ashwan 1871-1970

Documentation: https://www.statsmodels.org/devel/datasets/generated/nile.html

Sample Size: 100

Variables:

  • year : The year of the observations
  • volume : The discharge at Aswan in m³ × 10⁸

Example Usage:

# Load the data
dataset = sm.datasets.nile.load_pandas()
df = dataset['data']
# Plot
fig = plt.figure()
plt.title('Nile River Flows at Ashwan 1871-1970')
plt.scatter(df['year'], df['volume'], ec='k', fc='gray')
plt.ylabel('Volume [m³]')
plt.xlabel('Year')
plt.tight_layout()
plt.show()

randhie

RAND Health Insurance Experiment Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/randhie.html

Sample Size: 20,190

Variables:

  • mdvis : Number of outpatient visits to an MD
  • lncoins : ln(coinsurance + 1), 0 <= coninsurance <= 100
  • idp : 1 if individual deductible plan, 0 otherwise
  • lpi : ln(max(1, annual participation incentive payment))
  • fmde : 0 if idp = 1; ln(max(1, MDE/(0.01 coinsurance))) otherwise
  • physlm : 1 if the person has a physical limitation
  • disea : Number of chronic diseases
  • hlthg : 1 if self-rated health is good
  • hlthf : 1 if self-rated health is fair
  • hlthp : 1 if self-rated health is poor

Example Usage:

# Load the data
dataset = sm.datasets.randhie.load_pandas()
df = dataset['data']
# Plot
plt.title('RAND Health Insurance Experiment Data')
plt.matshow(df.corr())
plt.yticks(range(len(list(df))), list(df))
plt.xticks(range(len(list(df))), list(df), rotation=36)
plt.show()

scotland

Taxation Powers Vote for the Scottish Parliament 1997

Documentation: https://www.statsmodels.org/devel/datasets/generated/scotland.html

Sample Size: 32

Exogenous Variables:

  • COUTAX : Amount of council tax collected in pounds sterling as of April 1997
  • UNEMPF : Female percentage of total unemployment benefits claims as of January 1998
  • MOR : The standardized mortality rate (UK is 100)
  • ACT : Labor force participation (Short for active)
  • GDP : GDP per county
  • AGE : Percentage of children aged 5 to 15 in the county
  • COUTAX_FEMALEUNEMP : Interaction between COUTAX and UNEMPF

Endogenous Variable:

  • YES : Proportion voting yes to granting taxation powers to the Scottish parliament

Example Usage:

# Load the data
dataset = sm.datasets.scotland.load_pandas()
# Plot
fig, ax = plt.subplots(4, 2, figsize=(8, 12))
ax = ax.flatten()
fig.suptitle('Taxation Powers Vote for the Scottish Parliament 1997')
for i, exog_name in enumerate(dataset['exog_name'][:-1]):
    ax[i].scatter(dataset['exog'][exog_name], dataset['endog'])
    ax[i].set_xlabel(dataset['exog_name'][i])
    ax[i].set_ylabel('Proportion Voting Yes')
ax1 = plt.subplot2grid((4, 4), (3, 1), colspan=2, fig=fig)
ax1.scatter(dataset['exog']['COUTAX_FEMALEUNEMP'], dataset['endog'])
ax1.set_xlabel('COUTAX_FEMALEUNEMP')
ax1.set_ylabel('Proportion Voting Yes')
plt.tight_layout()
plt.show()

spector

Spector and Mazzeo (1980) - Program Effectiveness Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/spector.html

Sample Size: 32

Exogenous Variables:

  • TUCE : Test score on economics test
  • PSI : Participation in program
  • GPA : Student’s grade point average

Endogenous Variable:

  • GRADE : Binary variable indicating whether or not a student’s grade improved; 1 indicates an improvement

Example Usage:

# Load the data
dataset = sm.datasets.spector.load_pandas()
data = dataset['data']
data['GRADE'] = data['GRADE'].replace({0: 'No', 1: 'Yes'})
data['PSI'] = data['PSI'].replace({0: "Didn't participate", 1: 'Participated'})
# Plot
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
fig.suptitle('Spector and Mazzeo (1980) - Program Effectiveness Data')
for participation in data['PSI'].unique():
    subset = data[data['PSI'] == participation]
    ax[0].scatter(subset['GPA'], subset['GRADE'], label=participation)
    ax[0].set_xlabel("Student's grade point average")
    ax[0].set_ylabel('Improvement')
    ax[1].scatter(subset['TUCE'], subset['GRADE'], label=participation)
    ax[1].set_xlabel('Test Score on Economics Test')
    ax[1].set_ylabel('Improvement')
ax[0].legend(loc='center right')
ax[1].legend(loc='center right')
plt.tight_layout()
plt.show()

stackloss

Stack Loss Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/stackloss.html

Sample Size: 21

Exogenous Variables:

  • AIRFLOW : Rate of operation of the plant
  • WATERTEMP : Cooling water temperature in the absorption tower
  • ACIDCONC : SAcid concentration of circulating acid minus 50 times 10

Endogenous Variable:

  • STACKLOSS : 10 times the percentage of ammonia going into the plant that escapes from the absoroption column

Example Usage:

# Load the data
dataset = sm.datasets.stackloss.load_pandas()
# Plot
fig = plt.figure(figsize=(8, 6))
shape = (2, 4)
ax1 = plt.subplot2grid(shape, (0, 0), colspan=2, fig=fig)
ax2 = plt.subplot2grid(shape, (0, 2), colspan=2, fig=fig)
ax3 = plt.subplot2grid(shape, (1, 1), colspan=2, fig=fig)
fig.suptitle('Stack Loss Data')
ax1.scatter(dataset['exog']['AIRFLOW'], dataset['endog'])
ax1.set_xlabel('Rate of operation of the plant')
ax1.set_ylabel('Stack Loss')
ax2.scatter(dataset['exog']['WATERTEMP'], dataset['endog'])
ax2.set_xlabel('Cooling water temperature')
ax2.set_ylabel('Stack Loss')
ax3.scatter(dataset['exog']['ACIDCONC'], dataset['endog'])
ax3.set_xlabel('Acid concentration')
ax3.set_ylabel('Stack Loss')
plt.tight_layout()
plt.show()

star98

Star98 Educational Dataset

Documentation: https://www.statsmodels.org/devel/datasets/generated/star98.html

Sample Size: 303

Exogenous Variables:

  • LOWINC : Percentage of low income students
  • PERASIAN : Percentage of Asian student
  • PERBLACK : Percentage of black students
  • PERHISP : Percentage of Hispanic students
  • PERMINTE : Percentage of minority teachers
  • AVYRSEXP : Sum of teachers’ years in educational service divided by the number of teachers
  • AVSALK : Total salary budget including benefits divided by the number of full-time teachers
  • PERSPENK : Per-pupil spending
  • PTRATIO : Pupil-teacher ratio
  • PCTAF : Percentage of students taking UC/CSU prep courses
  • PCTCHRT : Percentage of charter schools
  • PCTYRRND : Percentage of year-round schools

Interaction Terms:

  • PERMINTE_AVYRSEXP
  • PERMINTE_AVSAL
  • AVYRSEXP_AVSAL
  • PERSPEN_PTRATIO
  • PERSPEN_PCTAF
  • PTRATIO_PCTAF
  • PERMINTE_AVYRSEXP_AVSAL
  • PERSPEN_PTRATIO_PCTAF

Endogenous Variables:

  • NABOVE : Total number of students above the national median for the math section
  • NBELOW : Total number of students below the national median for the math section

Example Usage:

# Load the data
dataset = sm.datasets.star98.load_pandas()
df = dataset['data']
# Plot
fig = plt.figure(figsize=(8, 8))
ax = plt.subplot()
ax.matshow(df.corr())
ax.set_yticks(range(len(list(df))), list(df))
ax.set_xticks(range(len(list(df))), list(df), rotation=90)
plt.subplots_adjust(left=0.3, top=0.7)
plt.show()

statecrime

Statewide Crime Data 2009

Documentation: https://www.statsmodels.org/devel/datasets/generated/statecrime.html

Sample Size: 51

Exogenous Variables:

  • urban : % of population in Urbanized Areas
  • poverty : % of individuals below the poverty line
  • hs_grad : Percent of population having graduated from high school or higher
  • single : A variable related to household type

Endogenous Variable:

  • murder : Murders per 100,000 population

Example Usage:

# Load the data
dataset = sm.datasets.statecrime.load_pandas()
# Plot
fig, ax = plt.subplots(2, 2, figsize=(9, 7))
ax = ax.flatten()
fig.suptitle('Statewide Crime Data 2009')
for i, exog_name in enumerate(dataset['exog_name']):
    for state in dataset['exog'].index:
        x = dataset['exog'][exog_name][state]
        y = dataset['endog'][state]
        ax[i].scatter(x, y, label=state)
        ax[i].set_xlabel(dataset['exog_name'][i])
        ax[i].set_ylabel('Murders per 100,000 people')
ax[1].legend(bbox_to_anchor=(1.05, 1.3), fontsize=6)
plt.subplots_adjust(left=0.06, right=0.84)
plt.show()

strikes

U.S. Strike Duration Data

Documentation: https://www.statsmodels.org/devel/datasets/generated/strikes.html

Sample Size: 62

Exogenous Variables:

  • iprod : Unanticipated industrial production

Endogenous Variable:

  • duration : Duration of the strike in days

Example Usage:

# Load the data
dataset = sm.datasets.strikes.load_pandas()
# Plot
fig, ax = plt.subplots()
ax.scatter(dataset['exog'], dataset['endog'])
ax.set_title('U.S. Strike Duration Data')
ax.set_ylabel('Duration of the strike in days')
ax.set_xlabel('Unanticipated industrial production')
plt.tight_layout()
plt.show()

sunspots

Yearly sunspots data 1700-2008

Documentation: https://www.statsmodels.org/devel/datasets/generated/sunspots.html

Sample Size: 309 (from 1700 to 2008 inclusive)

Variables:

  • YEAR : Year
  • SUNACTIVITY : Number of sunspots for each year

Example Usage:

# Load the data
dataset = sm.datasets.sunspots.load_pandas()
# Plot
fig, ax = plt.subplots()
ax.scatter(dataset['data']['YEAR'], dataset['data']['SUNACTIVITY'])
ax.set_title('Yearly Sunspots Data 1700-2008')
ax.set_ylabel('Number of Sunspots')
ax.set_xlabel('Year')
plt.tight_layout()
plt.show()

⇦ Back