The Statsmodels package provides datasets that can be used as example data in MWEs and ‘Hello World’ scripts that test functionality. These can be imported via the Statsmodels API and come as ‘modules’ inside the datasets
object. Here’s how to list out all of the 28 datasets that are available:
from statsmodels import api as sm
# List all 28 of the statsmodels datasets
print('Name Title')
print('---- -----')
for attribute in dir(sm.datasets):
# If the attribute is a module
if str(type(getattr(sm.datasets, attribute))) == "<class 'module'>":
# The utils module is not a dataset
if attribute == 'utils':
continue
title = getattr(sm.datasets, attribute).TITLE
print(f'{attribute:18s} {title}')
## Name Title
## ---- -----
## anes96 American National Election Survey 1996
## cancer Breast Cancer Data
## ccard Bill Greene's credit scoring data.
## china_smoking Smoking and lung cancer in eight cities in China.
## co2 Mauna Loa Weekly Atmospheric CO2 Data
## committee First 100 days of the US House of Representatives 1995
## copper World Copper Market 1951-1975 Dataset
## cpunish US Capital Punishment dataset.
## danish_data Danish Money Demand Data
## elnino El Nino - Sea Surface Temperatures
## engel Engel (1857) food expenditure data
## fair Affairs dataset
## fertility World Bank Fertility Data
## grunfeld Grunfeld (1950) Investment Data
## heart Transplant Survival Data
## interest_inflation (West) German interest and inflation rate 1972-1998
## longley Longley dataset
## macrodata United States Macroeconomic data
## modechoice Travel Mode Choice
## nile Nile River flows at Ashwan 1871-1970
## randhie RAND Health Insurance Experiment Data
## scotland Taxation Powers Vote for the Scottish Parliament 1997
## spector Spector and Mazzeo (1980) - Program Effectiveness Data
## stackloss Stack loss data
## star98 Star98 Educational Dataset
## statecrime Statewide Crime Data 2009
## strikes U.S. Strike Duration Data
## sunspots Yearly sunspots data 1700-2008
Some more packages and settings that will be used by this page:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from sklearn import linear_model
The datasets usually have exog
and endog
attributes that hold the exogenous and endogenous variables, respectively. These are economics terms for what are essentially the dependent and independent variables, or the ‘features’ and the ‘target’ in machine learning terminology.
Documentation: https://www.statsmodels.org/devel/datasets/generated/anes96.html
Exogenous Variables:
logpopul
: log(popul + .1)
where popul
is census place population in 1000sselfLR
: Respondent’s self-reported political leanings from “Left” to “Right”age
: Age of respondenteduc
: Education level of respondentincome
: Income of householdEndogenous Variable:
PID
: Party identification of respondentExample Usage:
# Load the data
dataset = sm.datasets.anes96.load_pandas()
X = dataset.exog
y = dataset.endog
# Plot
plt.title('American National Election Survey 1996')
df = pd.concat([X['age'], y], axis=1)
boxes = [df.loc[df['PID'] == val, 'age'] for val in df['PID'].unique()]
plt.boxplot(boxes)
plt.xlabel('Party identification of respondent')
plt.ylabel('Age')
identification = [
'Strong Democrat', 'Weak Democrat', 'Independent-Democrat',
'Independent-Indpendent', 'Independent-Republican', 'Weak Republican',
'Strong Republican'
]
plt.xticks(ticks=range(1, 8), labels=identification, rotation=15)
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/cancer.html
Exogenous Variable:
population
: The population of the countyEndogenous Variable:
cancer
: The number of breast cancer observancesExample Usage:
# Load the data
dataset = sm.datasets.cancer.load_pandas()
X = dataset.exog
y = dataset.endog
# Create a linear regression model
model = linear_model.LinearRegression()
# Train the model
x = model.fit(X.values, y.values)
# Use the model to make a prediction
X_fitted = np.array([np.min(X), np.max(X)]).reshape(-1, 1)
y_fitted = model.predict(X_fitted)
# Plot
fig = plt.figure()
plt.title('Breast Cancer Data')
plt.scatter(X, y, ec='k', fc='gray')
label = f'y = {model.coef_[0]:.3f}x {model.intercept_:.3f}'
plt.plot(X_fitted, y_fitted, c='gray', label=label)
plt.xlabel('The population of the county')
plt.ylabel('The number of breast cancer observances')
plt.legend(frameon=False)
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/ccard.html
Exogenous Variables:
AGE
: Age in years + 12ths of a yearINCOME
: Income, divided by 10,000INCOMESQ
: INCOME
squaredOWNRENT
: Individual owns (1) or rents (0) homeEndogenous Variable:
AVGEXP
: Avg monthly credit card expenditureExample Usage:
# Load the data
dataset = sm.datasets.ccard.load_pandas()
X = dataset.exog
y = dataset.endog
# Plot
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle("Bill Greene's Credit Scoring Data")
ax1.scatter(X['AGE'], y)
ax1.set_xlabel('Age in years + 12ths of a year')
ax1.set_ylabel('Avg monthly credit card expenditure')
ax2.scatter(X['INCOME'], y)
ax2.set_xlabel('Income, divided by 10,000')
ax2.set_ylabel('Avg monthly credit card expenditure')
ax3.scatter(X['INCOMESQ'], y)
ax3.set_xlabel('Income squared')
ax3.set_ylabel('Avg monthly credit card expenditure')
ax4.scatter(X['OWNRENT'], y)
ax4.set_xlabel('Individual owns (1) or rents (0) home')
ax4.set_ylabel('Avg monthly credit card expenditure')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/china_smoking.html
Variables:
Location
: Name of the citysmoking
: Yes or no, according to a person’s smoking behaviorcancer
: Yes or no, according to a person’s lung cancer statusExample Usage:
# Load the data
dataset = sm.datasets.china_smoking.load_pandas()
# Create contingency tables
contingency_tables = {}
for i, row in dataset['raw_data'].iterrows():
dct = {
'Cancer': [row['smoking_yes_cancer_yes'], row['smoking_no_cancer_yes']],
'Healthy': [row['smoking_yes_cancer_no'], row['smoking_no_cancer_no']],
}
df = pd.DataFrame(dct, index=['Smoker', 'Non-Smoker'])
contingency_tables[row['Location']] = df
print(row['Location'] + ':')
print(df)
print('')
## Beijing:
## Cancer Healthy
## Smoker 126 100
## Non-Smoker 35 61
##
## Shanghai:
## Cancer Healthy
## Smoker 908 688
## Non-Smoker 497 807
##
## Shenyang:
## Cancer Healthy
## Smoker 913 747
## Non-Smoker 336 598
##
## Nanjng:
## Cancer Healthy
## Smoker 235 172
## Non-Smoker 58 121
##
## Harbin:
## Cancer Healthy
## Smoker 402 308
## Non-Smoker 121 215
##
## Zhengzhou:
## Cancer Healthy
## Smoker 182 156
## Non-Smoker 72 98
##
## Taiyuan:
## Cancer Healthy
## Smoker 60 99
## Non-Smoker 11 43
##
## Nanchang:
## Cancer Healthy
## Smoker 104 89
## Non-Smoker 21 36
Documentation: https://www.statsmodels.org/devel/datasets/generated/co2.html
Index:
date
: Sample date in YYYY-MM-DD format. There is only one sample reported every 7 days as each represents a weekly average.Column:
co2
: CO₂ concentration in ppmv (parts per million by volume)Example Usage:
# Load the data
dataset = sm.datasets.co2.load_pandas()
# Plot
plt.figure()
plt.plot(dataset['data'])
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/committee.html
Exogenous Variables:
SIZE
: Number of members on the committeeSUBS
: Number of subcommitteesSTAFF
: Number of staff members assigned to the committeePRESTIGE
: Dummy variable indicating whether the committee is a high prestige committee (PRESTIGE == 1
is a high prestige committee)BILLS103
: The number of bill assignments in the first 100 days of the US’s 103rd House of RepresentativesEndogenous Variable:
BILLS104
: The number of bill assignments in the first 100 days of the US’s 104th House of Representatives over 20 CommitteesExample Usage:
# Load the data
dataset = sm.datasets.committee.load_pandas()
# Plot
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle('First 100 days of the US House of Representatives 1995')
ax1.scatter(dataset['exog']['SIZE'], dataset['endog'])
ax1.set_xlabel('Number of members on the committee')
ax1.set_ylabel('Bill assignments')
ax2.scatter(dataset['exog']['SUBS'], dataset['endog'])
ax2.set_xlabel('Number of subcommittees')
ax2.set_ylabel('Bill assignments')
ax3.scatter(dataset['exog']['STAFF'], dataset['endog'])
ax3.set_xlabel('Number of staff members assigned to the committee')
ax3.set_ylabel('Bill assignments')
ax4.scatter(dataset['exog']['BILLS103'], dataset['endog'])
ax4.set_xlabel('Bill assignments (103rd House of Representatives)')
ax4.set_ylabel('Bill assignments (104th House of Representatives)')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/copper.html
Exogenous Variables:
COPPERPRICE
: Constant dollar adjusted price of copperINCOMEINDEX
: An index of real per capita income (base 1970)ALUMPRICE
: The price of aluminiumINVENTORYINDEX
: A measure of annual manufacturer inventory trendTIME
: A time trendEndogenous Variable:
WORLDCONSUMPTION
: World consumption of copper (in 1000 metric tons)Example Usage:
# Load the data
dataset = sm.datasets.copper.load_pandas()
# Plot
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle('World Copper Market 1951-1975 Dataset')
ax1.scatter(dataset['exog']['COPPERPRICE'], dataset['endog'])
ax1.set_xlabel('Constant dollar adjusted price of copper')
ax1.set_ylabel('World consumption of copper')
ax2.scatter(dataset['exog']['INCOMEINDEX'], dataset['endog'])
ax2.set_xlabel('An index of real per capita income (base 1970)')
ax2.set_ylabel('World consumption of copper')
ax3.scatter(dataset['exog']['ALUMPRICE'], dataset['endog'])
ax3.set_xlabel('The price of aluminium')
ax3.set_ylabel('World consumption of copper')
ax4.scatter(dataset['exog']['INVENTORYINDEX'], dataset['endog'])
ax4.set_xlabel('A measure of annual manufacturer inventory trend')
ax4.set_ylabel('World consumption of copper')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/cpunish.html
Exogenous Variables:
INCOME
: Median per capita income in 1996 dollarsPERPOVERTY
: Percent of the population classified as living in povertyPERBLACK
: Percent of black citizens in the populationVC100k96
: Rate of violent crimes per 100,00 residents for 1996SOUTH
: SOUTH == 1
indicates a state in the SouthDEGREE
: An estimate of the proportion of the state population with a college degree of some kindEndogenous Variable:
EXECUTIONS
: Executions in 1996Example Usage:
# Load the data
dataset = sm.datasets.cpunish.load_pandas()
# Plot
fig, ax = plt.subplots(3, 2, figsize=(10, 10))
ax = ax.flatten()
fig.suptitle('US Capital Punishment Dataset')
for i, exog_name in enumerate(dataset['exog_name']):
ax[i].scatter(dataset['exog'][exog_name], dataset['endog'])
ax[i].set_xlabel(dataset['exog_name'][i])
ax[i].set_ylabel('Executions in 1996')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/danish_data.html
Sample Size: 55
Variables:
lrm
: Log real moneylry
: Log real incomelpy
: Log pricesibo
: Bond rateide
: Deposit rateExample Usage:
# Load the data
dataset = sm.datasets.danish_data.load_pandas()
data = dataset['data']
# Plot
fig = plt.figure(figsize=(10, 10))
shape = (3, 4)
ax1 = plt.subplot2grid(shape, (0, 0), colspan=2, fig=fig)
ax2 = plt.subplot2grid(shape, (0, 2), colspan=2, fig=fig)
ax3 = plt.subplot2grid(shape, (1, 0), colspan=2, fig=fig)
ax4 = plt.subplot2grid(shape, (1, 2), colspan=2, fig=fig)
ax5 = plt.subplot2grid(shape, (2, 1), colspan=2, fig=fig)
fig.suptitle('Danish Money Demand Data')
ax1.scatter(data.index, data['lrm'])
ax1.set_xlabel('Period')
ax1.set_ylabel('Log real money')
ax2.scatter(data.index, data['lry'])
ax2.set_xlabel('Period')
ax2.set_ylabel('Log real income')
ax3.scatter(data.index, data['lpy'])
ax3.set_xlabel('Period')
ax3.set_ylabel('Log prices')
ax4.scatter(data.index, data['ibo'])
ax4.set_xlabel('Period')
ax4.set_ylabel('Bond rate')
ax5.scatter(data.index, data['ide'])
ax5.set_xlabel('Period')
ax5.set_ylabel('Deposit rate')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/elnino.html
Sample Size: 61 years x 12 months
Variable:
TEMPERATURE
: Average sea surface temperature in degrees CelciusExample Usage:
# Load the data
dataset = sm.datasets.elnino.load_pandas()
data = dataset['data']
# Convert wide to long
months = [
'JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN',
'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC'
]
df = data.melt(id_vars=['YEAR'], value_vars=months, var_name='MONTH')
# Combine year and month into a single 'DATE' column
df['YEAR'] = df['YEAR'].astype(int)
df['MONTH'] = df['MONTH'].apply(lambda x: months.index(x) + 1)
df['DATE'] = df['YEAR'].astype(str) + '-' + df['MONTH'].astype(str)
df['DATE'] = pd.to_datetime(df['DATE'], format='%Y-%m')
# Sort the data
df = df.sort_values('DATE')
# Plot
fig, ax = plt.subplots()
ax.plot(df['DATE'], df['value'])
ax.set_title('El Nino - Sea Surface Temperatures')
ax.set_ylabel('Average sea surface temperature (°C)')
_ = ax.set_xlim([df['DATE'].min(), df['DATE'].max()])
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/engel.html
Sample Size: 235
Variables:
income
: Annual household income (Belgian francs)foodexp
: Annual household food expenditure (Belgian francs)Example Usage:
# Load the data
dataset = sm.datasets.engel.load_pandas()
data = dataset['data']
# Plot
plt.scatter(data['income'], data['foodexp'])
plt.title('Engel (1857) Food Expenditure Data')
plt.ylabel('Annual household food expenditure (Belgian francs)')
plt.xlabel('Annual household income (Belgian francs)')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/fair.html
Sample Size: 6366
Exogenous Variables:
rate_marriage
: Rating of marriageage
: Ageyrs_married
: Number of years married - interval approximationschildren
: Number of childrenreligious
: How religiouseduc
: Level of educationoccupation
: Occupationoccupation_husb
: Husband’s occupationEndogenous Variable:
affairs
: Measure of time spent in extramarital affairsExample Usage:
# Load the data
dataset = sm.datasets.fair.load_pandas()
# Plot
fig, ax = plt.subplots(4, 2, figsize=(8, 12))
ax = ax.flatten()
fig.suptitle('Affairs Dataset')
for i, exog_name in enumerate(dataset['exog_name']):
ax[i].scatter(dataset['exog'][exog_name], dataset['endog'])
ax[i].set_xlabel(dataset['exog_name'][i])
ax[i].set_ylabel('Time in extramarital affairs')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/fertility.html
Sample Size: 219 countries/regions
Variable: The fertility rate for the given year
Example Usage:
# Load the data
dataset = sm.datasets.fertility.load_pandas()
# Convert from wide to long
years = list(range(1960, 2014))
years = [str(year) for year in years]
df = pd.melt(dataset['data'], id_vars=['Country Name'], value_vars=years)
# Sort
df = df.sort_values('variable')
# Plot
fig, ax = plt.subplots()
countries = [
'Yemen, Rep.', 'Barbados', 'Tonga', 'Macao SAR, China', 'Niger', 'Mongolia'
]
for country in countries:
subset = df[df['Country Name'] == country]
plt.scatter(subset['variable'].astype(int), subset['value'], label=country)
plt.title('World Bank Fertility Data')
plt.ylabel('Fertility Rate')
plt.xlabel('Year')
plt.legend()
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/grunfeld.html
Sample Size: 220 (20 years for 11 firms)
Exogenous Variables:
value
: Market value as of December 31 in 1947 dollarscapital
: Stock of plant and equipment in 1947 dollarsfirm
: General Motors, US Steel, General Electric, Chrysler, Atlantic Refining, IBM, Union Oil, Westinghouse, Goodyear, Diamond Match, American Steelyear
: 1935 - 1954Endogenous Variable:
invest
: Gross investment in 1947 dollarsExample Usage:
# Load the data
dataset = sm.datasets.grunfeld.load_pandas()
# Plot
fig, ax = plt.subplots(2, 2, figsize=(8, 6))
ax = ax.flatten()
fig.suptitle('Grunfeld (1950) Investment Data')
for i, exog_name in enumerate(dataset['exog_name']):
ax[i].scatter(dataset['exog'][exog_name], dataset['endog'])
if exog_name == 'firm':
ax[i].set_xticks(range(11))
ax[i].set_xticklabels(range(11))
ax[i].set_xlabel(dataset['exog_name'][i])
ax[i].set_ylabel('Gross Investment (1947 dollars)')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/heart.html
Sample Size: 69
Exogenous Variables:
censors
: Indicates if an observation is censored; 1 is uncensoredage
: Age at the time of surgeryEndogenous Variable:
survival
: Days after surgery until deathExample Usage:
# Load the data
dataset = sm.datasets.heart.load_pandas()
df = dataset['data']
# Plot
fig, ax = plt.subplots()
plt.scatter(dataset['exog'], dataset['endog'])
plt.title('Transplant Survival Data')
plt.ylabel('Days after surgery until death')
plt.xlabel('Age at the time of surgery')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/interest_inflation.html
Sample Size: 107
Variables:
year
: Q2 1972 - Q4 1998quarter
: 1-4Dp
: Delta log GDP deflatorR
: Nominal long term interest rateExample Usage:
# Load the data
dataset = sm.datasets.interest_inflation.load_pandas()
df = dataset['data']
# Re-format the date information
df['year'] = df['year'].astype(int)
df['quarter'] = df['quarter'].astype(int)
df['date'] = df['year'].astype(str) + '-' + df['quarter'].astype(str)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m')
df = df.sort_values('date')
# Plot
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
fig.suptitle('(West) German interest and inflation rate 1972-1998')
ax[0].scatter(df['date'], df['Dp'])
ax[0].set_ylabel('Delta Log GDP Deflator')
ax[0].set_xlabel('Quarter')
ax[1].scatter(df['date'], df['R'])
ax[1].set_ylabel('Nominal Long Term Interest Rate')
ax[1].set_xlabel('Quarter')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/longley.html
Sample Size: 16
Exogenous Variables:
GNPDEFL
: GNP deflatorGNP
: GNPUNEMP
: Number of unemployedARMED
: Size of armed forcesPOP
: PopulationYEAR
: Year (1947 - 1962)Endogenous Variable:
TOTEMP
: Total employmentExample Usage:
# Load the data
dataset = sm.datasets.longley.load_pandas()
# Plot
fig, ax = plt.subplots(3, 2, figsize=(10, 10))
ax = ax.flatten()
fig.suptitle('Longley Dataset')
for i, exog_name in enumerate(dataset['exog_name']):
ax[i].scatter(dataset['exog'][exog_name], dataset['endog'])
ax[i].set_xlabel(dataset['exog_name'][i])
ax[i].set_ylabel('Total Employment')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/macrodata.html
Sample Size: 203
Variables:
year
: 1959q1 - 2009q3quarter
: 1-4realgdp
: Real gross domestic productrealcons
: Real personal consumption expendituresrealinv
: Real gross private domestic investmentrealgovt
: Real federal consumption expenditures & gross investmentrealdpi
: Real private disposable incomecpi
: End of the quarter consumer price index for all urban consumers: all itemsm1
: End of the quarter M1 nominal money stock (Seasonally adjusted)tbilrate
: Quarterly monthly average of the monthly 3-month treasury billunemp
: Seasonally adjusted unemployment rate (%)pop
: End of the quarter total population: all ages incl. armed forces over seasinfl
: Inflation rate (ln(cpi_{t}/cpi_{t-1}) * 400)realint
: Real interest rate (tbilrate - infl)Example Usage:
# Load the data
dataset = sm.datasets.macrodata.load_pandas()
# Clean
df = dataset['data']
df = df.drop('year', axis=1)
df = df.drop('quarter', axis=1)
# Plot
plt.title('United States Macroeconomic Data')
plt.matshow(df.corr())
plt.yticks(range(len(list(df))), list(df))
plt.xticks(range(len(list(df))), list(df), rotation=36)
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/modechoice.html
Sample Size: 210
Exogenous Variables:
ttme
: Terminal waiting time for plane, train and bus (minutes); 0 for carinvc
: In vehicle cost for all stages (dollars)invt
: Travel time (in-vehicle time) for all stages (minutes)gc
: Generalized cost measure: invc + (invt × value of travel time savings) (dollars)hinc
: Household income ($1000s)psize
: Traveling group size in mode chosenEndogenous Variable:
choice
: Yes (1
) or no (0
)Example Usage:
# Load the data
dataset = sm.datasets.modechoice.load_pandas()
# Clean the data
df = dataset['data']
df['mode'] = df['mode'].replace({1: 'air', 2: 'train', 3: 'bus', 4: 'car'})
# Plot
fig, ax = plt.subplots(3, 2, figsize=(10, 10))
ax = ax.flatten()
fig.suptitle('Travel Mode Choice')
for i, exog_name in enumerate(dataset['exog_name']):
for mode in df['mode'].unique():
subset = df[df['mode'] == mode]
ax[i].scatter(subset[exog_name], subset['choice'], label=mode)
ax[i].set_xlabel(dataset['exog_name'][i])
ax[i].set_ylabel('Choice')
ax[i].set_yticks([0, 1])
ax[i].set_yticklabels(['No', 'Yes'])
ax[i].legend()
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/nile.html
Sample Size: 100
Variables:
year
: The year of the observationsvolume
: The discharge at Aswan in m³ × 10⁸Example Usage:
# Load the data
dataset = sm.datasets.nile.load_pandas()
df = dataset['data']
# Plot
fig = plt.figure()
plt.title('Nile River Flows at Ashwan 1871-1970')
plt.scatter(df['year'], df['volume'], ec='k', fc='gray')
plt.ylabel('Volume [m³]')
plt.xlabel('Year')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/randhie.html
Sample Size: 20,190
Variables:
mdvis
: Number of outpatient visits to an MDlncoins
: ln(coinsurance + 1), 0 <= coninsurance <= 100idp
: 1 if individual deductible plan, 0 otherwiselpi
: ln(max(1, annual participation incentive payment))fmde
: 0 if idp = 1; ln(max(1, MDE/(0.01 coinsurance))) otherwisephyslm
: 1 if the person has a physical limitationdisea
: Number of chronic diseaseshlthg
: 1 if self-rated health is goodhlthf
: 1 if self-rated health is fairhlthp
: 1 if self-rated health is poorExample Usage:
# Load the data
dataset = sm.datasets.randhie.load_pandas()
df = dataset['data']
# Plot
plt.title('RAND Health Insurance Experiment Data')
plt.matshow(df.corr())
plt.yticks(range(len(list(df))), list(df))
plt.xticks(range(len(list(df))), list(df), rotation=36)
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/scotland.html
Sample Size: 32
Exogenous Variables:
COUTAX
: Amount of council tax collected in pounds sterling as of April 1997UNEMPF
: Female percentage of total unemployment benefits claims as of January 1998MOR
: The standardized mortality rate (UK is 100)ACT
: Labor force participation (Short for active)GDP
: GDP per countyAGE
: Percentage of children aged 5 to 15 in the countyCOUTAX_FEMALEUNEMP
: Interaction between COUTAX and UNEMPFEndogenous Variable:
YES
: Proportion voting yes to granting taxation powers to the Scottish parliamentExample Usage:
# Load the data
dataset = sm.datasets.scotland.load_pandas()
# Plot
fig, ax = plt.subplots(4, 2, figsize=(8, 12))
ax = ax.flatten()
fig.suptitle('Taxation Powers Vote for the Scottish Parliament 1997')
for i, exog_name in enumerate(dataset['exog_name'][:-1]):
ax[i].scatter(dataset['exog'][exog_name], dataset['endog'])
ax[i].set_xlabel(dataset['exog_name'][i])
ax[i].set_ylabel('Proportion Voting Yes')
ax1 = plt.subplot2grid((4, 4), (3, 1), colspan=2, fig=fig)
ax1.scatter(dataset['exog']['COUTAX_FEMALEUNEMP'], dataset['endog'])
ax1.set_xlabel('COUTAX_FEMALEUNEMP')
ax1.set_ylabel('Proportion Voting Yes')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/spector.html
Sample Size: 32
Exogenous Variables:
TUCE
: Test score on economics testPSI
: Participation in programGPA
: Student’s grade point averageEndogenous Variable:
GRADE
: Binary variable indicating whether or not a student’s grade improved; 1 indicates an improvementExample Usage:
# Load the data
dataset = sm.datasets.spector.load_pandas()
data = dataset['data']
data['GRADE'] = data['GRADE'].replace({0: 'No', 1: 'Yes'})
data['PSI'] = data['PSI'].replace({0: "Didn't participate", 1: 'Participated'})
# Plot
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
fig.suptitle('Spector and Mazzeo (1980) - Program Effectiveness Data')
for participation in data['PSI'].unique():
subset = data[data['PSI'] == participation]
ax[0].scatter(subset['GPA'], subset['GRADE'], label=participation)
ax[0].set_xlabel("Student's grade point average")
ax[0].set_ylabel('Improvement')
ax[1].scatter(subset['TUCE'], subset['GRADE'], label=participation)
ax[1].set_xlabel('Test Score on Economics Test')
ax[1].set_ylabel('Improvement')
ax[0].legend(loc='center right')
ax[1].legend(loc='center right')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/stackloss.html
Sample Size: 21
Exogenous Variables:
AIRFLOW
: Rate of operation of the plantWATERTEMP
: Cooling water temperature in the absorption towerACIDCONC
: SAcid concentration of circulating acid minus 50 times 10Endogenous Variable:
STACKLOSS
: 10 times the percentage of ammonia going into the plant that escapes from the absoroption columnExample Usage:
# Load the data
dataset = sm.datasets.stackloss.load_pandas()
# Plot
fig = plt.figure(figsize=(8, 6))
shape = (2, 4)
ax1 = plt.subplot2grid(shape, (0, 0), colspan=2, fig=fig)
ax2 = plt.subplot2grid(shape, (0, 2), colspan=2, fig=fig)
ax3 = plt.subplot2grid(shape, (1, 1), colspan=2, fig=fig)
fig.suptitle('Stack Loss Data')
ax1.scatter(dataset['exog']['AIRFLOW'], dataset['endog'])
ax1.set_xlabel('Rate of operation of the plant')
ax1.set_ylabel('Stack Loss')
ax2.scatter(dataset['exog']['WATERTEMP'], dataset['endog'])
ax2.set_xlabel('Cooling water temperature')
ax2.set_ylabel('Stack Loss')
ax3.scatter(dataset['exog']['ACIDCONC'], dataset['endog'])
ax3.set_xlabel('Acid concentration')
ax3.set_ylabel('Stack Loss')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/star98.html
Sample Size: 303
Exogenous Variables:
LOWINC
: Percentage of low income studentsPERASIAN
: Percentage of Asian studentPERBLACK
: Percentage of black studentsPERHISP
: Percentage of Hispanic studentsPERMINTE
: Percentage of minority teachersAVYRSEXP
: Sum of teachers’ years in educational service divided by the number of teachersAVSALK
: Total salary budget including benefits divided by the number of full-time teachersPERSPENK
: Per-pupil spendingPTRATIO
: Pupil-teacher ratioPCTAF
: Percentage of students taking UC/CSU prep coursesPCTCHRT
: Percentage of charter schoolsPCTYRRND
: Percentage of year-round schoolsInteraction Terms:
PERMINTE_AVYRSEXP
PERMINTE_AVSAL
AVYRSEXP_AVSAL
PERSPEN_PTRATIO
PERSPEN_PCTAF
PTRATIO_PCTAF
PERMINTE_AVYRSEXP_AVSAL
PERSPEN_PTRATIO_PCTAF
Endogenous Variables:
NABOVE
: Total number of students above the national median for the math sectionNBELOW
: Total number of students below the national median for the math sectionExample Usage:
# Load the data
dataset = sm.datasets.star98.load_pandas()
df = dataset['data']
# Plot
fig = plt.figure(figsize=(8, 8))
ax = plt.subplot()
ax.matshow(df.corr())
ax.set_yticks(range(len(list(df))), list(df))
ax.set_xticks(range(len(list(df))), list(df), rotation=90)
plt.subplots_adjust(left=0.3, top=0.7)
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/statecrime.html
Sample Size: 51
Exogenous Variables:
urban
: % of population in Urbanized Areaspoverty
: % of individuals below the poverty linehs_grad
: Percent of population having graduated from high school or highersingle
: A variable related to household typeEndogenous Variable:
murder
: Murders per 100,000 populationExample Usage:
# Load the data
dataset = sm.datasets.statecrime.load_pandas()
# Plot
fig, ax = plt.subplots(2, 2, figsize=(9, 7))
ax = ax.flatten()
fig.suptitle('Statewide Crime Data 2009')
for i, exog_name in enumerate(dataset['exog_name']):
for state in dataset['exog'].index:
x = dataset['exog'][exog_name][state]
y = dataset['endog'][state]
ax[i].scatter(x, y, label=state)
ax[i].set_xlabel(dataset['exog_name'][i])
ax[i].set_ylabel('Murders per 100,000 people')
ax[1].legend(bbox_to_anchor=(1.05, 1.3), fontsize=6)
plt.subplots_adjust(left=0.06, right=0.84)
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/strikes.html
Sample Size: 62
Exogenous Variables:
iprod
: Unanticipated industrial productionEndogenous Variable:
duration
: Duration of the strike in daysExample Usage:
# Load the data
dataset = sm.datasets.strikes.load_pandas()
# Plot
fig, ax = plt.subplots()
ax.scatter(dataset['exog'], dataset['endog'])
ax.set_title('U.S. Strike Duration Data')
ax.set_ylabel('Duration of the strike in days')
ax.set_xlabel('Unanticipated industrial production')
plt.tight_layout()
plt.show()
Documentation: https://www.statsmodels.org/devel/datasets/generated/sunspots.html
Sample Size: 309 (from 1700 to 2008 inclusive)
Variables:
YEAR
: YearSUNACTIVITY
: Number of sunspots for each yearExample Usage:
# Load the data
dataset = sm.datasets.sunspots.load_pandas()
# Plot
fig, ax = plt.subplots()
ax.scatter(dataset['data']['YEAR'], dataset['data']['SUNACTIVITY'])
ax.set_title('Yearly Sunspots Data 1700-2008')
ax.set_ylabel('Number of Sunspots')
ax.set_xlabel('Year')
plt.tight_layout()
plt.show()