⇦ Back

AirPassengers

PyDataset Documentation (adopted from R Documentation):

Monthly Airline Passenger Numbers 1949-1960

The classic Box & Jenkins airline data. Monthly totals of international airline passengers, 1949 to 1960.

A monthly time series, in thousands.

Source:

First 5 rows of the dataset:

##           time  AirPassengers
## 1  1949.000000            112
## 2  1949.083333            118
## 3  1949.166667            132
## 4  1949.250000            129
## 5  1949.333333            121

⇧ Back to top

BJsales

PyDataset Documentation (adopted from R Documentation):

Sales Data with Leading Indicator

The sales time series BJsales and leading indicator BJsales.lead each contain 150 observations. The objects are of class "ts".

Source:

References:

First 5 rows of the dataset:

##    time  BJsales
## 1     1    200.1
## 2     2    199.5
## 3     3    199.4
## 4     4    198.9
## 5     5    199.0

⇧ Back to top

BOD

PyDataset Documentation (adopted from R Documentation):

Biochemical Oxygen Demand

The BOD data frame has 6 rows and 2 columns giving the biochemical oxygen demand versus time in an evaluation of water quality.

This data frame contains the following columns:

Source:

Originally from:

First 5 rows of the dataset:

##    Time  demand
## 1     1     8.3
## 2     2    10.3
## 3     3    19.0
## 4     4    16.0
## 5     5    15.6

⇧ Back to top

Formaldehyde

PyDataset Documentation (adopted from R Documentation):

Determination of Formaldehyde

These data are from a chemical experiment to prepare a standard curve for the determination of formaldehyde by the addition of chromatropic acid and concentrated sulphuric acid and the reading of the resulting purple color on a spectrophotometer.

A data frame with 6 observations on 2 variables.

Source:

References:

First 5 rows of the dataset:

##    carb  optden
## 1   0.1   0.086
## 2   0.3   0.269
## 3   0.5   0.446
## 4   0.6   0.538
## 5   0.7   0.626

⇧ Back to top

HairEyeColor

PyDataset Documentation (adopted from R Documentation):

Hair and Eye Color of Statistics Students

Distribution of hair and eye color and sex in 592 statistics students.

A 3-dimensional array resulting from cross-tabulating 592 observations on 3 variables. The variables and their levels are as follows:

The Hair x Eye table comes rom a survey of students at the University of Delaware reported by Snee (1974). The split by Sex was added by Friendly (1992a) for didactic purposes.

This data set is useful for illustrating various techniques for the analysis of contingency tables, such as the standard chi-squared test or, more generally, log-linear modelling, and graphical methods such as mosaic plots, sieve diagrams or association plots.

Source:

Snee (1974) gives the two-way table aggregated over Sex. The Sex split of the ‘Brown hair, Brown eye’ cell was changed to agree with that used by Friendly (2000).

References:

First 5 rows of the dataset:

##     Hair    Eye   Sex  Freq
## 1  Black  Brown  Male    32
## 2  Brown  Brown  Male    53
## 3    Red  Brown  Male    10
## 4  Blond  Brown  Male     3
## 5  Black   Blue  Male    11

⇧ Back to top

InsectSprays

PyDataset Documentation (adopted from R Documentation):

Effectiveness of Insect Sprays

The counts of insects in agricultural experimental units treated with different insecticides.

A data frame with 72 observations on 2 variables.

Source:

Reference:

First 5 rows of the dataset:

##    count spray
## 1     10     A
## 2      7     A
## 3     20     A
## 4     14     A
## 5     14     A

⇧ Back to top

JohnsonJohnson

PyDataset Documentation (adopted from R Documentation):

Quarterly Earnings per Johnson & Johnson Share

Quarterly earnings (dollars) per Johnson & Johnson share 1960–80.

A quarterly time series

Source:

First 5 rows of the dataset:

##       time  JohnsonJohnson
## 1  1960.00            0.71
## 2  1960.25            0.63
## 3  1960.50            0.85
## 4  1960.75            0.44
## 5  1961.00            0.61

⇧ Back to top

LakeHuron

PyDataset Documentation (adopted from R Documentation):

Level of Lake Huron 1875–1972

Annual measurements of the level, in feet, of Lake Huron 1875–1972.

A time series of length 98.

Sources:

First 5 rows of the dataset:

##    time  LakeHuron
## 1  1875     580.38
## 2  1876     581.86
## 3  1877     580.97
## 4  1878     580.80
## 5  1879     579.79

⇧ Back to top

LifeCycleSavings

PyDataset Documentation (adopted from R Documentation):

Intercountry Life-Cycle Savings Data

Data on the savings ratio 1960–1970.

A data frame with 50 observations on 5 variables:

Under the life-cycle savings hypothesis as developed by Franco Modigliani, the savings ratio (aggregate personal saving divided by disposable income) is explained by per-capita disposable income, the percentage rate of change in per-capita disposable income, and two demographic variables: the percentage of population less than 15 years old and the percentage of the population over 75 years old. The data are averaged over the decade 1960–1970 to remove the business cycle or other short-term fluctuations.

Source:

The data were obtained from Belsley, Kuh and Welsch (1980). They in turn obtained the data from Sterling (1977).

References:

First 5 rows of the dataset:

##               sr  pop15  pop75      dpi  ddpi
## Australia  11.43  29.35   2.87  2329.68  2.87
## Austria    12.07  23.32   4.41  1507.99  3.93
## Belgium    13.17  23.80   4.43  2108.47  3.82
## Bolivia     5.75  41.89   1.67   189.13  0.22
## Brazil     12.88  42.19   0.83   728.47  4.56

⇧ Back to top

Nile

PyDataset Documentation (adopted from R Documentation):

Flow of the River Nile

Measurements of the annual flow of the river Nile at Ashwan 1871–1970.

A time series of length 100.

Source:

References:

First 5 rows of the dataset:

##    time  Nile
## 1  1871  1120
## 2  1872  1160
## 3  1873   963
## 4  1874  1210
## 5  1875  1160

⇧ Back to top

OrchardSprays

PyDataset Documentation (adopted from R Documentation):

Potency of Orchard Sprays

An experiment was conducted to assess the potency of various constituents of orchard sprays in repelling honeybees, using a Latin square design.

A data frame with 64 observations on 4 variables.

Individual cells of dry comb were filled with measured amounts of lime sulphur emulsion in sucrose solution. Seven different concentrations of lime sulphur ranging from a concentration of 1/100 to 1/1,562,500 in successive factors of 1/5 were used as well as a solution containing no lime sulphur.

The responses for the different solutions were obtained by releasing 100 bees into the chamber for two hours, and then measuring the decrease in volume of the solutions in the various cells.

An 8 x 8 Latin square design was used and the treatments were coded as follows:

Source:

Reference:

First 5 rows of the dataset:

##    decrease  rowpos  colpos treatment
## 1        57       1       1         D
## 2        95       2       1         E
## 3         8       3       1         B
## 4        69       4       1         H
## 5        92       5       1         G

⇧ Back to top

PlantGrowth

PyDataset Documentation (adopted from R Documentation):

Results from an Experiment on Plant Growth

Results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions.

A data frame of 30 cases on 2 variables:

Source:

First 5 rows of the dataset:

##    weight group
## 1    4.17  ctrl
## 2    5.58  ctrl
## 3    5.18  ctrl
## 4    6.11  ctrl
## 5    4.50  ctrl

⇧ Back to top

Puromycin

PyDataset Documentation (adopted from R Documentation):

Reaction Velocity of an Enzymatic Reaction

The Puromycin data frame has 23 rows and 3 columns of the reaction velocity versus substrate concentration in an enzymatic reaction involving untreated cells or cells treated with Puromycin.

This data frame contains the following columns:

Data on the velocity of an enzymatic reaction were obtained by Treloar (1974). The number of counts per minute of radioactive product from the reaction was measured as a function of substrate concentration in parts per million (ppm) and from these counts the initial rate (or velocity) of the reaction was calculated (counts/min/min). The experiment was conducted once with the enzyme treated with Puromycin, and once with the enzyme untreated.

Source:

First 5 rows of the dataset:

##    conc  rate    state
## 1  0.02    76  treated
## 2  0.02    47  treated
## 3  0.06    97  treated
## 4  0.06   107  treated
## 5  0.11   123  treated

⇧ Back to top

Titanic

PyDataset Documentation (adopted from R Documentation):

Survival of passengers on the Titanic

This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarized according to economic status (class), sex, age and survival.

A 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables. The variables and their levels are as follows:

  1. Class: 1st, 2nd, 3rd, Crew
  2. Sex: Male, Female
  3. Age: Child, Adult
  4. Survived: No, Yes

The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts—from the proportions of first-class passengers to the ‘women and children first’ policy, and the fact that that policy was not entirely successful in saving the women and children in the third class—are reflected in the survival rates for various classes of passenger.

These data were originally collected by the British Board of Trade in their investigation of the sinking. Note that there is not complete agreement among primary sources as to the exact numbers on board, rescued, or lost.

Due in particular to the very successful film ‘Titanic’, the last years saw a rise in public interest in the Titanic. Very detailed data about the passengers is now available on the Internet, at sites such as Encyclopedia Titanica (http://www.rmplc.co.uk/eduweb/sites/phind).

Source:

The source provides a data set recording class, sex, age, and survival status for each person on board of the Titanic, and is based on data originally collected by the British Board of Trade and reprinted in:

First 5 rows of the dataset:

##   Class     Sex    Age Survived  Freq
## 1   1st    Male  Child       No     0
## 2   2nd    Male  Child       No     0
## 3   3rd    Male  Child       No    35
## 4  Crew    Male  Child       No     0
## 5   1st  Female  Child       No     0

⇧ Back to top

ToothGrowth

PyDataset Documentation (adopted from R Documentation):

The Effect of Vitamin C on Tooth Growth in Guinea Pigs

The response is the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid).

A data frame with 60 observations on 3 variables.

Source:

References:

First 5 rows of the dataset:

##     len supp  dose
## 1   4.2   VC   0.5
## 2  11.5   VC   0.5
## 3   7.3   VC   0.5
## 4   5.8   VC   0.5
## 5   6.4   VC   0.5

⇧ Back to top

UCBAdmissions

PyDataset Documentation (adopted from R Documentation):

Student Admissions at UC Berkeley

Aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.

A 3-dimensional array resulting from cross-tabulating 4526 observations on 3 variables. The variables and their levels are as follows:

  1. Admit, factor: Admission status (Admitted or Rejected)
  2. Gender, factor: Male or Female
  3. Dept, factor: Department (A, B, C, D, E, F)
  4. Freq, numeric: Number of applicants with this combination of factors

This data set is frequently used for illustrating Simpson’s paradox, see Bickel et al (1975). At issue is whether the data show evidence of sex bias in admission practices. There were 2691 male applicants, of whom 1198 (44.5%) were admitted, compared with 1835 female applicants of whom 557 (30.4%) were admitted. This gives a sample odds ratio of 1.83, indicating that males were almost twice as likely to be admitted. In fact, graphical methods (as in the example below) or log-linear modelling show that the apparent association between admission and sex stems from differences in the tendency of males and females to apply to the individual departments (females used to apply more to departments with higher rejection rates).

This data set can also be used for illustrating methods for graphical display of categorical data, such as the general-purpose mosaic plot or the fourfold display for 2-by-2-by-k tables. See the home page of Michael Friendly (http://www.math.yorku.ca/SCS/friendly.html) for further information.

References:

First 5 rows of the dataset:

##       Admit  Gender Dept  Freq
## 1  Admitted    Male    A   512
## 2  Rejected    Male    A   313
## 3  Admitted  Female    A    89
## 4  Rejected  Female    A    19
## 5  Admitted    Male    B   353

⇧ Back to top

UKDriverDeaths

PyDataset Documentation (adopted from R Documentation):

Road Casualties in Great Britain 1969–84

UKDriverDeaths is a time series giving the monthly totals of car drivers in Great Britain killed or seriously injured Jan 1969 to Dec 1984. Compulsory wearing of seat belts was introduced on 31 Jan 1983.

Seatbelts is more information on the same problem.

Seatbelts is a multiple time series, with columns

Sources:

Reference:

First 5 rows of the dataset:

##           time  UKDriverDeaths
## 1  1969.000000            1687
## 2  1969.083333            1508
## 3  1969.166667            1507
## 4  1969.250000            1385
## 5  1969.333333            1632

⇧ Back to top

UKgas

PyDataset Documentation (adopted from R Documentation):

UK Quarterly Gas Consumption

Quarterly UK gas consumption from 1960Q1 to 1986Q4, in millions of therms.

A quarterly time series of length 108.

Source:

First 5 rows of the dataset:

##       time  UKgas
## 1  1960.00  160.1
## 2  1960.25  129.7
## 3  1960.50   84.8
## 4  1960.75  120.1
## 5  1961.00  160.1

⇧ Back to top

USAccDeaths

PyDataset Documentation (adopted from R Documentation):

Accidental Deaths in the US 1973–1978

A time series giving the monthly totals of accidental deaths in the USA. The values for the first six months of 1979 are 7798 7406 8363 8460 9217 9316.

First 5 rows of the dataset:

##           time  USAccDeaths
## 1  1973.000000         9007
## 2  1973.083333         8106
## 3  1973.166667         8928
## 4  1973.250000         9137
## 5  1973.333333        10017

⇧ Back to top

USArrests

PyDataset Documentation (adopted from R Documentation):

Violent Crime Rates by US State

This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

A data frame with 50 observations on 4 variables:

Sources:

Reference:

First 5 rows of the dataset:

##             Murder  Assault  UrbanPop  Rape
## Alabama       13.2      236        58  21.2
## Alaska        10.0      263        48  44.5
## Arizona        8.1      294        80  31.0
## Arkansas       8.8      190        50  19.5
## California     9.0      276        91  40.6

tips

PyDataset Documentation (adopted from R Documentation):

Tipping data

One waiter recorded information about each tip he received over a period of a few months working in one restaurant.

A data frame with 244 rows and 7 variables:

In all he recorded 244 tips. The data was reported in a collection of case studies for business statistics (Bryant & Smith 1995).

Reference:

First 5 rows of the dataset:

##    total_bill   tip     sex smoker  day    time  size
## 1       16.99  1.01  Female     No  Sun  Dinner     2
## 2       10.34  1.66    Male     No  Sun  Dinner     3
## 3       21.01  3.50    Male     No  Sun  Dinner     3
## 4       23.68  3.31    Male     No  Sun  Dinner     2
## 5       24.59  3.61  Female     No  Sun  Dinner     4

⇧ Back to top

⇦ Back