⇦ Back

1 Example Data

Here are the final log standings from the 2018-19 and 2019-20 English Premier League seasons, sorted alphabetically by team:

EPL_2018_19 <- data.frame(
    "Team" = c(
        "Arsenal", "Bournemouth", "Brighton & Hove Albion", "Burnley", "Cardiff City", "Chelsea", "Crystal Palace",
        "Everton", "Fulham", "Huddersfield Town", "Leicester City", "Liverpool", "Manchester City", "Manchester United",
        "Newcastle United", "Southampton", "Tottenham Hotspur", "Watford", "West Ham United", "Wolverhampton Wanderers"
    ),
    "Points" = c(70, 45, 36, 40, 34, 72, 49, 54, 26, 16, 52, 97, 98, 66, 45, 39, 71, 50, 52, 57)
)
EPL_2019_20 <- data.frame(
    "Team" = c(
        "Arsenal", "Aston Villa", "Bournemouth", "Brighton & Hove Albion", "Burnley", "Chelsea", "Crystal Palace",
        "Everton", "Leicester City", "Liverpool", "Manchester City", "Manchester United", "Newcastle United",
        "Norwich City", "Sheffield United", "Southampton", "Tottenham Hotspur", "Watford", "West Ham United",
        "Wolverhampton Wanderers"
    ),
    "Points" = c(56, 35, 34, 41, 54, 66, 43, 49, 62, 99, 81, 66, 44, 21, 54, 52, 59, 34, 39, 59)
)

library(kableExtra)

kable_input <- kable(EPL_2018_19)
kable_styling(
    kable_input, bootstrap_options = c("striped", "scale_down", "condensed"), full_width = F, font_size = 11,
    position = "float_left"
)
kable_input <- kable(EPL_2019_20)
kable_styling(
    kable_input, bootstrap_options = c("striped", "scale_down", "condensed"), full_width = F, font_size = 11,
    position = "center"
)
Team Points
Arsenal 70
Bournemouth 45
Brighton & Hove Albion 36
Burnley 40
Cardiff City 34
Chelsea 72
Crystal Palace 49
Everton 54
Fulham 26
Huddersfield Town 16
Leicester City 52
Liverpool 97
Manchester City 98
Manchester United 66
Newcastle United 45
Southampton 39
Tottenham Hotspur 71
Watford 50
West Ham United 52
Wolverhampton Wanderers 57
Team Points
Arsenal 56
Aston Villa 35
Bournemouth 34
Brighton & Hove Albion 41
Burnley 54
Chelsea 66
Crystal Palace 43
Everton 49
Leicester City 62
Liverpool 99
Manchester City 81
Manchester United 66
Newcastle United 44
Norwich City 21
Sheffield United 54
Southampton 52
Tottenham Hotspur 59
Watford 34
West Ham United 39
Wolverhampton Wanderers 59

2 Concatenation

Concatenating two data frames is actually not the same as merging them, it’s simply combining them by sticking one next to the other:

Append a data frame to the bottom of another with the rbind() (row bind) function:

total <- rbind(EPL_2018_19, EPL_2019_20)
print(total)
##                       Team Points
## 1                  Arsenal     70
## 2              Bournemouth     45
## 3   Brighton & Hove Albion     36
## 4                  Burnley     40
## 5             Cardiff City     34
## 6                  Chelsea     72
## 7           Crystal Palace     49
## 8                  Everton     54
## 9                   Fulham     26
## 10       Huddersfield Town     16
## 11          Leicester City     52
## 12               Liverpool     97
## 13         Manchester City     98
## 14       Manchester United     66
## 15        Newcastle United     45
## 16             Southampton     39
## 17       Tottenham Hotspur     71
## 18                 Watford     50
## 19         West Ham United     52
## 20 Wolverhampton Wanderers     57
## 21                 Arsenal     56
## 22             Aston Villa     35
## 23             Bournemouth     34
## 24  Brighton & Hove Albion     41
## 25                 Burnley     54
## 26                 Chelsea     66
## 27          Crystal Palace     43
## 28                 Everton     49
## 29          Leicester City     62
## 30               Liverpool     99
## 31         Manchester City     81
## 32       Manchester United     66
## 33        Newcastle United     44
## 34            Norwich City     21
## 35        Sheffield United     54
## 36             Southampton     52
## 37       Tottenham Hotspur     59
## 38                 Watford     34
## 39         West Ham United     39
## 40 Wolverhampton Wanderers     59

Append a data frame to the side of another with the cbind() (column bind) function:

total <- cbind(EPL_2018_19, EPL_2019_20)
print(total)
##                       Team Points                    Team Points
## 1                  Arsenal     70                 Arsenal     56
## 2              Bournemouth     45             Aston Villa     35
## 3   Brighton & Hove Albion     36             Bournemouth     34
## 4                  Burnley     40  Brighton & Hove Albion     41
## 5             Cardiff City     34                 Burnley     54
## 6                  Chelsea     72                 Chelsea     66
## 7           Crystal Palace     49          Crystal Palace     43
## 8                  Everton     54                 Everton     49
## 9                   Fulham     26          Leicester City     62
## 10       Huddersfield Town     16               Liverpool     99
## 11          Leicester City     52         Manchester City     81
## 12               Liverpool     97       Manchester United     66
## 13         Manchester City     98        Newcastle United     44
## 14       Manchester United     66            Norwich City     21
## 15        Newcastle United     45        Sheffield United     54
## 16             Southampton     39             Southampton     52
## 17       Tottenham Hotspur     71       Tottenham Hotspur     59
## 18                 Watford     50                 Watford     34
## 19         West Ham United     52         West Ham United     39
## 20 Wolverhampton Wanderers     57 Wolverhampton Wanderers     59

3 Merging

There are four types of merging:

  • Keep all rows in both data frames (don’t delete any data)
  • Keep all rows in the left data frame (but delete those in the right data frame that don’t appear in the left)
  • Keep all rows in the right data frame (but delete those in the left data frame that don’t appear in the right)
  • Don’t keep all rows, only keep those that are common to both data frames


(1) Keep all rows in both data frames, (2) keep all rows in the left data frame but only those in the right that are also in the left, (3) keep all rows in the right data frame but only those in the left that are also in the right, (4) keep only the rows that appear in both data frames

3.1 Keep all rows in both data frames

Merge the data from both seasons by team name:

outer <- merge(EPL_2018_19, EPL_2019_20, by = "Team", all = TRUE)
print(outer)
##                       Team Points.x Points.y
## 1                  Arsenal       70       56
## 2              Aston Villa       NA       35
## 3              Bournemouth       45       34
## 4   Brighton & Hove Albion       36       41
## 5                  Burnley       40       54
## 6             Cardiff City       34       NA
## 7                  Chelsea       72       66
## 8           Crystal Palace       49       43
## 9                  Everton       54       49
## 10                  Fulham       26       NA
## 11       Huddersfield Town       16       NA
## 12          Leicester City       52       62
## 13               Liverpool       97       99
## 14         Manchester City       98       81
## 15       Manchester United       66       66
## 16        Newcastle United       45       44
## 17            Norwich City       NA       21
## 18        Sheffield United       NA       54
## 19             Southampton       39       52
## 20       Tottenham Hotspur       71       59
## 21                 Watford       50       34
## 22         West Ham United       52       39
## 23 Wolverhampton Wanderers       57       59

As you can see, the teams that were in the league for both seasons (eg Arsenal and Bournemouth) have values in the “Points” columns for both seasons, whereas teams such as Aston Villa and Cardiff City which were promoted/relegated after the first season only have values for one season. For the other season there is an “NA” (not available). That is why there are 23 rows in the data frame when there are only 20 teams in the Premier League in any one season.

3.2 Keep all rows in the left data frame

To see the points for all the teams in the 2018-19 season and the season after:

left <- merge(EPL_2018_19, EPL_2019_20, by = "Team", all.x = TRUE)
print(left)
##                       Team Points.x Points.y
## 1                  Arsenal       70       56
## 2              Bournemouth       45       34
## 3   Brighton & Hove Albion       36       41
## 4                  Burnley       40       54
## 5             Cardiff City       34       NA
## 6                  Chelsea       72       66
## 7           Crystal Palace       49       43
## 8                  Everton       54       49
## 9                   Fulham       26       NA
## 10       Huddersfield Town       16       NA
## 11          Leicester City       52       62
## 12               Liverpool       97       99
## 13         Manchester City       98       81
## 14       Manchester United       66       66
## 15        Newcastle United       45       44
## 16             Southampton       39       52
## 17       Tottenham Hotspur       71       59
## 18                 Watford       50       34
## 19         West Ham United       52       39
## 20 Wolverhampton Wanderers       57       59

The three teams that got promoted into the Premier League are not shown here as their names do not appear in the left data frame.

3.3 Keep all rows in the right data frame

To see the points for all the teams in the 2019-20 season and the season before:

right <- merge(EPL_2018_19, EPL_2019_20, by = "Team", all.y = TRUE)
print(right)
##                       Team Points.x Points.y
## 1                  Arsenal       70       56
## 2              Aston Villa       NA       35
## 3              Bournemouth       45       34
## 4   Brighton & Hove Albion       36       41
## 5                  Burnley       40       54
## 6                  Chelsea       72       66
## 7           Crystal Palace       49       43
## 8                  Everton       54       49
## 9           Leicester City       52       62
## 10               Liverpool       97       99
## 11         Manchester City       98       81
## 12       Manchester United       66       66
## 13        Newcastle United       45       44
## 14            Norwich City       NA       21
## 15        Sheffield United       NA       54
## 16             Southampton       39       52
## 17       Tottenham Hotspur       71       59
## 18                 Watford       50       34
## 19         West Ham United       52       39
## 20 Wolverhampton Wanderers       57       59

The three teams that got relegated from the Premier League are not shown here as their names do not appear in the right data frame.

3.4 Keep only the rows that appear in both data frames

To see the points for all the teams that were in the Premier League for both seasons:

inner <- merge(EPL_2018_19, EPL_2019_20, by = "Team")
print(inner)
##                       Team Points.x Points.y
## 1                  Arsenal       70       56
## 2              Bournemouth       45       34
## 3   Brighton & Hove Albion       36       41
## 4                  Burnley       40       54
## 5                  Chelsea       72       66
## 6           Crystal Palace       49       43
## 7                  Everton       54       49
## 8           Leicester City       52       62
## 9                Liverpool       97       99
## 10         Manchester City       98       81
## 11       Manchester United       66       66
## 12        Newcastle United       45       44
## 13             Southampton       39       52
## 14       Tottenham Hotspur       71       59
## 15                 Watford       50       34
## 16         West Ham United       52       39
## 17 Wolverhampton Wanderers       57       59

The teams that got promoted and relegated from the Premier League are not shown here as their names do not appear in both data frames. As a results only 17 rows remain. This is the default behaviour of the merge() function.

4 Merge on Common Columns

In the previous four examples the two data frames only had one column in common - team name - and so that was the column we merged on. However, if we have data frames with two (or more) columns in common and we merge on just one of those columns we get duplication of the other(s):

EPL_2020_21 <- data.frame(
    team = c(
        "Manchester City", "Manchester United", "Liverpool", "Chelsea", "Leicester City", "West Ham United",
        "Tottenham Hotspur", "Arsenal", "Leeds United", "Everton", "Aston Villa", "Newcastle United",
        "Wolverhampton Wanderers", "Crystal Palace", "Southampton", "Brighton & Hove Albion", "Burnley",
        "Fulham", "West Bromwich Albion", "Sheffield United"
    ),
    location = c(
        "Manchester", "Manchester", "Liverpool", "London", "Leicester", "London", "London", "London",
        "Leeds", "Liverpool", "Birmingham", "Newcastle upon Tyne", "Wolverhampton", "London", "Southampton",
        "Falmer", "Burnley", "London", "West Bromwich", "Sheffield"
    ),
    points_2020_21 = c(
        86, 74, 69, 67, 66, 65, 62, 61, 59, 59, 55, 45, 45, 44, 43, 41, 39, 28, 26, 23
    )
)

EPL_2021_22 <- data.frame(
    team = c(
        "Manchester City", "Liverpool", "Chelsea", "Tottenham Hotspur", "Arsenal", "Manchester United",
        "West Ham United", "Leicester City", "Brighton & Hove Albion", "Wolverhampton Wanderers",
        "Newcastle United", "Crystal Palace", "Brentford", "Aston Villa", "Southampton", "Everton",
        "Leeds United", "Burnley", "Watford", "Norwich City"
    ),
    location = c(
        "Manchester", "Liverpool", "London", "London", "London", "Manchester", "London", "Leicester", "Falmer",
        "Wolverhampton", "Newcastle upon Tyne", "London", "London", "Birmingham", "Southampton", "Liverpool",
        "Leeds", "Burnley", "Watford", "Norwich"
    ),
    points_2021_22 = c(
        93, 92, 74, 71, 69, 58, 56, 52, 51, 51, 49, 48, 46, 45, 40, 39, 38, 35, 23, 22
    )
)

df <- merge(EPL_2020_21, EPL_2021_22, by = "team", all = T)
print(df)
##                       team          location.x points_2020_21          location.y points_2021_22
## 1                  Arsenal              London             61              London             69
## 2              Aston Villa          Birmingham             55          Birmingham             45
## 3                Brentford                <NA>             NA              London             46
## 4   Brighton & Hove Albion              Falmer             41              Falmer             51
## 5                  Burnley             Burnley             39             Burnley             35
## 6                  Chelsea              London             67              London             74
## 7           Crystal Palace              London             44              London             48
## 8                  Everton           Liverpool             59           Liverpool             39
## 9                   Fulham              London             28                <NA>             NA
## 10            Leeds United               Leeds             59               Leeds             38
## 11          Leicester City           Leicester             66           Leicester             52
## 12               Liverpool           Liverpool             69           Liverpool             92
## 13         Manchester City          Manchester             86          Manchester             93
## 14       Manchester United          Manchester             74          Manchester             58
## 15        Newcastle United Newcastle upon Tyne             45 Newcastle upon Tyne             49
## 16            Norwich City                <NA>             NA             Norwich             22
## 17        Sheffield United           Sheffield             23                <NA>             NA
## 18             Southampton         Southampton             43         Southampton             40
## 19       Tottenham Hotspur              London             62              London             71
## 20                 Watford                <NA>             NA             Watford             23
## 21    West Bromwich Albion       West Bromwich             26                <NA>             NA
## 22         West Ham United              London             65              London             56
## 23 Wolverhampton Wanderers       Wolverhampton             45       Wolverhampton             51

You can see what has gone wrong here: the ‘location’ information has effectively been duplicated and columns ‘location.x’ and ‘location.y’ have been created. To fix this we can merge on the common columns manually by listing them in a vector that gets given to the by keyword argument:

df <- merge(EPL_2020_21, EPL_2021_22, by = c("team", "location"), all = T)
print(df)
##                       team            location points_2020_21 points_2021_22
## 1                  Arsenal              London             61             69
## 2              Aston Villa          Birmingham             55             45
## 3                Brentford              London             NA             46
## 4   Brighton & Hove Albion              Falmer             41             51
## 5                  Burnley             Burnley             39             35
## 6                  Chelsea              London             67             74
## 7           Crystal Palace              London             44             48
## 8                  Everton           Liverpool             59             39
## 9                   Fulham              London             28             NA
## 10            Leeds United               Leeds             59             38
## 11          Leicester City           Leicester             66             52
## 12               Liverpool           Liverpool             69             92
## 13         Manchester City          Manchester             86             93
## 14       Manchester United          Manchester             74             58
## 15        Newcastle United Newcastle upon Tyne             45             49
## 16            Norwich City             Norwich             NA             22
## 17        Sheffield United           Sheffield             23             NA
## 18             Southampton         Southampton             43             40
## 19       Tottenham Hotspur              London             62             71
## 20                 Watford             Watford             NA             23
## 21    West Bromwich Albion       West Bromwich             26             NA
## 22         West Ham United              London             65             56
## 23 Wolverhampton Wanderers       Wolverhampton             45             51

…or, we can find these common columns automatically. To do this, use the intersect() function on the vectors of the column names of the two data frames to get the intersection of these two sets:

print(intersect(colnames(EPL_2020_21), colnames(EPL_2021_22)))
## [1] "team"     "location"

The merging step can now look like this:

df <- merge(EPL_2020_21, EPL_2021_22, by = intersect(colnames(EPL_2020_21), colnames(EPL_2021_22)), all = T)
print(df)
##                       team            location points_2020_21 points_2021_22
## 1                  Arsenal              London             61             69
## 2              Aston Villa          Birmingham             55             45
## 3                Brentford              London             NA             46
## 4   Brighton & Hove Albion              Falmer             41             51
## 5                  Burnley             Burnley             39             35
## 6                  Chelsea              London             67             74
## 7           Crystal Palace              London             44             48
## 8                  Everton           Liverpool             59             39
## 9                   Fulham              London             28             NA
## 10            Leeds United               Leeds             59             38
## 11          Leicester City           Leicester             66             52
## 12               Liverpool           Liverpool             69             92
## 13         Manchester City          Manchester             86             93
## 14       Manchester United          Manchester             74             58
## 15        Newcastle United Newcastle upon Tyne             45             49
## 16            Norwich City             Norwich             NA             22
## 17        Sheffield United           Sheffield             23             NA
## 18             Southampton         Southampton             43             40
## 19       Tottenham Hotspur              London             62             71
## 20                 Watford             Watford             NA             23
## 21    West Bromwich Albion       West Bromwich             26             NA
## 22         West Ham United              London             65             56
## 23 Wolverhampton Wanderers       Wolverhampton             45             51

Note that this intersect() function only works as expected on vectors, not on the data frames themselves:

print(intersect(EPL_2020_21, EPL_2021_22))
## data frame with 0 columns and 0 rows

We didn’t get anything useful from that!

⇦ Back