Here are the final log standings from the 2018-19 and 2019-20 English Premier League seasons, sorted alphabetically by team:
EPL_2018_19 <- data.frame(
"Team" = c(
"Arsenal", "Bournemouth", "Brighton & Hove Albion", "Burnley", "Cardiff City", "Chelsea", "Crystal Palace",
"Everton", "Fulham", "Huddersfield Town", "Leicester City", "Liverpool", "Manchester City", "Manchester United",
"Newcastle United", "Southampton", "Tottenham Hotspur", "Watford", "West Ham United", "Wolverhampton Wanderers"
),
"Points" = c(70, 45, 36, 40, 34, 72, 49, 54, 26, 16, 52, 97, 98, 66, 45, 39, 71, 50, 52, 57)
)
EPL_2019_20 <- data.frame(
"Team" = c(
"Arsenal", "Aston Villa", "Bournemouth", "Brighton & Hove Albion", "Burnley", "Chelsea", "Crystal Palace",
"Everton", "Leicester City", "Liverpool", "Manchester City", "Manchester United", "Newcastle United",
"Norwich City", "Sheffield United", "Southampton", "Tottenham Hotspur", "Watford", "West Ham United",
"Wolverhampton Wanderers"
),
"Points" = c(56, 35, 34, 41, 54, 66, 43, 49, 62, 99, 81, 66, 44, 21, 54, 52, 59, 34, 39, 59)
)
library(kableExtra)
kable_input <- kable(EPL_2018_19)
kable_styling(
kable_input, bootstrap_options = c("striped", "scale_down", "condensed"), full_width = F, font_size = 11,
position = "float_left"
)
kable_input <- kable(EPL_2019_20)
kable_styling(
kable_input, bootstrap_options = c("striped", "scale_down", "condensed"), full_width = F, font_size = 11,
position = "center"
)
Team | Points |
---|---|
Arsenal | 70 |
Bournemouth | 45 |
Brighton & Hove Albion | 36 |
Burnley | 40 |
Cardiff City | 34 |
Chelsea | 72 |
Crystal Palace | 49 |
Everton | 54 |
Fulham | 26 |
Huddersfield Town | 16 |
Leicester City | 52 |
Liverpool | 97 |
Manchester City | 98 |
Manchester United | 66 |
Newcastle United | 45 |
Southampton | 39 |
Tottenham Hotspur | 71 |
Watford | 50 |
West Ham United | 52 |
Wolverhampton Wanderers | 57 |
Team | Points |
---|---|
Arsenal | 56 |
Aston Villa | 35 |
Bournemouth | 34 |
Brighton & Hove Albion | 41 |
Burnley | 54 |
Chelsea | 66 |
Crystal Palace | 43 |
Everton | 49 |
Leicester City | 62 |
Liverpool | 99 |
Manchester City | 81 |
Manchester United | 66 |
Newcastle United | 44 |
Norwich City | 21 |
Sheffield United | 54 |
Southampton | 52 |
Tottenham Hotspur | 59 |
Watford | 34 |
West Ham United | 39 |
Wolverhampton Wanderers | 59 |
Concatenating two data frames is actually not the same as merging them, it’s simply combining them by sticking one next to the other:
Append a data frame to the bottom of another with the rbind()
(row bind) function:
total <- rbind(EPL_2018_19, EPL_2019_20)
print(total)
## Team Points
## 1 Arsenal 70
## 2 Bournemouth 45
## 3 Brighton & Hove Albion 36
## 4 Burnley 40
## 5 Cardiff City 34
## 6 Chelsea 72
## 7 Crystal Palace 49
## 8 Everton 54
## 9 Fulham 26
## 10 Huddersfield Town 16
## 11 Leicester City 52
## 12 Liverpool 97
## 13 Manchester City 98
## 14 Manchester United 66
## 15 Newcastle United 45
## 16 Southampton 39
## 17 Tottenham Hotspur 71
## 18 Watford 50
## 19 West Ham United 52
## 20 Wolverhampton Wanderers 57
## 21 Arsenal 56
## 22 Aston Villa 35
## 23 Bournemouth 34
## 24 Brighton & Hove Albion 41
## 25 Burnley 54
## 26 Chelsea 66
## 27 Crystal Palace 43
## 28 Everton 49
## 29 Leicester City 62
## 30 Liverpool 99
## 31 Manchester City 81
## 32 Manchester United 66
## 33 Newcastle United 44
## 34 Norwich City 21
## 35 Sheffield United 54
## 36 Southampton 52
## 37 Tottenham Hotspur 59
## 38 Watford 34
## 39 West Ham United 39
## 40 Wolverhampton Wanderers 59
Append a data frame to the side of another with the cbind()
(column bind) function:
total <- cbind(EPL_2018_19, EPL_2019_20)
print(total)
## Team Points Team Points
## 1 Arsenal 70 Arsenal 56
## 2 Bournemouth 45 Aston Villa 35
## 3 Brighton & Hove Albion 36 Bournemouth 34
## 4 Burnley 40 Brighton & Hove Albion 41
## 5 Cardiff City 34 Burnley 54
## 6 Chelsea 72 Chelsea 66
## 7 Crystal Palace 49 Crystal Palace 43
## 8 Everton 54 Everton 49
## 9 Fulham 26 Leicester City 62
## 10 Huddersfield Town 16 Liverpool 99
## 11 Leicester City 52 Manchester City 81
## 12 Liverpool 97 Manchester United 66
## 13 Manchester City 98 Newcastle United 44
## 14 Manchester United 66 Norwich City 21
## 15 Newcastle United 45 Sheffield United 54
## 16 Southampton 39 Southampton 52
## 17 Tottenham Hotspur 71 Tottenham Hotspur 59
## 18 Watford 50 Watford 34
## 19 West Ham United 52 West Ham United 39
## 20 Wolverhampton Wanderers 57 Wolverhampton Wanderers 59
There are four types of merging:
Merge the data from both seasons by team name:
outer <- merge(EPL_2018_19, EPL_2019_20, by = "Team", all = TRUE)
print(outer)
## Team Points.x Points.y
## 1 Arsenal 70 56
## 2 Aston Villa NA 35
## 3 Bournemouth 45 34
## 4 Brighton & Hove Albion 36 41
## 5 Burnley 40 54
## 6 Cardiff City 34 NA
## 7 Chelsea 72 66
## 8 Crystal Palace 49 43
## 9 Everton 54 49
## 10 Fulham 26 NA
## 11 Huddersfield Town 16 NA
## 12 Leicester City 52 62
## 13 Liverpool 97 99
## 14 Manchester City 98 81
## 15 Manchester United 66 66
## 16 Newcastle United 45 44
## 17 Norwich City NA 21
## 18 Sheffield United NA 54
## 19 Southampton 39 52
## 20 Tottenham Hotspur 71 59
## 21 Watford 50 34
## 22 West Ham United 52 39
## 23 Wolverhampton Wanderers 57 59
As you can see, the teams that were in the league for both seasons (eg Arsenal and Bournemouth) have values in the “Points” columns for both seasons, whereas teams such as Aston Villa and Cardiff City which were promoted/relegated after the first season only have values for one season. For the other season there is an “NA” (not available). That is why there are 23 rows in the data frame when there are only 20 teams in the Premier League in any one season.
To see the points for all the teams in the 2018-19 season and the season after:
left <- merge(EPL_2018_19, EPL_2019_20, by = "Team", all.x = TRUE)
print(left)
## Team Points.x Points.y
## 1 Arsenal 70 56
## 2 Bournemouth 45 34
## 3 Brighton & Hove Albion 36 41
## 4 Burnley 40 54
## 5 Cardiff City 34 NA
## 6 Chelsea 72 66
## 7 Crystal Palace 49 43
## 8 Everton 54 49
## 9 Fulham 26 NA
## 10 Huddersfield Town 16 NA
## 11 Leicester City 52 62
## 12 Liverpool 97 99
## 13 Manchester City 98 81
## 14 Manchester United 66 66
## 15 Newcastle United 45 44
## 16 Southampton 39 52
## 17 Tottenham Hotspur 71 59
## 18 Watford 50 34
## 19 West Ham United 52 39
## 20 Wolverhampton Wanderers 57 59
The three teams that got promoted into the Premier League are not shown here as their names do not appear in the left data frame.
To see the points for all the teams in the 2019-20 season and the season before:
right <- merge(EPL_2018_19, EPL_2019_20, by = "Team", all.y = TRUE)
print(right)
## Team Points.x Points.y
## 1 Arsenal 70 56
## 2 Aston Villa NA 35
## 3 Bournemouth 45 34
## 4 Brighton & Hove Albion 36 41
## 5 Burnley 40 54
## 6 Chelsea 72 66
## 7 Crystal Palace 49 43
## 8 Everton 54 49
## 9 Leicester City 52 62
## 10 Liverpool 97 99
## 11 Manchester City 98 81
## 12 Manchester United 66 66
## 13 Newcastle United 45 44
## 14 Norwich City NA 21
## 15 Sheffield United NA 54
## 16 Southampton 39 52
## 17 Tottenham Hotspur 71 59
## 18 Watford 50 34
## 19 West Ham United 52 39
## 20 Wolverhampton Wanderers 57 59
The three teams that got relegated from the Premier League are not shown here as their names do not appear in the right data frame.
To see the points for all the teams that were in the Premier League for both seasons:
inner <- merge(EPL_2018_19, EPL_2019_20, by = "Team")
print(inner)
## Team Points.x Points.y
## 1 Arsenal 70 56
## 2 Bournemouth 45 34
## 3 Brighton & Hove Albion 36 41
## 4 Burnley 40 54
## 5 Chelsea 72 66
## 6 Crystal Palace 49 43
## 7 Everton 54 49
## 8 Leicester City 52 62
## 9 Liverpool 97 99
## 10 Manchester City 98 81
## 11 Manchester United 66 66
## 12 Newcastle United 45 44
## 13 Southampton 39 52
## 14 Tottenham Hotspur 71 59
## 15 Watford 50 34
## 16 West Ham United 52 39
## 17 Wolverhampton Wanderers 57 59
The teams that got promoted and relegated from the Premier League are not shown here as their names do not appear in both data frames. As a results only 17 rows remain. This is the default behaviour of the merge()
function.
In the previous four examples the two data frames only had one column in common - team name - and so that was the column we merged on. However, if we have data frames with two (or more) columns in common and we merge on just one of those columns we get duplication of the other(s):
EPL_2020_21 <- data.frame(
team = c(
"Manchester City", "Manchester United", "Liverpool", "Chelsea", "Leicester City", "West Ham United",
"Tottenham Hotspur", "Arsenal", "Leeds United", "Everton", "Aston Villa", "Newcastle United",
"Wolverhampton Wanderers", "Crystal Palace", "Southampton", "Brighton & Hove Albion", "Burnley",
"Fulham", "West Bromwich Albion", "Sheffield United"
),
location = c(
"Manchester", "Manchester", "Liverpool", "London", "Leicester", "London", "London", "London",
"Leeds", "Liverpool", "Birmingham", "Newcastle upon Tyne", "Wolverhampton", "London", "Southampton",
"Falmer", "Burnley", "London", "West Bromwich", "Sheffield"
),
points_2020_21 = c(
86, 74, 69, 67, 66, 65, 62, 61, 59, 59, 55, 45, 45, 44, 43, 41, 39, 28, 26, 23
)
)
EPL_2021_22 <- data.frame(
team = c(
"Manchester City", "Liverpool", "Chelsea", "Tottenham Hotspur", "Arsenal", "Manchester United",
"West Ham United", "Leicester City", "Brighton & Hove Albion", "Wolverhampton Wanderers",
"Newcastle United", "Crystal Palace", "Brentford", "Aston Villa", "Southampton", "Everton",
"Leeds United", "Burnley", "Watford", "Norwich City"
),
location = c(
"Manchester", "Liverpool", "London", "London", "London", "Manchester", "London", "Leicester", "Falmer",
"Wolverhampton", "Newcastle upon Tyne", "London", "London", "Birmingham", "Southampton", "Liverpool",
"Leeds", "Burnley", "Watford", "Norwich"
),
points_2021_22 = c(
93, 92, 74, 71, 69, 58, 56, 52, 51, 51, 49, 48, 46, 45, 40, 39, 38, 35, 23, 22
)
)
df <- merge(EPL_2020_21, EPL_2021_22, by = "team", all = T)
print(df)
## team location.x points_2020_21 location.y points_2021_22
## 1 Arsenal London 61 London 69
## 2 Aston Villa Birmingham 55 Birmingham 45
## 3 Brentford <NA> NA London 46
## 4 Brighton & Hove Albion Falmer 41 Falmer 51
## 5 Burnley Burnley 39 Burnley 35
## 6 Chelsea London 67 London 74
## 7 Crystal Palace London 44 London 48
## 8 Everton Liverpool 59 Liverpool 39
## 9 Fulham London 28 <NA> NA
## 10 Leeds United Leeds 59 Leeds 38
## 11 Leicester City Leicester 66 Leicester 52
## 12 Liverpool Liverpool 69 Liverpool 92
## 13 Manchester City Manchester 86 Manchester 93
## 14 Manchester United Manchester 74 Manchester 58
## 15 Newcastle United Newcastle upon Tyne 45 Newcastle upon Tyne 49
## 16 Norwich City <NA> NA Norwich 22
## 17 Sheffield United Sheffield 23 <NA> NA
## 18 Southampton Southampton 43 Southampton 40
## 19 Tottenham Hotspur London 62 London 71
## 20 Watford <NA> NA Watford 23
## 21 West Bromwich Albion West Bromwich 26 <NA> NA
## 22 West Ham United London 65 London 56
## 23 Wolverhampton Wanderers Wolverhampton 45 Wolverhampton 51
You can see what has gone wrong here: the ‘location’ information has effectively been duplicated and columns ‘location.x’ and ‘location.y’ have been created. To fix this we can merge on the common columns manually by listing them in a vector that gets given to the by
keyword argument:
df <- merge(EPL_2020_21, EPL_2021_22, by = c("team", "location"), all = T)
print(df)
## team location points_2020_21 points_2021_22
## 1 Arsenal London 61 69
## 2 Aston Villa Birmingham 55 45
## 3 Brentford London NA 46
## 4 Brighton & Hove Albion Falmer 41 51
## 5 Burnley Burnley 39 35
## 6 Chelsea London 67 74
## 7 Crystal Palace London 44 48
## 8 Everton Liverpool 59 39
## 9 Fulham London 28 NA
## 10 Leeds United Leeds 59 38
## 11 Leicester City Leicester 66 52
## 12 Liverpool Liverpool 69 92
## 13 Manchester City Manchester 86 93
## 14 Manchester United Manchester 74 58
## 15 Newcastle United Newcastle upon Tyne 45 49
## 16 Norwich City Norwich NA 22
## 17 Sheffield United Sheffield 23 NA
## 18 Southampton Southampton 43 40
## 19 Tottenham Hotspur London 62 71
## 20 Watford Watford NA 23
## 21 West Bromwich Albion West Bromwich 26 NA
## 22 West Ham United London 65 56
## 23 Wolverhampton Wanderers Wolverhampton 45 51
…or, we can find these common columns automatically. To do this, use the intersect()
function on the vectors of the column names of the two data frames to get the intersection of these two sets:
print(intersect(colnames(EPL_2020_21), colnames(EPL_2021_22)))
## [1] "team" "location"
The merging step can now look like this:
df <- merge(EPL_2020_21, EPL_2021_22, by = intersect(colnames(EPL_2020_21), colnames(EPL_2021_22)), all = T)
print(df)
## team location points_2020_21 points_2021_22
## 1 Arsenal London 61 69
## 2 Aston Villa Birmingham 55 45
## 3 Brentford London NA 46
## 4 Brighton & Hove Albion Falmer 41 51
## 5 Burnley Burnley 39 35
## 6 Chelsea London 67 74
## 7 Crystal Palace London 44 48
## 8 Everton Liverpool 59 39
## 9 Fulham London 28 NA
## 10 Leeds United Leeds 59 38
## 11 Leicester City Leicester 66 52
## 12 Liverpool Liverpool 69 92
## 13 Manchester City Manchester 86 93
## 14 Manchester United Manchester 74 58
## 15 Newcastle United Newcastle upon Tyne 45 49
## 16 Norwich City Norwich NA 22
## 17 Sheffield United Sheffield 23 NA
## 18 Southampton Southampton 43 40
## 19 Tottenham Hotspur London 62 71
## 20 Watford Watford NA 23
## 21 West Bromwich Albion West Bromwich 26 NA
## 22 West Ham United London 65 56
## 23 Wolverhampton Wanderers Wolverhampton 45 51
Note that this intersect()
function only works as expected on vectors, not on the data frames themselves:
print(intersect(EPL_2020_21, EPL_2021_22))
## data frame with 0 columns and 0 rows
We didn’t get anything useful from that!