Here are the final log standings from the 2018-19 and 2019-20 English Premier League seasons, sorted alphabetically by team:
import pandas as pd
EPL_2018_19 = pd.DataFrame({
'Team': [
'Arsenal', 'Bournemouth', 'Brighton & Hove Albion', 'Burnley', 'Cardiff City', 'Chelsea', 'Crystal Palace',
'Everton', 'Fulham', 'Huddersfield Town', 'Leicester City', 'Liverpool', 'Manchester City', 'Manchester United',
'Newcastle United', 'Southampton', 'Tottenham Hotspur', 'Watford', 'West Ham United', 'Wolverhampton Wanderers'
],
'Points': [70, 45, 36, 40, 34, 72, 49, 54, 26, 16, 52, 97, 98, 66, 45, 39, 71, 50, 52, 57]
})
EPL_2019_20 = pd.DataFrame({
'Team': [
'Arsenal', 'Aston Villa', 'Bournemouth', 'Brighton & Hove Albion', 'Burnley', 'Chelsea', 'Crystal Palace',
'Everton', 'Leicester City', 'Liverpool', 'Manchester City', 'Manchester United', 'Newcastle United',
'Norwich City', 'Sheffield United', 'Southampton', 'Tottenham Hotspur', 'Watford', 'West Ham United',
'Wolverhampton Wanderers'
],
'Points': [56, 35, 34, 41, 54, 66, 43, 49, 62, 99, 81, 66, 44, 21, 54, 52, 59, 34, 39, 59]
})
Team | Points |
---|---|
Arsenal | 70 |
Bournemouth | 45 |
Brighton & Hove Albion | 36 |
Burnley | 40 |
Cardiff City | 34 |
Chelsea | 72 |
Crystal Palace | 49 |
Everton | 54 |
Fulham | 26 |
Huddersfield Town | 16 |
Leicester City | 52 |
Liverpool | 97 |
Manchester City | 98 |
Manchester United | 66 |
Newcastle United | 45 |
Southampton | 39 |
Tottenham Hotspur | 71 |
Watford | 50 |
West Ham United | 52 |
Wolverhampton Wanderers | 57 |
Team | Points |
---|---|
Arsenal | 56 |
Aston Villa | 35 |
Bournemouth | 34 |
Brighton & Hove Albion | 41 |
Burnley | 54 |
Chelsea | 66 |
Crystal Palace | 43 |
Everton | 49 |
Leicester City | 62 |
Liverpool | 99 |
Manchester City | 81 |
Manchester United | 66 |
Newcastle United | 44 |
Norwich City | 21 |
Sheffield United | 54 |
Southampton | 52 |
Tottenham Hotspur | 59 |
Watford | 34 |
West Ham United | 39 |
Wolverhampton Wanderers | 59 |
In the real world, you would probably use data stored in spreadsheets. These can be imported into Python as data frames using the following code:
# Import Excel spreadsheets as data frames
df_1 = pd.read_excel('First Spreadsheet.xlsx')
df_2 = pd.read_excel('Second Spreadsheet.xlsx')
Or, if you are using CSVs:
# Import CSVs as data frames
df_1 = pd.read_csv('First Spreadsheet.csv')
df_2 = pd.read_csv('Second Spreadsheet.csv')
Concatenating two data frames is actually not the same as merging them, it’s simply combining them by sticking one on top of the other. It’s done using the concat()
function:
total = pd.concat([EPL_2018_19, EPL_2019_20])
Which produces the following:
print(total)
## Team Points
## 0 Arsenal 70
## 1 Bournemouth 45
## 2 Brighton & Hove Albion 36
## 3 Burnley 40
## 4 Cardiff City 34
## 5 Chelsea 72
## 6 Crystal Palace 49
## 7 Everton 54
## 8 Fulham 26
## 9 Huddersfield Town 16
## 10 Leicester City 52
## 11 Liverpool 97
## 12 Manchester City 98
## 13 Manchester United 66
## 14 Newcastle United 45
## 15 Southampton 39
## 16 Tottenham Hotspur 71
## 17 Watford 50
## 18 West Ham United 52
## 19 Wolverhampton Wanderers 57
## 0 Arsenal 56
## 1 Aston Villa 35
## 2 Bournemouth 34
## 3 Brighton & Hove Albion 41
## 4 Burnley 54
## 5 Chelsea 66
## 6 Crystal Palace 43
## 7 Everton 49
## 8 Leicester City 62
## 9 Liverpool 99
## 10 Manchester City 81
## 11 Manchester United 66
## 12 Newcastle United 44
## 13 Norwich City 21
## 14 Sheffield United 54
## 15 Southampton 52
## 16 Tottenham Hotspur 59
## 17 Watford 34
## 18 West Ham United 39
## 19 Wolverhampton Wanderers 59
Append a data frame to the side of another with the concat()
function by setting the axis='columns'
or axis='1'
keyword argument (these mean the same thing):
total = pd.concat([EPL_2018_19, EPL_2019_20], axis='columns')
print(total)
## Team Points Team Points
## 0 Arsenal 70 Arsenal 56
## 1 Bournemouth 45 Aston Villa 35
## 2 Brighton & Hove Albion 36 Bournemouth 34
## 3 Burnley 40 Brighton & Hove Albion 41
## 4 Cardiff City 34 Burnley 54
## 5 Chelsea 72 Chelsea 66
## 6 Crystal Palace 49 Crystal Palace 43
## 7 Everton 54 Everton 49
## 8 Fulham 26 Leicester City 62
## 9 Huddersfield Town 16 Liverpool 99
## 10 Leicester City 52 Manchester City 81
## 11 Liverpool 97 Manchester United 66
## 12 Manchester City 98 Newcastle United 44
## 13 Manchester United 66 Norwich City 21
## 14 Newcastle United 45 Sheffield United 54
## 15 Southampton 39 Southampton 52
## 16 Tottenham Hotspur 71 Tottenham Hotspur 59
## 17 Watford 50 Watford 34
## 18 West Ham United 52 West Ham United 39
## 19 Wolverhampton Wanderers 57 Wolverhampton Wanderers 59
There are four types of merging:
Merge the data from both seasons by team name:
outer = pd.merge(EPL_2018_19, EPL_2019_20, on='Team', how='outer')
print(outer)
## Team Points_x Points_y
## 0 Arsenal 70.0 56.0
## 1 Bournemouth 45.0 34.0
## 2 Brighton & Hove Albion 36.0 41.0
## 3 Burnley 40.0 54.0
## 4 Cardiff City 34.0 NaN
## 5 Chelsea 72.0 66.0
## 6 Crystal Palace 49.0 43.0
## 7 Everton 54.0 49.0
## 8 Fulham 26.0 NaN
## 9 Huddersfield Town 16.0 NaN
## 10 Leicester City 52.0 62.0
## 11 Liverpool 97.0 99.0
## 12 Manchester City 98.0 81.0
## 13 Manchester United 66.0 66.0
## 14 Newcastle United 45.0 44.0
## 15 Southampton 39.0 52.0
## 16 Tottenham Hotspur 71.0 59.0
## 17 Watford 50.0 34.0
## 18 West Ham United 52.0 39.0
## 19 Wolverhampton Wanderers 57.0 59.0
## 20 Aston Villa NaN 35.0
## 21 Norwich City NaN 21.0
## 22 Sheffield United NaN 54.0
As you can see, the teams that were in the league for both seasons (eg Arsenal and Bournemouth) have values in the “Points” columns for both seasons, whereas teams such as Aston Villa and Cardiff City which were promoted/relegated after the first season only have values for one season. For the other season there is an “NaN” (not a number). That is why there are 23 rows in the data frame when there are only 20 teams in the Premier League in any one season.
For all teams that were in the Premier League for the 2018-19 season, see their points for that season and the one afterwards:
left = pd.merge(EPL_2018_19, EPL_2019_20, on='Team', how='left')
print(left)
## Team Points_x Points_y
## 0 Arsenal 70 56.0
## 1 Bournemouth 45 34.0
## 2 Brighton & Hove Albion 36 41.0
## 3 Burnley 40 54.0
## 4 Cardiff City 34 NaN
## 5 Chelsea 72 66.0
## 6 Crystal Palace 49 43.0
## 7 Everton 54 49.0
## 8 Fulham 26 NaN
## 9 Huddersfield Town 16 NaN
## 10 Leicester City 52 62.0
## 11 Liverpool 97 99.0
## 12 Manchester City 98 81.0
## 13 Manchester United 66 66.0
## 14 Newcastle United 45 44.0
## 15 Southampton 39 52.0
## 16 Tottenham Hotspur 71 59.0
## 17 Watford 50 34.0
## 18 West Ham United 52 39.0
## 19 Wolverhampton Wanderers 57 59.0
The three teams that got promoted into the Premier League are not shown here as their names do not appear in the left data frame.
For all teams that were in the Premier League for the 2019-20 season, see their points for that season and the previous one:
right = pd.merge(EPL_2018_19, EPL_2019_20, on='Team', how='right')
print(right)
## Team Points_x Points_y
## 0 Arsenal 70.0 56
## 1 Aston Villa NaN 35
## 2 Bournemouth 45.0 34
## 3 Brighton & Hove Albion 36.0 41
## 4 Burnley 40.0 54
## 5 Chelsea 72.0 66
## 6 Crystal Palace 49.0 43
## 7 Everton 54.0 49
## 8 Leicester City 52.0 62
## 9 Liverpool 97.0 99
## 10 Manchester City 98.0 81
## 11 Manchester United 66.0 66
## 12 Newcastle United 45.0 44
## 13 Norwich City NaN 21
## 14 Sheffield United NaN 54
## 15 Southampton 39.0 52
## 16 Tottenham Hotspur 71.0 59
## 17 Watford 50.0 34
## 18 West Ham United 52.0 39
## 19 Wolverhampton Wanderers 57.0 59
The three teams that got relegated from the Premier League are not shown here as their names do not appear in the right data frame.
To see the points for all the teams that were in the Premier League for both seasons:
inner = pd.merge(EPL_2018_19, EPL_2019_20, on='Team')
print(inner)
## Team Points_x Points_y
## 0 Arsenal 70 56
## 1 Bournemouth 45 34
## 2 Brighton & Hove Albion 36 41
## 3 Burnley 40 54
## 4 Chelsea 72 66
## 5 Crystal Palace 49 43
## 6 Everton 54 49
## 7 Leicester City 52 62
## 8 Liverpool 97 99
## 9 Manchester City 98 81
## 10 Manchester United 66 66
## 11 Newcastle United 45 44
## 12 Southampton 39 52
## 13 Tottenham Hotspur 71 59
## 14 Watford 50 34
## 15 West Ham United 52 39
## 16 Wolverhampton Wanderers 57 59
The teams that got promoted and relegated from the Premier League are not shown here as their names do not appear in both data frames. As a result, only 17 rows remain. This is the default behaviour of the merge()
function.
By performing a merge with the indicator=True
option set, you can get an extra column (called “_merge
”) that tells you from where that particular piece of data came:
df = pd.merge(EPL_2018_19, EPL_2019_20, on='Team', how='outer', indicator=True)
print(df)
## Team Points_x Points_y _merge
## 0 Arsenal 70.0 56.0 both
## 1 Bournemouth 45.0 34.0 both
## 2 Brighton & Hove Albion 36.0 41.0 both
## 3 Burnley 40.0 54.0 both
## 4 Cardiff City 34.0 NaN left_only
## 5 Chelsea 72.0 66.0 both
## 6 Crystal Palace 49.0 43.0 both
## 7 Everton 54.0 49.0 both
## 8 Fulham 26.0 NaN left_only
## 9 Huddersfield Town 16.0 NaN left_only
## 10 Leicester City 52.0 62.0 both
## 11 Liverpool 97.0 99.0 both
## 12 Manchester City 98.0 81.0 both
## 13 Manchester United 66.0 66.0 both
## 14 Newcastle United 45.0 44.0 both
## 15 Southampton 39.0 52.0 both
## 16 Tottenham Hotspur 71.0 59.0 both
## 17 Watford 50.0 34.0 both
## 18 West Ham United 52.0 39.0 both
## 19 Wolverhampton Wanderers 57.0 59.0 both
## 20 Aston Villa NaN 35.0 right_only
## 21 Norwich City NaN 21.0 right_only
## 22 Sheffield United NaN 54.0 right_only
The teams that were in both data frames, only the left data frame or only the right data frame are labelled as such in the _merge
column.
There is no way to do a ‘not inner’ merge in one step (ie keep the teams that were in one season only), but it can be done in two steps using this indicator:
df = pd.merge(EPL_2018_19, EPL_2019_20, on='Team', how='outer', indicator=True)
df = df[df['_merge'] != 'both']
print(df)
## Team Points_x Points_y _merge
## 4 Cardiff City 34.0 NaN left_only
## 8 Fulham 26.0 NaN left_only
## 9 Huddersfield Town 16.0 NaN left_only
## 20 Aston Villa NaN 35.0 right_only
## 21 Norwich City NaN 21.0 right_only
## 22 Sheffield United NaN 54.0 right_only
These are the six teams that played in the Premier League for only one of the two seasons.
Let’s imagine we have a data frame containing out-of-date information that we want to update with data from a different data frame. For example, here is a database of Arsenal and Burnley’s log points for three seasons, the last of which is incomplete:
import pandas as pd
df = pd.DataFrame({
'Team': [
'Arsenal', 'Arsenal', 'Arsenal',
'Burnley', 'Burnley', 'Burnley'
],
'Season': ['2018_19', '2019_20', '2020_21', '2018_19', '2019_20', '2020_21'],
'Points': [70, 56, 49, 40, 54, 36]
})
print(df)
## Team Season Points
## 0 Arsenal 2018_19 70
## 1 Arsenal 2019_20 56
## 2 Arsenal 2020_21 49
## 3 Burnley 2018_19 40
## 4 Burnley 2019_20 54
## 5 Burnley 2020_21 36
After Arsenal won their next game, their tally of points for the 2020-21 season increased from 49 to 52. So the third row of the data frame (index 2) should look like this:
# The object containing the update must be a data frame (a series won't work)
update = pd.DataFrame({
'Team': ['Arsenal'],
'Season': ['2020_21'],
'Points': [52]
})
print(update)
## Team Season Points
## 0 Arsenal 2020_21 52
We want to edit the database and put in this new, correct information. In other words, we want to merge these two data frames but also overwrite the old one. This is known as updating and, usefully, the function that does this in Pandas is called update()
. However, in order for it to work, the index of the new row (in the data frame update
) needs to match the index of the old row (in the data frame df
). That is how Python will know which row to overwrite. At the moment we have a problem because the original row (row 3 of data frame df
) has index 2 while the new row (row 1 of data frame update
) has index 0.
Now, we could fix this manually by individually changing only these two indexes using hard-coded values, but it’s much easier to change the indexes completely and instead use overlapping columns as the indexes. Then, the data frames can be ‘merged’ on the values they have in common and the information in the remaining columns updated accordingly:
# Use the team names and the seasons as the indexes
df = df.set_index(['Team', 'Season'])
update = update.set_index(['Team', 'Season'])
# Merge and overwrite
df.update(update)
# Change back to using sequential numbers as the indexes
df = df.reset_index()
print(df)
## Team Season Points
## 0 Arsenal 2018_19 70
## 1 Arsenal 2019_20 56
## 2 Arsenal 2020_21 52
## 3 Burnley 2018_19 40
## 4 Burnley 2019_20 54
## 5 Burnley 2020_21 36
Now we can see that Arsenal have 52 points (so far) in the 2020-21 season.