1 Example Data

Here are the final log standings from the 2018-19 and 2019-20 English Premier League seasons, sorted alphabetically by team:

import pandas as pd

EPL_2018_19 = pd.DataFrame({
    'Team': [
        'Arsenal', 'Bournemouth', 'Brighton & Hove Albion', 'Burnley', 'Cardiff City', 'Chelsea', 'Crystal Palace',
        'Everton', 'Fulham', 'Huddersfield Town', 'Leicester City', 'Liverpool', 'Manchester City', 'Manchester United',
        'Newcastle United', 'Southampton', 'Tottenham Hotspur', 'Watford', 'West Ham United', 'Wolverhampton Wanderers'
    ],
    'Points': [70, 45, 36, 40, 34, 72, 49, 54, 26, 16, 52, 97, 98, 66, 45, 39, 71, 50, 52, 57]
})
EPL_2019_20 = pd.DataFrame({
    'Team': [
        'Arsenal', 'Aston Villa', 'Bournemouth', 'Brighton & Hove Albion', 'Burnley', 'Chelsea', 'Crystal Palace',
        'Everton', 'Leicester City', 'Liverpool', 'Manchester City', 'Manchester United', 'Newcastle United',
        'Norwich City', 'Sheffield United', 'Southampton', 'Tottenham Hotspur', 'Watford', 'West Ham United',
        'Wolverhampton Wanderers'
    ],
    'Points': [56, 35, 34, 41, 54, 66, 43, 49, 62, 99, 81, 66, 44, 21, 54, 52, 59, 34, 39, 59]
})

Team	Points
Arsenal	70
Bournemouth	45
Brighton & Hove Albion	36
Burnley	40
Cardiff City	34
Chelsea	72
Crystal Palace	49
Everton	54
Fulham	26
Huddersfield Town	16
Leicester City	52
Liverpool	97
Manchester City	98
Manchester United	66
Newcastle United	45
Southampton	39
Tottenham Hotspur	71
Watford	50
West Ham United	52
Wolverhampton Wanderers	57

Team	Points
Arsenal	56
Aston Villa	35
Bournemouth	34
Brighton & Hove Albion	41
Burnley	54
Chelsea	66
Crystal Palace	43
Everton	49
Leicester City	62
Liverpool	99
Manchester City	81
Manchester United	66
Newcastle United	44
Norwich City	21
Sheffield United	54
Southampton	52
Tottenham Hotspur	59
Watford	34
West Ham United	39
Wolverhampton Wanderers	59

In the real world, you would probably use data stored in spreadsheets. These can be imported into Python as data frames using the following code:

# Import Excel spreadsheets as data frames
df_1 = pd.read_excel('First Spreadsheet.xlsx')
df_2 = pd.read_excel('Second Spreadsheet.xlsx')

Or, if you are using CSVs:

# Import CSVs as data frames
df_1 = pd.read_csv('First Spreadsheet.csv')
df_2 = pd.read_csv('Second Spreadsheet.csv')

2 Concatenation

Concatenating two data frames is actually not the same as merging them, it’s simply combining them by sticking one on top of the other. It’s done using the concat() function:

total = pd.concat([EPL_2018_19, EPL_2019_20])

Which produces the following:

print(total)

##                        Team  Points
## 0                   Arsenal      70
## 1               Bournemouth      45
## 2    Brighton & Hove Albion      36
## 3                   Burnley      40
## 4              Cardiff City      34
## 5                   Chelsea      72
## 6            Crystal Palace      49
## 7                   Everton      54
## 8                    Fulham      26
## 9         Huddersfield Town      16
## 10           Leicester City      52
## 11                Liverpool      97
## 12          Manchester City      98
## 13        Manchester United      66
## 14         Newcastle United      45
## 15              Southampton      39
## 16        Tottenham Hotspur      71
## 17                  Watford      50
## 18          West Ham United      52
## 19  Wolverhampton Wanderers      57
## 0                   Arsenal      56
## 1               Aston Villa      35
## 2               Bournemouth      34
## 3    Brighton & Hove Albion      41
## 4                   Burnley      54
## 5                   Chelsea      66
## 6            Crystal Palace      43
## 7                   Everton      49
## 8            Leicester City      62
## 9                 Liverpool      99
## 10          Manchester City      81
## 11        Manchester United      66
## 12         Newcastle United      44
## 13             Norwich City      21
## 14         Sheffield United      54
## 15              Southampton      52
## 16        Tottenham Hotspur      59
## 17                  Watford      34
## 18          West Ham United      39
## 19  Wolverhampton Wanderers      59

Append a data frame to the side of another with the concat() function by setting the axis='columns' or axis='1' keyword argument (these mean the same thing):

total = pd.concat([EPL_2018_19, EPL_2019_20], axis='columns')
print(total)

##                        Team  Points                     Team  Points
## 0                   Arsenal      70                  Arsenal      56
## 1               Bournemouth      45              Aston Villa      35
## 2    Brighton & Hove Albion      36              Bournemouth      34
## 3                   Burnley      40   Brighton & Hove Albion      41
## 4              Cardiff City      34                  Burnley      54
## 5                   Chelsea      72                  Chelsea      66
## 6            Crystal Palace      49           Crystal Palace      43
## 7                   Everton      54                  Everton      49
## 8                    Fulham      26           Leicester City      62
## 9         Huddersfield Town      16                Liverpool      99
## 10           Leicester City      52          Manchester City      81
## 11                Liverpool      97        Manchester United      66
## 12          Manchester City      98         Newcastle United      44
## 13        Manchester United      66             Norwich City      21
## 14         Newcastle United      45         Sheffield United      54
## 15              Southampton      39              Southampton      52
## 16        Tottenham Hotspur      71        Tottenham Hotspur      59
## 17                  Watford      50                  Watford      34
## 18          West Ham United      52          West Ham United      39
## 19  Wolverhampton Wanderers      57  Wolverhampton Wanderers      59

3 Merging

There are four types of merging:

Outer: keep all teams in both data frames (don’t delete any data)
Left: keep all teams in the left data frame (but delete those in the right data frame that don’t appear in the left)
Right: keep all teams in the right data frame (but delete those in the left data frame that don’t appear in the right)
Inner: only keep the teams that are common to both data frames

(1) Keep all rows in both data frames, (2) keep all rows in the left data frame but only those in the right that are also in the left, (3) keep all rows in the right data frame but only those in the left that are also in the right, (4) keep only the rows that appear in both data frames

3.1 Outer Merge (Keep all rows in both data frames)

Merge the data from both seasons by team name:

outer = pd.merge(EPL_2018_19, EPL_2019_20, on='Team', how='outer')
print(outer)

##                        Team  Points_x  Points_y
## 0                   Arsenal      70.0      56.0
## 1               Bournemouth      45.0      34.0
## 2    Brighton & Hove Albion      36.0      41.0
## 3                   Burnley      40.0      54.0
## 4              Cardiff City      34.0       NaN
## 5                   Chelsea      72.0      66.0
## 6            Crystal Palace      49.0      43.0
## 7                   Everton      54.0      49.0
## 8                    Fulham      26.0       NaN
## 9         Huddersfield Town      16.0       NaN
## 10           Leicester City      52.0      62.0
## 11                Liverpool      97.0      99.0
## 12          Manchester City      98.0      81.0
## 13        Manchester United      66.0      66.0
## 14         Newcastle United      45.0      44.0
## 15              Southampton      39.0      52.0
## 16        Tottenham Hotspur      71.0      59.0
## 17                  Watford      50.0      34.0
## 18          West Ham United      52.0      39.0
## 19  Wolverhampton Wanderers      57.0      59.0
## 20              Aston Villa       NaN      35.0
## 21             Norwich City       NaN      21.0
## 22         Sheffield United       NaN      54.0

As you can see, the teams that were in the league for both seasons (eg Arsenal and Bournemouth) have values in the “Points” columns for both seasons, whereas teams such as Aston Villa and Cardiff City which were promoted/relegated after the first season only have values for one season. For the other season there is an “NaN” (not a number). That is why there are 23 rows in the data frame when there are only 20 teams in the Premier League in any one season.

3.2 Left Merge (Keep all rows in the left data frame)

For all teams that were in the Premier League for the 2018-19 season, see their points for that season and the one afterwards:

left = pd.merge(EPL_2018_19, EPL_2019_20, on='Team', how='left')
print(left)

##                        Team  Points_x  Points_y
## 0                   Arsenal        70      56.0
## 1               Bournemouth        45      34.0
## 2    Brighton & Hove Albion        36      41.0
## 3                   Burnley        40      54.0
## 4              Cardiff City        34       NaN
## 5                   Chelsea        72      66.0
## 6            Crystal Palace        49      43.0
## 7                   Everton        54      49.0
## 8                    Fulham        26       NaN
## 9         Huddersfield Town        16       NaN
## 10           Leicester City        52      62.0
## 11                Liverpool        97      99.0
## 12          Manchester City        98      81.0
## 13        Manchester United        66      66.0
## 14         Newcastle United        45      44.0
## 15              Southampton        39      52.0
## 16        Tottenham Hotspur        71      59.0
## 17                  Watford        50      34.0
## 18          West Ham United        52      39.0
## 19  Wolverhampton Wanderers        57      59.0

The three teams that got promoted into the Premier League are not shown here as their names do not appear in the left data frame.

3.3 Right Merge (Keep all rows in the right data frame)

For all teams that were in the Premier League for the 2019-20 season, see their points for that season and the previous one:

right = pd.merge(EPL_2018_19, EPL_2019_20, on='Team', how='right')
print(right)

##                        Team  Points_x  Points_y
## 0                   Arsenal      70.0        56
## 1               Aston Villa       NaN        35
## 2               Bournemouth      45.0        34
## 3    Brighton & Hove Albion      36.0        41
## 4                   Burnley      40.0        54
## 5                   Chelsea      72.0        66
## 6            Crystal Palace      49.0        43
## 7                   Everton      54.0        49
## 8            Leicester City      52.0        62
## 9                 Liverpool      97.0        99
## 10          Manchester City      98.0        81
## 11        Manchester United      66.0        66
## 12         Newcastle United      45.0        44
## 13             Norwich City       NaN        21
## 14         Sheffield United       NaN        54
## 15              Southampton      39.0        52
## 16        Tottenham Hotspur      71.0        59
## 17                  Watford      50.0        34
## 18          West Ham United      52.0        39
## 19  Wolverhampton Wanderers      57.0        59

The three teams that got relegated from the Premier League are not shown here as their names do not appear in the right data frame.

3.4 Inner Merge (Keep only the rows that appear in both data frames)

To see the points for all the teams that were in the Premier League for both seasons:

inner = pd.merge(EPL_2018_19, EPL_2019_20, on='Team')
print(inner)

##                        Team  Points_x  Points_y
## 0                   Arsenal        70        56
## 1               Bournemouth        45        34
## 2    Brighton & Hove Albion        36        41
## 3                   Burnley        40        54
## 4                   Chelsea        72        66
## 5            Crystal Palace        49        43
## 6                   Everton        54        49
## 7            Leicester City        52        62
## 8                 Liverpool        97        99
## 9           Manchester City        98        81
## 10        Manchester United        66        66
## 11         Newcastle United        45        44
## 12              Southampton        39        52
## 13        Tottenham Hotspur        71        59
## 14                  Watford        50        34
## 15          West Ham United        52        39
## 16  Wolverhampton Wanderers        57        59

The teams that got promoted and relegated from the Premier League are not shown here as their names do not appear in both data frames. As a result, only 17 rows remain. This is the default behaviour of the merge() function.

3.5 Indicators

By performing a merge with the indicator=True option set, you can get an extra column (called “_merge”) that tells you from where that particular piece of data came:

df = pd.merge(EPL_2018_19, EPL_2019_20, on='Team', how='outer', indicator=True)
print(df)

##                        Team  Points_x  Points_y      _merge
## 0                   Arsenal      70.0      56.0        both
## 1               Bournemouth      45.0      34.0        both
## 2    Brighton & Hove Albion      36.0      41.0        both
## 3                   Burnley      40.0      54.0        both
## 4              Cardiff City      34.0       NaN   left_only
## 5                   Chelsea      72.0      66.0        both
## 6            Crystal Palace      49.0      43.0        both
## 7                   Everton      54.0      49.0        both
## 8                    Fulham      26.0       NaN   left_only
## 9         Huddersfield Town      16.0       NaN   left_only
## 10           Leicester City      52.0      62.0        both
## 11                Liverpool      97.0      99.0        both
## 12          Manchester City      98.0      81.0        both
## 13        Manchester United      66.0      66.0        both
## 14         Newcastle United      45.0      44.0        both
## 15              Southampton      39.0      52.0        both
## 16        Tottenham Hotspur      71.0      59.0        both
## 17                  Watford      50.0      34.0        both
## 18          West Ham United      52.0      39.0        both
## 19  Wolverhampton Wanderers      57.0      59.0        both
## 20              Aston Villa       NaN      35.0  right_only
## 21             Norwich City       NaN      21.0  right_only
## 22         Sheffield United       NaN      54.0  right_only

The teams that were in both data frames, only the left data frame or only the right data frame are labelled as such in the _merge column.

3.6 Bonus: the “Exclusive Or” (XOR) Merge

There is no way to do a ‘not inner’ merge in one step (ie keep the teams that were in one season only), but it can be done in two steps using this indicator:

df = pd.merge(EPL_2018_19, EPL_2019_20, on='Team', how='outer', indicator=True)
df = df[df['_merge'] != 'both']
print(df)

##                  Team  Points_x  Points_y      _merge
## 4        Cardiff City      34.0       NaN   left_only
## 8              Fulham      26.0       NaN   left_only
## 9   Huddersfield Town      16.0       NaN   left_only
## 20        Aston Villa       NaN      35.0  right_only
## 21       Norwich City       NaN      21.0  right_only
## 22   Sheffield United       NaN      54.0  right_only

These are the six teams that played in the Premier League for only one of the two seasons.

4 Updating a Row

Let’s imagine we have a data frame containing out-of-date information that we want to update with data from a different data frame. For example, here is a database of Arsenal and Burnley’s log points for three seasons, the last of which is incomplete:

import pandas as pd

df = pd.DataFrame({
    'Team': [
        'Arsenal', 'Arsenal', 'Arsenal',
        'Burnley', 'Burnley', 'Burnley'
    ],
    'Season': ['2018_19', '2019_20', '2020_21', '2018_19', '2019_20', '2020_21'],
    'Points': [70, 56, 49, 40, 54, 36]
})

print(df)

##       Team   Season  Points
## 0  Arsenal  2018_19      70
## 1  Arsenal  2019_20      56
## 2  Arsenal  2020_21      49
## 3  Burnley  2018_19      40
## 4  Burnley  2019_20      54
## 5  Burnley  2020_21      36

After Arsenal won their next game, their tally of points for the 2020-21 season increased from 49 to 52. So the third row of the data frame (index 2) should look like this:

# The object containing the update must be a data frame (a series won't work)
update = pd.DataFrame({
    'Team': ['Arsenal'],
    'Season': ['2020_21'],
    'Points': [52]
})

print(update)

##       Team   Season  Points
## 0  Arsenal  2020_21      52

We want to edit the database and put in this new, correct information. In other words, we want to merge these two data frames but also overwrite the old one. This is known as updating and, usefully, the function that does this in Pandas is called update(). However, in order for it to work, the index of the new row (in the data frame update) needs to match the index of the old row (in the data frame df). That is how Python will know which row to overwrite. At the moment we have a problem because the original row (row 3 of data frame df) has index 2 while the new row (row 1 of data frame update) has index 0.

Now, we could fix this manually by individually changing only these two indexes using hard-coded values, but it’s much easier to change the indexes completely and instead use overlapping columns as the indexes. Then, the data frames can be ‘merged’ on the values they have in common and the information in the remaining columns updated accordingly:

# Use the team names and the seasons as the indexes
df = df.set_index(['Team', 'Season'])
update = update.set_index(['Team', 'Season'])
# Merge and overwrite
df.update(update)
# Change back to using sequential numbers as the indexes
df = df.reset_index()

print(df)

##       Team   Season  Points
## 0  Arsenal  2018_19      70
## 1  Arsenal  2019_20      56
## 2  Arsenal  2020_21      52
## 3  Burnley  2018_19      40
## 4  Burnley  2019_20      54
## 5  Burnley  2020_21      36

Now we can see that Arsenal have 52 points (so far) in the 2020-21 season.

⇦ Back

Data Handling in Python:Merging