In Python, the intraclass correlation coefficient (ICC) can be calculated using the intraclass_corr() function from the pingouin library. This function’s documentation tells us that:

The intraclass correlation assesses the reliability of ratings by comparing the variability of different ratings of the same subject to the total variation across all ratings and all subjects

…and the Wikipedia page says that:

The intraclass correlation coefficient (ICC) is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other.

Note also that an interclass correlation coefficient is a thing that exists; it is similar but different.

1 Setup

As mentioned, the pingouin library will be used to calculate the ICC, and the pandas library will also be needed. These can be installed from the terminal with:

python3.11 -m pip install pingouin
python3.11 -m pip install pandas

After this they can be imported into Python scripts with:

import pingouin as pg
import pandas as pd

2 Worked Example

This example comes from the Real Statistics site, although it has also been included in Pingouin as a built-in example.

Let’s imagine that there are four judges each tasting 8 different types of wine and rating them from 0 to 9. The results of their assessments have been included in Pingouin, so there is a function to import this raw data directly:

data = pg.read_dataset('icc')
print(data)

##     Wine Judge  Scores
## 0      1     A       1
## 1      2     A       1
## 2      3     A       3
## 3      4     A       6
## 4      5     A       6
## 5      6     A       7
## 6      7     A       8
## 7      8     A       9
## 8      1     B       2
## 9      2     B       3
## 10     3     B       8
## 11     4     B       4
## 12     5     B       5
## 13     6     B       5
## 14     7     B       7
## 15     8     B       9
## 16     1     C       0
## 17     2     C       3
## 18     3     C       1
## 19     4     C       3
## 20     5     C       5
## 21     6     C       6
## 22     7     C       7
## 23     8     C       9
## 24     1     D       1
## 25     2     D       2
## 26     3     D       4
## 27     4     D       3
## 28     5     D       6
## 29     6     D       2
## 30     7     D       9
## 31     8     D       8

Pivotting this data table will make it more readable, although it’s actually more useable when it’s in the original un-pivotted format (or ‘long’ format) so we won’t assign the pivotted table to a new variable:

print(pd.pivot_table(data, index='Judge', columns='Wine').T)

## Judge        A  B  C  D
##        Wine            
## Scores 1     1  2  0  1
##        2     1  3  3  2
##        3     3  8  1  4
##        4     6  4  3  3
##        5     6  5  5  6
##        6     7  5  6  2
##        7     8  7  7  9
##        8     9  9  9  8

The above table matches the one given in the original example, so we can be sure we’re starting from the right place with this worked example.

In order to use the intraclass_corr() function we need to give it four inputs:

data - the input dataframe in long format (ie un-pivotted)
targets - the name of the column in data that contains the names of the things being rated
raters - the name of the column in data that contains the names of the things doing the rating
ratings - the name of the column in data that contains the values of the ratings

The first of these is a dataframe and the other three are strings (as they are column names). In our example, the things being rated are Wines, the raters are the Judges and the ratings are the Scores, so here’s how to calculate the ICC:

results = pg.intraclass_corr(data=data, targets='Wine', raters='Judge', ratings='Scores')

# Pandas display options
pd.set_option('display.max_columns', 8)
pd.set_option('display.width', 200)
# Show results
print(results)

##     Type              Description       ICC          F  df1  df2      pval         CI95%
## 0   ICC1   Single raters absolute  0.727521  11.680026    7   24  0.000002  [0.43, 0.93]
## 1   ICC2     Single random raters  0.727689  11.786693    7   21  0.000005  [0.43, 0.93]
## 2   ICC3      Single fixed raters  0.729487  11.786693    7   21  0.000005  [0.43, 0.93]
## 3  ICC1k  Average raters absolute  0.914384  11.680026    7   24  0.000002  [0.75, 0.98]
## 4  ICC2k    Average random raters  0.914450  11.786693    7   21  0.000005  [0.75, 0.98]
## 5  ICC3k     Average fixed raters  0.915159  11.786693    7   21  0.000005  [0.75, 0.98]

This output is very verbose! You get a whole table when you probably only want one number. Here’s how to choose which of the six ICC values you want:

2.1 ICC Types

The different types of ICC models are detailed briefly on the Wikipedia page, but here’s a summary:

ICC1: each of the subjects has been measured by only a sub-set of the raters, and it is NOT the same sub-set of raters that measured each subject
ICC2: each of the subjects has been measured by only a sub-set of the raters, but it IS the same sub-set of raters that measured each subject
ICC3: each of the subjects has been measured by the entire population of raters (so we don’t have to take inter-rater variability into account)
ICC1k, ICC2k and ICC3k: the reliability of the k raters when working as a group (whereas ICC1, ICC2 and ICC3 represent the reliability of the raters as individuals). These values will always be higher (in the example they are ~0.91 as opposed to the first three which are ~0.73) because multiple raters working together will always give a statistically more reliable result.

For this example the Real Statistics page uses ICC2 (single random raters) which is correct for this example: a group of four judges does not represent the entire population of people who could rate wine, each judge tasted all 8 wines and we want to know the reliability of the raters as individuals:

results = results.set_index('Description')
icc = results.loc['Single random raters', 'ICC']

print(icc.round(3))

## 0.728

This is the same value as in the original example.

2.2 Confidence Interval

The function also gives the 95% confidence interval:

lower_ci = results.loc['Single random raters', 'CI95%'][0]
upper_ci = results.loc['Single random raters', 'CI95%'][1]

print(f'ICC = {icc:.3f}, 95% CI [{lower_ci}, {upper_ci}]')

## ICC = 0.728, 95% CI [0.43, 0.93]

2.3 How the Function Works

The source code for this function might be useful if you want to take a look at how exactly it works. That can be found over here.

3 Interpretation

Again we can look at the Wikipedia page for help, as it gives this guide for interpreting the ICC (re-produced from Cicchetti¹):

Inter-rater agreement	Intraclass correlation
Poor	Less than 0.40
Fair	Between 0.40 and 0.59
Good	Between 0.60 and 0.74
Excellent	Between 0.75 and 1.00

…plus this alternative one from Koo and Li²:

Inter-rater agreement	Intraclass correlation
Poor	Less than 0.50
Moderate	Between 0.50 and 0.75
Good	Between 0.75 and 0.90
Excellent	Between 0.90 and 1.00

Cicchetti D. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment. 1994;6(4):284-290
Koo T, Li M. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. Journal of Chiropractic Medicine. 2016;15(2):155-163

These can be coded up into functions as follows:

def interpret_icc_cicchetti(icc):
    """Interpret the inter-rater agreement."""
    if icc < 0.4:
        return 'poor'
    elif icc < 0.60:
        return 'fair'
    elif icc < 0.75:
        return 'good'
    elif icc <= 1:
        return 'excellent'
    else:
        raise ValueError(f'Invalid value for the ICC: {icc}')


def interpret_icc_koo_li(icc):
    """Interpret the inter-rater agreement."""
    if icc < 0.5:
        return 'poor'
    elif icc < 0.75:
        return 'moderate'
    elif icc < 0.9:
        return 'good'
    elif icc <= 1:
        return 'excellent'
    else:
        raise ValueError(f'Invalid value for the ICC: {icc}')

Our result of 0.728 can now be interpreted automatically:

icc = results.loc['Single random raters', 'ICC']
agreement = interpret_icc_cicchetti(icc)

print(f"An inter-rater agreement of {icc.round(3)} is {agreement}")

## An inter-rater agreement of 0.728 is good

icc = results.loc['Single random raters', 'ICC']
agreement = interpret_icc_koo_li(icc)

print(f"An inter-rater agreement of {icc.round(3)} is {agreement}")

## An inter-rater agreement of 0.728 is moderate

4 Variations

4.1 Subsets

If you are only interested in the agreement amongst a subset of the raters you can filter the dataset accordingly. Here’s the agreement between Judges A and B:

data = data[data['Judge'].isin(['A', 'B'])]
results = pg.intraclass_corr(data, 'Wine', 'Judge', 'Scores')
results = results.set_index('Type')
icc = results.loc['ICC1', 'ICC']

print(icc.round(3))

## 0.671

4.2 Wide-Format

Often you will have raw data that is in wide format:

dct = {
    'Judge A': [1, 1, 3, 6, 6, 7, 8, 9],
    'Judge B': [2, 3, 8, 4, 5, 5, 7, 9],
    'Judge C': [0, 3, 1, 3, 5, 6, 7, 9],
    'Judge D': [1, 2, 4, 3, 6, 2, 9, 8],
}
df = pd.DataFrame(dct)

print(df)

##    Judge A  Judge B  Judge C  Judge D
## 0        1        2        0        1
## 1        1        3        3        2
## 2        3        8        1        4
## 3        6        4        3        3
## 4        6        5        5        6
## 5        7        5        6        2
## 6        8        7        7        9
## 7        9        9        9        8

It will need to be converted into long format before intraclass_corr() can be used. This can be done by creating a new column that will form the targets and then converting to long-format with the melt() function from Pandas:

df['index'] = df.index
df = pd.melt(df, id_vars=['index'], value_vars=list(df)[:-1])

print(df)

##     index variable  value
## 0       0  Judge A      1
## 1       1  Judge A      1
## 2       2  Judge A      3
## 3       3  Judge A      6
## 4       4  Judge A      6
## 5       5  Judge A      7
## 6       6  Judge A      8
## 7       7  Judge A      9
## 8       0  Judge B      2
## 9       1  Judge B      3
## 10      2  Judge B      8
## 11      3  Judge B      4
## 12      4  Judge B      5
## 13      5  Judge B      5
## 14      6  Judge B      7
## 15      7  Judge B      9
## 16      0  Judge C      0
## 17      1  Judge C      3
## 18      2  Judge C      1
## 19      3  Judge C      3
## 20      4  Judge C      5
## 21      5  Judge C      6
## 22      6  Judge C      7
## 23      7  Judge C      9
## 24      0  Judge D      1
## 25      1  Judge D      2
## 26      2  Judge D      4
## 27      3  Judge D      3
## 28      4  Judge D      6
## 29      5  Judge D      2
## 30      6  Judge D      9
## 31      7  Judge D      8

The ICC can then be calculated as per normal:

results = pg.intraclass_corr(df, 'index', 'variable', 'value')
results = results.set_index('Description')
icc = results.loc['Single random raters', 'ICC']

print(icc.round(3))

## 0.728

⇦ Back

Statistics in Python:Intraclass Correlation Coefficient