Inter-rater reliability

Solution

The method for calculating inter-rater reliability will depend on the type of data (categorical, ordinal, or continuous) and the number of coders.

Suppose this is your data set. It consists of 30 cases, rated by three coders. It is a subset of the data set in the irr package.

Two raters: Cohen’s Kappa

This will calculate Cohen’s Kappa for two coders – In this case, raters 1 and 2.

kappa2(dat[,c(1,2)], "unweighted")
#>  Cohen's Kappa for 2 Raters (Weights: unweighted)
#> 
#>  Subjects = 30 
#>    Raters = 2 
#>     Kappa = 0.651 
#> 
#>         z = 7 
#>   p-value = 2.63e-12

N raters: Fleiss’s Kappa, Conger’s Kappa

kappam.fleiss(dat)
#>  Fleiss' Kappa for m Raters
#> 
#>  Subjects = 30 
#>    Raters = 3 
#>     Kappa = 0.534 
#> 
#>         z = 9.89 
#>   p-value = 0

It is also possible to use Conger’s (1980) exact Kappa. (Note that it is not clear to me when it is better or worse to use the exact method.)

If the data is ordinal, then it may be appropriate to use a weighted Kappa. For example, if the possible values are low, medium, and high, then if a case were rated medium and high by the two coders, they would be in better agreement than if the ratings were low and high.

We will use a subset of the anxiety data set from the irr package.

library(irr)
data(anxiety)
dfa <- anxiety[,c(1,2)]
dfa
#>    rater1 rater2
#> 1       3      3
#> 2       3      6
#> 3       3      4
#> 4       4      6
#> 5       5      2
#> 6       5      4
#> 7       2      2
#> 8       3      4
#> 9       5      3
#> 10      2      3
#> 11      2      2
#> 12      6      3
#> 13      1      3
#> 14      5      3
#> 15      2      2
#> 16      2      2
#> 17      1      1
#> 18      2      3
#> 19      4      3
#> 20      3      4

# Compare raters 1 and 2 with squared weights
kappa2(dfa, "squared")
#>  Cohen's Kappa for 2 Raters (Weights: squared)
#> 
#>  Subjects = 20 
#>     Kappa = 0.297 
#> 
#>         z = 1.34 
#>   p-value = 0.18
kappa2(dfa, "equal")
#>  Cohen's Kappa for 2 Raters (Weights: equal)
#> 
#>  Subjects = 20 
#>    Raters = 2 
#>     Kappa = 0.189 
#> 
#>         z = 1.42 
#>   p-value = 0.157

Compare the results above to the unweighted calculation (used in the tests for non-ordinal data above), which treats all differences as the same:

Weighted Kappa with factors

The data above is numeric, but a weighted Kappa can also be calculated for factors. Note that the factor levels must be in the correct order, or results will be wrong.

# Make a factor-ized version of the data
dfa2 <- dfa
dfa2$rater1 <- factor(dfa2$rater1, levels=1:6, labels=LETTERS[1:6])
dfa2$rater2 <- factor(dfa2$rater2, levels=1:6, labels=LETTERS[1:6])
dfa2
#>    rater1 rater2
#> 1       C      C
#> 2       C      F
#> 3       C      D
#> 4       D      F
#> 5       E      B
#> 6       E      D
#> 7       B      B
#> 8       C      D
#> 9       E      C
#> 10      B      C
#> 11      B      B
#> 12      F      C
#> 13      A      C
#> 14      E      C
#> 15      B      B
#> 16      B      B
#> 17      A      A
#> 18      B      C
#> 19      D      C
#> 20      C      D
# The factor levels must be in the correct order:
levels(dfa2$rater1)
#> [1] "A" "B" "C" "D" "E" "F"
levels(dfa2$rater2)
#> [1] "A" "B" "C" "D" "E" "F"
# The results are the same as with the numeric data, above
kappa2(dfa2, "squared")
#>  Cohen's Kappa for 2 Raters (Weights: squared)
#> 
#>  Subjects = 20 
#>    Raters = 2 
#>     Kappa = 0.297 
#>         z = 1.34 
#>   p-value = 0.18
# Use linear weights
kappa2(dfa2, "equal")
#> 
#>  Subjects = 20 
#>    Raters = 2 
#>     Kappa = 0.189 
#> 
#>         z = 1.42 
#>   p-value = 0.157

When the variable is continuous, the intraclass correlation coefficient should be computed. From the documentation for icc:

Should only the subjects be considered as random effects ("oneway" model, default) or are subjects and raters randomly chosen from a bigger pool of persons ("twoway" model).
If differences in judges’ mean ratings are of interest, interrater "agreement" instead of "consistency" (default) should be computed.
If the unit of analysis is a mean of several ratings, unit should be changed to "average". In most cases, however, single values (unit="single", default) are regarded. We will use the anxiety data set from the irr package.

library(irr)
data(anxiety)
anxiety
#>    rater1 rater2 rater3
#> 1       3      3      2
#> 2       3      6      1
#> 3       3      4      4
#> 4       4      6      4
#> 5       5      2      3
#> 6       5      4      2
#> 7       2      2      1
#> 8       3      4      6
#> 9       5      3      1
#> 10      2      3      1
#> 11      2      2      1
#> 12      6      3      2
#> 13      1      3      3
#> 14      5      3      3
#> 15      2      2      1
#> 16      2      2      1
#> 17      1      1      3
#> 18      2      3      3
#> 19      4      3      2
#> 20      3      4      2
# Just one of the many possible ICC coefficients
icc(anxiety, model="twoway", type="agreement")
#>  Single Score Intraclass Correlation
#> 
#>    Model: twoway 
#>    Type : agreement 
#> 
#>    Subjects = 20 
#>      Raters = 3 
#>    ICC(A,1) = 0.198
#> 
#>  F-Test, H0: r0 = 0 ; H1: r0 > 0 
#>  F(19,39.7) = 1.83 , p = 0.0543 
#> 
#>  95%-Confidence Interval for ICC Population Values: