Inter-rater reliability

    Solution

    The method for calculating inter-rater reliability will depend on the type of data (categorical, ordinal, or continuous) and the number of coders.

    Suppose this is your data set. It consists of 30 cases, rated by three coders. It is a subset of the data set in the irr package.

    Two raters: Cohen’s Kappa

    This will calculate Cohen’s Kappa for two coders – In this case, raters 1 and 2.

    1. kappa2(dat[,c(1,2)], "unweighted")
    2. #> Cohen's Kappa for 2 Raters (Weights: unweighted)
    3. #>
    4. #> Subjects = 30
    5. #> Raters = 2
    6. #> Kappa = 0.651
    7. #>
    8. #> z = 7
    9. #> p-value = 2.63e-12

    N raters: Fleiss’s Kappa, Conger’s Kappa

    1. kappam.fleiss(dat)
    2. #> Fleiss' Kappa for m Raters
    3. #>
    4. #> Subjects = 30
    5. #> Raters = 3
    6. #> Kappa = 0.534
    7. #>
    8. #> z = 9.89
    9. #> p-value = 0

    It is also possible to use Conger’s (1980) exact Kappa. (Note that it is not clear to me when it is better or worse to use the exact method.)

    If the data is ordinal, then it may be appropriate to use a weighted Kappa. For example, if the possible values are low, medium, and high, then if a case were rated medium and high by the two coders, they would be in better agreement than if the ratings were low and high.

    We will use a subset of the anxiety data set from the irr package.

    1. library(irr)
    2. data(anxiety)
    3. dfa <- anxiety[,c(1,2)]
    4. dfa
    5. #> rater1 rater2
    6. #> 1 3 3
    7. #> 2 3 6
    8. #> 3 3 4
    9. #> 4 4 6
    10. #> 5 5 2
    11. #> 6 5 4
    12. #> 7 2 2
    13. #> 8 3 4
    14. #> 9 5 3
    15. #> 10 2 3
    16. #> 11 2 2
    17. #> 12 6 3
    18. #> 13 1 3
    19. #> 14 5 3
    20. #> 15 2 2
    21. #> 16 2 2
    22. #> 17 1 1
    23. #> 18 2 3
    24. #> 19 4 3
    25. #> 20 3 4
    1. # Compare raters 1 and 2 with squared weights
    2. kappa2(dfa, "squared")
    3. #> Cohen's Kappa for 2 Raters (Weights: squared)
    4. #>
    5. #> Subjects = 20
    6. #> Kappa = 0.297
    7. #>
    8. #> z = 1.34
    9. #> p-value = 0.18
    10. kappa2(dfa, "equal")
    11. #> Cohen's Kappa for 2 Raters (Weights: equal)
    12. #>
    13. #> Subjects = 20
    14. #> Raters = 2
    15. #> Kappa = 0.189
    16. #>
    17. #> z = 1.42
    18. #> p-value = 0.157

    Compare the results above to the unweighted calculation (used in the tests for non-ordinal data above), which treats all differences as the same:

    Weighted Kappa with factors

    The data above is numeric, but a weighted Kappa can also be calculated for factors. Note that the factor levels must be in the correct order, or results will be wrong.

    1. # Make a factor-ized version of the data
    2. dfa2 <- dfa
    3. dfa2$rater1 <- factor(dfa2$rater1, levels=1:6, labels=LETTERS[1:6])
    4. dfa2$rater2 <- factor(dfa2$rater2, levels=1:6, labels=LETTERS[1:6])
    5. dfa2
    6. #> rater1 rater2
    7. #> 1 C C
    8. #> 2 C F
    9. #> 3 C D
    10. #> 4 D F
    11. #> 5 E B
    12. #> 6 E D
    13. #> 7 B B
    14. #> 8 C D
    15. #> 9 E C
    16. #> 10 B C
    17. #> 11 B B
    18. #> 12 F C
    19. #> 13 A C
    20. #> 14 E C
    21. #> 15 B B
    22. #> 16 B B
    23. #> 17 A A
    24. #> 18 B C
    25. #> 19 D C
    26. #> 20 C D
    27. # The factor levels must be in the correct order:
    28. levels(dfa2$rater1)
    29. #> [1] "A" "B" "C" "D" "E" "F"
    30. levels(dfa2$rater2)
    31. #> [1] "A" "B" "C" "D" "E" "F"
    32. # The results are the same as with the numeric data, above
    33. kappa2(dfa2, "squared")
    34. #> Cohen's Kappa for 2 Raters (Weights: squared)
    35. #>
    36. #> Subjects = 20
    37. #> Raters = 2
    38. #> Kappa = 0.297
    39. #> z = 1.34
    40. #> p-value = 0.18
    41. # Use linear weights
    42. kappa2(dfa2, "equal")
    43. #>
    44. #> Subjects = 20
    45. #> Raters = 2
    46. #> Kappa = 0.189
    47. #>
    48. #> z = 1.42
    49. #> p-value = 0.157

    When the variable is continuous, the intraclass correlation coefficient should be computed. From the documentation for icc:

    • Should only the subjects be considered as random effects ("oneway" model, default) or are subjects and raters randomly chosen from a bigger pool of persons ("twoway" model).
    • If differences in judges’ mean ratings are of interest, interrater "agreement" instead of "consistency" (default) should be computed.
    • If the unit of analysis is a mean of several ratings, unit should be changed to "average". In most cases, however, single values (unit="single", default) are regarded. We will use the anxiety data set from the irr package.
    1. library(irr)
    2. data(anxiety)
    3. anxiety
    4. #> rater1 rater2 rater3
    5. #> 1 3 3 2
    6. #> 2 3 6 1
    7. #> 3 3 4 4
    8. #> 4 4 6 4
    9. #> 5 5 2 3
    10. #> 6 5 4 2
    11. #> 7 2 2 1
    12. #> 8 3 4 6
    13. #> 9 5 3 1
    14. #> 10 2 3 1
    15. #> 11 2 2 1
    16. #> 12 6 3 2
    17. #> 13 1 3 3
    18. #> 14 5 3 3
    19. #> 15 2 2 1
    20. #> 16 2 2 1
    21. #> 17 1 1 3
    22. #> 18 2 3 3
    23. #> 19 4 3 2
    24. #> 20 3 4 2
    25. # Just one of the many possible ICC coefficients
    26. icc(anxiety, model="twoway", type="agreement")
    27. #> Single Score Intraclass Correlation
    28. #>
    29. #> Model: twoway
    30. #> Type : agreement
    31. #>
    32. #> Subjects = 20
    33. #> Raters = 3
    34. #> ICC(A,1) = 0.198
    35. #>
    36. #> F-Test, H0: r0 = 0 ; H1: r0 > 0
    37. #> F(19,39.7) = 1.83 , p = 0.0543
    38. #>
    39. #> 95%-Confidence Interval for ICC Population Values: