Dialogue Evaluation

Reliability of subjective judgments
- "Assessing agreement on classification tasks: the kappa statistic". J. Carletta. Computational Linguistics. 1996
- "K coefficient measures agreement among a set of coders making category judgments, correcting for expected chance agreement."
- K = (P(A) - P(E)) / (1 - P(E)), where P(A) stands for the proportion of times the coders agree and P(E) is the proportion of times that we would expect them to agree by chance.

Reliability of subjective judgments