Conclusions
Human agreement is evaluated to ensure the reliability of such studies and systems.
What we propose is to compare the systems performance against human judgment by the same means in which the agreement of human coders is evaluated.
Kappa and per class agreement can be used to measure agreement between coders and systems