Second HAREM: Evaluation

HAREM, Linguateca

Em português


Scores

The NEs can have three different scores: correct, missing or spurious as they are being aligned to the NEs in the Golden Collection (GC).

Measures

In Second HAREM the measure of combined semantic classification will be used.

In general, a NE in the GC can be vague between several interpretations (N). The systems' outputs can also have vague NEs, which can give rise to spurious classifications (M).

Each NE will be evaluated by the following equation:

1 + Σ(1, N) { α*(1 - 1/ num-cats )* cat-certa + β*(1 - 1/ num-tipos )* tipo-certo + γ*(1 - 1/ num-subtipos )* subtipo-certo } - Σ(1, M) {α*(1/ num-cats )* cat-espuria + β*(1/ num-tipos )* tipo-espurio + γ*(1/ num-subtipos )* subtipo-espurio }

num-cats = number of possible values for the particular CATEGory (10 in the complete scenario, but can be less in selective scenarios)
num-tipos = number of possible values for TIPO for the given CATEGory
num-subtipos = number of possible values for SUBTIPO for the given pair CATEG/TIPO.

cat-certa = 1 (if CATEG is correct) or 0 (if CATEG is wrong)
cat-espuria = 1 (if CATEG is spurious) or 0 (if it's not)
tipo-certo = 1 (if TIPO is correct) or 0 (if TIPO is wrong)
tipo-espurio = 1 (if TIPO is spurious) or 0 (if it's not)
subtipo-certo = 1 (if SUBTIPO is correct) or 0 (if SUBTIPO is wrong)
subtipo-espurio = 1 (if SUBTIPO is spurious) or 0 (if it's not)

α, β and γ are parameters to be adjusted later, corresponding to different weights for CATEGories, TIPOs and SUBTIPOs.

If a NE is neither vague in the GC nor in the system's output, the previous equation can be simplified to:

CSC = 1 + α*(1 - 1/<em>num-cats</em>)*<em>cat-certa </em> + β*(1 - 1/<em>num-tipos</em>)*<em>tipo_certo</em> + γ*(1-1/<em>num-subtipos</em>)*<em>subtipo-certo</em> - α*(1/<em>num-cats</em>)*<em>cat-espuria</em> - β*(1/<em>num-tipos</em>)*<em>tipo-espurio</em> - γ*(1/<em>num-subtipos</em>)*<em>subtipo-espurio</em>

In addition, besides this measure for the correctly identified NEs, the number of missing and spurious NEs will also be collected.

Metrics

The metrics (corresponding to the aggregation of the values of measures for all the NEs) will be the usual::

Precision

Precision measures the quality of the system's output: it is the proportion of correct answers in all answers returned by the system.

Precision = Number of correctly classified NEs / Number of NEs classified by the system

Precision = Σ score obtaint by each NE / Maximum score if all system classifications were correct

Recall

Recall measures the proportion of solutions in the GC that the system was able to reproduce.

Recall = Number of correctly classified NEs / Number of classified NEs in the GC

Recall = Σ score obtained by each NE / maximum score in the GC

F-measure

The F-measure combines precision and recall.

F-measure = (2 * Precision * Recall) / (Precision + Recall)

Overgeneration

Overgeneration measures the proportion of spurious results in the system's output.

Overgeneration = Number of spurious NEs / Number of classified NEs

Undergeneration

Undergeneration measures the number of analyses that are lacking, compared to the GC.

Undergeneration = Number of missing NEs / Number of NEs in the golden collection

Notes


Last update: 1 April 2008.