Second HAREM: Evaluation

Scores
Measures
Metrics
Notes

Scores

The NEs can have three different scores: correct, missing or spurious as they are being aligned to the NEs in the Golden Collection (GC).

Measures

In Second HAREM the measure of combined semantic classification will be used.

In general, a NE in the GC can be vague between several interpretations (N). The systems' outputs can also have vague NEs, which can give rise to spurious classifications (M).

Each NE will be evaluated by the following equation:

1 + Σ(1, N) { α*(1 - 1/ num-cats )* cat-certa + β*(1 - 1/ num-tipos )* tipo-certo + γ*(1 - 1/ num-subtipos )* subtipo-certo } - Σ(1, M) {α*(1/ num-cats )* cat-espuria + β*(1/ num-tipos )* tipo-espurio + γ*(1/ num-subtipos )* subtipo-espurio }

num-cats = number of possible values for the particular CATEGory (10 in the complete scenario, but can be less in selective scenarios)
num-tipos = number of possible values for TIPO for the given CATEGory
num-subtipos = number of possible values for SUBTIPO for the given pair CATEG/TIPO.

cat-certa = 1 (if CATEG is correct) or 0 (if CATEG is wrong)
cat-espuria = 1 (if CATEG is spurious) or 0 (if it's not)
tipo-certo = 1 (if TIPO is correct) or 0 (if TIPO is wrong)
tipo-espurio = 1 (if TIPO is spurious) or 0 (if it's not)
subtipo-certo = 1 (if SUBTIPO is correct) or 0 (if SUBTIPO is wrong)
subtipo-espurio = 1 (if SUBTIPO is spurious) or 0 (if it's not)

α, β and γ are parameters to be adjusted later, corresponding to different weights for CATEGories, TIPOs and SUBTIPOs.

If a NE is neither vague in the GC nor in the system's output, the previous equation can be simplified to:

CSC = 1 + α*(1 - 1/num-cats)*cat-certa + β*(1 - 1/num-tipos)*tipo_certo + γ*(1-1/num-subtipos)*subtipo-certo - α*(1/num-cats)*cat-espuria - β*(1/num-tipos)*tipo-espurio - γ*(1/num-subtipos)*subtipo-espurio

In addition, besides this measure for the correctly identified NEs, the number of missing and spurious NEs will also be collected.

Metrics

The metrics (corresponding to the aggregation of the values of measures for all the NEs) will be the usual::

Precision

Precision measures the quality of the system's output: it is the proportion of correct answers in all answers returned by the system.

Precision = Number of correctly classified NEs / Number of NEs classified by the system

Precision = Σ score obtaint by each NE / Maximum score if all system classifications were correct

Recall

Recall measures the proportion of solutions in the GC that the system was able to reproduce.

Recall = Number of correctly classified NEs / Number of classified NEs in the GC

Recall = Σ score obtained by each NE / maximum score in the GC

F-measure

The F-measure combines precision and recall.

F-measure = (2 * Precision * Recall) / (Precision + Recall)

Overgeneration

Overgeneration measures the proportion of spurious results in the system's output.

Overgeneration = Number of spurious NEs / Number of classified NEs

Undergeneration

Undergeneration measures the number of analyses that are lacking, compared to the GC.

Undergeneration = Number of missing NEs / Number of NEs in the golden collection

Notes

Given the new syntax of the Second HAREM, the First HAREM measures concerning identification and classification by categories correspond now to the execution of filters (in the first case, selecting no categories and, in the second, selecting no types).
The measure for semantic classification just presented is nothing more than the generalization and improvement of the CSC used in the First HAREM, systematically taking into account the possibility that systems produce more than one classification per NE. See a detailed example of vague NEs evaluation.
Selective scenarios are also present in the Second HAREM, and systems can choose to classify only a subset of the CATEGories or only CATEG and some or none TIPOs or SUBTIPOs. In this last case, the measures are relative to the chosen subset, and are obtained after application of the respective filters.
Contrary to what we did in the First HAREM, partially identified NEs will not be considered correct. These NEs will thus be treated as incorrect -- which means they will be simultaneously considered spurious (identified by the system, but not present in the GC) and the NEs in the GC will be considered missing.
In the Second HAREM, the ALT tag, besides being used to indicate an assystematic set of alternatives, will also be used to indicate different equally valid possibilities for the classification of a piece of text. See a (uncomplete) list of classifications where ALT should be used. Participants are strongly encouraged to use this tag also in their systems' output. In addition to the evaluation method used in First HAREM (no ALTs in the participations) which we call relaxed mode, we will also offer a strict evaluation, described in ALT evaluation in different scenarios.
Note that the CSC measure allows scoring separately the three following conceptually different and differently marked cases: ignorance (unfilled attributes), certainty of being different (using the "OUTRO" attribute) and mistake (classification different from the one in the GC).
If a given CATEGORIA or TIPO has no SUBTIPOs, it is the same as having only one SUBTIPO: OUTRO. This makes the multiplicative factor in the CSC equation 1-1 = 0.

Last update: 1 April 2008.