HAREM: NER for Portuguese

An updated version of the package with all final Second HAREM resources is available from https://www.linguateca.pt/HAREM/PacoteRecursosSegundoHAREM.zip (see Readme.txt), together with a relation glossary. The package includes:

a new version of the ReRelEM CD (CDSegundoHAREMReRelEM.xml) covering the full HAREM golden collection;
an updated version of the TEMPO CD (just a few small changes), (CDSegundoHAREM_TEMPO.xml), fully compatible with the above CD

For compatibility reasons, we also make available here the golden collection just with NE annoation, CDSegundoHAREMclassico.xml, as well as LÂMPADA 1.0 if one is interested in repeating exactly the Second HAREM.

Publication of the Second HAREM book is ready.

What is HAREM?

HAREM is an evaluation contest for named entity recognition in Portuguese. Its first edition (First HAREM) was initiated in September 2004, comprised two evaluation events, and officially ended in the First HAREM Workshop in Porto, 15 July 2006.

The current edition of HAREM (Second HAREM) is currently taking place NOW (see calendar below).

Who organizes HAREM?

Linguateca organizes HAREM in the scope of its IRE model (Information, Resources, and Evaluation).

The Second HAREM has currently as organizers the following members of the Linguateca team: Diana Santos (coord.), Cláudia Freitas, Hugo Oliveira, David Cruz, Paula Carvalho, Luís Miguel Cabral and Cristina Mota (since May 2008).

The First HAREM had Diana Santos and Nuno Cardoso as coordinators, and Nuno Seco, Rui Vilela, Paulo Rocha, Susana Afonso and Anabela Barreiro as further members og the organization.

Guidelines

So far, only available in Portuguese

Basic Second HAREM guidelines: html (12 March 2008), set of examples: pdf (4 March 2008), table of categories, types and subtypes (24 March 2008)
Description of time (category TEMPO), proposed by Hagège, Baptista and Mamede pdf (13 April 2008)
ReRelEM pilot track: Recognition of semantic relations between NEs: html (10 April 2008)

We strongly encourage participants to make heavy use of the input validator provided by the organization.

Full description of the syntax (in Portuguese) and the list of words in lower case accepted as part of a NE are also available (sintaxe and minusculas).

Evaluation measures

They are described and exemplified in Second HAREM: Evaluation.

Briefly, we are using a generalization of First HAREM's CSC for scoring semantic classification, as well as using the usual measures of precision, recall, overgeneration, undergeneration and F-measure.

The two main differences regarding the First HAREM are (i) no longer considering partially correct NEs, but (ii) systematically coding different possible delimitation through the ALT tag, which systems are also encouraged to use in their output. In fact, in addition to vague classifications such as <EM CATEG="PESSOA|ORGANIZACAO">, we also expect that systems code more than one alternative of identification with the <ALT> syntax.

See some detailed examples of evaluation here:

Finally, we have provided also separate evaluation measures for

The new attributes under the TEMPO category (10 April 2008)
The new ReRelEM track (10 April 2008)

HAREM resources

The Second HAREM collection is already available, as well as information about source, language variety, origin and text genre:

Second HAREM Collection

Metadata

The only small differences relative to the one provided to the participants are:

The removal of some repeated documents, listed in repeticoes
the cleaning of some spurious XML sequences (<P /> (<P>) from documents relkj7666 and 2ght33.

Also, the golden colection has been made available:

Second HAREM Golden Collection
Detailed documentation of the options taken in its creation (in Portuguese)
Subset of Golden Collection for the complete TEMPO
Detailed documentation of the options taken in its creation (in Portuguese)
Subset of Golden Collection for ReRelEM, for inspection by the participants
Detailed documentation of the options taken in its creation (in Portuguese)

Training resources for the Second HAREM

We have developed some examples of the full syntax of the Second HAREM:

Of the collection for Second HAREM, as will be distributed to participants: just text (25 January 2008)
Of the very same collection annotated according to the guidelines: annotated (26 March 2008)

Currently, you can access the (preliminary) resources compiled in the First HAREM and transformed according to the Second HAREM guidelines (basic, no SUBTIPO, no TEMPO yet):

Golden collection of MiniHAREM in First HAREM: just text, annotated (25 January 2008)
Golden collection of the first event of the First HAREM: just text, annotated (11 February 2008)
The full First HAREM collection (including both the above), just text: ColeccaoHAREM.xml

But please note that not every problem in these golden collections has been solved, so when the training material disagrees with the guidelines, the guidelines will take precedence.

Finally, the TEMPO group also provided us with the first 10% of the previous golden collection from MiniHAREM:

Training material for full TEMPO guidelines: tempo.xml (14 April 2008)

Evaluation programs

Due to the change of measures and the change of HAREM syntax from the First to the Second HAREM, the programs had to be modified, and in, some cases, written from scratch.

User's manual for the HAREM programs (classical HAREM) (in Portuguese)
User's manual for the HAREM programs (TEMPO track) (in Portuguese)
User's manual for the HAREM programs (ReRelEM track) (in Portuguese)
User's manual for individual reports (in Portuguese)
Documentation on selective scenarios and their evaluation (in Portuguese)
The evaluation programs (for classical HAREM, TEMPO and ReRelEM and report gneration) (last update: 13 November 2008)

Results

Currently, you can inspect the results for

the main track, the TEMPO track and the ReRelEM track
performance measures provided by the systems themselves (runtime time and execution environment)
each system individual results

Schedule

Until 10th November 2007: Register as a prospective participant open: 22 groups have registered for the Second HAREM as a response for the call for interest.
Until 30th November 2007: General discussion about how the Second HAREM was going to be
December 2007: Preliminary guidelines and example collections made available.
January 2008: Final guidelines available, together with the evaluation architecture, and (training) evaluation resources conforming to those guidelines.
14 - 28 April 2008: Evaluation contest took place (submissions only open for 48h after download of the collection): See the final participating team (10 systems) from the 16 systems enrolled for the Second HAREM.
8 May 2008: Second HAREM collection and its metadata was made available.
16 May 2008: First version of TEMPO golden collection available for inspection.
4 June 2008: First version of TEMPO golden collection available for inspection. Final version of the Second HAREM golden collection (classical mode) available.
6 June 2008: First version of ReRelEM golden collection available for inspection.
12 June 2008: Final version of TEMPO golden collection.
19 June 2008: HAREM results (main track, classical mode) available.
25 June 2008: TEMPO results made available.
31 July 2008: Final version of ReRelEM golden collection available.
6 August 2008: ReRelEM results made available.
8 August 2008: Final individular reports (except for ReRelEM) made available.
21 August 2008: New results of ReRelEM made available.
7 September 2008: HAREM workshop, as satelite of PROPOR 2008
12 October 2008: Papers for the book on the Second HAREM due.
17 November 2008: Final resources packaged.
25 July 2009: The book on the Second HAREM was made publicly available.
7 April 2010: New ReRelEM golden collection, covering the complete HAREM golden collection, available.
27 April 2010: LÂMPADA 2.0 delivered.

More information about HAREM

We have published the following book on the Second HAREM:

Cristina Mota & Diana Santos (eds.). Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM. Linguateca, 2008. https://www.linguateca.pt/LivroSegundoHAREM/

Funding

Last update: 6 May 2010.