HAREM - evaluation contest for named entity recognizers in Portuguese

Linguateca - 2006.
  Printer-friendly version

HAREM Golden Collection


The HAREM Golden Collection can be downloaded from:

HAREM Golden Collection (ZIP, 217 KB) (Last review: 4th of November, 2005)

It includes ca. 93,000 words, from 129 different texts, from several genres, and language varieties.

Distribution by genre and language variey is in the figures below.

Figure 1: Distribution by genre


Figure 2: Distribution by variant


It is marked with the following categories and subcategories, whose quantitative distribution is in the table below.

Table 1: Distribution of categories and sub-categories in the GC

CategorySub-typeEnglish glossNr. NEs
PESSOAINDIVIDUALindividual person856
CARGOtitle of employment79
MEMBROmembers10
GRUPOINDgroup of people10
GRUPOCARGOgroup of titles19
GRUPOMEMBROgroup of members137
ORGANIZACAOADMINISTRACAOadministration224
INSTITUICAOinstitution462
EMPRESAcompany230
SUBsub-organization61
TEMPODATAdate335
HORAtime39
PERIODOperiod62
CICLICOcyclic5
LOCALCORREIOaddress17
ADMINISTRATIVOadministrative906
GEOGRAFICOgeographic86
VIRTUALvirtual126
ALARGADOextended161
OBRAPRODUTOproduct74
REPRODUZIDAreproducible work89
ARTEunique work10
PUBLICACAOpublication51
ACONTECIMENTOEFEMERIDEunique23
ORGANIZADOlarge event62
EVENTOatomic event45
ABSTRACCAODISCIPLINAsubject228
MARCAbrandname36
ESTADOcondition34
ESCOLAschool14
IDEIAideal45
PLANOplan40
OBRAcomplete works4
NOMEname76
COISAOBJECTOobject39
SUBSTANCIAsubstance9
CLASSEclass37
VALORCLASSIFICACAOclassification62
QUANTIDADEamount370
MOEDAmoney53
VARIADO other42
Last page update: 18/11/2005 10:17:12