Making the CHAVE collection available

Linguateca

CHAVE em português


CHAVE is a Portuguese collection for IR and Q&A created for CLEF in 2004 and updated every year (see the CLEF website, as well as a paper [Santos & Rocha 2004] describing its creation in some detail).

Since April 2007 a version syntactically anotated by PALAVRAS (Bick, 2000) has been made available.

Since January 2010 a named-entity annotated version of CHAVE, by REMBRANDT 0.7 (Cardoso, 2008) is also available.

In addition to a large set of documents, namely the 1994 and 1995 full editions of the Público and Folha de São Paulo newspapers, we make available the following resources, related to two diferent tracks:


This resource is organized as follows:

Information about the texts

The newspaper collections correspond to two major daily newspapers in Portuguese in 1994 and 1995: PÚBLICO, and Folha de São Paulo.

For the CLEF 2004 edition, only PÚBLICO 1995 was used for adhoc IR, and PÚBLICO 1994 and 1995 for QA@CLEF. For CLEF 2005 and 2006 both newspapers and years were used for QA@CLEF and adhoc IR. In 2006 they were also used for GeoCLEF. In 2007 they continue to be used in QA@ and GeoCLEF, as well as for the robust track.

The following table presents a quantitative description of the collections:

ColectionsPúblicoFolha de São Paulo
Years1994-19951994-1995
Editions726730
Documents106,821103,913
Size348,078 kB226,690 kB
UnitsTotal64,222,79742,109,286
Different500,197426,469
WordsTotal54,947,07235,699,765
Different472,817393,885

Notes:

Here is a sample document from PÚBLICO (SGML 351KB, gz 135KB, with the corresponding DTD) and from Folha de São Paulo (SGML, gz and corresponding DTD).

Conditions for use

In order to comply with CLEF tradition, we request that users of CHAVE obey the following conditions:
  1. Register in order to get the collection
  2. Reference the following facts: that the collection consists of the 1994 and 1995 complete editions of Público newspaper (www.publico.pt) and Folha de São Paulo (www.folha.com.br), that it was compiled by Linguateca (www.linguateca.pt), and that this compilation occurred in the framework of CLEF (www.clef-campaign.org)
  3. Use it for research and development only; not for reselling or making profit from its direct distribution (on-line or off-line)
  4. No results obtained outside the CLEF official campaigns can invoke CLEF's name in a way that implies that the system was assessed within CLEF, i.e., it is not acceptable to compare with results out of contest without clearly stating it. Ideally, one should simply refer to the CHAVE collection.
  5. Mention that the syntactic annotation was made by PALAVRAS (Bick, 2000) and the NE-annotation by REMBRANDT (Cardoso, 2008).
We make everyone also aware that CHAVE is a part of a much larger (multilingual) collection (to be) distributed by ELRA, that we strongly encourage anyone interested in CLIR to get.
Last update: 29 January 2010.
Send questions, comments and suggestions