Making the CHAVE collection available
Linguateca
CHAVE em português
CHAVE is a Portuguese collection for IR and Q&A created for CLEF in 2004 and updated every year (see the CLEF website, as well as a paper [Santos & Rocha 2004] describing its creation in some detail).
Since April 2007 a version syntactically anotated by
PALAVRAS (Bick, 2000) has been made available.
Since January 2010 a named-entity annotated version of CHAVE, by REMBRANDT 0.7 (Cardoso, 2008) is also available.
In addition to a large set of documents, namely the 1994 and 1995 full editions of the Público and Folha de São Paulo newspapers, we make available the following resources, related to two diferent tracks:
- Information Retrieval (IR) Ad Hoc Track (2004, 2005, 2006, Robust: 2007)
- a list of topics in Portuguese, compiled cooperatively with the other CLEF organizers
- a pool of (binarily) judged documents for each topic
- Question Answering Evaluation QA@CLEF (2004, 2005, 2006, 2007, 2008)
- a list of questions and answers in Portuguese, compiled cooperatively with the other QA@CLEF organizers (data for 2004
can be directly
download from the
organization (here)
- a (non exhaustive) set of document ids that support the answer(s) for a subset of the above
- Geographic Information Retrieval GeoCLEF (2006, 2007, 2008)
- a list of topics in Portuguese, compiled cooperatively with the other CLEF organizers
- a pool of (binarily) judged documents for each topic
This resource is organized as follows:
- Textos - Folder containing the
complete texts of the newspapers PÚBLICO and
Folha de São Paulo of 1994 and 1995.
From April 2007, there is also available a version syntactically anotated by
PALAVRAS (Bick, 2000).
The files can be identified by the cg. prefix.
- 2004 - The Portuguese resources concerning CLEF2004
- 2005 - The Portuguese resources concerning CLEF2005
- 2006 - The Portuguese resources concerning CLEF2006
- 2007 - The Portuguese resources concerning CLEF2007
- 2008 - The Portuguese resources concerning CLEF2008
- 200x/Monte - Document pools for each topic relative to CLEF200x
- 200x/PerguntasRespostas - Questions and answers compiled by the organization of CLEF200x
- 200x/Topicos - Topics in Portuguese, compiled by the organization of CLEF200x
Information about the texts
The newspaper collections correspond to two major daily newspapers in Portuguese in 1994 and 1995: PÚBLICO, and Folha de São Paulo.
For the CLEF 2004 edition, only PÚBLICO 1995 was used for adhoc IR, and PÚBLICO 1994 and 1995 for QA@CLEF. For CLEF 2005 and 2006 both newspapers and years were used for QA@CLEF and adhoc IR. In 2006 they were also used for GeoCLEF. In 2007 they continue to be used in QA@ and GeoCLEF, as well as for the robust track.
The following table presents a quantitative description of the collections:
|
|
| Colections | Público | Folha de São Paulo |
| Years | 1994-1995 | 1994-1995 |
| Editions | 726 | 730 |
| Documents | 106,821 | 103,913 |
| Size | 348,078 kB | 226,690 kB |
| Units | Total | 64,222,797 | 42,109,286 |
| Different | 500,197 | 426,469 |
| Words | Total | 54,947,072 | 35,699,765 |
| Different | 472,817 | 393,885 |
Notes:
- Público is not published on Christmas day or New Year's day, so there are 4 editions less.
- A word was defined as a letter followed by a (possibly void) sequence of letters and hyphens.
- Markup was not counted.
Here is a sample document from PÚBLICO (SGML 351KB, gz 135KB, with the corresponding DTD) and from Folha de São Paulo (SGML, gz and corresponding DTD).
Conditions for use
In order to comply with CLEF tradition, we request that users of CHAVE obey the following conditions:
- Register in order to get the collection
- Reference the following facts: that the collection consists of the 1994 and 1995 complete editions of Público newspaper (www.publico.pt) and Folha de São Paulo (www.folha.com.br), that it was compiled by Linguateca (www.linguateca.pt), and that this compilation occurred in the framework of CLEF (www.clef-campaign.org)
- Use it for research and development only; not for reselling or making profit from its direct distribution (on-line or off-line)
- No results obtained outside the CLEF official campaigns can invoke CLEF's name in a way that implies that the system was assessed within CLEF, i.e., it is not acceptable to compare with results out of contest without clearly stating it. Ideally, one should simply refer to the CHAVE collection.
- Mention that the syntactic annotation was made by PALAVRAS (Bick, 2000) and the NE-annotation by REMBRANDT (Cardoso, 2008).
We make everyone also aware that CHAVE is a part of a much larger (multilingual) collection (to be) distributed by ELRA, that we strongly encourage anyone interested in CLIR to get.
Last update: 29 January 2010.
Send questions, comments and suggestions