The Floresta Sintá(c)tica project
Página principal
Floresta Sintá(c)tica (syntactic forest) is a publicly available treebank for Portuguese, created as a collaboration project
between the VISL project, http://visl.sdu.dk, and Linguateca (formerly the Computational Processing of Portuguese project), http://www.linguateca.pt.
The Floresta is based on human revision of the output of the PALAVRAS parser, developed by Eckhard Bick for his PhD (1994-2000)
at the University of Århus (Denmark). The parser is available on the Web at the VISL project site (http://visl.sdu.dk). More information about the parser can be found in Bick, Eckhard. The Parsing System
Palavras, Automatic Grammatical Analysis of Portuguese in a Constraint
Grammar Framework, Aarhus University Press, 2000.
The textual material of Floresta comes from the CETEMPúblico and CETENfolha corpora (actually their first one-million words).
Please see the paper at LREC'2002 for a general description of the project.
All information in English so far is listed below:
- Portuguese VISL category set
- The Constraint Grammar category set of "Palavras"
- Grammatical categories (tags) used in the Floresta project.
- Susana Afonso, Eckhard Bick & Ana Raquel Marchi. Notational and terminological guide-lines
- Susana Afonso, Eckhard Bick & Ana Raquel Marchi. Documentation of the choices in the treebank project
- We have also created a FAQ
- Afonso, Susana, Eckhard Bick, Renato Haber & Diana Santos. ""Floresta sintá(c)tica": a treebank for Portuguese", in Manuel González Rodríguez & Carmen Paz Suárez Araujo (eds.), Proceedings of LREC 2002, the Third International Conference on Language Resources and Evaluation (Las Palmas de Gran Canaria, Spain, 29-31 May 2002), ELRA, 2002, pp.1698-1703. rtf ps Associated poster: ps
- Some examples prepared jointly by Susana Afonso e Diana Santos for the presentation at LREC'2002, 31 May 2002.
- Some non-trivial cases, prepared by Susana Afonso, June 2002.
- Santos, Diana. "The Floresta experience",
presentation at the Swedish Treebank Symposium (Växjö University,
28-29 November 2002). PowerPoint slides in Postscript format
- Santos, Diana. "Timber! Issues in treebank building and use", in Nuno J. Mamede, Jorge Baptista, Isabel Trancoso & Maria das Graças Volpe Nunes (eds.), Computational Processing of the Portuguese Language, 6th International Workshop, PROPOR 2003, Faro, 26-27 June 2003, Proceedings, Springer Verlag, 2003, pp.151-8. (c) Springer-Verlag. ps rtf
- Bick, Eckhard. "Treebank Troubles", presentation at Avalon'2003 (Faro, 28 June 2003), PowerPoint slides in Postscript format
Team
Project leaders: Diana Santos and Eckhard Bick.
Linguistic revision
Susana Afonso (November 2000 to the present)
Raquel Marchi (November 2000 to September 2001; Jan 2003 to the present)
Anabela Barreiro Colasuonno (May-December 2002)
Tool development
Renato Haber (November 2000 to September 2001)
Luís Sarmento (November-December 2002)
Rui Vilela (August 2004 to the present)
Results
The Floresta Sintá(c)tica project has so far produced:
- Bosque, a subset of Floresta, fully revised by the linguistic team (version 7.4, 22 December 2005): 9,431 trees, corresponding to 1962 extracts of CETEMPúblico and CETENFolha, 9,368 distinct sentences, 215,003 tokens and ca. 184,773 words
- Floresta Virgem a set of trees automatically created from the CG output of the PALAVRAS parser, corresponding to the first million words of the CETEMPúblico corpus. Size: 41,382 trees (roughly corresponding to 7,913 extracts, 41,406 sentences and 1,072,857 tokens). It includes the contents of Bosque without manual revision.
Each tree of our treebank corresponds to three different objects:
- CG representation in text format
- Phrase tree in text format
- Phrase tree in graphical format
2. and 3. contain exactly the same information and just differ in presentation mode, while 1. does not contain constituents nor attachment information (only dependency). We have some example sentences to illustrate the three objects.
Access
Bosque
One can download the phrase trees that constitute the Bosque:
They can also be individually inspected in graphical format at the VISL site, Portuguese zone, choosing, under "Non-automatic parse", "Floresta sintá(c)tica treebank", http://visl.sdu.dk/visl/pt/floresta.html?S=cetemcorpus#top.
Or at the VISL site, Portuguese zone, choosing, under "Non-automatic parse",
"Non-automatic parse", "Pre-analysed Portuguese sentences", "Newspaper corpus treebank (Floresta)" http://visl.sdu.dk/visl/pt/treebank.html, and clicking on the
figure preceding each sentence.
The Bosque is also available in the Penn Treebank and TIGER formats, in XML, through the work of the Braga node of Linguateca, see Floresta page at Braga.
The Bosque 7.3 was used for the ConLL-X shared task on multilingual dependency parsing. We are grateful to Sabine Buchholz for processing Bosque and making it available for the ConLL-X exercise. These data provided here have been prepared by her and her team, we just make it available as is from here.
Finally, they can be queried through
Floresta Virgem
Floresta Virgem can also be queried through Águia, a tool for searching the Floresta treebank, as well as obtained as two single files:
- FlorestaVirgem_CP_3.0: CETEMPúblico's first million words, automatically annotated and transformed into trees by PALAVRAS 2.0 in September 2006: AD format, VISL AD format, CG format
- FlorestaVirgem_CF_2.1: CETENFolha's first million words, automatically annotated and transformed into trees by PALAVRAS 2.0 in September 2006: AD format, VISL AD format, CG format
Last update: 26 September 2006.
Comments and suggestions about the Floresta treebank