The Floresta Sintá(c)tica project (old page)

Floresta Sintá(c)tica (syntactic forest) is a publicly available treebank for Portuguese, created as a collaboration project between the VISL project, http://visl.sdu.dk, and Linguateca (formerly the Computational Processing of Portuguese project), http://www.linguateca.pt.

The Floresta is based on human revision of the output of the PALAVRAS parser, developed by Eckhard Bick for his PhD (1994-2000) at the University of Århus (Denmark). The parser is available on the Web at the VISL project site (http://visl.sdu.dk). More information about the parser can be found in Bick, Eckhard. The Parsing System Palavras, Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework, Aarhus University Press, 2000.

The Floresta Sintá(c)tica project comprises three corpora:

Floresta Virgem — a set of trees automatically created from the CG output of the PALAVRAS parser, corresponding to the first million words of the CETEMPúblico corpus. Size: aprox. 1.639.585 words.
Bosque — a subset of Floresta, fully revised by the linguistic team. Size: aprox. 162.484 lexical units.
Selva — scientific, literary and transcribed spoken texts, further subdivided in approximately equal shares of Portuguese and Brazilian texts. Size: approx. 300.000 words. Selva is intended to be a partially reviewed corpus.

These three corpora can be queried through Milhafre, a user-friendly search interface.

Documentation

Please see the paper at LREC'2002 for a general description of the project. All information in English so far is listed below:

Portuguese VISL category set
The Constraint Grammar category set of "Palavras"
Grammatical categories (tags) used in the Floresta project.
Susana Afonso, Eckhard Bick & Ana Raquel Marchi. Notational and terminological guide-lines
Susana Afonso, Eckhard Bick & Ana Raquel Marchi. Documentation of the choices in the treebank project
We have also created a FAQ
Afonso, Susana, Eckhard Bick, Renato Haber & Diana Santos. ""Floresta sintá(c)tica": a treebank for Portuguese", in Manuel González Rodríguez & Carmen Paz Suárez Araujo (eds.), Proceedings of LREC 2002, the Third International Conference on Language Resources and Evaluation (Las Palmas de Gran Canaria, Spain, 29-31 May 2002), ELRA, 2002, pp.1698-1703. rtf ps Associated poster: ps
Some examples prepared jointly by Susana Afonso e Diana Santos for the presentation at LREC'2002, 31 May 2002.
Some non-trivial cases, prepared by Susana Afonso, June 2002.
Santos, Diana. "The Floresta experience", presentation at the Swedish Treebank Symposium (Växjö University, 28-29 November 2002). PowerPoint slides in Postscript format
Santos, Diana. "Timber! Issues in treebank building and use", in Nuno J. Mamede, Jorge Baptista, Isabel Trancoso & Maria das Graças Volpe Nunes (eds.), Computational Processing of the Portuguese Language, 6^th International Workshop, PROPOR 2003, Faro, 26-27 June 2003, Proceedings, Springer Verlag, 2003, pp.151-8. (c) Springer-Verlag. ps rtf
Bick, Eckhard. "Treebank Troubles", presentation at Avalon'2003 (Faro, 28 June 2003), PowerPoint slides in Postscript format

Team

Project leaders: Diana Santos (to September 2007) and Eckhard Bick.

Linguistic revision
Susana Afonso (November 2000 to 2005)
Raquel Marchi (November 2000 to September 2001; Jan 2003 to 2005)
Anabela Barreiro Colasuonno (May-December 2002)
Cláudia Freitas (June 2007 to present)

Tool development
Renato Haber (November 2000 to September 2001)
Luís Sarmento (November-December 2002)
Rui Vilela (August 2004 to December 2005)
Paulo Rocha (June 2007 to present)

Results

The Floresta Sintá(c)tica project has so far produced:

Bosque, a subset of Floresta, fully revised by the linguistic team v7.6, (8 July 2007). 9.369 reviewed trees, from 1.962 extracts, 162.484 tokens, aprox. 140.278 lexical units
Floresta Virgem a set of trees automatically created from the CG output of the PALAVRAS parser, corresponding to the first million words of the CETEMPúblico corpus. Size: 87,702 trees (roughly corresponding to 7,913 extracts, 41,406 sentences, 1.917.648 tokens and 1.648.289 lexical units). It doesn't include sentences included in Bosque.
Selva: a corpus that contains around 300.000 words and 30.000 sentences, divided into three roughly equal shares of scientific, literary and transcribed spoken texts, further subdivided in approximately equal shares of Portuguese and Brazilian texts. Selva is intended to be a partially reviewed corpus, where some characteristics of the corpus are reviewed one by one, instead of the complete annotation being revised tree by tree as in Bosque.

Each tree of our treebank corresponds to three different objects:

CG representation in text format
Phrase tree in text format
Phrase tree in graphical format

2. and 3. contain exactly the same information and just differ in presentation mode, while 1. does not contain constituents nor attachment information (only dependency). We have some example sentences to illustrate the three objects.

Access

Bosque

One can download the phrase trees that constitute the Bosque (v7.6), in several formats, from the main page:

They can also be individually inspected in graphical format at the VISL site, Portuguese zone, choosing, under "Non-automatic parse", "Floresta sintá(c)tica treebank", http://visl.sdu.dk/visl/pt/floresta.html?S=cetemcorpus#top.

Or at the VISL site, Portuguese zone, choosing, under "Non-automatic parse", "Non-automatic parse", "Pre-analysed Portuguese sentences", "Newspaper corpus treebank (Floresta)" http://visl.sdu.dk/visl/pt/treebank.html, and clicking on the figura de árvore gráfica no projecto VISL figure preceding each sentence.

The Bosque is also available in the Penn Treebank and TIGER formats, in XML, through the work of the Braga node of Linguateca, see Floresta page at Braga.

The Bosque 7.3 was used for the ConLL-X shared task on multilingual dependency parsing. We are grateful to Sabine Buchholz for processing Bosque and making it available for the ConLL-X exercise. These data provided here have been prepared by her and her team, we just make it available as is from here.

Finally, they can be queried through

Milhafre, a user-friendly tool for searching the Floresta treebanks
Águia, another tool for searching the Floresta treebank (see also a preliminary tutorial) (older versions of the corpus);
tgrep, over the Penn Treebank format, in http://corp.hum.sdu.dk/tgrepeye_pt.html (older version of the corpus)

Floresta Virgem

Most of Floresta Virgem can also be queried through Milhafre and Águia:

FlorestaVirgem_CP_3.0: CETEMPúblico's first million words, automatically annotated and transformed into trees by PALAVRAS 2.0 in June 2008: AD format, VISL AD format, CG format
FlorestaVirgem_CF_3.0: CETENFolha's first million words, automatically annotated and transformed into trees by PALAVRAS 2.0 in June 2008: AD format, VISL AD format, CG format

Last update: 8 July 2008

Comments and suggestions about the Floresta treebank