The Floresta Sintá(c)tica project

Presentation
Contents
Documentation
Team

Presentation

Floresta Sintá(c)tica (syntactic forest) is a publicly available treebank for Portuguese, created as a collaboration project between the VISL project, http://visl.sdu.dk, and Linguateca (formerly the Computational Processing of Portuguese project), http://www.linguateca.pt.

Floresta is based on the output of the PALAVRAS parser, developed by Eckhard Bick for his PhD (1994-2000) at the University of Århus (Denmark). The parser is available on the Web at the VISL project site (http://visl.sdu.dk). More information about the parser can be found in Bick, Eckhard. The Parsing System Palavras, Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework, Aarhus University Press, 2000.

Strictly speaking, the Floresta treebank can be considered as a merge of three distinct components:

the texts themselves
the automatic annotation produced by PALAVRAS
the human revision produced by the team

While the final outcome is publicly available for download on the Web, for formal licensing this requires the consent of the text owners, of the PALAVRAS owner, and of the Floresta project team (Linguateca and VISL).

The Floresta Sintá(c)tica treebank comprises four parts:

Floresta Virgem — a set of trees automatically created from the CG output of the PALAVRAS parser, corresponding roughly to the first million tokens of the CETEMPúblico and CETENFolha corpora, with a size of ca. 1,640,000 words. The text material comes from the PÚBLICO newspaper and the Folha de São Paulo newspaper respectively.
Bosque — a subset of Floresta Virgem, fully revised and corrected in the scope of the project, with a current size of 162,484 lexical units.
Selva — a set of trees automatically created from the output of the PALAVRAS parser, from a set of scientific, literary and transcribed spoken texts, further subdivided in approximately equal shares of Portuguese and Brazilian texts, partially revised by the Floresta team, with a size of ca. 300,000 words.
Amazônia — a set of trees automatically created from the output of the PALAVRAS parser from the Brazilian cultural blog site Overmundo, not revised, with a size of ca. 3.5 million words.

These four parts can be queried through:

Milhafre, a user-friendly search interface created by Paulo Rocha and Cláudia Freitas
Águia, an older search interface created by Diana Santos (currently only Bosque and Floresta Virgem)
CorpusEye, another search interface in VISL created by Eckhard Bick (currently only Bosque and Floresta Virgem)

In addition, they can be downloaded in several formats as well (see example sentences for a detailed explanation of each format):

Bosque v8.0, 9,368 trees, corresponding to 1,962 different extracts, featuring 162,484 tokens, aprox. 140 thousand words: in CG (constraint grammar), AD (phrase structure tree), graphical, and tgrep format, Penn TreeBank and TIGER
Floresta Virgem, v3.0, 87,702 trees, corresponding to 7,913 extracts, 41,406 sentences, 1,917,648 tokens and 1,648,289 lexical units: in CG (constraint grammar), AD (phrase tree), graphical, and tgrep format
Selva, v1.0, in CG, AD and tgrep format
Amazônia, v2.0, ca. 3,500,000 words: in CG, AD and tgrep format

They can also be individually inspected in graphical format at the VISL site, Portuguese zone, choosing, under "Non-automatic parse", "Floresta sintá(c)tica treebank", http://visl.sdu.dk/visl/pt/floresta.html?S=cetemcorpus#top.

Or at the VISL site, Portuguese zone, choosing, under "Non-automatic parse", "Non-automatic parse", "Pre-analysed Portuguese sentences", "Newspaper corpus treebank (Floresta)" http://visl.sdu.dk/visl/pt/treebank.html, and clicking on the figura de árvore gráfica no projecto VISL figure preceding each sentence.

The Bosque is also available in the Penn Treebank and TIGER formats, in XML, through the work of the Braga node of Linguateca, see Floresta page at Braga.

The Bosque 7.3 was used for the ConLL-X shared task on multilingual dependency parsing. We are grateful to Sabine Buchholz for processing Bosque and making it available for the ConLL-X exercise. The data provided here have been prepared by her and her team, we just make it available as is from here.

Documentation

Please see the paper at LREC'2002 for a general description of the project. The full documention can be found through Linguateca's publication catalogue, searching for items with the tag floresta; information available in English is listed below:

Portuguese VISL category set
The Constraint Grammar category set of PALAVRAS
Grammatical categories (tags) used in the Floresta project
Eckhard Bick, Susana Afonso & Ana Raquel Marchi. Notational and terminological guide-lines
Eckhard Bick, Susana Afonso& Ana Raquel Marchi. Documentation of the choices in the treebank project
We have also created an English FAQ
Afonso, Susana, Eckhard Bick, Renato Haber & Diana Santos. ""Floresta sintá(c)tica": a treebank for Portuguese", in Manuel González Rodríguez & Carmen Paz Suárez Araujo (eds.), Proceedings of LREC 2002, the Third International Conference on Language Resources and Evaluation (Las Palmas de Gran Canaria, Spain, 29-31 May 2002), ELRA, 2002, pp.1698-1703. pdf Poster
Some examples prepared jointly by Susana Afonso e Diana Santos for the presentation at LREC'2002, 31 May 2002.
Santos, Diana. "The Floresta experience", presentation at the Swedish Treebank Symposium (Växjö University, 28-29 November 2002). pdf
Santos, Diana. "Timber! Issues in treebank building and use", in Nuno J. Mamede, Jorge Baptista, Isabel Trancoso & Maria das Graças Volpe Nunes (eds.), Computational Processing of the Portuguese Language, 6^th International Workshop, PROPOR 2003, Faro, 26-27 June 2003, Proceedings, Springer Verlag, 2003, pp.151-8. pdf
Bick, Eckhard. "Treebank Troubles", presentation at Avalon'2003 (Faro, 28 June 2003), pdf
Cláudia Freitas, Paulo Rocha & Eckhard Bick. "Floresta Sintá(c)tica: Bigger, Thicker and Easier". In António Teixeira, Vera Lúcia Strube de Lima, Luís Caldas de Oliveira & Paulo Quaresma (eds.), Computational Processing of the Portuguese Language, 8th International Conference, Proceedings (PROPOR 2008), Springer Verlag, 2008, pp. 216-219. Poster pdf

A full description of the linguistic choices we made can be found, in Portuguese, on what we have called Bíblia Florestal.

Team

Project leaders

Eckhard Bick
Diana Santos (except for the period from September 2008 to March 2010)

Linguistic revision

Cláudia Freitas (since June 2007)
Susana Afonso (November 2000 to 2005)
Raquel Marchi (November 2000 to September 2001; January 2003 to 2005)
Anabela Barreiro Colasuonno (May-December 2002)

Tool development

Paulo Rocha (June 2007 to December 2008)
Rui Vilela (August 2004 to 2005)
Luís Sarmento (November-December 2002)
Renato Haber (November 2000 to September de 2001)

Last update of this page: 3 August 2010.

Comments and suggestions about the Floresta treebank

The Floresta Sintá(c)tica project

Presentation

Contents

Documentation

Team