The Floresta Sintá(c)tica project
Em português
Floresta Sintá(c)tica (syntactic forest) is a publicly
available treebank for Portuguese, created as a collaboration project
between the VISL project, http://visl.sdu.dk, and Linguateca (formerly the Computational Processing of Portuguese project), http://www.linguateca.pt.
Floresta is based on the output of the PALAVRAS parser, developed by Eckhard Bick for his PhD (1994-2000) at the University of Århus (Denmark). The parser is available on the Web at the VISL project site (http://visl.sdu.dk). More information about the parser can be found in Bick, Eckhard. The Parsing System Palavras, Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework, Aarhus University Press, 2000.
Strictly speaking, the Floresta treebank can be considered as a merge of three distinct components:
- the texts themselves
- the automatic annotation produced by PALAVRAS
- the human revision produced by the team
While the final outcome is publicly available for download on the Web, for formal licensing this requires the consent of the text owners, of the PALAVRAS owner, and of the Floresta project team (Linguateca and VISL).
The Floresta Sintá(c)tica treebank comprises four parts:
- Floresta Virgem — a set of trees automatically created from the CG output of the PALAVRAS parser, corresponding roughly to the first million tokens of the CETEMPúblico and CETENFolha corpora, with a size of ca. 1,640,000 words. The text material comes from the PÚBLICO newspaper and the Folha de São Paulo newspaper respectively.
- Bosque — a subset of Floresta Virgem, fully revised and corrected in the scope of the project, with a current size of 162,484 lexical units.
- Selva — a set of trees automatically created from the output of the PALAVRAS parser, from a set of scientific, literary and transcribed spoken texts, further subdivided in approximately equal shares of Portuguese and Brazilian texts, partially revised by the Floresta team, with a size of ca. 300,000 words.
- Amazônia — a set of trees automatically created from the output of the PALAVRAS parser from the Brazilian cultural blog site Overmundo, not revised, with a size of ca. 3.5 million words.
These four parts can be queried through:
- Milhafre, a user-friendly search interface created by Paulo Rocha and Cláudia Freitas
- Águia, an older search interface created by Diana Santos (currently only Bosque and Floresta Virgem)
- CorpusEye, another search interface in VISL created by Eckhard Bick (currently only Bosque and Floresta Virgem)
In addition, they can be downloaded in several formats as well (see example sentences for a detailed explanation of each format):
- Bosque v8.0, 9,368 trees, corresponding to 1,962 different extracts, featuring 162,484 tokens, aprox. 140 thousand words: in CG (constraint grammar), AD (phrase structure tree), graphical, and tgrep format, Penn TreeBank and TIGER
- Floresta Virgem, v3.0, 87,702 trees, corresponding to 7,913 extracts, 41,406 sentences, 1,917,648 tokens and 1,648,289 lexical units: in CG (constraint grammar), AD (phrase tree), graphical, and tgrep format
- Selva, v1.0, in CG, AD and tgrep format
- Amazônia, v2.0, ca. 3,500,000 words: in CG, AD and tgrep format
They can also be individually inspected in graphical format at the
VISL site, Portuguese zone, choosing, under "Non-automatic parse",
"Floresta sintá(c)tica treebank", http://visl.sdu.dk/visl/pt/floresta.html?S=cetemcorpus#top.
Or at the VISL site, Portuguese zone, choosing, under "Non-automatic parse",
"Non-automatic parse", "Pre-analysed Portuguese sentences", "Newspaper corpus treebank (Floresta)" http://visl.sdu.dk/visl/pt/treebank.html, and clicking on the
figure preceding each sentence.
The Bosque is also available in the Penn Treebank and TIGER formats,
in XML, through the work of the Braga node of Linguateca, see Floresta page at Braga.
The Bosque 7.3 was used for the ConLL-X shared task on multilingual dependency parsing.
We are grateful to Sabine Buchholz for processing Bosque and making it
available for the ConLL-X exercise. The data provided here have been
prepared by her and her team, we just make it available as is from here.
Please see the paper at LREC'2002 for a general description of the project. The full documention can be found through Linguateca's publication catalogue, searching for items with the tag floresta; information available in English is listed below:
- Portuguese VISL category set
- The Constraint Grammar category set of PALAVRAS
- Grammatical categories (tags) used in the Floresta project
- Eckhard Bick, Susana Afonso & Ana Raquel Marchi. Notational and terminological guide-lines
- Eckhard Bick, Susana Afonso& Ana Raquel Marchi. Documentation of the choices in the treebank project
- We have also created an English FAQ
- Afonso, Susana, Eckhard Bick, Renato Haber & Diana Santos. ""Floresta sintá(c)tica": a treebank for Portuguese", in Manuel González Rodríguez & Carmen Paz Suárez Araujo (eds.), Proceedings of LREC 2002, the Third International Conference on Language Resources and Evaluation (Las Palmas de Gran Canaria, Spain, 29-31 May 2002), ELRA, 2002, pp.1698-1703. pdf Poster
- Some examples prepared jointly by Susana Afonso e Diana Santos for the presentation at LREC'2002, 31 May 2002.
- Santos, Diana. "The Floresta experience", presentation at the Swedish Treebank Symposium (Växjö University, 28-29 November 2002). pdf
- Santos, Diana. "Timber! Issues in treebank building and use", in Nuno J. Mamede, Jorge Baptista, Isabel Trancoso & Maria das Graças Volpe Nunes (eds.), Computational Processing of the Portuguese Language, 6th International Workshop, PROPOR 2003, Faro, 26-27 June 2003, Proceedings, Springer Verlag, 2003, pp.151-8. pdf
- Bick, Eckhard. "Treebank Troubles", presentation at Avalon'2003 (Faro, 28 June 2003), pdf
- Cláudia Freitas, Paulo Rocha & Eckhard Bick. "Floresta Sintá(c)tica: Bigger, Thicker and Easier". In António Teixeira, Vera Lúcia Strube de Lima, Luís Caldas de Oliveira & Paulo Quaresma (eds.), Computational Processing of the Portuguese Language, 8th International Conference, Proceedings (PROPOR 2008), Springer Verlag, 2008, pp. 216-219. Poster pdf
A full description of the linguistic choices we made can be found, in Portuguese, on what we have called Bíblia Florestal.
Project leaders
- Eckhard Bick
- Diana Santos (except for the period from September 2008 to March 2010)
Linguistic revision
- Cláudia Freitas (since June 2007)
- Susana Afonso (November 2000 to 2005)
- Raquel Marchi (November 2000 to September 2001; January 2003 to 2005)
- Anabela Barreiro Colasuonno (May-December 2002)
Tool development
- Paulo Rocha (June 2007 to December 2008)
- Rui Vilela (August 2004 to 2005)
- Luís Sarmento (November-December 2002)
- Renato Haber (November 2000 to September de 2001)
Last update of this page: 3 August 2010.
Comments and suggestions about the Floresta treebank