The Floresta Sintá(c)tica project

Em português

logo temporário da FS

Presentation

Floresta Sintá(c)tica (syntactic forest) is a publicly available treebank for Portuguese, created as a collaboration project between the VISL project, http://visl.sdu.dk, and Linguateca (formerly the Computational Processing of Portuguese project), http://www.linguateca.pt.

Floresta is based on the output of the PALAVRAS parser, developed by Eckhard Bick for his PhD (1994-2000) at the University of Århus (Denmark). The parser is available on the Web at the VISL project site (http://visl.sdu.dk). More information about the parser can be found in Bick, Eckhard. The Parsing System Palavras, Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework, Aarhus University Press, 2000.

Strictly speaking, the Floresta treebank can be considered as a merge of three distinct components:

While the final outcome is publicly available for download on the Web, for formal licensing this requires the consent of the text owners, of the PALAVRAS owner, and of the Floresta project team (Linguateca and VISL).

Contents

The Floresta Sintá(c)tica treebank comprises four parts:

These four parts can be queried through:

In addition, they can be downloaded in several formats as well (see example sentences for a detailed explanation of each format):

They can also be individually inspected in graphical format at the VISL site, Portuguese zone, choosing, under "Non-automatic parse", "Floresta sintá(c)tica treebank", http://visl.sdu.dk/visl/pt/floresta.html?S=cetemcorpus#top.

Or at the VISL site, Portuguese zone, choosing, under "Non-automatic parse", "Non-automatic parse", "Pre-analysed Portuguese sentences", "Newspaper corpus treebank (Floresta)" http://visl.sdu.dk/visl/pt/treebank.html, and clicking on the figura de árvore gráfica no projecto VISL figure preceding each sentence.

The Bosque is also available in the Penn Treebank and TIGER formats, in XML, through the work of the Braga node of Linguateca, see Floresta page at Braga.

The Bosque 7.3 was used for the ConLL-X shared task on multilingual dependency parsing. We are grateful to Sabine Buchholz for processing Bosque and making it available for the ConLL-X exercise. The data provided here have been prepared by her and her team, we just make it available as is from here.

Documentation

Please see the paper at LREC'2002 for a general description of the project. The full documention can be found through Linguateca's publication catalogue, searching for items with the tag floresta; information available in English is listed below: A full description of the linguistic choices we made can be found, in Portuguese, on what we have called Bíblia Florestal.

Team

Project leaders Linguistic revision Tool development
Last update of this page: 3 August 2010.
Comments and suggestions about the Floresta treebank