The Floresta Sintá(c)tica project (old page)
Página principal
Floresta Sintá(c)tica (syntactic forest) is a publicly
available treebank for Portuguese, created as a collaboration project
between the VISL project, http://visl.sdu.dk, and Linguateca (formerly the Computational Processing of Portuguese project), http://www.linguateca.pt.
The Floresta is based on human revision of the output of the PALAVRAS parser, developed by Eckhard Bick for his PhD (1994-2000)
at the University of Århus (Denmark). The parser is available on the Web at the VISL project site (http://visl.sdu.dk). More information about the parser can be found in Bick, Eckhard. The Parsing System
Palavras, Automatic Grammatical Analysis of Portuguese in a Constraint
Grammar Framework, Aarhus University Press, 2000.
The Floresta Sintá(c)tica project comprises three corpora:
- Floresta Virgem — a set of trees automatically created from the CG output of the PALAVRAS
parser, corresponding to the first million words of the CETEMPúblico
corpus. Size: aprox. 1.639.585 words.
- Bosque
— a subset of Floresta, fully revised by the linguistic team. Size: aprox. 162.484 lexical units.
- Selva — scientific,
literary and transcribed spoken texts, further subdivided in approximately
equal shares of Portuguese and Brazilian texts. Size: approx. 300.000 words. Selva is intended to be a partially reviewed
corpus.
These three corpora can be queried through Milhafre, a user-friendly search interface.
Please see the paper at LREC'2002 for a general description of the project.
All information in English so far is listed below:
- Portuguese VISL category set
- The Constraint Grammar category set of "Palavras"
- Grammatical categories (tags) used in the Floresta project.
- Susana Afonso, Eckhard Bick & Ana Raquel Marchi. Notational and terminological guide-lines
- Susana Afonso, Eckhard Bick & Ana Raquel Marchi. Documentation of the choices in the treebank project
- We have also created a FAQ
- Afonso, Susana, Eckhard Bick, Renato Haber & Diana Santos.
""Floresta sintá(c)tica": a treebank for Portuguese", in Manuel
González Rodríguez & Carmen Paz Suárez Araujo
(eds.), Proceedings of LREC 2002, the Third International Conference on Language Resources and Evaluation (Las Palmas de Gran Canaria, Spain, 29-31 May 2002), ELRA, 2002, pp.1698-1703. rtf ps Associated poster: ps
- Some examples prepared jointly by Susana Afonso e Diana Santos for the presentation at LREC'2002, 31 May 2002.
- Some non-trivial cases, prepared by Susana Afonso, June 2002.
- Santos, Diana. "The Floresta experience",
presentation at the Swedish Treebank Symposium (Växjö University,
28-29 November 2002). PowerPoint slides in Postscript format
- Santos, Diana. "Timber! Issues in treebank building and use", in
Nuno J. Mamede, Jorge Baptista, Isabel Trancoso & Maria das
Graças Volpe Nunes (eds.), Computational Processing of the Portuguese Language, 6th International Workshop, PROPOR 2003, Faro, 26-27 June 2003, Proceedings, Springer Verlag, 2003, pp.151-8. (c) Springer-Verlag. ps rtf
- Bick, Eckhard. "Treebank Troubles", presentation at Avalon'2003 (Faro, 28 June 2003), PowerPoint slides in Postscript format
Team
Project leaders: Diana Santos (to September 2007) and Eckhard Bick.
Linguistic revision
Susana Afonso (November 2000 to 2005)
Raquel Marchi (November 2000 to September 2001; Jan 2003 to 2005)
Anabela Barreiro Colasuonno (May-December 2002)
Cláudia Freitas (June 2007 to present)
Tool development
Renato Haber (November 2000 to September 2001)
Luís Sarmento (November-December 2002)
Rui Vilela (August 2004 to December 2005)
Paulo Rocha (June 2007 to present)
Results
The Floresta Sintá(c)tica project has so far produced:
- Bosque, a subset of Floresta, fully revised by the linguistic team
v7.6, (8 July 2007). 9.369 reviewed trees, from 1.962 extracts, 162.484 tokens, aprox. 140.278 lexical units
- Floresta Virgem a set of trees automatically created from
the CG output of the PALAVRAS parser, corresponding to the first
million words of the CETEMPúblico corpus. Size: 87,702 trees (roughly corresponding to 7,913 extracts, 41,406 sentences, 1.917.648 tokens and 1.648.289 lexical units). It doesn't include sentences included in Bosque.
- Selva: a corpus that contains around 300.000 words
and 30.000 sentences, divided into three roughly equal shares of scientific,
literary and transcribed spoken texts, further subdivided in approximately
equal shares of Portuguese and Brazilian texts. Selva is intended to be a partially reviewed
corpus, where some characteristics of the corpus are reviewed one by one,
instead of the complete annotation being revised tree by tree as in Bosque.
Each tree of our treebank corresponds to three different objects:
- CG representation in text format
- Phrase tree in text format
- Phrase tree in graphical format
2. and 3. contain exactly the same information and just differ in
presentation mode, while 1. does not contain constituents nor
attachment information (only dependency). We have some example sentences to illustrate the three objects.
Access
Bosque
One can download the phrase trees that constitute the Bosque (v7.6), in several formats, from the main page:
They can also be individually inspected in graphical format at the
VISL site, Portuguese zone, choosing, under "Non-automatic parse",
"Floresta sintá(c)tica treebank", http://visl.sdu.dk/visl/pt/floresta.html?S=cetemcorpus#top.
Or at the VISL site, Portuguese zone, choosing, under "Non-automatic parse",
"Non-automatic parse", "Pre-analysed Portuguese sentences", "Newspaper corpus treebank (Floresta)" http://visl.sdu.dk/visl/pt/treebank.html, and clicking on the
figure preceding each sentence.
The Bosque is also available in the Penn Treebank and TIGER formats,
in XML, through the work of the Braga node of Linguateca, see Floresta page at Braga.
The Bosque 7.3 was used for the ConLL-X shared task on multilingual dependency parsing.
We are grateful to Sabine Buchholz for processing Bosque and making it
available for the ConLL-X exercise. These data provided here have been
prepared by her and her team, we just make it available as is from here.
Finally, they can be queried through
- Milhafre, a user-friendly tool for searching the Floresta treebanks
-
- Águia, another tool for searching the Floresta treebank (see also a preliminary tutorial) (older versions of the corpus);
- tgrep, over the Penn Treebank format, in http://corp.hum.sdu.dk/tgrepeye_pt.html (older version of the corpus)
Floresta Virgem
Most of Floresta Virgem can also be queried through Milhafre and Águia:
- FlorestaVirgem_CP_3.0: CETEMPúblico's first million
words, automatically annotated and transformed into trees by PALAVRAS
2.0 in June 2008:
AD format,
VISL AD format,
CG format
- FlorestaVirgem_CF_3.0: CETENFolha's first million words, automatically annotated and transformed into trees by PALAVRAS 2.0 in June 2008:
AD format,
VISL AD format,
CG format
Last update: 8 July 2008
Comments and suggestions about the Floresta treebank