FLoresta FAQ

Frequently asked questions about the Floresta Sintá(c)tica

(FAQ)

What is the difference between form and function in the annotation scheme?

The notational difference at the node tag is that function is marked in upper case letters before the colon, form in lower case letters behind the colon: FUNCTION:form, e.g. SUBJ:np.

In content terms, function can be

(1) a clause level function, like subject (SUBJ), direct object (ACC), adverbial (ADVL) or predicator (P)

(2) a group level function, e.g. head (H) or dependents like nominal modifiers (>N, N<), adverbial modifiers (>A, A<) or argument of preposition (P<). A special case are the verbal functions of main verb (MV) and auxiliary (AUX) within a predicator group.

(3) the paratactic function of conjunct (CJT)

(4) utterance level function, like statement (STA) or question (QUE)

Form can be:

(a) word class, like noun (n), article (art) or adjective (adj). Subclasses are added with a hyphen, e.g. finite verbs (v-fin) or gerunds (v-ger), or as secondary tags in brackets, e.g. definite article <artd> or indefinite article <arti>, relatives <rel> or interrogatives <interr>. Morphological form is added to word classes in parenthesis, e.g. (M P) for male gender plural.

(b) group forms, like noun phrase (np) or prepositional phrase (pp)

(d) the paratactic form of coordinated unit (cu)

What is the meaning of the arrows left and right of some function tags? Why only on some function tags and not on all?

The arrows are dependency markers, as used in the Constraint Grammar tradition. The arrow points to the dependency-head of the constituent in question (if unmarked, to the clause's verb). The arrow base may be marked for function (if unmarked, function is implied by dependency). In the Floresta notation, arrows are retained at group level in particular, where no distinction is made in functional terms other than position and head type. One could say, that dependency here IS function, i.e. the >A tag in adjective and adverb phrases receives its intensifier function from the very fact that i modifies an adjective or adverb. The importance of position can be seen in the meaning difference between um grande homem and um homem grande, where prepositioned grande (>N) has a different meaning from postpositioned grande (N<). On the other hand, on clause level many individual functions are distinguished, and a simple >V or V< instead of SUBJ, ACC etc. would not make much sense.

Why is em lugar de read as one unit, while da and do are split into two units?

The Floresta treebank is a syntactically annotated corpus, which is why tokens bearing function tags must be constituent units. Do and da are fused words consisting of a preposition (the head of a prepositional phrase, H) and an article, i.e. half a noun phrase (or less), where the latter's function is that of prenominal modifier/dependent (>N) of some noun to the right, not the function of argument of preposition (P<), which pertains to the np consisting of both the article and such noun. By splitting do and da, it is possible to arrive at a standard structure like the following:

...:pp
=H:prp de
=P<:np
==>N:art o
==H:n ...

In order to maintain orthographic and corpus fidelity, the parts of split tokens are marked by <sam-> (1. half) and <-sam> (second half).

Polylexical fusion of expressions like em_lugar_de, on the other hand, always maintains constituent integrity (i.e. concerns whole constituents only) and is done partly to facilitate the work of the syntactic CG parser, partly to be able to assign function to a (synthetic) constituent that has no meaningful analytical reading, as in the case of conjunctional do_que, as opposed to the relative de o que. Ultimately, which expressions are read as synthetic tokens (polylexicals) is governed by the parser's lexicon. Most synthetic tokens are semantic units with one-word-translations into other languages, or syntactic units that in traditional Portuguese grammar are termed locuções (especially complex conjunctions, prepositions and adverbs).

Which corpus was used for the Floresta treebank?

The first million words of the CETEMPúblico corpus (an 180 million-word running text corpus of text from the daily Portuguese newspaper PÚBLICO, divided in 2-paragraph extracts, see

http://cgi.portugues.mct.pt/cetempublico/). This was manually revised for sentence separation and removal of some non-interesting material by the Floresta team, and automatically analysed by PALAVRAS and the VISL tree building programs. The Floresta treebank is built from this material (the whole non-humanly revised corpus is what we call the Floresta Virgem; the final Floresta is progressing by manually revising it. Its first 1,067 sentences constitute what we call the Bosque. Ideally all of it will eventually by revised, though as an intermediat solution we also ponder a tripartition scheme with 10% fully revised, 50% partially revised (main constituents) and 50% revised only indirectly by corresponding principled changes in the automatic analysis.

We also hope to work on a corresponding corpus for Brazilian Portuguese soon.

When looking at graphical Floresta trees at the VISL site, what is the difference between 'VISL default' and 'CG-style'?

CG-style notation uses the original CG- and Floresta tags, whereas VISL default and the other VISL notations use a cross language system with different complexity levels used for grammar teaching at different university or school levels, respectively.

For what practical purpose can this tool be used?

1. The Floresta corpus allows linguistic research involving many levels of grammar - morphology, word class, syntactic function, structural patterns. Users can extract examples for - and run statistics on - both word based categories and larger grammatical or lexical structures and contexts.

2. The treebank can help arrive at a golden standard of correct analysis, for evaluation across

parsers, or to automatically create and/or induce computational grammars for such parsers.

3. The Floresta data can provide a training basis for all kinds of applications involving an automatic parser as a building block (question-answering systems, grammar checkers, information

retrieval etc.) by providing them with an error-free first step on which to deploy the rest of the systems.

4. At the cross-language VISL site, a large proportion of the regular visitors use Floresta type data bases for grammar teaching/learning, assisted by different add-on programs such as games and tree-manipulators.

Furhtermore, the very process of designing and building a tree bank can be seen as an exercise in consensus bulding about syntactical analysis of a given language (among computational and descriptive syntacticians), which may be of value as well as a kind of "reference grammar".

Can I add my own corpus to the Floresta, then use the same tools for corpus searching and tree visualisation?

Yes, any Portuguese text can be automatically annotated with the Floresta annotation scheme, using the PALAVRAS parser and its filters. Just mail your corpus to lineb@hum.au.dk, d we will process it and make it accessible. Please make sure, you have copy-right clearing and can pass it on to us for web use.

One cautioning remark: Unless you have man power for proof-reading (in which case we would be happy to provide a start-up workshop), there will be a certain percentage of error, probably about 1% of part-of-speech readings, and 4% of syntactic functions, in a corpus without spelling errors, plus a proportion of syntactic bracketing errors depending on sentence length and complexity. Even with proof-reading, our experience shows that error frequency increases with structual complexity (embedded hyphenated or bracketed sentences, quotes, book titles, lists etc.)

Which varieties of Portuguese are treated?

The Floresta treebank consist of samples of written European Portuguese, with a Brazilian Portuguese section without human proof-reading. The genre is in both cases modern newspaper text.

However, the underlying CG-parser has also been used successfully on other varieties of Portuguese, such as litteray prose, historical texts, urban Brazilian speech data and dialectal speech data from Portugal. Cf. the annotated Portuguese corpora at http://cgi.portugues.mct.pt and http://corpora.hum.sdu.dk . These corpora are not annotated with constituent tree strucure as such, but their word based tags offer the same range of form- and function-categories as the Floresta corpus.

What is a Constraint Grammar, and what advantages does it have as opposed to a Constituent Grammar?

Constraint Grammar (CG), as introduced by Fred Karlsson in 1992-1995, is a methodological reductionist paradigm, in which linguistic information arises through a context governed mapping and disambiguation process at different annotational levels. Traditionally, Constraint Grammars are lexicon- and morphology based, and express syntactic surface structure by means of word based function- and dependency-tags.

It is never easy to express a just and simple opinion concerning the advantages of one grammatical paradigm over another. Thus, the answer given here was worded, somewhat provocatively, by one of the Floresta-initiators (Eckhard Bick), and is not shared by the other (Diana Santos):

In the CG-camp itself it is generally held that the advantages of CG as a method of automatic analysis are that it is robust and fast, and alway gives at least one analysis, however complex (or erroneous) the input. Published performance figures for a number of languages are much better than for probabilistic systems. As a descriptional paradigm, CG has the additional advantage of "elegant underspecification", i.e. of allowing the expression of ambiguity in a concize, word based, form. Thus, it allows , without further special adaptations, flat coordination, ambiguous postnominal attachment and the coupling of verb chain elements across interfering clause level constituents.

By comparison, Constituents Grammars have the descriptional capacity, but also the methodological need to make a choice within the above mentioned ambiguity classes, to recognise discontinuous constituents as one etc. Thus, what may appear a pedagogical or explanatory descriptional advantage, may become a methodological disadvantage. Also, the methodological parent of Constituent Grammars, Phrase Structure Grammar in the Chomskyan tradition, is not in itself a robust tool for automatic analysis - there is a good chance of getting no complete analysis at all for unorthodox, complex or wrong sentences.

For the Floresta project, it was decided to make the best of both worlds, and use an existant Constraint Grammar parser (PALAVRAS) for the analysis per se, then add constituent structure by means of an added transformation grammar, which would then enjoy the luxury of using not lexial words, but ready categories (PoS, subject, object etc.) as terminals, thus exploiting CG robustness to the highest level possible while maintaining the "graphical" expressivity of Constituent Grammar, as desired by e.g. the teacher target group at VISL.

I don't have internet at home. Can I use the system off-line?

While the Floresta corpus can be downloaded and distributed for off-line work, neither the search interface at cgi.portugues.mct.pt nor the graphical tree manipulator and teaching tools at visl.sdu.dk are available - as yet - for off-line users. However, given sufficient user interest, this could well be a future option.

Where can I get a tag list and definitions for all grammatical categories used in the corpus?

At http://visl.sdu.dk/visl/pt/symbolset-floresta.html .