Notational and terminological guide-lines
Projecto: Floresta Sintá(c)tica
Last update: 14 February 2001
Ana Raquel Marchi
The notational principles described below are adopted by the VISL-project and consequently by the Floresta Sintá(c)tica- since it makes use of the VISL tools-, for the following reasons:
* robust disambiguation using CG;
* easy notational filtering: differences of grammatical terminology and tradition to a large extent could and should be handled by annotational filtering rather than by creating different parallel annotations by building different parsers or doing double manual annotation. Drawing on all its disambiguated annotational levels (morphology, syntax, valency, secondary subcategories), the information of a good (and hopefully corrected) CG analysis should make it possible to support a large variety of different (not all!) output or search conventions.
So far PALAVRAS-based filtering experiments have been conducted with regard to the following projects/annotation conventions:
- VISL teaching annotation (form-function-trees, predicators, disjunct constituents);
- the NILC tag set (certain complex tags, experimental incorporation of valency in word classes);
- the Tycho Brahe project (historical written Portuguese);
- the CORDIAL-SIN project (dialectal transcribed speech).
The following table sums up the topics. The links lead to further information on each of the topics:
Topics Related information
Modular tags * a) function tags @F>> attachment not to the nearest verb (the default
b) form tags Subclass of adverbs * : ADV <kc>, conjunctional relative
adverb KC <adv> , conjunctional
adverb of prepositional type
- Levels of constituency
a) Function vs. form (FUNCTION:form) * a.1.) form Maintenance of word class (os pobres F:adj);
Phrase type marked at the non-terminal node.
a.2.) Function Functions in agreement with group phrase (if F:np, then
functions of the dependents relate to np, for instance,
N< , >N)
a.3.) Underspecification Use of ?:form for function underspecification;
FUNCTION:? for form underspecification *
b) Non-terminal nodes * No zero / empty constituents;
No one-member nodes.
a) vertical trees * Indentation marking tree depth: x equal signs mean
x levels below the top node (EXCEPTION: first line below
the top node is not indented);
One node / word per line.
b) ambiguity / alternative analyses * b.1.) morphological ambiguity Use of slash (/) b.2.) syntactic ambiguity One node alternatives: use of slash (/);
Local structural ambiguity: [+/- n], -n being number of
equal signs up / left the tree and [+n], number of equal signs
down / left the tree, placed after the function (before the colon
Global structural ambiguity: several trees An depending on
the number of possible analyses (n being number of trees).
- Base form / lema *
Lemas represent the surface expresses word even though
an alternative lema may apply (ex: 'oiro' / 'ouro')
Lematisation non disambiguatable from the linguistic context,
both alternative lemas are expressed, using the slash notation
(/) (ex: 'ir' / 'ser')
a) nouns Lema: gender of the word, singular b) proper nouns Lema: gender and number of the word c) pronouns Lema: nominative case d) specifiers (variable forms) Lema: masculine singular e) adverbs e.1.) full form (-mente) Lema: -mente form e.2.) coordinated adverbs (clara e eficazmente) Lema: -mente form for both coordinated adverbs
- Definitions *
a) group Complex form: more than one non-verbal constituent;
Internal structure: head / modifiers.
b) clause Complex form: more than one constituent, one of them is
verbal (fcl finite clause/ acl averbal clause / icl infinite
c) word Simple form, no constituents c.1.) polylexicals or multi word expressions Simple form with complex internal structure, that is the
complex structure integrate one unit, notated with equal
signs in CG format (São=Paulo) and by underscore in tree
- Word classes
a) Proper nouns * Personal, institutional, topological names, titles,
b) Personal pronouns * 4 cases for personal pronouns established: NOM,
ACC, DAT, PIV
c) Prepositions and articles: contractions * In the VISL system, the contractions are unfolded: no
would appear divided into its parts: em + o
d) Verb * Verb + clitic (ex: Julguei-te) are separated, the hyphen
attached to the verb;
Verb (Future tense or conditional + clitic (ex: Julgar-te-ia)
are separated, the verb appearing in its full form, that is,
inflected in tense and person (julgaria-)
modular tags composite tag
(here: "mixing" base form and word class)
é "ser" V PR 3S IND dorme "dormir" V PR 3S IND
é SER-PR3S dorme V-PR3S
|"noun-hood" marked at np-head
function level, retaining (morphological) word class
|"noun-hood" marked by changing word class|
|morphological level: case
syntactic level: @P< (argument of preposition)
or e.g. @SUBJ> (subject)
|mim = "eu" PERS P(preposit)IV(e) @P<
eu = "eu" PERS NOM(inative) @SUBJ>
|mim = PERS OBL(íquo)
eu= PERS RE(c)TO
The distinction of annotation levels allows linguistic progression in steps and clear definitions and makes it easier to "reconcile" certain conflicting truths or views by expressing them all, but on different levels (as they are adjectives turn into nouns by being np-heads, subject verbs- o "estar", adverbial nouns- "vem domingo").
|without zero constituent
(surface syntax annotation)
|with zero constituent
("deep" syntax annotation)
|no one-member-nodes||one member "complex" clause constituents|
The contractions in Portuguese appear in their non contracted form in the VISL system. Once the contraction is separated in its units, it is possible, according to the VISL tree formalism, to form a group whose head would be the preposition. In the case of personal pronouns contractions, the need for unfolding is even more sensitive, because there are, in these cases, two distinct syntactic forms involved concerning the elements composing the contractions (lhe, dative function while o, accusative form). Therefore, the contractions are decomposed in its elements:
nesse- em esse/isso/este
disso- de isso/isto/esse/este
do- de o/a
ao- a o ; à- a a
comigo- com mim
lho- lhe o/a
The following examples illustrate the above:
i. prepositional group (H:prp)
Maria mora no Porto
SUBJ:prop('Maria' F S) Maria
P:v-fin('morar' PR IND 3S) mora
=H:prp ('em') em
==>N:art ('o' M S) o
==H:prop('Porto' M S) Porto
ii. two distinct syntactic functions
Se se tivessem esquecido, o mundo lho teria lembrado.
=ACC:pron-pers('se' M/F 3P ACC) se
==AUX:v-fin('ter' <ink> IMPF 3P SUBJ) tivessem
=>N:art('o' <artd> M S) o
=H:n('mundo' M S) mundo
DAT:pron-pers('ele'/'eles' <sam-> M/F 3S/P DAT) lhe/lhes
ACC:pron-pers('ele' <-sam> M 3S ACC) o
=AUX:v-fin('ter' COND 3S) teria
FUNCTION1 / FUNCTION2 / FUNCTION x : form1 /form2/form x
e.g. In the sentence "Encontrou-se um gato", both subject (Someone found a cat) and direct object (A cat was found) readings are possible regarding the analysis of the personal pronoun "se" and consequently the analysis of the np um gato:
P:v-fin('encontrar' PS 3S IND) Encontrou-
SUBJ/ACC:pron-pers('se' M/F 3S/P ACC) se
=>N:art('um' <arti> M S) um
=H:n('gato' M S) gato
ii. local structural ambiguity/alternative analysis
The attachment of the alternative node is signaled by [ +/- n] where [+n] represents 'n' movements down (or right in indentation) and [-n] represents 'n' movements up towards the top node (or left in indentation).
FUNCTION1 / FUNCTION 2 [+/- n] : form
e.g. In the sentence 'Vi o gato da rua', two analyses are possible regarding the prepositional group da rua: either a post nominal reading (an alley cat) or an ADVL reading (From where did you see the cat?), the latter implying a change in indentation- one level up or left-.
'Vi o gato da rua'
P:v-fin('ver' PS 1S IND) Vi
=>N:art ('o' <artd> DET M S) o
=H:n('gato' M S) gato
=N< / ADVL [-1]:pp
===>N:art('a' <artd> DET F S) a
===H:n('rua' F S) rua
iii. global structural ambiguity / alternative analysis
When the ambiguity or alternative analysis involve larger chunks of the sentence, especially interdependent changes in several places, or discontinuous constituents, the A1, A2,...convention is used as a default with several complete analyses for the sentence in question.
e.g. The sentence: "Estavam repletas de lixo, copos de plástico
sujos de café" can hold two readings: either the noun
phrase copos de plástico is predicating the noun lixo
or both lixo and copos de plástico are coordinated
elements of a compound unit. However, because the alternative analysis
involves changes that cannot be represented by the slash notation, a full
sentence analysis (A2) is necessary:
P:v-fin('estar' IMPF 3P IND) Estavam
=H:adj('repleto' F P) repletas
===CJT:n('lixo' M S) lixo
====H:n('copo' M P) copos
======H:n('plástico' M S) plástico
=======H:adj('sujo' M P) sujos
========P<:n('café' M S) café
P:v-fin('estar' IMPF 3P IND) Estavam
=H:adj('repleto' F P) repletas
===H:n('lixo' M S) lixo
====H:n('copo' M P) copos
=====P<:n('plástico' M S) plástico
=====H:adj('sujo' M P) sujos
======P<:n('café' M S) café
A [a] <artd> DET F S @>N
deputada [deputada] N F S @SUBJ>
vai [ir] V PR 3S IND VFIN @FAUX
abster-[abster] <hyphen>V INF @IMV @#ICL-AUX<
se [se] <refl> PERS F 3S ACC @<ACC
|A [a] <artd> DET F S @>N
deputada [deputada] N F S @SUBJ>
não [não] ADV @ADVL>
se [se] PERS F 3S ACC @ACC>>
fez [fazer] <fmc> V PS 3S IND VFIN @FMV
ouvir[ouvir] V INF @IMV @#ICL-<ACC
@FUNCTION>> is the notation to indicate that the attachment in this
case is not to the nearest main verb (as in the default case). In particular,
the personal pronoun 'se' attaches to 'ouvir' and not to the nearest verb
Proper nouns refer in the Portuguese VISL system to:
- personal and institutional names (ex. António Guterres, Polícia
Judiciária, Banco de Portugal);
- titles (ex. O Sítio do Pica-pau amarelo);
- Topological names (ex. São Paulo, Gare do Oriente)
- Abbreviations (ex: ONU, NILC)
Proper nouns (PROP) are trated as a separate category form nouns in
general, as they are defined differently from the category noun (N)
n terms of gender and number. That means that while gender is a lexeme
category which is invariable and number a word form category which is variable,
in the case of proper nouns, both gender and number are lexeme categories
(and invariant). In practical terms this means that a noun like enfermeira
would be tagged as N F S, having the feminine form as base form, while
(let's think of an example like As Lisboas de Pessoa) would
be tagged as PROP F P, and the base form Lisboas.
When the proper nouns have a complex internal structure, they form complex tokens, that is, they integrate one unit, called polylexicals, which is notated by equal signs in CG form (O=Sítio=do=Pica-pau=amarelo) and by underscore in tree format (O_sítio_do_pica-pau_amarelo). One can also find this notation with common nouns, adjectives or adverbs as Expo 98, máquina de escrever. The reason why these are considered to be tokens is because they are semantically motivated complex structures.
form with more than one constituent: normally a verbal constituent (predicator):
clause (fcl) and infinite clause (icl).
- clauses with no verbal constituent have a clause header constituent (subordinator or relative): averbal clause (acl).
|Em Portugal, esta novela não foi ainda editada.|
|infinite clause:||Apesar das tentativas feitas por os agentes da Direcção de Investigação de Corrupção, o árbitro não fez qualquer revelação que pudesse incriminar outras pessoas.|
|averbal clause:||Segundo a «Newsweek», as duas salas dos mapas foram o palco principal das negociações.|
- complex form with more than one constituent;
- does not have a predicate or clause header constituent;
- two main categories: head (H) and dependents (>N ; N<; N<PRED, A<; >A)
- simple form, holding no constituents (n; adj; adv; v-fin, etc.)
Particularly when coordinating conjunctions are involved, there are cases where constitutents seem to come together to form a syntactic unit which does not fall within
our list of recognized form and/or function labels. The most common
cases are the ones dealing with shared constituency, in other words, cases
where, being a coordination present, there are constituents that semantically
belong to both parts of the compound unit, the conjoints (CJT). Since,
the VISL notation does not allow empty nodes, the place of that constituent
in the second conjoint, which would be empty as it is implied, could not
be represented. Therefore, in these cases, the interrogation mark is used:
? : form
|FUNCTION : ?|
that is, the constituents SUBJ, P and ADVL, because they are immediately under the top node (STA:fcl), exhibit no indentation, depite the fact that they have a mother node (which is the top node), that is, they are one level below. However, the dependents of SUBJ:np (in green) are indented.
The PALAVRAS parser has many subcategories for adverbs, which are filtered away in the tree annotation. However, the conjunctional subclasses is retained, since they are treated as different word classes in the traditional Portuguese grammar:
ADV <kc> is a conjunctional adverb of the coordinating type: contudo, todavia, pois, entretanto, porém, no=entanto
The reason for using the class of "conjunctional adverb" (ADV <kc>) rather than "adverbial conjunction" (KC <adv>) is the wish to retain maximal compatibility to international traditional word classes - words like the above are treated as adverbs in both Romance and Germanic languages, for instance English, German, Spanish and French. Ordinary coordinating conjunctions (KC) are a small class in our system: e, ou, mas, nem.
ADV <ks> is a conjunctional relative adverb of the subordinating
type: segundo in segundo dizem os meus amigos, ...
ADV <prp> is a conjunctional adverb of the prepositional type: segundo in segundo o meu professor
The reason for using adverbial subclasses for words like segundo,
than have independent word classes, is the principle of distinguishing
between (unchanging) morphological word class, and (changing) syntactic
function. What makes segundo a "preposition" or a "conjunction"
here, is function rather than form.
The reflexive pronoun "se" is marked as ACC in relation to case.
(1) Language variety:
Lemas could have alternative forms, if European Portuguese (PE) and Portuguese
form Brazil (PB) is concerned. The cases where this might occur are
the sequences ct (sintá(c)tica); cç (dire(c)ção);
(óptimo); pç (ado(p)ção)and lexeme variation
like estresse (PB) e stress (PE).
There is also the case of free variation in both PB and PE variants, like ou / oi (touro / toiro; loiro/ louro).
The lema will correspond to the surface form of the word and will not express the variety (will not present both forms). This means that if the word occurring in a sentence is direcção, the lema will be direcção.
(2) Form ambiguity:
Some forms, especially verbal forms, can be ambiguous in terms of defining the lema. For instance, forms of the verb ser and ir. Normally the linguistic context allows the disambiguation. When it doesn't, both possible lemas are represented, following the general formalism for ambiguous forms, which is the use of slash. For instance, the linguistic context in Foi em 1923 is not enough to determine if the lema of the form Foi is the verb ser or ir. So, the lema will reflect both possibilities 'ir'/ 'ser'.
(3) Word class:
Depending on the word class and its definition, lemas are represented differently. Going from case to case in particular:
- nouns (N): nouns have invariant gender as lexeme category and number as variant word category. The lema will have the gender of the word and the default for number is singular. For instance: the lema for gatas is gata.
- proper nouns (PROP): proper nouns have invariant gender and number as lexeme category. The lema will keep the information of gender and number that is encoded in the word. For instance: Todas as Petrogais deveriam ser desmanteladas. The lema is Petrogais.
- pronouns (PERS): the lema of the pronouns will be its nominative case, whichever case they hold. For instance in Fomo-nos embora the clitic lema will be the nominative case of nos, that is, nós.
- specifiers (SPEC):
* variable forms of indefinite pronouns: the lema is the gender and number default masculine singular. For instance algumas, would have as lema algum.
- adverbs (ADV) in coordination: adverbs are not considered as being lexically derived from adjectives, which means that the lema for adverbs ending in -mente is the same form as the one the adverb possesses (claramente has claramente as its lema). In Portuguese, though, if two adverbs are coordinated, the first adverb has the suffix -mente elliptic (what we have called morphologic ellipsis), as it is the case of clara e inteligentemente. The reason for the ellipsis is purely syntactic, thus the lema will take the full form of the adverb as if it was not coordinated with another adverb. In short, the lema for clara is claramente.
- adjectives (ADJ): adjectives have both gender and number as variant word category. Therefore, the lema for any adjective is the singular and masculine default. For instance, the lema of trabalhosas will be trabalhoso. This also holds for ordinal numerals (primeiras, sétimo)
- verbs (V): the lema of the verb forms is the infinitive form of the verb in question. Cantámos, for example, will have as lema cantar.
- numerals (NUM): numerals follow the adjective lema rules (as in Ela tem duas maçãs) or the noun (N) lema rules (as in O 13 é aziago!)
- determiners (DET):
* articles: the lema preserves the gender of the article. The number is the default: singular. For instance: As casas estão pintadas, the lema is a.
* determiners: the lema has the gender and number default of the word form (masculine and singular). For instance the lema of esta(s) is este .
* attributive quantifiers: gender and number default (masculine singular) in the lema of, for instance, poucas (lema: pouco).
Compound nouns, hyphenated, are not separated. Therefore, guarda-chuva, anti-motim, Expo-98 are one unit.
* Splitting cases:
The hyphen can be a separating feature. Verbs occurring with clitics in Portuguese are hyphenated. For instance, marcou-me. The separation is processed in the following way:
marcou- [marcar] <hyfen> <fmc> V PS 3S IND VFIN
me [eu] PERS M/F 1S ACC @<ACC
In more complex form of verbs co-occurring with clitics, especially in European Portuguese, as the Future and Conditional (Esperar-te-ei; Esperar-te-ia), where the clitic appears hyphenated between the verb theme and the tense and person inflexion, the separation differs from the above case. In these cases, the separation only occurs between the verb and the clitic, and, therefore, the verb takes the verb and person inflexion:
Esperaria- [esperar] <hyfen> <fmc> V COND 1/3S VFIN @FMV
te- [tu] <hyfen> PERS M/F 2S ACC @<ACC