Notational and terminological guide-lines


Projecto:  Floresta Sintá(c)tica
Last update: 14 February 2001

Eckhard Bick
Susana Afonso
Ana Raquel Marchi

The notational principles described below are adopted by the VISL-project and consequently by the Floresta Sintá(c)tica- since it makes use of the VISL tools-,  for the following reasons:

        * robust disambiguation using CG;
        * easy  notational filtering: differences of grammatical terminology and tradition to a large extent could and should be handled by annotational filtering rather than by creating different parallel annotations by building different parsers or doing double manual annotation. Drawing on all its disambiguated annotational levels (morphology, syntax, valency, secondary subcategories), the information of a good (and hopefully corrected) CG analysis should make it possible to support a large variety of different (not all!) output or search conventions.

So far PALAVRAS-based filtering experiments have been conducted with regard to the following projects/annotation conventions:

            - VISL teaching annotation (form-function-trees, predicators, disjunct constituents);
            - the NILC tag set (certain complex tags, experimental incorporation of valency in word classes);
            - the Tycho Brahe project (historical written Portuguese);
            - the CORDIAL-SIN project (dialectal transcribed speech).

The following table sums up the topics. The links lead to further information on each of the topics:
Related information
  • Tagging
  Modular tags *
                   a) function tags    @F>> attachment not to the nearest verb (the default 
   case)  *
                   b) form tags     Subclass of adverbs * : ADV <kc>, conjunctional relative 
                                             adverb KC <adv> , conjunctional 
                                             adverb of prepositional type
  •                 Levels of constituency
                   a) Function vs. form   (FUNCTION:form) *
                               a.1.) form    Maintenance of word class (os pobres F:adj);
   Phrase type marked at the non-terminal node.
                               a.2.) Function    Functions in agreement with group phrase (if F:np, then 
   functions of the dependents relate to np, for instance, 
   N< , >N)
                              a.3.) Underspecification    Use of ?:form for function underspecification; 
  FUNCTION:? for form underspecification *
                    b) Non-terminal nodes *    No zero / empty constituents;
   No one-member nodes.
  • Representation:
                    a) vertical trees *    Indentation marking tree depth: x equal signs mean 
   x levels below the top node (EXCEPTION: first line below 
   the top node is not indented);
   One node / word per line.
                    b) ambiguity / alternative analyses *  
                                  b.1.) morphological ambiguity    Use of slash (/)
                                  b.2.) syntactic ambiguity    One node alternatives: use of slash (/);
   Local structural ambiguity: [+/- n], -n being number of 
   equal signs up / left the tree and [+n], number of equal signs 
   down / left the tree, placed after the function (before the colon 
   and form);
   Global structural ambiguity: several trees An depending on 
   the number of possible analyses (n being number of trees).
  • Base form / lema *
  Lemas represent the surface expresses word even though 
  an alternative lema may apply (ex: 'oiro' / 'ouro')
  Lematisation non disambiguatable from the linguistic context, 
  both alternative lemas are expressed, using the slash notation
  (/) (ex: 'ir' / 'ser')
                    a) nouns    Lema: gender of the word, singular
                    b) proper nouns    Lema: gender and number of the word
                    c) pronouns    Lema: nominative case
                    d) specifiers (variable forms)    Lema: masculine singular
                    e) adverbs  
                                    e.1.) full form (-mente)     Lema: -mente form
                                    e.2.) coordinated adverbs (clara e eficazmente)    Lema: -mente form for both coordinated adverbs
  •   Definitions *
                     a) group    Complex form: more than one non-verbal constituent;
   Internal structure: head / modifiers.
                     b) clause    Complex form: more than one constituent, one of them is 
   verbal (fcl finite clause/ acl averbal clause / icl infinite 
                     c) word     Simple form, no constituents
                                  c.1.) polylexicals or multi word expressions     Simple form with complex internal structure, that is the 
   complex structure integrate one unit, notated with equal 
   signs in CG format (São=Paulo) and by underscore in tree 
   format (São_Paulo).
  • Word classes
                       a) Proper nouns *    Personal, institutional, topological names, titles, 
                       b) Personal pronouns *     4 cases for personal pronouns established: NOM, 
                       c) Prepositions and articles: contractions *    In the VISL system, the contractions are unfolded: no
   would appear divided into its parts: em + o
                       d) Verb *    Verb + clitic (ex: Julguei-te) are separated, the hyphen 
   attached to the verb;
   Verb (Future tense or conditional + clitic (ex: Julgar-te-ia
  are separated, the verb appearing in its full form, that is, 
   inflected in tense and person (julgaria-)
modular tags composite tag 
(here: "mixing" base form and word class)
é "ser" V PR 3S IND
dorme "dormir" V PR 3S IND
dorme V-PR3S

            In the Portuguese VISL system, the pronouns exhibit four cases: Nominative (NOM), Accusative (ACC), Dative (DAT), and prepositional (PIV) which means that the syntactic functions the personal pronoun can hold are embedded in their morphological information (Subject- NOM; Direct Object - ACC; Indirect Object -DAT; Prepositional object-PIV).
            The distribution of the personal pronouns through case is the following:
eu, tu, ele/ela, nós, vós, eles/elas
me, te, o/a, se, nos, vos, os/as
+(me, te, nos, vos)
mim, ti
migo, tigo, sigo

The reflexive pronoun "se" is marked as ACC in relation to case.

              All the words and tokens present their base form. In CG format, base forms are in square brackets: [lema];  in tree format, base forms appear in single quotation marks: 'lema'. There are several aspects to take into consideration regarding lemas: language variety, ambiguity and word class. The lema of  the invariable word classes like conjunctions, indefinite invariable pronouns, adverbs (except inserted in a coordination), interjections and prepositions hold the same form as the word itself.

            (1) Language variety:

                    Lemas could have alternative forms, if European Portuguese (PE) and Portuguese form Brazil (PB) is concerned.  The cases where this might occur are the sequences ct  (sintá(c)tica); (dire(c)ção); pt (óptimo); (ado(p)ção)and lexeme variation like estresse (PB) e stress (PE).
There is also the case of free variation in both PB and PE variants, like ou / oi (touro / toiro; loiro/ louro).

                    The lema will correspond to the surface form of the word and will not express the variety (will not present both forms). This means that if the word occurring in a sentence is direcção, the lema will be direcção.

            (2) Form ambiguity:

                    Some forms, especially verbal forms, can be ambiguous in terms of defining the lema. For instance, forms of the verb ser and ir. Normally the linguistic context allows the disambiguation. When it doesn't, both possible lemas are represented, following the general formalism for ambiguous forms, which is the use of slash. For instance, the linguistic context in Foi em 1923 is not enough to determine if the lema of the form Foi is the verb ser or ir. So, the lema will reflect both possibilities 'ir'/ 'ser'.

            (3) Word class:

                Depending on the word class and its definition, lemas are represented differently. Going from case to case in particular:

                        - nouns (N): nouns have invariant gender as lexeme category and number as variant word category. The lema will have the gender of the word and the default for number is singular. For instance: the lema for gatas is gata.

                        - proper nouns (PROP): proper nouns have invariant gender and number as lexeme category. The lema will keep the information of gender and number that is encoded in the word. For instance: Todas as Petrogais deveriam ser desmanteladas. The lema is Petrogais.

                        - pronouns (PERS): the lema of the pronouns will be its nominative case, whichever case they hold. For instance in Fomo-nos embora the clitic lema will be the nominative case of nos, that is, nós.

                        - specifiers (SPEC):

                                        * variable forms of indefinite pronouns: the lema is the gender and number default masculine singular. For instance algumas, would have as lema         algum.

                        - adverbs (ADV) in coordination: adverbs are not considered as being lexically derived from adjectives, which means that the lema for adverbs ending in -mente is the same form as the one the adverb possesses (claramente has claramente as its lema). In Portuguese, though, if two adverbs are coordinated, the first adverb has the suffix -mente elliptic (what we have called morphologic ellipsis), as it is the case of clara e inteligentemente. The reason for the ellipsis is purely syntactic, thus the lema will take the full form of the adverb as if it was not coordinated with another adverb. In short, the lema for clara is claramente.

                        - adjectives (ADJ): adjectives have both gender and number as variant word category. Therefore, the lema for any adjective is the singular and masculine default. For instance, the lema of trabalhosas will be trabalhoso. This also holds for ordinal numerals (primeiras, sétimo)

                        - verbs (V): the lema of the verb forms is the infinitive form of the verb in question.  Cantámos, for example, will have as lema cantar.

                        - numerals (NUM): numerals follow the adjective lema rules (as in Ela tem duas maçãs) or the noun (N) lema rules (as in O 13 é aziago!)

                        - determiners (DET):

                                            * articles: the lema preserves the gender of the article. The number is the default: singular. For instance: As casas estão pintadas, the lema is a.

                                            * determiners: the lema has the gender and number default of the word form (masculine and singular). For instance the lema of esta(s) is este .

                                            * attributive quantifiers: gender and number default (masculine singular) in the lema of, for instance, poucas (lema: pouco).

             * Non-splitting cases:

                          Compound nouns, hyphenated, are not separated. Therefore, guarda-chuva, anti-motim, Expo-98 are one unit.

             * Splitting cases:

                        The hyphen can be a separating feature. Verbs occurring with clitics in Portuguese are hyphenated. For instance, marcou-me. The separation is processed  in the following way:

                marcou-    [marcar] <hyfen> <fmc> V PS 3S IND VFIN @FMV
                me                [eu] PERS M/F 1S ACC @<ACC

                 In more complex form of verbs co-occurring with clitics, especially in European Portuguese, as the Future and Conditional (Esperar-te-ei; Esperar-te-ia), where the clitic appears hyphenated between the verb theme and the tense and person inflexion, the separation differs from the above case. In these cases, the separation only occurs between the verb and the clitic, and, therefore, the verb takes the verb and person inflexion:

                Esperaria- [esperar] <hyfen>  <fmc> V COND 1/3S VFIN @FMV
                te- [tu] <hyfen> PERS M/F 2S ACC @<ACC