Documentation of the choices in the treebank project

logo temporário da FS


Projecto Floresta sintá(c)tica
Last update:  24 November  2001

Eckard Bick
Susana Afonso
Ana Raquel Marchi
 


 

The following document is the result of the discussion that took place during the process of revising trees. The topics that are listed below represent the options that were adopted in terms of analysis. It is, thus, important to have the options documented so that the users of the Floresta are aware of the linguistic solutions encountered for certain type of linguistic case problems, that either are not treated in traditional grammars or that, in structure, pose a problem in tree representation- levels of attachment-.
The topics are divided into sections according to the linguistic areas they concern (part of speech, syntax, etc, but also punctuation and how to deal with human mistakes in the original sentences in the corpus), illustrated with examples and, when possible, with links to further discussion and other examples related to the topic.
 
 



 

A. Introduction of Codes 


In order to enable the corpus search, some codes were introduced in the corpus. These codes are mentioned in the below sections, related to what they refer to. In formal terms, the codes are either enclosed in < >, and placed next to the constituent in the trees, or preceded by # and indicated in the source line (in this case: três elementos essenciais, clara e inteligentemente valorizados por..., the code #E would appear in the source line : 

SOURCE: CETEMPúblico n=195 sec=clt sem=93a  #E

and <Em> next to the constituent:

 (...) 
 ,
N<:icl
 =>A:cu
 ==CJT:adv('claro' <Em>)   clara 
 ==CO:conj-c('e')  e
 ==CJT:adv('inteligentemente') inteligentemente
 =H:v-pcp('valorizar' M P)        valorizados
  (...)
 

Not all the topics listed below have a code, for some tags or syntactic notation is enough for the search (ex: nominalised adjectives- tag <n> - or discontinuous constituents- hyphenated form, F:f-, and/or function, -F:f). 
 

i. Ellipsis: #E

       The sentences that exhibit elliptic elements are marked with the code #E, in general and depending on the type of ellipsis it tags in angle brackets are applied:
 
  • Morphologic ellipsis: <Em>  (check section B1. iii. for examples and details)
  • Syntactic ellipsis (description in section B2.i.):
                ¤ syntactic ellipsis:  <Es>
                ¤ group ellipsis: <Eg>
 

ii. Discourse structures (section B2.v.): #D

iii. Listing structures (section B2.vi.): #L

IV.Unconventional / ungrammatical constructions: #W

This code is used if the sentence presents either an ungrammaticality or being syntactically correct, it presents some oddity. The situations where the #W was used by the revisors wasn't systematised yet.
 

B. Linguistic areas:

B1. Part of speech

This section concerns the discussion around certain topics related to part of speech that were problematic ifor the syntactic analysis. Therefore, in order to make the corpus more consistent, we have decided for some options in detriment of others (how to analyse the first adverb in a coordination  whose form lacks the sufix -mente) or introduced new elements, for instance, new tags.

i. Adjective, head of a noun phrase: introduction of the tag <n>

The introduction of the tag <n> follows the prototypical head system. In other words, if a noun phrase is at stake, it is expected that its head is a noun. However, nominalised adjectives are still in the system tagged as adjectives. For instance, both instances of the lexeme jovem in "O jovem tornou-se um herói" and "O jovem herói foi relembrado" are regarded as adjectives, even though jovem in the first occurrence is nominalised (as the article preceding it is provides evidence). 

<n> is then applied if nominalised adjectives imply a noun-head (e.g. jovem, velho (...)); 

The  adjectives that in the nominal form are abstract concepts (not implying, therefore, a noun-head , as it is the case of as frio- discussion on the topic in C165-6), are lexicalised as nouns. 
 

Syntactically, the introduction of the tag <n> is added automatically to mark adjective heads of noun phrases in a semi-morphological way. Functionally, this information is equivalent to the combination '=H:adj' within an np. This also covers the special cases where two adjectives make up an np, one being head, the other modifier (e.g. O velho doente).

For further discussion, please refer to sentences C165-6 and C142-1.

ii. The case of dado: preposition / past participle

Initially, dado was tagged exclusively as a past participle by the tagger-parser. However, and after a corpus search has provided evidence that it may not inflect in gender and number, another analysis, of considering it a preposition, similarly to visto was put forward. Because the past participle reading already existed, and it is the preferred one, the prepositional reading was only introduced in the cases where there was no inflection present. 
Being so, dado is analysed as:
  • Preposition, if its form is fixed (no inflection, e.g. ..., dado as circunstâncias);
  • Past participle, if it inflects in number and gender (e.g. ..., dadas as circunstâncias), including the otherwise underspecified male singular (M S) (e.g. dado o caso,...).
Check sentence C2-4 for discussion on the topic and examples.

iii. Morphological ellipsis: coordinated adverbs 

Apparently this is the only case where one can say that there is a morphological ellipsis. In Portuguese, when regular adverbs (ending in -mente) are coordinated,  only the last adverb in the coordination is presented in its full form. The other adverbs in the coordination have an implicit but unmarked derivation (stemming from the latin ablative case: mens alta). The following sentence is an example of the previous: João fez tudo clara, inteligente e eficientemente. 
When such is the case, the first element is morphologically tagged as derived adverb (<adv> ADJ F S).

iv. Preposition / Verb


The parser's default to handle the verb haver as in há + <expression of time> (João gosta da Maria há muito tempo) is to consider the present form of the verb, third person singular a preposition , head of an adverbial (prepositional phrase in form). 
When the same expression is placed in the beginning of a sentence / phrase, its form changes in the way that 'que' is added (há / havia + <expression of time> + que as in Há muito tempo que a Maria gosta do João). In cases like this, a verbal analysis proved to be more satisfactory and há / havia + <expression of time> + que is, then, syntactically, an Adverbial , where há / havia is the main verb (MV:v-fin), the expression of time the ACC and que a focus adverb (FOC:adv):

                           A1 
                           STA:fcl 
                          ADVL:fcl
                    =P:v-fin('haver' PR 3S IND)    Há
                    =ACC:n('muito_tempo' M S)        muito_tempo
                    FOC:adv('que' <foc>)    que
                           SUBJ:np
                           =>N:art('a' <artd> F S) a
                           =H:prop('Maria' F S)    Maria
                           P:v-fin('gostar' PR 3S IND)     gosta
                           PIV:pp
                           =H:prp('de' <sam->)     de
                           =P<:np
                           ==>N:art('o' <-sam>  M S)       o
                           ==H:prop('João' M S)    João

This option is to be valid for all the cases, abandoning, this way, the prepositional reading of the verb given by the parser.

Another verbal analysis would differ from the previous one, in the analysis of 'que'. In this analysis, 'que' is considered to be an adverbial relative pronoun. Therefore one would have há / havia + <expression of time> + que not an adverbial phrase, but rather belonging to the clausal level, being há / havia the predicator (P:v-fin), the expression of time the ACC and 'que' an adverbial relative pronoun initiating a relative clause:

                           A2
                           STA:fcl
                    =P:v-fin('haver' PR 3S IND)    Há
                    =ACC:np
                    ==H:n('muito_tempo' M S)        muito_tempo
                    ==N<:fcl
                    ===ADVL:adv('que' <foc>)    que
                           ===SUBJ:np
                           ====>N:art('a' <artd> F S) a
                           ====H:prop('Maria' F S)    Maria
                           ===P:v-fin('gostar' PR 3S IND)     gosta
                           ===PIV:pp
                           ====H:prp('de' <sam->)     de
                           ====P<:np
                           =====>N:art('o' <-sam>  M S)       o
                           =====H:prop('João' M S)    João
                           .
 
 

Further discussion and examples in haver_time
 

v. Numerals: gender and number

The default case for analysing numerals is the same for determiners: they inflect in gender and number according to the the gender and number of the head they modify. This means that taking the example: 

A Maria tem dez bons amigos, seis são de longa data. 

Two different situations in this case:

  • dez, modifier of the head amigos, therefore presenting the same number and gender (NUM M P)
  • seis,  the head, but the context of the sentence allows the disambiguation- moreover, any number above one is plural, except if one considers the number isolated (as in 13 é aziago!- in this case, there is also the adjective gender and number disambiguating the numeral!)
However there are situations where the sentence window context or even the extra-linguistic context is not enough to disambiguate th gender and number of the numeral in question. In this case,  isolated arabic numbers are regarded as non-disambiguatable with regard to gender and number, and a default is used for anything larger than 1: NUM M/F P, for instance, when it occurs in parentheses.The following example illustrates the case: Maria Carvalho (9) e João Cunha (9) têm à sua frente um longo caminho a percorrer.

In terms of dates, we have decided to analyse them as singular and masculine (NUM M S), following the same principle adopted for the above case 13 é aziago!   For instance: 2001 foi  imprevisível (both tests: o ano de 2001 and O imprevisível 2001 support the decison  taken).
 
 

vi. Coordinating adverbs

The sequences:
 
  •  tanto...como
  •  não só ...como 
  • não só ... mas também
  • ...bem como
are paralle to nem...nem, that is a coordination (...:cu) with coordinating adverbs (CO:adv).

Semantically there seems to be a proximity between the coordination (either expressed by the coordinating conjunctions or adverbs) and the use of mais and menos in the following cases:

...uma segunda compilação da Kaos que vai incluir todos os lançamentos não contidos em " Totally Kaos ", mais três edições exteriores à editora (= vai incluir todos os lançamentos(...) e três edições...)
...foi um elogio tudo menos   inocente. (= foi um elogio tudo mas um elogio inocente)

The replacement of the adverbs mais and menos by the copulative coordinating conjunction and adversative coordinating conjunction, respectively, supports the same approach for mais and menos- coordinating adverbs.

Extended discussion on coordinating adverbs

vii. Base forms: insertion of <prop>

The trees present the base form of the lexemes, before the morphologic information -  for nouns,  F:f ('base form' gendernumber) -.
In principle the issue does not represent any major difficulties unless one looks at the ortographic conventions adopted for each variant of Portuguese. Regarding months, Portuguese from Brazil adopted, for the initial letter, small letter, while in European Portuguese, months are considered to be proper nouns, and therefore the initial letter is a capital letter (novembro versus Novembro).
Because the tagger-parser is more acquainted with Portuguese from Brazil, the base forms were inserted according to the ortographic norms mentioned above. In order to be more faithful to European Portuguese, as the corpus used is Portuguese, a secondary tag <prop> was added, indicating that despite the base form has a  small  initial letter, the noun in question is to be regarded as having proper noun characteristics. 

B2. Syntax

i. Ellipsis: (#E and  <Eg>, for group ellipsis or <Es>, for syntactic ellipsis)

The elements that were considered elliptic are those which could be easily recovered by the context of the sentence. The following cases were handled as elliptic:

         - elliptic verb phrase / predicator (recovered elements of the ellipsis in bold)

              a) total elliptic predicator:  e.g Paulo Sá pedia ainda uma acareação entre o industrial portuense Manuel Macedo, Ramiro Moreira e o tenente da Marinha Pedro Menezes, todos (são) testemunhas neste caso.

              b)  partial elliptic verb phrase: e.g. «A Marca de Fogo» foi filmado em 1914 em simultâneo com «The Golden Chance, o primeiro (foi filmado) de dia e o segundo (foi filmado) de noite.

         - the same element present in one string which is repeated in another string becomes elliptic in the latter (recovered elements of the ellipsis in bold)

e.g. Os quatro primeiros temas destinam-se a mostrar o papel de Portugal no mundo e o quinto (tema), o único (tema) sem relação com a história nacional, é justificado por a experiência de Barcelona (Port Aventura) , que regista assinalável sucesso. 

       Dezenas de timorenses e (dezenas de) portugueses «ocupam» pacificamente o pavilhão indonésio da Expo-92, em Sevilha.

But not all cases which can be recovered by the context are handled as being elliptic. For instance, cases of lexical valency where the lexeme governs a preposition. The preposition in these cases is considered to be obligatory and, therefore, cannot be elliptical (cf. Note, section  D. Human mistakes). Some examples: convencer de algo/que... as in e.g. Estava convencido de que só eu a via, só eu a imaginava vista de cima naufragando no meio de os horríveis autocarros lisboetas ; certeza de que / algo as in Tinha a certeza de que tudo estava a correr bem. In the case of the missing preposition de (Tinha a certeza que; Estava convencido que), the code relative to the ellipisis (#E) is not applied.
 

Briefly, ellipses are treated either:

  • keeping them visible by attributing them the syntactic function they would hold if they were embedded in a clause that isn't elliptic; 
or
  • especially in cases involving determiners, the determiners become the HEAD of the phrase- preventing empty heads. Cf. also the <n> marked "noun-like" adjective np-heads (A-i).
For further discussion refer to sentence: C170-12 or C179-7

ii. Clausal and group discontinuities

By discontinuity, we mean elements in a sentence, either a clause or groups that are split into two parts by an intervening element (also a clause or group).  The general convention is to mark the first part of a discontinuous constituent with a right-attached hyphen, and the second with a left-attached hyphen. Possible third middle parts receive hyphen on both sides of the FUNCTION:form symbol. The formalism is described in the terminological guide-lines

In the corpus used for the Floresta Sintá(c)tica, 2 types of discontinuity were found:

 

1. Discontinuous finite clauses, split by an intervening clause.

Cases were observed where the main clause is embedded in the subordinate clause as in the following example:

Devido à acção de Ames, explicou o actual director da CIA, foi muito mais difícil para os EUA compreender o que se passava na URSS durante aquele período crítico, ...

The clause  in italic bold is the main clause and the clause clustering it in italic is actually the object of the main verb of the clause at the high level. 
In terms of notation, the two parts of the split clause are, broadly represented in the following way:

FUNCTION:fcl-     (corresponding to the first part of the split clause)
FUNCTION:form   (intervening element: clause, group)
-FUNCTION:fcl    (corresponding to the second part of the split clause)

(The hyphen represents the link betwewen the two parts that were split)

Specifically,  in the above case one would have:

ACC:fcl-        Devido à acção de Ames
P:v-fin             explicou
SUBJ:np          o actual director da CIA
-ACC:fcl        foi muito mais difícil (...)
 

For further discussion, refer to sentence C159-5

2. Discontinuous groups

                   2.2. noun phrases: 

                              2.2.1. noun phrases with embedded higher level clauses

                 These are the cases where a noun phrase is split by the main clause. The following example illustrates the case: 

                 Que jogos, que sites na Internet.......se podem fazer que traduzam o título do workshop?

                                   The non-discontinuous counterpart of the above sentence would be: Que jogos, que sites na Internet que traduzam o título do workshop se podem fazer?

                           In terms of notation, the sentence would be represented as:

SUBJ:cu-
=H:cu
==CJT:np 
===>N:pron-det('que' <interr> M P) que
===H:n('jogo' M P) jogos
==,
==CJT:np 
===>N:pron-det('que' <interr> M P) que
===` 
===H:n('site' M P) sites
===' 
===N<:pp 
====H:prp('em' <sam->) em
====P<:np 
=====>N:art('a' <-sam> F S) a
=====H:prop('Internet' F S) Internet
ACC:pron-pers('se' M 3P ACC) se
P:vp 
=AUX:v-fin('poder' PR 3P IND) podem
=MV:v-inf('fazer') fazer
-SUBJ:np
=N<:fcl 
==SUBJ:pron-indp('que' <rel> M P) que
==P:v-fin('traduzir' PR 3P SUBJ) traduzam
==ACC:np 
===>N:art('o' M S) o
===H:n('título' M S) título
===N<:pp
====H:prp('de' <sam->) de
====P<:np
=====>N:art('o' <-sam> M S) o
=====` 
=====H:n('workshop' M S) workshop
=====' 
?

(in this case, the noun phrase is actually a compound unit- coordinated noun phrases- and so, the hyphen is placed next to the form ....:cu-) 
 

                              2.2.2. noun phrases with high level clause constituents 

The sentence illustrating the case is the following:

integra constantemente o elemento insólito , da existência humana e urbana, na paisagem quotitiana: a rotina dos gestos e dos comportamentos

The noun phrase has the following internal structural (if not discontinuous):

                           ACC:np
                           =>N:art('o' <artd> M S) o
                           =H:n('elemento' M S)    elemento
                           =N<:adj('insólito' M S) insólito
                           =N<:pp
                           ==H:prp('de' <sam->)   de
                           ==P<:np
                           ===>N:art('a' <-sam>  F S)     a
                           ===H:n('existência' F S)       existência
                           ===N<:cu
                           ====CJT:adj('humano' F S)      humana
                           ====CO:conj-c('e' <co-postnom>)        e
                           ====CJT:adj('urbano' F S)      urbana
                           ===:
                           ===APP:np
                           ====>N:art('a' <artd> F S)     a
                           ====H:n('rotina' F S)  rotina
                           ====N<:cu
                           =====CJT:pp
                           ======H:prp('de' <sam->)        de
                           ======P<:np
                           =======>N:art('o' <-sam>  M P)  os
                           =======H:n('gesto' M P) gestos
                           =====CO:conj-c('e' <co-postnom>)        e
                           =====CJT:pp
                           ======H:prp('de' <sam->)        de
                           ======P<:np
                           =======>N:art('o' <-sam>  M P)  os
                           =======H:n('comportamento' M P) comportamentos

However, in the actual noun phrase, a clause level constituent- ADVL- is embedded (em a paisagem quotidiana), provoking the split of the noun phrase.
The insertion is made and the non phrase spliting is represented by the use of hyphen (as in any case of discontinuity), in green:

                          ACC:np-
                           =>N:art('o' <artd> M S) o
                           =H:n('elemento' M S)    elemento
                           =N<:adj('insólito' M S) insólito
                           =N<:pp
                           ==H:prp('de' <sam->)   de
                           ==P<:np
                           ===>N:art('a' <-sam>  F S)     a
                           ===H:n('existência' F S)       existência
                           ===N<:cu
                           ====CJT:adj('humano' F S)      humana
                           ====CO:conj-c('e' <co-postnom>)        e
                           ====CJT:adj('urbano' F S)      urbana
                       ADVL: pp                       na paisagem quotidiana
                          :
                           -ACC:np
                           =APP:np
                           ==>N:art('a' <artd> F S)     a
                           ==H:n('rotina' F S)  rotina
                           ==N<:cu
                           ===CJT:pp
                           ====H:prp('de' <sam->)        de
                           ====P<:np
                           =====>N:art('o' <-sam>  M P)  os
                           =====H:n('gesto' M P) gestos
                           ===CO:conj-c('e' <co-postnom>)        e
                           ===CJT:pp
                           ====H:prp('de' <sam->)        de
                           ====P<:np
                           =====>N:art('o' <-sam>  M P)  os
                           =====H:n('comportamento' M P) comportamentos

                    2.2.3. relative constructions

Prototypically , the relative constructions are  post nominals which are clauses initiated by the relative pronoun and  all the internal elements hold clausal functions. A clear example is the following: Nascida no seio da estética M-Base, que encontrou em Cassandra a única-porta voz vocal,... 

STA:fcl 
PRED:icl
=P:v-pcp('nascer' F S) Nascida
=ADVL:pp
==H:prp('em' <sam->) em
==P<:np
===>N:art('o' <-sam> M S) o
===H:n('seio' M S) seio
===N<:pp 
====H:prp('de' <sam->) de
====P<:np 
=====>N:art('a' <-sam> F S) a
=====>N:adj('estético' F S) estética
=====H:prop('M-Base' F S) M-Base
=====,
=====N<:fcl
======SUBJ:pron-indp('que' <rel> M S)        que
======P:v-fin('encontrar' PS 3S IND) encontrou
======ADVL:pp
=======H:prp('em')   em
=======P<:prop('Cassandra' F S)      Cassandra
======ACC:np
=======>N:art('a' <artd> F S)        a
=======>N:adj('único' F S)   única
=======H:n('porta-voz' M S)  porta-voz
=======N<:adj('vocal' M S)   vocal
,
(...)

However, the relative constructions may have a different configuration which may not constitute a problem in terms of CG format (surface structure)but it does in tree representation, where the attachment problem arises.
This is the case of the  relative construction where the relative pronoun does not have a clausal function but, instead, is inserted in a noun phrase which is itself a post nominal. However, because it is a relative construction the post nominal is fronted, which causes an attachment problem in terms of tree representation: discontinuity.

A case illustrating the above is, for instance, Nascida no seio da estética M-Base, de que se tornou a única porta-voz vocal, Cassandra...

The preposition de is the head of a prepositional phrase hols the function of post nominal of the head porta-voz. The relative pronoun refering back to estética (in the non-finite clause preceding it) is actually the complement of the preposition in the noun phrase. If the post nominal containing the relative clause was not fronted, the sentence would be: 

Nascida no seio da estética M-Base , Cassandra tornou-se a única porta-voz vocal de que (estética M-Base).

And the analysis:

(...)
,
SUBJ:prop('Cassandra' F S)      Cassandra
P:v-fin('tornar' PS 3S IND)     tornou-
ACC:pron-pers('se' <refl> F 3S ACC)     se
 SC:np
 =>N:art('a' <artd> F S) a
 =>N:adj('único' F S)    única
 =H:n('porta-voz' M S)   porta-voz
 =N<:adj('vocal' M S)    vocal
 =N<:pp
 ==H:prp('de' <sam->)     de
 ==P<:pp
 ===>N:artd('o' <-sam>  F S)     a
 ===H:pron-indp('que' F S)         que
  .
 

Once the post nominal is fronted (de que se tornou a única porta-voz vocal) the Subject complement (SC:np) is splitted and it can be only represented in tree structure by discontinuities:

(...)
,
SC:np-
=N<:pp                             de que
ACC:pron-pers('se' <refl> F 3S ACC)     se
P:v-fin('tornar' PS 3S IND)     tornou- 
-SC:np
 =>N:art('a' <artd> F S) a
 =>N:adj('único' F S)    única
 =H:n('porta-voz' M S)   porta-voz
 =N<:adj('vocal' M S)    vocal
,
(...)
 
 

iii. Non-finite attributive participle clauses as opposed to participle groups

Clauses and groups are defined in the Notational and terminological guide-lines.
Under a top level node (finite clause) clauses and groups can occur. When there is a verbal element heading the constituent, it will be a clause, otherwise, a structure <head, dependent > will form a group.

       O João comeu um bolo feito pela namorada (feito pela namorada is an infinite main clause)

       O João comeu um bolo de laranja (de laranja, prepositional group - de (head); laranja (dependent)- no verbal element)

Extended comment in C150-4.

iv. Direct object clause (ACC:fcl) without the presence of the subordinating conjunction 'que'

Especially quotes in connection with speech verbs as in the following example:
«Eles fizeram da nossa aldeia um cemitério», contou uma velha (que)

Despite the fact that the subordinating conjunction is not present, by the context one can deduce that the clause in quotation marks is a subordinated clause, whose function is direct object of the verb in the main clause. 

Examples and further discussion in sentence : C165-5
 

v. Syntactic value of punctuation marks (#D for discourse structure within the window)

In general, in our system, no syntactic function is explicitly marked on punctuation symbols. However, punctuation is still used by the parser to assign function to other constituents. Thus, when dealing with commas, heuristically the parser tries to implement a coordenating relation between the constituents. There are cases, though, where coordination would be syntactically acceptable but not semantically.
If the punctuation mark can be fully replaced by a conjunction, then, the clause in question will retain the function suggested by that conjunction (usually ADVL:fcl), as in the following sentence:

Agora, o financiamento do projecto foi muito complicado, (=porque) tentei na Suécia e mais tarde consegui na Alemanha --

A comma or semi-colon (cf. criteria for sentence-separation) within our analysis window, that really mark different chunks of discourse, rather than coordination, is treated by using utterance function tags (like STA:fcl for 'statement') instead of the normal conjunct tag (CJT:fcl), while keeping the top node form of coordinated unit (UTT:cu or STA:cu). The following sentence is a clear example of the above situation:

Penso que o fundamental é que «Where In The World é o primeiro álbum mesmo «da banda», é só isso.    (the comma might indicate a pause in speech)

Further details can be found in sentences: C172-2; C176-5

vi. Apposition vs. Postnominal predicative (#L, for listing structure within the sentence)

We use two different functions for postnominal material that is separated by punctuation, but still treated as belonging within the same np.

1. @APP (apposition)

The prototypical apposition is a name or definite np, identifying the np-head it postmodifies: "Jerónimo, o grande cacique" or "o seu advogado, Marco da Silva".

2. @N<PRED (postnominal predicative)

The prototypical postnominal predicative is an adjective, attributive participle or indefinite np, predicating something about the np-head it postmodifies, typically with the semantic relation of 'IS' (=):
"Jorge Gomes, funcionário" or "Jorge Gomes, contente com a vida"

In a newspaper corpus there is much parenthetic information within parentheses, and we treat these cases as @APP and @N<PRED in the sense defined above. An interesting borderline case of @N<PRED arises, where there is no IS-relation, but some other kind of predication, possibly involving elliptic prepositions or subclause material:

Miguel Castro (57)
Miguel Castro (Campinas) = Miguel Castro (de Campinas)
Miguel Castro (piano) = Miguel Castro (ao piano)
Marta Suplicy (PT) = Marta Suplicy (do PT)
A conferência de Barcelona (Novembro de 1995) = A conferência de Barcelona (que aconteceu em Novembro de 1995)

If, however, an abbreviation follows what it is an abbreviation for, we would tag it @APP:

Partido da Terra (PT)

vii. Syntactic function of the personal pronoun '-se'

Eight categories were considered:

1. <refl> @ACC (lexical reflexive)

                  Ele comportou-se muito bem! 
2. <obj> @ACC (direct object reflexive)
                  O João já se lavou. 
3. <coll> @ACC (collective reflexives)
                   Os ministros reuniram-se ontem de emergência.
4. <reci> @ACC (reciprocal reflexives)
                   Eles amam-se!
5. @ACC-PASS
                    Vendem-se casas.
6. @SUBJ
                    Vende-se casas
In case of speech verbs, both @ACC-PASS and @SUBJ readings are possible. The parser opts for the @SUBJ reading (e.g. Diz-se que é possível conciliar carreira e família)
7. @DAT
                  Chamaram-se a si mesmos revolucionários!
8. @VOK 
                  Faça-se uma revolução!
    • only with verbs in the subjunctive;
    • with verbs in singular or plural inflection;
    • beginning of the sentence.

viii. Prepositional form of obligatory complements: @SC / @OC and @ADV ( @ADVS ; @ADVO)

The arguments of copula verbs, (marked <vK> and <vtK> for valency) recognised by the tagger-parser are  @SC (subject complement) or @OC (object complement): e.g. sou de Lisboa; tomo-o por bom profissional. 

There are verbs that despite the fact that they are not categorised as copula verbs, they exhibit in terms of argument selection a similar behaviour as the copula verbs. This type of verbs select obligatory adverbial arguments, tagged in the corpus as @ADV. 
It was decided to divide the tag in two subcategories:

    • @ADVS, if subject related (e.g. mora em Lisboa; ficou abaixo das expectativas);
    • @ADVO, if object related (e.g. pousou o livro na mesa)
Note that 'estar' and 'ficar' have a valency potential for both subject complements (@SC) and subject related adverbial arguments (@ADVS). The difference can be tested by replacement with 'tal'/adjective (for @SC) or with pronoun adverbs (lá, aqui, hoje, muito, for @ADV).
Discussion and examples: sentence C1-4

ix. Focus markers

The focus marker secondary tag <meta> is used in conjunction with advers that have a focusing scope over a syntactic constituent immediately to the right, while the syntactic function @FOC is used for focusing constructions involving etymologically/morphologically verbal material, often involving que-clefting. An example of the <meta> case is "Até o José protestou", an example of @FOC is "Esse computador escreve é devagar." or "É de peixe que gosta mesmo.", the latter involving a focus bracket (two @FOC).
 
1.  Focus adverbs : the tag <meta>,  is used to mark a number of adverbs (até, nem, não, já, ...) when used to focus a constituent in its scope. At the clause level, the associated syntactic function would  typically be @ADVL, sometimes    @>S,  at group level the function marker would be one of the dependency markers @>A, @>N, @>P.
     Morphological ambiguity may arise as to consider the word classes preposition and adverb. Words like "até" can be considered either a     preposition or an adverb, which in the latter case would mean a focus adverb:
  • as a preposition, the syntactic analysis is @ADVL or @ADV, depending on the nature of the verb;
  •  as an adverb, the syntactic analysis is @>P (CG format) and FOC in tree representation.
    e.g. Penso que é mesmo possível chegar até ao Big Bang.


2. Focus reading of eis=que:

Eis que is a token and holds the function of FOC and the form, adverb, the reason being that a) an analytical reading with copula + predicative que-clause would be odd due to the fixed constituent order (no inversion), b) the description matches the on eused for é=que focusing constructions which sometimes separate constituents that do not allow predicative readings ("de peixe é=que gosta" where the fact of tasting can't be made of a fish ...).
 

x.  Eis

Eis alone  followed by a noun or noun phrase (@SC) has a different reading (as a copula verb), inspired by the word's etymology. The usage allows to assign the nominal constituent a predicative function.
 
 

xi. Shared constituents

In a relation of coordination, there might be the case that the conjoints have the same constituents, despite the fact they they are only displayed in one of the conjoints, for instance subjects (that in Portuguese don't have to be expressed) .
As it is described in Notational and terminological guide-lines  there isn't a tag yet to label those constituents that form a syntactic coordinated unit which do not fall in any of the established terminology, therefore undetermined function and form (represented by ?) are used.
The initial solution, the automatic one, did not make explicit that the constituents were shared by every conjoint in the compound unit, being the (shared) constituents dependent on one the CJT nodes (the nearest to the constituent).

The following cases were found in the Floresta:

1. Shared subject(s) and adverbial(s)

The subject / adverbial(s) is/are the same in every conjoint of the compound unit. For example:
 

  • Shared subject(s):
            O Presidente cancela todos os compromissos e fecha-se na Casa Branca.
 
  • Shared adverbial(s):
            Naquele ano, as brigadas vermelhas (BR) estavam no auge da actividade terrorista, o líder cristão democrata Aldo Moro acabara de ser raptado, e o princípe - proibido de entrar na Itália desde o exílio do pai em 1946- teria mesmo recebido ameaças da BR. 

The solution for both cases was to have the subject or adverbial at the sentence level just like the coordinated node which label is underspecified (?:cu), that is a level above the conjoints:

For instance, taking the shared subject :
 
Shared constituency not explicit
Shared constituency explicit

                           A1 
                           STA:cu 
                    CJT:fcl
                    =SUBJ:np
                           ==>N:art('o' <artd> M S)        O
                           ==H:n('presidente' <prop> M S)  Presidente
                           =P:v-fin('cancelar' PR 3S IND)  cancela
                           =ACC:np
                           ==>N:pron-det('todo_o' <quant> M P)     todos_os
                           ==H:n('compromisso' M P)        compromissos
                           CO:conj-c('e' <co-vfin> )       e
                           CJT:fcl
                           =P:v-fin('fechar' PR 3S IND)    fecha-
                           =SUBJ:pron-pers('se' M/F 3S/P ACC)      se
                           =ADVL:pp
                           ==H:prp('em' <sam->)    em
                           ==P<:np
                           ===>N:art('a' <-sam>  F S)      a
                           ===H:prop('Casa_Branca' F S)    Casa_Branca
                           .

                           A1 
                           STA:cu 
                    SUBJ:np
                           ==>N:art('o' <artd> M S)        O
                           ==H:n('presidente' <prop> M S)  Presidente
                    =?:cu
                           ==CJT:?
                           ===P:v-fin('cancelar' PR 3S IND)  cancela
                           ===ACC:np
                           ====>N:pron-det('todo_o' <quant> M P)     todos_os
                           ====H:n('compromisso' M P)        compromissos
                           ==CO:conj-c('e' <co-vfin> )       e
                           ==CJT:?
                           ===P:v-fin('fechar' PR 3S IND)    fecha-
                           ===SUBJ:pron-pers('se' M/F 3S/P ACC)      se
                           ===ADVL:pp
                           ====H:prp('em' <sam->)    em
                           ====P<:np
                           =====>N:art('a' <-sam>  F S)      a
                           =====H:prop('Casa_Branca' F S)    Casa_Branca
                           .

 

2. Shared apposition and dependent(s):

This case is more difficult to handle than the previous one as it involves an attachment problem. The issue is how to attach an apposition  and dependents in general to two coordinated constituents.
Example:
 

             já depois da derrota continuou  a tentar na comunicação social e nos jornalistas, os bodes expiatórios da derrota...

            E apelava ao "idealismo e ao pioneirismo" da América como o antídoto capaz de dar sentido ao seu enorme poder.

The problem was the repetition of the preposition and the most satisfactory solution, although quite complex to implement involving descontinuity, was to indicate a :

     a) discontinuous preposition
     b) discontinuous complement of the preposition
     c) attaching the APP / dependents in general to the discontinuous complement of preposition.

The tree would look like this:

E apelava ao idealismo...

 Further discussion on the topic
 

B3. Pragmatics

i. Utterance function

The system uses the following utterance functions:
  • UTT = utterance (the underspecified default)
  • STA = statement (parser-default: sentence not ending in '?' or '!')
  • QUE = question (parser-default: sentence ending in '?')
  • EXC = exclamation (parser-default: sentence ending in '!')
  • COM = command (parser-default: sentence ending in '!' and containing imperative verb form, V IMP, or vocative function, @VOK)
As a consequence of our sentence separation principles, which yields individual sentences as the window of analysis, utterance function tags would appear only at the top node (usually ...:fcl, sometimes ...:acl), but sometimes a non-separator punctuation mark, especially a comma, occurs in the corpus with the function of discourse chunk separator. Here, the top node form will be coordinated unit (cu), and its daughter nodes would also carry utterance function tags.

C. Punctuation


There is not a standard, robust set of rules for punctuation yet, and there isn't so far a thorough discussion on the topic. However, some regularities can already be documented.

Final sentence punctuation (full stop, exclamation/interrogation mark, colon, semi-colon): top level

Inner sentence punctuation:  punctuation chunking constituents (double commas, parentheses, quotes, hyphens) should be placed at the same level (i.e. with the same indentation) as the "chunked" constituent or word. The opening punctuation of a chunk is placed before (i.e. outside) the highest node in the chunk. The same holds for separators (hyphen, colon, comma), which are placed at the same level (i.e. with the same indentation) as the units they separate. Here, too, node lines go with what's inside the node, separators are kept outside nodes.
 

D. Human mistakes: <new> and <nil>


The criterion established in principle was that errors in the original corpus should not be corrected. In order to overcome this type of situations present in the corpus, secondary tags were introduced: 

<new> 

Whenever mispellings influence the syntactic analysis, especially when making it not possible, the mistake is corrected and the correction signaled with the tag <new> 

Example:  Poe favor, feche a porta! 

The automatic analysis would consider Poe, that correctly should be por, as a finite main verb (pôr), which obviously is not a satisfactory analysis. The tag <new> indicating the correctionof the particular sentence in the corpus is marked. 

<nil> 

When a possible analysis is encountered, the mistake is not corrected but signaled with the tag <nil>, meaning that it is a mistake and it should not have been there in the first place. 

Example: Os queixosos não deixam de nunca de remeter as culpas para o governo.
 

The preposition de in de nunca is wrongly used. However, it does not compromise the analysis of the sentence- since the ADVL that in the correct form should be an adverb (nunca)- forms this way a group, headed by the preposition de. The tag <nil> indicates the preposition is wrongly used, but still the mistake was not corrected.
 

Note that the same tags are applied to missing or unnecessary punctuation (for instance, a missing full stop before a capitalised letter or two adjacent commas). However, misuse of punctuation determined by personal interpretation will not be corrected.

On the top of these tags, the code #W is also introduced, indicating that the sentence holds an odd construction. 

Discussion in: C163-2