Documentation of the choices in the treebank project
Projecto Floresta sintá(c)tica
Last update: 24 November
2001
Eckard Bick
Susana Afonso
Ana Raquel Marchi
The following document is the result of the discussion that took place
during the process of revising trees. The topics that are listed below
represent the options that were adopted in terms of analysis. It is, thus,
important to have the options documented so that the users of the Floresta
are aware of the linguistic solutions encountered for certain type of linguistic
case problems, that either are not treated in traditional grammars or that,
in structure, pose a problem in tree representation- levels of attachment-.
The topics are divided into sections according to the linguistic areas
they concern (part of speech, syntax, etc, but also punctuation and how
to deal with human mistakes in the original sentences in the corpus), illustrated
with examples and, when possible, with links to further discussion and
other examples related to the topic.
A. Introduction of Codes
In order to enable the corpus
search, some codes were introduced in the corpus. These codes are mentioned
in the below sections, related to what they refer to. In formal terms,
the codes are either enclosed in < >, and placed next to the
constituent in the trees, or preceded by # and indicated in the
source line (in this case: três elementos essenciais,
clara
e inteligentemente valorizados por..., the code #E would appear
in the source line :
SOURCE: CETEMPúblico n=195 sec=clt sem=93a #E
and <Em> next to the constituent:
(...)
,
N<:icl
=>A:cu
==CJT:adv('claro' <Em>) clara
==CO:conj-c('e') e
==CJT:adv('inteligentemente') inteligentemente
=H:v-pcp('valorizar' M P)
valorizados
(...)
Not all the topics listed below have a code, for some tags or syntactic
notation is enough for the search (ex: nominalised adjectives- tag <n>
- or discontinuous constituents- hyphenated form, F:f-, and/or function,
-F:f).
i. Ellipsis: #E
The
sentences that exhibit elliptic elements are marked with the code #E, in
general and depending on the type of ellipsis it tags in angle brackets
are applied:
-
Morphologic ellipsis: <Em> (check section B1. iii. for examples
and details)
-
Syntactic ellipsis (description in section B2.i.):
¤ syntactic ellipsis: <Es>
¤ group ellipsis: <Eg>
ii. Discourse structures (section B2.v.): #D
iii. Listing structures (section B2.vi.): #L
IV.Unconventional / ungrammatical constructions: #W
This code is used if the sentence presents either an ungrammaticality or
being syntactically correct, it presents some oddity. The situations where
the #W was used by the revisors wasn't systematised yet.
B. Linguistic areas:
B1. Part of speech
This section concerns the discussion around certain topics related to part
of speech that were problematic ifor the syntactic analysis. Therefore,
in order to make the corpus more consistent, we have decided for some options
in detriment of others (how to analyse the first adverb in a coordination
whose form lacks the sufix -mente) or introduced new elements, for instance,
new tags.
i. Adjective, head of a noun phrase: introduction of the tag <n>
The introduction of the tag <n> follows the prototypical head system.
In other words, if a noun phrase is at stake, it is expected that its head
is a noun. However, nominalised adjectives are still in the system tagged
as adjectives. For instance, both instances of the lexeme jovem in "O
jovem
tornou-se
um herói" and "O
jovem herói foi relembrado"
are regarded as adjectives, even though jovem in the first occurrence is
nominalised (as the article preceding it is provides evidence).
<n> is then applied if nominalised adjectives imply a noun-head (e.g.
jovem, velho (...));
The
adjectives that in the nominal form are abstract concepts (not implying,
therefore, a noun-head , as it is the case of as frio- discussion
on the topic in C165-6), are lexicalised as nouns.
Syntactically, the introduction of the tag <n>
is added automatically to mark adjective heads of noun phrases in a semi-morphological way. Functionally, this information is equivalent to the combination '=H:adj' within an np. This also covers the special cases where two adjectives make
up an np, one being head, the other modifier (e.g. O velho doente).
For further discussion, please refer to sentences C165-6 and C142-1.
ii. The case of dado: preposition / past participle
Initially,
dado was tagged exclusively as a past participle by the
tagger-parser. However, and after a corpus search has provided evidence
that it may not inflect in gender and number, another analysis, of considering
it a preposition, similarly to
visto was put forward. Because the
past participle reading already existed, and it is the preferred one, the
prepositional reading was only introduced in the cases where there was
no inflection present.
Being so,
dado is analysed as:
-
Preposition, if its form is fixed (no inflection, e.g. ..., dado as circunstâncias);
-
Past participle, if it inflects in number and gender (e.g. ..., dadas as
circunstâncias), including the otherwise underspecified male singular
(M S) (e.g. dado o caso,...).
Check sentence
C2-4
for discussion on the topic and examples.
iii. Morphological ellipsis: coordinated adverbs
Apparently this is the only case where one can say
that there is a morphological ellipsis. In Portuguese, when regular adverbs
(ending in -mente) are coordinated, only the last adverb in the coordination
is presented in its full form. The other adverbs in the coordination have
an implicit but unmarked derivation (stemming from the latin ablative case:
mens alta). The following sentence is an example of the previous: João
fez tudo clara, inteligente e eficientemente.
When such is the case, the first element is morphologically tagged
as derived adverb (<adv> ADJ F S).
iv. Preposition / Verb
The parser's default to handle the verb haver as in há
+ <expression of time> (João gosta da Maria há muito
tempo) is to consider the present form of the verb, third person
singular há a preposition , head of an adverbial (prepositional
phrase in form).
When the same expression is placed in the beginning of a sentence /
phrase, its form changes in the way that 'que' is added (há /
havia + <expression of time> + que as in Há muito tempo
que a Maria gosta do João). In cases like this, a verbal
analysis proved to be more satisfactory and há / havia + <expression
of time> + que is, then, syntactically, an Adverbial , where
há
/ havia is the main verb (MV:v-fin), the expression of time
the ACC and
que
a focus adverb (FOC:adv):
A1
STA:fcl
ADVL:fcl
=P:v-fin('haver' PR 3S IND) Há
=ACC:n('muito_tempo' M S) muito_tempo
FOC:adv('que' <foc>) que
SUBJ:np
=>N:art('a' <artd> F S) a
=H:prop('Maria' F S) Maria
P:v-fin('gostar' PR 3S IND) gosta
PIV:pp
=H:prp('de' <sam->) de
=P<:np
==>N:art('o' <-sam> M S) o
==H:prop('João' M S) João
This option is to be valid for all the cases, abandoning, this way,
the prepositional reading of the verb given by the parser.
Another verbal analysis would differ from the previous one, in the analysis
of 'que'. In this analysis, 'que' is considered to be an adverbial relative
pronoun. Therefore one would have há / havia + <expression
of time> + que not an adverbial phrase, but rather belonging to the
clausal level, being há / havia the predicator (P:v-fin), the expression
of time the ACC and 'que' an adverbial relative pronoun initiating a relative
clause:
A2
STA:fcl
=P:v-fin('haver' PR 3S IND) Há
=ACC:np
==H:n('muito_tempo' M S) muito_tempo
==N<:fcl
===ADVL:adv('que' <foc>) que
===SUBJ:np
====>N:art('a' <artd> F S) a
====H:prop('Maria' F S) Maria
===P:v-fin('gostar' PR 3S IND) gosta
===PIV:pp
====H:prp('de' <sam->) de
====P<:np
=====>N:art('o' <-sam> M S)
o
=====H:prop('João' M S) João
.
Further discussion and examples in haver_time
v. Numerals: gender and number
The default case for analysing numerals is the same for determiners: they
inflect in gender and number according to the the gender and number of
the head they modify. This means that taking the example:
A Maria tem dez bons amigos, seis são de longa data.
Two different situations in this case:
-
dez, modifier of the head amigos, therefore presenting the same
number and gender (NUM M P)
-
seis, the head, but the context of the sentence allows the
disambiguation- moreover, any number above one is plural, except if one
considers the number isolated (as in 13 é aziago!- in this
case, there is also the adjective gender and number disambiguating the
numeral!)
However there are situations
where the sentence window context or
even the extra-linguistic context is not enough to disambiguate th gender
and number of the numeral in question. In this case, isolated arabic
numbers are regarded as non-disambiguatable with regard to gender and number,
and a default is used for anything larger than 1: NUM M/F P, for instance,
when it occurs in parentheses.The following example illustrates the case:
Maria
Carvalho (9) e João Cunha (9) têm à sua frente um longo
caminho a percorrer.
In terms of dates, we have decided to analyse them as singular and masculine
(NUM M S), following the same principle adopted for the above case 13
é aziago! For instance: 2001 foi imprevisível
(both
tests: o ano de 2001 and O imprevisível 2001
support
the decison taken).
vi. Coordinating adverbs
The sequences:
-
tanto...como
-
não só ...como
-
não só ... mas também
-
...bem como
are paralle to nem...nem, that is a coordination (...:cu) with coordinating
adverbs (CO:adv).
Semantically there seems to be a proximity between the coordination
(either expressed by the coordinating conjunctions or adverbs) and the
use of mais and menos in the following cases:
...uma segunda compilação da Kaos que vai incluir todos
os lançamentos não contidos em " Totally Kaos ", mais
três edições exteriores à editora (= vai
incluir todos os lançamentos(...) e três edições...)
...foi um elogio tudo menos inocente. (=
foi um elogio tudo mas um elogio inocente)
The replacement of the adverbs mais and menos by the copulative coordinating
conjunction and adversative coordinating conjunction, respectively, supports
the same approach for mais and menos- coordinating adverbs.
Extended discussion on coordinating
adverbs
vii. Base forms: insertion of <prop>
The trees present the base form of the lexemes, before the morphologic
information - for nouns, F:f (
'base form' gendernumber)
-.
In principle the issue does not represent any major difficulties unless
one looks at the ortographic conventions adopted for each variant of Portuguese.
Regarding months, Portuguese from Brazil adopted, for the initial letter,
small letter, while in European Portuguese, months are considered to be
proper nouns, and therefore the initial letter is a capital letter (novembro
versus
Novembro).
Because the tagger-parser is more acquainted with Portuguese from Brazil,
the base forms were inserted according to the ortographic norms mentioned
above. In order to be more faithful to European Portuguese, as the corpus
used is Portuguese, a secondary tag
<prop> was added, indicating
that despite the base form has a small initial letter, the
noun in question is to be regarded as having proper noun characteristics.
B2. Syntax
i. Ellipsis: (#E and <Eg>, for group ellipsis or <Es>, for
syntactic ellipsis)
The elements that were considered elliptic are those
which could be easily recovered by the context of the sentence. The following
cases were handled as elliptic:
- elliptic verb
phrase / predicator (recovered elements of the ellipsis in bold)
a) total elliptic predicator: e.g Paulo Sá pedia ainda
uma acareação entre o industrial portuense Manuel Macedo,
Ramiro Moreira e o tenente da Marinha Pedro Menezes, todos (são)
testemunhas
neste caso.
b) partial elliptic verb phrase: e.g. «A Marca de Fogo»
foi filmado em 1914 em simultâneo com «The Golden Chance, o
primeiro (foi filmado) de dia e o segundo (foi filmado) de
noite.
- the same element
present in one string which is repeated in another string becomes elliptic
in the latter (recovered elements of the ellipsis in bold)
e.g. Os quatro primeiros temas destinam-se a mostrar o papel de Portugal
no mundo e o quinto (tema), o único (tema) sem relação
com a história nacional, é justificado por a experiência
de Barcelona (Port Aventura) , que regista assinalável sucesso.
Dezenas de timorenses e (dezenas
de) portugueses «ocupam» pacificamente o pavilhão
indonésio da Expo-92, em Sevilha.
But not all cases which can be recovered by the context are handled
as being elliptic. For instance, cases of lexical valency where the lexeme
governs a preposition. The preposition in these cases is considered to
be obligatory and, therefore, cannot be elliptical (cf. Note, section
D. Human mistakes). Some examples: convencer de algo/que... as in e.g.
Estava
convencido de que só eu a via, só eu a imaginava vista de
cima naufragando no meio de os horríveis autocarros lisboetas ;
certeza
de que / algo as in Tinha a certeza de que tudo estava a correr bem.
In
the case of the missing preposition de (Tinha a certeza que; Estava convencido
que), the code relative to the ellipisis (#E) is not applied.
Briefly, ellipses are treated either:
-
keeping them visible by attributing them the syntactic function they would
hold if they were embedded in a clause that isn't elliptic;
or
-
especially in cases involving determiners, the determiners become the HEAD
of the phrase- preventing empty heads. Cf. also the <n> marked "noun-like"
adjective np-heads (A-i).
ii. Clausal and group discontinuities
By discontinuity, we mean elements in a sentence, either a clause or groups
that are split into two parts by an intervening element (also a clause
or group). The general convention is to mark the first part of a
discontinuous constituent with a right-attached hyphen, and the second
with a left-attached hyphen. Possible third middle parts receive hyphen
on both sides of the FUNCTION:form symbol. The formalism is described in
the
terminological guide-lines
In the corpus used for the Floresta Sintá(c)tica, 2 types of
discontinuity were found:
1. Discontinuous finite clauses, split by an intervening clause.
Cases were observed where the main clause is embedded in the subordinate
clause as in the following example:
Devido à acção de Ames, explicou o actual
director da CIA, foi muito mais difícil para os EUA compreender
o que se passava na URSS durante aquele período crítico,
...
The clause in italic bold is the main clause and the clause clustering
it in italic is actually the object of the main verb of the clause at the
high level.
In terms of notation, the two parts of the split clause are, broadly
represented in the following way:
FUNCTION:fcl- (corresponding to the first part
of the split clause)
FUNCTION:form (intervening element: clause, group)
-FUNCTION:fcl (corresponding to the second part of
the split clause)
(The hyphen represents the link betwewen the two parts that were
split)
Specifically, in the above case one would have:
ACC:fcl- Devido à acção
de Ames
P:v-fin
explicou
SUBJ:np o actual
director da CIA
-ACC:fcl foi muito mais difícil
(...)
For further discussion, refer to sentence C159-5
2. Discontinuous groups
2.2. noun phrases:
2.2.1. noun phrases with embedded higher level clauses
These are the cases where a noun phrase is split by the main clause.
The following example illustrates the case:
Que jogos, que sites na Internet.......se podem
fazer que traduzam o título do workshop?
The non-discontinuous counterpart of the above sentence would be: Que
jogos, que sites na Internet que traduzam o título do workshop se
podem fazer?
In terms of notation, the sentence would be represented as:
SUBJ:cu-
=H:cu
==CJT:np
===>N:pron-det('que' <interr> M P) que
===H:n('jogo' M P) jogos
==,
==CJT:np
===>N:pron-det('que' <interr> M P) que
===`
===H:n('site' M P) sites
==='
===N<:pp
====H:prp('em' <sam->) em
====P<:np
=====>N:art('a' <-sam> F S) a
=====H:prop('Internet' F S) Internet
ACC:pron-pers('se' M 3P ACC) se
P:vp
=AUX:v-fin('poder' PR 3P IND) podem
=MV:v-inf('fazer') fazer
-SUBJ:np
=N<:fcl
==SUBJ:pron-indp('que' <rel> M P) que
==P:v-fin('traduzir' PR 3P SUBJ) traduzam
==ACC:np
===>N:art('o' M S) o
===H:n('título' M S) título
===N<:pp
====H:prp('de' <sam->) de
====P<:np
=====>N:art('o' <-sam> M S) o
=====`
=====H:n('workshop' M S) workshop
====='
?
(in this case, the noun phrase is actually a compound unit- coordinated
noun phrases- and so, the hyphen is placed next to the form ....:cu-)
2.2.2. noun phrases with high level clause constituents
The sentence illustrating the case is the following:
integra constantemente o elemento insólito , da existência
humana e urbana, na paisagem quotitiana: a rotina dos gestos e dos
comportamentos
The noun phrase has the following internal structural (if not discontinuous):
ACC:np
=>N:art('o' <artd> M S) o
=H:n('elemento' M S) elemento
=N<:adj('insólito' M S) insólito
=N<:pp
==H:prp('de' <sam->) de
==P<:np
===>N:art('a' <-sam> F S) a
===H:n('existência' F S) existência
===N<:cu
====CJT:adj('humano' F S) humana
====CO:conj-c('e' <co-postnom>)
e
====CJT:adj('urbano' F S) urbana
===:
===APP:np
====>N:art('a' <artd> F S) a
====H:n('rotina' F S) rotina
====N<:cu
=====CJT:pp
======H:prp('de' <sam->) de
======P<:np
=======>N:art('o' <-sam> M P) os
=======H:n('gesto' M P) gestos
=====CO:conj-c('e' <co-postnom>)
e
=====CJT:pp
======H:prp('de' <sam->) de
======P<:np
=======>N:art('o' <-sam> M P) os
=======H:n('comportamento' M P) comportamentos
However, in the actual noun phrase, a clause level constituent- ADVL-
is embedded (em a paisagem quotidiana), provoking the split of the
noun phrase.
The insertion is made and the non phrase spliting is represented by
the use of hyphen (as in any case of discontinuity), in green:
ACC:np-
=>N:art('o' <artd> M S) o
=H:n('elemento' M S) elemento
=N<:adj('insólito' M S) insólito
=N<:pp
==H:prp('de' <sam->) de
==P<:np
===>N:art('a' <-sam> F S) a
===H:n('existência' F S) existência
===N<:cu
====CJT:adj('humano' F S) humana
====CO:conj-c('e' <co-postnom>)
e
====CJT:adj('urbano' F S) urbana
ADVL: pp
na paisagem quotidiana
:
-ACC:np
=APP:np
==>N:art('a' <artd> F S) a
==H:n('rotina' F S) rotina
==N<:cu
===CJT:pp
====H:prp('de' <sam->) de
====P<:np
=====>N:art('o' <-sam> M P) os
=====H:n('gesto' M P) gestos
===CO:conj-c('e' <co-postnom>)
e
===CJT:pp
====H:prp('de' <sam->) de
====P<:np
=====>N:art('o' <-sam> M P) os
=====H:n('comportamento' M P) comportamentos
2.2.3. relative constructions
Prototypically , the relative constructions are post nominals
which are clauses initiated by the relative pronoun and all the internal
elements hold clausal functions. A clear example is the following: Nascida
no seio da estética M-Base, que encontrou em Cassandra a única-porta
voz vocal,...
STA:fcl
PRED:icl
=P:v-pcp('nascer' F S) Nascida
=ADVL:pp
==H:prp('em' <sam->) em
==P<:np
===>N:art('o' <-sam> M S) o
===H:n('seio' M S) seio
===N<:pp
====H:prp('de' <sam->) de
====P<:np
=====>N:art('a' <-sam> F S) a
=====>N:adj('estético' F S) estética
=====H:prop('M-Base' F S) M-Base
=====,
=====N<:fcl
======SUBJ:pron-indp('que' <rel> M S)
que
======P:v-fin('encontrar' PS 3S IND) encontrou
======ADVL:pp
=======H:prp('em') em
=======P<:prop('Cassandra' F S) Cassandra
======ACC:np
=======>N:art('a' <artd> F S)
a
=======>N:adj('único' F S) única
=======H:n('porta-voz' M S) porta-voz
=======N<:adj('vocal' M S) vocal
,
(...)
However, the relative constructions may have a different configuration
which may not constitute a problem in terms of CG format (surface structure)but
it does in tree representation, where the attachment problem arises.
This is the case of the relative construction where the relative
pronoun does not have a clausal function but, instead, is inserted in a
noun phrase which is itself a post nominal. However, because it is a relative
construction the post nominal is fronted, which causes an attachment problem
in terms of tree representation: discontinuity.
A case illustrating the above is, for instance, Nascida no seio da
estética M-Base, de que se tornou a única porta-voz vocal,
Cassandra...
The preposition de is the head of a prepositional phrase hols
the function of post nominal of the head porta-voz. The relative
pronoun refering back to estética (in the non-finite clause
preceding it) is actually the complement of the preposition in the noun
phrase. If the post nominal containing the relative clause was not fronted,
the sentence would be:
Nascida no seio da estética M-Base , Cassandra tornou-se a
única porta-voz vocal de que (estética M-Base).
And the analysis:
(...)
,
SUBJ:prop('Cassandra' F S) Cassandra
P:v-fin('tornar' PS 3S IND) tornou-
ACC:pron-pers('se' <refl> F 3S ACC) se
SC:np
=>N:art('a' <artd> F S) a
=>N:adj('único' F S) única
=H:n('porta-voz' M S) porta-voz
=N<:adj('vocal' M S) vocal
=N<:pp
==H:prp('de' <sam->) de
==P<:pp
===>N:artd('o' <-sam> F S) a
===H:pron-indp('que' F S)
que
.
Once the post nominal is fronted (de que se tornou a única
porta-voz vocal) the Subject complement (SC:np) is splitted and it
can be only represented in tree structure by discontinuities:
(...)
,
SC:np-
=N<:pp
de que
ACC:pron-pers('se' <refl> F 3S ACC) se
P:v-fin('tornar' PS 3S IND) tornou-
-SC:np
=>N:art('a' <artd> F S) a
=>N:adj('único' F S) única
=H:n('porta-voz' M S) porta-voz
=N<:adj('vocal' M S) vocal
,
(...)
iii. Non-finite attributive participle clauses as opposed to participle
groups
Clauses and groups are defined in the
Notational and terminological guide-lines.
Under a top level node (finite clause) clauses and groups can occur.
When there is a verbal element heading the constituent, it will be a clause,
otherwise, a structure <head, dependent > will form a group.
O João comeu um bolo
feito pela namorada (feito pela namorada is an infinite main
clause)
O João comeu um bolo
de laranja (de laranja, prepositional group - de (head); laranja
(dependent)- no verbal element)
Extended comment in C150-4.
iv. Direct object clause (ACC:fcl) without the presence of the subordinating
conjunction 'que'
Especially quotes in connection with speech verbs
as in the following example:
«Eles fizeram da nossa aldeia um cemitério», contou
uma velha (que)
Despite the fact that the subordinating conjunction is not present,
by the context one can deduce that the clause in quotation marks is a subordinated clause, whose function is direct object of the verb in the main clause.
Examples and further discussion in sentence : C165-5
v. Syntactic value of punctuation marks (#D for discourse structure within
the window)
In general, in our system, no syntactic function
is explicitly marked on punctuation symbols. However, punctuation is still
used by the parser to assign function to other constituents. Thus, when
dealing with commas, heuristically the parser tries to implement a coordenating
relation between the constituents. There are cases, though, where coordination
would be syntactically acceptable but not semantically.
If the punctuation mark can be fully replaced by a conjunction, then, the
clause in question will retain the function suggested by that conjunction
(usually ADVL:fcl), as in the following sentence:
Agora, o financiamento do projecto foi muito complicado, (=porque)
tentei na Suécia e mais tarde consegui na Alemanha --
A comma or semi-colon (cf. criteria for sentence-separation) within
our analysis window, that really mark different chunks of discourse, rather
than coordination, is treated by using utterance function tags (like STA:fcl
for 'statement') instead of the normal conjunct tag (CJT:fcl), while keeping
the top node form of coordinated unit (UTT:cu or STA:cu). The following
sentence is a clear example of the above situation:
Penso que o fundamental é que «Where In The World é
o primeiro álbum mesmo «da banda», é só
isso. (the comma might indicate a pause in speech)
Further details can be found in sentences: C172-2;
C176-5
vi. Apposition vs. Postnominal predicative (#L, for listing structure within
the sentence)
We use two different functions for postnominal material that is separated
by punctuation, but still treated as belonging within the same np.
1. @APP (apposition)
The prototypical apposition is a name or definite
np, identifying the np-head it postmodifies: "Jerónimo, o grande
cacique" or "o seu advogado, Marco da Silva".
2. @N<PRED (postnominal predicative)
The prototypical postnominal predicative is an adjective,
attributive participle or indefinite np, predicating something about the
np-head it postmodifies, typically with the semantic relation of 'IS' (=):
"Jorge Gomes, funcionário" or "Jorge Gomes, contente com a vida"
In a newspaper corpus there is much parenthetic information within parentheses,
and we treat these cases as @APP and @N<PRED in the sense defined above.
An interesting borderline case of @N<PRED arises, where there is no
IS-relation, but some other kind of predication, possibly involving elliptic
prepositions or subclause material:
Miguel Castro (57)
Miguel Castro (Campinas) = Miguel Castro (de Campinas)
Miguel Castro (piano) = Miguel Castro (ao piano)
Marta Suplicy (PT) = Marta Suplicy (do PT)
A conferência de Barcelona (Novembro de 1995) = A conferência
de Barcelona (que aconteceu em Novembro de 1995)
If, however, an abbreviation follows what it is an abbreviation for,
we would tag it @APP:
Partido da Terra (PT)
vii. Syntactic function of the personal pronoun '-se'
Eight categories were considered:
1. <refl> @ACC (lexical reflexive)
Ele comportou-se muito bem!
2.
<obj> @ACC (direct object reflexive)
O João já se lavou.
3.
<coll> @ACC (collective reflexives)
Os ministros reuniram-se ontem de emergência.
4.
<reci> @ACC (reciprocal reflexives)
Eles amam-se!
5.
@ACC-PASS
Vendem-se casas.
6.
@SUBJ
Vende-se casas
In case of speech verbs,
both @ACC-PASS and @SUBJ readings are possible. The parser opts for the
@SUBJ reading (e.g. Diz-se que é possível conciliar carreira
e família)
7.
@DAT
Chamaram-se a si mesmos revolucionários!
8.
@VOK
Faça-se uma revolução!
-
only with verbs in the subjunctive;
-
with verbs in singular or plural inflection;
-
beginning of the sentence.
viii. Prepositional form of obligatory complements: @SC / @OC and @ADV
( @ADVS ; @ADVO)
The arguments of copula verbs, (marked <vK> and
<vtK> for valency) recognised by the tagger-parser are @SC (subject
complement) or @OC (object complement): e.g. sou de Lisboa; tomo-o por
bom profissional.
There are verbs that despite the fact that they are not categorised
as copula verbs, they exhibit in terms of argument selection a similar
behaviour as the copula verbs. This type of verbs select obligatory adverbial
arguments, tagged in the corpus as @ADV.
It was decided to divide the tag in two subcategories:
-
@ADVS, if subject related (e.g. mora em Lisboa; ficou abaixo das expectativas);
-
@ADVO, if object related (e.g. pousou o livro na mesa)
Note that 'estar' and 'ficar' have a valency potential
for both subject complements (@SC) and subject related adverbial arguments
(@ADVS). The difference can be tested by replacement with 'tal'/adjective
(for @SC) or with pronoun adverbs (lá, aqui, hoje, muito, for @ADV).
Discussion and examples: sentence
C1-4
ix. Focus markers
The focus marker secondary tag <meta> is used in conjunction with advers
that have a focusing scope over a syntactic constituent immediately to
the right, while the syntactic function @FOC is used for focusing constructions
involving etymologically/morphologically verbal material, often involving
que-clefting. An example of the <meta> case is "
Até o
José protestou", an example of @FOC is "Esse computador escreve
é devagar."
or "
É de peixe
que gosta mesmo.", the latter involving
a focus bracket (two @FOC).
1. Focus adverbs : the tag <meta>,
is used to mark a number of adverbs (até, nem, não, já,
...) when used to focus a constituent in its scope. At the clause level,
the associated syntactic function would typically be @ADVL, sometimes
@>S, at group level the function marker would be one of the dependency
markers @>A, @>N, @>P.
Morphological ambiguity may arise as to consider
the word classes preposition and adverb. Words like "até" can be
considered either a preposition or an adverb, which
in the latter case would mean a focus adverb:
-
as a preposition, the syntactic analysis is @ADVL or @ADV, depending on
the nature of the verb;
-
as an adverb, the syntactic analysis is @>P (CG format) and FOC in
tree representation.
e.g. Penso que é mesmo
possível chegar até ao Big Bang.
2. Focus reading of eis=que:
Eis que is a token and holds the function of FOC and the form,
adverb, the reason being that a) an analytical reading with copula + predicative
que-clause would be odd due to the fixed constituent order (no inversion),
b) the description matches the on eused for é=que focusing constructions
which sometimes separate constituents that do not allow predicative readings
("de peixe é=que gosta" where the fact of tasting can't be made
of a fish ...).
x. Eis
Eis alone followed by a noun or noun phrase (@SC) has a
different reading (as a copula verb), inspired by the word's etymology.
The usage allows to assign the nominal constituent a predicative function.
xi. Shared constituents
In a relation of coordination, there might be the case that the conjoints
have the same constituents, despite the fact they they are only displayed
in one of the conjoints, for instance subjects (that in Portuguese don't
have to be expressed) .
As it is described in
Notational
and terminological guide-lines there isn't a tag yet to label
those constituents that form a syntactic coordinated unit which do not
fall in any of the established terminology, therefore undetermined function
and form (represented by ?) are used.
The initial solution, the automatic one, did not make explicit that
the constituents were shared by every conjoint in the compound unit, being
the (shared) constituents dependent on one the CJT nodes (the nearest to
the constituent).
The following cases were found in the Floresta:
1. Shared subject(s) and adverbial(s)
The subject / adverbial(s) is/are the same in every conjoint of the
compound unit. For example:
O
Presidente cancela todos os compromissos e fecha-se na Casa Branca.
Naquele
ano, as brigadas vermelhas (BR) estavam no auge da actividade terrorista,
o líder cristão democrata Aldo Moro acabara de ser raptado,
e o princípe - proibido de entrar na Itália desde o exílio
do pai em 1946- teria mesmo recebido ameaças da BR.
The solution for both cases was to have the subject or adverbial at
the sentence level just like the coordinated node which label is underspecified
(?:cu), that is a level above the conjoints:
For instance, taking the shared subject :
Shared constituency not explicit
|
Shared constituency explicit
|
A1
STA:cu
CJT:fcl
=SUBJ:np
==>N:art('o' <artd> M S) O
==H:n('presidente' <prop> M S) Presidente
=P:v-fin('cancelar' PR 3S IND) cancela
=ACC:np
==>N:pron-det('todo_o' <quant> M P) todos_os
==H:n('compromisso' M P) compromissos
CO:conj-c('e' <co-vfin> ) e
CJT:fcl
=P:v-fin('fechar' PR 3S IND) fecha-
=SUBJ:pron-pers('se' M/F 3S/P ACC) se
=ADVL:pp
==H:prp('em' <sam->) em
==P<:np
===>N:art('a' <-sam> F S) a
===H:prop('Casa_Branca' F S) Casa_Branca
. |
A1
STA:cu
SUBJ:np
==>N:art('o' <artd> M S) O
==H:n('presidente' <prop> M S) Presidente
=?:cu
==CJT:?
===P:v-fin('cancelar' PR 3S IND) cancela
===ACC:np
====>N:pron-det('todo_o' <quant> M P) todos_os
====H:n('compromisso' M P) compromissos
==CO:conj-c('e' <co-vfin> ) e
==CJT:?
===P:v-fin('fechar' PR 3S IND) fecha-
===SUBJ:pron-pers('se' M/F 3S/P ACC) se
===ADVL:pp
====H:prp('em' <sam->) em
====P<:np
=====>N:art('a' <-sam> F S) a
=====H:prop('Casa_Branca' F S) Casa_Branca
. |
2. Shared apposition and dependent(s):
This case is more difficult to handle than the previous one as it involves
an attachment problem. The issue is how to attach an apposition and
dependents in general to two coordinated constituents.
Example:
já depois da derrota continuou a tentar na comunicação
social e nos jornalistas, os bodes expiatórios da derrota...
E apelava ao "idealismo e ao pioneirismo" da América como
o antídoto capaz de dar sentido ao seu enorme poder.
The problem was the repetition of the preposition and the most satisfactory
solution, although quite complex to implement involving descontinuity,
was to indicate a :
a) discontinuous preposition
b) discontinuous complement of the preposition
c) attaching the APP / dependents in general
to the discontinuous complement of preposition.
The tree would look like this:
Further discussion on
the topic
B3. Pragmatics
i. Utterance function
The system uses the following utterance functions:
-
UTT = utterance (the underspecified default)
-
STA = statement (parser-default: sentence not ending in '?' or '!')
-
QUE = question (parser-default: sentence ending in '?')
-
EXC = exclamation (parser-default: sentence ending in '!')
-
COM = command (parser-default: sentence ending in '!' and containing imperative
verb form, V IMP, or vocative function, @VOK)
As a consequence of our sentence separation
principles, which yields individual sentences as the window of analysis,
utterance function tags would appear only at the top node (usually ...:fcl,
sometimes ...:acl), but sometimes a non-separator punctuation mark, especially
a comma, occurs in the corpus with the function of discourse chunk separator.
Here, the top node form will be coordinated unit (cu), and its daughter
nodes would also carry utterance function tags.
C. Punctuation
There is not a standard, robust set of rules for punctuation yet,
and there isn't so far a thorough discussion on the topic. However, some
regularities can already be documented.
Final sentence punctuation (full stop, exclamation/interrogation
mark, colon, semi-colon): top level
Inner sentence punctuation: punctuation chunking constituents
(double commas, parentheses, quotes, hyphens) should be placed at the same
level (i.e. with the same indentation) as the "chunked" constituent or
word. The opening punctuation of a chunk is placed before (i.e. outside)
the highest node in the chunk. The same holds for separators (hyphen, colon,
comma), which are placed at the same level (i.e. with the same indentation)
as the units they separate. Here, too, node lines go with what's inside
the node, separators are kept outside nodes.
D. Human mistakes: <new> and <nil>
The criterion established in principle was that errors in the original
corpus should not be corrected. In order to overcome this type of situations
present in the corpus, secondary tags were introduced:
<new>
Whenever mispellings influence
the syntactic analysis, especially when making it not possible, the mistake
is corrected and the correction signaled with the tag <new>
Example: Poe favor,
feche a porta!
The automatic analysis
would consider Poe, that correctly should be por, as a finite
main verb (pôr), which obviously is not a satisfactory analysis.
The tag <new> indicating the correctionof the particular sentence in
the corpus is marked.
<nil>
When a possible analysis
is encountered, the mistake is not corrected but signaled with the tag
<nil>, meaning that it is a mistake and it should not have been there
in the first place.
Example: Os queixosos não deixam de nunca de remeter as culpas
para o governo.
The preposition de in de nunca is wrongly used. However,
it does not compromise the analysis of the sentence- since the ADVL that
in the correct form should be an adverb (nunca)- forms this way a group,
headed by the preposition de. The tag <nil> indicates the preposition
is wrongly used, but still the mistake was not corrected.
Note that the same tags are applied to missing or unnecessary punctuation
(for instance, a missing full stop before a capitalised letter or two adjacent
commas). However, misuse of punctuation determined by personal interpretation
will not be corrected.
On the top of these tags, the code #W is also introduced, indicating
that the sentence holds an odd construction.
Discussion in: C163-2