Building COMPARA
Ana Frankenberg-Garcia, Diana Santos & Rosário Silva
This page provides a description of the steps involved in the building of the corpus.
- Copyright permission
- Digitization
- Paragraph alignment
- Sentence separation
- Sentence alignment: automatic alignment, alignment revision and markup
- Grammatical annotation
- Semantic annotation
Copyright permission
COMPARA is publicly available online and, as such, we need to apply for permission to store in our server the texts we choose for the corpus. After selecting the texts we want to use, we have to identify who the copyright holders of those texts are. The texts whose authors died 70 or more years ago are in the public domain and copyright permissions are not required. Permissions are needed for all other texts. For example, a source text by Henry James does not require copyright permission, but if its translation into Portuguese is recent, then it is necessary for us to apply for permission to use it.
Permission requests are sent to the copyright holders of the texts we want to use. They can be authors, translators, publishers or a combination of them. If the copyright holders are deceased, permission requests are sent to their heirs. We begin by asking for permission to use source texts, and only apply for translation permissions after the source texts ones have been granted.
In the permission requests for COMPARA, perhaps the most important thing we did was explain, in simple and non-technical language, what a corpus is, what it is for and how it works. In our initial contact with authors, translators and publishers, it was also important to clarify that the texts used in COMPARA - usually a 30% extract - although available for online searches, could not be downloaded in full. Another factor that may have contributed for a positive response from copyright holders was the fact that, with each new permission request, we listed the names of all those who had already granted us permission. In the case of publishers with online bookshops, we also offered to include a link to their catalogues in our website. To obtain a model of the letters we wrote, click here.
Although the process of appying for copyright permissions is time-consuming and slow, most of our requests were granted. However, some copyright holders never answered back, and a few refused permission.
Digitization
The texts that are not available in electronic format are scanned and submitted to an optical character recognition program, whose output is revised manually. These texts and the ones it was possible to obtain in electronic format are then processed as follows:
Page numbers, columns, figures, diagrams and other extra-linguistic elements are removed.
Chapter titles and sub-titles are marked <chaptitle>. If they happen to be in capital letters, they are converted to small letters (only the first letter of the first word in the title and the first letter of proper names remain in upper case). For example, the chapter LOOKING-GLASS INSECTS is marked <chaptitle>Looking-glass insects</chaptitle>
Any obvious misprints detected are corrected and recorded on a separate file.
The spelling of earlier Portuguese-language editions is updated to conform to current orthographic norms.
Hyphens, travessões, dashes and bullets
Portuguese travessões and m-dashes are rewritten as double hyphens (--) ; hyphens and bullets receive the n-dash mark (-) .
Quotation marks and apostrophes
Double quotes are marked («) to open and (») to close. Single quotes are represented by the grave accent (`) to open and the acute accent (´) to close. Apostrophes are rewritten as single, non-directional quotation marks (').
Authors' notes are marked <anote> and inserted immediately after the sentences where their identifying marks appear.
Translators' notes are marked <tnote> and inserted in the place of their identifying marks.
Text that has been underlined or highlighted in capital letters, bold, italics, a different type of font or by indentation is identified by the following tags <title>, <foreign>, <named>, <voice> and <emph>. Note that words within quotation marks are not considered as highlighted and are therefore not marked.
Common nouns, verbs, prepositions, adjectives, etc highlighted with capital letters (instead of bold, italics, different font, etc.) are changed to small letters. This rule does not apply to acronyms, which should remain in upper case. Highlighted proper names (for example, titles of books and named entities) maintain the first letter in upper case.
Although they are in capital letters, acronyms (e.g., WHO, UNESCO, AIDS) are not considered to be highlighted unless they are also in bold, italics, different type of font, etc.
<title> is used to mark both real and fictional titles of books, newspapers, magazines, films, plays, television programmes, songs, etc. For example:
EBDL1 - We'd been to an early-evening showing of <title>Reservoir Dogs</title> .
Note that this mark only identifies titles cited in the corpus texts, and not titles or sub-titles of the texts themselves.
<foreign> is used to mark words and expressions in a language other than the main language of the corpus text. For example:
EBDL2 - Back in the<foreign>en suite</foreign> bathroom, he briskly cleans his teeth and brushes his hair.
Proper names are only marked <foreign> if they contain common nouns, for example:
EBJB1 - Besides, I remember the end of <title><foreign>L' Education sentimentale </foreign></title> .
Note that <foreign> is not used for proper names like Macbeth, but is used for proper names that are made of or include common nouns, like Bouvard et Pécuchet, which is considered foreign because the French conjunction et can give rise to the translations Bouvard and Pécuchet (En) and Bouvard e Pécuchet (Pt) . Likewise, in a Portuguese text, Benson and Hedges is considered foreign because the English conjunction and can give rise to the translation Benson e Hedges . Note, however, that a name like Luís de Camões, which contains the Portuguese preposition de, cannot be marked <foreign> because the name cannot give rise to the translation *Luís of Camões.
Proper names used to identify brands, shops, hotels, companies, products, doctrines, etc. are marked <named> . For example:
PPCP1 - On a sweat-smudged label was written <named> Minerva Wardrobes</named> .
<voice> is used to mark citations and changes of voice in the narrative, indicating that a character is thinking, writing or reminiscing, or that the voice of another character is intruding. For example:
EBDL2 - The fox stopped and turned his head to look at Vic for a moment, as if to say, <voice>Yes?</voice> and then proceeded calmly on his way, his brush swaying in the air behind him.
<emph> is used to mark words and expressions highlighted for emphasis. Given the amount of subjectivity associated with this category, it is only used when no other highlight mark is applicable. For example:
EBJT1 - And then he went and <emph>died</emph>, the sod.
Whenever there are lists of titles, foreign words, named entities and so on, separate tags are used for each element in the list. For example:
PBPM1 - <foreign> Urutus </foreign>, <foreign> jararacas </foreign>, <foreign> cascavéis </foreign>, <foreign> jararacuçus </foreign>, <foreign> surucutingas </foreign>, <foreign> cotiaras </foreign> -- I saw these and many other serpents in the slides that Melissa projected during her talk.
The titles, foreign words, proper names, changes of voice and emphatic expressions that have not been highlighted by authors or translators are not marked in the corpus texts.
Words and expressions in regular font inserted within a longer, highlighted stretch of text to indicate a contrast within a contrast were also considered as highlighted. For example:
EBLC1 - And you won't hurt me, though I<emph> am</emph> an insect.´
The corpus texts are stored in ISO Latin 8859-1 (Western European) format.
Paragraph alignment
After a source text and its translation have undergone the above digitization procedure, the two texts are aligned paragraph by paragraph. If there happens to be a mismatch, the alignment follows the paragraph divisions in the source texts.
Carriage returns that have been removed from the translation texts in the alignment process are marked <Pout>. Carriage returns that have been inserted in the translation texts during the alignment are marked <Pin>.
This intermediate markup allows us to insert <P> marks automatically in the translated texts.
Full-paragraph mismatches
In the few exceptional cases whereour team detected substantial differences between source texts and translations at the level of the paragraph (namely, entire paragraphs missing from the translation or complete paragraphs added to the translation without corresponding source text), we chose to remove from the corpus the parts of the source texts and translations affected. This procedure does not compromise the spirit in which the corpus was created, since the basic alignment unit of COMPARA is the sentence and not entire texts.
Sentence separation
Source texts and translations are submitted to the tokenization and sentence separation tools developed by the AC/DC project (see Atomização, in Portuguese). As shown below, however, some of the sentence separation criteria adopted are specific to COMPARA.
A sentence is defined as a word or sequence of words beginning with a capital letter and ending with a full-stop, ellipsis, exclamation mark or question mark, followed by a new sequence of words beginning with a capital letter, or by no text at all in the case of the end of a paragraph. The paragraph below illustrates the sentence separation criteria adopted. Sentence beginnings are marked <s> :
EURZ1 (five sentences)
<s>«You shouldn't listen to me,» Simon suddenly sighs. <s>«I'm an old fool who no longer has any courage. <s>But for Master Abraham's sake I will try to face the truth, if you like. <s>Now tell me, you believe he was murdered by someone who knew him... a New Christian?» <s>His questioning eyes seem almost hopeful, as if death by a Jew's hand is preferable to Uncle having been murdered by a follower of the Nazarene.
In cases of direct speech followed or preceded by reporting verbs (such as say, tell, whisper, suggest etc.) , there can be words beginning with capital letters after the punctuation marks mentioned above without any resulting sentence separation. For example:
EBJT1 (one sentence)
<s>`You OK?´ Robin's daughter said, standing close to him, but not touching.
Note that when direct speech is not followed or preceded by reporting verbs, sentence separation is maintained. In the example below, a new sentence begins after the second question mark because realise is not a reporting verb:
PBCB2 (three sentences)
<s>Then asks `What happened to Osbenio? <s>And to Clauir?´ <s> I realise he was expecting someone else, a relative, someone or other.
The colon is only considered a sentence separator if it marks the end of a paragraph:
EBDL3 (two sentences)
<s>From long practice Philip was able to follow his drift pretty well, and therefore answered confidently:
<s>«Oh, no, I couldn't leave Hilary behind to cope on her own.
If there is no end of paragraph, there is no sentence separation, no matter whether or not the word after the colon begins with a capital letter:
EBJB1 (one sentence)
<s>Flaubert wanted them to be: few writers believed more in the objectivity of the written text and the insignificance of the writer's personality; yet still we disobediently pursue.
EUHJ1 (one sentence)
<s>But she did not commit herself, and in a moment she asked: «Now that he has come back, will he stay here always?»
It is worth noting that there may be a new line without there being any sentence separation, as is particularly notable in poetry. In such cases, new lines are simply marked <br>:
EBLC1 (three sentences)
<s>`Humpty Dumpty sat on a wall:
<s>Humpty Dumpty had a great fall.
<s>All the King's horses and all the King's men
<br>Couldn't put Humpty Dumpty in his place again.´
Special cases
Some authors use dashes followed by words beginning with capital letters idiosyncratically. For alignment purposes, these were treated in the same way as separate sentences. For example:
ESNG4 (two sentences)
<s>-- There -- there -- <s> The herdsman draws back from his own hand as if to hold something at bay.
EBJT2 (two sentences)
<s>It's your baby -- ´
<s>` Yes, but you're my niece and we've always been particular friends.
For alignment purposes, we also chose to treat like separate sentences the parts of José Saramago's texts in which there are commas followed by direct speech beginning with a capital letter. For example:
PPJSA1 (two sentences)
<s>The woman guided her husband to an empty chair, and since all the other chairs were occupied, she remained standing beside him, <s>We'll have to wait, she whispered in his ear.
Sentence alignment
Source texts and translations are then automatically aligned using EasyAlign 1.0, an alignment program for the IMS Corpus Workbench system (for further information, contact Stefan Evert).
With the help of a word processor, the automatic alignment output is edited so as to conform to COMPARA's alignment criteria, whereby an alignment unit consists of a source text sentence (see 3 above) and the corresponding text in the translation, whether it is one, more than one or only part of a sentence. Sentences that have not been translated are aligned with empty units. Sentences that have been added to translations without any corresponding equivalent in the source text are marked <add> and fitted into the immediately preceding alignment unit. For example:
Sentence preserved in translation (1:1)
EBJT21 (source) | EBJT2 (translation) |
<s>He still said, though less angrily now, that she had deceived him. | <s>Ele ainda afirmava, embora menos encolerizado, que ela o tinha desiludido. |
Sentence split in translation (1:2)
EBDL3T1 (source) | EBDL3T1 (translation) |
<s>«Spare me the narrow misses, Bill, what have you got?» | <s>«Não me fale do que perdi, Bill. <s>O que é que ainda tem? » |
Sentences joined in translation (1:½)
PBPM1 (source) | PBPM1 (translation) |
<s>Muito bem. | <s2>So then, |
<s>O casal vem chegando, dentro do automóvel. | <s2>the couple arrives in the automobile. |
Sentence deleted in translation (1-0)
PBAD1 (source) | PBAD1 (translation) |
<s>A cara impenetrável, os olhos não diziam nada. | <s>Zito's face was inscrutable, his eyes said nothing. |
<s>Não estava mais ali quem falou. | <s> |
<s>Ele agora atendia uma freguesa que queria três metros de morim. | <s>Now he was serving a customer who wanted three metres of cambric. |
Sentence added to translation (1:1+1ad)
PPCP1 (source) | PPCP1 (translation) |
<s>«Porquê, acha que é assim de deitar fora?» | <s>«But why should we waste them? <add>Why?</add>» |
The sentences that have been reordered in translation follow the same alignment rules, and the reordering is marked separately. The reordered sentence is marked <reord> and the place where the translator chose to insert it is marked <place>:
Sentence reordered in translation
EBOW1 (source) | EBOW1 (translation) |
<s>The picture had to be concealed. | <s><reord 3> Era preciso esconder o retrato.</reord> |
<s>There was no help for it. | <s> Não havia remédio. <place 3> |
The markup for sentences that have been joined, added to and reordered in translation requires human interpretation and is therefore inserted manually. The alignment markup for sentences that have been preserved, deleted and split in translation is done automatically. However, sentence splits need to be inspected manually, for the automatic alignment markup is not sensitive to some of the sentence separation criteria involving direct speech.
Grammatical annotation
Grammatical annotation in COMPARA is a two-step process: first, we apply an automatic parser and then we revise its output manually.
The automatic parser used for Portuguese is PALAVRAS and for English, CLAWS.
For details concerning the revision of the annotation, see Documentação da anotação da parte portuguesa do COMPARA (in Portuguese only) and COMPARA's English annotation with CLAWS C7: revision criteria.
Further details on how syntactic information is added to COMPARA is available on annotation workflow.
Semantic Annotation
We have also started semantic annotation of COMPARA in both languages by adding the positional attribute "sem".
Since there are currently no available automatic semantic analysers that we know of, we decided to use a lexically-driven approach followed by human revision.
The first semantic field was colour, corresponding to a five-fold classification in the attribute "sem": colour, colour:race, colour:human, colour:wine and colour:original for words denoting colour.
Other related attributes such as race (sem="race") and ripeness (sem="unripe") have also undergone some unsystematic human annotation.
We have also created specific attributes to classify colour words in English (sem="colour") into several groups: Blue, Red, Yellow, Green, Orange, Brown, Beige, Black, White, Grey, Pink, Purple, Gold, Silver, Other, Multiple and Unspecified, as the value of the attribute "colour". For Portuguese, the corresponding attribute "cor" includes the possible classifications: Amarelo, Azul, Branco, Castanho, Cinzento, Creme, Dourado, Laranja, Prateado, Preto, Rosa, Roxo, Verde, Vermelho, Outras, Multipla and Naoespecificada.
See Silva et al. (in Portuguese) for more details.
We expect that researchers interested in other semantic fields will provide us with their lexical core and help us with the human revision of the corresponding result, thus helping to build a richer and more varied semantic resource for the contrast of Portuguese and English.