The DISPARA system
Technical description: The DISPARA system
Diana Santos
The aim of this page is to describe the technical details about making COMPARA searchable in the Web, while at the same time providing a general overview of DISPARA, meant as a general system for DIStributing PARAllel corpora on the Web, first put to use in connection with COMPARA.
So, while the Web interface is what users see, there is a full computational system, with a specific workflow associated, behind the scenes, which is indissociable from the interface, and this is why it is described here.
- The process of building the COMPARA corpus
- The DISPARA Web interface
- The DISPARA general architecture
- Instantiation to serving COMPARA
- Tokenization in COMPARA
The process of building the COMPARA corpus
After getting the raw texts in electronic text form, the process of building the COMPARA corpus comprises several phases as follows:
- As described in Building COMPARA, the texts are manually paragraph aligned (while some markup is added, such as specifying graphically marked titles, emphasis, named entities and the like, as well as identifying translator's and author's notes).
- Then, a set of corpus tools developed for the AC/DC project is applied, in order to perform tokenization and sentence separation. Each pair of texts is then automatically aligned with the EasyAlign tool (v. 1.0), a sentence aligner included in the IMS Corpus Workbench.
- The result of the alignment is translated into a human-readable version, which is then carefully revised in order to conform to COMPARA's principle: one sentence only in the source text. Cases where sentence splitting must be done (because the translation of more than one source sentence was joined in one sentence by the translator), sentence order changes and additions are also marked by a human expert at this time.
-
Then, a special program is run that automatically adds several other pieces of information:
- the alignment type for each alignment unit not yet encoded in the previous step (i.e, for 1-0, 1-1 and 1-to-many cases)
- the corresponding text pair (so that, later, it is possible to know the origin of each concordance retrieved by the system)
- the language variety of source and translation for each parallel concordance
- whether the language is translated or original
- the date of the source's first edition, and of the translation used
-
Two corpora (one for each language) are created in the input form for IMS-CWB
(see an example here), specially catering for
- reordered material
- zero (no translation) alignment units
- translation notes
- human corrections to the (automatically discovered) alignment type, in the form of correction files
- The alignment process is run again, but this time telling the program to use the alignment correspondences already existing in the new files.
- The two corpora are then encoded into the IMS Corpus Workbench internal format, with the IMS CWB programs, and become available for querying through the tools in this workbench.
- As the final step, some ellaborate programs count the corpus contents and automatically produce the Quantitative summary and Bibliographic references pages, one in each language.
At this stage, and as soon as the two aligned corpora are copied to Linguateca's Web server, whoever has an Internet connection can query them through the DISPARA Web interface.
The DISPARA Web interface
The DISPARA Web interface has originally been developed to give access to COMPARA, but we have strived from the beginning to make it general enough to be used for other parallel corpora.
So, we distinguish below between what we call the DISPARA general architecture, and its instantiation for COMPARA.
The DISPARA general architecture
DISPARA was conceived for working with the IMS Corpus Workbench, which has been singled out in previous occasions as the best corpus system available given the general context of the Computational Processing of Portuguese/Linguateca project. Several descriptions of it, as well as motivation for its use can be found elsewhere (see e.g. exemplos).
DISPARA is based on the concept of alignment unit (a source sentence, in case of source texts, aligned with whatever translates a source sentence, in case of translated texts), which implies that the corpora are encoded with the help of the structural attribute ua (instead of e.g. s for sentence). Since we want to identify each ua, we associate it with a id attribute and a tipo attribute (for kind of alignment), as in <ua id="PBMA4-37" tipo="1-2"> .
As already mentioned, several kinds of information are, in addition, associated to every token, in the form of values to the positional attributes fonte (source id), varport (Portuguese variety), varing (English variety) and oritrad (whether the text is original or translated). None of these attributes is required (and others may be added), but it is important to realize that their existence constrains what kind of distributions and sub-selections a user can ask for.
Let us briefly mention the most conspicuous features of DISPARA:
- DISPARA allows the display of several kinds of information as the result of a query, namely: concordance; distribution of forms; and distribution of all kinds of information encoded in the attributes above (sources, varieties, etc.).
- In addition, a unique feature of DISPARA is the "Combined distribution of Portuguese and English search expressions", a complex query built from a set of hidden queries and whose purpose is to provide a snapshot of the distribution of the items involved in the two languages.
- Another key functionality of DISPARA is allowing queries by alignment type (such as addition of sentences in the translation, joining or separating sentences, as well as reordering).
Anyone with two IMS-CWB-encoded aligned corpora with ua attributes should be able to make use of DISPARA with very little adaptation: Basically, the content of the Web pages, as well as the different messages served by the program, and the layout of the results would have to be rewritten, and the names of the corpora substituted for in the right places. Naturally, other options could eventually be added.
Instantiation to serving COMPARA
The instance of DISPARA that we are concerned with here, that of serving COMPARA, has quite a rich set of options as a consequence of several design issues of this corpus and the way its user interface was conceived:
- It allows the user to ask specifically for the display of translation notes (which are internally encoded as the value of the structural attribute note)
- It allows for querying for the presence of sentence-internal markup (except for corr which was not deemed relevant for user interrogation). Some markup is highlighted by the interface (we use italics for emph, foreign, named or title; and bold for the content of notes).
As a final remark, it should be avowed that it is not always clear what pertains to the general architecture and what are specific features of COMPARA only, given that the corpus interface and the DISPARA system were developed in tandem.
As a good illustration of this, we have recently started the process of POS annotating COMPARA, and, in that connection, also expanding the DISPARA system.
Tokenization in COMPARA
To deal with Portuguese, we used the set of programs developed in order to process Portuguese corpora in the AC/DC project. For English, we used the Portuguese tokenizer, changing some details and adding (as a wrapper) some functionalities.
It is well known that the most difficult (or non-consensual) issues in the tokenization of English are contractions, possessive markings and single quotes. So, in order to ease the processing that follows, and allow unambiguous distinction among the different uses of the character ', we require the corpus compilers to encode differently open and close single quotes, using other characters different from '. (Incidently, we do the same to double quotes, differently encoding opening and closing quotes. But notice that this is transparent to the ordinary user, who sees the ordinary English quotes in the concordances.)
This significantly simplifies tokenization, since the new quote markers are uniquely identified as punctuation signs (and constitute therefore separate tokens), while ' in COMPARA is known to denote either possessive case or contraction, both constructs which we do not consider bringing about new tokens. In other words, in COMPARA, don't, o'clock, Peter's and students' are considered one token, and can therefore be queried as such.
As far as numerical expressions are concerned, our policy is as follows: when currency units precede the number, we consider currency and amount as one token (as in £60 or $300.00); when they follow the numerical expression, they are treated as any other numerical expression with units, with separate tokens. So, 30 réis, 40,000 francs constitute two tokens in COMPARA analogously to any other measures, e.g. 600 miles per hour, which exhibits four tokens.
Finally, a few abbreviations were removed and others added, to avoid conflict between Portuguese and English.