Computational processing of Portuguese: working memo

Diana Santos
This is an English translation, at times shortened, of the Portuguese original dated 9th February, created jointly by Diana Santos and Signe Oksefjell on 23rd March. This document was last reviewed on 13th April, reflecting the last update of the Portuguese version.
The only way to avoid that a language is neglected in the information society of the future is to invest in the computational processing of that language. By investing in the Portuguese language we secure that knowledge may be communicated and taught, as well as accessed, in Portuguese. The man in the street should be able to live with the computer without having to give up his culture or language.

This document is intended as a point of departure for a thorough discussion about the future of the computational processing of Portuguese.


Introduction

At present, there is a tendency of world-wide homogenization at all levels. Technology, with its "limitations", is often seen as the driving force behind this situation. Technologists know, however, that it is not the technology in itself that is limiting, but rather the interests of the technology owners.

However, a paradoxical situation has arisen, due to the fact that information is becoming all-important: since information for the most part is encoded in some natural language -- be it in books, documents, or simply in spoken interaction -- one needs to take into account the diversity of the planet in order to obtain more information. I.e., we need to take into consideration the many languages of the world, the several writing systems, and a number of different communication cultures (or ways to communicate) to keep up with the information flow.

Every enterprise of the information society is aware of the need for localization; one should go one step further and think of "originalization". Instead of adapting systems conceived by foreign experts for a foreign market, one should devise tools, advertisements, and large systems for a Portuguese audience (or better, for a Portuguese-speaking audience).

[...]

In a world strongly influenced by the American view of communication it is not surprising that the importance of the differences between languages is minimized, also at the level of the computational processing of natural languages

This latter point can result in a statement such as: It is an advantage to consider French as misspelled English for multilingual information retrieval, which was actually made by one reputed member of the NLP community! (See http://www.cst.ku.dk/projects/eagles2/workshop/TRECkaren.html.)

We have to consider our native language as a major factor for development policies, and take into account the specificity of the Portuguese culture as reflected in the language and the communicative patterns. Hence, we must look at the Portuguese (language) reality and, from there, set out to develop Portuguese-aware systems. It is hoped that both man-machine communication and communication between people mediated by the computer will be improved.

The state of a scientific domain cannot be changed by decree, or due to a government's goodwill. It is necessary that the partners involved come together and reflect upon the situation, express their views, and suggest concrete measures. The present document reflects a first informal discussion with the members of the Portuguese R&D community who were willing to share their ideas with me and send suggestions and criticisms. In order to represent the wishes and opinions of everyone who works (or would like to work) on the computational processing of Portuguese, more collaboration is needed. We therefore ask you to send suggestions for improvement and further contributions to projecto@informatics.sintef.no.

All contributions will be made available from our site.


Defining the computational processing of Portuguese as a priority

To ensure that there is continuity in existing and future R&D groups, this research area should be strongly supported, both politically and financially. This is also extremely important in order for Portuguese language engineering to achieve the status of a realistic professional choice.

At present the majority of R&D groups have serious difficulties both in funding projects and in recruiting people, precisely because this area has not been recognized as being important.

To be able to make significant developments, projects within NLP require more than a 2-3 year time-frame. This is needed both to achieve a continued improvement of resources and to guarantee some continuity in basic research. This is not to say that periodic reviews and possibly funding readjustments should have the same time restrictions as the projects.

Also, one or several measures to evaluate the health of the area should be defined. It would then be possible to check whether one is actually contributing to the field's progress and to correct steps which turn out to be dead ends.

For this last goal, one could

Making language resources available (in many ways)

So far the R&D community (with very few exceptions) has kept every resource they develop as a well-kept secret, which brings about, among other things,

To change this situation, without harming the resource developers, a framework where sharing is encouraged and recompensated should be established. At the same time, flexible paying schemes based on use should be developed.

More than making available already existing language resources, it is necessary to develop many more which are lacking for our language, and guarantee that their development can be followed by all interested partners, thus avoiding the risk of their future unavailability.

Some examples of what is needed for our language:

Some suggestions on how to achieve these resources:

It should also be noted that a "documentation standpoint" would be very advantageous regarding resource compilation, i.e., in order to distribute and describe the resources, classification schemes (taxonomies, thesauri) are needed. Furthermore, the encoding of information in portable formats such as XML or those suggested by the TEI should be encouraged. The resource should at least be well-documented.

Evaluation and quality control

Due to lack of common resources and lack of communication between research groups, there is no consensus on how to evaluate a given tool, or data, as a Portuguese language resource. In most cases, it is simply not possible to evaluate the work in the field.

It is therefore essential that methods of testing, evaluation and comparison are developed such as the TREC (see http://trec.nist.gov/), or SENSEVAL (see http://www.itri.bton.ac.uk/events/senseval/cfp2.html) contests, which are designed specifically for Portuguese.

Also, it is necessary to publish and define standards of product acceptance as far as Portuguese is concerned in so different areas as operating systems, systems to support linguistic activities (translation workbenches, text processors), CSCW environments, and large systems in government agencies.

Attention should also be drawn to problems related to international standards, such as the absence of accented characters in most internet protocols. These are problems that need to be fought in the international arena.

It would be useful, then, to have a public "portuguezation" service of the technology (instead of simply making things "sound" Portuguese, by adapting tools or resources for other languages to Portuguese). Such an institution should organize the evaluation contests, inform the R&D community, provide resource distribution, develop or commission quality tests, and represent the country in international committees (see further the next section, on "Services for the development of resources and tools").

Services for the development of resources and tools

There is a set of services that should be available for Portuguese:

How should these services be obtained?

In some cases, they may belong to the public administration; in other cases, it will be sufficient for the groups and centres involved to devote part of their time to this kind of activity, provided they get enough funding for that goal.

Furthermore, the users themselves must be represented in these networks, in order to evaluate the service provided.

Please note that these services should not be concentrated in a single node but rather be distributed over several locations in Portugal and worldwide. This would counteract monopolizing tendencies and coordinate the human potential that is geographically distributed. We must not forget the advantages of a collaboration with Brazil, as well as with international groups.

Before issuing laws and creating "paper" networks, it is important that the scientific community reconsiders its organization, for which funding should be provided.

In addition, project proposals should be required to include plans for distribution of end products and resource evaluation, so that everyone involved would take seriously the activities of testing, validation, and service providing, in addition to the already recognized activities of R&D, teaching and popularization.

Reinforcement of empirical methods

One subject that must be carefully considered is empirical methods used in computational linguistics (and in particular in the processing of Portuguese).

Matters such as evaluation, coverage, precision, testing of hypotheses, version control, comparison of different systems, and objective measures, ought to be stressed.

Until now, as mentioned in the above section "Making language resources available", there has been very little work that can be considered as evaluation. Similarly, very little has been done to measure a system's adequacy; to what extent does it successfully solve its task?

For example, how does one evaluate a Portuguese dictionary? By its size? By a list of bugs discovered? Or by its user friendliness? And what about a speech synthesis system? Or a search engine on the Web?

These questions are at least as important as the development of the systems or the resources themselves, and it is necessary to devote a lot of attention to them, particularly because they have seldom been focussed on before.

As pointed out earlier, there is a strong need to develop evaluation resources, such as large corpora, annotated corpora, etc.

Some measures to improve the lack of empirical methods in the computational processing of Portuguese are the following:

Linking basic research and technology

Since natural language processing is supposed to process speech or text, it is essential that research in NLP is accompanied by programs that actually do this. We need such programs.

Only with systems that demonstrate what one is after, and allow one to test and change them, it is possible to

In order to develop systems that execute a given task, it will often be necessary to use other systems, developed by other groups, as building blocks for larger systems. Such a practical need would encourage collaboration between experts in different areas (that is, different sub-areas of NLP and speech).

Development of applications related to the daily work in an information society

The main challenge within the field of computational processing of Portuguese is to take the step from an academic activity to a reality felt at all levels in our information society.

The ultimate goal of this investment is that a certain level of Portuguese language processing ("portugware") will be felt just as necessary as an operating system. And that, consequently, in order to use any device or equipment, it will soon become inconceivable that error messages, training, and help should not be issued in our language. In the near future, we should be able to

[...]

For this purpose it is necessary, on the one hand, to contact international companies and establish concrete action plans to help them integrate Portuguese in their products.

On the other hand, it is necessary to foster awareness of and expertise in language technology in Portuguese companies, offering incentive schemes (such as tax reduction) to favour the companies which use modern language technology in their work.

To enforce a "portuguezation" by law should, however, be avoided. The natural advantage of working in one's native language should in itself trigger a preference for Portuguese-aware products. The way to proceed is to set an example or present a model.

In this vein, we could imagine a special program constructed for some key companies or institutions, such as publishers, mass media, libraries, museums, etc. so that they could invest, and seriously so, in the computational processing of Portuguese before it is too late.

This pairs well with the need to fund cultural institutions (such as libraries, museums, and all institutions that own large bodies of interesting data), so that they can make their collections, or resources, available for easy manipulation and access. This funding could also contribute to a better knowledge of the collections themselves.

Some successful examples would cause a domino effect, and other companies or public departments would actively seek financing for similar activities.

Incidentally, it should be noted that not only work, but also leisure, would benefit from a serious investment in the computational processing of Portuguese. In fact, we only need to look at the tremendous demand for Portuguese and Brazilian lyrics on the Internet to understand that investment in culture sometimes also pays off.

Education policy in the field of computational processing of Portuguese

One of the major problems of Portuguese language technology is its lack of recognition and even identity. Neither engineering schools nor arts faculties recognize the field as a priority, and in neither case appropriate education in Portugal is provided. (There are some positive signs, though: a degree in language and knowledge engineering, an MA program in computational linguistics, and several courses within different MA programs; see http://www.portugues.mct.pt/ensino.html.)

However, quite a few central issues are missing (note that I do not have access to all university curricula, and therefore the following statements may not be accurate, but simply reflect the general impression that people involved in speech processing conveyed to me):

  1. It appears that there is no specific education on phonetics for the purposes of speech processing.
  2. Although a considerable part of NLP research is based on statistical methods, there is no specific education on the subject "statiscal methods in NLP" in Portugal (there seem to be only general introductory courses on statistics). This contrasts with a heavy bias towards formal methods in general and logic in particular.
  3. There are very few courses on speech processing; most courses offered cover only digital signal processing, and/or neural networks

Another problem is that researchers come to the field with different, not to say opposite, perspectives, which makes the communication between engineers and linguists extremely difficult: While the former see NLP as a specific area of computer science whose raw material is language, the latter see it as an application of linguistics.

Ideally, there should be a basic NLP course in all computer science degrees (with the option of more advanced ones), or even in all engineering degrees. Likewise, there should be one or more computer science courses in all arts curricula, not only to make the student familiar with the computational methods and resources used in his particular branch of the Humanities, but also to offer some insight into how those systems and tools were developed.

In our time, there is already a common technological basis (a technological infrastructure) which caters for communication between different disciplines. It would therefore be a good idea that several departments/institutes cooperated in order to share curricula, courses, teaching staff, and engage in common projects). Some researchers have suggested a reorganization of the R&D groups so that a cross-fertilization would be possible; see Proposta de Estruturação da Área do Processamento Computacional do Português.

Furthermore, it seems appropriate to create a Web-based course (or even a laboratory) for the computational processing of Portuguese, since one of the uses of NLP is precisely in computer-aided education, and more specifically on the Web. This would require that all members of the community participated (provided there was public funding for these activities) in order to create teaching materials and provide the necessary tools. Their knowledge and systems would then reach a wider audience, and be of great benefit to everyone who followed the courses.

Related areas

Some areas that deserve special interest at the present moment, from an NLP perspective, are:

It is necessary to define, and protect, the public status of a language, and clarify the questions of copyright regarding publishers, authors and compilers of collections (or knowledge repositories such as dictionaries and encyclopædia)

One of the most exciting and fast growing areas (also in NLP) is the Web. This should also be reflected in our language, and requires experts that are familiar both with NLP and Web technology.

Large collections of information in special formats exist, in databases or knowledge bases. Traditionally, the two areas are separate, which leads to a duplucation of efforts both in resource compilation, in processing and even in the development of human-machine interfaces. Needless to say, this duplication should be avoided.

By establishing models of communication over distance and emphasizing the student's process of self-teaching, the Internet also gives rise to new pedagogic models which should naturally be language- and culture-aware. This is an area in which the computational processing of Portuguese may be essential for the user friendliness of the system and consequently for the fulfilment of the system's goals.

Communication and participation

If people involved in this area fail to communicate, there can be no hope that significant developments will take place in the computational processing of Portuguese.

Until now, the area has lived in a world of secret negotiations, intrigues, and personal invitations instead of open meetings or calls. Little has been done in the vein of collaboration, and few resources and tools are freely available.

There are several concrete measures that could be taken to change this situation, at least for publicly funded projects.


Other documents

The present text has gained significantly from the many suggestions, comments and ideas that we have received (by e-mail, phone or face-to-face communication) since we announced on our Web pages that we were in charge of writing the present document.

In January/February we gratefully acknowledge the reception of the following documents for discussion (in Portuguese), to which we point to in our own text:

The complete set of contributions and reactions to the present document, some of them having unfortunately arrived too late to have influenced it, are available from http://www.portugues.mct.pt/branco/reaccoes.html.

It should also be acknowledged that many of the ideas stated in the present document have been taken from other documents on the subject or on NLP in general, most of which are available on the Web. We tried to put together our sources available to everyone in an alphabetical list at http://www.portugues.mct.pt/atalhos1.html.

The only non-electronic source was

O'Hagan, Minako. The coming industry of teletranslation, Clevelon / Philadelphia / Adelaide: Multilingual Matters Ltd., 1996.


Towards an area profile

It is not possible to create in a semi-automatical way an area profile of an area which is not recognized by the funding agencies. The present choice tries to relate in a common goal a series of different activities in order to invest in a particular area. To a much higher degree than in the case of traditional research fields, the choice is subjective.

My attempt was to describe the area in Portugal according to the way it may be described in the future, making use of three different information sources provided by the Fundação da Ciência e da Tecnologia and the Observatório das Ciências e das Tecnologias:

From these data, three lists are presented to the R&D community:
  1. A list of projects, classified as either "in the area" and "related to the area"
  2. A list of scholarships, also classified as above
  3. A list of people having a PhD which can be considered at least potentially related to the area
as well as some comments on the methodology employed.

In parallel, using the information present on the Web, our project tried to create a catalogue of resources and of actors in the area of the computational processing of Portuguese (not restricted to Portuguese soil). We have been working on this catalogue since July 1998 and it can be consulted from our site, http://www.portugues.mct.pt/


Send questions, comments and suggestions.