Computational processing of Portuguese: working memo

Diana Santos

This is an English translation, at times shortened, of the Portuguese original dated 9th February, created jointly by Diana Santos and Signe Oksefjell on 23rd March. This document was last reviewed on 13th April, reflecting the last update of the Portuguese version.

The only way to avoid that a language is neglected in the information society of the future is to invest in the computational processing of that language. By investing in the Portuguese language we secure that knowledge may be communicated and taught, as well as accessed, in Portuguese. The man in the street should be able to live with the computer without having to give up his culture or language.

This document is intended as a point of departure for a thorough discussion about the future of the computational processing of Portuguese.

Introduction
Defining the computational processing of Portuguese as a priority
Making language resources available
Evaluation and quality control relative to Portuguese
Services for the development of language resources and tools
Reinforcement of empirical methods
Linking basic research and technology
Development of applications related to the daily work in an information society
Education policy in the field of computational processing of Portuguese
Related areas
Communication and participation
Other documents
Towards an area profile

Introduction

At present, there is a tendency of world-wide homogenization at all levels. Technology, with its "limitations", is often seen as the driving force behind this situation. Technologists know, however, that it is not the technology in itself that is limiting, but rather the interests of the technology owners.

We need highways, but who owns the fast cars?
Anyone can make a home video, but more than 90% of all videos are produced by the film industry
Anyone with some knowledge of computer science can write his/her own text processing program, but some software companies earn billions and have a near monopoly of such products
Almost any person or enterprise can publish freely on the WWW, but are the search engines impartial?

However, a paradoxical situation has arisen, due to the fact that information is becoming all-important: since information for the most part is encoded in some natural language -- be it in books, documents, or simply in spoken interaction -- one needs to take into account the diversity of the planet in order to obtain more information. I.e., we need to take into consideration the many languages of the world, the several writing systems, and a number of different communication cultures (or ways to communicate) to keep up with the information flow.

Every enterprise of the information society is aware of the need for localization; one should go one step further and think of "originalization". Instead of adapting systems conceived by foreign experts for a foreign market, one should devise tools, advertisements, and large systems for a Portuguese audience (or better, for a Portuguese-speaking audience).

[...]

In a world strongly influenced by the American view of communication it is not surprising that the importance of the differences between languages is minimized, also at the level of the computational processing of natural languages

either by following "universal" conceptions of language
or by adapting applications and methodologies originally developed for English to other languages

This latter point can result in a statement such as: It is an advantage to consider French as misspelled English for multilingual information retrieval, which was actually made by one reputed member of the NLP community! (See http://www.cst.ku.dk/projects/eagles2/workshop/TRECkaren.html.)

We have to consider our native language as a major factor for development policies, and take into account the specificity of the Portuguese culture as reflected in the language and the communicative patterns. Hence, we must look at the Portuguese (language) reality and, from there, set out to develop Portuguese-aware systems. It is hoped that both man-machine communication and communication between people mediated by the computer will be improved.

The state of a scientific domain cannot be changed by decree, or due to a government's goodwill. It is necessary that the partners involved come together and reflect upon the situation, express their views, and suggest concrete measures. The present document reflects a first informal discussion with the members of the Portuguese R&D community who were willing to share their ideas with me and send suggestions and criticisms. In order to represent the wishes and opinions of everyone who works (or would like to work) on the computational processing of Portuguese, more collaboration is needed. We therefore ask you to send suggestions for improvement and further contributions to projecto@informatics.sintef.no.

All contributions will be made available from our site.

Defining the computational processing of Portuguese as a priority

To ensure that there is continuity in existing and future R&D groups, this research area should be strongly supported, both politically and financially. This is also extremely important in order for Portuguese language engineering to achieve the status of a realistic professional choice.

At present the majority of R&D groups have serious difficulties both in funding projects and in recruiting people, precisely because this area has not been recognized as being important.

To be able to make significant developments, projects within NLP require more than a 2-3 year time-frame. This is needed both to achieve a continued improvement of resources and to guarantee some continuity in basic research. This is not to say that periodic reviews and possibly funding readjustments should have the same time restrictions as the projects.

Also, one or several measures to evaluate the health of the area should be defined. It would then be possible to check whether one is actually contributing to the field's progress and to correct steps which turn out to be dead ends.

For this last goal, one could

create an international evaluation committee
define an objective measure for the "health of Portuguese in the information society" (see A questão da defesa do português).
create a nation-wide (or international) discussion forum to deal with problems related to the computational processing of Portuguese

Making language resources available (in many ways)

So far the R&D community (with very few exceptions) has kept every resource they develop as a well-kept secret, which brings about, among other things,

lack of communication between researchers
impossibility of evaluation or comparison of results
unnecessary repetition of work, ignoring already existing national resources and competence
the fact that Portuguese lags well behind other languages

To change this situation, without harming the resource developers, a framework where sharing is encouraged and recompensated should be established. At the same time, flexible paying schemes based on use should be developed.

More than making available already existing language resources, it is necessary to develop many more which are lacking for our language, and guarantee that their development can be followed by all interested partners, thus avoiding the risk of their future unavailability.

Some examples of what is needed for our language:

tagged corpora
parsed corpora
aligned corpora
terminological databases in most domains
machine dictionaries with subcategorization information
computerized thesauri
frequency studies
corpus-based grammars
corpus-based dictionaries
semantic networks
idioms and fixed phrases dictionaries
contrastive dictionaries of Portuguese variants

Some suggestions on how to achieve these resources:

Make availability (at all stages) a necessary condition for public funding
Make collaboration between institutions a preferred condition
If the institution in charge does not work according to the plan (and/or does not make available the promised resources), allow its substitution by another
Provide economic support to the already existing resources on condition that they are made publicly available
Make laws that forbid the ownership of the Portuguese language, without preventing its commercial exploitation (by publishers for example)
Launch calls for proposals aimed at the creation of such resources
Develop a legal and technical framework that supports on-line subscription of language resources, as well as a financial framework that allows for this kind of costs in the budget of R&D groups

It should also be noted that a "documentation standpoint" would be very advantageous regarding resource compilation, i.e., in order to distribute and describe the resources, classification schemes (taxonomies, thesauri) are needed. Furthermore, the encoding of information in portable formats such as XML or those suggested by the TEI should be encouraged. The resource should at least be well-documented.

Evaluation and quality control

Due to lack of common resources and lack of communication between research groups, there is no consensus on how to evaluate a given tool, or data, as a Portuguese language resource. In most cases, it is simply not possible to evaluate the work in the field.

It is therefore essential that methods of testing, evaluation and comparison are developed such as the TREC (see http://trec.nist.gov/), or SENSEVAL (see http://www.itri.bton.ac.uk/events/senseval/cfp2.html) contests, which are designed specifically for Portuguese.

Also, it is necessary to publish and define standards of product acceptance as far as Portuguese is concerned in so different areas as operating systems, systems to support linguistic activities (translation workbenches, text processors), CSCW environments, and large systems in government agencies.

Attention should also be drawn to problems related to international standards, such as the absence of accented characters in most internet protocols. These are problems that need to be fought in the international arena.

It would be useful, then, to have a public "portuguezation" service of the technology (instead of simply making things "sound" Portuguese, by adapting tools or resources for other languages to Portuguese). Such an institution should organize the evaluation contests, inform the R&D community, provide resource distribution, develop or commission quality tests, and represent the country in international committees (see further the next section, on "Services for the development of resources and tools").

Services for the development of resources and tools

There is a set of services that should be available for Portuguese:

A network for translation activities, whose goal would be to help translators and researchers, by providing information and data, and testing or producing test materials for translation-related products. In addition, such a network should make publicly available bilingual data in order to improve, make compatible, and promote the quality of translation.
A network for terminological work, whose goal would be to make resources available and support terminologists as well as create, or contribute to the creation of, new terminological databases.
A network for speech processing of Portuguese, whose goal would be to make resources available, carry out evaluation, and create specific tools to manipulate the resources. This network should also develop and make public speech databases covering several regions in order to guarantee nationwide coverage of training material for speech recognition systems.
A network for the computational processing of Portuguese, whose goal would be to make resources available, carry out evaluation, and develop specific tools to evaluate or apply those resources. Another goal would be to develop computational lexicons, parsing and generation modules, to be used in the R&D activities of other elements which might benefit from them, such as companies, R&D groups in other areas, international projects. (Please note that "available" does not necessarily mean "free".)

How should these services be obtained?

In some cases, they may belong to the public administration; in other cases, it will be sufficient for the groups and centres involved to devote part of their time to this kind of activity, provided they get enough funding for that goal.

Furthermore, the users themselves must be represented in these networks, in order to evaluate the service provided.

Please note that these services should not be concentrated in a single node but rather be distributed over several locations in Portugal and worldwide. This would counteract monopolizing tendencies and coordinate the human potential that is geographically distributed. We must not forget the advantages of a collaboration with Brazil, as well as with international groups.

Before issuing laws and creating "paper" networks, it is important that the scientific community reconsiders its organization, for which funding should be provided.

In addition, project proposals should be required to include plans for distribution of end products and resource evaluation, so that everyone involved would take seriously the activities of testing, validation, and service providing, in addition to the already recognized activities of R&D, teaching and popularization.

Reinforcement of empirical methods

One subject that must be carefully considered is empirical methods used in computational linguistics (and in particular in the processing of Portuguese).

Matters such as evaluation, coverage, precision, testing of hypotheses, version control, comparison of different systems, and objective measures, ought to be stressed.

Until now, as mentioned in the above section "Making language resources available", there has been very little work that can be considered as evaluation. Similarly, very little has been done to measure a system's adequacy; to what extent does it successfully solve its task?

For example, how does one evaluate a Portuguese dictionary? By its size? By a list of bugs discovered? Or by its user friendliness? And what about a speech synthesis system? Or a search engine on the Web?

These questions are at least as important as the development of the systems or the resources themselves, and it is necessary to devote a lot of attention to them, particularly because they have seldom been focussed on before.

As pointed out earlier, there is a strong need to develop evaluation resources, such as large corpora, annotated corpora, etc.

Some measures to improve the lack of empirical methods in the computational processing of Portuguese are the following:

Stimulate and develop resources for evaluation purposes
Make it compulsory for every publicly financed project to include an evaluation part – i.e., a description of how to evaluate the results, and when. Furthermore, the evaluation should preferably be externally controlled (or at least repeatable).
Insist on evaluation and empirical methods in all educational actions
Publish and foster work on the evaluation of tools and resources, in order to increase the interest in such work

Linking basic research and technology

Since natural language processing is supposed to process speech or text, it is essential that research in NLP is accompanied by programs that actually do this. We need such programs.

Only with systems that demonstrate what one is after, and allow one to test and change them, it is possible to

teach NLP or speech processing
make progress in the field
realize what needs to be done
have input from possible users so that one can develop a product, system, or service that will satisfy their needs

In order to develop systems that execute a given task, it will often be necessary to use other systems, developed by other groups, as building blocks for larger systems. Such a practical need would encourage collaboration between experts in different areas (that is, different sub-areas of NLP and speech).

Development of applications related to the daily work in an information society

The main challenge within the field of computational processing of Portuguese is to take the step from an academic activity to a reality felt at all levels in our information society.

The ultimate goal of this investment is that a certain level of Portuguese language processing ("portugware") will be felt just as necessary as an operating system. And that, consequently, in order to use any device or equipment, it will soon become inconceivable that error messages, training, and help should not be issued in our language. In the near future, we should be able to

give spoken orders in Portuguese (in addition to menu-based interaction)
talk to a machine over the phone without having to speak slowly in English
write or ask questions in Portuguese, instead of having to learn an artificial search or query language

[...]

For this purpose it is necessary, on the one hand, to contact international companies and establish concrete action plans to help them integrate Portuguese in their products.

On the other hand, it is necessary to foster awareness of and expertise in language technology in Portuguese companies, offering incentive schemes (such as tax reduction) to favour the companies which use modern language technology in their work.

To enforce a "portuguezation" by law should, however, be avoided. The natural advantage of working in one's native language should in itself trigger a preference for Portuguese-aware products. The way to proceed is to set an example or present a model.

In this vein, we could imagine a special program constructed for some key companies or institutions, such as publishers, mass media, libraries, museums, etc. so that they could invest, and seriously so, in the computational processing of Portuguese before it is too late.

This pairs well with the need to fund cultural institutions (such as libraries, museums, and all institutions that own large bodies of interesting data), so that they can make their collections, or resources, available for easy manipulation and access. This funding could also contribute to a better knowledge of the collections themselves.

Some successful examples would cause a domino effect, and other companies or public departments would actively seek financing for similar activities.

Incidentally, it should be noted that not only work, but also leisure, would benefit from a serious investment in the computational processing of Portuguese. In fact, we only need to look at the tremendous demand for Portuguese and Brazilian lyrics on the Internet to understand that investment in culture sometimes also pays off.

Education policy in the field of computational processing of Portuguese

One of the major problems of Portuguese language technology is its lack of recognition and even identity. Neither engineering schools nor arts faculties recognize the field as a priority, and in neither case appropriate education in Portugal is provided. (There are some positive signs, though: a degree in language and knowledge engineering, an MA program in computational linguistics, and several courses within different MA programs; see http://www.portugues.mct.pt/ensino.html.)

However, quite a few central issues are missing (note that I do not have access to all university curricula, and therefore the following statements may not be accurate, but simply reflect the general impression that people involved in speech processing conveyed to me):

It appears that there is no specific education on phonetics for the purposes of speech processing.
Although a considerable part of NLP research is based on statistical methods, there is no specific education on the subject "statiscal methods in NLP" in Portugal (there seem to be only general introductory courses on statistics). This contrasts with a heavy bias towards formal methods in general and logic in particular.
There are very few courses on speech processing; most courses offered cover only digital signal processing, and/or neural networks

Another problem is that researchers come to the field with different, not to say opposite, perspectives, which makes the communication between engineers and linguists extremely difficult: While the former see NLP as a specific area of computer science whose raw material is language, the latter see it as an application of linguistics.

Ideally, there should be a basic NLP course in all computer science degrees (with the option of more advanced ones), or even in all engineering degrees. Likewise, there should be one or more computer science courses in all arts curricula, not only to make the student familiar with the computational methods and resources used in his particular branch of the Humanities, but also to offer some insight into how those systems and tools were developed.

In our time, there is already a common technological basis (a technological infrastructure) which caters for communication between different disciplines. It would therefore be a good idea that several departments/institutes cooperated in order to share curricula, courses, teaching staff, and engage in common projects). Some researchers have suggested a reorganization of the R&D groups so that a cross-fertilization would be possible; see Proposta de Estruturação da Área do Processamento Computacional do Português.

Furthermore, it seems appropriate to create a Web-based course (or even a laboratory) for the computational processing of Portuguese, since one of the uses of NLP is precisely in computer-aided education, and more specifically on the Web. This would require that all members of the community participated (provided there was public funding for these activities) in order to create teaching materials and provide the necessary tools. Their knowledge and systems would then reach a wider audience, and be of great benefit to everyone who followed the courses.

Related areas

Some areas that deserve special interest at the present moment, from an NLP perspective, are:

Legal issues and language technology

It is necessary to define, and protect, the public status of a language, and clarify the questions of copyright regarding publishers, authors and compilers of collections (or knowledge repositories such as dictionaries and encyclopædia)

Language and the Internet (especially WWW)

One of the most exciting and fast growing areas (also in NLP) is the Web. This should also be reflected in our language, and requires experts that are familiar both with NLP and Web technology.

The relationship between the database and the NLP communities

Large collections of information in special formats exist, in databases or knowledge bases. Traditionally, the two areas are separate, which leads to a duplucation of efforts both in resource compilation, in processing and even in the development of human-machine interfaces. Needless to say, this duplication should be avoided.

The partial automation of education and its relationship with language and culture

By establishing models of communication over distance and emphasizing the student's process of self-teaching, the Internet also gives rise to new pedagogic models which should naturally be language- and culture-aware. This is an area in which the computational processing of Portuguese may be essential for the user friendliness of the system and consequently for the fulfilment of the system's goals.

Communication and participation

If people involved in this area fail to communicate, there can be no hope that significant developments will take place in the computational processing of Portuguese.

Until now, the area has lived in a world of secret negotiations, intrigues, and personal invitations instead of open meetings or calls. Little has been done in the vein of collaboration, and few resources and tools are freely available.

There are several concrete measures that could be taken to change this situation, at least for publicly funded projects.

Before approval or rejection of the project proposals, an open discussion should take place, so that the projects could achieve some consensus, feedback, and suggestions from anyone interested. This would improve the final result significantly, and other interested partners with a wish to work on specific parts could take part in the project. Technically, it would be enough to request that the project proposals were publicly available on the Web (with a schedule for discussion and feedback). Likewise, the progress reports, as well as the final report, should be made available on the Web.
Projects that do not achieve funding on a particular round could apply again on later occasions, already having benefitted from previous criticism and suggestions.
There should be a clear separation between the evaluation of the project proposal and the group that suggests it, following criteria that should be made public by the funding agencies. The results of the evaluation, and the committee who performs it, should also be made known, or at least available, to the people who submitted the proposals.
Projects that plan to make use of data or tools developed by other groups should be preferred.

Towards an area profile

It is not possible to create in a semi-automatical way an area profile of an area which is not recognized by the funding agencies. The present choice tries to relate in a common goal a series of different activities in order to invest in a particular area. To a much higher degree than in the case of traditional research fields, the choice is subjective.

My attempt was to describe the area in Portugal according to the way it may be described in the future, making use of three different information sources provided by the Fundação da Ciência e da Tecnologia and the Observatório das Ciências e das Tecnologias:

A database of projects financed by MCT in 1994-1996
A database of human resources having a Phd in Portugal (having taken the dgree in Portugal or abroad), graduated from 1970 to 1997, in the three fileds "Linguistics", "Computer Science and Electrotechnical Engineering", and "Communication Sciences"
A database of (doctoral or post-doctoral) scholarships from 1994 to 1997, also in the same three fields

From these data, three lists are presented to the R&D community:

A list of projects, classified as either "in the area" and "related to the area"
A list of scholarships, also classified as above
A list of people having a PhD which can be considered at least potentially related to the area

as well as some comments on the methodology employed.

In parallel, using the information present on the Web, our project tried to create a catalogue of resources and of actors in the area of the computational processing of Portuguese (not restricted to Portuguese soil). We have been working on this catalogue since July 1998 and it can be consulted from our site, http://www.portugues.mct.pt/

Send questions, comments and suggestions.