Main interests

Diana Santos

My main interests in brief.

Machine Translation

I was responsible for the development of PORTUGA (Mentor/P), a broad-coverage MT prototype from English to Portuguese. Its development took place at the IBM-INESC Scientific Group, 1987 - 1989.

The project resulted in an MT shell, written in PLNLP, and an instantiation for the English to Portuguese pair.
The shell was also used in a pilot project in IBM Norway (MT from Norwegian Bokmål to Nynorsk).
Its most interesting idea was the treatment of idioms and lexical gaps.

Some relevant publications are:

Santos, Diana.: "Lexical gaps and idioms in Machine Translation", Hans Karlgren (ed.), Proceedings of COLING'90 (Helsinki, August 1990), Vol 2, pp.330-5.
Santos, Diana.: "Broad-coverage machine translation", in K. Jensen, G. Heidorn & S. Richardson, Natural Language Processing: The PLNLP Approach, Kluwer Academic Press, 1992.

Computational processing of Portuguese

Unfortunately, Portuguese is well behind the other major languages of the world as far as its computational processing is concerned. My efforts in the field have included:

Corpus processing
- gathering an initial sentence corpus,
- studying some of its properties,
- annotating it with major part-of-speech,
- developing a corpus browser
Creation of a morphological analyser of original design: Palavroso
Some studies in sentence separation using only morphological clues (no lexicon)
Development of various grammar fragments for Portuguese (unfortunately, only at an initial stage)
Studies of computational lexica for Portuguese
Studies of the alignment of Portuguese with English
Studies of the variation between European and Brazilian Portuguese

Some relevant publications are:

Medeiros, José Carlos, Rui Marques & Diana Santos.: "Português Quantitativo", Actas do 1.o Encontro de Processamento de Língua Portuguesa (Escrita e Falada) - EPLP'93, (Lisboa, 25-26 de Fevereiro de 1993), pp.33-8.
Barreiro, Anabela, Maria de Jesus Pereira & Diana Santos.: "Critérios e opções linguísticas no desenvolvimento do Palavroso, um sistema computacional de descrição morfológica do português", Relatório INESC num. RT/54-93, Dezembro de 1993.
Santos, Diana.: "Português Computacional", Actas do Congresso Internacional sobre o Português (Lisboa, 11-15 de Abril de 1994), Vol. 3, pp.167-184.

See also my activity in the project Computational Processing of Portuguese.

Semantics

I hold the following standpoints regarding semantics:

There is no such thing as language independent meaning.
Contrastive data (of the translation variety) are one of the best source for semantics
Vagueness is the most important property of natural language --and should be accordingly studied and modelled.

Some relevant publications are:

Santos, Diana.: "On the use of parallel texts in the comparison of languages", Actas do XI Encontro da Associação Portuguesa de Linguística (Lisboa, 2-4 de Outubro de 1995), pp.217-239.
Santos, Diana Maria de Sousa Marques Pinto dos.: "Tense and aspect in English and Portuguese: a contrastive semantical study", Tese de doutoramento, Instituto Superior Técnico, Universidade Técnica de Lisboa, Junho 1996.
Santos, Diana.: "The importance of vagueness in translation: Examples from English to Portuguese", Romansk Forum Nr. 5, Juni 1997, pp.43-69.

I hold the (widely held) belief that corpora are an excellent method of looking at language; but that they are not a solution in themselves. In other words, methodological questions are one of the most interesting subjects of corpus processing.

Some questions are:

Evaluation of NLP systems using corpora
Getting at a methodology for corpus-based contrastive studies
Identify different strategies for using corpora (best examples, good-enough examples, all examples)
Tag policies versus lexicon development policies

Some relevant publications (not covering all the aspects above, though) are:

Bacelar do Nascimento, Maria Fernanda, Amália Mendes & Diana Santos.: "O corpus e a classificação sintáctica dos verbos", Actas do 1.o Encontro de Processamento de Língua Portuguesa (Escrita e Falada) - EPLP'93, (Lisboa, 25-26 de Fevereiro de 1993).
Santos, Diana.: "Bilingual alignment and tense", Proceedings of the Second Annual Workshop on Very Large Corpora (Kyoto, August 4th, 1994), extended version as INESC Report AR/10-94.
Santos, Diana.: "On grammatical translationese", in Short papers presented at the Tenth Scandinavian Conference on Computational Linguistics (Helsinki, 29-30th May 1995), compiled by Kimmo Koskenniemi, pp.59-66.

Evaluation

My favourite subject since 1999, I've been working hard to bring the "evaluation contest" paradign home to the Portuguese language processing community.

Net-based NLP services

To use the Web to make tools and language resources, minimizing adaptation time for new users and focussing on the fundamental questions of user support.

The service for the Oslo Corpus of Bosnian Texts (OCBT) was created and implemented by me, in the framework of the net-based services provided by the Text laboratory.

A similar, though more ambitious service is the one providing access to Portuguese corpora, the AC/DC project.

Relevant publications are:

Santos 98b: Santos, Diana. "Providing access to language resources through the World Wide Web: the Oslo Corpus of Bosnian Texts". In Antonio Rubio, Natividad Gallardo, Rosa Castro and Antonio Tejada (eds.), Proceedings of The First International Conference on Language Resources and Evaluation (Granada, 28-30 May 1998), Vol. 1, pp.475-481.
Santos 99b: Santos, Diana. "Disponibilização de corpora através da WWW". Actas do I Workshop sobre Linguística Computacional da Associação Portuguesa de Linguística (Lisboa, 25-27 de Maio de 1998), APL, 1999.

Contrastive studies

I see contrastive studies as a method to get at a deeper understanding of both each language and of translation between them.

Basically, I am after methodologies to perform corpus-based contrastive studies. I am also interested in studying other languages' influence on my own.

In addition to my PhD thesis, relevant publications are:

Santos 97b: Santos, Diana. "O tradutês na literatura infantil traduzida em Portugal", Actas do XIII Encontro da Associação Portuguesa de Linguística (Lisboa, 1-3 de Outubro de 1997).
Santos 98c: Santos, Diana. "Perception verbs in English and Portuguese". In Johansson, Stig and Signe Oksefjell (eds.), Corpora and Crosslinguistic Research: Theory, Method, and Case Studies. Amsterdam: Rodopi, pp.319-342.
Santos 99a: Santos, Diana. "The Pluperfect in English and Portuguese: What Translations Patterns Show". In Hilde Hasselgaard & Signe Oksefjell (eds.), Out of Corpora: Studies in Honour of Stig Johansson, Amsterdam: Rodopi, pp.283-299.
Santos 99c: Santos, Diana. "Um olhar computacional sobre a tradução". Terminología y Traducción 2/99.
Santos and Oksefjell forthcoming: Santos, Diana & Signe Oksefjell. "Using a translation corpus to validate independent claims", Languages in Contrast.

Research policy in NLP

Having worked as an NLP group leader for quite a while, I am also interested in the general questions of

How to make available NLP work and data, and, at the same time, protecting both data and users?
How to avoid reinventing the wheel (as far as the lexicon/the grammar/etc.) are concerned?
How to build a working infrastructure as NLP service?
How to minimize user ignorance as well as time used in user support?

After having described some problems with the way research of NLP is organized in Portugal I suggested some ways to go in a white paper and made practical suggestions for collaborative work in several documents created in the Computational Processing of Portuguese project, now Linguateca.

Information retrieval

I believe that Web information retrieval is the best field to apply both NLP and evaluation techniques, in a real world real "man in the street" context.

[Home Page | Publications ]

Last modified on 4 April 2003 by Diana Santos <Diana.Santos@sintef.no>

Main interests

Machine Translation

Computational processing of Portuguese

Semantics

Corpus processing

Evaluation

Net-based NLP services

Contrastive studies

Research policy in NLP

Information retrieval

[Home Page | Publications ]