Main interests
Diana Santos
My main interests in brief.
Machine Translation
I was responsible for the development of PORTUGA (Mentor/P), a
broad-coverage MT prototype from English to Portuguese. Its
development took place at the IBM-INESC Scientific Group, 1987 - 1989.
- The project resulted in an MT shell, written in PLNLP, and an
instantiation for the English to Portuguese pair.
- The shell was also used in a pilot project in IBM Norway (MT from
Norwegian Bokmål to Nynorsk).
- Its most interesting idea was the treatment of idioms and lexical
gaps.
Some relevant publications are:
- Santos, Diana.
- "Lexical gaps and idioms in Machine Translation", Hans Karlgren (ed.), Proceedings of COLING'90 (Helsinki, August 1990), Vol 2, pp.330-5.
- Santos, Diana.
- "Broad-coverage machine translation", in K. Jensen, G. Heidorn & S. Richardson, Natural Language Processing: The PLNLP Approach, Kluwer Academic Press, 1992.
Computational processing of Portuguese
Unfortunately, Portuguese is well behind the other major languages of
the world as
far as its computational processing is concerned.
My efforts in the field have included:
- Corpus processing
- gathering an initial sentence corpus,
- studying some
of its properties,
- annotating it with major part-of-speech,
- developing
a corpus browser
- Creation of a morphological analyser of original design: Palavroso
- Some studies in sentence separation using only morphological clues (no
lexicon)
- Development of various grammar fragments for Portuguese
(unfortunately, only at an initial stage)
- Studies of computational lexica for Portuguese
- Studies of the alignment of Portuguese with English
- Studies of the variation between European and Brazilian Portuguese
Some relevant publications are:
- Medeiros, José Carlos, Rui Marques & Diana Santos.
- "Português Quantitativo", Actas do 1.o Encontro de Processamento de Língua Portuguesa (Escrita e Falada) - EPLP'93, (Lisboa, 25-26 de Fevereiro de 1993), pp.33-8.
- Barreiro, Anabela, Maria de Jesus Pereira & Diana Santos.
- "Critérios e
opções linguísticas no desenvolvimento do Palavroso, um sistema
computacional de descrição morfológica do português", Relatório INESC
num. RT/54-93, Dezembro de 1993.
- Santos, Diana.
- "Português Computacional", Actas
do Congresso Internacional sobre o Português (Lisboa, 11-15 de
Abril de 1994), Vol. 3, pp.167-184.
See also my activity in the project Computational Processing of Portuguese.
I hold the following standpoints regarding semantics:
- There is no such thing as language independent meaning.
- Contrastive data (of the translation variety) are one of the best source for semantics
- Vagueness is the most important property of natural language
--and should be accordingly studied and modelled.
Some relevant publications are:
- Santos, Diana.
- "On the use of parallel texts in the comparison of
languages", Actas do XI Encontro da Associação Portuguesa de
Linguística (Lisboa, 2-4 de Outubro de 1995), pp.217-239.
- Santos, Diana Maria de Sousa
Marques Pinto dos.
- "Tense and aspect in English and Portuguese: a
contrastive semantical study", Tese de doutoramento, Instituto
Superior Técnico, Universidade Técnica de Lisboa, Junho 1996.
- Santos, Diana.
- "The importance of vagueness in translation:
Examples from English to Portuguese",
Romansk Forum Nr. 5, Juni 1997, pp.43-69.
I hold the (widely held) belief that corpora are an excellent method of
looking at language; but that they are not a solution in themselves.
In other words, methodological questions are one of the most
interesting subjects of corpus processing.
Some questions are:
- Evaluation of NLP systems using corpora
- Getting at a methodology for corpus-based contrastive studies
- Identify different strategies for using corpora (best examples,
good-enough examples, all examples)
- Tag policies versus lexicon development policies
Some relevant publications (not covering all the aspects above, though) are:
- Bacelar do Nascimento, Maria
Fernanda, Amália Mendes & Diana Santos.
- "O corpus e a classificação
sintáctica dos verbos", Actas do 1.o Encontro de Processamento de
Língua Portuguesa (Escrita e Falada) - EPLP'93, (Lisboa, 25-26 de
Fevereiro de 1993).
- Santos, Diana.
- "Bilingual alignment and tense", Proceedings of the
Second Annual Workshop on Very Large Corpora (Kyoto, August 4th,
1994), extended version as INESC Report AR/10-94.
- Santos, Diana.
- "On grammatical translationese", in Short papers
presented at the Tenth Scandinavian Conference on Computational
Linguistics (Helsinki, 29-30th May 1995), compiled by Kimmo
Koskenniemi, pp.59-66.
My favourite subject since 1999, I've been working hard to bring the "evaluation contest" paradign home to the Portuguese language processing community.
To use the Web to make tools and language resources, minimizing adaptation time for new users and focussing on the fundamental questions of user support.
The service for the
Oslo Corpus of Bosnian Texts (OCBT) was created and implemented by me, in the framework of the net-based services provided by the Text laboratory.
A similar, though more ambitious service is the one providing access to
Portuguese corpora, the AC/DC project.
Relevant publications are:
- Santos 98b
- Santos, Diana. "Providing access to language
resources through the World Wide Web: the Oslo Corpus of Bosnian
Texts". In Antonio Rubio, Natividad Gallardo, Rosa Castro and Antonio Tejada (eds.),
Proceedings of The First International Conference on
Language Resources and Evaluation (Granada, 28-30 May 1998), Vol. 1, pp.475-481.
- Santos 99b
- Santos, Diana. "Disponibilização de corpora através da WWW". Actas do I Workshop sobre Linguística Computacional da Associação Portuguesa de Linguística (Lisboa, 25-27 de Maio de 1998), APL, 1999.
I see contrastive studies as a method to get at a deeper understanding of both each language and of translation between them.
Basically, I am after methodologies to perform corpus-based contrastive studies. I am also interested in studying other languages' influence on my own.
In addition to my PhD thesis, relevant publications are:
- Santos 97b
- Santos, Diana. "O tradutês na literatura infantil
traduzida em Portugal",
Actas do XIII Encontro da Associação Portuguesa de
Linguística (Lisboa, 1-3 de Outubro de 1997).
- Santos 98c
- Santos, Diana. "Perception verbs in English and
Portuguese". In Johansson, Stig and Signe Oksefjell (eds.), Corpora and Crosslinguistic Research: Theory, Method, and Case Studies. Amsterdam: Rodopi,
pp.319-342.
- Santos 99a
- Santos, Diana. "The Pluperfect in English and Portuguese:
What Translations Patterns Show". In Hilde Hasselgaard & Signe Oksefjell (eds.), Out of Corpora: Studies in Honour of Stig Johansson, Amsterdam: Rodopi, pp.283-299.
- Santos 99c
- Santos, Diana. "Um olhar computacional sobre a
tradução". Terminología y Traducción 2/99.
- Santos and Oksefjell forthcoming
- Santos, Diana & Signe Oksefjell. "Using a translation corpus to validate
independent claims", Languages in Contrast.
Having worked as an NLP group leader for quite a
while, I am also interested in the general questions of
- How to make available NLP work and data, and, at the
same time, protecting both data and users?
- How to avoid reinventing the wheel (as far as the lexicon/the
grammar/etc.) are concerned?
- How to build a working infrastructure as NLP service?
- How to minimize user ignorance as well as time used in user
support?
After having described some problems with the way research of NLP is
organized in Portugal I suggested some ways to go in a white paper and
made practical suggestions for collaborative work in several documents created in the Computational Processing of Portuguese project, now Linguateca.
I believe that Web information retrieval is the best field to apply both NLP and evaluation techniques, in a real world real "man in the street" context.
Last modified on 4 April 2003 by Diana Santos <Diana.Santos@sintef.no>