A Proposal for Founding a National Language Technology Program in Finland

1998

1. Language technology

Language technology is the technology for digital processing of spoken or written human language. A central field of research in advancing language technology is computational linguistics, which combines the linguistic understanding of language structure with information technology and mathematical methods as well as with the constantly increasing computational and memory capacity of contemporary computers.

Language technology is needed when one wants to facilitate and intensify computer processing of human language. Language technology has been first applied in automatic text processing to check automatic hyphenation and spelling. Current important application fields of language technology are speech analysis and generation as well as information retrieval from large document collections.

David Nahamoo, a researcher at IBM, declared in Business week’s February 1998 issue his belief in an impending breakthrough of language technology: ’Without question, 1998 will be the year of natural-language products...’. Also Bill Gates, the general manager of Microsoft, announced in October 1997 in connection with Itxpo97 that research and development in Microsoft will concentrate in the near future in natural interface technologies i.e. speech recognition and other language technologies.

The engagement and investment in language technology by one of the largest information technology firms in the world will bring language technology within the reach of millions of people in the coming years.

2. A survey of language technology in Finland

The national public funding agency for technology, the Tekes Technology Development Centre, granted in January 26, 1998 a 122 000 FIM grant to the national computation centre, CSC-Centre for Scientific Computing Ltd for a survey and feasibility project for Language Technology in Finland. CSC formed a steering group, whose members were Prof. Kimmo Koskenniemi, the chairman (University of Helsinki), Kaisa Häkkinen (Åbo Akademi, the Academy of Finland), Panu Korhonen (Nokia Research Center), Ulla Lehtiniemi substituted by Taru Kuhanen (Tieto Corporation), Aimo Maanvilja (Research centre of the Helsinki Telephone company), Annu Jylhä-Pyykönen (Ministry of Education), prof. Mikko Sams (Helsinki University of Technology), Matti Sihto (Tekes), Juha Telkkinen (Promentor Solutions Oy) and Hannele Vihermaa (Alma Media Oyj). Manne Miettinen (CSC) acted as the secretary of the steering group. The meetings of the steering group were attended by the chairmen of the work groups of the survey announcement seminar in June 11: prof. Lauri Carlson (University of Helsinki), prof. Timo Honkela (University of Art and Design) (substituted by Aarno Lehtola (VTT State Technology Resarch Centre / Information Technology)), prof. Jussi Karlgren (University of Helsinki, SICS), Kaarina Nazarenko ( Sanoma Oy) and prof. Kari Sajavaara (University of Jyväskylä).

The survey resulted in the publication of a 66 page report Kieliteknologia Suomessa (Language Technology in Finland, CSC R02/98). The report briefly evaluates the state of art and facilities for language technology in Finland as well as presents an outline of the reseach and commercial work done in Finland. The report also proposes the foundation of a national program for language technology.

As a part of the survey, a theme seminar on language technology was organized on June 11, 1998. Over a hundred delegates from language technology research, language technology firms and firms that apply and exploit language technology participated in the seminar. The guest speaker of the seminar was Giovanni B. Varile, the head of the language technology unit at DG XIII of the Commission of the European Communities.

During the discussion in the seminar, the considerably wide range of activities in language technology research and commercial applications was noted, but also its scattered nature. The speakers found it important and desirable to start a common program to promote the collaboration of different parties and to exploit language technology more widely. The present proposal for founding a national language technology program is based on the report and the results of the seminar.

3. What do we need language technology for?

The need for language technology is essentially connected with the social upheaval in process and partly already materialized in the form of the information society. A significant factor in the process under way is the fast growth of digital information on offer. The social and economical wealth of citizens and companies in the information society essentially depends on how one can exploit this information, which is mostly couched in domestic or in foreign languages.

Language technology is a key technology in an equal information society, because it makes information more accessible and enables people to use their own language when using computers and information systems. Thanks to it also ordinary citizens who are not familiar with information technology can master digital information in their mother tongue or in some other natural language.

Without language technology, digital information coded in the form of natural language is in danger of remaining unexploited to a great extent. This would weaken industrial and commercial efficiency and competitivity and would limit the quality of life and the realization of reforms to improve the openness of society.

In the survey, seven fields or areas of language technology were defined. on which the national language technology program would be based:

1. Management of documents by their linguistic content offers a great deal of challenges and tasks for language technology, the most important of which are developing new information retrieval systems, automatic document classification and automatic summarization. Developing and sharpening of these methods becomes all the more pressing and important as the size and the number of documents increase. Those benefitting from this include companies whose products are associated with wide documentation or large text masses, as well as ordinary people who are able through these means to find the information they need more easily.

2. Computer aided translation tools improve the quantity and quality of translation. The use of such tools is already quite widespread and significant and its importance will obviously increase in the future. Localisation, or the adaptation programs or systems designed originally in one language to local languages, is needed more and more and there is a clear need for automatizing the tasks involved.

3. Computer aided language learning and electronic dictionaries are very important in the integrated Europe. Today’s technology allows excellent opportunities to create internationally important products in this quickly growing field.

4. Natural language user interfaces would facilitate the communication between man and machine especially in more complex applications where graphic interfaces do not work well. With these interfaces it is possible to offer services and products to a very wide range of users.

5. Speech signal processing is becoming one of the most important fields of language technology, because it is connected to many contemporary products and services, among others mobile phones. Many services demand high quality speech recognition and speech synthesis in order to reach a wide range of users.

6. Common corpora for linguistics and language technology are necessary tools and materials for developing these technologies and applications. Development of robust applications requires very large research databases because linquistic variation is much more complex and complicated than we intuitively realize. In addition to developing commercial products, large text and speech corpora are needed to act as a foundation for basic research required for future products.

7.Computer assisted text creation and editing like checking orthography, spelling, grammar, and readability etc. aim to increase the productivity of text creation from the user’s point of view, and from the point of view of language technology companies, generate new products for domestic and international markets.

4. The basis and need for a language technology program

Language technology in Finland has been researched since 1970. The research work has been successful and in many fields international renown has been achieved. Different sectors of language technology have been explored in different universities and research centers by reseachers from research groups of different fields.

On the basis of research results, language technology firms have been founded, some of which specialise in export and others mainly in domestic markets.. The standard of methods and products of language technology can be considered internationally very high, for instance, Microsoft’s biggest language module supplier is a Finnish company, Lingsoft Oy.

In addition, some companies exploiting language technology such as publishers and telecommunication companies have got a good start applying language technology to their own production processes and products.

The generation of new research processes and the application of already existing technology has been hampered by the scattered nature of language technology research in Finland. Because of its dispersed nature, language technology has not been regarded as an independent field and the possibilities of its different sub-fields have not been known well enough among its possible appliers and not even among researchers in its other sub-fields. Low visibility has possibly been why even the recent large national research programs of information and communication technology have paid scant attention to the language technological point of view.

A national technology program would offer a suitable framework for coordinating research and development work in Finland. Within this framework, investments by different funding agencies could channeled so that the Academy of Finland would finance basic research programs through its own channels and Tekes on the other hand would fund more immediate industrial research and development projects.

Finland takes active part in European Communion’s reseach programs, where language technology has had an important role already for a couple of decades. Many countries have set up a national language technology program in the last few years, which is what the Commission particulary wishes to happen in Finland as well. The national research program would considerably enchance the chances and facilities to exploit EC related language technology research programs in the future.

5. Aim of the program

A successful language technology program would produce e.g. the following results and improvements:

6. Organization of the program

The national language technology program would consist of the following parts:

For each sub-area of the language technology program mentioned above, a super project would be founded when possible, which would bring together partners in its sub-area and would coordinate the research driven projects din its area. Each super project would have its own steering group which would manage, coordinate and follow the work done in it.

Each super project would be in charge of the development of communication and cooperation between research groups, language technology companies and applications in its sub-area. The super projects should have sufficient resources to organize the meetings etc. that this kind of integration requires.

The super projects would be built up from tentative short statements of intent into actual project proposals, for instance so that the steering group of the language technology program would give recommendations if required about the extension of the parties involved or the definition of the tasks for the super project before drawing up the final project plan. Tekes would process and accept the super projects according to its own procedures.

The steering groups of the super projects would in turn receive project proposals and would guide the grouping and planning of sub-projects. On the basis of the proposals, new partners can be taken on to the super projects as far as the funding allows. Proposals concerning larger wholes will be worked into full project plans submitted to Tekes. The super projects will report to the technology program steering group with appropriate frequency.

Projects can be adopted to the program also on the ground of project plans submitted directly to Tekes without the above mentioned procedure. This option concerns especially product development projects because here the contents of the project plan are secret (except for the title).

The language technology program would probably initially have more research driven projects, after that, mixed projects and at a later stage, mostly product development projects. Certain product development projects are very likely to be ready to start up at the outset of the program, while other sub-areas would first require research projects.

7. Funding of the program

The technology program could start at the beginning of the year 1999 and it could last three years. The total volume of the program would be about 80 million FIM, of which research and product development projects funded by Tekes could be about 60% or 50 million FIM. The program could include projects which get financing for instance from the research funding of the Academy of Finland, the Nordic Council of Ministers, the European Commission, and other institutions.

The projects of the language technology program would be divided into three main groups:

The share funded by Tekes could be divided so that about 60% of the volume would be allocated to research oriented projects and the rest to industrial projects. In terms of sub-areas, the distribution of the total financing volume of Tekes, companies, and other financiers could approximately go along the following guidelines (in units of 1000 FIM):

 

basic research

applied research

product development

total

document management

5700

4500

5000

15200

translation tools

2700

2500

2000

7200

language learning and dictionaries

3600

2500

3000

9100

natural language user interfaces

1800

1000

2000

4800

speech signal processing

13200

12000

10000

35200

common corpora

1800

3000

0

4800

writer’s aids

1200

500

1500

3200

coordination

0

0

0

500

total (1000 FIM)

30000

26000

23500

80000

 

The quality of the individual project plans and the impact and importance of the proposed projects would naturally decide the concrete allocation of money to different sub-areas. The Academy of Finland, the Nordic Council of Ministers or other financiers have not made decisions or obligations concerning language technology at this stage.

8. Sub-areas of the program

The national language technology program would consist of seven thematic sub-areas which are introduced below. The presentation of sub-areas is based on the recommendations of the final report of the feasibility study of the project and on the discussions and project ideas arising after the publication of the report. Initial project proposals are being collected within the scope of each sub-area. At the time of writing this text, the collection is still underway, and some of the proposed topics will be listed and described in a separate appendix.

8.1 Management of documents by linguistic content

The most important research topics in this sub-area are information retrieval, recognition of the language of the document, document classification, text condensation, machine-aided and automatic generation of hypertext and links, tools for producing terminology, and term recognition. Possible cooperation partners include scientific libraries and projects connected to FinElib (Finnish Electronic Library) program, Neural Networks Research Centre, Helsinki University of Technology, Research Unit for Multilingual Language Technology (University of Helsinki), Research center of Helsinki Telephone Company, VTT Information Technology, Department of Information Research (University of Tampere), Nokia Mobile Phones, Kone Oyj, Alma Media Oyj, Conexor oy, Sonera, Trantex oy, Republica oy, CSC Oy, Kielikone Oy, Lingsoft Oy.

8.2 Translation tools and localisation

In this sub-area research work is done and applied on computer assisted translation and the production and management of multilingual documents. An important field of application are localisation tools, i.e. computer programs which expedite the localisation of computer programs in different languages and culture areas. Possible national cooperation partners include the Department of Translation Studies ( University of Helsinki), Trantex Oy, Brossco Oy, Lanser Data Oy, Lingsoft Oy, Conexor oy, Kielikone Oy, PasaNet Oy, Nokia Mobile Phones, Department of International Communication (University of Joensuu), Trantex Oy, Instrumentarium, TSK-Center of Technical Terminology ry, VTT Information Technology, and Institute of Digital Media ( Technical University of Tampere).

8.3 Computer aided language learning and dictionaries

In the projects of this sub-area, research and development work is done on methods and applications connected to machine-aided language learning. Applicable language technology includes techniques for word inflection and syntactic analysis.

Information technology can be used in language education in many ways: in the actual teaching process, in creating study materials for both teacher directed and self study, as well as in the assessment of language skills. Computer corpora should also be exploited in language education both in teaching and in creating study materials.

Possible national cooperation partners include Kielikone Oy, Lanser Data Oy, Lingsoft Oy, Conexor oy, Centre of applied language studies (University of Jyväskylä), Teleste Educational Oy, Laboratory of Computational Engineering (Helsinki University of Technology), Marketting Institute, Research Center of Helsinki Telephone Company, Lingsoft Oy, Gurusoft Oy, Pitchsystems Oy and Timehouse Oy.

8.4 Natural language user interfaces

In this sub-area research is done on speech interfaces, the use of controlled languages and modelling user interaction. National cooperation partners could include the Research Unit for Multilingual Language Technology (University of Helsinki), Department of Computer Science (University of Tampere), Research group on human multimodal information processing (Laboratory of Acoustics and sound processing, Laboratory of Computational Engineering at the Helsinki University of Technology), Nokia Research Center, Media Lab (University of Art and Design), Research Center of Helsinki Telephone Company, Conexor oy, Lingsoft Oy, VTT Information Technology, Alma Media Oyj, Dataestradi, Kielikone Oy.

8.5 Speech signal processing

Advanced speech recognition and speech synthesis is expected to drastically change the interaction between man and the machine and between people. In the projects of this sub-area, research is done on recognizing and generating audiovisual speech (speech synthesis). Producing high quality individual and expressive speech by machine is extremely demanding, and the same is true of speech recognition when the number of speakers, the vocabulary and the topic are unlimited. Connecting traditional rule based language technology and speech technology concentrated on signal processing is an interesting challenge in the near future.

Possible national cooperation partners include the Neural Networks Research Centre (Helsinki University of Technology), Research Unit for Multilingual Language Technology (University of Helsinki), Department of Applied Physics (University of Turku), Acoustics Laboratory (Helsinki University of Technology), Laboratory of Computational Engineering (Helsinki University of Technology), Department of Phonetics (University of Helsinki), Department of Foreign Languages (University of Joensuu), Department of Signal Processing (Technical University of Tampere), Research Center of Helsinki Telephone Company, Lingsoft Oy, Gurusoft Oy, Pitchsystems Oy and Timehouse Oy.

8.6 Common corpora of linguistics and language technology

The aim of this sub-area is to develop a language resource center which serves linguistics and language technology. The center would collect electronic text corpora, speech databases, computer based dictionaries and analysis programs.

The projects of this sub-area could on the one hand collect through negotiation already existing national and foreign corpora for nation-wide use, and on the other hand start up large joint projects for generating more and new kinds of corpora. Although there is a shortage of all kinds of language material, there is a particular lack of spoken language material (audiovisual speech corpora and an up to date large spoken language corpus) and of multilingual (Finnish-some other language) parallel corpora.

National cooperation parties could include CSC- Centre for Scientific Computing Oy, Center of Domestic Languages, and the Department General Linguistics (University of Helsinki), Laboratory of Computational Engineering (Helsinki University of Technology) and the different language departments in Finland’s universities. Publishers, for instance Alma Media Oy, Sanoma-WSOY, Otava and Edita and the Library of the central coalition of visually disabled, have significant quantities of computer readable reseach materials, which should be brought for academic and commercial research use.

Foreign cooperation partners could include the Linguistic Data Consortium (USA), Språkbanken (Sweden) and the EC funded ELRA (European Language Resources Association).

8.7 Writer's tools

The projects of this sub-area one research and develop i.a. tools to check grammar, to evaluate the readability of text and extend more traditional methods like orthography and spell checking to new languages. Writer’s tools is the most traditional and commonest application field of language technology, but one has not by any means used all opportunities of this field yet.

Possible national cooperation partners are for instance Research Unit for Multilingual Language Technology (University of Helsinki), different language departments, Lanser Data Oy, Conexor oy, Lingsoft Oy, Kielikone Oy, Sanoma.-WSOY, Alma Media Oyj, Kone Oyj, Republica Oy.

 

Appendix 1. Participants of the seminar

 

Document management

Ahonen, Helena UH

Airola, Anu UH

Heinonen, Oskari UH

Hyppönen, Olli UH

Häkkinen, Kaisa Åbo Akademi/Academy of Finland

Juntunen, Jukka-Pekka Kielikone OY

Karlgren, Jussi UH

Kostiainen Kaisa VTT Information Technology

Lagus, Krista HUT

Lahti, Maria Sonera

Lahtinen, Timo UH

Laine, Anna Trantex Oy

Lounela, Mikko DomLang

Marjamäki, Kaija Nokia Mobile Phones

Murtomaa, Eeva UH

Ojanen, Eetu Republica OY

Pekkarinen, Päivi TerKko

Pietiläinen, Pirkko U Oulu

Pitkänen, Kari K UH

Siltanen, Pekka VTT Information Technology

Vanhanen, Eleonoora Lanser Data Oy

Virtanen, Liisa HY

Yli-Jyrä, Anssi M. CSC

Ylinen, Markku Alma Media Oyj

Translator’s tools

Arnola, Harri Kielikone Oy

Blåberg, Olli Lanser Data Oy

Carlson, Lauri U Helsinki

Eriksson, Eira Lingsoft Oy

Läärä-Inutile, Päivi Instrumentarium

Niemelä, Merja Nokia Mobile Phones

Nykänen, Olli TSK ry

Piha, Sanna Trantex Oy

Reiman, Juhani PasaNet Oy

Romppainen, Birgitta Stockholms Universitet

Sarolahti, Pasi Instrumentarium

Sorva, Juha Instrumentarium

Tenni, Jarno VTT Information Technology

Tirkkonen-Condit, Sonja U Joensuu

Voutilainen, Atro Conexor oy

Computer aided language learning

Majorin Ari Lingsoft Oy

Mäki-Knuutila Liisa U Jyväskylä

Sajavaara, Kari U Jyväskylä

Stenman, Ulla Tampere UT

Telkkinen, Juha Promentor Solutions Ltd

Vasankari Timo Teleste Educational Oy

Åminne, Rigmor Markkinointi-instituutti

Natural language interfaces

Boda, Peter Nokia Research Center

Hagelin, Ritva UH

Hakkarainen, Anni STAKES

Honkela, Timo U Art & Design

Kommonen, Kari-Hans U Art & Design

Kurronen, Joni-Pekka Dataestradi

Lehtola, Aarno Nokia Research Center

Sihto, Matti Tekes

Turunen, Markku U Tampere

Vihermaa, Hannele Alma Media Oyj

Speech signal processing

Alku, Paavo U Turku

Koppinen, Konsta Tampere UT

Lindén, Krister Lingsoft Oy

Sams, Mikko HUT

Toivonen, Raimo Pitchsystems Oy

Tuuli, Raimo Gurusoft Oy

Vainio, Martti UH

Werner, Stefan U Joensuu

Common corpora

Helasvuo, Marja-Liisa UH

Järvikivi, Juhani U Joensuu

Kalliokuusi Virpi TSK ry

Lehtinen, Outi DomLang

Mauranen, Anna U Joensuu

Miettinen, Manne CSC

Niemi, Jussi U Joensuu

Piitulainen, Jussi UH

Rahikainen, Tarmo DomLang

Ryhänen, Pasi Lingsoft Oy

Salmisuo Sari UH

Suihkonen Pirkko UH/Academy of Finland

Writer’s tools

Arppe, Antti Lingsoft Oy

Heimonen, Esko Lanser Data Oy

Järvinen, Timo Conexor oy

Mäenpää Jarmo Kone Oyj

Nazarenko, Kaarina Sanoma Oy

Saarikoski, Harri Republica Oy

Turpeinen, Marko Alma Media Oyj

Westerlund, Fredrik Lingsoft Oy

Present at seminar but not at group discussions

Koskenniemi, Kimmo UH

Laitinen, Sauli VTT Information Services

Launonen Raimo VTT Information Technology

Lehti, Merja VTT Information Services

Lehtiniemi Ulla TT Information Services

Luoto-Halvari, Anna Ministry of Education

Pirkola, Ari UTampere

Pääsky, Timo Inforcon Oy

Räsänen, Maija Rautaruukki Oyj

Appendix 2. Project ideas

The project ideas listed below were presented in the discussions arising after the June seminar on the mailing list kt-info@listserv.funet.fi. They are only sketches, and they do not represent the interests of the different parties and fields equally. The list of ideas nevertheless tells about a lively interest in language technology and the language technology program under consideration.