Abstract
Rocha & Santos 2000
Paulo Alexandre Rocha & Diana Santos. "CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa", in Maria das Graças Volpe Nunes (ed.), Actas do V Encontro para o processamento computacional da língua portuguesa escrita e falada (PROPOR'2000) (Atibaia, São Paulo, Brasil, 19 a 22 de Novembro de 2000), pp. 131-140.
Translation of the title: CETEMPúblico: A large corpus of Portuguese newspaper text
This paper reports on the creation of CETEMPúblico, the largest publicly available corpus
of Portuguese to date, containing 180 million words, created to boost research in language
engineering in Portuguese. After providing some background for creating it, we focus on
the processing required, explaining in detail some options taken, namely:
- the division of articles in extracts;
- their random reordering and numbering in the final corpus;
- the marking of structural units such as sentence separation, titles and
author identification;
- the use of a partial system for contents classification;
- and the distribution methods.
Other documents related to the project Computational processing of Portuguese
Other publications by Diana Santos
Other publications by Paulo Alexandre Rocha