CETEMPúblico: a large corpus of Portuguese newspaper language

Linguateca, the Computational processing of Portuguese follow up

CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público) is a corpus containing some 180 million words in European Portuguese, created by the Computational processing of Portuguese project after an agreement between the Portuguese Ministry for Science and Technology (MCT) and the Portuguese daily newspaper PÚBLICO was signed in April 2000.

Its first version, CETEMPúblico 1.0, came to existence on the 25th July 2000. See the associated Readme file.

We make the corpus available in the following different ways:

Through the AC/DC project (granting Web access to corpora), available from here
Through http download (version 1.7). Those interested, please register in the Portuguese page.
Through the Linguistic Data Consortium (LDC), CETEMPúblico Version 1.7

FAQ - Frequently Asked Questions

Who are the envisaged users of CETEMPúblico?

This corpus was mainly aimed at all those who develop computer programs processing the Portuguese language, and who would need raw material for their work. The text versions on CD were conceived for this kind of users.

On the other hand, we want the corpus to be useful to everyone who studies the Portuguese language and wishes to check their hypotheses in previously organized text material. The online and the CQP versions are meant for such users, who are, in any case, also welcome to get it on CD in order to process the corpus locally, possibly by means of the corpus processing system of their choice.

What is PÚBLICO?

PÚBLICO is a widely read daily Portuguese newspaper. It was founded in 1991 and was the first newspaper in Portugal to make available an online edition on the Web, Publico.pt.

Are there any restrictions to the use of CETEMPúblico?

As stated in the User Conditions file, CETEMPúblico can be used for research and technological development. Only its direct commercial exploitation is not allowed.

What are my duties as a user of CETEMPúblico?

The Público newspaper should always be acknowledged as source of the material, in any presentation of work that make use of CETEMPúblico, such as articles, theses and talks.

A free copy of any commercial products emerging from R&D projects using CETEMPúblico should be given to the PÚBLICO newspaper.

Am I allowed to reconstruct the full newspaper articles?

No. The agreement signed between MCT and PÚBLICO forced us to chop up the articles into extracts and shuffle them so that no reconstruction were possible. The corpus is not supposed to replace the newspaper's archives.

Does CETEMPúblico include all the text published by PÚBLICO?

No. On the one hand, several editions were missing in the material provided by the newspaper, and we excluded newspaper sections not considered relevant for the goals of the corpus, such as quotations from other Portuguese newspapers ("Diz-se"), the errata section ("O PÚBLICO errou"), and sports results in table format (classifications, rankings, results, etc.). On the other hand, CETEMPúblico includes a large number of articles that were not actually published by lack of space or opportunity.

Is the language of CETEMPúblico exclusively European Portuguese?

The vast majority is Portuguese from Portugal, although there are a few texts of Brazilian and African writers.

What is included in CETEMPúblico?

The corpus includes the text of around 2,600 editions of PÚBLICO, written (stored) between 1991 and 1998, amounting to approximately 180 million words.

CETEMPúblico 1.7 contains 1,504,258 extracts (CETEMPúblico 1.0 had 1,567,625), bearing the information about section of origin and semester. Each extract is divided in paragraphs and sentences, and titles and authors are marked as such. See some examples of extracts.

How were the words counted?

Tokens containing at least one letter or digit were considered words. Punctuation marks were not considered words.

Some approximate numbers (computed 2001):

Tokens Types

Units 229,038,019 1,033,041

Words 191,687,833 999,059

Punctuation 13,065,151 33,982

"Punctuation" includes tokens with punctuation marks, such as (1993), a) or 17:53.

Structure Number

Extracts <ext> 1,504,258

Paragraphs <p> 2,571,735

Sentences <s> 7,082,094

Titles <t> 655,059

Authors <a> 247,392

List elements <li> 80,060

The list of tokens in CETEMPúblico is available from the AC/DC project pages (word list, lemma list).

Further quantitative information is also available from the quantitative description page, that is updated for each AC/DC corpus when changes in the programs occur.

What is the corpus structure?

We specify the (non-annotated) corpus structure with the help of a small BNF grammar. Terminals appear in bold:

corpus = <corpus> extract+ </corpus> extract = extract_id extract_contents </ext> extract_contents = paragraph+ paragraph = title | author_id | <p> sentence+ </p> | list_element title = <t> token+ </t> author_id = <a> token+ </a> list_element = <li> token+ </li> sentence = ( <s> | <s tipo=frag> ) token+ </s> token = | palavra | sinal_pontuação | identificador X = ( *+ ) | *+ extract_id = <ext n=number sec=sec_id sem=semester > number = [0-9]+ sec_id= soc | pol | clt | des | opi | eco | com | clt-soc | pol-soc | nd semester = 91a | 91b | 92a | 92b | 93a | 93b | 94a | 94b | 95a | 95b | 96a | 96b |97a | 97b | 98a | 98b

Notes:

The parentheses and the * in the definition of X are terminals (as opposed to all other occurrences).
number ranges from 1 to 1567625 and is unique (some numbers no longer exist).
palavra (word), sinal_pontuação (punctuation mark) and identificador (identifier) in the above grammar are not further analysed (this is left to a Portuguese tokenizer).

Alternatively, we provide a DTD for SGML parsers.

Do the characters strictly reflect newspaper usage?

In some cases we made normalization decisions (the original material was encoded in Macintosh characters, while we chose the ISO-8859-1 character encoding standard). Some of the changes performed are:

Long dash was transformed into "--" (a sequence of two hyphens).
All quotes are encoded as Ť or ť.
The "oe ligature" character was transformed into the sequence of the letters O and E as usual in ISO-8859-1 encoding.
The decimal character 127 (hexadecimal 7F) was replaced by hyphen.
The (few) cases of << and >> were transformed into their one-character equivalent, namely Ť and ť
The characters &, < and > were translated into the corresponding SGML entities, namely &, < and >.

Is all material included in CETEMPúblico in a valid format?

Although this was not the case with the previous versions, we have checked that this is true as far as version 1.7 is concerned.

Are there other known problems in CETEMPúblico?

There are some repeated articles (and consequently repeated extracts). Although from version 1.2 on we have tried to eliminate duplicated extracts (keeping just the first), there remain in the corpus cases of slightly different articles, which we take to be different versions of the same text. See an example of similar extracts here.
Paragraphs identified as titles or authors were always joined to the previous paragraph before extract separation. The articles were divided based on the criteria "two paragraphs", and some articles included several short news ("Breves"). Therefore, not only some (sub)titles are separated from the news they refer to, but in some cases they were joined to a completely different piece of information. See one case of incorrect title separation here.

See also our ACL'2001 paper (see below) for precision and recall on structural markup concerning titles, author identification and sentence separation.

What is "CETEMPúblico's first million" (primeiro milhão do CETEMPúblico)?

As the name indicates, it is the first subset of CETEMPúblico (the first million words), which was created under our treebank project, Floresta Sintá(c)tica, and whose sentence separation was manually revised and redone (in what concerned text including semicolon, colon and parentheses). It does not only include earlier text (1991), rather it should feature a balanced selection of years 1991 through 1999 as well as all categories included in the full corpus.

Access to this first million (also annotated) is being given through our AC/DC project.

What is the annotated CETEMPúblico (CETEMPúblico anotado)?

As for all other corpora of the AC/DC project, we have annotated CETEMPúblico with the PALAVRAS parser developed by Eckhard Bick. Due to its size, the actual annotation was actually done in Eckhard Bick's VISL project premises and not in Linguateca.

Currently users can query the annotation done in 2006 through the AC/DC project interface. Note that, due to efficiency problems, you are strongly advised to use a cut clause in their concordance queries, like in [word="como" & pos="V.*"] cut 20.

Annotated CETEMPúblico of 2006 is also available for download. To get the access information please register in the Portuguese page.

Is there more information about CETEMPúblico?

You can read more about this corpus in two articles, available here in electronic form:

Paulo Rocha & Diana Santos. "CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa", in Maria das Graças Volpe Nunes (ed.), Actas do V Encontro para o processamento computacional da língua portuguesa escrita e falada, PROPOR'2000 (Atibaia, São Paulo, Brasil, 19 a 22 de Novembro de 2000), pp. 131-140, pdf
Diana Santos & Paulo Rocha. "Evaluating CETEMPúblico, a free resource for Portuguese", in Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, ACL'2001 (Toulouse, 9-11 July 2001), pp.442-449, pdf

How can I remain updated about future CETEMPúblico changes?

Whenever we learn about new problems with the corpus, we try to create patches to solve them. They will be available from CETEMPúblico's page. We will also update the corpus version to which we give access on the Web. So far (for users of version 1.0), we have made available 6 patches in Perl, named patch_cetempublico_1.0.x.pl that may be downloaded from the information page.

In order to remain updated about the corpus progress, you can also subscribe to the CETEMPúblico mailing list by sending us a message. Note: You don't need to explicitly subscribe to this list if you ordered the corpus through us, because your registering to get a copy leads to your inclusion in this list.

Acknowledgements

At PÚBLICO, we heartily thank José Vítor Malheiros, director of the electronic version, without whom the corpus would not exist, and Paulo Almeida for technical support concerning the newspaper files.
We are grateful to Stefan Evert and Arne Fitschen (University of Stuttgart) for help and support as far as the IMS Corpus Workbench is concerned.
We thank Pedro Veiga for starting the whole project from the MCT side, as well as providing administrative facilities for the burning and distribution of the first batch of CDs.
We thank Miguel Andrade for having carried out the legal work necessary for the project.
We thank José João Dias de Almeida for valuable suggestions to handle the repeated extract problem.
We thank Andrew Cole at LDC for help in validating the corpus's SGML version.
And we thank Eckhard Bick for the many PALAVRAS-annotated versions of CETEMPÚblico that he has provided us with over the years.

Last update: 10 September 2007.

Send questions, comments and suggestions

	Tokens	Types
Units	229,038,019	1,033,041
Words	191,687,833	999,059
Punctuation	13,065,151	33,982

Structure	Number
Extracts <ext>	1,504,258
Paragraphs <p>	2,571,735
Sentences <s>	7,082,094
Titles <t>	655,059
Authors <a>	247,392
List elements <li>	80,060