Rank and frequency in Portuguese

We give access here to the ferquency lists of tokens of the AC/DC corpora and of the other corpora -- specifically Web collections -- also made available by Linguateca.

The frequency lists of tha AC/DC corpora were created by the cwb-lexdecode tool from Open CWB / IMS-CWB from Stuttgart University. Lemma and part of speech were assigned in context by PALAVRAS, the Portuguese parser by Eckhard Bick (Bick, 2000). Be warned that the lists were computed from the automatically annotated versions of the corpora, most of them have not been revised.

Frequency and rank of wordforms and lemmas in the AC/DC corpora

The following service allows one to obtain frequencies and rank of lexical items and sublexical patterns, per corpus or all together:

Some comments to the choices taken:

For proper nouns, the frequency provided was obtained by dividing the initail number by the length of the proper name: for example, Universidade do Porto, which in the corpora is annotated as three words with the lemma Universidade=do=Porto, has the lemma frequency divided by three.
For the all corpora lists we attempted not to include repeated material: for example, only the 1995 Brazilian material of CHAVE was added, since the Portuguese part was already in the CETEMPúblico, and the 1994 Brazilian inside NILC/Sâo Carlos. It is however possible that smaller stretches of same material may be repeated.

Corpus	Token frequency list	Lemma frequency list
All corpora	20,7 Mb	44,4 Mb
All corpora from Portugal	15,9 Mb	30,8 Mb
All corpora from Brazil	9,3 Mb	16,5 Mb
AmostRA-NILC	280 kb	118 kb
ANCIB	1,3 Mb	997 kb
Avante!	2,1 Mb	1,5 Mb
Corpus Brasileiro	94,1 Mb	168,3 Mb
CD HAREM	511 kb	263 kb
CETEMPúblico	16,8 Mb	25,4 Mb
CHAVE	12,5 Mb	15,8 Mb
Ciência Viva	715 kb	374 kb
Colonia	2,9 Mb	932 kb
CONDIVport	2,6 Mb	1,2 Mb
CONDIVport2	328 kb	165 kb
CoNE	806 kb	495 kb
C-Oral-Brasil	230 kb	92 kb
CORDIAL-SIN	486 kb	132 kb
CorpiRef	0 b	0 b
CorTrad, lado português	1,2 Mb	572 kb
DHBB	2,2 Mb	4,2 Mb
DiaCLAV	1,9 Mb	1,7 Mb
Diáspora TL-PT	47 kb	18 kb
DisPR	361 kb	128 kb
ECI-EBR	1021 kb	409 kb
ECI-EE	69 kb	29 kb
ENPCPUB (parte em português)	211 kb	79 kb
Floresta	2,8 Mb	2,1 Mb
FrasesPB	98 kb	44 kb
FrasesPP	83 kb	37 kb
Mariano Gago	538 kb	294 kb
LeMe	741 kb	526 kb
Literateca	9,8 Mb	3,9 Mb
Marielle, presente!	438 kb	251 kb
Moçambula	177 kb	74 kb
Museu da Pessoa	711 kb	311 kb
Natura/Minho	1,2 Mb	909 kb
NOBRE	4,3 Mb	1,2 Mb
OBras	4,0 Mb	1,1 Mb
PANTERA, lado português	787 kb	253 kb
P'lo Norte	130 kb	54 kb
Português Falado - Documentos Autênticos	138 kb	56 kb
ReLi	258 kb	75 kb
NILC/São Carlos	7,2 Mb	6,8 Mb
todos juntos	83,5 Mb	38,4 Mb
Tycho Brahe	2,4 Mb	1,1 Mb
Vercial	5,6 Mb	2,2 Mb

Partial searches, for few tokens or lemmata:

Tokens Lemmata

You can search using regular expressions of Perl.

To look for multiword lemmas, use a syntax like Belo=Horizonte, Castelo=Branco.

Token frequencies of Web collections

Depending on the collection, different methods were used:

For WBR99, we used the tokenization provided by the collection
For WPT03, another tokenization was used
For WPT05, the tokenization was done by the tokeniza function from the Lingua::PT::PLNbase library back in March 2009.

we have not in any case attempted to remove foreign words from the lists.

Colection No. documents No. words Frequency list disregarding capitalization No. different types disregarding capitalization Frequency list keeping original capitalization No. different types keeping capitalization

WPT-05 9.501.202 5.856.585.035 187M (gz) 25.237.118 206M (gz) 27.861.391

WPT-03 1.529.758 1.059.436.086 55,1M (tar.gz) 6.834.451

WBR-99 5.939.061 1.915.526.098 14M (tar.gz) 2.669.965

Frequency lists of wordforms and lemmas by part of speech in the AC/DC corpora

Corpus

Tokens

Lemmas

ADJ

ADV

NUM

GRAM

all

ADJ

ADV

NUM

PROP

GRAM

all

all/pos

CorTrad, lado português

ENPCPUB (parte em português)

PANTERA, lado português

Português Falado - Documentos Autênticos

all

all/pt

all/br

All tokens not belonging to any of the other categories were classified as GRAM (grammatical words). For that reason, these lists include an additional column specifying the category assigned by the parser.

Similar information can be obtained for the Portuguese parts of COMPARA and CorTrad:

Corpus Tokens Lemmas
N ADJ ADV V all N ADJ ADV V Proper names all
COMPARA (Portuguese) 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b
CorTrad jorn (Portuguese) 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b
CorTrad literary (Portuguese) 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b
CorTrad culinary (Portuguese) 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b 0 b

Further information (so far only in Portuguese):

[ Examples | Tokenization | Annotation | Corpora | Acknowledgements ]

Last update: 04 Julho 2016.

We would like to receive your feedback:
Comments, requests and suggestions

Colection	No. documents	No. words	Frequency list disregarding capitalization	No. different types disregarding capitalization	Frequency list keeping original capitalization	No. different types keeping capitalization
WPT-05	9.501.202	5.856.585.035	187M (gz)	25.237.118	206M (gz)	27.861.391
WPT-03	1.529.758	1.059.436.086	55,1M (tar.gz)	6.834.451
WBR-99	5.939.061	1.915.526.098	14M (tar.gz)	2.669.965