Rank and frequency in Portuguese

AC/DC project, Linguateca

Em português


We give access here to the ferquency lists of tokens of the AC/DC corpora and of the other corpora -- specifically Web collections -- also made available by Linguateca.

The frequency lists of tha AC/DC corpora were created by the cwb-lexdecode tool from Open CWB / IMS-CWB from Stuttgart University. Lemma and part of speech were assigned in context by PALAVRAS, the Portuguese parser by Eckhard Bick (Bick, 2000). Be warned that the lists were computed from the automatically annotated versions of the corpora, most of them have not been revised.

Frequency and rank of wordforms and lemmas in the AC/DC corpora

The following service allows one to obtain frequencies and rank of lexical items and sublexical patterns, per corpus or all together:

Some comments to the choices taken:

CorpusToken frequency listLemma frequency list
All corpora0 b 43,4 Mb
All corpora from Portugal 15,8 Mb 29,9 Mb
All corpora from Brazil 9,3 Mb 16,5 Mb
AmostRA-NILC 281 kb 120 kb
ANCIB 1,4 Mb 1,0 Mb
Avante! 2,1 Mb 1,5 Mb
Corpus Brasileiro 103,7 Mb 186,9 Mb
CD HAREM 511 kb 263 kb
CETEMPúblico 16,7 Mb 24,0 Mb
CHAVE 12,6 Mb 16,0 Mb
Ciência Viva 739 kb 398 kb
Colonia 2,9 Mb 1,1 Mb
CONDIVport 2,6 Mb 1,3 Mb
CONDIVport2 330 kb 168 kb
CoNE 798 kb 512 kb
C-Oral-Brasil 230 kb 99 kb
CORDIAL-SIN 486 kb 157 kb
DHBB 2,1 Mb 3,9 Mb
DiaCLAV 1,9 Mb 1,7 Mb
Diáspora TL-PT 47 kb 19 kb
ECI-EBR 1022 kb 434 kb
ECI-EE 71 kb 32 kb
ENPCPUB (parte em português) 211 kb 82 kb
Floresta 2,8 Mb 2,1 Mb
FrasesPB 98 kb 44 kb
FrasesPP 83 kb 38 kb
Mariano Gago 542 kb 297 kb
LeMe 756 kb 545 kb
Literateca 9,9 Mb 4,9 Mb
Marielle, presente! 441 kb 252 kb
Moçambula 176 kb 76 kb
Museu da Pessoa 711 kb 328 kb
Natura/Minho 1,2 Mb 924 kb
NOBRE 4,1 Mb 1,5 Mb
OBras 4,0 Mb 1,4 Mb
PANTERA, lado português 771 kb 264 kb
P'lo Norte 131 kb 55 kb
Português Falado - Documentos Autênticos 138 kb 57 kb
ReLi 259 kb 80 kb
NILC/São Carlos 7,3 Mb 7,1 Mb
todos juntos 171,2 Mb 211,9 Mb
Tycho Brahe 1,2 Mb 545 kb
Vercial 5,8 Mb 2,7 Mb

Partial searches, for few tokens or lemmata:

Tokens Lemmata

  • You can search using regular expressions of Perl.
  • To look for multiword lemmas, use a syntax like Belo=Horizonte, Castelo=Branco.

    Token frequencies of Web collections

    Depending on the collection, different methods were used:
    • For WBR99, we used the tokenization provided by the collection
    • For WPT03, another tokenization was used
    • For WPT05, the tokenization was done by the tokeniza function from the Lingua::PT::PLNbase library back in March 2009.

    we have not in any case attempted to remove foreign words from the lists.

    ColectionNo. documentsNo. wordsFrequency list disregarding capitalizationNo. different types disregarding capitalizationFrequency list keeping original capitalization No. different types keeping capitalization
    WPT-05 9.501.202 5.856.585.035 187M (gz) 25.237.118 206M (gz) 27.861.391
    WPT-03 1.529.758 1.059.436.086 55,1M (tar.gz) 6.834.451
    WBR-99 5.939.061 1.915.526.098 14M (tar.gz) 2.669.965

    Frequency lists of wordforms and lemmas by part of speech in the AC/DC corpora

    CorpusTokensLemmas
    NADJADVVNUMGRAMallNADJADVVNUMPROPGRAMallall/pos
    AmostRA-NILC 66 kb 30 kb 4 kb 64 kb 2 kb 6 kb 281 kb 51 kb 21 kb 4 kb 19 kb 2 kb 20 kb 1 kb 120 kb 135 kb
    ANCIB 271 kb 112 kb 13 kb 200 kb 54 kb 99 kb 1,4 Mb 195 kb 59 kb 10 kb 40 kb 56 kb 707 kb 4 kb 1,0 Mb 1,1 Mb
    Avante! 376 kb 234 kb 31 kb 568 kb 40 kb 16 kb 2,1 Mb 242 kb 116 kb 25 kb 61 kb 52 kb 1,0 Mb 3 kb 1,5 Mb 1,6 Mb
    Corpus Brasileiro 20,7 Mb 7,0 Mb 359 kb 7,0 Mb 11,9 Mb 905 kb 103,7 Mb 18,7 Mb 5,4 Mb 310 kb 2,5 Mb 15,2 Mb 145,5 Mb 89 kb 186,9 Mb 198,3 Mb
    CD HAREM 111 kb 48 kb 7 kb 97 kb 7 kb 7 kb 511 kb 77 kb 30 kb 5 kb 23 kb 7 kb 119 kb 2 kb 263 kb 291 kb
    CETEMPúblico 2,4 Mb 1,3 Mb 111 kb 2,8 Mb 1,1 Mb 120 kb 16,7 Mb 1,8 Mb 738 kb 86 kb 290 kb 1,3 Mb 19,8 Mb 8 kb 24,0 Mb 25,5 Mb
    CHAVE 2,0 Mb 1,0 Mb 90 kb 2,2 Mb 912 kb 88 kb 12,6 Mb 1,4 Mb 575 kb 69 kb 240 kb 1008 kb 12,8 Mb 7 kb 16,0 Mb 17,1 Mb
    Ciência Viva 153 kb 85 kb 11 kb 146 kb 13 kb 19 kb 739 kb 111 kb 48 kb 9 kb 27 kb 15 kb 187 kb 2 kb 398 kb 436 kb
    Colonia 552 kb 292 kb 38 kb 1023 kb 8 kb 24 kb 2,9 Mb 383 kb 152 kb 32 kb 114 kb 14 kb 407 kb 5 kb 1,1 Mb 1,2 Mb
    CONDIVport 504 kb 317 kb 38 kb 658 kb 52 kb 34 kb 2,6 Mb 337 kb 164 kb 30 kb 74 kb 64 kb 689 kb 4 kb 1,3 Mb 1,5 Mb
    CONDIVport2 71 kb 30 kb 5 kb 65 kb 6 kb 7 kb 330 kb 53 kb 19 kb 4 kb 17 kb 6 kb 68 kb 1 kb 168 kb 187 kb
    CoNE 162 kb 65 kb 7 kb 103 kb 39 kb 25 kb 798 kb 106 kb 33 kb 6 kb 23 kb 42 kb 302 kb 2 kb 512 kb 558 kb
    C-Oral-Brasil 55 kb 20 kb 3 kb 48 kb 3 kb 10 kb 230 kb 44 kb 14 kb 2 kb 14 kb 4 kb 20 kb 2 kb 99 kb 115 kb
    CORDIAL-SIN 118 kb 30 kb 3 kb 156 kb 1 kb 10 kb 486 kb 80 kb 18 kb 2 kb 23 kb 4 kb 24 kb 2 kb 157 kb 176 kb
    DHBB 345 kb 193 kb 20 kb 456 kb 103 kb 15 kb 2,1 Mb 226 kb 101 kb 15 kb 49 kb 108 kb 3,5 Mb 3 kb 3,9 Mb 4,2 Mb
    DiaCLAV 353 kb 191 kb 21 kb 474 kb 47 kb 14 kb 1,9 Mb 222 kb 95 kb 17 kb 54 kb 56 kb 1,3 Mb 3 kb 1,7 Mb 1,8 Mb
    Diáspora TL-PT 9 kb 3 kb 1 kb 10 kb565 b 3 kb 47 kb 8 kb 2 kb 1 kb 3 kb567 b 3 kb 1 kb 19 kb 22 kb
    ECI-EBR 215 kb 120 kb 16 kb 270 kb 5 kb 10 kb 1022 kb 149 kb 69 kb 12 kb 44 kb 13 kb 145 kb 2 kb 434 kb 482 kb
    ECI-EE 15 kb 9 kb 2 kb 14 kb 1 kb 3 kb 71 kb 12 kb 6 kb 2 kb 5 kb 1 kb 3 kb 1 kb 32 kb 35 kb
    ENPCPUB (parte em português) 45 kb 21 kb 5 kb 54 kb1020 b 6 kb 211 kb 35 kb 15 kb 4 kb 15 kb 1 kb 10 kb 1 kb 82 kb 93 kb
    Floresta 554 kb 268 kb 30 kb 571 kb 49 kb 76 kb 2,8 Mb 372 kb 138 kb 26 kb 79 kb 52 kb 1,4 Mb 8 kb 2,1 Mb 2,3 Mb
    FrasesPB 26 kb 10 kb 2 kb 18 kb451 b 4 kb 98 kb 22 kb 8 kb 2 kb 8 kb412 b 2 kb 1 kb 44 kb 50 kb
    FrasesPP 21 kb 9 kb 2 kb 15 kb514 b 3 kb 83 kb 17 kb 7 kb 2 kb 6 kb490 b 2 kb 1 kb 38 kb 42 kb
    Mariano Gago 109 kb 58 kb 10 kb 135 kb 7 kb 9 kb 542 kb 76 kb 34 kb 8 kb 24 kb 7 kb 145 kb 2 kb 297 kb 325 kb
    LeMe 149 kb 110 kb 8 kb 98 kb 33 kb 33 kb 756 kb 108 kb 72 kb 6 kb 25 kb 37 kb 282 kb 20 kb 545 kb 595 kb
    Literateca 1,8 Mb 966 kb 105 kb 3,3 Mb 39 kb 55 kb 9,9 Mb 1,3 Mb 545 kb 88 kb 380 kb 68 kb 2,6 Mb 12 kb 4,9 Mb 5,4 Mb
    Marielle, presente! 90 kb 42 kb 6 kb 94 kb 11 kb 11 kb 441 kb 61 kb 25 kb 5 kb 21 kb 12 kb 127 kb 2 kb 252 kb 277 kb
    Moçambula 39 kb 18 kb 4 kb 40 kb 1 kb 5 kb 176 kb 30 kb 12 kb 3 kb 13 kb 1 kb 13 kb 1 kb 76 kb 85 kb
    Museu da Pessoa 159 kb 69 kb 9 kb 192 kb 4 kb 9 kb 711 kb 110 kb 40 kb 7 kb 32 kb 6 kb 134 kb 2 kb 328 kb 367 kb
    Natura/Minho 230 kb 123 kb 15 kb 252 kb 34 kb 12 kb 1,2 Mb 148 kb 66 kb 12 kb 41 kb 41 kb 617 kb 3 kb 924 kb 997 kb
    NOBRE 724 kb 453 kb 57 kb 1,5 Mb 11 kb 42 kb 4,1 Mb 506 kb 238 kb 49 kb 160 kb 20 kb 521 kb 6 kb 1,5 Mb 1,6 Mb
    OBras 716 kb 403 kb 52 kb 1,5 Mb 15 kb 25 kb 4,0 Mb 481 kb 199 kb 44 kb 137 kb 23 kb 506 kb 6 kb 1,4 Mb 1,5 Mb
    PANTERA, lado português 146 kb 76 kb 13 kb 242 kb 3 kb 11 kb 771 kb 106 kb 44 kb 11 kb 37 kb 4 kb 58 kb 3 kb 264 kb 297 kb
    P'lo Norte 29 kb 12 kb 3 kb 26 kb 1 kb 5 kb 131 kb 22 kb 8 kb 2 kb 8 kb 1 kb 10 kb 1 kb 55 kb 62 kb
    Português Falado - Documentos Autênticos 33 kb 12 kb 2 kb 35 kb556 b 4 kb 138 kb 27 kb 8 kb 2 kb 10 kb 2 kb 5 kb 1 kb 57 kb 66 kb
    ReLi 44 kb 28 kb 6 kb 52 kb 1 kb 44 kb 259 kb 33 kb 17 kb 4 kb 15 kb 1 kb 11 kb 2 kb 80 kb 94 kb
    NILC/São Carlos 1,2 Mb 638 kb 57 kb 1,3 Mb 615 kb 41 kb 7,3 Mb 837 kb 334 kb 46 kb 157 kb 660 kb 5,1 Mb 6 kb 7,1 Mb 7,6 Mb
    todos juntos 22,9 Mb 8,0 Mb 430 kb 9,2 Mb 13,0 Mb 84,1 Mb 171,2 Mb 20,6 Mb 6,1 Mb 371 kb 2,9 Mb 16,5 Mb 166,2 Mb 50,2 Mb 211,9 Mb 282,1 Mb
    Tycho Brahe 247 kb 112 kb 16 kb 347 kb 4 kb 13 kb 1,2 Mb 178 kb 66 kb 13 kb 61 kb 9 kb 213 kb 3 kb 545 kb 604 kb
    Vercial 998 kb 533 kb 61 kb 2,0 Mb 27 kb 29 kb 5,8 Mb 686 kb 277 kb 50 kb 206 kb 41 kb 1,5 Mb 8 kb 2,7 Mb 3,0 Mb
    all 4,9 Mb 2,4 Mb 193 kb 5,4 Mb 2,0 Mb 402 kb0 b 3,7 Mb 1,5 Mb 160 kb 708 kb 2,3 Mb 35,2 Mb 35 kb 43,4 Mb 59,0 Mb
    all/pt 3,7 Mb 1,9 Mb 164 kb 4,6 Mb 1,2 Mb 218 kb0 b 2,7 Mb 1,1 Mb 131 kb 584 kb 1,5 Mb 24,0 Mb 32 kb 29,9 Mb 41,0 Mb
    all/br 2,3 Mb 1,1 Mb 103 kb 2,8 Mb 965 kb 209 kb 9,3 Mb 1,7 Mb 649 kb 86 kb 336 kb 1,0 Mb 12,8 Mb 12 kb 16,5 Mb 22,4 Mb

    All tokens not belonging to any of the other categories were classified as GRAM (grammatical words). For that reason, these lists include an additional column specifying the category assigned by the parser.

    Similar information can be obtained for the Portuguese parts of COMPARA and CorTrad:

    CorpusTokensLemmas
    NADJADVVallNADJADVVProper namesall
    COMPARA (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad jorn (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad literary (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad culinary (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b

    Further information (so far only in Portuguese):

    [ Examples | Tokenization | Annotation | Corpora | Acknowledgements ]


    Last update: 04 Julho 2016.
    We would like to receive your feedback:
    Comments, requests and suggestions