Rank and frequency in Portuguese

AC/DC project, Linguateca

Em português


We give access here to the ferquency lists of tokens of the AC/DC corpora and of the other corpora -- specifically Web collections -- also made available by Linguateca.

The frequency lists of tha AC/DC corpora were created by the cwb-lexdecode tool from Open CWB / IMS-CWB from Stuttgart University. Lemma and part of speech were assigned in context by PALAVRAS, the Portuguese parser by Eckhard Bick (Bick, 2000). Be warned that the lists were computed from the automatically annotated versions of the corpora, most of them have not been revised.

Frequency and rank of wordforms and lemmas in the AC/DC corpora

The following service allows one to obtain frequencies and rank of lexical items and sublexical patterns, per corpus or all together:

Some comments to the choices taken:

CorpusToken frequency listLemma frequency list
All corpora 20,5 Mb 44,1 Mb
All corpora from Portugal 15,9 Mb 30,7 Mb
All corpora from Brazil 9,2 Mb 16,3 Mb
AmostRA-NILC 280 kb 118 kb
ANCIB 1,3 Mb 997 kb
Avante! 2,1 Mb 1,5 Mb
Corpus Brasileiro 100,9 Mb 185,1 Mb
CD HAREM 511 kb 263 kb
CETEMPúblico 16,8 Mb 25,4 Mb
CHAVE 12,5 Mb 15,8 Mb
Ciência Viva 715 kb 374 kb
Colonia 2,9 Mb 932 kb
CONDIVport 2,6 Mb 1,2 Mb
CONDIVport2 328 kb 165 kb
CoNE 806 kb 495 kb
C-Oral-Brasil 230 kb 92 kb
CORDIAL-SIN 486 kb 132 kb
CorTrad, lado português0 b0 b
DHBB 2,2 Mb 4,2 Mb
DiaCLAV 1,9 Mb 1,7 Mb
Diáspora TL-PT 47 kb 18 kb
DisPR 361 kb 128 kb
ECI-EBR 1021 kb 409 kb
ECI-EE 69 kb 29 kb
ENPCPUB (parte em português) 211 kb 79 kb
Floresta 2,8 Mb 2,1 Mb
FrasesPB 98 kb 44 kb
FrasesPP 83 kb 37 kb
Mariano Gago 538 kb 294 kb
LeMe 741 kb 526 kb
Literateca 9,7 Mb 3,9 Mb
Marielle, presente! 438 kb 251 kb
Moçambula 177 kb 74 kb
Museu da Pessoa 710 kb 311 kb
Natura/Minho 1,2 Mb 909 kb
NOBRE 4,2 Mb 1,1 Mb
OBras 4,0 Mb 1,1 Mb
PANTERA, lado português 771 kb 264 kb
P'lo Norte 130 kb 54 kb
Português Falado - Documentos Autênticos 138 kb 56 kb
ReLi 258 kb 75 kb
NILC/São Carlos 7,2 Mb 6,8 Mb
todos juntos 83,5 Mb 38,4 Mb
Tycho Brahe 2,4 Mb 1,1 Mb
Vercial 5,5 Mb 2,2 Mb

Partial searches, for few tokens or lemmata:

Tokens Lemmata

  • You can search using regular expressions of Perl.
  • To look for multiword lemmas, use a syntax like Belo=Horizonte, Castelo=Branco.

    Token frequencies of Web collections

    Depending on the collection, different methods were used:
    • For WBR99, we used the tokenization provided by the collection
    • For WPT03, another tokenization was used
    • For WPT05, the tokenization was done by the tokeniza function from the Lingua::PT::PLNbase library back in March 2009.

    we have not in any case attempted to remove foreign words from the lists.

    ColectionNo. documentsNo. wordsFrequency list disregarding capitalizationNo. different types disregarding capitalizationFrequency list keeping original capitalization No. different types keeping capitalization
    WPT-05 9.501.202 5.856.585.035 187M (gz) 25.237.118 206M (gz) 27.861.391
    WPT-03 1.529.758 1.059.436.086 55,1M (tar.gz) 6.834.451
    WBR-99 5.939.061 1.915.526.098 14M (tar.gz) 2.669.965

    Frequency lists of wordforms and lemmas by part of speech in the AC/DC corpora

    CorpusTokensLemmas
    NADJADVVNUMGRAMallNADJADVVNUMPROPGRAMallall/pos
    AmostRA-NILC 67 kb 30 kb 4 kb 64 kb 1 kb 6 kb 280 kb 50 kb 20 kb 4 kb 19 kb 2 kb 20 kb 2 kb 118 kb 133 kb
    ANCIB 253 kb 114 kb 14 kb 205 kb 50 kb 20 kb 1,3 Mb 133 kb 50 kb 11 kb 34 kb 56 kb 716 kb 6 kb 997 kb 1,1 Mb
    Avante! 382 kb 233 kb 35 kb 563 kb 39 kb 28 kb 2,1 Mb 212 kb 93 kb 31 kb 51 kb 54 kb 1,1 Mb 14 kb 1,5 Mb 1,6 Mb
    Corpus Brasileiro 19,7 Mb 6,9 Mb 427 kb 7,6 Mb 11,8 Mb 1,3 Mb 100,9 Mb 17,7 Mb 5,1 Mb 360 kb 2,2 Mb 15,2 Mb 145,4 Mb 208 kb 185,1 Mb 196,5 Mb
    CD HAREM 111 kb 48 kb 7 kb 97 kb 7 kb 7 kb 511 kb 77 kb 30 kb 5 kb 23 kb 7 kb 119 kb 2 kb 263 kb 291 kb
    CETEMPúblico 2,5 Mb 1,2 Mb 140 kb 2,7 Mb 1,1 Mb 192 kb 16,8 Mb 1,6 Mb 420 kb 111 kb 141 kb 1,3 Mb 21,9 Mb 184 kb 25,4 Mb 27,2 Mb
    CHAVE 2,0 Mb 1018 kb 107 kb 2,1 Mb 887 kb 121 kb 12,5 Mb 1,1 Mb 337 kb 88 kb 123 kb 1009 kb 13,3 Mb 97 kb 15,8 Mb 17,0 Mb
    Ciência Viva 144 kb 85 kb 12 kb 146 kb 11 kb 10 kb 715 kb 92 kb 41 kb 10 kb 26 kb 13 kb 191 kb 3 kb 374 kb 414 kb
    Colonia 539 kb 285 kb 41 kb 1,0 Mb 10 kb 31 kb 2,9 Mb 286 kb 110 kb 35 kb 86 kb 17 kb 398 kb 7 kb 932 kb 1,0 Mb
    CONDIVport 510 kb 312 kb 41 kb 656 kb 49 kb 37 kb 2,6 Mb 246 kb 116 kb 33 kb 59 kb 68 kb 712 kb 10 kb 1,2 Mb 1,3 Mb
    CONDIVport2 73 kb 30 kb 5 kb 65 kb 6 kb 6 kb 328 kb 50 kb 18 kb 4 kb 17 kb 6 kb 69 kb 2 kb 165 kb 185 kb
    CoNE 160 kb 65 kb 8 kb 105 kb 37 kb 29 kb 806 kb 83 kb 30 kb 6 kb 22 kb 40 kb 315 kb 4 kb 495 kb 542 kb
    C-Oral-Brasil 55 kb 19 kb 3 kb 49 kb 3 kb 12 kb 230 kb 39 kb 12 kb 2 kb 14 kb 4 kb 20 kb 2 kb 92 kb 108 kb
    CORDIAL-SIN 117 kb 29 kb 3 kb 155 kb 1 kb 12 kb 486 kb 63 kb 14 kb 2 kb 21 kb 4 kb 24 kb 3 kb 132 kb 151 kb
    CorTrad, lado português0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    DHBB 360 kb 197 kb 22 kb 455 kb 93 kb 27 kb 2,2 Mb 263 kb 81 kb 19 kb 44 kb 101 kb 3,7 Mb 31 kb 4,2 Mb 4,5 Mb
    DiaCLAV 366 kb 190 kb 23 kb 470 kb 44 kb 24 kb 1,9 Mb 208 kb 79 kb 19 kb 46 kb 55 kb 1,3 Mb 16 kb 1,7 Mb 1,8 Mb
    Diáspora TL-PT 9 kb 3 kb 1 kb 10 kb579 b 4 kb 47 kb 8 kb 2 kb 1 kb 3 kb590 b 2 kb 1 kb 18 kb 22 kb
    DisPR 68 kb 47 kb 9 kb 109 kb 2 kb 7 kb 361 kb 47 kb 25 kb 9 kb 20 kb 2 kb 22 kb 2 kb 128 kb 142 kb
    ECI-EBR 217 kb 119 kb 16 kb 271 kb 5 kb 12 kb 1021 kb 137 kb 61 kb 14 kb 41 kb 13 kb 141 kb 4 kb 409 kb 458 kb
    ECI-EE 15 kb 9 kb 2 kb 14 kb 1 kb 3 kb 69 kb 11 kb 5 kb 2 kb 5 kb 1 kb 2 kb 1 kb 29 kb 33 kb
    ENPCPUB (parte em português) 45 kb 21 kb 5 kb 54 kb993 b 6 kb 211 kb 34 kb 15 kb 4 kb 15 kb873 b 8 kb 1 kb 79 kb 90 kb
    Floresta 554 kb 268 kb 30 kb 571 kb 49 kb 76 kb 2,8 Mb 372 kb 138 kb 26 kb 79 kb 52 kb 1,4 Mb 8 kb 2,1 Mb 2,3 Mb
    FrasesPB 27 kb 10 kb 2 kb 18 kb433 b 4 kb 98 kb 21 kb 7 kb 2 kb 8 kb395 b 2 kb 1 kb 44 kb 50 kb
    FrasesPP 21 kb 9 kb 2 kb 15 kb495 b 3 kb 83 kb 17 kb 7 kb 2 kb 6 kb491 b 2 kb 1 kb 37 kb 42 kb
    Mariano Gago 110 kb 58 kb 11 kb 132 kb 6 kb 11 kb 538 kb 73 kb 31 kb 9 kb 23 kb 7 kb 147 kb 4 kb 294 kb 322 kb
    LeMe 152 kb 109 kb 8 kb 100 kb 31 kb 15 kb 741 kb 96 kb 52 kb 7 kb 23 kb 37 kb 302 kb 21 kb 526 kb 589 kb
    Literateca 1,6 Mb 879 kb 119 kb 3,4 Mb 38 kb 103 kb 9,7 Mb 798 kb 284 kb 99 kb 193 kb 72 kb 2,5 Mb 19 kb 3,9 Mb 4,4 Mb
    Marielle, presente! 91 kb 42 kb 6 kb 95 kb 9 kb 10 kb 438 kb 55 kb 23 kb 5 kb 20 kb 10 kb 134 kb 3 kb 251 kb 275 kb
    Moçambula 40 kb 18 kb 4 kb 40 kb 1 kb 6 kb 177 kb 29 kb 12 kb 4 kb 13 kb 1 kb 13 kb 1 kb 74 kb 83 kb
    Museu da Pessoa 161 kb 69 kb 10 kb 192 kb 4 kb 11 kb 710 kb 98 kb 36 kb 8 kb 30 kb 6 kb 133 kb 4 kb 311 kb 350 kb
    Natura/Minho 238 kb 122 kb 16 kb 252 kb 32 kb 15 kb 1,2 Mb 137 kb 56 kb 13 kb 36 kb 41 kb 625 kb 7 kb 909 kb 984 kb
    NOBRE 692 kb 421 kb 65 kb 1,5 Mb 11 kb 77 kb 4,2 Mb 340 kb 146 kb 55 kb 99 kb 20 kb 524 kb 9 kb 1,1 Mb 1,3 Mb
    OBras 671 kb 394 kb 57 kb 1,5 Mb 13 kb 33 kb 4,0 Mb 320 kb 133 kb 51 kb 93 kb 22 kb 521 kb 8 kb 1,1 Mb 1,3 Mb
    PANTERA, lado português 146 kb 76 kb 13 kb 242 kb 3 kb 11 kb 771 kb 106 kb 44 kb 11 kb 37 kb 4 kb 58 kb 3 kb 264 kb 297 kb
    P'lo Norte 29 kb 12 kb 3 kb 26 kb 1 kb 5 kb 130 kb 21 kb 8 kb 2 kb 8 kb 1 kb 10 kb 1 kb 54 kb 61 kb
    Português Falado - Documentos Autênticos 33 kb 12 kb 3 kb 35 kb544 b 5 kb 138 kb 25 kb 8 kb 2 kb 10 kb 2 kb 5 kb 2 kb 56 kb 64 kb
    ReLi 44 kb 27 kb 6 kb 51 kb1006 b 45 kb 258 kb 30 kb 15 kb 4 kb 14 kb 1 kb 10 kb 2 kb 75 kb 88 kb
    NILC/São Carlos 1,2 Mb 622 kb 64 kb 1,3 Mb 596 kb 60 kb 7,2 Mb 601 kb 215 kb 52 kb 96 kb 664 kb 5,3 Mb 37 kb 6,8 Mb 7,4 Mb
    todos juntos 4,5 Mb 2,4 Mb 244 kb 25,2 Mb 2,8 Mb 92,0 Mb 83,5 Mb 2,7 Mb 741 kb 179 kb 445 kb 2,2 Mb 32,3 Mb 50,6 Mb 38,4 Mb 99,2 Mb
    Tycho Brahe 487 kb 222 kb 34 kb 724 kb 16 kb 43 kb 2,4 Mb 278 kb 94 kb 27 kb 80 kb 30 kb 662 kb 7 kb 1,1 Mb 1,3 Mb
    Vercial 914 kb 495 kb 67 kb 2,0 Mb 27 kb 47 kb 5,5 Mb 458 kb 169 kb 58 kb 122 kb 43 kb 1,4 Mb 12 kb 2,2 Mb 2,5 Mb
    all 4,8 Mb 2,3 Mb 239 kb 5,5 Mb 2,0 Mb 438 kb 20,5 Mb 2,9 Mb 782 kb 186 kb 321 kb 2,3 Mb 37,9 Mb 298 kb 44,1 Mb 59,7 Mb
    all/pt 3,7 Mb 1,8 Mb 204 kb 4,7 Mb 1,2 Mb 321 kb 20,5 Mb 2,2 Mb 606 kb 159 kb 258 kb 1,5 Mb 26,2 Mb 230 kb 30,7 Mb 41,8 Mb
    all/br 2,3 Mb 1,1 Mb 123 kb 2,8 Mb 943 kb 173 kb 9,2 Mb 1,2 Mb 401 kb 100 kb 184 kb 1,0 Mb 13,4 Mb 86 kb 16,3 Mb 22,1 Mb

    All tokens not belonging to any of the other categories were classified as GRAM (grammatical words). For that reason, these lists include an additional column specifying the category assigned by the parser.

    Similar information can be obtained for the Portuguese parts of COMPARA and CorTrad:

    CorpusTokensLemmas
    NADJADVVallNADJADVVProper namesall
    COMPARA (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad jorn (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad literary (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad culinary (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b

    Further information (so far only in Portuguese):

    [ Examples | Tokenization | Annotation | Corpora | Acknowledgements ]


    Last update: 04 Julho 2016.
    We would like to receive your feedback:
    Comments, requests and suggestions