Rank and frequency in Portuguese

AC/DC project, Linguateca

Em português


We give access here to the ferquency lists of tokens of the AC/DC corpora and of the other corpora -- specifically Web collections -- also made available by Linguateca.

The frequency lists of tha AC/DC corpora were created by the cwb-lexdecode tool from Open CWB / IMS-CWB from Stuttgart University. Lemma and part of speech were assigned in context by PALAVRAS, the Portuguese parser by Eckhard Bick (Bick, 2000). Be warned that the lists were computed from the automatically annotated versions of the corpora, most of them have not been revised.

Frequency and rank of wordforms and lemmas in the AC/DC corpora

The following service allows one to obtain frequencies and rank of lexical items and sublexical patterns, per corpus or all together:

Some comments to the choices taken:

CorpusToken frequency listLemma frequency list
All corpora 20,3 Mb 42,8 Mb
All corpora from Portugal 15,5 Mb 29,4 Mb
All corpora from Brazil 9,0 Mb 16,3 Mb
AmostRA-NILC 282 kb 121 kb
ANCIB 1,4 Mb 1,0 Mb
Avante! 2,1 Mb 1,5 Mb
Corpus Brasileiro 104,1 Mb 187,1 Mb
CD HAREM 511 kb 263 kb
CETEMPúblico 16,8 Mb 24,0 Mb
CHAVE 12,7 Mb 16,0 Mb
Ciência Viva 740 kb 398 kb
Colonia 2,9 Mb 1,1 Mb
CONDIVport 2,6 Mb 1,3 Mb
CONDIVport2 329 kb 165 kb
CoNE 798 kb 520 kb
C-Oral-Brasil 230 kb 99 kb
CORDIAL-SIN 438 kb 130 kb
DHBB 2,1 Mb 3,9 Mb
DiaCLAV 1,8 Mb 1,7 Mb
Diáspora TL-PT 47 kb 19 kb
ECI-EBR 1,0 Mb 430 kb
ECI-EE 71 kb 31 kb
ENPCPUB (parte em português) 211 kb 82 kb
Floresta 3,0 Mb 2,5 Mb
FrasesPB 98 kb 44 kb
FrasesPP 83 kb 38 kb
Mariano Gago 542 kb 297 kb
Literateca 9,3 Mb 4,6 Mb
Marielle, presente! 441 kb 252 kb
Moçambula 175 kb 77 kb
Museu da Pessoa 711 kb 335 kb
Natura/Minho 1,1 Mb 958 kb
NOBRE 3,8 Mb 1,3 Mb
OBras 3,4 Mb 1,1 Mb
PANTERA, lado português0 b0 b
P'lo Norte 130 kb 55 kb
Português Falado - Documentos Autênticos 97 kb 41 kb
ReLi 286 kb 91 kb
NILC/São Carlos 7,3 Mb 7,1 Mb
todos juntos 169,3 Mb 212,0 Mb
Tycho Brahe 2,4 Mb 1,3 Mb
Vercial 5,6 Mb 2,6 Mb

Partial searches, for few tokens or lemmata:

Tokens Lemmata

  • You can search using regular expressions of Perl.
  • To look for multiword lemmas, use a syntax like Belo=Horizonte, Castelo=Branco.

    Token frequencies of Web collections

    Depending on the collection, different methods were used:
    • For WBR99, we used the tokenization provided by the collection
    • For WPT03, another tokenization was used
    • For WPT05, the tokenization was done by the tokeniza function from the Lingua::PT::PLNbase library back in March 2009.

    we have not in any case attempted to remove foreign words from the lists.

    ColectionNo. documentsNo. wordsFrequency list disregarding capitalizationNo. different types disregarding capitalizationFrequency list keeping original capitalization No. different types keeping capitalization
    WPT-05 9.501.202 5.856.585.035 187M (gz) 25.237.118 206M (gz) 27.861.391
    WPT-03 1.529.758 1.059.436.086 55,1M (tar.gz) 6.834.451
    WBR-99 5.939.061 1.915.526.098 14M (tar.gz) 2.669.965

    Frequency lists of wordforms and lemmas by part of speech in the AC/DC corpora

    CorpusTokensLemmas
    NADJADVVNUMGRAMallNADJADVVNUMPROPGRAMallall/pos
    AmostRA-NILC 67 kb 31 kb 4 kb 64 kb 1 kb 6 kb 282 kb 51 kb 22 kb 4 kb 20 kb 2 kb 20 kb 2 kb 121 kb 136 kb
    ANCIB 266 kb 111 kb 13 kb 198 kb 53 kb 99 kb 1,4 Mb 182 kb 61 kb 10 kb 40 kb 56 kb 712 kb 4 kb 1,0 Mb 1,1 Mb
    Avante! 372 kb 234 kb 31 kb 567 kb 41 kb 15 kb 2,1 Mb 241 kb 118 kb 26 kb 61 kb 50 kb 1,1 Mb 3 kb 1,5 Mb 1,7 Mb
    Corpus Brasileiro 20,7 Mb 7,1 Mb 359 kb 7,1 Mb 11,9 Mb 915 kb 104,1 Mb 18,9 Mb 5,6 Mb 325 kb 2,5 Mb 15,2 Mb 145,5 Mb 87 kb 187,1 Mb 198,7 Mb
    CD HAREM 111 kb 48 kb 7 kb 97 kb 7 kb 7 kb 511 kb 77 kb 30 kb 5 kb 23 kb 7 kb 119 kb 2 kb 263 kb 291 kb
    CETEMPúblico 2,4 Mb 1,3 Mb 111 kb 2,8 Mb 1,1 Mb 123 kb 16,8 Mb 1,8 Mb 741 kb 85 kb 290 kb 1,3 Mb 19,8 Mb 8 kb 24,0 Mb 25,5 Mb
    CHAVE 2,0 Mb 1,1 Mb 90 kb 2,2 Mb 912 kb 90 kb 12,7 Mb 1,4 Mb 601 kb 69 kb 241 kb 1008 kb 12,8 Mb 7 kb 16,0 Mb 17,1 Mb
    Ciência Viva 152 kb 85 kb 11 kb 147 kb 13 kb 20 kb 740 kb 110 kb 49 kb 9 kb 27 kb 15 kb 187 kb 2 kb 398 kb 436 kb
    Colonia 545 kb 305 kb 38 kb 1,0 Mb 8 kb 24 kb 2,9 Mb 377 kb 162 kb 32 kb 115 kb 14 kb 409 kb 5 kb 1,1 Mb 1,2 Mb
    CONDIVport 482 kb 314 kb 37 kb 646 kb 50 kb 36 kb 2,6 Mb 311 kb 166 kb 31 kb 74 kb 60 kb 684 kb 5 kb 1,3 Mb 1,4 Mb
    CONDIVport2 71 kb 30 kb 5 kb 65 kb 6 kb 8 kb 329 kb 50 kb 19 kb 4 kb 17 kb 6 kb 68 kb 2 kb 165 kb 185 kb
    CoNE 162 kb 65 kb 7 kb 103 kb 39 kb 26 kb 798 kb 112 kb 33 kb 6 kb 23 kb 42 kb 303 kb 2 kb 520 kb 566 kb
    C-Oral-Brasil 54 kb 20 kb 3 kb 48 kb 3 kb 10 kb 230 kb 44 kb 14 kb 2 kb 14 kb 4 kb 20 kb 2 kb 99 kb 116 kb
    CORDIAL-SIN 95 kb 22 kb 3 kb 154 kb 1 kb 10 kb 438 kb 61 kb 12 kb 2 kb 20 kb 4 kb 24 kb 2 kb 130 kb 145 kb
    DHBB 345 kb 194 kb 20 kb 474 kb 102 kb 15 kb 2,1 Mb 227 kb 101 kb 15 kb 54 kb 108 kb 3,5 Mb 3 kb 3,9 Mb 4,2 Mb
    DiaCLAV 341 kb 189 kb 21 kb 469 kb 48 kb 13 kb 1,8 Mb 216 kb 97 kb 17 kb 53 kb 54 kb 1,3 Mb 3 kb 1,7 Mb 1,9 Mb
    Diáspora TL-PT 9 kb 3 kb 1 kb 10 kb565 b 3 kb 47 kb 8 kb 2 kb 1 kb 3 kb574 b 3 kb 1 kb 19 kb 22 kb
    ECI-EBR 215 kb 121 kb 16 kb 271 kb 5 kb 12 kb 1,0 Mb 147 kb 70 kb 13 kb 44 kb 9 kb 146 kb 3 kb 430 kb 477 kb
    ECI-EE 15 kb 9 kb 2 kb 14 kb 1 kb 3 kb 71 kb 11 kb 6 kb 2 kb 5 kb 1 kb 3 kb 1 kb 31 kb 35 kb
    ENPCPUB (parte em português) 44 kb 21 kb 5 kb 55 kb1005 b 6 kb 211 kb 35 kb 15 kb 4 kb 15 kb 1 kb 10 kb 1 kb 82 kb 93 kb
    Floresta 631 kb 301 kb 33 kb 640 kb 56 kb 20 kb 3,0 Mb 430 kb 155 kb 29 kb 87 kb 59 kb 1,7 Mb 8 kb 2,5 Mb 2,7 Mb
    FrasesPB 26 kb 10 kb 2 kb 18 kb451 b 4 kb 98 kb 22 kb 8 kb 2 kb 8 kb419 b 2 kb 1 kb 44 kb 50 kb
    FrasesPP 21 kb 9 kb 2 kb 15 kb514 b 3 kb 83 kb 17 kb 7 kb 2 kb 6 kb490 b 2 kb 1 kb 38 kb 42 kb
    Mariano Gago 109 kb 58 kb 10 kb 136 kb 7 kb 9 kb 542 kb 76 kb 34 kb 8 kb 24 kb 7 kb 145 kb 2 kb 297 kb 325 kb
    Literateca 1,7 Mb 942 kb 104 kb 3,9 Mb 59 kb 110 kb 9,3 Mb 1,2 Mb 522 kb 84 kb 403 kb 66 kb 2,4 Mb 12 kb 4,6 Mb 5,2 Mb
    Marielle, presente! 90 kb 42 kb 6 kb 95 kb 11 kb 11 kb 441 kb 61 kb 25 kb 5 kb 21 kb 12 kb 127 kb 2 kb 252 kb 277 kb
    Moçambula 39 kb 18 kb 4 kb 40 kb 1 kb 5 kb 175 kb 30 kb 12 kb 3 kb 13 kb 1 kb 14 kb 1 kb 77 kb 86 kb
    Museu da Pessoa 157 kb 69 kb 9 kb 193 kb 4 kb 9 kb 711 kb 109 kb 40 kb 7 kb 32 kb 5 kb 142 kb 2 kb 335 kb 373 kb
    Natura/Minho 220 kb 121 kb 14 kb 251 kb 36 kb 11 kb 1,1 Mb 145 kb 67 kb 12 kb 40 kb 40 kb 653 kb 2 kb 958 kb 1,0 Mb
    NOBRE 679 kb 425 kb 53 kb 1,4 Mb 10 kb 42 kb 3,8 Mb 477 kb 226 kb 45 kb 154 kb 18 kb 463 kb 6 kb 1,3 Mb 1,5 Mb
    OBras 603 kb 351 kb 45 kb 1,3 Mb 14 kb 23 kb 3,4 Mb 398 kb 169 kb 38 kb 117 kb 21 kb 372 kb 6 kb 1,1 Mb 1,2 Mb
    PANTERA, lado português0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    P'lo Norte 29 kb 12 kb 3 kb 26 kb 1 kb 5 kb 130 kb 22 kb 8 kb 2 kb 8 kb 1 kb 10 kb 1 kb 55 kb 62 kb
    Português Falado - Documentos Autênticos 23 kb 8 kb 2 kb 24 kb436 b 3 kb 97 kb 19 kb 6 kb 2 kb 8 kb 1 kb 3 kb 1 kb 41 kb 47 kb
    ReLi 48 kb 30 kb 6 kb 56 kb 1 kb 50 kb 286 kb 36 kb 18 kb 4 kb 16 kb 1 kb 15 kb 2 kb 91 kb 105 kb
    NILC/São Carlos 1,2 Mb 640 kb 57 kb 1,3 Mb 615 kb 41 kb 7,3 Mb 839 kb 336 kb 45 kb 157 kb 660 kb 5,1 Mb 6 kb 7,1 Mb 7,6 Mb
    todos juntos 23,4 Mb 8,2 Mb 435 kb 11,6 Mb 14,9 Mb 72,1 Mb 169,3 Mb 20,6 Mb 6,2 Mb 379 kb 3,0 Mb 16,5 Mb 166,3 Mb 45,8 Mb 212,0 Mb 277,8 Mb
    Tycho Brahe 500 kb 255 kb 30 kb 698 kb 15 kb 28 kb 2,4 Mb 372 kb 159 kb 26 kb 130 kb 28 kb 639 kb 5 kb 1,3 Mb 1,5 Mb
    Vercial 946 kb 524 kb 61 kb 2,0 Mb 26 kb 31 kb 5,6 Mb 638 kb 269 kb 50 kb 194 kb 41 kb 1,5 Mb 8 kb 2,6 Mb 2,9 Mb
    all 4,7 Mb 2,4 Mb 190 kb 5,3 Mb 2,0 Mb 391 kb 20,3 Mb 3,6 Mb 1,5 Mb 161 kb 686 kb 2,3 Mb 34,8 Mb 18 kb 42,8 Mb 58,2 Mb
    all/pt 3,5 Mb 1,8 Mb 163 kb 4,6 Mb 1,2 Mb 207 kb 20,3 Mb 2,7 Mb 1,1 Mb 136 kb 565 kb 1,4 Mb 23,7 Mb 15 kb 29,4 Mb 40,3 Mb
    all/br 2,2 Mb 1,1 Mb 99 kb 2,7 Mb 964 kb 209 kb 9,0 Mb 1,6 Mb 637 kb 84 kb 325 kb 1,0 Mb 12,7 Mb 12 kb 16,3 Mb 22,2 Mb

    All tokens not belonging to any of the other categories were classified as GRAM (grammatical words). For that reason, these lists include an additional column specifying the category assigned by the parser.

    Similar information can be obtained for the Portuguese parts of COMPARA and CorTrad:

    CorpusTokensLemmas
    NADJADVVallNADJADVVProper namesall
    COMPARA (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad jorn (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad literary (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad culinary (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b

    Further information (so far only in Portuguese):

    [ Examples | Tokenization | Annotation | Corpora | Acknowledgements ]


    Last update: 04 Julho 2016.
    We would like to receive your feedback:
    Comments, requests and suggestions