Rank and frequency in Portuguese

AC/DC project, Linguateca

Em português


We give access here to the ferquency lists of tokens of the AC/DC corpora and of the other corpora -- specifically Web collections -- also made available by Linguateca.

The frequency lists of tha AC/DC corpora were created by the cwb-lexdecode tool from Open CWB / IMS-CWB from Stuttgart University. Lemma and part of speech were assigned in context by PALAVRAS, the Portuguese parser by Eckhard Bick (Bick, 2000). Be warned that the lists were computed from the automatically annotated versions of the corpora, most of them have not been revised.

Frequency and rank of wordforms and lemmas in the AC/DC corpora

The following service allows one to obtain frequencies and rank of lexical items and sublexical patterns, per corpus or all together:

Some comments to the choices taken:

CorpusToken frequency listLemma frequency list
All corpora 20,1 Mb 41,4 Mb
All corpora from Portugal 16,1 Mb 30,4 Mb
All corpora from Brazil 8,3 Mb 14,2 Mb
AmostRA-NILC 280 kb 120 kb
ANCIB 1,3 Mb 1,2 Mb
Avante! 2,1 Mb 1,5 Mb
Corpus Brasileiro 98,6 Mb 204,9 Mb
CD HAREM 511 kb 263 kb
CETEMPúblico 16,7 Mb 25,0 Mb
CHAVE 12,5 Mb 16,5 Mb
Colonia 2,9 Mb 1,2 Mb
CONDIVport 2,5 Mb 1,5 Mb
CONDIVport20 b0 b
CoNE 743 kb 570 kb
C-Oral-Brasil 235 kb 101 kb
DiaCLAV 1,8 Mb 1,7 Mb
Diáspora TL-PT 47 kb 19 kb
ECI-EBR 1022 kb 431 kb
ECI-EE 71 kb 30 kb
ENPCPUB (parte em português) 211 kb 83 kb
Floresta 3,0 Mb 2,5 Mb
FrasesPB 98 kb 44 kb
FrasesPP 83 kb 38 kb
Mariano Gago 528 kb 309 kb
Em progresso 282 kb 155 kb
Moçambula 175 kb 76 kb
Museu da Pessoa 711 kb 335 kb
Natura/Minho 1,1 Mb 925 kb
NOBRE 1,9 Mb 638 kb
OBras 2,7 Mb 898 kb
P'lo Norte 130 kb 56 kb
Português Falado - Documentos Autênticos 97 kb 41 kb
ReLi 232 kb 95 kb
NILC/São Carlos 7,0 Mb 7,3 Mb
todos juntos 111,8 Mb 231,9 Mb
Tycho Brahe 2,1 Mb 1,1 Mb
Vercial 6,8 Mb 2,8 Mb

Partial searches, for few tokens or lemmata:

Tokens Lemmata

  • You can search using regular expressions of Perl.
  • To look for multiword lemmas, use a syntax like Belo=Horizonte, Castelo=Branco.

    Token frequencies of Web collections

    Depending on the collection, different methods were used:
    • For WBR99, we used the tokenization provided by the collection
    • For WPT03, another tokenization was used
    • For WPT05, the tokenization was done by the tokeniza function from the Lingua::PT::PLNbase library back in March 2009.

    we have not in any case attempted to remove foreign words from the lists.

    ColectionNo. documentsNo. wordsFrequency list disregarding capitalizationNo. different types disregarding capitalizationFrequency list keeping original capitalization No. different types keeping capitalization
    WPT-05 9.501.202 5.856.585.035 187M (gz) 25.237.118 206M (gz) 27.861.391
    WPT-03 1.529.758 1.059.436.086 55,1M (tar.gz) 6.834.451
    WBR-99 5.939.061 1.915.526.098 14M (tar.gz) 2.669.965

    Frequency lists of wordforms and lemmas by part of speech in the AC/DC corpora

    CorpusTokensLemmas
    NADJADVVNUMGRAMallNADJADVVNUMPROPGRAMallall/pos
    AmostRA-NILC 67 kb 30 kb 4 kb 64 kb 1 kb 5 kb 280 kb 52 kb 21 kb 3 kb 20 kb 2 kb 19 kb 1 kb 120 kb 134 kb
    ANCIB 348 kb 101 kb 13 kb 197 kb 57 kb 9 kb 1,3 Mb 265 kb 56 kb 10 kb 43 kb 58 kb 778 kb 3 kb 1,2 Mb 1,3 Mb
    Avante! 372 kb 234 kb 31 kb 567 kb 41 kb 15 kb 2,1 Mb 241 kb 118 kb 26 kb 61 kb 50 kb 1,1 Mb 3 kb 1,5 Mb 1,7 Mb
    Corpus Brasileiro 20,2 Mb 6,5 Mb 331 kb 7,0 Mb 10,9 Mb 976 kb 98,6 Mb 18,1 Mb 5,0 Mb 269 kb 2,8 Mb 13,4 Mb 165,8 Mb 268 kb 204,9 Mb 216,6 Mb
    CD HAREM 111 kb 48 kb 7 kb 97 kb 7 kb 7 kb 511 kb 77 kb 30 kb 5 kb 23 kb 7 kb 119 kb 2 kb 263 kb 291 kb
    CETEMPúblico 2,4 Mb 1,2 Mb 105 kb 2,7 Mb 1,1 Mb 316 kb 16,7 Mb 1,8 Mb 731 kb 84 kb 299 kb 1,2 Mb 21,0 Mb 8 kb 25,0 Mb 26,6 Mb
    CHAVE 1,9 Mb 1020 kb 86 kb 2,1 Mb 872 kb 246 kb 12,5 Mb 1,3 Mb 566 kb 68 kb 232 kb 950 kb 13,4 Mb 8 kb 16,5 Mb 17,6 Mb
    Colonia 525 kb 300 kb 36 kb 1016 kb 10 kb 20 kb 2,9 Mb 358 kb 161 kb 32 kb 122 kb 13 kb 503 kb 5 kb 1,2 Mb 1,3 Mb
    CONDIVport 481 kb 298 kb 36 kb 647 kb 49 kb 18 kb 2,5 Mb 317 kb 160 kb 28 kb 79 kb 58 kb 898 kb 4 kb 1,5 Mb 1,6 Mb
    CONDIVport20 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CoNE 189 kb 56 kb 7 kb 97 kb 38 kb 7 kb 743 kb 135 kb 30 kb 5 kb 24 kb 40 kb 334 kb 2 kb 570 kb 615 kb
    C-Oral-Brasil 53 kb 20 kb 3 kb 48 kb 5 kb 10 kb 235 kb 43 kb 14 kb 2 kb 14 kb 4 kb 22 kb 2 kb 101 kb 117 kb
    DiaCLAV 356 kb 189 kb 21 kb 478 kb 44 kb 11 kb 1,8 Mb 222 kb 93 kb 16 kb 57 kb 49 kb 1,2 Mb 3 kb 1,7 Mb 1,8 Mb
    Diáspora TL-PT 9 kb 3 kb 1 kb 10 kb563 b 3 kb 47 kb 8 kb 2 kb 1 kb 3 kb579 b 3 kb 1 kb 19 kb 23 kb
    ECI-EBR 218 kb 119 kb 16 kb 272 kb 5 kb 9 kb 1022 kb 151 kb 68 kb 12 kb 45 kb 8 kb 145 kb 2 kb 431 kb 478 kb
    ECI-EE 15 kb 10 kb 2 kb 14 kb 1 kb 3 kb 71 kb 11 kb 6 kb 2 kb 5 kb 1 kb 3 kb 1 kb 30 kb 34 kb
    ENPCPUB (parte em português) 45 kb 21 kb 5 kb 55 kb933 b 5 kb 211 kb 35 kb 14 kb 4 kb 15 kb 1 kb 10 kb 1 kb 83 kb 94 kb
    Floresta 631 kb 301 kb 33 kb 640 kb 56 kb 19 kb 3,0 Mb 428 kb 155 kb 28 kb 87 kb 59 kb 1,7 Mb 8 kb 2,5 Mb 2,7 Mb
    FrasesPB 27 kb 10 kb 2 kb 18 kb442 b 3 kb 98 kb 22 kb 8 kb 2 kb 8 kb407 b 2 kb 1 kb 44 kb 50 kb
    FrasesPP 21 kb 9 kb 2 kb 15 kb499 b 3 kb 83 kb 17 kb 7 kb 2 kb 6 kb467 b 2 kb 1 kb 38 kb 42 kb
    Mariano Gago 103 kb 55 kb 9 kb 129 kb 7 kb 10 kb 528 kb 74 kb 32 kb 8 kb 23 kb 7 kb 161 kb 2 kb 309 kb 335 kb
    Em progresso 57 kb 25 kb 5 kb 61 kb 5 kb 10 kb 282 kb 41 kb 15 kb 4 kb 16 kb 5 kb 72 kb 1 kb 155 kb 171 kb
    Moçambula 39 kb 18 kb 4 kb 40 kb 1 kb 5 kb 175 kb 30 kb 12 kb 3 kb 13 kb 1 kb 14 kb 1 kb 76 kb 86 kb
    Museu da Pessoa 157 kb 69 kb 9 kb 193 kb 4 kb 9 kb 711 kb 109 kb 40 kb 7 kb 32 kb 5 kb 142 kb 2 kb 335 kb 373 kb
    Natura/Minho 231 kb 121 kb 14 kb 254 kb 34 kb 9 kb 1,1 Mb 148 kb 65 kb 11 kb 42 kb 38 kb 621 kb 2 kb 925 kb 998 kb
    NOBRE 348 kb 216 kb 32 kb 645 kb 4 kb 29 kb 1,9 Mb 236 kb 111 kb 28 kb 78 kb 6 kb 175 kb 4 kb 638 kb 713 kb
    OBras 461 kb 269 kb 36 kb 978 kb 8 kb 140 kb 2,7 Mb 305 kb 130 kb 31 kb 91 kb 12 kb 319 kb 4 kb 898 kb 986 kb
    P'lo Norte 29 kb 12 kb 3 kb 27 kb 1 kb 4 kb 130 kb 22 kb 8 kb 2 kb 8 kb 1 kb 11 kb 1 kb 56 kb 63 kb
    Português Falado - Documentos Autênticos 23 kb 8 kb 2 kb 24 kb436 b 3 kb 97 kb 19 kb 6 kb 2 kb 8 kb 1 kb 3 kb 1 kb 41 kb 47 kb
    ReLi 46 kb 30 kb 5 kb 58 kb 1 kb 6 kb 232 kb 37 kb 18 kb 4 kb 16 kb 1 kb 17 kb 2 kb 95 kb 108 kb
    NILC/São Carlos 1,2 Mb 589 kb 55 kb 1,3 Mb 567 kb 24 kb 7,0 Mb 804 kb 314 kb 43 kb 173 kb 598 kb 5,5 Mb 5 kb 7,3 Mb 7,9 Mb
    todos juntos 22,5 Mb 7,3 Mb 400 kb 11,7 Mb 12,6 Mb 6,7 Mb 111,8 Mb 20,0 Mb 5,6 Mb 345 kb 3,4 Mb 14,4 Mb 188,3 Mb 1,4 Mb 231,9 Mb 246,2 Mb
    Tycho Brahe 426 kb 210 kb 25 kb 593 kb 12 kb 27 kb 2,1 Mb 316 kb 133 kb 22 kb 122 kb 18 kb 559 kb 5 kb 1,1 Mb 1,3 Mb
    Vercial 932 kb 527 kb 58 kb 2,0 Mb 27 kb 1,2 Mb 6,8 Mb 629 kb 276 kb 50 kb 194 kb 35 kb 1,7 Mb 5 kb 2,8 Mb 3,1 Mb
    all 4,3 Mb 2,0 Mb 162 kb 4,8 Mb 1,8 Mb 2,9 Mb 20,1 Mb 3,2 Mb 1,3 Mb 134 kb 590 kb 2,0 Mb 33,5 Mb 1,2 Mb 41,4 Mb 56,7 Mb
    all/pt 3,4 Mb 1,7 Mb 144 kb 4,2 Mb 1,1 Mb 1,5 Mb 20,1 Mb 2,5 Mb 1,0 Mb 117 kb 520 kb 1,3 Mb 25,1 Mb 11 kb 30,4 Mb 41,6 Mb
    all/br 2,0 Mb 999 kb 87 kb 2,4 Mb 837 kb 1,3 Mb 8,3 Mb 1,4 Mb 570 kb 73 kb 285 kb 890 kb 10,3 Mb 1,2 Mb 14,2 Mb 19,9 Mb

    All tokens not belonging to any of the other categories were classified as GRAM (grammatical words). For that reason, these lists include an additional column specifying the category assigned by the parser.

    Similar information can be obtained for the Portuguese parts of COMPARA and CorTrad:

    CorpusTokensLemmas
    NADJADVVallNADJADVVProper namesall
    COMPARA (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad jorn (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad literary (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b
    CorTrad culinary (Portuguese)0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b0 b

    Further information (so far only in Portuguese):

    [ Examples | Tokenization | Annotation | Corpora | Acknowledgements ]


    Last update: 04 Julho 2016.
    We would like to receive your feedback:
    Comments, requests and suggestions