The frequency lists of tha AC/DC corpora were created by the cwb-lexdecode tool from Open CWB / IMS-CWB from Stuttgart University. Lemma and part of speech were assigned in context by PALAVRAS, the Portuguese parser by Eckhard Bick (Bick, 2000). Be warned that the lists were computed from the automatically annotated versions of the corpora, most of them have not been revised.
The following service allows one to obtain frequencies and rank of lexical items and sublexical patterns, per corpus or all together:
Some comments to the choices taken:
we have not in any case attempted to remove foreign words from the lists.
| Colection | No. documents | No. words | Frequency list disregarding capitalization | No. different types disregarding capitalization | Frequency list keeping original capitalization | No. different types keeping capitalization |
| WPT-05 | 9.501.202 | 5.856.585.035 | 187M (gz) | 25.237.118 | 206M (gz) | 27.861.391 |
| WPT-03 | 1.529.758 | 1.059.436.086 | 55,1M (tar.gz) | 6.834.451 | ||
| WBR-99 | 5.939.061 | 1.915.526.098 | 14M (tar.gz) | 2.669.965 |
All tokens not belonging to any of the other categories were classified as GRAM (grammatical words). For that reason, these lists include an additional column specifying the category assigned by the parser.
Similar information can be obtained for the Portuguese parts of COMPARA and CorTrad:
| Corpus | Tokens | Lemmas | |||||||||
| N | ADJ | ADV | V | all | N | ADJ | ADV | V | Proper names | all | |
| COMPARA (Portuguese) | 545 kb | 276 kb | 42 kb | 774 kb | 1,7 Mb | 327 kb | 141 kb | 37 kb | 202 kb | 219 kb | 937 kb |
| CorTrad jorn (Portuguese) | 177 kb | 104 kb | 10 kb | 190 kb | 805 kb | 122 kb | 54 kb | 7 kb | 32 kb | 235 kb | 464 kb |
| CorTrad literary (Portuguese) | 39 kb | 19 kb | 4 kb | 54 kb | 186 kb | 31 kb | 12 kb | 3 kb | 15 kb | 9 kb | 73 kb |
| CorTrad culinary (Portuguese) | 30 kb | 15 kb | 1 kb | 31 kb | 120 kb | 23 kb | 8 kb | 1 kb | 10 kb | 4 kb | 48 kb |
[ Examples | Tokenization | Annotation | Corpora | Acknowledgements ]