The frequency lists of tha AC/DC corpora were created by the cwb-lexdecode tool from Open CWB / IMS-CWB from Stuttgart University. Lemma and part of speech were assigned in context by PALAVRAS, the Portuguese parser by Eckhard Bick (Bick, 2000). Be warned that the lists were computed from the automatically annotated versions of the corpora, most of them have not been revised.
The following service allows one to obtain frequencies and rank of lexical items and sublexical patterns, per corpus or all together:
Some comments to the choices taken:
we have not in any case attempted to remove foreign words from the lists.
Colection | No. documents | No. words | Frequency list disregarding capitalization | No. different types disregarding capitalization | Frequency list keeping original capitalization | No. different types keeping capitalization |
WPT-05 | 9.501.202 | 5.856.585.035 | 187M (gz) | 25.237.118 | 206M (gz) | 27.861.391 |
WPT-03 | 1.529.758 | 1.059.436.086 | 55,1M (tar.gz) | 6.834.451 | ||
WBR-99 | 5.939.061 | 1.915.526.098 | 14M (tar.gz) | 2.669.965 |
All tokens not belonging to any of the other categories were classified as GRAM (grammatical words). For that reason, these lists include an additional column specifying the category assigned by the parser.
Similar information can be obtained for the Portuguese parts of COMPARA and CorTrad:
Corpus | Tokens | Lemmas | |||||||||
N | ADJ | ADV | V | all | N | ADJ | ADV | V | Proper names | all | |
COMPARA (Portuguese) | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b |
CorTrad jorn (Portuguese) | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b |
CorTrad literary (Portuguese) | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b |
CorTrad culinary (Portuguese) | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b | 0 b |
[ Examples | Tokenization | Annotation | Corpora | Acknowledgements ]