The physical address for this service, launched on 23 September 1999, is /ACDC/.
The underlying corpus management system is CWB (a new version of the IMS corpus workbench).
(Table below last changed 16 April 2025)
Corpus | Size (units) | Size (words) | Size
(sentences) | Variety | Short description |
---|---|---|---|---|---|
AmostRA-NILC | 128.517 | 99.267 | 4.965 | BR | AmostRA-NILC: pos-tagged sample of NILC Corpus |
ANCIB | 1.698.148 | 1.257.785 | 83.504 | BR | Brazilian discussion list (moderated) on library science |
Avante! | 7.790.515 | 6.611.270 | 193.107 | PT | Articles from Portuguese party-political newspaper Avante!, 1997-2002 |
Corpus Brasileiro | 1.164.063.447 | 983.190.976 | 44.522.504 | BR | One thousand million words of Brazilian Portuguese from several genres |
CD HAREM | 290.001 | 225.766 | 12.558 | PT BR | Golden collection of the First and Second HAREMs |
CETEMPúblico | 239.113.359 | 195.231.421 | 7.017.260 | PT | Two-paragraph excerpts from a major Portuguese daily newspaper, PÚBLICO, 1991-1998 |
CHAVE | 127.881.425 | 101.324.906 | 4.762.290 | PT BR | Articles from major daily newspapers PÚBLICO and Folha de São Paulo, 1994-1995 |
Ciência Viva | 805.307 | 663.487 | 27.270 | PT | Science dissemination texts written in Portuguese newspapers |
Colonia | 6.965.290 | 5.196.107 | 299.171 | PT BR | Historical corpus of Brazilian amd Portuguese texts from XVIth to XXth centuries |
CONDIVport | 7.200.795 | 5.627.261 | 301.077 | PT BR | Articles from sports newspapers and fashion or health magazines from three periods (1950s, 1970s and 2000s), from the CONDIVport project |
CONDIVport2 | 212.075 | 175.277 | 6.533 | PT BR | Articles from daily newspapers from the 2010s, from the CONDIVport project |
CoNE | 921.366 | 681.377 | 31.563 | PT BR | Spam or general e-mail messages |
C-Oral-Brasil | 439.519 | 267.102 | 30.634 | BR | C-Oral-Brasil, Brazilian Portuguese informal speech |
CORDIAL-SIN | 1.494.736 | 857.066 | 98.010 | PT | Corpus CORDIAL-SIN, oral interviews in Portugal |
CorTrad, lado português | 1.739.029 | 1.307.745 | 65.370 | BR | CorTrad, Portuguese side, final translations or originals |
DHBB | 16.096.075 | 14.177.792 | 461.808 | BR | Enclyclopedic entries in a Brazilian historical dictionary |
DiaCLAV | 7.854.974 | 6.701.348 | 210.373 | PT | Articles from four Portuguese regional newspapers, Diário de Coimbra, Diário de Leiria, Diário de Aveiro, Viseu Diário |
Diáspora TL-PT | 27.409 | 21.908 | 1.035 | TL | Diaspora TL-PT, interviews of East-Timorese in Portugal |
DisPR | 330.043 | 275.592 | 10.647 | PT BR | Presidential speeches in Portugal and Brazill |
ECI-EBR | 924.904 | 728.951 | 44.381 | BR | Corpus Borba-Ramsey of Brazilian Portuguese |
ECI-EE | 30.277 | 25.779 | 789 | PT | Call for the EU ESPRIT program |
ENPCPUB (parte em português) | 92.679 | 72.798 | 4.371 | PT BR | Translated fiction from English, subset of the ENPC corpus |
Floresta | 5.815.359 | 4.779.248 | 257.017 | PT BR | The Floresta treebank |
FrasesPB | 23.259 | 19.185 | 652 | BR | Individual sentences in Brazilian Portuguese |
FrasesPP | 20.030 | 16.266 | 594 | PT | Individual sentences in European Portuguese |
Mariano Gago | 693.884 | 569.843 | 22.931 | PT | Texts by and about José Mariano Gago |
LeMe | 3.496.795 | 2.581.509 | 178.686 | PT | Pharmaceutic literature |
Literateca | 52.083.367 | 37.326.463 | 2.311.471 | PT BR | Lusophone literary works in the public domain |
Marielle, presente! | 506.032 | 409.831 | 20.444 | BR | Texts by and about Marielle Franco |
Moçambula | 69.469 | 59.038 | 2.285 | MO | Messages from readers to Mozambican newspapers |
Museu da Pessoa | 1.847.292 | 1.431.277 | 93.466 | PT BR | Transcriptions of oral interviews from Museu da Pessoa |
Natura/Minho | 2.255.442 | 1.800.223 | 70.277 | PT | Unedited version of articles for Diário do Minho, a regional newspaper in Portugal |
NOBRE | 12.149.628 | 8.856.498 | 504.411 | PT | Portuguese literary works in the public domain |
OBras | 14.512.445 | 10.274.921 | 636.607 | BR | Brazilian literary works in the public domain |
PANTERA, lado português | 939.091 | 636.189 | 43.240 | todas | Works translated from and to Norwegian (excerpts) |
P'lo Norte | 52.751 | 41.226 | 2.381 | PT | Blogs about Norway written by Portuguese blogers |
Português Falado - Documentos Autênticos | 148.582 | 107.215 | 7.569 | todas | Transcribed interviews from ten different locations where Portuguese is spoken |
ReLi | 157.560 | 128.784 | 7.231 | BR | ReLi, corpus of book appraisals |
NILC/São Carlos | 46.194.786 | 35.145.895 | 2.148.320 | BR | Various texts from the NILC Corpus: newspaper, commercial letters and educational texts |
todos juntos | 1.518.927.964 | 1.261.058.299 | 56.970.430 | todas | All corpora together |
Tycho Brahe | 4.220.057 | 3.341.892 | 135.842 | PT BR | Corpus Tycho Brahe, historical texts |
Vercial | 20.856.814 | 14.741.576 | 986.803 | PT | Portuguese fiction, 16th to XXth century, from Projecto Vercial |
Total | 3.271.070.497 | 2.708.046.359 | 122.589.877 | PT BR | all corpora |
We provide more extensive documentation and information, in Portuguese, about the Corpora, the actual processing and encoding of the several kinds of information present in the corpora (tokenization, sentence separation and annotation).
Actual examples on how to invoke AC/DC can be found in exemplos (in Portuguese) and in a page with some old examples in English.Also, extensive frequency lists are provided in the frequencies service, per corpus, and in total, and computed for both lemmata and forms, also per PoS. In that page, so far only in Portuguese, you are able both to download them or to search them using regular expressions.
[ Access to the AC/DC corpora | Portuguese main page of Linguateca | English page of Linguateca ]