Internet Access to Corpora: The AC/DC project

Linguateca

Em português


The AC/DC project stands for Acesso a corpora/Disponibilização de corpora ("acess and availability of corpora"), and is one of the activities of Linguateca, previously the Computational Processing of Portuguese project.

The physical address for this service, launched on 23 September 1999, is /ACDC/.

The underlying corpus management system is CWB (a new version of the IMS corpus workbench).

Main goals

One of the goals of Linguateca is to improve significantly the conditions for NLP of Portuguese, namely by

Corpora currently offered and their rough characterization

For each corpus, we present:

(Table below last changed 16 April 2025)

CorpusSize
(units)
Size
(words)
Size
(sentences)
VarietyShort description
AmostRA-NILC128.51799.2674.965BRAmostRA-NILC: pos-tagged sample of NILC Corpus
ANCIB1.698.1481.257.78583.504BRBrazilian discussion list (moderated) on library science
Avante!7.790.5156.611.270193.107PTArticles from Portuguese party-political newspaper Avante!, 1997-2002
Corpus Brasileiro1.164.063.447983.190.97644.522.504BROne thousand million words of Brazilian Portuguese from several genres
CD HAREM290.001225.76612.558PT BRGolden collection of the First and Second HAREMs
CETEMPúblico239.113.359195.231.4217.017.260PTTwo-paragraph excerpts from a major Portuguese daily newspaper, PÚBLICO, 1991-1998
CHAVE127.881.425101.324.9064.762.290PT BRArticles from major daily newspapers PÚBLICO and Folha de São Paulo, 1994-1995
Ciência Viva805.307663.48727.270PTScience dissemination texts written in Portuguese newspapers
Colonia6.965.2905.196.107299.171PT BRHistorical corpus of Brazilian amd Portuguese texts from XVIth to XXth centuries
CONDIVport7.200.7955.627.261301.077PT BRArticles from sports newspapers and fashion or health magazines from three periods (1950s, 1970s and 2000s), from the CONDIVport project
CONDIVport2212.075175.2776.533PT BRArticles from daily newspapers from the 2010s, from the CONDIVport project
CoNE921.366681.37731.563PT BRSpam or general e-mail messages
C-Oral-Brasil439.519267.10230.634BRC-Oral-Brasil, Brazilian Portuguese informal speech
CORDIAL-SIN1.494.736857.06698.010PTCorpus CORDIAL-SIN, oral interviews in Portugal
CorTrad, lado português1.739.0291.307.74565.370BRCorTrad, Portuguese side, final translations or originals
DHBB16.096.07514.177.792461.808BREnclyclopedic entries in a Brazilian historical dictionary
DiaCLAV7.854.9746.701.348210.373PTArticles from four Portuguese regional newspapers, Diário de Coimbra, Diário de Leiria, Diário de Aveiro, Viseu Diário
Diáspora TL-PT27.40921.9081.035TLDiaspora TL-PT, interviews of East-Timorese in Portugal
DisPR330.043275.59210.647PT BRPresidential speeches in Portugal and Brazill
ECI-EBR924.904728.95144.381BRCorpus Borba-Ramsey of Brazilian Portuguese
ECI-EE30.27725.779789PTCall for the EU ESPRIT program
ENPCPUB (parte em português)92.67972.7984.371PT BRTranslated fiction from English, subset of the ENPC corpus
Floresta5.815.3594.779.248257.017PT BRThe Floresta treebank
FrasesPB23.25919.185652BRIndividual sentences in Brazilian Portuguese
FrasesPP20.03016.266594PTIndividual sentences in European Portuguese
Mariano Gago693.884569.84322.931PTTexts by and about José Mariano Gago
LeMe3.496.7952.581.509178.686PTPharmaceutic literature
Literateca52.083.36737.326.4632.311.471PT BRLusophone literary works in the public domain
Marielle, presente!506.032409.83120.444BRTexts by and about Marielle Franco
Moçambula69.46959.0382.285MOMessages from readers to Mozambican newspapers
Museu da Pessoa1.847.2921.431.27793.466PT BRTranscriptions of oral interviews from Museu da Pessoa
Natura/Minho2.255.4421.800.22370.277PTUnedited version of articles for Diário do Minho, a regional newspaper in Portugal
NOBRE12.149.6288.856.498504.411PTPortuguese literary works in the public domain
OBras14.512.44510.274.921636.607BRBrazilian literary works in the public domain
PANTERA, lado português939.091636.18943.240todasWorks translated from and to Norwegian (excerpts)
P'lo Norte52.75141.2262.381PTBlogs about Norway written by Portuguese blogers
Português Falado - Documentos Autênticos148.582107.2157.569todasTranscribed interviews from ten different locations where Portuguese is spoken
ReLi157.560128.7847.231BRReLi, corpus of book appraisals
NILC/São Carlos46.194.78635.145.8952.148.320BRVarious texts from the NILC Corpus: newspaper, commercial letters and educational texts
todos juntos1.518.927.9641.261.058.29956.970.430todasAll corpora together
Tycho Brahe4.220.0573.341.892135.842PT BRCorpus Tycho Brahe, historical texts
Vercial20.856.81414.741.576986.803PTPortuguese fiction, 16th to XXth century, from Projecto Vercial
Total3.271.070.4972.708.046.359122.589.877PT BRall corpora

We provide more extensive documentation and information, in Portuguese, about the Corpora, the actual processing and encoding of the several kinds of information present in the corpora (tokenization, sentence separation and annotation).

Actual examples on how to invoke AC/DC can be found in exemplos (in Portuguese) and in a page with some old examples in English.

Also, extensive frequency lists are provided in the frequencies service, per corpus, and in total, and computed for both lemmata and forms, also per PoS. In that page, so far only in Portuguese, you are able both to download them or to search them using regular expressions.

Related projects

The following are related projects also under the scope of Linguateca:

[ Access to the AC/DC corpora | Portuguese main page of Linguateca | English page of Linguateca ]


Last updated: 02 August 2016.
We would like to receive your feedback:
Send questions, comments and suggestions