Access to resources and services

Linguateca

Esta página em português

One of the goals of Linguateca is to improve significantly the conditions for NLP of Portuguese, namely

make the available resources more available
foster development and public availability of others
provide programs to get corpora on-the-fly on the Internet
create sufficiently big corpora that can be used as a reference
make Portuguese corpus processing in general easier
create publicly available programs that can be reused by other researchers or developers

This page links the main resources, services or programs developed under the scope of Linguateca.

AC/DC

Main goals of the AC/DC project

provide one place where access to all corpora is given
further improve the information associated with these corpora
develop a good user interface

The corpora were annotated with Eckhard Bick's PALAVRAS parser, from the VISL project.

CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público) is a corpus containing some 180 million words in European Portuguese, built by the project Computacional Processing of Portuguese following an agreement between the Portuguese Ministry for Science and Technology (MCT) and the newspaper PÚBLICO.

CETENFolha

CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo) is a corpus containing some 24 million words in Brazilian Portuguese, built by the project Computacional Processing of Portuguese from the texts of Folha de S. Paulo belonging to the corpus NILC/São Carlos, compiled by Núcleo Interinstitucional de Lingüística Computacional (NILC).

COMPARA

A Portuguese-English parallel corpus project, including a novel interface, DISPARA, in collaboration with Ana Frankenberg-Garcia. COMPARA is an open-ended collection of Portuguese-English and English-Portuguese translations. One can use COMPARA to find out how translators have translated words and expressions from Portuguese into English and from English into Portuguese.

Corpógrafo

Corpógrafo was created by CLUP/FLUP node of Linguateca to facilitate the creation of specialized, "do-it-yourself" corpora. The system offers text preprocessing, terminology extraction and help in defining concepts. A toolbox is provided that allows the user to manage his/her own texts and terminological databases.

Esfinge

Esfinge is a general domain question answering system that answers questions in Portuguese based on the Web.

Floresta sintá(c)tica

This project, in collaboration with the VISL project, has as aim to create a syntactically annotated treebank for Portuguese, humanly revised, to advance computational syntax and to create a reosurce for future evaluation tasks of tools for Portuguese.

PAPEL

PAPEL is a dictionary-based lexical ontology for Portuguese lexical, created from Porto Editora's Dicionário da Língua Portuguesa, created mainly at the Coimbra node of Linguateca. It will be made publicly available.

REPENTINO

REPENTINO is a repository of textual named entity instances, i.e. a set of proper nouns denoting a specific entity which in Portuguese is written with at least one capital, classified as to which kind of entity they denote (e.g, company, book title, place name, etc.). REPENTINO is organized in several major categories, in turn subdivided in subcategories.

Repositório

This space provides a kind of electronic Web shelf for all NLP resources for Portuguese that people want us to make available. We give access to IR collections, MT lexicons and corpora of summaries, among others.

WebJspell

WebJspell is a Web interface to Jspell, a morphological analyser and spell checker developed by Natura for Portuguese and English. Through WebJspell it is also possible to spellcheck entire Webpages by simply submitting their URL, as well as propose new entries for the dictionaries. WebJspell was created by the Braga node of Linguateca.

WPT 03 and 05

The WPT 03 is a collection of Web pages created from a crawl of the entire Portuguese Web in the year 2003. As far as we know, the WPT 03 is the first and only collection that spans the entire Web of a country which is freely available for research purposes. The WPT 03 is a result of a web crawl made between March and June of 2003 by the crawlers of tumba!, a Web search engine for the Portuguese community. In addition, the log of the queries to tumba! in the period from 1st October 2003 are also provided, after having run them through an anonymization procedure.

The WPT05 is a corresponding collection for 2005.

Both WPT03 and WPT05 were created by the XLDB group and made available here.

See also the Language Resource CatalogSearch for language resources: OLAC .

Last update: 30 June 2016

Send questions, comments and suggestions