Gramateca: Corpus-based grammar for Portuguese

Here we report on Gramateca, a broadly encompassing project for corpus-based studies of Portuguese launched by Linguateca. If you read Portuguese, check the Portuguese site of Gramateca.

We plan to use as raw material primarily the corpora that we make available through AC/DC, but we accept any other corpus-based data which the grammarians can get hands on. The main issue is making the data, and especially the linguistic analyses of those data, widely available to the whole community interested in Portuguese grammar.

Since the AC/DC corpora have been automatically parsed by PALAVRAS (Bick 2000), this parser will be instrumental in the grammar studies, and one of the by-products of the present endeavour is a more encompassing description of the categories used by the parser, as well as of the problems that an automatic analysis has necessarily to involve. Still, we hope to be able to also make use of several other parsers for Portuguese and their contribution is most welcome.

One the inspiration sources for the present project was Biber and Johansson et al.'s (1999) corpus-based grammar, but we expect to be able to improve our results at least in the following two ways:

all materials (and underlying corpora) are publicly available, so studies should be replicable and therefore easier to argue against and it would be much easier to present detailed argumentation for other interpretations;
more developed statistical techniques will be employed.

Since we are dealing with another language, we will also make use of the outstanding corpus-based works on Portuguese grammar that have been developed through the years, most notably the NURC project in Brasil and the Português Fundamental project in Portugal which started more than 50 years ago, and have to this day continued in high quality published grammars such as Castilho's Gramática do Português Culto falado no Brasil or Raposo et al.'s Gramática do Português.

In addition, we will make use of the human-revised semantic annotation that is available (and underway) at AC/DC.

Areas of work

There has already been work (see corresponding pages, mainly in Portuguese) on

Conditional constructors
Differences between oral and writing
The human body
Emotions in language
Reported speech
Distant reading of literature, in Literateca

See also the papers who have been published on these subjects in a Gramateca context: issuing a query to Linguateca's publication catalogue for works tagged with gramateca.

Sharing and funding

Everyone who wants to join Gramateca has to find their own funding, but we hope that eventually by making part of a quality and collectively run project will help individuals to raise that funding.

There are also no straight-jacket structure of chapters or publisher to use, just that this infrastructure will allow more work to be shared and written.

We hope that people publish on what they do in Gramateca, and that we will also be able to publish several volumes under a Creative Commons license, see e.g. Canning (2014).

We are anyway grateful to Linguateca's (previous) funding throughout the years, namely to FCCN, FCT, MCES and other Portuguese and European programs, as well as to the University of Oslo, ILOS and especially its Gruppe for forskningsinfrastruktur for cluster maintenance and support.

References

Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad & E. Finegan. The Longman grammar of spoken and written English. 1999, London: Longman.
Bick, Eckhard. The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.
Canning, John. Statistics for Humanities, www.statisticsforhumanities.net

Last update: 14 october 2017.

Contact Gramateca's team.