Literateca: more than the literary subset of Gramateca

Given that a sizeable subset of Gramateca's texts are literary texts, we decided to embark in statistically informed corpus-based study of lusophone literature. The goal is to be able to answer literary relevant questions, using a corpus-based methodology. This combination of (linguistically annotated) literary texts and corpus tools we call Literateca.

In order to make it usable, we had to solve some problems first, namely to remove duplicate texts and handle the multiplicity of author identifiers in the different corpora.

Unique identification of authors

The first issue we had to solve was the identification of unique authors, given that different (literary) corpora had different ways of being identified and graphed.

So, we produced a list of unique authors, or better, a way to provide a unique identifier to the full set of author descriptions in our corpora.

These identifiers are not authoritative. We believe that to define authors is a prerrogative of the national libraries of the lusophone countries. These identifiers are just descriptive so that our users know what the identifiers refer to in our corpora. So, in line with Linguateca's philosophy in all areas of activity, this is just documentation of the identifiers of the authors in Literateca. We are anyway grateful for reporting any flaws or inaccuracies that users may detect.

The first data

The first study in Literateca, in May 2017, collected the data in two (tab-separated) tables, which we make available here: table 1 and table 2 (last update: 28 March 2019).

Basically, for each work, me measured some features that could be used in exploratory studies and for visualization of the material.

Chronological description of the material:

Colour-coded by corpus: Vercial (rose), OBras (green), NOBRE (red), Tycho Brahe (light blue), Colonia (black) and PANTERA (dark blue), which was the (initial) order we included works.

Application of statistical techniques to the material

For the moment we can only present static results created in R, but our intention is to develop an environment in Gramateca where users themselves can create the figures they are after, after choosing the features and techniques they wish to employ.

Until then, we are willing in producing all figures one may require from Literateca, also to understand the user requirements. Some examples:

Emotions by 50 years (half-century):
Principal components analysis:
Correspondence analysis:
Factor analysis with varimax rotation:

Last update: 30 March 2019.

Contact Linguateca's team of corpus-based grammar.