Measuring the Web in PortugueseRachel Aires
SINTEF Telecom and Informatics
SINTEF Telecom and Informatics
"Everyone" has heard figures about Internet size and growth (although not necessarily confirmed the figures ). It has been stated that Portuguese is the sixth language in the world in terms of native speakers, the fourth most used in Internet interaction, and that Portuguese native speakers constitute 3% of Internet population . It is thus high time to be actively concerned with this increasingly large community of users. Our focus here is on its finer characterization. Note that we use "Portuguese" in this paper always referring to the language and not to the country.
Sheer sizeEven a straightforward concept such as "size" is multidimensional. We are here interested in size in terms of (Web) pages that are written in Portuguese, and whenever possible in terms of words (but word count is fairly tricky, too). It should be obvious that domain name is not a reliable indicator of language: Martins & Silva , in a preliminary overview of the .pt domain, found that only 57% of the pages had Portuguese content.
To compute size we performed three different experiments: 1) Using our Portuguese corpora , we created queries that contained several of the most frequent grammatical items in Portuguese (in this way avoiding the well known problem of homography with other languages' items), making use of the fact that they are not considered stop words for English/Internet in geral. We assumed that any text sized above three sentences would have the combination of these grammatical words, so that we are only missing pages with extremely short textual content. We then used three search engines (Alltheweb, Altavista and Google) to get the number of indexed pages, which was respectively 11.5, 2.3 and 3.0 millions. We checked whether it would make a difference to restrict the search to pages in Portuguese only (a search engine option). The change was negligible for the first two, while Google surprisingly returned only a fourth of the previous pages. 2) We selected relatively infrequent terms (again based on the corpora we have - 200 millions European Portuguese; 50 millions Brazilian Portuguese) and checked their number of occurrences (cegonha, austero, arrumação, cara de pau, etc.). The idea would be to estimate the full size of the Portuguese Web text based on their approximate ranking order. However, the disparities between the search engines were too big to allow any reasonable conclusion. 3) Replicating Grefenstette's estimation method and Portuguese words  (com, uma, os, não, ao, mas, muito, seu, são, eu, foi, você, ele, pela, quando, pode, brasil, seus, um) on Altavista, we come to 5,090,230,228 words in early November 2002.
User populationA user of Portuguese content is not the same of a Web user whose native language is Portuguese, even though one would expect the two groups to considerably overlap. Many Web users in Brazil and Portugal may simply (or mainly) visit English material on the Web; conversely, there may be some users coming from domains other than .pt or .br, such as emmigrants, language learners, or visiting researchers, who access Portuguese content. There are several sources that characterize Web use in Portugal and Brazil in sociological terms (i.e., who are the Web users of these two countries), but as far as we know not the other way around (who accesses the Web content in Portuguese?). This is much more difficult to measure, without having access to logs. We just present some hints on how this might partially be done: count how many references there are in other domains to Web in Portuguese; if link statistics are made available by public multilingual sites, count how many times Portuguese pages are accessed instead of the corresponding in other languages.
Content distribution"Content" here can be broadly taken as subject area, genre (advertisement, scientific, informative, etc.), variant (Brazilian or European Portuguese), host type (organization, educational, commercial, personal pages, etc.), resource kind, among others. Contrarily to Ide et al. , who claim that, for American English, "web language on the whole is dramatically skewed toward dense, academic-like prose" -- a conclusion due to their previous requirement of finding unambiguously American authors -- we want to find Portuguese in any variant first, and only then classify it by genre or variant. One indirect way to do this is select specific terms of given areas and see their relative size. For relative Web presence compared to other languages and cultures, see . We want to do a similar experiment with well-known variant clues.
Parallel contentA particular way of classifying content is according to the existence or not of versions/translations in other languages: is a Portuguese page parallel or unique? Is the Portuguese content replicated (a replica of) other languages, or appears "stand-alone"? This is not only relevant to have an idea of how "natural" (vs. translationese) is the Portuguese encountered, but also in order to identify genuine areas where CLIR (cross-language information retrieval) really makes sense -- for EU documents, it is probably enough to select the desired language version. We plan to use an adapted form of Resnik's approach  to gather bilingual corpora on the Web, but our hypothesis is that most bilingual sites are hosted by .pt and .br computers.
Search engine coverageIt is common knowledge that different engines are better for different things. Can differences also be demonstrated with respect to language coverage? Our preliminary results strongly imply this, but we want to do a more encompassing comparison that includes also search engines specially developed for Brazil and Portugal Web universes, to see whether their use pays in terms of coverage.
References Hannemyr, Gisle. "The Internet as Hyperbole: A Critical Examination of Adoption Rates", NOKOBIT 2001, Norsk Konferrasne for organisasjoners bruk av IT (Tromsø, 26.-28. november 2001), Agder: Høgskolen i Agder, pp. 129-46
 Global Internet Statistics (by Language), http://www.glreach.com/globstats/index.php3
 Martins, Bruno & Mário J. Silva. "Is it Portuguese? Language detection in large document collections", CRC'01 - 4ª Conferência de Redes de Computadores (Covilhã, Novembro de 2001), http://xldb.fc.ul.pt/referencias/CRC2001Final/IsPortuguese.pdf.
 Santos, Diana & Eckhard Bick. "Providing Internet access to Portuguese corpora: the AC/DC project", in Gavriladou et al. (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation, LREC2000 (Athens, 31 May-2 June 2000), pp.205-210.
 Grefenstette, Gregory & Julien Nioche. Estimation of English and non-English Language Use on the WWW. In: RIAO'2000, Paris, April 12-14, 2000. http://www.xrce.xerox.com/competencies/content-analysis/publications/Documents/P19137/content/RIAO2000gref.pdf
 Ide, N., Reppen, R., Suderman, K. (2002). The American National Corpus: More Than the Web Can Provide. Proceedings of the Third Language Resources and Evaluation Conference (LREC), Las Palmas, Canary Islands, Spain, pp. 839-44
 Línguas e culturas latinas na Internet. http://www.unilat.org/dtil/lenguainternet/pt/l_latinas_pt.asp
 Resnik, Philip. "Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text", in Farwell et al.(eds.), Machine Translation and the Information Soup, Springer Verlag, 1998, pp.72-82.
© Rachel Aires & Diana Santos