WPT 05 in English

Esta página está disponível em português.

1 About the WPT 05 collection
2 WPT05 metadata
- 2.1 Metada contents
3 WPT 05 contents
4 WPT 05 portuguese n-grams
- 4.1 N-grams contents
- 4.2 Statistics of the WPT05 portuguese n-grams collection
5 Terms of Use
6 Publications

About the WPT 05 collection

WPT 05 is a collection of over 10 million documents from the portuguese web obtained by the crawler of the Tumba! search engine, produced by the XLDB Node of Linguateca.

The contents were crawled in 2005 and have been harvested according to the following criteria:

hosted in a .pt domain
written in portuguese, hosted in a .com, .org, .net or .tv domain, and referenced by a hyperlink from, at least, one page hosted in a .pt domain.

The WPT 05 collection and related data are available in multiple versions and formats:

WPT 05 metadata: contains the attributes of each of the collected contents (includes automatically extracted text and identified language)in the RDF/XML format.
WPT 05 contents: contains the harvested contents in raw form, as they have been archived, in the Internet Archive ARC format
WPT 05 portuguese n-grams: includes the n-grams generated from the extracted text of collected documents in portuguese

WPT 05 is the successor to WPT 03, which is a crawl from 2003 distributed since 2004 byLinguateca.

WPT05 metadata

The RDF/XML version of the WPT 05 collection uses the RDF technology and the OAI-ORE specification for representing duplicates and the web pages hierarchy. It includes the crawling metadata and the text extracted from each URL.

Characteristics of the WPT 05 metadata collection:

No duplicated text. The text of documents marked as duplicates is not included; for duplicated contents, we only provide the additional URL where we found the same contents.
Domain preservation. The hierarchy of pages within each domain is provided in the metadata.
Text-rich documents. Only the documents of the following MIME types are included: application/pdf, application/postscript, application/vnd.ms-office, text/html, text/plain, text/rtf.
UTF-8 encoded. All the extracted text is encoded in the UTF-8 format.
RDF/XML. Each file of the distributed collection is an XML valid file, enabling its handling by the tools commonly available for RDF and XML processing.
Language. We used ngramj to detect the language of each extracted text and provide the result in the <dc:language> tag.

Metada contents

Below is an excerpt from a document representation in the WPT 05 collection:

 <rdf:Description rdf:about="http://www.di.fc.ul.pt/entrada.html">
   <ore:isAggregatedBy rdf:resource="http://www.di.fc.ul.pt"/>
   <wpt:ipAddr rdf:datatype="http://www.w3.org/2001/XMLSchema#string">194.117.22.87</wpt:ipAddr>

   <wpt:server rdf:datatype="http://www.w3.org/2001/XMLSchema#string">apache</wpt:server>
   <wpt:statusCode rdf:datatype="http://www.w3.org/2001/XMLSchema#int">200</wpt:statusCode>
   <dcterm:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2005-10-13T23:00:00Z</dcterm:modified>

   <wpt:fetched rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2005-11-23T12:32:44Z</wpt:fetched>
   <dc:format rdf:resource="text/html"/>
   <wpt:arcName rdf:resource="WPT-9-20080823090030-00857"/>
   <wpt:filteredText>Departamento de Inform?tica - FCUL
&gt;
Logótipo DI DI DI Bem-vindo Somos o Departamento de Informática da Faculdade de Ciências da Universidade de Lisboa.
...
  </wpt:filteredText>

  <dc:language>pt</dc:language>
</rdf:Description>

If a web page is a duplicate of another page, its representation would be as follows:

<rdf:Description rdf:about="http://www.di.fc.ul.pt/">
   <ore:isAggregatedBy rdf:resource="http://www.di.fc.ul.pt"/>

   <wpt:ipAddr rdf:datatype="http://www.w3.org/2001/XMLSchema#string">194.117.22.87</wpt:ipAddr>
   <wpt:server rdf:datatype="http://www.w3.org/2001/XMLSchema#string">apache</wpt:server>
   <wpt:statusCode rdf:datatype="http://www.w3.org/2001/XMLSchema#int">200</wpt:statusCode>

   <dcterm:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2005-10-13T23:00:00Z</dcterm:modified>
   <wpt:fetched rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2005-11-23T12:29:40Z</wpt:fetched>
   <dc:format rdf:resource="text/html"/>

   <wpt:arcName rdf:resource="WPT-9-20080819155514-00002"/>
   <wpt:duplicateOf>http://www.di.fc.ul.pt/entrada.html</wpt:duplicateOf>
 </rdf:Description>

For more information about the OAI-ORE specification, please check the OAI-ORE Primer or the OAI-ORE guide for XML.

WPT 05 contents

The WPT 05 contents collection contains the raw documents as they were crawled, without any sort of post-processing, such as filtering by type, elimination of duplicates, or encoding normalization.

To achieve this goal, we adopted the ARC format from the Internet Archive, designed for the specific purpose of preserving web pages as they were crawled.

For more information on the syntax and details of the ARC format, please check the ARC format specification.

WPT 05 portuguese N-Grams

Contains the n-grams generated from the texts extracted from the WPT05 harvested contents classified as portuguese (7 million documents or 26 Gigabytes of text).

We generated up to 5-grams using the N-gram Statistics Package.

The following regular expressions were applied to tokenize the text:

   \w+                                                     # "word" character
 [\.,;:\?!]                                              # punctation
 \w+\'\w+                                                # "word" connected by '
 \bn\.o                                                  # number
 [\w_.-]+ \@ [\w_.-]+\w                                  # emails
 \w+\.?[ºª�]\.?                                          # ordinals
 \d+(?:\/\d+)+                                           # dates or similar: 12/21/1
 \d+(?:[.,]\d+)+%?                                       # numbers
 \d+(?:\.[oa])+                                          # ordinals numbers: 12.o
 \d+\:\d+(\:\d+)?                                        # the time: 12:12:2
 ((https?|ftp|gopher)://|www)[\w_./~:-]+\w               # urls
 \w+\.(?:com|org|net|pt)                                 # simplified urls
 \w+(-\w+)+                                              # dá-lo-à
 \\\\unicode\{\d+\}                                      # unicode
 \w+\.(?:exe|html?|zip|jpg|gif|wav|mp3|png|t?gz|pl|xml)  # filenames

These regular expressions are part of the Perl extension for NLP of the portuguese which include a tokenizer for portuguese developed by Linguateca.

N-grams contents

The n-grams are available as a set of UTF-8 encoded files, containing the n-grams and their frequency, as shown below:

Example of 3-grams data:

   à Associação Montfort 4
   à Associação motivo 7
   à Associação Movimento 1
   à Associação Música 3
   à Associação Mulheres 2
   à Associação Mundial 4
   à Associação Municipal 3
   à Associação Municípios 1
   à Associação Museológica 1
   à Associação Musical 3

Example of 4-grams data:

    A detenção de Carlos 3
   A detenção de certas 1
   A detenção de Cães 1
   A detenção de cidadão 1
   A detenção de cidadãos 2
   A detenção de cinco 2
   A detenção de clérigos 1
   A detenção de Davoudi 4
   A detenção de equipamentos 1

Statistics of the WPT05 portuguese n-grams collection

After the extraction, all n-grams whose tokens have more than 32 characters were discarded as well as n-grams with frequencies below 5.

The n-grams count is the following:

Unigrams: 2 111 004 (25 Mb)
Bigrams: 27 674 092 (432 Mb)
Trigrams: 71 307 404 (1,4 Gb)
Tetragrams: 89 668 947 (2,1 Gb)
Pentagrams: 84 378 473 (2,3 Gb)

The full n-grams collection has a total of 6.3 Gigabytes (1.6 Gigabytes compressed with bzip2).

Terms of Use

The WPT 05 collection is made available exclusively for research purposes. Commercial use is prohibited.

Linguateca and its XLDB node should be referred as sources of the data in all public presentations of works that have used this collection, including articles, thesis and communications in conferences and workshops

As the Linguateca's XLDB node organised the WPT 05 collection and Linguateca is distributing it, citation of the WPT 05 should be given as follows:

A WPT 05 é um recurso criado pela Equipa de Investigação XLDB do LASIGE (http://xldb.di.fc.ul.pt/) em conjunto com a Linguateca.

The WPT 05 is a resource built by the XLDB Research Team of LASIGE (http://xldb.di.fc.ul.pt/) with Linguateca.

How to obtain the WPT 05

You need to send us a message describing your interest in obtaining the WPT 05 collection:

You will have to fill-in and send a signed copy of this form that defines the terms of use of the collection to:

Fernando Ribeiro - Linguateca - FCCN

Apartado 50435

1708-001
Portugal

The form must be signed by the person in charge of the organization which will use the resource. Alternately, you can send the form to the fax number: +351 21 847 21 67.

After receiving the form, we will provide the web site location and password for transferring a copy of the collection (available compressed in gzip or bzip2). Alternately, you may request a copy of the collection stored in DVD (the WPT05-RDF/XML collection will be composed by two DVDs with gzip compressed files).

Support and updates

We intend to support the WPT 05 users. If you have questions about the collection, please send a message to the Linguateca.

Similar collections (for portuguese)

Linguateca publishes a catalog of collections for information retrieval in portuguese

Publications

For publications related with WPT execute the following search in Linguateca's catalogue.

Page in XLDB

XLDB has a twin page that can be accessed here.