WPT 05 in English
Esta página está disponível em português.
Contents[hide] |
About the WPT 05 collection
WPT 05 is a collection of over 10 million documents from the portuguese web obtained by the crawler of the Tumba! search engine, produced by the XLDB Node of Linguateca.
The contents were crawled in 2005 and have been harvested according to the following criteria:
- hosted in a .pt domain
- written in portuguese, hosted in a .com, .org, .net or .tv domain, and referenced by a hyperlink from, at least, one page hosted in a .pt domain.
The WPT 05 collection and related data are available in multiple versions and formats:
- WPT 05 metadata
- contains the attributes of each of the collected contents (includes automatically extracted text and identified language)in the RDF/XML format.
- WPT 05 contents
- contains the harvested contents in raw form, as they have been archived, in the Internet Archive ARC format
- WPT 05 portuguese n-grams
- includes the n-grams generated from the extracted text of collected documents in portuguese
WPT 05 is the successor to WPT 03, which is a crawl from 2003 distributed since 2004 byLinguateca.
WPT05 metadata
The RDF/XML version of the WPT 05 collection uses the RDF technology and the OAI-ORE specification for representing duplicates and the web pages hierarchy. It includes the crawling metadata and the text extracted from each URL.
Characteristics of the WPT 05 metadata collection:
- No duplicated text. The text of documents marked as duplicates is not included; for duplicated contents, we only provide the additional URL where we found the same contents.
- Domain preservation. The hierarchy of pages within each domain is provided in the metadata.
- Text-rich documents. Only the documents of the following MIME types are included: application/pdf, application/postscript, application/vnd.ms-office, text/html, text/plain, text/rtf.
- UTF-8 encoded. All the extracted text is encoded in the UTF-8 format.
- RDF/XML. Each file of the distributed collection is an XML valid file, enabling its handling by the tools commonly available for RDF and XML processing.
- Language. We used ngramj to detect the language of each extracted text and provide the result in the <dc:language> tag.
Metada contents
Below is an excerpt from a document representation in the WPT 05 collection:
<rdf:Description rdf:about="http://www.di.fc.ul.pt/entrada.html"> <ore:isAggregatedBy rdf:resource="http://www.di.fc.ul.pt"/> <wpt:ipAddr rdf:datatype="http://www.w3.org/2001/XMLSchema#string">194.117.22.87</wpt:ipAddr> <wpt:server rdf:datatype="http://www.w3.org/2001/XMLSchema#string">apache</wpt:server> <wpt:statusCode rdf:datatype="http://www.w3.org/2001/XMLSchema#int">200</wpt:statusCode> <dcterm:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2005-10-13T23:00:00Z</dcterm:modified> <wpt:fetched rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2005-11-23T12:32:44Z</wpt:fetched> <dc:format rdf:resource="text/html"/> <wpt:arcName rdf:resource="WPT-9-20080823090030-00857"/> <wpt:filteredText>Departamento de Inform?tica - FCUL > Logótipo DI DI DI Bem-vindo Somos o Departamento de Informática da Faculdade de Ciências da Universidade de Lisboa. ... </wpt:filteredText> <dc:language>pt</dc:language> </rdf:Description>
If a web page is a duplicate of another page, its representation would be as follows:
<rdf:Description rdf:about="http://www.di.fc.ul.pt/"> <ore:isAggregatedBy rdf:resource="http://www.di.fc.ul.pt"/> <wpt:ipAddr rdf:datatype="http://www.w3.org/2001/XMLSchema#string">194.117.22.87</wpt:ipAddr> <wpt:server rdf:datatype="http://www.w3.org/2001/XMLSchema#string">apache</wpt:server> <wpt:statusCode rdf:datatype="http://www.w3.org/2001/XMLSchema#int">200</wpt:statusCode> <dcterm:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2005-10-13T23:00:00Z</dcterm:modified> <wpt:fetched rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2005-11-23T12:29:40Z</wpt:fetched> <dc:format rdf:resource="text/html"/> <wpt:arcName rdf:resource="WPT-9-20080819155514-00002"/> <wpt:duplicateOf>http://www.di.fc.ul.pt/entrada.html</wpt:duplicateOf> </rdf:Description>
For more information about the OAI-ORE specification, please check the OAI-ORE Primer or the OAI-ORE guide for XML.
WPT 05 contents
The WPT 05 contents collection contains the raw documents as they were crawled, without any sort of post-processing, such as filtering by type, elimination of duplicates, or encoding normalization.
To achieve this goal, we adopted the ARC format from the Internet Archive, designed for the specific purpose of preserving web pages as they were crawled.
For more information on the syntax and details of the ARC format, please check the ARC format specification.
WPT 05 portuguese N-Grams
Contains the n-grams generated from the texts extracted from the WPT05 harvested contents classified as portuguese (7 million documents or 26 Gigabytes of text).
We generated up to 5-grams using the N-gram Statistics Package.
The following regular expressions were applied to tokenize the text:
\w+ # "word" character [\.,;:\?!] # punctation \w+\'\w+ # "word" connected by ' \bn\.o # number [\w_.-]+ \@ [\w_.-]+\w # emails \w+\.?[ºª°]\.? # ordinals \d+(?:\/\d+)+ # dates or similar: 12/21/1 \d+(?:[.,]\d+)+%? # numbers \d+(?:\.[oa])+ # ordinals numbers: 12.o \d+\:\d+(\:\d+)? # the time: 12:12:2 ((https?|ftp|gopher)://|www)[\w_./~:-]+\w # urls \w+\.(?:com|org|net|pt) # simplified urls \w+(-\w+)+ # dá-lo-à \\\\unicode\{\d+\} # unicode \w+\.(?:exe|html?|zip|jpg|gif|wav|mp3|png|t?gz|pl|xml) # filenames
These regular expressions are part of the Perl extension for NLP of the portuguese which include a tokenizer for portuguese developed by Linguateca.
N-grams contents
The n-grams are available as a set of UTF-8 encoded files, containing the n-grams and their frequency, as shown below:
Example of 3-grams data:
à Associação Montfort 4 à Associação motivo 7 à Associação Movimento 1 à Associação Música 3 à Associação Mulheres 2 à Associação Mundial 4 à Associação Municipal 3 à Associação Municípios 1 à Associação Museológica 1 à Associação Musical 3
Example of 4-grams data:
A detenção de Carlos 3 A detenção de certas 1 A detenção de Cães 1 A detenção de cidadão 1 A detenção de cidadãos 2 A detenção de cinco 2 A detenção de clérigos 1 A detenção de Davoudi 4 A detenção de equipamentos 1
Statistics of the WPT05 portuguese n-grams collection
After the extraction, all n-grams whose tokens have more than 32 characters were discarded as well as n-grams with frequencies below 5.
The n-grams count is the following:
- Unigrams: 2 111 004 (25 Mb)
- Bigrams: 27 674 092 (432 Mb)
- Trigrams: 71 307 404 (1,4 Gb)
- Tetragrams: 89 668 947 (2,1 Gb)
- Pentagrams: 84 378 473 (2,3 Gb)
The full n-grams collection has a total of 6.3 Gigabytes (1.6 Gigabytes compressed with bzip2).
Terms of Use
The WPT 05 collection is made available exclusively for research purposes. Commercial use is prohibited.
Linguateca and its XLDB node should be referred as sources of the data in all public presentations of works that have used this collection, including articles, thesis and communications in conferences and workshops
As the Linguateca's XLDB node organised the WPT 05 collection and Linguateca is distributing it, citation of the WPT 05 should be given as follows:
How to obtain the WPT 05
You need to send us a message describing your interest in obtaining the WPT 05 collection:
You will have to fill-in and send a signed copy of this form that defines the terms of use of the collection to:
1708-001
Portugal
The form must be signed by the person in charge of the organization which will use the resource. Alternately, you can send the form to the fax number: +351 21 847 21 67.
After receiving the form, we will provide the web site location and password for transferring a copy of the collection (available compressed in gzip or bzip2). Alternately, you may request a copy of the collection stored in DVD (the WPT05-RDF/XML collection will be composed by two DVDs with gzip compressed files).
Support and updates
We intend to support the WPT 05 users. If you have questions about the collection, please send a message to the Linguateca.
Similar collections (for portuguese)
Linguateca publishes a catalog of collections for information retrieval in portuguese
Publications
For publications related with WPT execute the following search in Linguateca's catalogue.
Page in XLDB
XLDB has a twin page that can be accessed here.