GikiP: GeoCLEF pilot for crosslingual geographic information retrieval from Wikipedia

Pilot organized by Linguateca, in the context of the GeoCLEF track of the CLEF campaign.

This is a pilot track initially suggested by Nuno Cardoso and accepted by GeoCLEF organization during the GeoCLEF breakout session in Budapest in September 2007. Main organization was assigned to Diana Santos.

Task definition

Find Wikipedia entries / documents that answer a particular information need which requires geographical reasoning of some sort.

Participants should use the Wikipedia collection(s) used in the QA@CLEF main track (information on how to get them is provided on CLEF registration).

Fifteen topics were made available on the 2th June from this site. Examples of topics (exact formulation can be found in the three languages English, German, and Portuguese)

  1. Which Dutch painters were famous by their portraits?
  2. Find European physicists who emigrated to the US between the two world wars of last century
  3. Which countries changed their borders in the XIX century?
  4. Places where Mozart lived
  5. Wars in Canadian soil

For each topic/question, the systems are supposed to return a list of Wikipedia articles, by providing their html.lst one line index. See fictive output format here.

Only answers / documents of the correct type are expected, that is, for example topics 1 and 2, names of people (painters and scientists), not names of boats or countries. For topic 3, names of countries (not of wars or kings).

Maximum number of documents returned is 100, but systems are encouraged to try to only return the right ones (which will typically be much less than that).

Participants can submit at most two runs.


System's results will be evaluated according to number of correct hits (N) and precision, by the simple formula mult*N*N/total, for each topic, where mult rewards multilinguality.

The system's final score will be given by the average of the individual scoers.

Evaluation is going to give emphasis to diversity and multilinguality, in that best systems are those who can retrieve most cases and in most languages. We will deliver (and assess) topics in Portuguese, German and English. So, an additional bonus will be computed for multilinguality, mult which is 1, 2 or 3 depending on the number of languages with right answers for that topic.

Participants in GikiP 2008

In schedule (deadline 12 June 2008), we had three systems participating
REMBRANDT's Extended NER On IR Interactive retrievals. Nuno Cardoso, University of Lisbon, Faculty of Sciences, LaSIGE, XLDB (Portugal)
Iustin Dornescu, Research Group in Computational Linguistics (CLG) at the University of Wolverhampton (UK)
Sven Hartrumpf and Johannes Leveling, Intelligent Information and Communication Systems (IICS) at the FernUniversität in Hagen (Germany)

More information about GikiP's rationale and motivation

Where is this different from GeoCLEF with another collection?

Documents (i.e. Wikipedia articles) cannot be only relevant to a topic, they need to constitute the right answer (and the most relevant one) to the user's information need as described in the topic formulation.

But obviously a set of highly interlinked encyclopaedia articles is different from a set of time-organized news, and this should be reflected on the kinds of topics one should ask.

Relationship to QA and IR

This is a hybrid between QA and IR, which can described as open list questions, where the answers are the titles of Wikipedia articles, whose content provides justification for the answer.

The most similar task we know of is WiQA in that one specific kind of questions (the other questions) were tried on Wikipedia articles. The main difference is that we are interested on mining relevant geographic information, and this is why this takes place as a GeoCLEF track.

Where's the geography?

The whole point of the GikiP arose by considering that some kinds of questions issued at the GeoCLEF main task, such as "Rivers with wineyards", would be much better served consulting an enciclopedia than a set of news, which might mention that in passing but would hardly create news on that topic.

We also expect to go on to geographical topics with more complexity, such as those discussed in Gey et al. (2006), given that these facts are bound to be joined in entries about relevant subjects.


Current organization is by Diana Santos and Paula Carvalho from Linguateca. We are grateful to Yvonne Skalban from the University of Wolverhampton for translating the topics into German, to Ross Purves from the University of Zürich - Irchel for revising the English version, and to Sven Hartrumpf and his parser VOCADI, from the University of Hagen, for spotting several typos in the first version of the topics.


