The assessment process ====================== (from the GikiCLEF Webpage) This section describes the assessment process in more detail. Note that assessment automatically receives information from the answers already inserted in the system by the topic owners, during topic development. As mentioned in the previous section, all answers will be pooled, but the same answer with different justifications (that is, different pages in the justification field) has to be considered and assessed separately, in order to validate the justification presented by the system. For a subset of the answers, more than one assessor is called to judge. Assessors have to, for each answer candidate, indicate, in a first pass, one of three alternatives: a) '''correct'''; b) '''incorrect'''; c) '''uncertain''' (= don't know). (While both correct and incorrect choices presuppose definite knowledge of the assessor about the topic, the choice of this last alternative presupposes that they first tried to fairly assess the correctness by viewing the justifications offered. No requirements of consulting other external choices are made or even desirable.) Then, they have to answer whether the page itself together with the justication chain does provide enough justification for that answer, by choosing between a) '''Justified''' or b) '''Not justified'''. They will also be required to inscribe a comment in the (hopefully rare) cases where although the answer has been (automatically) flagged as correct, the material in the page explicitly contradicts it. Any other comments on difficult choices or doubts are also welcome. Then the conflict assessment procedure takes place, which is done by the administrators, who contact the opposing parties if needed or directly correct the verdicts. After this process, assessors are informed of all changes to their judgements so that they can complain or redo other choices. After all answers have been assessed and problematic cases discussed, the score for individual runs is computed by the GikiCLEF evaluation system. Further information, below: * Assessment in the GikiCLEF assessment system: We have developed a complex system to help a large number of assessors to work cooperatively in scoring participant systems, whose internals and rationale are described here * Precise guidelines for GikiCLEF assessors: Given a particular answer to a particular question, how to proceed, and a detailde example. * First information about conflict resolution: When we discovered that the assessment process was not that easy, after all, and some information on that 1. Assessment in the GikiCLEF assessment system =============================================== After all submissions have been received, the assessment process starts by producing an answer pool, in which different justifications for the same answer are considered as different answers. This pooling mechanism handles differences between HTML and XML, that is, only one will remain in the pool, as well as automatically filters out those pages which are easy to discard as invalid answers (such as disambiguation pages, or redirects). Then, and in order to minimize the assessors' workload, the system automatically adds the information it already has about the topics, namely that some answers are correct and self-justified, or correct (but not self justified), provided this information is already present in the topic management system. An assessor is then presented with a list of answers for which s/he has to - either check the correctness through inspecting the pages and the justifications - or just check the justifications because it is already known by the system that the answer is correct but not self justified Also, the assessor can add comments about interesting issues (incompatible information in different languages, Wikipedia link translations incorrect, etc.) which may have a bearing on the evaluation score The system allows the same answers to be evaluated by different assessors, and then dutifully stores all assessments, indexed by assessor, so that a subsequent a conflict-solving procedure can be stared if they do not agree. Assessors do not have access to the other assessors judgements while assessing, nor to the comments already entered about this particular answer. After all answers in the pool have been classified, it is time for the evaluation system to take control. 2. Precise guidelines for GikiCLEF assessors ============================================ Upon entering the system, and choosing the option ''Assess answers'', each assessor receives a list of answers to assess, depending on the languages he registered in, and in fact also depending on the administrators knowledge of his or her linguistic capabilities and the pool size for each language. Some of these have been chosen to on purpose overlap with other assessors, but the assessor does not know which are which. From the point of view of system design, note that: * Assessors are only requested to assess answers which are not yet stored in the system as correct and self-justified. * Also, and in order to minimize assessors' work, when the system already knows that a particulart answer is correct but that the page in itself does not provide a justification for the answer, the assessor is not required to do any work. * If, on the other hand, the answer is known by the system to be correct and in need of external justification, the assessor is only required to check the validity of the justification provided. To make the assessment process easier, the assessor is able to see the pre-selected answers to the topic, although warned that they are not complete -- and in fact in a few cases they have turned out to even be wrong (and were accordingly removed). The only exception is the one case where the GikiCLEF question was closed instead of open (name the five Italian regions), where only five can be accepted as correct. Finally, if a particular answer was considered incorrect by the assessor, in principle no more equal answers (with different justifications) will be presented to that assessor for assessing. This was however not implemented yet. An assessor is however free to comment and check all answers if s/he wishes to, as well as redo the assessment as many times as necessary, as well as fill in comments about things that are not or may not be totally clear. In the bottom of the page for each answer to be assessed there was a link that identified the answer for discussion or problem reporting that assessors were also able to use to report problems or discuss complex issues. 2.1 Specific check lists ======================== After repeated discussion with individual assessors, the following instructions were sent to the assessors' list -- unfortunately not on time for avoiding a lot of reassessing work -- which may be interesting also to participants and the public in general: * GikiCLEF assessment should be easy to perform, because in the first step you don't need to read the whole document to find the answer. It the answer (=title of the wikipedia article) is not of the type that was asked for, the answer is INCORRECT. Only if it is correct do you need to read or find whether the justification is enough. So, "flag of Peru" is never a right answer for a question like "Which countries", although it might be perfect for justification. * Note about topic 2: although images are not shown in the collection, we are assuming that a user would be able to see at once the flag, and therefore and based on the visual clue, decide. So please try to visualize the flag when assessing that topic. The same for images of mountains and of rivers if they appear on the main page of the answer. 2.2 Detailed example ==================== Take the following answer to GC-2009-01 "List the Italian places where Ernest Hemingway visited during his life." and imagine you are assessing the answer (which is correct). In the text of this page it stands: which is an obvious justification for claiming that Hemingway WAS there. So you click on '''Correct''' as far as the answer is concerned (or the system has already done that for you). Then you proceed to assess whether the asnwer is justified, and read: This means: the justification presented by the system is the document itself, no further proofs are supposed to be necessary. And this is true in that case. So please click on '''Justified''' and marvel on such a miracle. If the answer was incorrect, mark the answer as '''Incorrect''', and you don't need to do anything about justification If you are uncertain but are sure that the document itself and or further "justifications" presented do not let you decide, please mark the answer as '''Uncertain''', and '''Unjustified''' Now imagine that you had got this very same answer to assess in another language, let us say Portuguese, where nothing was said about Hemingway, but you already knew that it was correct. Then you could classify it as '''Correct''', but '''Not Justified'''. On the other hand, you could have been asked to assess it BEFORE you had read the English or Italian one, and you are not an expert on Hemingway. Then '''Uncertain''' was the right answer. '''You are not required to go and reassess this answer if you learned it afterwards.''' But you could, of course. Both results would result in the same overall result for GikiCLEF systems. 3. First information about conflict resolution ============================================== Follows how the (first) conflict resolution was done. The first impressions, after looking at the conflicts as far as the very same answer in the same language and with the same justification is concerned, was that we could identify four different cases: # cases where it is easy to decide who is right (such as type mismatch or information in the page not understood by a non-native speaker, but that leaves no doubt) # cases where we do not know where the values "correct" or "incorrect" come from, and have to check how people did that classification: did they really know, or just expected that the predetermined answers were absolutely correct -- the all and only? If they did, this must be revised :-( # cases where there are genuine disagreements -- and these will be sent to the two or three parties involved, possibly also with copy to the topic owner # cases where different language Wikipedias actually say different things, and so each answer has to be considered individually. This is strictly speaking not a conflict between assessors but between infromation sources, but we have to deal with this with special care. And these were the kind of decisions taken during conflict resolution, but which may still have to be revised by the assessors group: * If the answer is contained in the question, it is considered incorrect, so Italy is not a fair answer to "List the Italian places" * If there is principled disagreement about vague, complex categories and different people have strong reasons for disagreement, for GikiCLEF we accept the union of all ** speaking/writing poets in other languages than Romanian are Romanian poets? YES ** studying in a place, taking a short visit to another place and coming back in love to that place, does it qualify as a place where someone falls in love? YES ** if a ciclist won the junior Tour de Flandres and then the adult one, is s/he considered a winner two times? YES * Very slight differences which very strongly convey the probability of yes are accepted, because we would expect most people (except lawyers and logicians) to accept that a ** eight thousanders accept a 50 m deviation (if a mountain is higher than 7950 m) ** Norwegian musicians convicted for burning (even if it does not mention they burned churches) ** people who wrote ballads and published a lot of volumes of poetry ** people who have two residences one in Switzerland and other somewhere else can be considered to have moved to Switzerland some time in their lives * If two of three sisters died of tuberculosis and the third the cause of death is not certain, is "the sisters Bronte" correct? YES * Still although it can be metonymically used for the places where people were imprisoned and most of them were above the Arctic, Gulag was not considered a place. * Also, fictional countries were not considered as correct for fictional works, even if they were created in a written fictional work. Due to these decisions, a new reassessment had to be performed. ================================ Main author/editor: Diana Santos