Initiative for the Evaluation of XML retrieval




  • Good programming abilities in Java (Compulsary)
  • Familiarity with XML

Task description

Query formulation and reformulation is recognized as one of the most difficult tasks that users in information retrieval systems are asked to perform. In order to support this activity, given corpus based related terms are to be be computed using the well-known methods. Based on these similarities, also methods for term clustering can be integrated, for which we already have implementations. In addition, a KWIC index [Paynter & Witten 01] shall be integrated, in order to give the user some context while selecting terms. From the KWIC index, there should be links to the corresponding passages and backwards, in order to support navigation in both directions.

Sub goals:
  1. Document element based term extraction
  2. Similiarity computation
  3. Application of clustering technique
  4. Perform a formative evaluation of above mentioned approach with document based related terms using a small group of users.


Paynter, G. W.; Witten, I. H. (2001).
A Combined Phrase and Thesaurus Browser for Large Document Collections. In Research and Advanced Technology for Digital Libraries. Proc. European Conference on Digital Libraries (ECDL 2001), volume 2163 of Lecture Notes in Computer Science, pages 25--36. Springer, Heidelberg et al.