 |
Test your programming skills
|
 |
Contents
A frequent task in Information Retrieval (IR) is the calculation
of term frequencies. For all terms it is to be counted how
often they occur in a text. For this a term is defined as the
stem of a word. Examples:
| word |
-> |
word stem (term) |
| going |
-> |
go |
| apple |
-> |
appl |
| apples |
-> |
appl |
Within this task the document in question (the first scene of Shakespeare's Hamlet) is
in XML format. Therefore first that file must be downladed and
parsed. After this only the contents of the
<LINE> is to be taken, meaning everything
enclosed in <LINE>...</LINE>. From
this the term frequencies (after stemming) are to be
calculated. The output of the programm is a list of all terms,
together with the respective occurrence frequencies within
<LINE> elements. The output should look
like this:
| word |
count |
| go |
7 |
| appl |
2 |
| situat |
5 |
It is recommend to implement term counting on plain text
first, and afterwards extend the programm towards XML parsing.
Feel free to fullfill this task in your favourite language.
Further resources for solving the problem in the most
important languages are below. Feel free to contact us when
you have questions concerning this tasks. If you want to let
us check your results, please send us the code and the output
of your running program.
Resources
Parsing of XML can be done with Xerces. For word stem reduction
there is a variant of the famous Porter Stemming Algorithm
available. For counting the occurrence frequencies one can use
java.util.StringTokenizer and
java.util.Hashtable.
Resources
Resources
|