Universität Duisburg-Essen
Startseite Arbeitsgruppe Informationsysteme

Text filters

Text filters are used to modify objects (in most cases, strings) in a uniform way. In the interface de.unidu.is.text.Filter, the method Iterator apply(Object) converts an object into a list of other objects, represented by an iterator. In addition, a filter can be called with an iterator (method Iterator apply(Iterator)); then, the filter is applied on each object returned by the filter.

A sub-interface de.unidu.is.text.SingleItemFilter can be used if each object is converted into only one other object or null. The advantage is that then, only the method Object run(Object) has to be implemented, which eases the usage of the filter in cases where no iterator handling is wanted. Most of the filters described below are single-item filters.

The package de.unidu.is.text contains a couple of pre-defined filters for Information Retrieval:

LowercaseFilter
converts a string into lowercase
SoundexFilter
converts a word into the corresponding soundex value
StemmerFilter
Converts a word into its stemmed version.
StopwordFilter
removes stop words
UntagFilter
removes XML/HTML tags from a string
HTMLFilter
extracts text from HTML strings, and converts entities
WordSplitterFilter
splits a string into tokens (converts all non-letter characters into whitespaces, splits the resulting string into tokens (the whitespaces are the token boundaries), and returns only tokens with a user-specified minimum length)
WordConcatenatorFilter
concatenates strings to one single string, seperated by a space)
CounterFilter
counts the occurrences of strings, and returns (object,frequency) tuples (instances of de.unidu.is.util.Tuple)
ParserFilter
a concatenation of word splitter, lowercase, stemmer and stopword filter (for convenient text parsing)

Own filters can be created based on two abstract classes in the package de.unidu.is.text:

AbstractFilter
defines an abstract filter class which allows for chaining filters; subclasses only have to implement Iterator filter(Object)
AbstractSingleItemFilter
defines an abstract single-item filter class which allows for chaining filters