 |
Text filters
|
 |
Text filters are used to modify objects (in most cases,
strings) in a uniform way. In the interface
de.unidu.is.text.Filter, the method
Iterator apply(Object) converts an object into a list of
other objects, represented by an iterator. In addition, a
filter can be called with an iterator (method
Iterator apply(Iterator)); then, the filter is applied on
each object returned by the filter.
A sub-interface de.unidu.is.text.SingleItemFilter
can be used if each object is converted into only one other
object or null. The advantage is that then, only
the method Object run(Object) has to be
implemented, which eases the usage of the filter in cases
where no iterator handling is wanted. Most of the filters
described below are single-item filters.
The package de.unidu.is.text contains a couple of
pre-defined filters for Information Retrieval:
-
LowercaseFilter
- converts a string into lowercase
-
SoundexFilter
- converts a word into the corresponding soundex
value
-
StemmerFilter
- Converts a word into its stemmed version.
-
StopwordFilter
- removes stop words
-
UntagFilter
- removes XML/HTML tags from a string
-
HTMLFilter
- extracts text from HTML strings, and converts entities
-
WordSplitterFilter
- splits a string into tokens (converts all non-letter
characters into whitespaces, splits the resulting string into
tokens (the whitespaces are the token boundaries), and
returns only tokens with a user-specified minimum length)
-
WordConcatenatorFilter
- concatenates strings to one single string, seperated by a
space)
-
CounterFilter
- counts the occurrences of strings, and returns
(object,frequency) tuples (instances of
de.unidu.is.util.Tuple)
-
ParserFilter
- a concatenation of word splitter, lowercase, stemmer and
stopword filter (for convenient text parsing)
Own filters can be created based on two abstract classes in
the package de.unidu.is.text:
-
AbstractFilter
- defines an abstract filter class which allows for chaining
filters; subclasses only have to implement
Iterator filter(Object)
-
AbstractSingleItemFilter
- defines an abstract single-item filter class which allows
for chaining filters
|