Citation-Key:
Klas/Fuhr:00
Title:
A new Effective Approach for Categorizing Web Documents
Author(s):
Claus-Peter Klas
Norbert Fuhr
In:
Proceedings of the 22th BCS-IRSG Colloquium on IR Research
Year:
2000

Abstract:
Categorization of Web documents poses a new challenge for automatic classification methods. In this paper, we present the megadocument approach for categorization. For each category, all corresponding document texts from the training sample are concatenated to a megadocument, which is indexed using standard methods. In order to classify a new document, the most similar megadocument determines the category to be assigned. Our evaluations show that for Web collections, the megadocument method clearly outperformes other classification methods. In contrast, for the Reuters collection, we only achieve mediocre results. Thus, our method seems to be well suited for heterogeneous document collections.

BibTeX entry

Fulltext as PS