hyspirit.application.indexing
Class MDSTools

java.lang.Object
  extended by hyspirit.application.indexing.MDSTools

public class MDSTools
extends java.lang.Object

Author:
Ingo Frommholz <ingo@is.informatik.uni-duisburg.de>

Created on 19-Oct-2005 16:18:49


Constructor Summary
MDSTools()
           
 
Method Summary
static void convertDoctermToIDF(java.lang.String doctermFile, java.lang.String idfFile, java.lang.String idfNorm, boolean optimisedStream)
          Reads an mds file which is supposed to contain document-term relations.
static void main(java.lang.String[] args)
          Some nifty tools for operations on MDS files
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

MDSTools

public MDSTools()
Method Detail

convertDoctermToIDF

public static void convertDoctermToIDF(java.lang.String doctermFile,
                                       java.lang.String idfFile,
                                       java.lang.String idfNorm,
                                       boolean optimisedStream)
Reads an mds file which is supposed to contain document-term relations. Ignores any weight. The first column must contain the term, the second one the document. All other columns are ignored. Example entry:

0.78 ("test", "doc1")

Terms and documents must not contain '"', '(', ')' and ',' in its name. The input stream from the docterm file should be optimised w.r.t. the second column: if the value changes here, the previous value will not show up at a later position. For example,

0.78 ("test", "doc1") 0.43 ("test2", "doc1") 0.33 ("test", "doc2")

is optimised, since "doc1" does not appear after the second line, whereas

0.78 ("test", "doc1") 0.33 ("test", "doc2") 0.43 ("test2", "doc1")

is not optimised, since "doc1" before and after "doc2". With such optimised streams, the algorithm does not have to store a list of all documents and terms already seen, but only the current one, which means the process will probably need much less memory than with a non-optimised stream.

The method writes a corresponding IDF file using DocFreqList.

Parameters:
doctermFile - absolute name of the docterm file
idfFile - absolute name of the idf file
idfNorm - "max_idf" or "sum_idf"
optimisedStream - says if a stream is optimised or not

main

public static void main(java.lang.String[] args)
Some nifty tools for operations on MDS files

Parameters:
args -