README file for moVMF v. 1.0

USAGE: moVMF [switches] word-doc-file
-a algorithm
   s: moVMF algorithm (default)
-i [s|p|r]
  initialization method:
     s -- subsampling
     p -- random perturbation (default)
     r -- totally random
     f -- read from file
-c number-of-clusters
-e epsilon
-s suppress output
-v version-number
-n no dump
-d dump the clustering process
-p perturbation-magnitude the distance between initial concept vectors
    and the centroid will be less than this.
-N number-of-samples
-O the name of output matrix
-t scaling scheme
-K lower bound on Kappa
-Z upper bound on Kappa
-S Run SOFT moVMF (default is HARD moVMF)
-E encoding-scheme
   1: normalized term frequency (default)
   2: normalized term frequency inverse document frequency
-o objective-function
    1: nonweighted (default)
    2: weighted

File Format

The input file must be in the CCS sparse matrix format.
As an example, for the input matrix, each column represents a document (or some other entity), and each row represents a word (or a feature). The clustering is written out to an output file, whose format depends on whether we are doing soft or hard clustering. If one is doing soft-clustering for example, then the output file takes this format:
[num documents]
[p(c_1|d_1) ... p(c_k|d_1)]
[p(c_1|d_2) ... p(c_k|d_2)]
...
[p(c_1|d_n) ... p(c_k|d_n)]
That is, this clustering information file just list the posteriors probability values for each class given the document. Each row has 'k' values, if we created 'k' clusters, and there are 'n' rows, one for each document.

Sample commandline

Suppose we have a dataset called 'data'. Thus we should have the following files in the current directory. data_dim, data_txx_nz, data_col_ccs, data_row_ccs
moVMF -c 3 -S -K 1 -Z 100 -t txx -O clusters data

This command takes the dataset 'data' and produces 3 soft clusters. We use -K to give a lower bound on kappa. If you do not give this option, then the program will not produce correct results. -Z is used to give an upper bound on the estimate of the kappa parameter. The clusters will be left in a file of the name clusters_txx_doctoclus.3
Suvrit Sra
Last modified: Mon Oct 31 11:22:47 CST 2005