Home: http://manmustbecool.github.io/MyWiki/

Apache Mahout

1 Development

1.1 Clustering

  1. Vectorization
  • An implementation of Vector is instantiated and filled in for each object.
  • all Vectors are written to a file in the SequenceFile format (from the Hadoop library), which is read by the Mahout algorithms.

1.2 Write a reader for Mahout output file

import org.apache.hadoop.io.Text;
import org.apache.mahout.clustering.kmeans.Cluster;

...
    
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);

    SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(
                "output/clusters-2/part-r-00000"), conf);

    Text text = new Text();
    Cluster cluster = new Cluster();

    while (reader.next(text, cluster)) {
        System.out.println(text + "   " + cluster);
    }
    reader.close();

In above case, each row of reading is (text, cluster). When we open the raw Hadoop output file with a text editor, we will see the text and cluster at the head of the file.

SEQorg.apache.hadoop.io.Text+org.apache.mahout.clustering.kmeans.Cluster...

An example of another case:

SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable...

2 Other

2.1 Install the Apache Maven plug-in for the Eclipse

The Mahout project uses the Apache Maven build and release system.

There are many Maven plug-ins. We need install the correct one:

m2e - Maven Integration for Eclipse http://eclipse.org/m2e/

eclipse install Url: http://download.eclipse.org/technology/m2e/releases/

2.2 Checkout the laterest Mahout source code

Within Eclipse, we need install the Subclipse

Install the Subclipse for the Eclipse

Eclipse install Url: http://subclipse.tigris.org/update_1.8.x

2.2.0.1 Problem: proxy server configuation

RA layer request failed

Configuring proxy setting of Eclipse preferences will not solve this problem.

We have to edit the file named “servers” that is stored in the Subversion runtime configuration area.

On Windows: Open the Run dialog and enter %APPDATA%and click OK.

On Linux: $ cd ~/.subversion/ $ vim servers

Uncomment http-proxy-host and http-proxy-port under the [global] section.

Making sure that the edited lines are without any space:

http-proxy-host = www-proxy.xxxx.se
http-proxy-port = 8080

to

http-proxy-host=www-proxy.xxxx.se
http-proxy-port=8080

Otherwise, it gives a error message:

Malformed file
svn: C:\Users\xxxxxx\AppData\Roaming\Subversion\servers:144: Option expected

Checkout the laterest Mahout source code

SVN url: http://svn.apache.org/repos/asf/mahout/trunk