A basic (and probably not complete) step by step guide on how to classify text documents using the machine learning platform Mahout
One time, after scraping a client site for extra data I had a list of around 2000 long tail pages that I needed to create ad groups for. Creating the keywords was easy(ish); it was an exercise in munging together seed words extracted from the page headers with a standard list of targeted keywords. Writing the ad text was much harder; the landing pages covered very different topics and I wanted to customize the advert depending on the topic. Life is too short to look through 2000 pages and assign them to categories, but it is not too short to spend hours messing around with the machine learning platform Mahout.
I used Mahout to build a naive bayes classifier which would tell me which category a landing page was in based on the text on a landing page.
1. Install Mahout
There are Mahout install instructions on their website.
2. Prepare Training Data
To train your model you need some data where you already know the classifications.
The tricky thing with Mahout is storing this stuff in a way that the program can understand. I extracted the page content to text files and then stored these files in directories named after the category:
|Category 1||Page 1|
|. . .|
|Category 2||Page 1|
|. . .|
3. Training the Model
What we are trying to do here is very similar to the 20 Newsgroups Example that Mahout have already prepared for us. We can piggy back off their work by running the following commands:
<PATH TO MAHOUT>/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \ -p <PATH TO TRAINING DATA> \ -o <PATH TO STORE OUTPUT> \ -a org.apache.mahout.vectorizer.DefaultAnalyzer \ -c UTF-8
The above command turns the data from our training set into a format that Mahout can use to train our model.
<PATH TO MAHOUT>/mahout trainclassifier \ -i <PATH TO DATA (OUTPUT FROM ABOVE)> \ -o <PATH TO STORE MODEL> \ -type bayes \ -ng 1 \ -source hdfs
This command might take a while to run but once it has then you have successfully trained your model!
4. Testing the Model
Prepare test data in the same way as you prepared the training data but instead of running the "trainclassifier" function do the following instead:
<PATH TO MAHOUT>/mahout testclassifier \ -m <PATH TO MODEL> \ -d <PATH TO TEST DATA> \ -type bayes \ -ng 1 \ -source hdfs \ -method sequential
This function will output a table telling you how accurate/useful your model is.
5. Using the model
I am having trouble with this step. Given a text file with the contents of a page you can classify it with this command:
<PATH TO MAHOUT>/mahout org.apache.mahout.classifier.Classify \ -m <PATH TO MODEL> \ --classify <PATH TO TEXT FILE TO CLASSIFY> \ --encoding UTF-8 \ --analyzer org.apache.mahout.vectorizer.DefaultAnalyzer \ --defaultCat unknown \ -ng 1 \ -type bayes \ -source hdfs
This outputs a load of logging information and the classification result which is not that useful if you want the results for more than one page. I'm worried I'd have to write some Java to get this to do what I want.
It was always my plan to use this method in anger. However, when trying to outsource the generation of training data I found a freelancer who could do the whole list for $50.