Mahout 命令

最新推荐文章于 2019-02-24 16:32:05 发布

转载最新推荐文章于 2019-02-24 16:32:05 发布 · 772 阅读

mahout 专栏收录该内容

3 篇文章

订阅专栏

本文介绍使用Apache Mahout进行文本分类的全流程，包括从原始文本到序列文件的转换、生成TF-IDF向量、划分训练与测试集、训练及测试Naive Bayes分类器。

转载于：http://bit1129.iteye.com/blog/2213708

1. mahout seqdirectory

生成序列文件

Java代码  
$ mahout seqdirectory   
    --input (-i) input               Path to job input directory(原始文本文件).  
    --output (-o) output             The directory pathname for output.（<Text,Text>Sequence File）  
    -ow  

功能：将原始文本数据集转换为< Text, Text > SequenceFile

2. mahout seq2sparke

将序列文件转化为向量

功能： Convert and preprocesses the dataset（<Text,Text> SequenceFile） into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.

即根据Sequence File转换为tfidf向量文件

说明：If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization

Java代码  
mahout seq2sparse                           
  --output (-o) output             The directory pathname for output.          
  --input (-i) input               Path to job input directory.                
  --weight (-wt) weight            The kind of weight to use. Currently TF     
                                       or TFIDF. Default: TFIDF                    
  --norm (-n) norm                 The norm to use, expressed as either a      
                                       float or "INF" if you want to use the       
                                       Infinite norm.  Must be greater or equal    
                                       to 0.  The default is not to normalize      
  --overwrite (-ow)                If set, overwrite the output directory      
  --sequentialAccessVector (-seq)  (Optional) Whether output vectors should    
                                       be SequentialAccessVectors. If set true     
                                       else false                                  
  --namedVector (-nv)              (Optional) Whether output vectors should    
                                       be NamedVectors. If set true else false  

-i Sequence File文件目录

-o 向量文件输出目录

-wt 权重类型，支持TF或者TFIDF两种选项，默认TFIDF

-n 使用的正规化，使用浮点数或者"INF"表示，

-ow 指定该参数，将覆盖已有的输出目录

-seq 指定该参数，那么输出的向量是SequentialAccessVectors

-nv 指定该参数，那么输出的向量是NamedVectors

3. mahout split

将预处理的tfidf向量集转换为training和testing向量集

功能：Split the preprocessed dataset into training and testing sets.

Java代码  
$ mahout split   
    -i ${WORK_DIR}/20news-vectors/tfidf-vectors   
    --trainingOutput ${WORK_DIR}/20news-train-vectors   
    --testOutput ${WORK_DIR}/20news-test-vectors    
    --randomSelectionPct 40   
    --overwrite --sequenceFiles -xm sequential  

说明：如上是将向量数据集分为训练数据和检测数据，以随机40-60拆分

3. mahout trainnb

功能：训练分类器

Java代码  
mahout trainnb  
  --input (-i) input               Path to job input directory.                   
  --output (-o) output             The directory pathname for output.                      
  --alphaI (-a) alphaI             Smoothing parameter. Default is 1.0  
  --trainComplementary (-c)        Train complementary? Default is false.                          
  --labelIndex (-li) labelIndex    The path to store the label index in           
  --overwrite (-ow)                If present, overwrite the output directory     
                                       before running job                             
  --help (-h)                      Print out help                                 
  --tempDir tempDir                Intermediate output directory                  
  --startPhase startPhase          First phase to run                             
  --endPhase endPhase              Last phase to run  

-i 输入路径

-o 输出路径

-a

-c 补偿性训练

-li label index文件的目录

-ow 指定该参数，删除输出目录

tempDir MapReduce作业的中间结果

startPhase 运行的第一个阶段

endPhase 运行的最后一个阶段

4. mahout testnb

功能：检验Bayes分类器

Java代码  
mahout testnb     
  --input (-i) input               Path to job input directory.                    
  --output (-o) output             The directory pathname for output.              
  --overwrite (-ow)                If present, overwrite the output directory      
                                       before running job  
  
  --model (-m) model               The path to the model built during training     
  --testComplementary (-c)         Test complementary? Default is false.                            
  --runSequential (-seq)           Run sequential?                                 
  --labelIndex (-l) labelIndex     The path to the location of the label index     
  --help (-h)                      Print out help                                  
  --tempDir tempDir                Intermediate output directory                   
  --startPhase startPhase          First phase to run                              
  --endPhase endPhase              Last phase to run