转载于:http://bit1129.iteye.com/blog/2213708
1. mahout seqdirectory
生成序列文件
- $ mahout seqdirectory
- --input (-i) input Path to job input directory(原始文本文件).
- --output (-o) output The directory pathname for output.(<Text,Text>Sequence File)
- -ow
功能: 将原始文本数据集转换为< Text, Text > SequenceFile
2. mahout seq2sparke
将序列文件转化为向量
功能: Convert and preprocesses the dataset(<Text,Text> SequenceFile) into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.
即根据Sequence File转换为tfidf向量文件
说明:If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization
- mahout seq2sparse
- --output (-o) output The directory pathname for output.
- --input (-i) input Path to job input directory.
- --weight (-wt) weight The kind of weight to use. Currently TF
- or TFIDF. Default: TFIDF
- --norm (-n) norm The norm to use, expressed as either a
- float or "INF" if you want to use the
- Infinite norm. Must be greater or equal
- to 0. The default is not to normalize
- --overwrite (-ow) If set, overwrite the output directory
- --sequentialAccessVector (-seq) (Optional) Whether output vectors should
- be SequentialAccessVectors. If set true
- else false
- --namedVector (-nv) (Optional) Whether output vectors should
- be NamedVectors. If set true else false
-i Sequence File文件目录
-o 向量文件输出目录
-wt 权重类型,支持TF或者TFIDF两种选项,默认TFIDF
-n 使用的正规化,使用浮点数或者"INF"表示,
-ow 指定该参数,将覆盖已有的输出目录
-seq 指定该参数,那么输出的向量是SequentialAccessVectors
-nv 指定该参数,那么输出的向量是NamedVectors
3. mahout split
将预处理的tfidf向量集转换为training和testing向量集
功能:Split the preprocessed dataset into training and testing sets.
- $ mahout split
- -i ${WORK_DIR}/20news-vectors/tfidf-vectors
- --trainingOutput ${WORK_DIR}/20news-train-vectors
- --testOutput ${WORK_DIR}/20news-test-vectors
- --randomSelectionPct 40
- --overwrite --sequenceFiles -xm sequential
说明:如上是将向量数据集分为训练数据和检测数据,以随机40-60拆分
3. mahout trainnb
功能:训练分类器
- mahout trainnb
- --input (-i) input Path to job input directory.
- --output (-o) output The directory pathname for output.
- --alphaI (-a) alphaI Smoothing parameter. Default is 1.0
- --trainComplementary (-c) Train complementary? Default is false.
- --labelIndex (-li) labelIndex The path to store the label index in
- --overwrite (-ow) If present, overwrite the output directory
- before running job
- --help (-h) Print out help
- --tempDir tempDir Intermediate output directory
- --startPhase startPhase First phase to run
- --endPhase endPhase Last phase to run
-i 输入路径
-o 输出路径
-a
-c 补偿性训练
-li label index文件的目录
-ow 指定该参数,删除输出目录
tempDir MapReduce作业的中间结果
startPhase 运行的第一个阶段
endPhase 运行的最后一个阶段
4. mahout testnb
功能:检验Bayes分类器
- mahout testnb
- --input (-i) input Path to job input directory.
- --output (-o) output The directory pathname for output.
- --overwrite (-ow) If present, overwrite the output directory
- before running job
- --model (-m) model The path to the model built during training
- --testComplementary (-c) Test complementary? Default is false.
- --runSequential (-seq) Run sequential?
- --labelIndex (-l) labelIndex The path to the location of the label index
- --help (-h) Print out help
- --tempDir tempDir Intermediate output directory
- --startPhase startPhase First phase to run
- --endPhase endPhase Last phase to run
-i 输入路径
-o 输出路径
-ow 覆盖输出目录
-c
- 积累更多,更有具代表性的样本;
- 在文本预处理阶段选择更好的分词算法;
- 在训练分类器时,对训练参数进行调整。
本文介绍使用Apache Mahout进行文本分类的全流程,包括从原始文本到序列文件的转换、生成TF-IDF向量、划分训练与测试集、训练及测试Naive Bayes分类器。

2884

被折叠的 条评论
为什么被折叠?



