什么是Mahout?
” Apache Mahout™ project’s goal is to build a scalable machine learning library ”
我来拓展一下:
(1) Mahout 是Apache旗下的开源项目,集成了大量的机器学习算法。
(2) 大部分算法,可以运行在Hadoop上,具有很好的拓展性,使得大数据上的机器学习成为可能。
本篇主要探讨 Mahout 0.9 中的聚类(Clustering)工具的用法。
一、数据准备
Mahout聚类算法的输入为List,即需要将每个待聚类的文档,表示为向量形式。
在本文中,我们选择经典的 Reuters21578 文本语料。尝试对新闻内容进行文本聚类。
1、下载数据
[color=#333333 !important]
1
| [color=#002D7A !important]axel [color=#006FE0 !important][color=#006FE0 !important]- n [color=#006FE0 !important][color=#CE0000 !important]20 [color=#006FE0 !important][color=#002D7A !important]http [color=#006FE0 !important]: [color=#FF8000 !important]//kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
|
2、解压缩数据
[color=#333333 !important]
1
| [color=#002D7A !important]tar [color=#006FE0 !important][color=#006FE0 !important]- xzvf [color=#006FE0 !important][color=#333333 !important]. [color=#006FE0 !important]/ [color=#002D7A !important]reuters21578 [color=#333333 !important]. [color=#002D7A !important]tar [color=#333333 !important]. gz [color=#006FE0 !important][color=#333333 !important]. [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sgm
|
解压缩之后,reuters-sgm下,包含了若干*.sgm文件,每个文件中又包含了若干下属结构化文档:
[color=#333333 !important]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| [color=#006FE0 !important]< [color=#004ED0 !important]REUTERS [color=#002D7A !important]TOPICS [color=#006FE0 !important]= [color=#008000 !important]"NO" [color=#006FE0 !important][color=#002D7A !important]LEWISSPLIT [color=#006FE0 !important]= [color=#008000 !important]"TRAIN" [color=#006FE0 !important][color=#002D7A !important]CGISPLIT [color=#006FE0 !important]= [color=#008000 !important]"TRAINING-SET" [color=#006FE0 !important][color=#002D7A !important]OLDID [color=#006FE0 !important]= [color=#008000 !important]"5545" [color=#006FE0 !important][color=#002D7A !important]NEWID [color=#006FE0 !important]= [color=#008000 !important]"2" [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#002D7A !important]DATE [color=#006FE0 !important]> [color=#CE0000 !important]26 [color=#006FE0 !important]- [color=#002D7A !important]FEB [color=#006FE0 !important]- [color=#CE0000 !important]1987 [color=#006FE0 !important][color=#CE0000 !important]15 [color=#006FE0 !important]: [color=#CE0000 !important]02 [color=#006FE0 !important]: [color=#CE0000 !important]20.00 [color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]DATE [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#002D7A !important]TOPICS [color=#006FE0 !important]> [color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]TOPICS [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#002D7A !important]PLACES [color=#006FE0 !important]> [color=#006FE0 !important]< [color=#002D7A !important]D [color=#006FE0 !important]> [color=#002D7A !important]usa [color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]D [color=#006FE0 !important]> [color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]PLACES [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#002D7A !important]PEOPLE [color=#006FE0 !important]> [color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]PEOPLE [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#002D7A !important]ORGS [color=#006FE0 !important]> [color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]ORGS [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#002D7A !important]EXCHANGES [color=#006FE0 !important]> [color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]EXCHANGES [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#002D7A !important]COMPANIES [color=#006FE0 !important]> [color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]COMPANIES [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#002D7A !important]UNKNOWN [color=#006FE0 !important]>
F [color=#006FE0 !important]Y
f0708 reute
d [color=#006FE0 !important]f [color=#006FE0 !important][color=#002D7A !important]BC [color=#006FE0 !important]- [color=#002D7A !important]STANDARD [color=#006FE0 !important]- [color=#002D7A !important]OIL [color=#006FE0 !important]- [color=#006FE0 !important]& [color=#002D7A !important]lt [color=#333333 !important]; [color=#002D7A !important]SRD [color=#006FE0 !important]> [color=#006FE0 !important]- [color=#800080 !important]TO [color=#006FE0 !important] [color=#CE0000 !important]02 [color=#006FE0 !important]- [color=#CE0000 !important]26 [color=#006FE0 !important][color=#CE0000 !important]0082 [color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]UNKNOWN [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#002D7A !important]TEXT [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#002D7A !important]TITLE [color=#006FE0 !important]> [color=#004ED0 !important]STANDARD [color=#002D7A !important]OIL [color=#006FE0 !important][color=#006FE0 !important]& [color=#002D7A !important]lt [color=#333333 !important]; [color=#002D7A !important]SRD [color=#006FE0 !important]> [color=#006FE0 !important][color=#800080 !important]TO [color=#006FE0 !important][color=#004ED0 !important]FORM [color=#004ED0 !important]FINANCIAL [color=#002D7A !important]UNIT [color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]TITLE [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#002D7A !important]DATELINE [color=#006FE0 !important]> [color=#006FE0 !important] [color=#002D7A !important]CLEVELAND [color=#333333 !important], [color=#006FE0 !important]Feb [color=#006FE0 !important][color=#CE0000 !important]26 [color=#006FE0 !important][color=#006FE0 !important]- [color=#006FE0 !important][color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]DATELINE [color=#006FE0 !important]> [color=#006FE0 !important]< [color=#002D7A !important]BODY [color=#006FE0 !important]> [color=#004ED0 !important]Standard [color=#004ED0 !important]Oil [color=#004ED0 !important]Co [color=#800080 !important]and [color=#006FE0 !important][color=#004ED0 !important]BP [color=#004ED0 !important]North [color=#004ED0 !important]America
[color=#004ED0 !important]Inc [color=#004ED0 !important]said [color=#004ED0 !important]they [color=#004ED0 !important]plan [color=#800080 !important]to [color=#006FE0 !important]form [color=#006FE0 !important]a [color=#006FE0 !important][color=#004ED0 !important]venture [color=#800080 !important]to [color=#006FE0 !important][color=#004ED0 !important]manage [color=#004ED0 !important]the [color=#004ED0 !important]money [color=#004ED0 !important]market
[color=#004ED0 !important]borrowing [color=#800080 !important]and [color=#006FE0 !important][color=#004ED0 !important]investment [color=#004ED0 !important]activities [color=#004ED0 !important]of [color=#004ED0 !important]both [color=#002D7A !important]companies [color=#333333 !important].
[color=#006FE0 !important] [color=#004ED0 !important]BP [color=#004ED0 !important]North [color=#004ED0 !important]America [color=#800080 !important]is [color=#006FE0 !important]a [color=#006FE0 !important][color=#004ED0 !important]subsidiary [color=#004ED0 !important]of [color=#004ED0 !important]British [color=#004ED0 !important]Petroleum [color=#004ED0 !important]Co
[color=#002D7A !important]Plc [color=#006FE0 !important][color=#006FE0 !important]& [color=#002D7A !important]lt [color=#333333 !important]; [color=#002D7A !important]BP [color=#006FE0 !important]> [color=#333333 !important], [color=#006FE0 !important][color=#004ED0 !important]which [color=#004ED0 !important]also owns [color=#006FE0 !important]a [color=#006FE0 !important][color=#CE0000 !important]55 [color=#006FE0 !important][color=#004ED0 !important]pct [color=#004ED0 !important]interest [color=#800080 !important]in [color=#006FE0 !important][color=#004ED0 !important]Standard [color=#002D7A !important]Oil [color=#333333 !important].
[color=#006FE0 !important] [color=#004ED0 !important]The [color=#004ED0 !important]venture [color=#004ED0 !important]will [color=#004ED0 !important]be [color=#004ED0 !important]called [color=#002D7A !important]BP [color=#006FE0 !important]/ [color=#004ED0 !important]Standard [color=#004ED0 !important]Financial [color=#004ED0 !important]Trading
[color=#800080 !important]and [color=#006FE0 !important][color=#004ED0 !important]will [color=#004ED0 !important]be [color=#004ED0 !important]operated [color=#004ED0 !important]by [color=#004ED0 !important]Standard [color=#004ED0 !important]Oil [color=#004ED0 !important]under [color=#004ED0 !important]the [color=#004ED0 !important]oversight of [color=#006FE0 !important]a
[color=#004ED0 !important]joint [color=#004ED0 !important]management [color=#002D7A !important]committee [color=#333333 !important].
[color=#006FE0 !important]Reuter
[color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]BODY [color=#006FE0 !important]> [color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]TEXT [color=#006FE0 !important]>
[color=#006FE0 !important]< [color=#006FE0 !important]/ [color=#002D7A !important]REUTERS [color=#006FE0 !important]>
|
在下文中,我们主要使用和中的文本。即标题+正文。
3、抽取
Mahout中内置了对上述Reuters预料的抽取程序,我们可以直接使用。
[color=#333333 !important]
1
| [color=#004ED0 !important]mahout [color=#002D7A !important]org [color=#333333 !important]. [color=#002D7A !important]apache [color=#333333 !important]. [color=#002D7A !important]lucene [color=#333333 !important]. [color=#002D7A !important]benchmark [color=#333333 !important]. [color=#002D7A !important]utils [color=#333333 !important]. ExtractReuters [color=#006FE0 !important][color=#333333 !important]. [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- sgm [color=#006FE0 !important][color=#333333 !important]. [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]out
|
如上所述,抽取好的结果在./reuters-out文件夹下面,每篇文档,变成了一个独立的文件。
一共有21578个txt,即数据集中含有21578篇文档:-)
说下命名规则吧,例如:文件名:./reuters-out/reut2-006.sgm-246.txt,表示来自于./reuters-sgm/reut2-006.sgm中的第246篇文档,下标从0开始。
4、转换成SequenceFile
对于传统的文本聚类算法而言,下一步应该是:将文本转化为词的向量空间表示。
然而,不要太着急哦。
由于Mahout运行在Hadoop上,HDFS是为大文件设计的。如果我们把上述21578个txt都拷贝上去,这样是非常不合适的
设想下:假设对1000万篇新闻进行聚类,难道要拷贝1000w个文件么?这会把name node搞挂的。
因此,Mahout采用SequenceFile作为其基本的数据交换格式。
内置的seqdirectory命令(这个命令设计的不合理,应该叫directoryseq才对),可以完成 文本目录->SequenceFile的转换过程。
[color=#333333 !important]
1
| [color=#004ED0 !important]mahout [color=#002D7A !important]seqdirectory [color=#006FE0 !important][color=#006FE0 !important]- i [color=#006FE0 !important][color=#002D7A !important]file [color=#006FE0 !important]: [color=#FF8000 !important]//$(pwd)/reuters-out/ -o file://$(pwd)/reuters-seq/ -c UTF-8 -chunk 64 -xm sequential
|
上述命令蕴含了2个大坑,在其他文档中均没有仔细说明:
(1) -xm sequential,表示在本地执行,而不是用MapReduce执行。如果是后者,我们势必要将这些小文件上传到HDFS上,那样的话,还要SequenceFile做甚……
(2) 然而seqdirectory在执行的时候,并不因为十本地模式,就在本地文件系统上寻找。而是根据-i -o的文件系统前缀来判断文件位置。也就是说,默认情况,依然十在HDFS上查找的……所以,这个file://的前缀是非常有必要的。
其他2个参数:
- -c UTF8:编码。
- -chunk 64:64MB一个Chunk,应该和HDFS的BLOCK保持一致或者倍数关系。
5、转换为向量表示
为了适应多种数据,聚类算法多使用向量空间作为输入数据。
由于我们先前已经得到了处理好的SequenceFile,从这一步开始,就可以在Hadoop上进行啦。
[color=#333333 !important]
1
| [color=#004ED0 !important]hadoop [color=#002D7A !important]dfs [color=#006FE0 !important][color=#006FE0 !important]- [color=#004ED0 !important]put [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]seq [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4
|
开始text->Vector的转换:
[color=#333333 !important]
1
| [color=#004ED0 !important]mahout [color=#002D7A !important]seq2sparse [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]i [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]seq [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]o [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sparse [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]ow [color=#006FE0 !important][color=#006FE0 !important]-- [color=#004ED0 !important]weight [color=#002D7A !important]tfidf [color=#006FE0 !important][color=#006FE0 !important]-- maxDFPercent [color=#006FE0 !important][color=#CE0000 !important]85 [color=#006FE0 !important][color=#006FE0 !important]-- [color=#002D7A !important]namedVector
|
输入和输出不解释了。在Mahout中的向量类型可以称为sparse。
参数说明如下:
- -ow( 或 –overwrite):即使输出目录存在,依然覆盖。
- –weight(或 -wt) tfidf:权重公式,大家都懂的。其他可选的有tf (当LDA时建议使用)。
- –maxDFPercent(或 -x) 85:过滤高频词,当DF大于85%时,将不在作为词特征输出到向量中。
- –namedVector (或-nv):向量会输出附加信息。
其他可能有用的选项:
- –analyzerName(或-a):指定其他分词器。
- –minDF:最小DF阈值。
- –minSupport:最小的支持度阈值,默认为2。
- –maxNGramSize(或-ng):是否创建ngram,默认为1。建议一般设定到2就够了。
- –minLLR(或 -ml):The minimum Log Likelihood Ratio。默认为1.0。当设定了-ng > 1后,建议设置为较大的值,只过滤有意义的N-Gram。
- –logNormalize(或 -lnorm):是否对输出向量做Log变换。
- –norm(或 -n):是否对输出向量做p-norm变换,默认不变换。
看一下产出:
[color=#333333 !important]
1
2
3
4
5
6
7
8
9
| [color=#004ED0 !important]hadoop [color=#002D7A !important]dfs [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]ls [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#004ED0 !important]sparse
Found [color=#006FE0 !important][color=#CE0000 !important]7 [color=#006FE0 !important][color=#002D7A !important]items
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sparse [color=#006FE0 !important]/ [color=#002D7A !important]df [color=#006FE0 !important]- [color=#002D7A !important]count
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sparse [color=#006FE0 !important]/ [color=#002D7A !important]dictionary [color=#333333 !important]. [color=#002D7A !important]file [color=#006FE0 !important]- [color=#CE0000 !important]0
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sparse [color=#006FE0 !important]/ [color=#002D7A !important]frequency [color=#333333 !important]. [color=#002D7A !important]file [color=#006FE0 !important]- [color=#CE0000 !important]0
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sparse [color=#006FE0 !important]/ [color=#002D7A !important]tf [color=#006FE0 !important]- [color=#002D7A !important]vectors
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sparse [color=#006FE0 !important]/ [color=#002D7A !important]tfidf [color=#006FE0 !important]- [color=#002D7A !important]vectors
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sparse [color=#006FE0 !important]/ [color=#002D7A !important]tokenized [color=#006FE0 !important]- [color=#002D7A !important]documents
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sparse [color=#006FE0 !important]/ [color=#002D7A !important]wordcount
|
说明各个文件的用途:
- dictionary.file-0:词文本 -> 词id(int)的映射。词转化为id,这是常见做法。
- frequency.file:词id -> 文档集词频(cf)。
- wordcount(目录): 词文本 -> 文档集词频(cf),这个应该是各种过滤处理之前的信息。
- df-count(目录): 词id -> 文档频率(df)。
- tf-vectors、tfidf-vectors (均为目录):词向量,每篇文档一行,格式为{词id:特征值},其中特征值为tf或tfidf。有用采用了内置类型VectorWritable,需要用命令”mahout vectordump -i ”查看。
- tokenized-documents:分词后的文档。
二、KMeans
1、运行K-Means
[color=#333333 !important]
1
| [color=#004ED0 !important]mahout [color=#002D7A !important]kmeans [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]i [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sparse [color=#006FE0 !important]/ [color=#002D7A !important]tfidf [color=#006FE0 !important]- [color=#002D7A !important]vectors [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]c [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]- [color=#002D7A !important]clusters [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]o [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important][color=#006FE0 !important]- k [color=#006FE0 !important][color=#CE0000 !important]20 [color=#006FE0 !important][color=#006FE0 !important]- [color=#004ED0 !important]dm [color=#002D7A !important]org [color=#333333 !important]. [color=#002D7A !important]apache [color=#333333 !important]. [color=#002D7A !important]mahout [color=#333333 !important]. [color=#002D7A !important]common [color=#333333 !important]. [color=#002D7A !important]distance [color=#333333 !important]. [color=#002D7A !important]CosineDistanceMeasure [color=#006FE0 !important][color=#006FE0 !important]- x [color=#006FE0 !important][color=#CE0000 !important]200 [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]ow [color=#006FE0 !important][color=#006FE0 !important]-- [color=#002D7A !important]clustering
|
参数说明如下:
- -i:输入为上面产出的tfidf向量。
- -o:每一轮迭代的结果将输出在这里。
- -k:几个簇。
- -c:这是一个神奇的变量。若不设定k,则用这个目录里面的点,作为聚类中心点。否则,随机选择k个点,作为中心点。
- -dm:距离公式,文本类型推荐用cosine距离。
- -x :最大迭代次数。
- –clustering:在mapreduce模式运行。
- –convergenceDelta:迭代收敛阈值,默认0.5,对于Cosine来说略大。
输出1,初始随机选择的中心点:
[color=#333333 !important]
1
2
3
| [color=#004ED0 !important]hadoop [color=#002D7A !important]dfs [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]ls [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]- [color=#004ED0 !important]clusters
Found [color=#006FE0 !important][color=#CE0000 !important]1 [color=#006FE0 !important][color=#002D7A !important]items
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]- [color=#002D7A !important]clusters [color=#006FE0 !important]/ [color=#002D7A !important]part [color=#006FE0 !important]- [color=#002D7A !important]randomSeed
|
输出2,聚类过程、结果:
[color=#333333 !important]
1
2
3
4
5
6
7
| [color=#004ED0 !important]hadoop [color=#002D7A !important]dfs [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]ls [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#004ED0 !important]kmeans
Found [color=#006FE0 !important][color=#CE0000 !important]5 [color=#006FE0 !important][color=#002D7A !important]items
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]/ [color=#002D7A !important]_policy
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]/ [color=#002D7A !important]clusteredPoints
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]/ [color=#002D7A !important]clusters [color=#006FE0 !important]- [color=#CE0000 !important]0
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]/ [color=#002D7A !important]clusters [color=#006FE0 !important]- [color=#CE0000 !important]1
[color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]/ [color=#002D7A !important]clusters [color=#006FE0 !important]- [color=#CE0000 !important]2 [color=#006FE0 !important]- [color=#800080 !important]final
|
其中,clusters-k(-final)为每次迭代后,簇的20个中心点的信息。
而clusterdPoints,存储了 簇id -> 文档id 的映射。
2、查看簇结果
首先,用clusterdump,来查看k(20)个簇的信息。
[color=#333333 !important]
1
2
3
4
5
| [color=#B85C00 !important]# Get to Local
[color=#004ED0 !important]hadoop [color=#002D7A !important]dfs [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]get [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]/ [color=#006FE0 !important][color=#333333 !important]. [color=#006FE0 !important]/
[color=#004ED0 !important]hadoop [color=#002D7A !important]dfs [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]get [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sparse [color=#006FE0 !important]/ [color=#006FE0 !important][color=#333333 !important]. [color=#006FE0 !important]/
[color=#B85C00 !important]# View ..
[color=#004ED0 !important]mahout [color=#002D7A !important]clusterdump [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]i [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]/ [color=#002D7A !important]clusters [color=#006FE0 !important]- [color=#CE0000 !important]2 [color=#006FE0 !important]- [color=#800080 !important]final [color=#006FE0 !important][color=#006FE0 !important]- d [color=#006FE0 !important][color=#333333 !important]. [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]sparse [color=#006FE0 !important]/ [color=#002D7A !important]dictionary [color=#333333 !important]. [color=#002D7A !important]file [color=#006FE0 !important]- [color=#CE0000 !important]0 [color=#006FE0 !important][color=#006FE0 !important]- [color=#004ED0 !important]dt [color=#002D7A !important]sequencefile [color=#006FE0 !important][color=#006FE0 !important]- o [color=#006FE0 !important][color=#333333 !important]. [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]- [color=#002D7A !important]cluster [color=#006FE0 !important]- [color=#002D7A !important]dump [color=#006FE0 !important]/ [color=#006FE0 !important][color=#006FE0 !important]- n [color=#006FE0 !important][color=#CE0000 !important]20
|
要说明的是,clusterdump似乎只能在本地执行……所以先把数据下载到本地吧。
参数说明:
- -i :我们只看最终迭代生成的簇结果。
- -d :使用 词 -> 词id 映射,使得我们输出结果中,可以直接显示每个簇,权重最高的词文本,而不是词id。
- -dt:上面映射类型,由于我们是seqdictionary生成的,so。。
- -o:最终产出目录
- -n:每个簇,只输出20个权重最高的词。
看看dump结果吧:
一共有20行,表示20个簇。每行形如:
[color=#333333 !important]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| [color=#004ED0 !important]VL [color=#006FE0 !important]- [color=#CE0000 !important]12722 [color=#333333 !important]{ [color=#002D7A !important]n [color=#006FE0 !important]= [color=#CE0000 !important]1305 [color=#006FE0 !important][color=#002D7A !important]c [color=#006FE0 !important]= [color=#333333 !important][ [color=#333333 !important]. [color=#333333 !important]. [color=#333333 !important]. [color=#333333 !important]. zorinsky' [color=#002D7A !important]s [color=#006FE0 !important]: [color=#CE0000 !important]0.011 [color=#333333 !important], [color=#006FE0 !important][color=#002D7A !important]zurich [color=#006FE0 !important]: [color=#CE0000 !important]0.006... [color=#333333 !important]] [color=#333333 !important], [color=#006FE0 !important][color=#002D7A !important]r [color=#006FE0 !important]= [color=#333333 !important][ [color=#333333 !important]. [color=#333333 !important]. [color=#333333 !important]. [color=#333333 !important]. [color=#002D7A !important]yuan [color=#006FE0 !important]: [color=#CE0000 !important]1.055 [color=#333333 !important], [color=#006FE0 !important][color=#002D7A !important]yugoslav [color=#006FE0 !important]: [color=#CE0000 !important]1.027 [color=#333333 !important], [color=#333333 !important]. [color=#333333 !important]. [color=#333333 !important]. [color=#333333 !important]] [color=#333333 !important]}
[color=#006FE0 !important] [color=#004ED0 !important]Top [color=#002D7A !important]Terms [color=#006FE0 !important]:
[color=#006FE0 !important] [color=#002D7A !important]he [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]3.105303428364896
[color=#006FE0 !important] [color=#002D7A !important]said [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]2.8756448350190205
[color=#006FE0 !important] [color=#002D7A !important]would [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]2.6413800148214874
[color=#006FE0 !important] [color=#002D7A !important]have [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]2.1552908992401942
[color=#006FE0 !important] [color=#002D7A !important]government [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.8426488105364687
[color=#006FE0 !important] [color=#002D7A !important]which [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.749669294978467
[color=#006FE0 !important] [color=#002D7A !important]economic [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.7431561736768233
[color=#006FE0 !important] [color=#002D7A !important]has [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.7429241635333532
[color=#006FE0 !important] [color=#002D7A !important]prices [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.7182022383386604
[color=#006FE0 !important] [color=#002D7A !important]oil [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.673632335845538
[color=#006FE0 !important] [color=#002D7A !important]from [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.64287882106971
[color=#006FE0 !important] [color=#002D7A !important]u [color=#333333 !important]. [color=#002D7A !important]s [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.6223870217115028
[color=#006FE0 !important] [color=#002D7A !important]had [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.602064758607711
[color=#006FE0 !important] [color=#002D7A !important]more [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.5874425666999086
[color=#006FE0 !important] [color=#002D7A !important]last [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.561653600890061
[color=#006FE0 !important] [color=#002D7A !important]we [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.5274837373316974
[color=#006FE0 !important] [color=#002D7A !important]been [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.4653439554674872
[color=#006FE0 !important] [color=#002D7A !important]year [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.4279387724353894
[color=#006FE0 !important] [color=#002D7A !important]could [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.4152588548331426
[color=#006FE0 !important] [color=#002D7A !important]minister [color=#006FE0 !important] [color=#006FE0 !important]= [color=#006FE0 !important]> [color=#006FE0 !important] [color=#CE0000 !important]1.4146991936183066
|
其中前面的12722是簇的ID,n=1305即簇中有这么多个文档。c向量是簇中心点向量,格式为 词文本:权重(点坐标),r是簇的半径向量,格式为 词文本:半径。
下面的Top Terms是簇中选取出来的特征词。
3、查看聚类结果
其实,聚类结果中,更重要的是,文档被聚到了哪个类。
遗憾的是,在很多资料中,都没有说明这一点。前文我们已经提到了,簇id -> 文档id的结果,保存在了clusteredPoints下面。这也是mahout内置类型存储的。我们可以用seqdumper命令查看。
[color=#333333 !important]
1
| [color=#004ED0 !important]mahout [color=#002D7A !important]seqdumper [color=#006FE0 !important][color=#006FE0 !important]- [color=#002D7A !important]i [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]user [color=#006FE0 !important]/ [color=#002D7A !important]coder4 [color=#006FE0 !important]/ [color=#002D7A !important]reuters [color=#006FE0 !important]- [color=#002D7A !important]kmeans [color=#006FE0 !important]/ [color=#002D7A !important]clusteredPoints [color=#006FE0 !important]/
|
其中,-d和-dt的原因同clusterdump。
如果不指定-o,默认输出到屏幕,输出结果为形如:
[color=#333333 !important]
1
| [color=#002D7A !important]Key [color=#006FE0 !important]: [color=#006FE0 !important][color=#CE0000 !important]4255 [color=#006FE0 !important]: [color=#006FE0 !important][color=#002D7A !important]Value [color=#006FE0 !important]: [color=#006FE0 !important][color=#002D7A !important]wt [color=#006FE0 !important]: [color=#006FE0 !important][color=#CE0000 !important]1.0 [color=#006FE0 !important][color=#002D7A !important]distance [color=#006FE0 !important]: [color=#006FE0 !important][color=#CE0000 !important]0.7752480913348985 [color=#006FE0 !important] [color=#002D7A !important]vec [color=#006FE0 !important]: [color=#006FE0 !important][color=#006FE0 !important]/ [color=#002D7A !important]reut2 [color=#006FE0 !important]- [color=#CE0000 !important]000.sgm [color=#006FE0 !important]- [color=#CE0000 !important]0.txt [color=#006FE0 !important][color=#006FE0 !important]= [color=#006FE0 !important][color=#333333 !important][ [color=#CE0000 !important]14 [color=#006FE0 !important]: [color=#CE0000 !important]4.670 [color=#333333 !important], [color=#006FE0 !important][color=#CE0000 !important]35 [color=#006FE0 !important]: [color=#CE0000 !important]7.545 [color=#333333 !important], [color=#006FE0 !important][color=#333333 !important]. [color=#333333 !important]. [color=#333333 !important]. [color=#006FE0 !important][color=#CE0000 !important]11278 [color=#006FE0 !important]: [color=#CE0000 !important]6.394 [color=#333333 !important], [color=#006FE0 !important][color=#CE0000 !important]11288 [color=#006FE0 !important]: [color=#CE0000 !important]6.731 [color=#333333 !important]]
|
其实,这个输出是一个SequenceFile,大家自己写程序也可以读出来的。
Key是ClusterID,上面clusterdump的时候,已经说了。
Value是文档的聚类结果:wt是文档属于簇的概率,对于kmeans总是1.0,/reut2-000.sgm-0.txt就是文档标志啦,前面seqdirectionary的-nv起作用了,再后面的就是这个点的各个词id和权重了。
三、Fuzzy-KMeans
KMeans是一种简单有效的聚类方法,但存在一些缺点。
例如:一个点只能属于一个簇,这种叫做硬聚类。而很多情况下,软聚类才是科学的。例如:《哈利波》属于小说,也属于电影。Fuzzy-Kmeans 通过引入“隶属度”的方式,实现了软聚类。