Abracadabra

Naive Bayes

利用朴素贝叶斯进行文本分类

准备数据

从文本中构建词向量

下面写一个词表到向量的转换函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def LoadDataSet():
""" Load a vector-like data set tranfered by
a data set list that generated by artifical.
Returns:
return_vec: the vector-like data set.
class_vec: the class label corresponds to the data items.
"""
posting_list = [['my', 'dog', 'has', 'flea',
'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him',
'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute',
'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how',
'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
class_vec = [0, 1, 0, 1, 0, 1] # 1 is abusive, 0 not
return posting_list, class_vec
def CreateVocabList(data_set):
""" Create vocabulary list from vector-like data set.
Arguments:
data_set: the data source.
Returns:
vocab_list: the vocabulary list.
"""
vocab_set = set([])
for document in data_set:
vocab_set = vocab_set | set(document)
return list(vocab_set)
def SetOfWords2Vec(vocab_list, input_set):
""" Transfer the words list to vector for each posting.
Arguments:
vocab_list: The vocabulary list.
input_set: The posting that ready to transfer to vector.
Returns:
return_vec: the result vector.
"""
# Initialize
return_vec = [0] * len(vocab_list)
for word in input_set:
if word in vocab_list:
return_vec[vocab_list.index(word)] = 1
else:
print("the word: %s is not in my Vocabulary" % word)
return return_vec

下面对函数的功能进行测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
In [13]: import bayes
In [14]: list_of_posts, list_classes = bayes.LoadDataSet()
In [15]: my_vocab_list = bayes.CreateVocabList(list_of_posts)
In [16]: my_vocab_list
Out[16]:
['help',
'worthless',
'I',
'take',
'love',
'maybe',
'stupid',
'to',
'not',
'please',
'quit',
'park',
'posting',
'dog',
'dalmation',
'steak',
'my',
'how',
'food',
'so',
'stop',
'is',
'garbage',
'flea',
'problems',
'has',
'buying',
'ate',
'him',
'licks',
'mr',
'cute']
In [17]: bayes.SetOfWords2Vec(my_vocab_list, list_of_posts[0])
Out[17]:
[1,
0,
0,
0,
0,
0,
0,
0,
0,
1,
0,
0,
0,
1,
0,
0,
1,
0,
0,
0,
0,
0,
0,
1,
1,
1,
0,
0,
0,
0,
0,
0]
In [18]: bayes.SetOfWords2Vec(my_vocab_list, list_of_posts[3])
Out[18]:
[0,
1,
0,
0,
0,
0,
1,
0,
0,
0,
0,
0,
1,
0,
0,
0,
0,
0,
0,
0,
1,
0,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0]

看上去一切都work,可以进入下一步了。

训练

从词向量计算概率

我们来看看贝叶斯公式:

$$p(c_i | w) = \frac{p(w | c_i) p(c_i)}{p(w)}$$

这里$w$代表词向量, 可以看出$c_i$的计算十分简单,值得注意的是,根据朴素贝叶斯的假设,有:

$$p(w | c_i) = p(w_0, w_1, w_2, \cdots, w_N | c_i) = p(w_0 | c_i)p(w_1 | c_i)p(w_2 | c_i) \cdots p(w_N | c_i)$$

当需要预测新样本的类别时:

nb

这样一切就很清楚了,下面给出伪代码:

1
2
3
4
5
6
7
8
9
10
计算每个类别的文档数目
for 每一篇文档:
for 每一个类别:
if 词条出现在文档中:
增加该词条的计数值
增加总词条数的计数值
for 每一个类别:
for 每一个词条:
将该词条的数目除以总词条数目得到条件概率
返回每个类别的条件概率

注意,这里的$p(w_j | c_i)$是要根据整个训练集来算,代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def TrainNaiveBayes0(train_matrix, train_category):
""" the training method.
Arguments:
train_matrix: The train data.
train_category: The train label.
Returns:
p0_vect: The conditional probability of w by c0
p1_vect: The conditional probability of w by c1
p_abusive: The conditional probability of c1
"""
num_train_docs = len(train_matrix)
num_words = len(train_matrix[0])
p_abusive = sum(train_category) / num_train_docs
p0_num = zeros(num_words)
p1_num = zeros(num_words)
p0_denom = 0.0
p1_denom = 0.0
for i in range(num_train_docs):
if train_category[i] == 1:
p1_num += train_matrix[i]
p1_denom += sum(train_matrix[i])
else:
p0_num += train_matrix[i]
p0_denom += sum(train_matrix[i])
p1_vect = p1_num / p1_denom
p0_vect = p0_num / p0_denom
return p0_vect, p1_vect, p_abusive

同样进行一下测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
In [25]: train_mat = []
In [26]: for post_in_doc in list_of_posts:
...: train_mat.append(bayes.SetOfWords2Vec(my_vocab_list, post_in_doc))
...:
In [27]: p0_v, p1_v, p_ab = bayes.TrainNaiveBayes0(train_mat, list_classes)
In [28]: p_ab
Out[28]: 0.5
In [29]: p0_v
Out[29]:
array([ 0.04166667, 0. , 0.04166667, 0. , 0.04166667,
0. , 0. , 0.04166667, 0. , 0.04166667,
0. , 0. , 0. , 0.04166667, 0.04166667,
0.04166667, 0.125 , 0.04166667, 0. , 0.04166667,
0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667,
0.04166667, 0. , 0.04166667, 0.08333333, 0.04166667,
0.04166667, 0.04166667])
In [30]: p1_v
Out[30]:
array([ 0. , 0.10526316, 0. , 0.05263158, 0. ,
0.05263158, 0.15789474, 0.05263158, 0.05263158, 0. ,
0.05263158, 0.05263158, 0.05263158, 0.10526316, 0. ,
0. , 0. , 0. , 0.05263158, 0. ,
0.05263158, 0. , 0.05263158, 0. , 0. ,
0. , 0.05263158, 0. , 0.05263158, 0. ,
0. , 0. ])

但是上述代码存在一些缺陷,首先,计算$p(w_j | c_i)$可能会出现结果为0的情况,那么最后的结果就会为0,那么需要进行一些修改 (为什么是2?):

1
2
3
4
p0_num = ones(num_words)
p1_num = ones(num_words)
p0_denom = 2.0
p1_denom = 2.0

另外一个就是下溢的问题,所以要改用log函数

1
2
p1_vect = log(p1_num / p1_denom)
p0_vect = log(p0_num / p0_denom)

测试

最后写一个分类和测试函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def ClassifyNaiveBayes(vec_to_classify, p0_vect, p1_vect, p_abusive):
""" Classify.
Arguments:
vec_to_classify: The vector to classify.
p0_vect: The conditional probability of w by c0.
p1_vect: The conditional probability of w by c1.
p_abusive: The conditional probability of c1.
Returns:
0: The predict class is 0
1: The predict class is 1
"""
p1 = sum(vec_to_classify * p1_vect) + log(p_abusive)
p0 = sum(vec_to_classify * p0_vect) + log(1 - p_abusive)
if p1 > p0:
return 1
else:
return 0
def TestNaiveBayes():
""" A test method.
"""
list_of_posts, list_of_classes = LoadDataSet()
my_vocab_list = CreateVocabList(list_of_posts)
train_mat = []
for post_in_doc in list_of_posts:
train_mat.append(SetOfWords2Vec(my_vocab_list, post_in_doc))
p0_v, p1_v, p_ab = TrainNaiveBayes0(
array(train_mat), array(list_of_classes))
test_entry = ['love', 'my', 'dalmation']
this_doc = array(SetOfWords2Vec(my_vocab_list, test_entry))
print(test_entry, 'classified as: ',
ClassifyNaiveBayes(this_doc, p0_v, p1_v, p_ab))
test_entry = ['stupid', 'garbage']
this_doc = array(SetOfWords2Vec(my_vocab_list, test_entry))
print(test_entry, 'classified as: ',
ClassifyNaiveBayes(this_doc, p0_v, p1_v, p_ab))

Test

1
2
3
In [32]: bayes.TestNaiveBayes()
['love', 'my', 'dalmation'] classified as: 0
['stupid', 'garbage'] classified as: 1

Ok, bravo!

词袋模型

现在有一个问题, 到目前为止,我们将每个词是否出现作为特征,这被称为词集模型。但是如果有一个词在文档中不止出现一次,那么就需要词袋模型进行建模。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def BagOfWords2Vec(vocab_list, input_set):
""" Transfer the words list to vector for each posting.
Arguments:
vocab_list: The vocabulary list.
input_set: The posting that ready to transfer to vector.
Returns:
return_vec: the result vector.
"""
# Initialize
return_vec = [0] * len(vocab_list)
for word in input_set:
if word in vocab_list:
return_vec[vocab_list.index(word)] += 1
else:
print("the word: %s is not in my Vocabulary" % word)
return return_vec

高斯朴素贝叶斯

一般的朴素贝叶斯算法的输入特征为离散值,那么当输入变量为连续值时就不能处理了,一般这时候假设输入变量服从一个正态分布,这样$p(w_j | c_i)$就可以计算了,所以整个的流程如下:

gnb_algo

gnb_mle

采用sk-learn进行下实验

1
2
3
4
5
6
7
8
9
In [33]: from sklearn import datasets
...: iris = datasets.load_iris()
...: from sklearn.naive_bayes import GaussianNB
...: gnb = GaussianNB()
...: y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
...: print("Number of mislabeled points out of a total %d points : %d"
...: % (iris.data.shape[0],(iris.target != y_pred).sum()))
...:
Number of mislabeled points out of a total 150 points : 6