Abracadabra

Decision tree

决策树(ID3)

决策树的构建

构造决策树时,所需要解决的第一个问题就是,每划分一个分支时,应该根据哪一维特征进行划分。这时候我们需要定义某种指标,然后对每一维特征进行该指标的评估,最后选择指标值最高的特征进行划分。

划分完毕之后,原始数据集就被划分为几个数据子集。如果某一个下的数据属于同一类型,则算法停止;否则,重复划分过程。

伪代码(创建分支)

1
2
3
4
5
6
7
8
9
10
11
createbranch()
检测数据集中的每个子项是否属于同一分类:
If so return 类标签;
Else
寻找划分数据集的最好特征
划分数据集
创建分支节点
for 每个划分的子集
调用函数createBranch并增加返回结果到分支节点中
return 分支节点

那么接下来的重点便是如何寻找划分数据集的最好特征,在这里我们使用ID3算法中使用的划分数据集的方法,也即根据熵来划分。

信息增益

划分数据的核心思想是将无序的数据变得更加有序。而一个数据有序程度可以进行量化表示,也就是信息,其度量方式就是。其显然,数据集划分前后其所含的信息会发生变化,这个变化便称为信息增益

熵定义为信息的期望,其中信息的定义如下,信息一般针对的对象为多个类别中的某一个类别

$$l(x_i) = -log_{2}p(x_i)$$

其中$x_i$表示某一类别,$p(x_i)$表示从多个类别中选择该类别的概率。

接下来,熵的定义如下:

$$H = \sum_{i=1}^{n}p(x_i)l(x_i)=-\sum_{i=1}^{n}p(x_i)log_{2}p(x_i)$$

信息增益定义如下:

$$IG(S|T) = H(S) - \sum_{value(T)} \frac{|S_v|}{|S|} H(S_v)$$

其中$S$ 为全部样本集合,$value(T) $是属性 $T$所有取值的集合,$v$ 是 $T$ 的其中一个属性值,$S_v$是 $S$ 中属性 $T$ 的值为 $v$ 的样例集合,$|S_v|$ 为 $S_v$ 中所含样例数,$|S|$ 为 $S$ 中所含样例数。

代码实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from math import log
def CalcShannonEnt(data_set):
""" Calculate the Shannon Entropy.
Arguments:
data_set: The object dataset.
Returns:
shannon_ent: The Shannon entropy of the object data set.
"""
# Initiation
num_entries = len(data_set)
label_counts = {}
# Statistics the frequency of each class in the dataset
for feat_vec in data_set:
current_label = feat_vec[-1]
if current_label not in label_counts.keys():
label_counts[current_label] = 0
label_counts[current_label] += 1
# Calculates the Shannon entropy
shannon_ent = 0.0
for key in label_counts:
prob = float(label_counts[key]) / num_entries
shannon_ent -= prob * log(prob, 2)
return shannon_ent

为了进行测试,以及之后的算法运行,我们写一个十分naive的数据生成方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def CreateDataSet():
""" A naive data generation method.
Returns:
data_set: The data set excepts label info.
labels: The data set only contains label info.
"""
data_set = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing', 'flippers']
return data_set, labels

注意,这里的labels并不代表分类标签,yes以及no才是,labels代表特征名。

下面进行一个简单的demo:

1
2
3
4
5
6
7
8
9
In [22]: import trees
In [23]: my_dat, labels = trees.CreateDataSet()
In [24]: my_dat
Out[24]: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
In [25]: trees.CalcShannonEnt(my_dat)
Out[25]: 0.9709505944546686

熵越高,表明数据集中类别数越多。

另一个度量无序程度的方法是基尼不纯度(Gini impurity)。

基尼不纯度

基尼不纯度的定义为,对于每一个节点,从所有类别标签中随机选择一个,选择出来的类别标签与其本身的类别标签不一致的概率之和。形式化地定义如下:

$$G = \sum_{i \ne j}p(x_i)p(x_j) = \sum_{i}p(x_i)\sum_{j \ne i}p(x_j) = \sum_{i}p(x_i)(1-p(x_i)) = \sum_{i}p(x_i) - \sum_{i}(p(x_i))^2 = 1 - \sum_{i}(p(x_i))^2$$

代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def CalcGiniImpurity(data_set):
""" Calculate the Gini impurity.
Arguments:
data_set: The object dataset.
Returns:
gini_impurity: The Gini impurity of the object data set.
"""
# Initiation
num_entries = len(data_set)
label_counts = {}
# Statistics the frequency of each class in the dataset
for feat_vec in data_set:
current_label = feat_vec[-1]
if current_label not in label_counts.keys():
label_counts[current_label] = 0
label_counts[current_label] += 1
# Calculates the Gini impurity
gini_impurity = 0.0
for key in label_counts:
prob = float(label_counts[key]) / num_entries
gini_impurity += pow(prob, 2)
gini_impurity = 1 - gini_impurity
return gini_impurity

同样进行一个简单的demo:

1
2
3
4
5
6
7
8
9
In [4]: import trees
In [5]: my_dat, labels = trees.CreateDataSet()
In [6]: my_dat
Out[6]: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
In [7]: trees.CalcGiniImpurity(my_dat)
Out[7]: 0.48

最后再介绍一种度量无序程度的方式,误分类不纯度。

误分类不纯度

定义如下:

$$M = 1 - \max_{i}(p(x_i))$$

代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def CalcMisClassifyImpurity(data_set):
""" Calculate the misclassification impurity.
Arguments:
data_set: The object dataset.
Returns:
mis_classify_impurity: The misclassification impurity of the object data set.
"""
# Initiation
num_entries = len(data_set)
label_counts = {}
# Statistics the frequency of each class in the dataset
for feat_vec in data_set:
current_label = feat_vec[-1]
if current_label not in label_counts.keys():
label_counts[current_label] = 0
label_counts[current_label] += 1
# Calculates the misclassification impurity
mis_classify_impurity = 0.0
max_prob = max(label_counts.values()) / num_entries
mis_classify_impurity = 1 - max_prob
return mis_classify_impurity

进行一个简单的demo:

1
2
3
4
5
6
7
8
9
10
In [25]: reload(trees)
Out[25]: <module 'trees' from 'C:\\Users\\Ewan\\Documents\\GitHub\\hexo\\public\\2017\\02\\15\\Decision-tree\\trees.py'>
In [26]: my_dat, labels = trees.CreateDataSet()
In [27]: my_dat
Out[27]: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
In [28]: trees.CalcMisClassifyImpurity(my_dat)
Out[28]: 0.4

最后用一个图来总结一下这三种不纯度度量的函数图像(以二类情况为例)

[Ref: http://www.cse.msu.edu/~cse802/DecisionTrees.pdf]:

impurity_compare

数据划分

根据以上,数据划分的思路是,基于每一维特征的每一个值进行划分,并计算划分前后的信息增益,最后选取增益最大的特征及其所对应的值进行划分,由于这里运用的是ID3算法,因此选择的信息度量方式是熵。

代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def SplitDataSet(data_set, axis, value):
""" Split the data set according to the given axis and correspond value.
Arguments:
data_set: Object data set.
axis: The split-feature index.
value: The value of the split-feature.
Returns:
ret_data_set: The splited data set.
"""
ret_data_set = []
for feat_vec in data_set:
if feat_vec[axis] == value:
reduced_feat_vec = feat_vec[:axis]
reduced_feat_vec.extend(feat_vec[axis + 1:])
ret_data_set.append(reduced_feat_vec)
return ret_data_set
def ChooseBestFeatureToSplit(data_set):
""" Choose the best feature to split.
Arguments:
data_set: Object data set.
Returns:
best_feature: The index of the feature to split.
"""
# Initiation
# Because the range() method excepts the lastest number
num_features = len(data_set[0]) - 1
base_entropy = CalcShannonEnt(data_set)
best_info_gain = 0.0
best_feature = -1
for i in range(num_features):
# Choose the i-th feature of all data
feat_list = [example[i] for example in data_set]
# Abandon the repeat feature value(s)
unique_vals = set(feat_list)
new_entropy = 0.0
# Calculates the Shannon entropy of the splited data set
for value in unique_vals:
sub_data_set = SplitDataSet(data_set, i, value)
prob = len(sub_data_set) / len(data_set)
new_entropy += prob * CalcShannonEnt(sub_data_set)
# base_entropy is equal or greatter than new_entropy
info_gain = base_entropy - new_entropy
if info_gain > best_info_gain:
best_info_gain = info_gain
best_feature = i
return best_feature

由以上代码(ID3算法)可以看出,其计算熵的依据是根据最后一个特征,是否这种naive的选取方式能够达到平均的最好结果?另外,其划分依据仅仅根据划分一次后的子数据集的熵之和,属于一种贪心策略,这样是否可以达到最优解?

递归构建决策树

既然是递归算法,那么必须设定递归结束条件:

  1. 遍历完所有属性
  2. 每个分支下的数据都属于相同的分类

这里存在一个问题,如果遍历完所有属性后,某些分支下还是存在多个分类,这种情况下一般采用多数表决的方式,代码实现方式如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def majority_cnt(class_list):
""" Decided the final class.
When the splited data is not belongs to the same class
while all feature is handled, the final class is decided by
the majority class.
Arguments:
class_list: The class list of the splited data set.
Returns:
sorted_class_count[0][0]: The majority class.
"""
class_count = {}
for vote in class_list:
if vote not in class_count.keys():
class_count[vote] = 0
class_count[vote] += 1
sorted_class_count = sorted(
class_count.iteritems(), key=operator.itemgetter(1), reverse=True)
return sorted_class_count[0][0]

下面进行树的创建:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def CreateTree(data_set, labels):
""" Create decision tree.
Arguments:
data_set: The object data set.
labels: The feature labels in the data_set.
Returns:
my_tree: A dict that represents the decision tree.
"""
class_list = [example[-1] for example in data_set]
# If the classes are fully same
if class_list.count(class_list[0]) == len(class_list):
return class_list[0]
# If all feature is handled
if len(data_set[0]) == 1:
return majority_cnt(class_list)
# Get the best split-feature and the correspond label
best_feat = ChooseBestFeatureToSplit(data_set)
best_feat_label = labels[best_feat]
# Build a recurrence dict
my_tree = {best_feat_label: {}}
# Get the next step labels parameter
del(labels[best_feat])
# Next step start
feat_values = [example[best_feat] for example in data_set]
unique_vals = set(feat_values)
for value in unique_vals:
sub_labels = labels[:]
# Recurrence calls
my_tree[best_feat_label][value] = CreateTree(
SplitDataSet(data_set, best_feat, value), sub_labels)
return my_tree

下面进行一下简单的测试:

1
2
3
4
5
6
7
8
9
In [27]: reload(trees)
Out[27]: <module 'trees' from 'C:\\Users\\Ewan\\Documents\\GitHub\\hexo\\public\\2017\\02\\15\\Decision-tree\\trees.py'>
In [28]: my_dat, labels = trees.CreateDataSet()
In [29]: my_tree = trees.CreateTree(my_dat, labels)
In [30]: my_tree
Out[30]: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

可见决策树已经构造成功(图形化如下所示),但是这显然不够,我们需要的是用决策树进行分类。

splitting_path

决策树分类

demo如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
In [63]: reload(trees)
Out[63]: <module 'trees' from 'C:\\Users\\Ewan\\Documents\\GitHub\\hexo\\public\\2017\\02\\15\\Decision-tree\\trees.py'>
In [64]: my_dat, labels = trees.CreateDataSet()
In [65]: labels
Out[65]: ['no surfacing', 'flippers']
In [66]: my_tree = trees.CreateTree(my_dat, labels)
In [67]: labels
Out[67]: ['no surfacing', 'flippers']
In [68]: my_tree
Out[68]: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
In [69]: trees.Classify(my_tree, labels, [1, 0])
Out[69]: 'no'
In [70]: trees.Classify(my_tree, labels, [1, 1])
Out[70]: 'yes'

C4.5

C4.5算法是由ID3算法引申而来,主要改进有以下两点:

  1. 选取最优分裂属性时根据信息增益率 (IGR)
  2. 使算法对连续变量兼容

下面分别对分裂信息以及信息增益率进行定义:

$$IGR = \frac{IG}{IV}$$

因此只需对ID3算法的代码做一些改动即可,为了兼容ID3, 具体实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def ChooseBestFeatureToSplit(data_set, flag='ID3'):
""" Choose the best feature to split.
Arguments:
data_set: Object data set.
flag: Decide if use the infomation gain rate or not.
Returns:
best_feature: The index of the feature to split.
"""
# Initiation
# Because the range() method excepts the lastest number
num_features = len(data_set[0]) - 1
base_entropy = CalcShannon(data_set)
method = 'ID3'
best_feature = -1
best_info_gain = 0.0
best_info_gain_rate = 0.0
for i in range(num_features):
new_entropy = 0.0
# Choose the i-th feature of all data
feat_list = [example[i] for example in data_set]
# Abandon the repeat feature value(s)
unique_vals = set(feat_list)
if len(unique_vals) > 3:
method = 'C4.5'
if method == 'ID3':
# Calculates the Shannon entropy of the splited data set
for value in unique_vals:
sub_data_set = SplitDataSet(data_set, i, value)
prob = len(sub_data_set) / len(data_set)
new_entropy += prob * CalcShannon(sub_data_set)
else:
data_set = np.array(data_set)
sorted_feat = np.argsort(feat_list)
for index in range(len(sorted_feat) - 1):
pre_sorted_feat, post_sorted_feat = np.split(
sorted_feat, [index + 1, ])
pre_data_set = data_set[pre_sorted_feat]
post_data_set = data_set[post_sorted_feat]
pre_coff = len(pre_sorted_feat) / len(sorted_feat)
post_coff = len(post_sorted_feat) / len(sorted_feat)
# Calucate the split info
iv = pre_coff * CalcShannon(pre_data_set) + \
post_coff * CalcShannon(post_data_set)
if iv > new_entropy:
new_entropy = iv
# base_entropy is equal or greatter than new_entropy
info_gain = base_entropy - new_entropy
if flag == 'C4.5':
info_gain_rate = info_gain / new_entropy
# print('index', i, 'info_gain_rate', info_gain_rate)
if info_gain_rate > best_info_gain_rate:
best_info_gain_rate = info_gain_rate
best_feature = i
if flag == 'ID3':
if info_gain > best_info_gain:
best_info_gain = info_gain
best_feature = i
return best_feature

下面需要解决的问题是连续变量的问题,为了实验的方便,我们更改一下naive的数据生成方法(Ref: http://blog.csdn.net/lemon_tree12138/article/details/51840361 ):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def CreateDataSet(method='ID3'):
""" A naive data generation method.
Arguments:
method: The algorithm class
Returns:
data_set: The data set excepts label info.
labels: The data set only contains label info.
"""
# Arguments check
if method not in ('ID3', 'C4.5'):
raise ValueError('invalid value: %s' % method)
if method == 'ID3':
data_set = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing', 'flippers']
else:
data_set = [[85, 85, 'no'],
[80, 90, 'yes'],
[83, 78, 'no'],
[70, 96, 'no'],
[68, 80, 'no'],
[65, 70, 'yes'],
[64, 65, 'yes'],
[72, 95, 'no'],
[69, 70, 'no'],
[75, 80, 'no'],
[75, 70, 'yes'],
[72, 90, 'yes'],
[81, 75, 'no'],
[71, 80, 'yes']]
labels = ['temperature', 'humidity']
return data_set, labels

假设我们选择了温度属性,则被提取的关键数据为:

[[85, No], [80, No], [83, Yes], [70, Yes], [68, Yes], [65, No], [64, Yes], [72, No], [69, Yes], [75, Yes], [75, Yes], [72, Yes], [81, Yes], [71, No]]

现在我们对这批数据进行从小到大进行排序,排序后数据集就变成:

[[64, Yes], [65, No], [68, Yes], [69, Yes], [70, Yes], [71, No], [72, No], [72, Yes], [75, Yes], [75, Yes], [80, No], [81, Yes], [83, Yes], [85, No]]

绘制成如下图例:

c4.5_data_sorted

当我们拿到一个已经排好序的(温度,结果)的列表之后,分别计算被某个单元分隔的左边和右边的分裂信息,比如现在计算 index = 4 时的分裂信息。则:

$$IV(v_4) = IV([4, 1], [5, 4]) = \frac{5}{14}IV([4, 1]) + \frac{9}{14}IV([5, 4])$$

$$IV(v_4) = \frac{5}{14}(-\frac{4}{5} \log_{2} \frac{4}{5} - \frac{1}{5} \log_{2} \frac{1}{5}) + \frac{9}{14}(-\frac{5}{9} \log_{2} \frac{5}{9} - \frac{4}{9} \log_{2} \frac{4}{9})$$

下图表示了不同分裂位置所得到的分裂信息:

c4_5_data_split

最后给出完整的代码实现 (最后的Classify方法还需修改):

trees.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
from math import log
import operator
import numpy as np
def CalcShannon(data_set):
""" Calculate the Shan0n Entropy.
Arguments:
data_set: The object dataset.
Returns:
shan0n_ent: The Shan0n entropy of the object data set.
"""
# Initiation
num_entries = len(data_set)
label_counts = {}
# Statistics the frequency of each class in the dataset
for feat_vec in data_set:
current_label = feat_vec[-1]
if current_label not in label_counts.keys():
label_counts[current_label] = 0
label_counts[current_label] += 1
# print(label_counts)
# Calculates the Shan0n entropy
shan0n_ent = 0.0
for key in label_counts:
prob = float(label_counts[key]) / num_entries
shan0n_ent -= prob * log(prob, 2)
return shan0n_ent
def CalcGiniImpurity(data_set):
""" Calculate the Gini impurity.
Arguments:
data_set: The object dataset.
Returns:
gini_impurity: The Gini impurity of the object data set.
"""
# Initiation
num_entries = len(data_set)
label_counts = {}
# Statistics the frequency of each class in the dataset
for feat_vec in data_set:
current_label = feat_vec[-1]
if current_label not in label_counts.keys():
label_counts[current_label] = 0
label_counts[current_label] += 1
# Calculates the Gini impurity
gini_impurity = 0.0
for key in label_counts:
prob = float(label_counts[key]) / num_entries
gini_impurity += pow(prob, 2)
gini_impurity = 1 - gini_impurity
return gini_impurity
def CalcMisClassifyImpurity(data_set):
""" Calculate the misclassification impurity.
Arguments:
data_set: The object dataset.
Returns:
mis_classify_impurity:
The misclassification impurity of the object data set.
"""
# Initiation
num_entries = len(data_set)
label_counts = {}
# Statistics the frequency of each class in the dataset
for feat_vec in data_set:
current_label = feat_vec[-1]
if current_label not in label_counts.keys():
label_counts[current_label] = 0
label_counts[current_label] += 1
# Calculates the misclassification impurity
mis_classify_impurity = 0.0
max_prob = max(label_counts.values()) / num_entries
mis_classify_impurity = 1 - max_prob
return mis_classify_impurity
def CreateDataSet(method='ID3'):
""" A naive data generation method.
Arguments:
method: The algorithm class
Returns:
data_set: The data set excepts label info.
labels: The data set only contains label info.
"""
# Arguments check
if method not in ('ID3', 'C4.5'):
raise ValueError('invalid value: %s' % method)
if method == 'ID3':
data_set = [[1, 1, 1],
[1, 1, 1],
[1, 0, 0],
[0, 1, 0],
[0, 1, 0]]
labels = ['0 surfacing', 'flippers']
else:
data_set = [[1, 85, 85, 0, 0],
[1, 80, 90, 1, 0],
[2, 83, 78, 0, 1],
[3, 70, 96, 0, 1],
[3, 68, 80, 0, 1],
[3, 65, 70, 1, 0],
[2, 64, 65, 1, 1],
[1, 72, 95, 0, 0],
[1, 69, 70, 0, 1],
[3, 75, 80, 0, 1],
[1, 75, 70, 1, 1],
[2, 72, 90, 1, 1],
[2, 81, 75, 0, 1],
[3, 71, 80, 1, 0]]
labels = ['outlook', 'temperature', 'humidity', 'windy']
return data_set, labels
def SplitDataSet(data_set, axis, value):
""" Split the data set according to the given axis and correspond value.
Arguments:
data_set: Object data set.
axis: The split-feature index.
value: The value of the split-feature.
Returns:
ret_data_set: The splited data set.
"""
ret_data_set = []
for feat_vec in data_set:
if feat_vec[axis] == value:
reduced_feat_vec = feat_vec[:axis]
reduced_feat_vec.extend(feat_vec[axis + 1:])
ret_data_set.append(reduced_feat_vec)
return ret_data_set
def ChooseBestFeatureToSplit(data_set, flag='ID3'):
""" Choose the best feature to split.
Arguments:
data_set: Object data set.
flag: Decide if use the infomation gain rate or not.
Returns:
best_feature: The index of the feature to split.
"""
# Initiation
# Because the range() method excepts the lastest number
num_features = len(data_set[0]) - 1
base_entropy = CalcShannon(data_set)
method = 'ID3'
best_feature = -1
best_info_gain = 0.0
best_info_gain_rate = 0.0
for i in range(num_features):
new_entropy = 0.0
# Choose the i-th feature of all data
feat_list = [example[i] for example in data_set]
# Abandon the repeat feature value(s)
unique_vals = set(feat_list)
if len(unique_vals) > 3:
method = 'C4.5'
if method == 'ID3':
# Calculates the Shannon entropy of the splited data set
for value in unique_vals:
sub_data_set = SplitDataSet(data_set, i, value)
prob = len(sub_data_set) / len(data_set)
new_entropy += prob * CalcShannon(sub_data_set)
else:
data_set = np.array(data_set)
sorted_feat = np.argsort(feat_list)
for index in range(len(sorted_feat) - 1):
pre_sorted_feat, post_sorted_feat = np.split(
sorted_feat, [index + 1, ])
pre_data_set = data_set[pre_sorted_feat]
post_data_set = data_set[post_sorted_feat]
pre_coff = len(pre_sorted_feat) / len(sorted_feat)
post_coff = len(post_sorted_feat) / len(sorted_feat)
# Calucate the split info
iv = pre_coff * CalcShannon(pre_data_set) + \
post_coff * CalcShannon(post_data_set)
if iv > new_entropy:
new_entropy = iv
# base_entropy is equal or greatter than new_entropy
info_gain = base_entropy - new_entropy
if flag == 'C4.5':
info_gain_rate = info_gain / new_entropy
# print('index', i, 'info_gain_rate', info_gain_rate)
if info_gain_rate > best_info_gain_rate:
best_info_gain_rate = info_gain_rate
best_feature = i
if flag == 'ID3':
if info_gain > best_info_gain:
best_info_gain = info_gain
best_feature = i
return best_feature
def majority_cnt(class_list):
""" Decided the final class.
When the splited data is 0t belongs to the same class
while all feature is handled, the final class is decided by
the majority class.
Arguments:
class_list: The class list of the splited data set.
Returns:
sorted_class_count[0][0]: The majority class.
"""
class_count = {}
for vote in class_list:
if vote not in class_count.keys():
class_count[vote] = 0
class_count[vote] += 1
sorted_class_count = sorted(
class_count.
items(), key=operator.itemgetter(1), reverse=True)
return sorted_class_count[0][0]
def CreateTree(data_set, feat_labels, method='ID3'):
""" Create decision tree.
Arguments:
data_set: The object data set.
labels: The feature labels in the data_set.
method: The algorithm class.
Returns:
my_tree: A dict that represents the decision tree.
"""
# Arguments check
if method not in ('ID3', 'C4.5'):
raise ValueError('invalid value: %s' % method)
labels = feat_labels.copy()
class_list = [example[-1] for example in data_set]
# print(class_list)
# If the classes are fully same
print('class_list', class_list)
if class_list.count(class_list[0]) == len(class_list):
return class_list[0]
# If all feature is handled
if len(data_set[0]) == 1:
return majority_cnt(class_list)
if method == 'ID3':
# Get the best split-feature and the correspond label
best_feat = ChooseBestFeatureToSplit(data_set)
best_feat_label = labels[best_feat]
# print(best_feat_label)
# Build a recurrence dict
my_tree = {best_feat_label: {}}
# Next step start
feat_values = [example[best_feat] for example in data_set]
# Get the next step labels parameter
del(labels[best_feat])
unique_vals = set(feat_values)
for value in unique_vals:
sub_labels = labels[:]
# Recurrence calls
my_tree[best_feat_label][value] = CreateTree(
SplitDataSet(data_set, best_feat, value), sub_labels)
return my_tree
else:
flag = 'ID3'
# Get the best split-feature and the correspond label
best_feat = ChooseBestFeatureToSplit(data_set, 'C4.5')
best_feat_label = labels[best_feat]
print(best_feat_label)
# Build a recurrence dict
my_tree = {best_feat_label: {}}
# Next step start
feat_values = [example[best_feat] for example in data_set]
del(labels[best_feat])
unique_vals = set(feat_values)
if len(unique_vals) > 3:
flag = 'C4.5'
if flag == 'ID3':
for value in unique_vals:
sub_labels = labels[:]
# Recurrence calls
my_tree[best_feat_label][value] = CreateTree(
SplitDataSet(data_set, best_feat, value),
sub_labels, 'C4.5')
return my_tree
else:
data_set = np.array(data_set)
best_iv = 0.0
best_split_value = -1
sorted_feat = np.argsort(feat_values)
for i in range(len(sorted_feat) - 1):
pre_sorted_feat, post_sorted_feat = np.split(
sorted_feat, [i + 1, ])
pre_data_set = data_set[pre_sorted_feat]
post_data_set = data_set[post_sorted_feat]
pre_coff = len(pre_sorted_feat) / len(sorted_feat)
post_coff = len(post_sorted_feat) / len(sorted_feat)
# Calucate the split info
iv = pre_coff * CalcShannon(pre_data_set) + \
post_coff * CalcShannon(post_data_set)
if iv > best_iv:
best_iv = iv
best_split_value = feat_values[sorted_feat[i]]
print(best_feat, best_split_value)
# print(best_split_value)
left_data_set = data_set[
data_set[:, best_feat] <= best_split_value]
left_data_set = np.delete(left_data_set, best_feat, axis=1)
# if len(left_data_set) == 1:
# return left_data_set[0][-1]
right_data_set = data_set[
data_set[:, best_feat] > best_split_value]
right_data_set = np.delete(right_data_set, best_feat, axis=1)
# if len(right_data_set) == 1:
# return right_data_set[0][-1]
sub_labels = labels[:]
my_tree[best_feat_label][
'<=' + str(best_split_value)] = CreateTree(
left_data_set.tolist(), sub_labels, 'C4.5')
my_tree[best_feat_label][
'>' + str(best_split_value)] = CreateTree(
right_data_set.tolist(), sub_labels, 'C4.5')
# print('continious tree', my_tree)
return my_tree
def Classify(input_tree, feat_labels, test_vec):
""" Classify that uses the given decision tree.
Arguments:
input_tree: The Given decision tree.
feat_labels: The labels of correspond feature.
test_vec: The test data.
Returns:
class_label: The class label that corresponds to the test data.
"""
# Get the start feature label to split
first_str = list(input_tree.keys())[0]
# Get the sub-tree that corresponds to the start feature to split
second_dict = input_tree[first_str]
# Get the feature index that the label is the start feature label
feat_index = feat_labels.index(first_str)
# Start recurrence search
for key in second_dict.keys():
if test_vec[feat_index] == key:
if type(second_dict[key]).__name__ == 'dict':
# Recurrence calls
class_label = Classify(second_dict[key], feat_labels, test_vec)
else:
class_label = second_dict[key]
return class_label

一个小demo:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
In [108]: reload(trees)
Out[108]: <module 'trees' from 'C:\\Users\\Ewan\\Documents\\GitHub\\hexo\\public\\2017\\02\\15\\Decision-tree\\trees.py'>
In [109]: my_dat, labels = trees.CreateDataSet('C4.5')
In [110]: my_tree_c = trees.CreateTree(my_dat, labels, 'C4.5')
class_list [0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0]
outlook
class_list [0, 0, 0, 1, 1]
humidity
1 90
class_list [0, 0, 1, 1]
temperature
0 69
class_list [1]
class_list [0, 0, 1]
windy
class_list [0]
class_list [0, 1]
class_list [0]
class_list [1, 1, 1, 1]
class_list [1, 1, 0, 1, 0]
windy
class_list [1, 1, 1]
class_list [0, 0]
In [111]: my_tree_c
Out[111]:
{'outlook': {1: {'humidity': {'<=90': {'temperature': {'<=69': 1,
'>69': {'windy': {0: 0, 1: 0}}}},
'>90': 0}},
2: 1,
3: {'windy': {0: 1, 1: 0}}}}