# 决策树（ID3）

## 决策树的构建

### 信息增益

#### 熵

$$l(x_i) = -log_{2}p(x_i)$$

$$H = \sum_{i=1}^{n}p(x_i)l(x_i)=-\sum_{i=1}^{n}p(x_i)log_{2}p(x_i)$$

$$IG(S|T) = H(S) - \sum_{value(T)} \frac{|S_v|}{|S|} H(S_v)$$

#### 基尼不纯度

$$G = \sum_{i \ne j}p(x_i)p(x_j) = \sum_{i}p(x_i)\sum_{j \ne i}p(x_j) = \sum_{i}p(x_i)(1-p(x_i)) = \sum_{i}p(x_i) - \sum_{i}(p(x_i))^2 = 1 - \sum_{i}(p(x_i))^2$$

#### 误分类不纯度

$$M = 1 - \max_{i}(p(x_i))$$

### 递归构建决策树

1. 遍历完所有属性
2. 每个分支下的数据都属于相同的分类

demo如下：

# C4.5

C4.5算法是由ID3算法引申而来，主要改进有以下两点：

1. 选取最优分裂属性时根据信息增益率 (IGR)
2. 使算法对连续变量兼容

$$IGR = \frac{IG}{IV}$$

[[85, No], [80, No], [83, Yes], [70, Yes], [68, Yes], [65, No], [64, Yes], [72, No], [69, Yes], [75, Yes], [75, Yes], [72, Yes], [81, Yes], [71, No]]

[[64, Yes], [65, No], [68, Yes], [69, Yes], [70, Yes], [71, No], [72, No], [72, Yes], [75, Yes], [75, Yes], [80, No], [81, Yes], [83, Yes], [85, No]]

$$IV(v_4) = IV([4, 1], [5, 4]) = \frac{5}{14}IV([4, 1]) + \frac{9}{14}IV([5, 4])$$

$$IV(v_4) = \frac{5}{14}(-\frac{4}{5} \log_{2} \frac{4}{5} - \frac{1}{5} \log_{2} \frac{1}{5}) + \frac{9}{14}(-\frac{5}{9} \log_{2} \frac{5}{9} - \frac{4}{9} \log_{2} \frac{4}{9})$$

trees.py