# 问题描述

1. 对于离散-离散的情况，采用卡方检验
2. 对于连续-离散的情况，采用ANOVA方差检验

# 卡方检验

## 一维情况

（注意，期望的频数是根据实际情况自行定义的）

$$\frac{观察值 - 期望值}{期望值}$$

gumpies: $\frac{89 - 100}{100} = -0.11$

sticklebarbs: $\frac{120 - 100}{100} = +0.20$

spothheads: $\frac{91 - 100}{100} = -0.09$

$$\frac{(观察值 - 期望值)^2}{期望值}$$

gumpies: $\frac{(89 - 100)^2}{100} = 1.21$

sticklebarbs: $\frac{(120 - 100)^2}{100} = 4.0$

spothheads: $\frac{(91 - 100)}{100} = 0.81$

sum: $1.21 + 4.0 + 0.81 = 6.02$

$$\chi^2 = \sum \frac{(O - E) ^ 2}{E}$$

2. 计算一个大小为300的序列中a, b, c三者的频数，作为观察值，然后将a = b = c = 100作为期望值，计算并记录下算出的卡方值
3. 重复1~2步10000次， 画出柱状图如下：

## 二维情况

Alzhemer’s onset -during 5-year period
noyes
recieved-yes1479156
estrogenno810158968
9571671124

（A）Alzhemer’s onset -during 5-year period
noyes
(R) recieved-yes[cell a][cell b]156
estrogenno[cell c][cell d]968
9571671124

$$E_a = \frac{156}{1124} \times \frac{957}{1124} \times 1124$$

$$E_{cell} =\frac{R}{N} \times \frac{C}{N} \times N$$

（A）Alzhemer’s onset -during 5-year period
noyes
(R) recieved-yes$E_a = \frac{156 \times 957}{1124} = 132.82$$E_b = \frac{156 \times 167}{1124} = 23.18156 estrogennoE_c = \frac{968 \times 957}{1124} = 824.18$$E_d = \frac{968 \times 167}{1124} = 143.82$968
9571671124

$$\chi^2 = \sum \frac{(O - E) ^ 2}{E}$$

$$\chi^2 = \sum \frac{(|O - E| - 0.5) ^ 2}{E}$$

$$df = (r - 1)(c - 1)$$

r = number of rows

c = number of columns

# ANOVA方差检验

1. 将小白鼠随机分为四组A， B， C， D
2. A组作为参照组，不给药；B， C， D三组分别注射一个单位，两个单位，三个单位的药剂
3. 记录实验结果，数值越低表明实验效果越好

ABCDTotal
27.022.821.923.5
26.223.123.419.6
28.827.720.123.7
33.527.627.820.8
28.824.019.323.9
$M_a = 28.86$$M_b = 25.04$$M_c = 22.50$$M_d = 22.30$$M_T = 24.68$

$$F = \frac{MS_{bg}}{MS_{wg}} = \frac{组间相似度}{组内相似度}$$

（1） 首先计算出如下值：

ABCDTotal
$N_A = 5$$N_B = 5$$N_C = 5$$N_D = 5$$N_T = 20$
$\sum X_{Ai} = 144.30$$\sum X_{Bi} = 125.20$$\sum X_{Ci} = 112.50$$\sum X_{Di} = 111.50$$\sum X_{Ti} = 493.50$
$\sum X^2_{Ai} = 4196.57$$\sum X^2_{Bi} = 3158.50$$\sum X^2_{Ci} = 2576.51$$\sum X^2_{Di} = 2501.95$$\sum X^2_{Ti} = 12433.53$
$SS_A = 32.07$$SS_B = 23.49$$SS_C = 45.26$$SS_D = 15.50$$SS_T = 256.42$

$$SS = \sum X^2_i - \frac{(\sum X_i)^2}{N}$$

（2）计算$SS_{wg}$以及$SS_{bg}$

$$SS_{wg} = SS_A + SS_B + SS_C + SS_D$$

$$SS_{bg} = SS_T - SS_{wg}$$

（4）计算相关自由度

$$df_{bg} = k - 1 = 4 - 1 = 3$$

$$df_{wg} = (N_A - 1) + (N_B - 1) + (N_C - 1) + (N_D - 1)$$

（5）计算 $MS_{bg}$以及$MS_{wg}$

$$MS_{bg} = \frac{SS_{bg}}{df_{bg}}$$

$$MS_{wg} = \frac{SS_{wg}}{df_{wg}}$$

（6）计算$F$

# 决策树

## Split the data

array([ 0.64356436,  0.64356436,  0.64391855,  0.64391855,  0.64407713,
0.64407713,  0.64407713,  0.64407713,  0.64407713,  0.64407713])

## difficulties_cat

Imp
scr_ave0.553782
medu_newcat0.217218
age_m0.085820
income_newcat0.048912
edu_ave0.029740
female0.026161
onlychild0.021336
mediacontact0.017030
divorce0.000000
scr_h_cat0.000000
mediacoview0.000000

Imp
scr_ave0.525587
medu_newcat0.213067
age_m0.084708
scr_of_edu0.080575
mediacontact0.029232
female0.025661
onlychild0.020929
income_newcat0.020240
divorce0.000000
scr_h_cat0.000000
mediacoview0.000000

## emo_cat

Imp
scr_ave0.521694
medu_newcat0.159798
income_newcat0.147978
edu_ave0.083233
age_m0.040750
female0.023781
mediacontact0.019127
mediacoview0.003639
onlychild0.000000
divorce0.000000
scr_h_cat0.000000

Imp
scr_ave0.520711
income_newcat0.155111
medu_newcat0.146762
scr_of_edu0.115745
age_m0.023767
female0.021924
mediacontact0.011410
mediacoview0.004571
onlychild0.000000
divorce0.000000
scr_h_cat0.000000

## con_cat

Imp
scr_ave0.562325
medu_newcat0.098819
mediacontact0.095089
edu_ave0.077572
female0.064213
income_newcat0.056334
age_m0.023696
onlychild0.013513
mediacoview0.008440
divorce0.000000
scr_h_cat0.000000

Imp
scr_ave0.511282
scr_of_edu0.110155
medu_newcat0.097473
mediacontact0.093793
female0.082843
income_newcat0.055566
age_m0.035552
onlychild0.013335
divorce0.000000
scr_h_cat0.000000
mediacoview0.000000

## hyp_cat

Imp
medu_newcat0.343448
scr_ave0.265652
onlychild0.165487
edu_ave0.076194
income_newcat0.054466
age_m0.046774
mediacontact0.020992
mediacoview0.015532
female0.011454
divorce0.000000
scr_h_cat0.000000

Imp
medu_newcat0.335192
scr_ave0.257218
onlychild0.161509
scr_of_edu0.105471
age_m0.065618
income_newcat0.033129
mediacontact0.022542
mediacoview0.012901
female0.006421
divorce0.000000
scr_h_cat0.000000

## pee_cat

Imp
scr_ave0.441246
income_newcat0.210646
female0.152568
age_m0.151888
medu_newcat0.033530
mediacoview0.009625
edu_ave0.000498
onlychild0.000000
divorce0.000000
scr_h_cat0.000000
mediacontact0.000000

Imp
scr_ave0.430946
income_newcat0.200188
age_m0.151286
female0.139104
medu_newcat0.033289
scr_of_edu0.033259
mediacoview0.011927
onlychild0.000000
divorce0.000000
scr_h_cat0.000000
mediacontact0.000000

## pro_cat

Imp
female0.295389
scr_ave0.252091
mediacontact0.145010
age_m0.137453
income_newcat0.085421
edu_ave0.062697
medu_newcat0.021938
onlychild0.000000
divorce0.000000
scr_h_cat0.000000
mediacoview0.000000

Imp
female0.299316
scr_ave0.204496
mediacontact0.131972
age_m0.125169
scr_of_edu0.111909
income_newcat0.080072
medu_newcat0.047066
onlychild0.000000
divorce0.000000
scr_h_cat0.000000
mediacoview0.000000

## Convert continuous variables to categorical variables

0.63948514409153423

Imp
scr_h_cat0.528976
medu_newcat0.221779
age_m0.092168
income_newcat0.057032
onlychild0.034724
mediacontact0.027193
female0.021036
scr_ave0.011653
mediacoview0.005439
divorce0.000000
edu_ave0.000000

Imp
scr_h_cat0.507678
medu_newcat0.215210
age_m0.089438
scr_of_edu0.042832
mediacontact0.041463
income_newcat0.035742
onlychild0.033695
female0.020413
scr_ave0.013527
divorce0.000000
mediacoview0.000000