Abracadabra

Correlation Analysis

问题描述

最近有一个关于关联分析的小作业,问题描述如下:

孩子的情感状况、品行症状、亲社会行为等等与很多因素相关,这个作业主要着眼点在于孩子每天看视频的时间是否对于以上这些指标有着重要的影响,因此需要对输入特征与输出标签之间的相关关系做一个分析。

属性描述如下:

features

由以上属性描述表可以看出,属性类型同时包括类别以及数值,因此进行相关分析时针对这两种情况需要采用不同的方法,这里采用的策略如下:

  1. 对于离散-离散的情况,采用卡方检验
  2. 对于连续-离散的情况,采用ANOVA方差检验

下面对于这两种检验方法进行介绍。

卡方检验

一维情况

假设一条河里有三种鱼,gumpies, sticklebarbs, 以及spotheads。如果这条河的生态环境没有遭到干扰,那么这三种鱼的数量是相等的(下表第三行)。现在从河里进行300次抽样,最后抽样得到的结果如下表第二行所示:

gumpiessticklebarbsspothheadsTotals
观察到的频数8912091300
期望的频数100100100300

现在需要解决的问题是,是否这条河的生态环境收到了干扰?我们假设生态环境正常,这是我们的原假设。

(注意,期望的频数是根据实际情况自行定义的)

很容易想到的是,我们可以建立一种度量来衡量现实情况与原假设的偏离程度,例如:

$$\frac{观察值 - 期望值}{期望值}$$

把实际的数据带入,可以得到以下的结果:

gumpies: $\frac{89 - 100}{100} = -0.11$

sticklebarbs: $\frac{120 - 100}{100} = +0.20$

spothheads: $\frac{91 - 100}{100} = -0.09$

发现结果还不错,但是这只能用来衡量单个类别的偏差程度,而不能用来衡量整体的偏差程度,因为这三者的加和为零。既然这样,很容易想到可以对之前的度量进行简单的修改,变成这样:

$$\frac{(观察值 - 期望值)^2}{期望值}$$

再把数据带入看看:

gumpies: $\frac{(89 - 100)^2}{100} = 1.21$

sticklebarbs: $\frac{(120 - 100)^2}{100} = 4.0$

spothheads: $\frac{(91 - 100)}{100} = 0.81$

sum: $1.21 + 4.0 + 0.81 = 6.02$

这样一来问题便得到了解决,而这也正是卡方检验所采取的方式。而这个sum值就是卡方(chi-square),记作$\chi^2$, 为了更加形式化地表示,我们把观察值记为$O$, 期望值记为$E$, 那么有如下等式成立:

$$\chi^2 = \sum \frac{(O - E) ^ 2}{E}$$

下面的问题是,我们达到了卡方值,但是这个值到底好还是不好呢?是更加支持原假设还是拒绝原假设呢?

可以设想这样一种情况,我们假设河里的鱼服从原假设的分布,也就是三种鱼出现的概率相等。然后我们把三百次采样看作一次实验,每一次实验完毕之后记录下采样出的300条鱼中每一种鱼的频数,然后计算卡方值。在进行很多次这样的实验之后,我们可以画一个柱状图,这个图记录下了卡方值的分布情况。然后再把实际的观察值(也就是上面表格的第二行)计算的卡方值(6.02)带入进去,看看大于或者等于这个值在柱状图中所有卡方值中占有多少比例。如果占有的比例很大,说明这个值是由跟原假设很近似的假设得出的,这就证明了原假设是对的;反之。如果这个比例很小,说明如果分布服从原假设,那么所计算出的卡方值基本不可能包含这个观测出的卡方值,表明原假设是不对的,我们就可以拒绝原假设。

其实统计检验的基本思想就是这样。但是存在一个问题,我们不可能进行真实的采样(从河里抓鱼),所以一般采用计算机模拟的方式,具体步骤如下所示:

  1. 等概率的产生a(代表gumpies), b(代表sticklebarbs), c(代表spothheads)的序列
  2. 计算一个大小为300的序列中a, b, c三者的频数,作为观察值,然后将a = b = c = 100作为期望值,计算并记录下算出的卡方值
  3. 重复1~2步10000次, 画出柱状图如下:

chi-square-distribution

可以看出只有5%左右的值大于6.02,说明我们可以以95%的置信度拒绝原假设。

二维情况

对于二维的情况,卡方检验又被称为卡方关联度检验,也就是检验两个变量之间的相关程度(独立程度),考虑下面这个数据表$O$:

Alzhemer’s onset -during 5-year period
noyes
recieved-yes1479156
estrogenno810158968
9571671124

这是观察值,为了计算卡方值,很明显我们需要计算期望值。

为了方便表示,把上表变成如下形式:

(A)Alzhemer’s onset -during 5-year period
noyes
(R) recieved-yes[cell a][cell b]156
estrogenno[cell c][cell d]968
9571671124

拿a做例子:

$$E_a = \frac{156}{1124} \times \frac{957}{1124} \times 1124$$

这个式子如何解释呢?如果这两个变量是独立的,那么$A$变量取$no$值与$R$变量取$yes$值这两个事件之间就是独立的,那么$[cell_a]$事件发生的概率就是两者相乘,也就是上述等式右边前两个数相乘,最后的期望值自然就是概率乘以实验总数。

上面的解释比较不正式,换一种较为正式的表达方式。假设我们要求$cell$的期望值,设$R$为$cell$所在列的边缘事件总数,$C$为$cell$所在行的边缘事件总数,$N$为实验总数目,这样就有:

$$E_{cell} =\frac{R}{N} \times \frac{C}{N} \times N$$

所以就有期望值$E$数据表如下:

(A)Alzhemer’s onset -during 5-year period
noyes
(R) recieved-yes$E_a = \frac{156 \times 957}{1124} = 132.82$$E_b = \frac{156 \times 167}{1124} = 23.18$156
estrogenno$E_c = \frac{968 \times 957}{1124} = 824.18$$E_d = \frac{968 \times 167}{1124} = 143.82$968
9571671124

这样就可以调用公式:

$$\chi^2 = \sum \frac{(O - E) ^ 2}{E}$$

特别地,当行数以及列数都为2时,上述公式需要进行一下修改:

$$\chi^2 = \sum \frac{(|O - E| - 0.5) ^ 2}{E}$$

这样算出来的卡方值为$11.01$

最后涉及到自由度的问题,因为比较简单,所以只写出结论:

$$df = (r - 1)(c - 1)$$

r = number of rows

c = number of columns

ANOVA方差检验

为了方便比较,同样采用上述阿尔兹海默病的例子。研究表明,老年痴呆症患者患病之后会经常经历情绪非常不稳定地阶段,原因是因为患者患病之前的生活中经常有恐惧或者焦虑的体验,正是这些一直存在于脑海中的记忆出发了患病后的不稳定情绪的产生。

现在我们假设有一个实验团队发明了一种药物,可以缓解这种情绪问题,他们对小白鼠进行了实验。实验设计如下:

  1. 将小白鼠随机分为四组A, B, C, D
  2. A组作为参照组,不给药;B, C, D三组分别注射一个单位,两个单位,三个单位的药剂
  3. 记录实验结果,数值越低表明实验效果越好

实验结果如下:

ABCDTotal
27.022.821.923.5
26.223.123.419.6
28.827.720.123.7
33.527.627.820.8
28.824.019.323.9
$M_a = 28.86$$M_b = 25.04$$M_c = 22.50$$M_d = 22.30$$M_T = 24.68$

下面需要进行一下相关性分析,判断药物是否对症状的缓解产生作用。

和卡方检验一样,ANOVA检验最后也有一个衡量指标,记为$F$,定义如下:

$$F = \frac{MS_{bg}}{MS_{wg}} = \frac{组间相似度}{组内相似度}$$

具体的计算步骤如下所示(推导过程省略)

(1) 首先计算出如下值:

ABCDTotal
$N_A = 5$$N_B = 5$$N_C = 5$$N_D = 5$$N_T = 20$
$\sum X_{Ai} = 144.30$$\sum X_{Bi} = 125.20$$\sum X_{Ci} = 112.50$$\sum X_{Di} = 111.50$$\sum X_{Ti} = 493.50$
$\sum X^2_{Ai} = 4196.57$$\sum X^2_{Bi} = 3158.50$$\sum X^2_{Ci} = 2576.51$$\sum X^2_{Di} = 2501.95$$\sum X^2_{Ti} = 12433.53$
$SS_A = 32.07$$SS_B = 23.49$$SS_C = 45.26$$SS_D = 15.50$$SS_T = 256.42$

其中:

$$SS = \sum X^2_i - \frac{(\sum X_i)^2}{N}$$

(2)计算$SS_{wg}$以及$SS_{bg}$

$$SS_{wg} = SS_A + SS_B + SS_C + SS_D$$

$$SS_{bg} = SS_T - SS_{wg}$$

(4)计算相关自由度

$$df_{bg} = k - 1 = 4 - 1 = 3$$

$$df_{wg} = (N_A - 1) + (N_B - 1) + (N_C - 1) + (N_D - 1)$$

(5)计算 $MS_{bg}$以及$MS_{wg}$

$$MS_{bg} = \frac{SS_{bg}}{df_{bg}}$$

$$MS_{wg} = \frac{SS_{wg}}{df_{wg}}$$

(6)计算$F$

最后得出F = 6.42 (df = 3, 16)

f_distribution

代码实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
import os
import xlrd
import numpy as np
import scipy.stats as stats
def GetChildDetailLab(combine=True):
""" Get the detail labs of each child data.
Returns:
child_detail_lab: the detail labs.
"""
CHILDS_FEATURE_LAB = ['tid', 'age_m', 'female', 'onlychild', 'divorce',
'medu_newcat', 'income_newcat', 'scr_ave', 'edu_ave',
'edu_of_scr', 'scr_h_cat', 'mediacoview',
'mediacontact']
CHILDS_CAT_LAB = ['emo_cat', 'con_cat', 'hyp_cat', 'pee_cat',
'difficulties_cat', 'pro_cat']
if combine:
CHILDS_DETAIL_LAB = CHILDS_FEATURE_LAB
CHILDS_DETAIL_LAB.extend(CHILDS_CAT_LAB)
return CHILDS_DETAIL_LAB
else:
return CHILDS_FEATURE_LAB, CHILDS_CAT_LAB
def ReadChildInfoFromExcel(
file_name='屏幕暴露与SDQ.xlsx', sheet_name='data'):
""" Read the screen-exposed vs.SDQ detail information
of each child from the excel-type file.
Arguments:
file_name: the name of the excel-type file.
sheet_name: the name of the sheet of the excel file.
Returns:
child_scr_exp_sdq: A list that contains the detail
information of each child.
labs: The lab corresponds to each colume of the data.
"""
CHILDS_FILE_NAME = 'child_scr_exp_sdq.npy'
CHILDS_DETAIL_LAB = GetChildDetailLab()
# print(CHILDS_DETAIL_LAB)
NOT_INT_LAB_INDEIES = [CHILDS_DETAIL_LAB.index('age_m'),
CHILDS_DETAIL_LAB.index('scr_ave'),
CHILDS_DETAIL_LAB.index('edu_ave'),
CHILDS_DETAIL_LAB.index('edu_of_scr')]
child_scr_exp_sdq = []
if (os.path.isfile(CHILDS_FILE_NAME)):
with open(CHILDS_FILE_NAME, 'rb') as f:
child_scr_exp_sdq = np.load(f)
else:
workBook = xlrd.open_workbook(file_name)
bookSheet = workBook.sheet_by_name(sheet_name)
# read from second row because of the first row has tabs
for row in range(1, bookSheet.nrows):
child_row = []
for col in range(bookSheet.ncols):
cel = bookSheet.cell(row, col)
try:
val = str(cel.value)
# tolerant the value error
if val == '':
val = '-1.0'
except Exception as e:
print(e)
# because of the type is different
try:
if col in NOT_INT_LAB_INDEIES:
val = float(val)
else:
# in Excel, if cel.value is 1, then str(cel.value) is
# '1.0'
val = val.split('.')[0]
val = int(val)
except Exception as e:
print(e)
val = -1
child_row.append(val)
child_scr_exp_sdq.append(child_row)
child_scr_exp_sdq = np.array(child_scr_exp_sdq)
with open(CHILDS_FILE_NAME, 'wb') as f:
np.save(f, child_scr_exp_sdq)
return child_scr_exp_sdq, CHILDS_DETAIL_LAB
def SplitDataSet(data_set, feature, cat):
""" Split the data set to two column, that is feature and cat
Arguments:
data_set: the source data set.
feature: the input vector
cat: the correspond category.
Returns:
splited_data_set: self-explation.
"""
CHILDS_DETAIL_LAB = GetChildDetailLab()
feature_index = CHILDS_DETAIL_LAB.index(feature)
cat_index = CHILDS_DETAIL_LAB.index(cat)
# print(feature_index, cat_index)
return data_set[:, (feature_index, cat_index)]
def CalChi2(data_set):
""" Calculate the chi-square value and p-value corresponds to the data set.
Arguments:
data_set: the object data set.
Returns:
chi2: the chi-square value.
p: the p-value.
"""
rows_number = len(set(data_set[:, -1]))
columns_number = len(set(data_set[:, 0]))
# print(rows_number, columns_number)
counts = np.zeros((rows_number, columns_number))
for row in data_set:
if row[-1] != -1:
if row[0] != -1:
try:
counts[int(row[-1])][int(row[0])] += 1
except:
pass
# drop the row that all item is 0
del_row_index = []
for index, count in enumerate(counts):
if not count.any():
del_row_index.append(index)
counts = np.delete(counts, tuple(del_row_index), axis=0)
# drop the column that all item is 0
del_col_index = []
for index, count in enumerate(counts.T):
if not count.any():
del_col_index.append(index)
counts = np.delete(counts, tuple(del_col_index), axis=1)
# print(counts)
# calculate the chi-square value and correspond p-value
chi2, p, dof, excepted = stats.chi2_contingency(counts)
return chi2, p
def ANOVATest(data_set):
""" Implement a ANOVA test on the data set.
Arguments:
data_set: the object data set.
Return:
f: The computed F-value of the test.
p: The associated p-value from the F-distribution.
"""
# Initial the three categories
normal = []
critical = []
abnormal = []
for data in data_set:
if data[0] != -1:
if data[-1] == 0:
normal.append(data[0])
elif data[-1] == 1:
critical.append(data[0])
elif data[-1] == 2:
abnormal.append(data[0])
f, p = stats.f_oneway(normal, critical, abnormal)
return f, p
def GenerateCoffMatrix(data_set):
""" Calculate the chi-square and p-value of each feature-category pair.
Arguments:
data_set: the source data set.
Returns:
coff_matrix: the final cofficient matrix.
"""
coff_matrix = {}
CHILDS_FEATURE_LAB, CHILDS_CAT_LAB = GetChildDetailLab(combine=False)
NOT_KEEP_FEATURE_LAB = ['tid', 'age_m', 'scr_ave', 'edu_ave', 'edu_of_scr']
for feature in CHILDS_FEATURE_LAB:
if feature in NOT_KEEP_FEATURE_LAB:
if feature != NOT_KEEP_FEATURE_LAB[0]:
for cat in CHILDS_CAT_LAB:
splited_data_set = SplitDataSet(data_set, feature, cat)
# print(feature, cat)
f, p = ANOVATest(splited_data_set)
key = feature + '-' + cat
coff_matrix[key] = (f, p)
else:
for cat in CHILDS_CAT_LAB:
splited_data_set = SplitDataSet(data_set, feature, cat)
# print(feature, cat)
chi2, p = CalChi2(splited_data_set)
key = feature + '-' + cat
coff_matrix[key] = (chi2, p)
return coff_matrix
def SiftRelativeFeature(coff_matrix, conf=1e-5):
""" Sift the feature that satisfy the caonfident condition.
Arguemnts:
coff_matrix: the calculated cofficient matrix for
all features and categories.
conf: the confident.
Returns:
relative_feature_matrix: the satisfied feature and correspond
category and chi2 and p-value.
"""
relative_feature_matrix = {}
for key in coff_matrix.keys():
if coff_matrix[key][-1] <= conf:
relative_feature_matrix[key] = coff_matrix[key]
return relative_feature_matrix
def WriteResult(coff_matrix, file_name='result.txt'):
""" Write the result to file.
Arguments:
relative_feature_matrix: the satisfied feature and correspond
category and chi2 and p-value.
file_name: the result file name.
"""
# Sort
sorted_coff_matrix = sorted(coff_matrix.items(),
key=lambda item: item[1][-1], reverse=False)
# print(sorted_coff_matrix)
with open(file_name, 'w') as f:
for item in sorted_coff_matrix:
f.write(str(item))
f.write('\n')
if __name__ == '__main__':
data_set, labels = ReadChildInfoFromExcel()
# splited_data_set = SplitDataSet(data_set, 'female', 'difficulties_cat')
# chi2, p = CalChi2(splited_data_set)
coff_matrix = GenerateCoffMatrix(data_set)
# relative_feature = SiftRelativeFeature(coff_matrix, 1)
WriteResult(coff_matrix)

result

因子分析

先说说因子分析与主成分分析的区别,下面的话引自知乎(https://www.zhihu.com/question/24524693)

pca_fa_1

pca_fa_2

具体实现使用SPSS软件进行实现:

操作步骤如下(引自 SPSS数据分析从入门到精通-陈胜可):

spss_1

spss_2

实验结果如下:

fa_result


2017.2.27 Updates

决策树

1
2
3
4
5
6
7
import pandas as pd
from sklearn import tree
import numpy as np
from sklearn.model_selection import cross_val_score
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Read data

1
data = pd.read_excel('./scr_SDQ.xlsx', sheetname='data', index_col=0).dropna().sort_index()

Convert the data format

1
2
3
4
float_columns = ['scr_ave', 'edu_ave','scr_of_edu']
for column in data.columns:
if column not in float_columns:
data[column] = data[column].astype(int)

Split the data

1
2
3
data_feature_with_edu_ave = data.ix[:, :'mediacontact'].drop('scr_of_edu', axis=1)
data_feature_with_scr_of_edu = data.ix[:, :'mediacontact'].drop('edu_ave', axis=1)
data_classes = data.ix[:, 'emo_cat':]
1
2
# dtree = tree.DecisionTreeClassifier(min_samples_leaf=500)
# cross_val_score(dtree,data_feature_with_edu_ave, data_classes['difficulties_cat'], cv=10)
array([ 0.64356436,  0.64356436,  0.64391855,  0.64391855,  0.64407713,
        0.64407713,  0.64407713,  0.64407713,  0.64407713,  0.64407713])

difficulties_cat

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
dtree_with_edu_ave = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_edu_ave = dtree_with_edu_ave.fit(data_feature_with_edu_ave, data_classes['difficulties_cat'])
pd.DataFrame(dtree_with_edu_ave.feature_importances_, columns = ["Imp"],
index = data_feature_with_edu_ave.columns).sort_values(by='Imp', ascending = False)
dtree_with_scr_of_edu = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_scr_of_edu = dtree_with_scr_of_edu.fit(data_feature_with_scr_of_edu, data_classes['difficulties_cat'])
pd.DataFrame(dtree_with_scr_of_edu.feature_importances_, columns = ["Imp"],
index = data_feature_with_scr_of_edu.columns).sort_values(by='Imp', ascending = False)
# dtree_with_edu_ave_diff.png
with open('dtree_with_edu_ave_diff.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_edu_ave, out_file=dot_file, feature_names=data_feature_with_edu_ave.columns)
# dtree_with_scr_of_edu_diff.png
with open('dtree_with_scr_of_edu_diff.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_scr_of_edu, out_file=dot_file, feature_names=data_feature_with_scr_of_edu.columns)


























































Imp
scr_ave0.553782
medu_newcat0.217218
age_m0.085820
income_newcat0.048912
edu_ave0.029740
female0.026161
onlychild0.021336
mediacontact0.017030
divorce0.000000
scr_h_cat0.000000
mediacoview0.000000

edu_ave_diff



























































Imp
scr_ave0.525587
medu_newcat0.213067
age_m0.084708
scr_of_edu0.080575
mediacontact0.029232
female0.025661
onlychild0.020929
income_newcat0.020240
divorce0.000000
scr_h_cat0.000000
mediacoview0.000000

scr_of_edu_diff

emo_cat

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
dtree_with_edu_ave = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_edu_ave = dtree_with_edu_ave.fit(data_feature_with_edu_ave, data_classes['emo_cat'])
pd.DataFrame(dtree_with_edu_ave.feature_importances_, columns = ["Imp"],
index = data_feature_with_edu_ave.columns).sort_values(by='Imp', ascending = False)
dtree_with_scr_of_edu = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_scr_of_edu = dtree_with_scr_of_edu.fit(data_feature_with_scr_of_edu, data_classes['emo_cat'])
pd.DataFrame(dtree_with_scr_of_edu.feature_importances_, columns = ["Imp"],
index = data_feature_with_scr_of_edu.columns).sort_values(by='Imp', ascending = False)
# dtree_with_edu_ave_emo.png
with open('dtree_with_edu_ave_emo.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_edu_ave, out_file=dot_file, feature_names=data_feature_with_edu_ave.columns)
# dtree_with_scr_of_edu_emo.png
with open('dtree_with_scr_of_edu_emo.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_scr_of_edu, out_file=dot_file, feature_names=data_feature_with_scr_of_edu.columns)


























































Imp
scr_ave0.521694
medu_newcat0.159798
income_newcat0.147978
edu_ave0.083233
age_m0.040750
female0.023781
mediacontact0.019127
mediacoview0.003639
onlychild0.000000
divorce0.000000
scr_h_cat0.000000

edu_ave_emo



























































Imp
scr_ave0.520711
income_newcat0.155111
medu_newcat0.146762
scr_of_edu0.115745
age_m0.023767
female0.021924
mediacontact0.011410
mediacoview0.004571
onlychild0.000000
divorce0.000000
scr_h_cat0.000000

scr_of_edu_emo

con_cat

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
dtree_with_edu_ave = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_edu_ave = dtree_with_edu_ave.fit(data_feature_with_edu_ave, data_classes['con_cat'])
pd.DataFrame(dtree_with_edu_ave.feature_importances_, columns = ["Imp"],
index = data_feature_with_edu_ave.columns).sort_values(by='Imp', ascending = False)
dtree_with_scr_of_edu = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_scr_of_edu = dtree_with_scr_of_edu.fit(data_feature_with_scr_of_edu, data_classes['con_cat'])
pd.DataFrame(dtree_with_scr_of_edu.feature_importances_, columns = ["Imp"],
index = data_feature_with_scr_of_edu.columns).sort_values(by='Imp', ascending = False)
# dtree_with_edu_ave_con.png
with open('dtree_with_edu_ave_con.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_edu_ave, out_file=dot_file, feature_names=data_feature_with_edu_ave.columns)
# dtree_with_scr_of_edu_con.png
with open('dtree_with_scr_of_edu_con.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_scr_of_edu, out_file=dot_file, feature_names=data_feature_with_scr_of_edu.columns)


























































Imp
scr_ave0.562325
medu_newcat0.098819
mediacontact0.095089
edu_ave0.077572
female0.064213
income_newcat0.056334
age_m0.023696
onlychild0.013513
mediacoview0.008440
divorce0.000000
scr_h_cat0.000000

edu_ave_con



























































Imp
scr_ave0.511282
scr_of_edu0.110155
medu_newcat0.097473
mediacontact0.093793
female0.082843
income_newcat0.055566
age_m0.035552
onlychild0.013335
divorce0.000000
scr_h_cat0.000000
mediacoview0.000000

scr_of_edu_con

hyp_cat

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
dtree_with_edu_ave = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_edu_ave = dtree_with_edu_ave.fit(data_feature_with_edu_ave, data_classes['hyp_cat'])
pd.DataFrame(dtree_with_edu_ave.feature_importances_, columns = ["Imp"],
index = data_feature_with_edu_ave.columns).sort_values(by='Imp', ascending = False)
dtree_with_scr_of_edu = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_scr_of_edu = dtree_with_scr_of_edu.fit(data_feature_with_scr_of_edu, data_classes['hyp_cat'])
pd.DataFrame(dtree_with_scr_of_edu.feature_importances_, columns = ["Imp"],
index = data_feature_with_scr_of_edu.columns).sort_values(by='Imp', ascending = False)
# dtree_with_edu_ave_hyp.png
with open('dtree_with_edu_ave_hyp.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_edu_ave, out_file=dot_file, feature_names=data_feature_with_edu_ave.columns)
# dtree_with_scr_of_edu_hyp.png
with open('dtree_with_scr_of_edu_hyp.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_scr_of_edu, out_file=dot_file, feature_names=data_feature_with_scr_of_edu.columns)


























































Imp
medu_newcat0.343448
scr_ave0.265652
onlychild0.165487
edu_ave0.076194
income_newcat0.054466
age_m0.046774
mediacontact0.020992
mediacoview0.015532
female0.011454
divorce0.000000
scr_h_cat0.000000

edu_ave_hyp



























































Imp
medu_newcat0.335192
scr_ave0.257218
onlychild0.161509
scr_of_edu0.105471
age_m0.065618
income_newcat0.033129
mediacontact0.022542
mediacoview0.012901
female0.006421
divorce0.000000
scr_h_cat0.000000

scr_of_edu_hyp

pee_cat

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
dtree_with_edu_ave = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_edu_ave = dtree_with_edu_ave.fit(data_feature_with_edu_ave, data_classes['pee_cat'])
pd.DataFrame(dtree_with_edu_ave.feature_importances_, columns = ["Imp"],
index = data_feature_with_edu_ave.columns).sort_values(by='Imp', ascending = False)
dtree_with_scr_of_edu = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_scr_of_edu = dtree_with_scr_of_edu.fit(data_feature_with_scr_of_edu, data_classes['pee_cat'])
pd.DataFrame(dtree_with_scr_of_edu.feature_importances_, columns = ["Imp"],
index = data_feature_with_scr_of_edu.columns).sort_values(by='Imp', ascending = False)
# dtree_with_edu_ave_pee.png
with open('dtree_with_edu_ave_pee.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_edu_ave, out_file=dot_file, feature_names=data_feature_with_edu_ave.columns)
# dtree_with_scr_of_edu_pee.png
with open('dtree_with_scr_of_edu_pee.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_scr_of_edu, out_file=dot_file, feature_names=data_feature_with_scr_of_edu.columns)


























































Imp
scr_ave0.441246
income_newcat0.210646
female0.152568
age_m0.151888
medu_newcat0.033530
mediacoview0.009625
edu_ave0.000498
onlychild0.000000
divorce0.000000
scr_h_cat0.000000
mediacontact0.000000

edu_ave_pee



























































Imp
scr_ave0.430946
income_newcat0.200188
age_m0.151286
female0.139104
medu_newcat0.033289
scr_of_edu0.033259
mediacoview0.011927
onlychild0.000000
divorce0.000000
scr_h_cat0.000000
mediacontact0.000000

scr_of_edu_pee

pro_cat

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
dtree_with_edu_ave = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_edu_ave = dtree_with_edu_ave.fit(data_feature_with_edu_ave, data_classes['pro_cat'])
pd.DataFrame(dtree_with_edu_ave.feature_importances_, columns = ["Imp"],
index = data_feature_with_edu_ave.columns).sort_values(by='Imp', ascending = False)
dtree_with_scr_of_edu = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_with_scr_of_edu = dtree_with_scr_of_edu.fit(data_feature_with_scr_of_edu, data_classes['pro_cat'])
pd.DataFrame(dtree_with_scr_of_edu.feature_importances_, columns = ["Imp"],
index = data_feature_with_scr_of_edu.columns).sort_values(by='Imp', ascending = False)
# dtree_with_edu_ave_pro.png
with open('dtree_with_edu_ave_pro.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_edu_ave, out_file=dot_file, feature_names=data_feature_with_edu_ave.columns)
# dtree_with_scr_of_edu_pro.png
with open('dtree_with_scr_of_edu_pro.dot', 'w') as dot_file:
tree.export_graphviz(dtree_with_scr_of_edu, out_file=dot_file, feature_names=data_feature_with_scr_of_edu.columns)


























































Imp
female0.295389
scr_ave0.252091
mediacontact0.145010
age_m0.137453
income_newcat0.085421
edu_ave0.062697
medu_newcat0.021938
onlychild0.000000
divorce0.000000
scr_h_cat0.000000
mediacoview0.000000

edu_ave_pro



























































Imp
female0.299316
scr_ave0.204496
mediacontact0.131972
age_m0.125169
scr_of_edu0.111909
income_newcat0.080072
medu_newcat0.047066
onlychild0.000000
divorce0.000000
scr_h_cat0.000000
mediacoview0.000000

scr_of_edu_pro

Convert continuous variables to categorical variables

1
2
3
4
data_category = data.copy()
for column in data_category.columns:
if column in float_columns:
data_category[column] = pd.cut(data_category[column], 10, labels=np.arange(10))
1
2
3
data_cate_feature_with_edu_ave = data_category.ix[:, :'mediacontact'].drop('scr_of_edu', axis=1)
data_cate_feature_with_scr_of_edu = data_category.ix[:, :'mediacontact'].drop('edu_ave', axis=1)
data_cate_classes = data.ix[:, 'emo_cat':]
1
2
dtree_cate = tree.DecisionTreeClassifier(min_samples_leaf=50)
cross_val_score(dtree_cate,data_cate_feature_with_edu_ave, data_cate_classes['difficulties_cat'], cv=10).sum() / 10
0.63948514409153423
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
dtree_cate_with_edu_ave = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_cate_with_edu_ave = dtree_with_edu_ave.fit(data_cate_feature_with_edu_ave, data_cate_classes['difficulties_cat'])
pd.DataFrame(dtree_cate_with_edu_ave.feature_importances_, columns = ["Imp"],
index = data_cate_feature_with_edu_ave.columns).sort_values(by='Imp', ascending = False)
dtree_cate_with_scr_of_edu = tree.DecisionTreeClassifier(min_samples_leaf=500)
dtree_cate_with_scr_of_edu = dtree_cate_with_scr_of_edu.fit(data_cate_feature_with_scr_of_edu, data_cate_classes['difficulties_cat'])
pd.DataFrame(dtree_cate_with_scr_of_edu.feature_importances_, columns = ["Imp"],
index = data_cate_feature_with_scr_of_edu.columns).sort_values(by='Imp', ascending = False)
# dtree_cate_with_edu_ave_diff.png
with open('dtree_cate_with_edu_ave_diff.dot', 'w') as dot_file:
tree.export_graphviz(dtree_cate_with_edu_ave, out_file=dot_file, feature_names=data_cate_feature_with_edu_ave.columns)
# dtree_cate_with_scr_of_edu_diff.png
with open('dtree_cate_with_scr_of_edu_diff.dot', 'w') as dot_file:
tree.export_graphviz(dtree_cate_with_scr_of_edu, out_file=dot_file, feature_names=data_cate_feature_with_scr_of_edu.columns)


























































Imp
scr_h_cat0.528976
medu_newcat0.221779
age_m0.092168
income_newcat0.057032
onlychild0.034724
mediacontact0.027193
female0.021036
scr_ave0.011653
mediacoview0.005439
divorce0.000000
edu_ave0.000000



























































Imp
scr_h_cat0.507678
medu_newcat0.215210
age_m0.089438
scr_of_edu0.042832
mediacontact0.041463
income_newcat0.035742
onlychild0.033695
female0.020413
scr_ave0.013527
divorce0.000000
mediacoview0.000000

cate_scr_of_edu_diff