数据聚合与分组运算
|
|
|
|
|
|
GroupBy 机制
|
|
data1 data2 key1 key2
0 -0.204708 1.393406 a one
1 0.478943 0.092908 a two
2 -0.519439 0.281746 b one
3 -0.555730 0.769023 b two
4 1.965781 1.246435 a one
|
|
<pandas.core.groupby.SeriesGroupBy object at 0x0000000008BAFA90>
变量groupby
是一个GroupBy
对象。它实际还没有进行任何计算,只有进行计算之后才能显示结果
|
|
key1
a 0.746672
b -0.537585
Name: data1, dtype: float64
|
|
key1 key2
a one 0.880536
two 0.478943
b one -0.519439
two -0.555730
Name: data1, dtype: float64
|
|
key2 one two
key1
a 0.880536 0.478943
b -0.519439 -0.555730
只要长度相同即可
|
|
California 2005 0.478943
2006 -0.519439
Ohio 2005 -0.380219
2006 1.965781
Name: data1, dtype: float64
只要数值型数据才会出现在结果中
|
|
data1 data2
key1
a 0.746672 0.910916
b -0.537585 0.525384
|
|
data1 data2
key1 key2
a one 0.880536 1.319920
two 0.478943 0.092908
b one -0.519439 0.281746
two -0.555730 0.769023
|
|
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
对分组进行迭代
显示分组数据(要通过这种迭代的方式才能显示)
|
|
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000034BC8CF8>
|
|
data1 data2 key1 key2
0 -0.204708 1.393406 a one
1 0.478943 0.092908 a two
2 -0.519439 0.281746 b one
3 -0.555730 0.769023 b two
4 1.965781 1.246435 a one
a
data1 data2 key1 key2
0 -0.204708 1.393406 a one
1 0.478943 0.092908 a two
4 1.965781 1.246435 a one
b
data1 data2 key1 key2
2 -0.519439 0.281746 b one
3 -0.555730 0.769023 b two
同样进行迭代才能显示结果
|
|
('a', 'one')
data1 data2 key1 key2
0 -0.204708 1.393406 a one
4 1.965781 1.246435 a one
('a', 'two')
data1 data2 key1 key2
1 0.478943 0.092908 a two
('b', 'one')
data1 data2 key1 key2
2 -0.519439 0.281746 b one
('b', 'two')
data1 data2 key1 key2
3 -0.55573 0.769023 b two
将分组结果转化成一个字典(要先转化为一个列表)
|
|
data1 data2 key1 key2
2 -0.519439 0.281746 b one
3 -0.555730 0.769023 b two
显示每一项的数据类型
|
|
data1 float64
data2 float64
key1 object
key2 object
dtype: object
对列进行分组…按照数据类型来?!
|
|
{dtype('float64'): data1 data2
0 -0.204708 1.393406
1 0.478943 0.092908
2 -0.519439 0.281746
3 -0.555730 0.769023
4 1.965781 1.246435, dtype('O'): key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one}
选择一列或一组列
|
|
<pandas.core.groupby.SeriesGroupBy object at 0x0000000034BDC2E8>
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000015EDF2B0>
上述代码是以下代码的语法糖
|
|
<pandas.core.groupby.SeriesGroupBy object at 0x0000000034BDC8D0>
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000034BDC898>
|
|
data2
key1 key2
a one 1.319920
two 0.092908
b one 0.281746
two 0.769023
|
|
<pandas.core.groupby.SeriesGroupBy object at 0x0000000034BDC6A0>
|
|
key1 key2
a one 1.319920
two 0.092908
b one 0.281746
two 0.769023
Name: data2, dtype: float64
通过字典或Series进行分组
|
|
a b c d e
Joe 1.007189 -1.296221 0.274992 0.228913 1.352917
Steve 0.886429 -2.001637 -0.371843 1.669025 -0.438570
Wes -0.539741 NaN NaN -1.021228 -0.577087
Jim 0.124121 0.302614 0.523772 0.000940 1.343810
Travis -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
|
|
会跳过NA值
|
|
blue red
Joe 0.503905 1.063885
Steve 1.297183 -1.553778
Wes -1.021228 -1.116829
Jim 0.524712 1.770545
Travis -4.230992 -2.405455
上述功能同样可以通过Series实现
|
|
a red
b red
c blue
d blue
e red
f orange
dtype: object
|
|
blue red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3
通过函数进行分组
根据人名长度进行分组
|
|
a b c d e
3 0.591569 -0.993608 0.798764 -0.791374 2.119639
5 0.886429 -2.001637 -0.371843 1.669025 -0.438570
6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
再加一个分组度量
|
|
a b c d e
3 one -0.539741 -1.296221 0.274992 -1.021228 -0.577087
two 0.124121 0.302614 0.523772 0.000940 1.343810
5 one 0.886429 -2.001637 -0.371843 1.669025 -0.438570
6 two -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
根据索引级别分组
|
|
cty US JP
tenor 1 3 5 1 3
0 0.560145 -1.265934 0.119827 -1.063512 0.332883
1 -2.359419 -0.199543 -1.541996 -0.970736 -1.307030
2 0.286350 0.377984 -0.753887 0.331286 1.349742
3 0.069877 0.246674 -0.011862 1.004812 1.327195
|
|
cty JP US
0 2 3
1 2 3
2 2 3
3 2 3
数据聚合
|
|
data1 data2 key1 key2
0 -0.204708 1.393406 a one
1 0.478943 0.092908 a two
2 -0.519439 0.281746 b one
3 -0.555730 0.769023 b two
4 1.965781 1.246435 a one
对分组后的数据进行相应操作
|
|
key1
a 1.668413
b -0.523068
Name: data1, dtype: float64
通过函数进行聚合操作
|
|
data1 data2
key1
a 2.170488 1.300498
b 0.036292 0.487276
列出分组后数据的一些常用属性
|
|
data1 data2
key1
a count 3.000000 3.000000
mean 0.746672 0.910916
std 1.109736 0.712217
min -0.204708 0.092908
25% 0.137118 0.669671
50% 0.478943 1.246435
75% 1.222362 1.319920
max 1.965781 1.393406
b count 2.000000 2.000000
mean -0.537585 0.525384
std 0.025662 0.344556
min -0.555730 0.281746
25% -0.546657 0.403565
50% -0.537585 0.525384
75% -0.528512 0.647203
max -0.519439 0.769023
导入一个数据集用于接下来更加高级的聚合操作
|
|
total_bill tip sex smoker day time size_ tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
5 25.29 4.71 Male No Sun Dinner 4 0.186240
面向列的多函数应用
根据性别以及是否吸烟进行分类
|
|
算出不同类型的顾客所给的小费占总花费的比例的平均值
|
|
sex smoker
Female No 0.156921
Yes 0.182150
Male No 0.160669
Yes 0.152771
Name: tip_pct, dtype: float64
同时算出比例的均值、标准差以及范围大小
|
|
mean std peak_to_peak
sex smoker
Female No 0.156921 0.036421 0.195876
Yes 0.182150 0.071595 0.360233
Male No 0.160669 0.041849 0.220186
Yes 0.152771 0.090588 0.674707
起一个别名
|
|
foo bar
sex smoker
Female No 0.156921 0.036421
Yes 0.182150 0.071595
Male No 0.160669 0.041849
Yes 0.152771 0.090588
对分组后的数据的两个属性分别做三个不同的操作
|
|
tip_pct total_bill
count mean max count mean max
sex smoker
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500 50.81
提取出上述两个属性中的一个
|
|
count mean max
sex smoker
Female No 54 0.156921 0.252672
Yes 33 0.182150 0.416667
Male No 97 0.160669 0.291990
Yes 60 0.152771 0.710345
对多个属性进行多个操作的同时进行起别名的操作
|
|
tip_pct total_bill
Durchschnitt Abweichung Durchschnitt Abweichung
sex smoker
Female No 0.156921 0.001327 18.105185 53.092422
Yes 0.182150 0.005126 17.977879 84.451517
Male No 0.160669 0.001751 19.791237 76.152961
Yes 0.152771 0.008206 22.284500 98.244673
对不同的列进行不同的操作
|
|
size_ tip
sex smoker
Female No 140 5.2
Yes 74 6.5
Male No 263 9.0
Yes 150 10.0
对不同的列进行数量不同类型不同的操作
|
|
tip_pct size_
min max mean std sum
sex smoker
Female No 0.056797 0.252672 0.156921 0.036421 140
Yes 0.056433 0.416667 0.182150 0.071595 74
Male No 0.071804 0.291990 0.160669 0.041849 263
Yes 0.035638 0.710345 0.152771 0.090588 150
以无索引的形式返回聚合数据
|
|
sex smoker total_bill tip size_ tip_pct
0 Female No 18.105185 2.773519 2.592593 0.156921
1 Female Yes 17.977879 2.931515 2.242424 0.182150
2 Male No 19.791237 3.113402 2.711340 0.160669
3 Male Yes 22.284500 3.051167 2.500000 0.152771
|
|
total_bill tip size_ tip_pct
sex smoker
Female No 18.105185 2.773519 2.592593 0.156921
Yes 17.977879 2.931515 2.242424 0.182150
Male No 19.791237 3.113402 2.711340 0.160669
Yes 22.284500 3.051167 2.500000 0.152771
分组级运算和转换
|
|
data1 data2 key1 key2
0 -0.204708 1.393406 a one
1 0.478943 0.092908 a two
2 -0.519439 0.281746 b one
3 -0.555730 0.769023 b two
4 1.965781 1.246435 a one
|
|
mean_data1 mean_data2
key1
a 0.746672 0.910916
b -0.537585 0.525384
保留原索引
|
|
data1 data2 key1 key2 mean_data1 mean_data2
0 -0.204708 1.393406 a one 0.746672 0.910916
1 0.478943 0.092908 a two 0.746672 0.910916
4 1.965781 1.246435 a one 0.746672 0.910916
2 -0.519439 0.281746 b one -0.537585 0.525384
3 -0.555730 0.769023 b two -0.537585 0.525384
另一个例子,以更简洁的方式实现上述功能
|
|
a b c d e
Joe 1.007189 -1.296221 0.274992 0.228913 1.352917
Steve 0.886429 -2.001637 -0.371843 1.669025 -0.438570
Wes -0.539741 NaN NaN -1.021228 -0.577087
Jim 0.124121 0.302614 0.523772 0.000940 1.343810
Travis -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
|
|
a b c d e
one -0.082032 -1.063687 -1.047620 -0.884358 -0.028309
two 0.505275 -0.849512 0.075965 0.834983 0.452620
将聚合后的结果放回原来数据中合适的位置(标量进行广播)
|
|
a b c d e
Joe -0.082032 -1.063687 -1.047620 -0.884358 -0.028309
Steve 0.505275 -0.849512 0.075965 0.834983 0.452620
Wes -0.082032 -1.063687 -1.047620 -0.884358 -0.028309
Jim 0.505275 -0.849512 0.075965 0.834983 0.452620
Travis -0.082032 -1.063687 -1.047620 -0.884358 -0.028309
同时减去均值
|
|
a b c d e
Joe 1.089221 -0.232534 1.322612 1.113271 1.381226
Steve 0.381154 -1.152125 -0.447807 0.834043 -0.891190
Wes -0.457709 NaN NaN -0.136869 -0.548778
Jim -0.381154 1.152125 0.447807 -0.834043 0.891190
Travis -0.631512 0.232534 -1.322612 -0.976402 -0.832448
检验一下
|
|
a b c d e
one 0.000000e+00 -1.110223e-16 0.0 7.401487e-17 0.0
two -2.775558e-17 0.000000e+00 0.0 0.000000e+00 0.0
Apply: 一般性的 “拆分-应用-合并”
小费数据,根据某一个属性从大到小进行排序
|
|
total_bill tip sex smoker day time size_ tip_pct
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
在分组后的数据集上进行上述排序操作,说明分组后的每一组都是一个DataFrame对象
|
|
total_bill tip sex smoker day time size_ tip_pct
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
|
|
total_bill tip sex smoker day time size_ \
smoker day
No Fri 94 22.75 3.25 Female No Fri Dinner 2
Sat 212 48.33 9.00 Male No Sat Dinner 4
Sun 156 48.17 5.00 Male No Sun Dinner 6
Thur 142 41.19 5.00 Male No Thur Lunch 5
Yes Fri 95 40.17 4.73 Male Yes Fri Dinner 4
Sat 170 50.81 10.00 Male Yes Sat Dinner 3
Sun 182 45.35 3.50 Male Yes Sun Dinner 3
Thur 197 43.11 5.00 Female Yes Thur Lunch 4
tip_pct
smoker day
No Fri 94 0.142857
Sat 212 0.186220
Sun 156 0.103799
Thur 142 0.121389
Yes Fri 95 0.117750
Sat 170 0.196812
Sun 182 0.077178
Thur 197 0.115982
获取分组后数据某一列的统计数据
|
|
smoker
No count 151.000000
mean 0.159328
std 0.039910
min 0.056797
25% 0.136906
50% 0.155625
75% 0.185014
max 0.291990
Yes count 93.000000
mean 0.163196
std 0.085119
min 0.035638
25% 0.106771
50% 0.153846
75% 0.195059
max 0.710345
Name: tip_pct, dtype: float64
|
|
smoker No Yes
count 151.000000 93.000000
mean 0.159328 0.163196
std 0.039910 0.085119
min 0.056797 0.035638
25% 0.136906 0.106771
50% 0.155625 0.153846
75% 0.185014 0.195059
max 0.291990 0.710345
grouped 根据性别以及是否吸烟进行分组
|
|
total_bill tip size_ tip_pct
sex smoker
Female No count 54.000000 54.000000 54.000000 54.000000
mean 18.105185 2.773519 2.592593 0.156921
std 7.286455 1.128425 1.073146 0.036421
min 7.250000 1.000000 1.000000 0.056797
25% 12.650000 2.000000 2.000000 0.139708
50% 16.690000 2.680000 2.000000 0.149691
75% 20.862500 3.437500 3.000000 0.181630
max 35.830000 5.200000 6.000000 0.252672
Yes count 33.000000 33.000000 33.000000 33.000000
mean 17.977879 2.931515 2.242424 0.182150
std 9.189751 1.219916 0.613917 0.071595
min 3.070000 1.000000 1.000000 0.056433
25% 12.760000 2.000000 2.000000 0.152439
50% 16.270000 2.880000 2.000000 0.173913
75% 22.120000 3.500000 2.000000 0.198216
max 44.300000 6.500000 4.000000 0.416667
Male No count 97.000000 97.000000 97.000000 97.000000
mean 19.791237 3.113402 2.711340 0.160669
std 8.726566 1.489559 0.989094 0.041849
min 7.510000 1.250000 2.000000 0.071804
25% 13.810000 2.000000 2.000000 0.131810
50% 18.240000 2.740000 2.000000 0.157604
75% 22.820000 3.710000 3.000000 0.186220
max 48.330000 9.000000 6.000000 0.291990
Yes count 60.000000 60.000000 60.000000 60.000000
mean 22.284500 3.051167 2.500000 0.152771
std 9.911845 1.500120 0.892530 0.090588
min 7.250000 1.000000 1.000000 0.035638
25% 15.272500 2.000000 2.000000 0.101845
50% 20.390000 3.000000 2.000000 0.141015
75% 28.572500 3.820000 3.000000 0.191697
max 50.810000 10.000000 5.000000 0.710345
禁止分组键
|
|
total_bill tip sex smoker day time size_ tip_pct
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
|
|
total_bill tip sex smoker day time size_ tip_pct
88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
分位数与桶分析
|
|
0 (-1.23, 0.489]
1 (-2.956, -1.23]
2 (-1.23, 0.489]
3 (0.489, 2.208]
4 (-1.23, 0.489]
5 (0.489, 2.208]
6 (-1.23, 0.489]
7 (-1.23, 0.489]
8 (0.489, 2.208]
9 (0.489, 2.208]
Name: data1, dtype: category
Categories (4, object): [(-2.956, -1.23] < (-1.23, 0.489] < (0.489, 2.208] < (2.208, 3.928]]
|
|
count max mean min
data1
(-2.956, -1.23] 95.0 1.670835 -0.039521 -3.399312
(-1.23, 0.489] 598.0 3.260383 -0.002051 -2.989741
(0.489, 2.208] 297.0 2.954439 0.081822 -3.745356
(2.208, 3.928] 10.0 1.765640 0.024750 -1.929776
|
|
count max mean min
data1
0 100.0 1.670835 -0.049902 -3.399312
1 100.0 2.628441 0.030989 -1.950098
2 100.0 2.527939 -0.067179 -2.925113
3 100.0 3.260383 0.065713 -2.315555
4 100.0 2.074345 -0.111653 -2.047939
5 100.0 2.184810 0.052130 -2.989741
6 100.0 2.458842 -0.021489 -2.223506
7 100.0 2.954439 -0.026459 -3.056990
8 100.0 2.735527 0.103406 -3.745356
9 100.0 2.377020 0.220122 -2.064111
Example: 用特定分组的值填充缺失值
填一些缺失值进去
|
|
0 NaN
1 -0.125921
2 NaN
3 -0.884475
4 NaN
5 0.227290
dtype: float64
用均值填充缺失值
|
|
0 -0.261035
1 -0.125921
2 -0.261035
3 -0.884475
4 -0.261035
5 0.227290
dtype: float64
同样,填一些缺失值
|
|
Ohio 0.922264
New York -2.153545
Vermont NaN
Florida -0.375842
Oregon 0.329939
Nevada NaN
California 1.105913
Idaho NaN
dtype: float64
计算分组均值
|
|
East -0.535707
West 0.717926
dtype: float64
这里的g
指代调用apply
的主体,也就是data.groupby(group_key)
分组后的结果
|
|
Ohio 0.922264
New York -2.153545
Vermont -0.535707
Florida -0.375842
Oregon 0.329939
Nevada 0.717926
California 1.105913
Idaho 0.717926
dtype: float64
由于groupby
操作后得到的结果类似于一个字典,字典key
是组名,value
是一个DataFrame Object
|
|
Ohio 0.922264
New York -2.153545
Vermont 0.500000
Florida -0.375842
Oregon 0.329939
Nevada -1.000000
California 1.105913
Idaho -1.000000
dtype: float64
Example: 随机采样和排列
构造扑克牌
红桃Hearts
, 黑桃Spades
, 梅花Clubs
, 方片Diamonds
|
|
|
|
AH 1
2H 2
3H 3
4H 4
5H 5
6H 6
7H 7
8H 8
9H 9
10H 10
JH 10
KH 10
QH 10
dtype: int64
随机抽牌
|
|
AD 1
8C 8
5H 5
KC 10
2C 2
dtype: int64
分类抽牌
|
|
C 2C 2
3C 3
D KD 10
8D 8
H KH 10
3H 3
S 2S 2
4S 4
dtype: int64
去掉分组键
|
|
KC 10
JC 10
AD 1
5D 5
5H 5
6H 6
7S 7
KS 10
dtype: int64
Example: 分组加权平均数和相关系数
|
|
category data weights
0 a 1.561587 0.957515
1 a 1.219984 0.347267
2 a -0.482239 0.581362
3 a 0.315667 0.217091
4 b -0.047852 0.894406
5 b -0.454145 0.918564
6 b -0.556774 0.277825
7 b 0.253321 0.955905
|
|
category
a 0.811643
b -0.122262
dtype: float64
stock
数据集
|
|
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
AAPL 2214 non-null float64
MSFT 2214 non-null float64
XOM 2214 non-null float64
SPX 2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB
|
|
AAPL MSFT XOM SPX
2011-10-11 400.29 27.00 76.27 1195.54
2011-10-12 402.19 26.96 77.16 1207.25
2011-10-13 408.43 27.18 76.37 1203.66
2011-10-14 422.00 27.27 78.11 1224.58
计算相关系数
|
|
AAPL MSFT XOM SPX
2003 0.541124 0.745174 0.661265 1.0
2004 0.374283 0.588531 0.557742 1.0
2005 0.467540 0.562374 0.631010 1.0
2006 0.428267 0.406126 0.518514 1.0
2007 0.508118 0.658770 0.786264 1.0
2008 0.681434 0.804626 0.828303 1.0
2009 0.707103 0.654902 0.797921 1.0
2010 0.710105 0.730118 0.839057 1.0
2011 0.691931 0.800996 0.859975 1.0
lambda
看来有很大用处
|
|
2003 0.480868
2004 0.259024
2005 0.300093
2006 0.161735
2007 0.417738
2008 0.611901
2009 0.432738
2010 0.571946
2011 0.581987
dtype: float64
Example: 分组级线型回归
|
|
这样传参
|
|
SPX intercept
2003 1.195406 0.000710
2004 1.363463 0.004201
2005 1.766415 0.003246
2006 1.645496 0.000080
2007 1.198761 0.003438
2008 0.968016 -0.001110
2009 0.879103 0.002954
2010 1.052608 0.001261
2011 0.806605 0.001514
透视表和交叉表
|
|
total_bill tip sex smoker day time size_ tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
5 25.29 4.71 Male No Sun Dinner 4 0.186240
6 8.77 2.00 Male No Sun Dinner 2 0.228050
7 26.88 3.12 Male No Sun Dinner 4 0.116071
8 15.04 1.96 Male No Sun Dinner 2 0.130319
9 14.78 3.23 Male No Sun Dinner 2 0.218539
pivot_table
默认情况相当于分组后进行mean()
操作
|
|
size_ tip tip_pct total_bill
sex smoker
Female No 2.592593 2.773519 0.156921 18.105185
Yes 2.242424 2.931515 0.182150 17.977879
Male No 2.711340 3.113402 0.160669 19.791237
Yes 2.500000 3.051167 0.152771 22.284500
指定分组度量
|
|
tip_pct size_
smoker No Yes No Yes
sex day
Female Fri 0.165296 0.209129 2.500000 2.000000
Sat 0.147993 0.163817 2.307692 2.200000
Sun 0.165710 0.237075 3.071429 2.500000
Thur 0.155971 0.163073 2.480000 2.428571
Male Fri 0.138005 0.144730 2.000000 2.125000
Sat 0.162132 0.139067 2.656250 2.629630
Sun 0.158291 0.173964 2.883721 2.600000
Thur 0.165706 0.164417 2.500000 2.300000
增加ALL列
|
|
tip_pct size_
smoker No Yes All No Yes All
sex day
Female Fri 0.165296 0.209129 0.199388 2.500000 2.000000 2.111111
Sat 0.147993 0.163817 0.156470 2.307692 2.200000 2.250000
Sun 0.165710 0.237075 0.181569 3.071429 2.500000 2.944444
Thur 0.155971 0.163073 0.157525 2.480000 2.428571 2.468750
Male Fri 0.138005 0.144730 0.143385 2.000000 2.125000 2.100000
Sat 0.162132 0.139067 0.151577 2.656250 2.629630 2.644068
Sun 0.158291 0.173964 0.162344 2.883721 2.600000 2.810345
Thur 0.165706 0.164417 0.165276 2.500000 2.300000 2.433333
All 0.159328 0.163196 0.160803 2.668874 2.408602 2.569672
更换一个分组度量
|
|
day Fri Sat Sun Thur All
sex smoker
Female No 2.0 13.0 14.0 25.0 54.0
Yes 7.0 15.0 4.0 7.0 33.0
Male No 2.0 32.0 43.0 20.0 97.0
Yes 8.0 27.0 15.0 10.0 60.0
All 19.0 87.0 76.0 62.0 244.0
分组计数并填充(可能存在空组合)
|
|
day Fri Sat Sun Thur
time sex smoker
Dinner Female No 2 30 43 2
Yes 8 33 10 0
Male No 4 85 124 0
Yes 12 71 39 0
Lunch Female No 3 0 0 60
Yes 6 0 0 17
Male No 0 0 0 50
Yes 5 0 0 23
交叉表: crosstab
|
|
|
|
Sample Gender Handedness
0 1 Female Right-handed
1 2 Male Left-handed
2 3 Female Right-handed
3 4 Male Right-handed
4 5 Male Left-handed
5 6 Male Right-handed
6 7 Female Right-handed
7 8 Female Left-handed
8 9 Male Right-handed
9 10 Female Right-handed
交叉表就是在…计数…
|
|
Handedness Left-handed Right-handed All
Gender
Female 1 4 5
Male 2 3 5
All 3 7 10
|
|
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244
Example: 2012 联邦选举委员会数据库
这个数据库包括赞助人的姓名,职业、雇主、地址以及出资额等信息
|
|
|
|
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id 1001731 non-null object
cand_id 1001731 non-null object
cand_nm 1001731 non-null object
contbr_nm 1001731 non-null object
contbr_city 1001712 non-null object
contbr_st 1001727 non-null object
contbr_zip 1001620 non-null object
contbr_employer 988002 non-null object
contbr_occupation 993301 non-null object
contb_receipt_amt 1001731 non-null float64
contb_receipt_dt 1001731 non-null object
receipt_desc 14166 non-null object
memo_cd 92482 non-null object
memo_text 97770 non-null object
form_tp 1001731 non-null object
file_num 1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB
|
|
cmte_id C00431445
cand_id P80003338
cand_nm Obama, Barack
contbr_nm ELLMAN, IRA
contbr_city TEMPE
contbr_st AZ
contbr_zip 852816719
contbr_employer ARIZONA STATE UNIVERSITY
contbr_occupation PROFESSOR
contb_receipt_amt 50
contb_receipt_dt 01-DEC-11
receipt_desc NaN
memo_cd NaN
memo_text NaN
form_tp SA17A
file_num 772372
Name: 123456, dtype: object
输出全部候选人名单
|
|
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
"Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick', 'Cain, Herman',
'Gingrich, Newt', 'McCotter, Thaddeus G', 'Huntsman, Jon',
'Perry, Rick'], dtype=object)
|
|
'Obama, Barack'
政党映射
|
|
|
|
123456 Obama, Barack
123457 Obama, Barack
123458 Obama, Barack
123459 Obama, Barack
123460 Obama, Barack
Name: cand_nm, dtype: object
|
|
123456 Democrat
123457 Democrat
123458 Democrat
123459 Democrat
123460 Democrat
Name: cand_nm, dtype: object
根据以上创建的映射,在原数据集中添加一列party
|
|
|
|
Democrat 593746
Republican 407985
Name: party, dtype: int64
看看出资额是正是负
|
|
True 991475
False 10256
Name: contb_receipt_amt, dtype: int64
调整出资额为正
|
|
筛选候选人
|
|
根据职业和雇主统计赞助信息
统计职业信息
|
|
RETIRED 233990
INFORMATION REQUESTED 35107
ATTORNEY 34286
HOMEMAKER 29931
PHYSICIAN 23432
INFORMATION REQUESTED PER BEST EFFORTS 21138
ENGINEER 14334
TEACHER 13990
CONSULTANT 13273
PROFESSOR 12555
Name: contbr_occupation, dtype: int64
筛选出一些不符合规格的信息映射到正常信息
|
|
以上巧妙运用了get方法
|
|
根据职业以及候选人政党分组,统计出资额总和
|
|
|
|
party Democrat Republican
contbr_occupation
ATTORNEY 11141982.97 7.477194e+06
CEO 2074974.79 4.211041e+06
CONSULTANT 2459912.71 2.544725e+06
ENGINEER 951525.55 1.818374e+06
EXECUTIVE 1355161.05 4.138850e+06
HOMEMAKER 4248875.80 1.363428e+07
INVESTOR 884133.00 2.431769e+06
LAWYER 3160478.87 3.912243e+05
MANAGER 762883.22 1.444532e+06
NOT PROVIDED 4866973.96 2.056547e+07
OWNER 1001567.36 2.408287e+06
PHYSICIAN 3735124.94 3.594320e+06
PRESIDENT 1878509.95 4.720924e+06
PROFESSOR 2165071.08 2.967027e+05
REAL ESTATE 528902.09 1.625902e+06
RETIRED 25305116.38 2.356124e+07
SELF-EMPLOYED 672393.40 1.640253e+06
|
|
<matplotlib.axes._subplots.AxesSubplot at 0x340fb4e0>
|
|
|
|
cand_nm contbr_occupation
Obama, Barack CONSULTANT 2459912.71
LAWYER 3160478.87
PHYSICIAN 3735124.94
HOMEMAKER 4248875.80
INFORMATION REQUESTED 4866973.96
ATTORNEY 11141982.97
RETIRED 25305116.38
Romney, Mitt C.E.O. 1968386.11
EXECUTIVE 2300947.03
PRESIDENT 2491244.89
ATTORNEY 5364718.82
HOMEMAKER 8147446.22
INFORMATION REQUESTED PER BEST EFFORTS 11396894.84
RETIRED 11508473.59
Name: contb_receipt_amt, dtype: float64
|
|
cand_nm contbr_employer
Obama, Barack MICROSOFT 215585.36
VOLUNTEER 257104.00
STUDENT 318831.45
SELF EMPLOYED 469290.00
SELF 1076531.20
HOMEMAKER 2605408.54
INFORMATION REQUESTED 5053480.37
NOT EMPLOYED 8586308.70
SELF-EMPLOYED 17080985.96
RETIRED 22694358.85
Romney, Mitt H.I.G. CAPITAL 139500.00
BARCLAYS CAPITAL 162750.00
GOLDMAN SACH & CO. 238250.00
MORGAN STANLEY 267266.00
CREDIT SUISSE 281150.00
STUDENT 496490.94
SELF-EMPLOYED 7409860.98
HOMEMAKER 8147196.22
RETIRED 11506225.71
INFORMATION REQUESTED PER BEST EFFORTS 12059527.24
Name: contb_receipt_amt, dtype: float64
根据出资额分组
不出意外果然要用到桶
|
|
411 (10, 100]
412 (100, 1000]
413 (100, 1000]
414 (10, 100]
415 (10, 100]
416 (10, 100]
417 (100, 1000]
418 (10, 100]
419 (100, 1000]
420 (10, 100]
Name: contb_receipt_amt, dtype: category
Categories (8, object): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]
|
|
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 493.0 77.0
(1, 10] 40070.0 3681.0
(10, 100] 372280.0 31853.0
(100, 1000] 153991.0 43357.0
(1000, 10000] 22284.0 26186.0
(10000, 100000] 2.0 1.0
(100000, 1000000] 3.0 NaN
(1000000, 10000000] 4.0 NaN
|
|
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 318.24 77.00
(1, 10] 337267.62 29819.66
(10, 100] 20288981.41 1987783.76
(100, 1000] 54798531.46 22363381.69
(1000, 10000] 51753705.67 63942145.42
(10000, 100000] 59100.00 12700.00
(100000, 1000000] 1490683.08 NaN
(1000000, 10000000] 7148839.76 NaN
计算比例
|
|
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 0.805182 0.194818
(1, 10] 0.918767 0.081233
(10, 100] 0.910769 0.089231
(100, 1000] 0.710176 0.289824
(1000, 10000] 0.447326 0.552674
(10000, 100000] 0.823120 0.176880
(100000, 1000000] 1.000000 NaN
(1000000, 10000000] 1.000000 NaN
画个图看看
|
|
<matplotlib.axes._subplots.AxesSubplot at 0x14c4db00>
根据州统计赞助信息
|
|
cand_nm Obama, Barack Romney, Mitt
contbr_st
AK 281840.15 86204.24
AL 543123.48 527303.51
AR 359247.28 105556.00
AZ 1506476.98 1888436.23
CA 23824984.24 11237636.60
CO 2132429.49 1506714.12
CT 2068291.26 3499475.45
DC 4373538.80 1025137.50
DE 336669.14 82712.00
FL 7318178.58 8338458.81
Mitt is so...poorly...
|
|
cand_nm Obama, Barack Romney, Mitt
contbr_st
AK 0.765778 0.234222
AL 0.507390 0.492610
AR 0.772902 0.227098
AZ 0.443745 0.556255
CA 0.679498 0.320502
CO 0.585970 0.414030
CT 0.371476 0.628524
DC 0.810113 0.189887
DE 0.802776 0.197224
FL 0.467417 0.532583