Abracadabra

Python data analysis Learning note ch05

pandas入门

按照以下约定引用相关package

1
2
from pandas import Series, DataFrame
import pandas as pd
1
2
3
4
5
6
7
8
9
10
11
12
from __future__ import division
from numpy.random import randn
import numpy as np
import os
import matplotlib.pyplot as plt
np.random.seed(12345)
plt.rc('figure', figsize=(10, 6))
from pandas import Series, DataFrame
import pandas as pd
np.set_printoptions(precision=4)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pandas数据结构介绍

Series

Series是一种类似于一维数组的对象,由一组数据以及一组与之相关的数据标签(类似于字典的键)组成,所以可以看成是一个有序的字典

1
2
obj = Series([4, 7, -5, 3])
obj
0    4
1    7
2   -5
3    3
dtype: int64
1
2
obj.values
obj.index
array([ 4,  7, -5,  3], dtype=int64)






RangeIndex(start=0, stop=4, step=1)
1
2
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
d    4
b    7
a   -5
c    3
dtype: int64
1
obj2.index
Index(['d', 'b', 'a', 'c'], dtype='object')
1
obj2['a']
-5
1
2
obj2['d'] = 6
obj2[['c', 'a', 'd']]
c    3
a   -5
d    6
dtype: int64

各种Numpy运算都是作用在数据上,同时索引与数据的链接会一直保持

1
obj2[obj2 > 0]
d    6
b    7
c    3
dtype: int64
1
obj2 * 2
d    12
b    14
a   -10
c     6
dtype: int64
1
np.exp(obj2)
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64
1
'b' in obj2
True
1
'e' in obj2
False

因此可以直接根据Numpy的Dict来创建Series

1
2
3
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

如果传入了index参数的话,那么就会与传入的Dict做键匹配,没有匹配上的就设为NaN

1
2
3
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
1
pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
1
pd.notnull(obj4)
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
1
obj4.isnull()
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Series有一个非常重要的数据对齐的功能

1
obj3
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
1
obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
1
obj3 + obj4
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Series本身以及其索引都有一个叫做name的属性,这个属性十分重要,以后很多高级功能都会用到

1
2
3
obj4.name = 'population'
obj4.index.name = 'state'
obj4
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

可以通过直接赋值的方式修改index属性

1
2
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

DataFrame是一个表格型的数据结构,可以看成由Series组成的字典,只不过这些Series共用一套索引

1
2
3
4
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

数据会被排序

1
frame














































popstateyear
01.5Ohio2000
11.7Ohio2001
23.6Ohio2002
32.4Nevada2001
42.9Nevada2002

1
DataFrame(data, columns=['year', 'state', 'pop'])














































yearstatepop
02000Ohio1.5
12001Ohio1.7
22002Ohio3.6
32001Nevada2.4
42002Nevada2.9

1
2
3
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])
frame2




















































yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7NaN
three2002Ohio3.6NaN
four2001Nevada2.4NaN
five2002Nevada2.9NaN

1
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

DataFrame每一个Key对应的Value都是一个Series

1
frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
1
frame2.year
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

注意到name属性也已经被设置好了

ix相当于一个行索引?

1
frame2.ix['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

可以利用Numpy的广播功能

1
2
frame2['debt'] = 16.5
frame2




















































yearstatepopdebt
one2000Ohio1.516.5
two2001Ohio1.716.5
three2002Ohio3.616.5
four2001Nevada2.416.5
five2002Nevada2.916.5

也可以赋值一个列表,但是长度必须匹配

1
2
frame2['debt'] = np.arange(5.)
frame2




















































yearstatepopdebt
one2000Ohio1.50.0
two2001Ohio1.71.0
three2002Ohio3.62.0
four2001Nevada2.43.0
five2002Nevada2.94.0

如果是赋值一个Series,则会匹配上索引,没有匹配上的就是置为NaN

1
2
3
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2




















































yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7-1.2
three2002Ohio3.6NaN
four2001Nevada2.4-1.5
five2002Nevada2.9-1.7

也可以进行逻辑操作

1
2
frame2['eastern'] = frame2.state == 'Ohio'
frame2


























































yearstatepopdebteastern
one2000Ohio1.5NaNTrue
two2001Ohio1.7-1.2True
three2002Ohio3.6NaNTrue
four2001Nevada2.4-1.5False
five2002Nevada2.9-1.7False

del用于删除一列

1
2
del frame2['eastern']
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

只要是通过索引方式进行的操作,都是直接在原数据上进行的操作,不是一个副本

嵌套的字典也可直接生成DataFrame,只不过内层的键被当作index,外层的键被当作colums

1
2
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
1
2
frame3 = DataFrame(pop)
frame3






























NevadaOhio
2000NaN1.5
20012.41.7
20022.93.6

同样可以进行转置,这样的话index和column就会互换

1
frame3.T




























200020012002
NevadaNaN2.42.9
Ohio1.51.73.6

显式地指定索引,不匹配的会置为NaN

1
DataFrame(pop, index=[2001, 2002, 2003])






























NevadaOhio
20012.41.7
20022.93.6
2003NaNNaN

1
frame3






























NevadaOhio
2000NaN1.5
20012.41.7
20022.93.6

还可以这样构建

1
2
3
pdata = {'Ohio': frame3['Ohio'][:-1],
'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)

























NevadaOhio
2000NaN1.5
20012.41.7

1
frame3






























NevadaOhio
2000NaN1.5
20012.41.7
20022.93.6

name属性也会在表格中显示出来

1
2
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3



































stateNevadaOhio
year
2000NaN1.5
20012.41.7
20022.93.6

values方法只返回数据,不返回index以及key

1
frame3.values
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])
1
frame2




















































yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7-1.2
three2002Ohio3.6NaN
four2001Nevada2.4-1.5
five2002Nevada2.9-1.7

1
frame2.values
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

索引对象

Index是一个可以单独提取出来的对象

1
2
3
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Index(['a', 'b', 'c'], dtype='object')
1
index[1:]
Index(['b', 'c'], dtype='object')

不可修改~!

1
index[1] = 'd'
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-52-676fdeb26a68> in <module>()
----> 1 index[1] = 'd'


C:\Users\Ewan\Anaconda3\lib\site-packages\pandas\indexes\base.py in __setitem__(self, key, value)
   1243 
   1244     def __setitem__(self, key, value):
-> 1245         raise TypeError("Index does not support mutable operations")
   1246 
   1247     def __getitem__(self, key):


TypeError: Index does not support mutable operations

直接创建Index对象

1
2
3
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index
True
1
frame3



































stateNevadaOhio
year
2000NaN1.5
20012.41.7
20022.93.6

1
'Ohio' in frame3.columns
True
1
2003 in frame3.index
False

基本功能

重新索引

1
2
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

重排索引形成新对象

1
2
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
1
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64
1
2
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill') # 前向填充
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
1
2
3
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'])
frame


































OhioTexasCalifornia
a012
c345
d678

1
2
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2








































OhioTexasCalifornia
a0.01.02.0
bNaNNaNNaN
c3.04.05.0
d6.07.08.0

1
2
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)


































TexasUtahCalifornia
a1NaN2
c4NaN5
d7NaN8

插值只能按行

1
2
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',
columns=states)








































TexasUtahCalifornia
a1NaN2
b1NaN2
c4NaN5
d7NaN8

用ix方法进行重新索引操作会使得代码很简洁

1
frame.ix[['a', 'b', 'c', 'd'], states]








































TexasUtahCalifornia
a1.0NaN2.0
bNaNNaNNaN
c4.0NaN5.0
d7.0NaN8.0

丢弃指定轴上的项

1
frame.ix[['a', 'b', 'c', 'd'], states]








































TexasUtahCalifornia
a1.0NaN2.0
bNaNNaNNaN
c4.0NaN5.0
d7.0NaN8.0

1
2
3
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
1
obj.drop(['d', 'c'])
a    0.0
b    1.0
e    4.0
dtype: float64
1
2
3
4
data = DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data













































onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415

1
data.drop(['Colorado', 'Ohio'])































onetwothreefour
Utah891011
New York12131415

1
data.drop('two', axis=1)








































onethreefour
Ohio023
Colorado467
Utah81011
New York121415

1
data.drop(['two', 'four'], axis=1)



































onethree
Ohio02
Colorado46
Utah810
New York1214

索引,选取和过滤

多种索引方式

1
2
3
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
obj['b']
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64






1.0
1
obj[1]
1.0
1
obj[2:4]
c    2.0
d    3.0
dtype: float64
1
obj[['b', 'a', 'd']]
b    1.0
a    0.0
d    3.0
dtype: float64
1
obj[[1, 3]]
b    1.0
d    3.0
dtype: float64
1
obj[obj < 2] # 直接对data进行操作
a    0.0
b    1.0
dtype: float64

这种切片方式…
末端包含

1
obj['b':'c']
b    1.0
c    2.0
dtype: float64
1
2
obj['b':'c'] = 5
obj
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64
1
2
3
4
data = DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data













































onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415

1
data['two']
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
1
data[['three', 'one']]



































threeone
Ohio20
Colorado64
Utah108
New York1412

1
data[:2] # axis=0































onetwothreefour
Ohio0123
Colorado4567

1
data[data['three'] > 5]






































onetwothreefour
Colorado4567
Utah891011
New York12131415

1
data < 5













































onetwothreefour
OhioTrueTrueTrueTrue
ColoradoTrueFalseFalseFalse
UtahFalseFalseFalseFalse
New YorkFalseFalseFalseFalse

1
data[data < 5] = 0
1
data













































onetwothreefour
Ohio0000
Colorado0567
Utah891011
New York12131415

索引的另外一种选择

1
data.ix['Colorado', ['two', 'three']]
two      5
three    6
Name: Colorado, dtype: int32
1
data.ix[['Colorado', 'Utah'], [3, 0, 1]]




























fouronetwo
Colorado705
Utah1189

1
data.ix[2]
one       8
two       9
three    10
four     11
Name: Utah, dtype: int32
1
data.ix[:'Utah', 'two']
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32
1
data.ix[data.three > 5, :3]


































onetwothree
Colorado056
Utah8910
New York121314

总结一下就是说,DataFrame是一个二维的数组,只不过每一维的索引方式除了序号之外,还可以用name属性来进行索引,且一切行为与序号无异

算术运算和数据对齐

1
2
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
1
s1
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
1
s2
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

数据对齐操作就是一种特殊的并集操作

1
s1 + s2
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64
1
2
3
4
5
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1


































bcd
Ohio0.01.02.0
Texas3.04.05.0
Colorado6.07.08.0

1
df2








































bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0

并且数据对齐操作是在所有维度上同时进行的

1
df1 + df2




















































bcde
ColoradoNaNNaNNaNNaN
Ohio3.0NaN6.0NaN
OregonNaNNaNNaNNaN
Texas9.0NaN12.0NaN
UtahNaNNaNNaNNaN

在算术方法中填充词

下面这种定义column的方式值得注意

1
2
3
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1






































abcd
00.01.02.03.0
14.05.06.07.0
28.09.010.011.0

1
df2


















































abcde
00.01.02.03.04.0
15.06.07.08.09.0
210.011.012.013.014.0
315.016.017.018.019.0

1
df1 + df2


















































abcde
00.02.04.06.0NaN
19.011.013.015.0NaN
218.020.022.024.0NaN
3NaNNaNNaNNaNNaN

要想填充值必须使用add方法

1
df1.add(df2, fill_value=0)


















































abcde
00.02.04.06.04.0
19.011.013.015.09.0
218.020.022.024.014.0
315.016.017.018.019.0

reindex方法与add方法还是存在差异的

1
df1.reindex(columns=df2.columns, fill_value=0)










































abcde
00.01.02.03.00
14.05.06.07.00
28.09.010.011.00

DataFrame和Series之间的运算

1
2
arr = np.arange(12.).reshape((3, 4))
arr
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])
1
arr[0]
array([ 0.,  1.,  2.,  3.])

广播操作

1
arr - arr[0]
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])
1
2
3
4
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
frame








































bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0

Series的name等于DataFrame的切片属性

1
series
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

默认情况下,DataFrame和Series之间的算术运算会将Series的index匹配到DataFrame的column, 然后沿着行向下广播

1
frame - series








































bde
Utah0.00.00.0
Ohio3.03.03.0
Texas6.06.06.0
Oregon9.09.09.0

如果Series的index与DataFrame的column不匹配,则进行数据对齐

1
2
series2 = Series(range(3), index=['b', 'e', 'f'])
frame + series2













































bdef
Utah0.0NaN3.0NaN
Ohio3.0NaN6.0NaN
Texas6.0NaN9.0NaN
Oregon9.0NaN12.0NaN

1
2
series3 = frame['d']
frame








































bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0

1
series3
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

匹配行并且在列上进行广播, 就必须要指定axis

1
2
frame.sub(series3, axis=0)
frame.sub(series3, axis=1)








































bde
Utah-1.00.01.0
Ohio-1.00.01.0
Texas-1.00.01.0
Oregon-1.00.01.0





























































OhioOregonTexasUtahbde
UtahNaNNaNNaNNaNNaNNaNNaN
OhioNaNNaNNaNNaNNaNNaNNaN
TexasNaNNaNNaNNaNNaNNaNNaN
OregonNaNNaNNaNNaNNaNNaNNaN

函数应用和映射

1
2
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
1
frame








































bde
Utah-0.2047080.478943-0.519439
Ohio-0.5557301.9657811.393406
Texas0.0929080.2817460.769023
Oregon1.2464351.007189-1.296221

Numpy的元素级方法也可以应用到DataFrame上,直接把DataFrame当作二维的Numpy.array即可

1
np.abs(frame)








































bde
Utah0.2047080.4789430.519439
Ohio0.5557301.9657811.393406
Texas0.0929080.2817460.769023
Oregon1.2464351.0071891.296221

1
f = lambda x: x.max() - x.min()

apply方法将函数映射到由各行或者各列形成的一维数组上

1
frame.apply(f) # axis=0
b    1.802165
d    1.684034
e    2.689627
dtype: float64
1
frame.apply(f, axis=1)
Utah      0.998382
Ohio      2.521511
Texas     0.676115
Oregon    2.542656
dtype: float64
1
2
3
4
def f(x):
return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
frame.apply(f, axis=1)




























bde
min-0.5557300.281746-1.296221
max1.2464351.9657811.393406




































minmax
Utah-0.5194390.478943
Ohio-0.5557301.965781
Texas0.0929080.769023
Oregon-1.2962211.246435

元素级的函数映射applymap, 之所以叫这个名字是因为Series有一个元素级的映射函数map

1
2
format = lambda x: '%.2f' % x
frame.applymap(format)








































bde
Utah-0.200.48-0.52
Ohio-0.561.971.39
Texas0.090.280.77
Oregon1.251.01-1.30

1
frame['e'].map(format)
Utah      -0.52
Ohio       1.39
Texas      0.77
Oregon    -1.30
Name: e, dtype: object

排序和排名

对索引或者column进行(字典)排序

1
2
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
a    1
b    2
c    3
d    0
dtype: int32
1
2
3
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
frame.sort_index()































dabc
one4567
three0123

1
frame.sort_index(axis=1)































abcd
three1230
one5674

降序

1
frame.sort_index(axis=1, ascending=False)































dcba
three0321
one4765

按照data进行排序

1
2
obj = Series([4, 7, -3, 2])
obj.sort_values()
2   -3
3    2
0    4
1    7
dtype: int64

在排序时,任何缺失值默认都会被放到Series的末尾

1
2
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64
1
2
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame



































ab
004
117
20-3
312

对指定index或者column进行排序

1
frame.sort_values(by='b')



































ab
20-3
312
004
117

或者根据multi-index亦或multi-column进行排序

1
frame.sort_values(by=['a', 'b'])



































ab
20-3
004
312
117

默认情况下,rank方法通过“为各组分配一个平均排名”的方式破坏平级关系。也就是说,如果有多个相同的值,则这些值的rank就是这些相同值rand的算术平均。

1
2
obj = Series([7, -5, 7, 4, 2, 0, 4, 4])
obj.rank()
0    7.5
1    1.0
2    7.5
3    5.0
4    3.0
5    2.0
6    5.0
7    5.0
dtype: float64

如果想按照一般方式排名

1
obj.rank(method='first')
0    7.0
1    1.0
2    8.0
3    4.0
4    3.0
5    2.0
6    5.0
7    6.0
dtype: float64

使用每个分组的最大排名

1
obj.rank(ascending=False, method='max')
0    2.0
1    8.0
2    2.0
3    5.0
4    6.0
5    7.0
6    5.0
7    5.0
dtype: float64
1
2
3
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})
frame








































abc
004.3-2.0
117.05.0
20-3.08.0
312.0-2.5

指定维度

1
frame.rank(axis=1)








































abc
02.03.01.0
11.03.02.0
22.01.03.0
32.03.01.0

带有重复值的轴索引

1
2
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
a    0
a    1
b    2
b    3
c    4
dtype: int32
1
obj.index.is_unique
False

返回一个Series

1
obj['a']
a    0
a    1
dtype: int32
1
obj['c']
4
1
2
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df








































012
a0.2749920.2289131.352917
a0.886429-2.001637-0.371843
b1.669025-0.438570-0.539741
b0.4769853.248944-1.021228

1
df.ix['b']




























012
b1.669025-0.438570-0.539741
b0.4769853.248944-1.021228

汇总和计算描述统计

1
2
3
4
5
df = DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two'])
df



































onetwo
a1.40NaN
b7.10-4.5
cNaNNaN
d0.75-1.3

1
df.sum() # axis=0 skipna=True
one    9.25
two   -5.80
dtype: float64
1
df.sum(axis=1) # skipna=True
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
1
df.mean(axis=1, skipna=False)
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

返回的是索引

1
df.idxmax()
one    b
two    d
dtype: object
1
df.cumsum()



































onetwo
a1.40NaN
b8.50-4.5
cNaNNaN
d9.25-5.8

describe对于数值型和非数值型数据的行为不一样

1
df.describe()
C:\Users\Ewan\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)























































onetwo
count3.0000002.000000
mean3.083333-2.900000
std3.4936852.262742
min0.750000-4.500000
25%NaNNaN
50%NaNNaN
75%NaNNaN
max7.100000-1.300000

1
2
3
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj
obj.describe()
0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object






count     16
unique     3
top        a
freq       8
dtype: object

相关系数和xi

1
2
3
4
5
6
7
8
9
10
import pandas_datareader.data as web
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
all_data[ticker] = web.get_data_yahoo(ticker)
price = DataFrame({tic: data['Adj Close']
for tic, data in all_data.items()})
volume = DataFrame({tic: data['Volume']
for tic, data in all_data.items()})
1
price[:10]






























































































AAPLGOOGIBMMSFT
Date
2010-01-0427.727039313.062468111.40500025.555485
2010-01-0527.774976311.683844110.05923225.563741
2010-01-0627.333178303.826685109.34428325.406859
2010-01-0727.282650296.753749108.96578625.142634
2010-01-0827.464034300.709808110.05923225.316031
2010-01-1127.221758300.255255108.90690324.994007
2010-01-1226.912110294.945572109.77324524.828866
2010-01-1327.291720293.252243109.53773525.060064
2010-01-1427.133657294.630868111.28724525.563741
2010-01-1526.680198289.710772110.84145825.481172

1
volume[:10]






























































































AAPLGOOGIBMMSFT
Date
2010-01-041234324003927000615530038409100
2010-01-051504762006031900684140049749600
2010-01-061380400007987100560530058182400
2010-01-0711928280012876600584060050559700
2010-01-081119027009483900419720051197400
2010-01-1111555740014479800573040068754700
2010-01-121486149009742900808150065912100
2010-01-1315147300013041800645540051863500
2010-01-141082235008511900711180063228100
2010-01-1514851690010909600849440079913200

price.pct_change

Signature: price.pct_change(periods=1, fill_method=’pad’, limit=None, freq=None, **kwargs)
Docstring:
Percent change over given number of periods.

Parameters

periods : int, default 1
Periods to shift for forming percent change

fill_method : str, default ‘pad’
How to handle NAs before computing percent changes

limit : int, default None
The number of consecutive NAs to fill before stopping

freq : DateOffset, timedelta, or offset alias string, optional
Increment to use from time series API (e.g. ‘M’ or BDay())

Returns

chg : NDFrame

Notes

By default, the percentage change is calculated along the stat
axis: 0, or Index, for DataFrame and 1, or minor for
Panel. You can change this with the axis keyword argument.

1
2
returns = price.pct_change()
returns.tail()



























































AAPLGOOGIBMMSFT
Date
2017-02-210.0072210.004335-0.002269-0.002012
2017-02-220.002999-0.0010820.004937-0.002016
2017-02-23-0.0042300.0006860.0027600.004040
2017-02-240.000952-0.003236-0.0016510.000000
2017-02-270.0019760.000772-0.010753-0.006035

1
returns.MSFT.corr(returns.IBM)
0.49525655865062668
1
returns.MSFT.cov(returns.IBM)
8.5880535146740545e-05
1
returns.corr()













































AAPLGOOGIBMMSFT
AAPL1.0000000.4095230.3813740.388875
GOOG0.4095231.0000000.4027810.470781
IBM0.3813740.4027811.0000000.495257
MSFT0.3888750.4707810.4952571.000000

1
returns.cov()













































AAPLGOOGIBMMSFT
AAPL0.0002690.0001050.0000750.000092
GOOG0.0001050.0002440.0000750.000106
IBM0.0000750.0000750.0001440.000086
MSFT0.0000920.0001060.0000860.000209

1
returns.corrwith(returns.IBM)
AAPL    0.381374
GOOG    0.402781
IBM     1.000000
MSFT    0.495257
dtype: float64
1
returns.corrwith(volume)
AAPL   -0.074055
GOOG   -0.009543
IBM    -0.194107
MSFT   -0.090724
dtype: float64

唯一值, 值计数以及成员资格

1
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
1
2
uniques = obj.unique()
uniques
array(['c', 'a', 'd', 'b'], dtype=object)
1
obj.value_counts()
c    3
a    3
b    2
d    1
dtype: int64
1
pd.value_counts(obj.values, sort=False)
a    3
d    1
b    2
c    3
dtype: int64
1
2
mask = obj.isin(['b', 'c'])
mask
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
1
obj[mask]
0    c
5    b
6    b
7    c
8    c
dtype: object
1
2
3
4
data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
'Qu2': [2, 3, 1, 2, 3],
'Qu3': [1, 5, 2, 4, 4]})
data














































Qu1Qu2Qu3
0121
1335
2412
3324
4434

1
2
result = data.apply(pd.value_counts).fillna(0)
result














































Qu1Qu2Qu3
11.01.01.0
20.02.01.0
32.02.00.0
42.00.02.0
50.00.01.0

处理缺失数据

1
2
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
1
string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool
1
2
string_data[0] = None
string_data.isnull()
0     True
1    False
2     True
3    False
dtype: bool

滤除缺失数据

1
2
3
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64
1
data[data.notnull()]
0    1.0
2    3.5
4    7.0
dtype: float64
1
2
3
4
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data








































012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0

1
cleaned






















012
01.06.53.0

1
data.dropna(how='all')


































012
01.06.53.0
11.0NaNNaN
3NaN6.53.0

1
2
data[4] = NA
data













































0124
01.06.53.0NaN
11.0NaNNaNNaN
2NaNNaNNaNNaN
3NaN6.53.0NaN

1
data.dropna(axis=1, how='all')








































012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0

1
2
3
df = DataFrame(np.random.randn(7, 3))
df.ix[:4, 1] = NA; df.ix[:2, 2] = NA
df


























































012
0-0.204708NaNNaN
1-0.555730NaNNaN
20.092908NaNNaN
31.246435NaN-1.296221
40.274992NaN1.352917
50.886429-2.001637-0.371843
61.669025-0.438570-0.539741

1
df.dropna(thresh=2)








































012
31.246435NaN-1.296221
40.274992NaN1.352917
50.886429-2.001637-0.371843
61.669025-0.438570-0.539741

填充缺失数据

1
df.fillna(0)


























































012
0-0.2047080.0000000.000000
1-0.5557300.0000000.000000
20.0929080.0000000.000000
31.2464350.000000-1.296221
40.2749920.0000001.352917
50.886429-2.001637-0.371843
61.669025-0.438570-0.539741

1
df.fillna({1: 0.5, 3: -1})


























































012
0-0.2047080.500000NaN
1-0.5557300.500000NaN
20.0929080.500000NaN
31.2464350.500000-1.296221
40.2749920.5000001.352917
50.886429-2.001637-0.371843
61.669025-0.438570-0.539741

fillna默认返回新对象,但也可以对现有对象就地修改

1
2
3
# always returns a reference to the filled object
_ = df.fillna(0, inplace=True)
df


























































012
0-0.2047080.0000000.000000
1-0.5557300.0000000.000000
20.0929080.0000000.000000
31.2464350.000000-1.296221
40.2749920.0000001.352917
50.886429-2.001637-0.371843
61.669025-0.438570-0.539741

1
2
3
df = DataFrame(np.random.randn(6, 3))
df.ix[2:, 1] = NA; df.ix[4:, 2] = NA
df




















































012
00.4769853.248944-1.021228
1-0.5770870.1241210.302614
20.523772NaN1.343810
3-0.713544NaN-2.370232
4-1.860761NaNNaN
5-1.265934NaNNaN

1
df.fillna(method='ffill')




















































012
00.4769853.248944-1.021228
1-0.5770870.1241210.302614
20.5237720.1241211.343810
3-0.7135440.124121-2.370232
4-1.8607610.124121-2.370232
5-1.2659340.124121-2.370232

1
df.fillna(method='ffill', limit=2)




















































012
00.4769853.248944-1.021228
1-0.5770870.1241210.302614
20.5237720.1241211.343810
3-0.7135440.124121-2.370232
4-1.860761NaN-2.370232
5-1.265934NaN-2.370232

1
2
data = Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

层次索引

1
2
3
4
data = Series(np.random.randn(10),
index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data
a  1    0.332883
   2   -2.359419
   3   -0.199543
b  1   -1.541996
   2   -0.970736
   3   -1.307030
c  1    0.286350
   2    0.377984
d  2   -0.753887
   3    0.331286
dtype: float64
1
data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
1
data['b']
1   -1.541996
2   -0.970736
3   -1.307030
dtype: float64
1
data['b':'c']
b  1   -1.541996
   2   -0.970736
   3   -1.307030
c  1    0.286350
   2    0.377984
dtype: float64
1
data.ix[['b', 'd']]
b  1   -1.541996
   2   -0.970736
   3   -1.307030
d  2   -0.753887
   3    0.331286
dtype: float64
1
data[:, 2]
a   -2.359419
b   -0.970736
c    0.377984
d   -0.753887
dtype: float64
1
data.unstack()








































123
a0.332883-2.359419-0.199543
b-1.541996-0.970736-1.307030
c0.2863500.377984NaN
dNaN-0.7538870.331286

1
data.unstack().stack()
a  1    0.332883
   2   -2.359419
   3   -0.199543
b  1   -1.541996
   2   -0.970736
   3   -1.307030
c  1    0.286350
   2    0.377984
d  2   -0.753887
   3    0.331286
dtype: float64
1
2
3
4
5
frame = DataFrame(np.arange(12).reshape((4, 3)),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])
frame

















































OhioColorado
GreenRedGreen
a1012
2345
b1678
291011

1
2
3
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
























































stateOhioColorado
colorGreenRedGreen
key1key2
a1012
2345
b1678
291011

1
frame['Ohio']












































colorGreenRed
key1key2
a101
234
b167
2910

创建MultiIndex对象复用

1
2
MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])

重排分级顺序

1
frame.swaplevel('key1', 'key2')


























































stateOhioColorado
colorGreenRedGreen
key2key1
1a012
2a345
1b678
2b91011

1
frame.sortlevel(1)


























































stateOhioColorado
colorGreenRedGreen
key1key2
a1012
b1678
a2345
b291011

1
frame.swaplevel(0, 1).sortlevel(0)
























































stateOhioColorado
colorGreenRedGreen
key2key1
1a012
b678
2a345
b91011

根据级别汇总统计

1
frame.sum(level='key2')







































stateOhioColorado
colorGreenRedGreen
key2
16810
2121416

1
frame.sum(level='color', axis=1)












































colorGreenRed
key1key2
a121
284
b1147
22010

使用DataFrame的列

1
2
3
4
frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]})
frame


































































abcd
007one0
116one1
225one2
334two0
443two1
552two2
661two3

1
2
frame2 = frame.set_index(['c', 'd'])
frame2



























































ab
cd
one007
116
225
two034
143
252
361

1
frame.set_index(['c', 'd'], drop=False)













































































abcd
cd
one007one0
116one1
225one2
two034two0
143two1
252two2
361two3

1
frame2.reset_index()


































































cdab
0one007
1one116
2one225
3two034
4two143
5two252
6two361

拓展话题

整数索引

1
2
ser = Series(np.arange(3.))
ser.iloc[-1]
2.0
1
ser
0    0.0
1    1.0
2    2.0
dtype: float64
1
2
ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]
2.0
1
ser.ix[:1]
0    0.0
1    1.0
dtype: float64
1
2
ser3 = Series(range(3), index=[-5, 1, 3])
ser3.iloc[2]
2
1
2
3
frame = DataFrame(np.arange(6).reshape((3, 2)), index=[2, 0, 1])
frame
frame.iloc[0]






























01
201
023
145

0    0
1    1
Name: 2, dtype: int32

面板数据

1
2
3
4
import pandas_datareader.data as web
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk))
for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))
1
pdata
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 1820 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2010-01-04 00:00:00 to 2017-02-27 00:00:00
Minor_axis axis: Open to Adj Close
1
2
pdata = pdata.swapaxes('items', 'minor')
pdata['Adj Close'].iloc[:10]






























































































AAPLDELLGOOGMSFT
Date
2010-01-0427.72703914.06528313.06246825.555485
2010-01-0527.77497614.38450311.68384425.563741
2010-01-0627.33317814.10397303.82668525.406859
2010-01-0727.28265014.23940296.75374925.142634
2010-01-0827.46403414.36516300.70980825.316031
2010-01-1127.22175814.37483300.25525524.994007
2010-01-1226.91211014.56830294.94557224.828866
2010-01-1327.29172014.57797293.25224325.060064
2010-01-1427.13365714.22005294.63086825.563741
2010-01-1526.68019813.92985289.71077225.481172

1
pdata.ix[:, '6/1/2012', :]























































OpenHighLowCloseVolumeAdj Close
AAPL569.159996572.650009560.520012560.989983130246900.072.681610
DELL12.15000012.30000012.04500012.07000019397600.011.675920
GOOG571.790972572.650996568.350996570.9810006138700.0285.205295
MSFT28.76000028.95999928.44000128.45000156634300.024.942239

1
pdata.ix['Adj Close', '5/22/2012':, :].iloc[:10]






























































































AAPLDELLGOOGMSFT
Date
2012-05-2272.16078614.58765300.10041226.090721
2012-05-2373.92149412.08221304.42610625.520864
2012-05-2473.24260712.04351301.52897825.485795
2012-05-2572.85003812.05319295.47005025.477028
2012-05-28NaN12.05319NaNNaN
2012-05-2974.14304112.24666296.87364525.915380
2012-05-3075.03700512.14992293.82167425.722505
2012-05-3174.85044211.92743290.14035425.591000
2012-06-0172.68161011.67592285.20529524.942239
2012-06-0473.10915611.60821289.00648025.029908

1
2
stacked = pdata.ix[:, '5/30/2012':, :].to_frame()
stacked






















































































































































































































































































































































































































































































































































































































OpenHighLowCloseVolumeAdj Close
Dateminor
2012-05-30AAPL569.199997579.989990566.559990579.169998132357400.075.037005
DELL12.59000012.70000012.46000012.56000019787800.012.149920
GOOG588.161028591.901014583.530999588.2309923827600.0293.821674
MSFT29.35000029.48000029.12000129.34000041585500.025.722505
2012-05-31AAPL580.740021581.499985571.460022577.730019122918600.074.850442
DELL12.53000012.54000012.33000012.33000019955600.011.927430
GOOG588.720982590.001032579.001013580.8609905958800.0290.140354
MSFT29.29999929.42000028.94000129.19000139134000.025.591000
2012-06-01AAPL569.159996572.650009560.520012560.989983130246900.072.681610
DELL12.15000012.30000012.04500012.07000019397600.011.675920
GOOG571.790972572.650996568.350996570.9810006138700.0285.205295
MSFT28.76000028.95999928.44000128.45000156634300.024.942239
2012-06-04AAPL561.500008567.499985548.499977564.289978139248900.073.109156
DELL12.11000012.11250011.80000012.00000017015700.011.608210
GOOG570.220958580.491016570.011006578.5909734883500.0289.006480
MSFT28.62000128.78000128.32000028.54999947926300.025.029908
2012-06-05AAPL561.269989566.470001558.330002562.83002597053600.072.920005
DELL11.95000012.24000011.95000012.16000015620900.011.762980
GOOG575.451008578.131003566.470986570.4109994697200.0284.920579
MSFT28.51000028.75000028.38999928.51000045715400.024.994841
2012-06-06AAPL567.770004573.849983565.499992571.460022100363900.074.038104
DELL12.21000012.28000012.09000012.21500020779900.011.816190
GOOG576.480979581.970971573.611004580.5709664207200.0289.995487
MSFT28.87999929.37000128.80999929.35000046860500.025.731273
2012-06-07AAPL577.290009577.320023570.500000571.72000194941700.074.071787
DELL12.32000012.41000012.12000012.13000020074000.011.733960
GOOG587.601014587.891038577.251006578.2309863530100.0288.826666
MSFT29.63999929.70000129.17000029.23000037792800.025.626067
2012-06-08AAPL571.599998580.580017568.999992580.31998486879100.075.185997
DELL12.13000012.22500012.02000012.12000018155600.011.724290
2017-02-13AAPL133.080002133.820007132.750000133.28999323035400.0133.289993
GOOG816.000000820.958984815.489990819.2399901198100.0819.239990
MSFT64.23999864.86000164.12999764.72000122920100.064.330000
2017-02-14AAPL133.470001135.089996133.250000135.02000432815500.0135.020004
GOOG819.000000823.000000816.000000820.4500121053600.0820.450012
MSFT64.41000464.72000164.01999764.57000023065900.064.570000
2017-02-15AAPL135.520004136.270004134.619995135.50999535501600.0135.509995
GOOG819.359985823.000000818.469971818.9799801304000.0818.979980
MSFT64.50000064.57000064.16000464.52999916917000.064.529999
2017-02-16AAPL135.669998135.899994134.839996135.35000622118000.0135.350006
GOOG819.929993824.400024818.979980824.1599731281700.0824.159973
MSFT64.73999865.23999864.44000264.51999720524700.064.519997
2017-02-17AAPL135.100006135.830002135.100006135.72000122084500.0135.720001
GOOG823.020020828.070007821.655029828.0700071597800.0828.070007
MSFT64.47000164.69000264.30000364.62000321234600.064.620003
2017-02-21AAPL136.229996136.750000135.979996136.69999724265100.0136.699997
GOOG828.659973833.450012828.349976831.6599731247700.0831.659973
MSFT64.61000164.94999764.44999764.48999819384900.064.489998
2017-02-22AAPL136.429993137.119995136.110001137.11000120745300.0137.110001
GOOG828.659973833.250000828.640015830.760010982900.0830.760010
MSFT64.33000264.38999964.05000364.36000119259700.064.360001
2017-02-23AAPL137.380005137.479996136.300003136.52999920704100.0136.529999
GOOG830.119995832.460022822.880005831.3300171470100.0831.330017
MSFT64.41999864.73000364.19000264.62000320235200.064.620003
2017-02-24AAPL135.910004136.660004135.279999136.66000421690900.0136.660004
GOOG827.729980829.000000824.200012828.6400151386600.0828.640015
MSFT64.52999964.80000364.13999964.62000321705200.064.620003
2017-02-27AAPL137.139999137.440002136.279999136.92999320196400.0136.929993
GOOG824.549988830.500000824.000000829.2800291099500.0829.280029
MSFT64.54000164.54000164.05000364.23000315850400.064.230003

3952 rows × 6 columns


1
stacked.to_panel()
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1207 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2017-02-27 00:00:00
Minor_axis axis: AAPL to MSFT