Abracadabra

python data analysis learning note Ch10

时间序列

1
2
3
4
5
6
7
8
9
from __future__ import division
from pandas import Series, DataFrame
import pandas as pd
from numpy.random import randn
import numpy as np
pd.options.display.max_rows = 12
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(12, 4))
1
2
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
1
%matplotlib inline

日期和时间数据类型及工具

1
2
3
from datetime import datetime
now = datetime.now()
now
datetime.datetime(2017, 3, 8, 14, 47, 50, 32019)
1
now.year, now.month, now.day
(2017, 3, 8)

返回值(天数,秒数)

1
2
delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
delta
datetime.timedelta(926, 56700)
1
delta.days
926
1
delta.seconds
56700

timedelta 天数

1
2
3
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(12)
datetime.datetime(2011, 1, 19, 0, 0)
1
start - 2 * timedelta(12)
datetime.datetime(2010, 12, 14, 0, 0)

字符串和datatime的相互转换

1
stamp = datetime(2011, 1, 3)

使用str直接转换

1
str(stamp)
'2011-01-03 00:00:00'

格式化转换

1
stamp.strftime('%Y-%m-%d')
'2011-01-03'

逆转换

1
2
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')
datetime.datetime(2011, 1, 3, 0, 0)

批量转换

1
2
datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]
[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

总是写格式很麻烦,直接调用parser解析

1
2
from dateutil.parser import parse
parse('2011-01-03')
datetime.datetime(2011, 1, 3, 0, 0)

可以解析任意格式

1
parse('Jan 31, 1997 10:45 PM')
datetime.datetime(1997, 1, 31, 22, 45)

指定格式

1
parse('6/12/2011', dayfirst=True)
datetime.datetime(2011, 12, 6, 0, 0)
1
datestrs
['7/6/2011', '8/6/2011']

pandasAPI

1
2
pd.to_datetime(datestrs)
# note: output changed (no '00:00:00' anymore)
DatetimeIndex(['2011-07-06', '2011-08-06'], dtype='datetime64[ns]', freq=None)

None也可以转换,只不过会变成缺失值

1
2
idx = pd.to_datetime(datestrs + [None])
idx
DatetimeIndex(['2011-07-06', '2011-08-06', 'NaT'], dtype='datetime64[ns]', freq=None)
1
idx[2]
NaT
1
pd.isnull(idx)
array([False, False,  True], dtype=bool)

时间序列基础

将行索引变成时间类型,也就是时间戳

1
2
3
4
5
from datetime import datetime
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = Series(np.random.randn(6), index=dates)
ts
2011-01-02   -0.296854
2011-01-05   -1.968663
2011-01-07   -0.484492
2011-01-08   -0.517927
2011-01-10   -0.348697
2011-01-12    0.102276
dtype: float64
1
2
type(ts)
# note: output changed to "pandas.core.series.Series"
pandas.core.series.Series

拥有一个特定的类型

1
ts.index
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

可以直接进行加法运算,相同的时间戳会进行匹配

1
ts + ts[::2]
2011-01-02   -0.593708
2011-01-05         NaN
2011-01-07   -0.968984
2011-01-08         NaN
2011-01-10   -0.697394
2011-01-12         NaN
dtype: float64

以纳秒形式存储时间戳

1
2
ts.index.dtype
# note: output changed from dtype('datetime64[ns]') to dtype('<M8[ns]')
dtype('<M8[ns]')

行索引就会变成时间戳类型

1
2
3
stamp = ts.index[0]
stamp
# note: output changed from <Timestamp: 2011-01-02 00:00:00> to Timestamp('2011-01-02 00:00:00')
Timestamp('2011-01-02 00:00:00')

索引、选取、子集构造

时间戳索引与正常索引行为一样

1
2
stamp = ts.index[2]
ts[stamp]
-0.4844920247591406

可以直接通过传入与行索引相匹配的时间戳进行索引

1
ts['1/10/2011']
-0.34869693931763396

换个格式也可以,会自动转换为datatime,只要最后转换成的时间戳是相同的,任意格式都可以

1
ts['20110110']
-0.34869693931763396

通过periods参数来指定往后顺延的时间长短

1
2
3
longer_ts = Series(np.random.randn(1000),
index=pd.date_range('1/1/2000', periods=1000))
longer_ts
2000-01-01    0.871808
2000-01-02   -0.025158
2000-01-03    0.132813
2000-01-04   -2.006494
2000-01-05   -0.988423
2000-01-06    0.775930
                ...   
2002-09-21   -0.186519
2002-09-22    0.881745
2002-09-23   -1.335826
2002-09-24    0.418774
2002-09-25    0.970405
2002-09-26    0.636320
Freq: D, dtype: float64

时间戳的特殊之处在于可以进行年份以及月份等的选取,相当于一个多维索引

1
longer_ts['2001']
2001-01-01   -1.799866
2001-01-02    0.499890
2001-01-03   -0.409970
2001-01-04   -0.808111
2001-01-05   -1.220433
2001-01-06    0.581235
                ...   
2001-12-26   -0.312186
2001-12-27   -0.804940
2001-12-28   -0.572741
2001-12-29   -0.175605
2001-12-30    0.693675
2001-12-31   -0.196274
Freq: D, dtype: float64
1
longer_ts['2001-05']
2001-05-01   -2.783535
2001-05-02    1.386292
2001-05-03    0.153705
2001-05-04   -0.571590
2001-05-05   -0.933012
2001-05-06    0.579244
                ...   
2001-05-26    0.080809
2001-05-27    0.652650
2001-05-28    0.862616
2001-05-29   -0.967580
2001-05-30    0.907069
2001-05-31    0.551137
Freq: D, dtype: float64

同样可以进行切片,只不过是按照时间的先后度量

1
ts[datetime(2011, 1, 7):]
2011-01-07   -0.484492
2011-01-08   -0.517927
2011-01-10   -0.348697
2011-01-12    0.102276
dtype: float64
1
ts
2011-01-02   -0.296854
2011-01-05   -1.968663
2011-01-07   -0.484492
2011-01-08   -0.517927
2011-01-10   -0.348697
2011-01-12    0.102276
dtype: float64

而且切片不需要进行索引匹配,只需要指定时间范围即可切片

1
ts['1/6/2011':'1/11/2011']
2011-01-07   -0.484492
2011-01-08   -0.517927
2011-01-10   -0.348697
dtype: float64

一个可以实现同样功能的内置方法

1
ts.truncate(after='1/9/2011')
2011-01-02   -0.296854
2011-01-05   -1.968663
2011-01-07   -0.484492
2011-01-08   -0.517927
dtype: float64

这里的freq参数指定了选取的频率,这里的是每一个星期三

1
2
3
4
5
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
long_df = DataFrame(np.random.randn(100, 4),
index=dates,
columns=['Colorado', 'Texas', 'New York', 'Ohio'])
long_df.ix['5-2001']




















































ColoradoTexasNew YorkOhio
2001-05-020.506207-1.1162180.6565750.212606
2001-05-09-1.306963-0.054373-1.165053-1.319361
2001-05-160.891692-0.4639001.6422670.644972
2001-05-23-0.0252832.363886-0.3679880.827882
2001-05-30-1.501301-2.5345530.2563690.268207

带有重复索引的时间序列

直接创建时间戳索引

1
2
3
4
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/2000',
'1/3/2000'])
dup_ts = Series(np.arange(5), index=dates)
dup_ts
2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32
1
dup_ts.index.is_unique
False
1
dup_ts['1/3/2000'] # not duplicated
4

如果有重复的时间索引,则会将满足条件的全部输出

1
dup_ts['1/2/2000'] # duplicated
2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32

因此可以直接根据时间戳进行索引

1
2
grouped = dup_ts.groupby(level=0)
grouped.mean()
2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32
1
grouped.count()
2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

日期的范围、频率以及移动

pandas中的时间序列一般被认为是不规则的,也就是说没有固定的频率。但是有时候需要以某种相对固定的频率进行分析,比如每日、每月、每15分钟等(这样自然会在时间序列中引入缺失值)。pandas拥有一整套标准时间序列频率以及用于重采样、频率推断、生成固定频率日期范围的工具

1
ts
2011-01-02   -0.296854
2011-01-05   -1.968663
2011-01-07   -0.484492
2011-01-08   -0.517927
2011-01-10   -0.348697
2011-01-12    0.102276
dtype: float64

例如,我们可以将之前那个时间序列转换为一个具有固定频率(每日)的时间序列。只需要调用resample即可

1
ts.resample('D').mean()
2011-01-02   -0.296854
2011-01-03         NaN
2011-01-04         NaN
2011-01-05   -1.968663
2011-01-06         NaN
2011-01-07   -0.484492
2011-01-08   -0.517927
2011-01-09         NaN
2011-01-10   -0.348697
2011-01-11         NaN
2011-01-12    0.102276
Freq: D, dtype: float64

生成日期范围

data_range函数, 指定始末

1
2
index = pd.date_range('4/1/2012', '6/1/2012')
index
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

只指定起始, 以及长度

1
pd.date_range(start='4/1/2012', periods=20)
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
              dtype='datetime64[ns]', freq='D')

只指定结尾,以及长度

1
pd.date_range(end='6/1/2012', periods=20)
DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

指定始末,以及采样频率, BM = business end of month

1
pd.date_range('1/1/2000', '12/1/2000', freq='BM')
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM')

默认peroids指的是天数

1
pd.date_range('5/2/2012 12:56:31', periods=5)
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')

可以省略时间戳

1
pd.date_range('5/2/2012 12:56:31', periods=5, normalize=True)
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

频率和日期偏移量

偏移量可以采用特定单位的时间对象

1
2
3
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
hour
<Hour>

4个小时,简单粗暴

1
2
four_hours = Hour(4)
four_hours
<4 * Hours>

每隔四个小时进行采样

1
pd.date_range('1/1/2000', '1/3/2000 23:59', freq='4h')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')

两个半小时

1
Hour(2) + Minute(30)
<150 * Minutes>

也可以直接使用这种类似于自然语言的形式

1
pd.date_range('1/1/2000', periods=10, freq='1h30min')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')

Week of month dates (WOM日期)

每月第三个星期五

1
2
rng = pd.date_range('1/1/2012', '9/1/2012', freq='WOM-3FRI')
list(rng)
[Timestamp('2012-01-20 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-02-17 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-03-16 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-04-20 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-05-18 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-06-15 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-07-20 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-08-17 00:00:00', offset='WOM-3FRI')]

移动(超前或滞后)数据

1
2
3
ts = Series(np.random.randn(4),
index=pd.date_range('1/1/2000', periods=4, freq='M'))
ts
2000-01-31    1.294798
2000-02-29   -1.907732
2000-03-31   -1.407750
2000-04-30    0.544825
Freq: M, dtype: float64

整体数据前移

1
ts.shift(2)
2000-01-31         NaN
2000-02-29         NaN
2000-03-31    1.294798
2000-04-30   -1.907732
Freq: M, dtype: float64

整体数据后移,有点类似于位运算中的移位操作

1
ts.shift(-2)
2000-01-31   -1.407750
2000-02-29    0.544825
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64

移位之后数据对齐

1
ts / ts.shift(1) - 1
2000-01-31         NaN
2000-02-29   -2.473382
2000-03-31   -0.262082
2000-04-30   -1.387018
Freq: M, dtype: float64

加入freq之后就是在行索引上进行时间前移

1
ts.shift(2, freq='M')
2000-03-31    1.294798
2000-04-30   -1.907732
2000-05-31   -1.407750
2000-06-30    0.544825
Freq: M, dtype: float64

在天数上进行前移

1
ts.shift(3, freq='D')
2000-02-03    1.294798
2000-03-03   -1.907732
2000-04-03   -1.407750
2000-05-03    0.544825
dtype: float64

另一种实现方式

1
ts.shift(1, freq='3D')
2000-02-03    1.294798
2000-03-03   -1.907732
2000-04-03   -1.407750
2000-05-03    0.544825
dtype: float64

换一个频率

1
ts.shift(1, freq='90T')
2000-01-31 01:30:00    1.294798
2000-02-29 01:30:00   -1.907732
2000-03-31 01:30:00   -1.407750
2000-04-30 01:30:00    0.544825
Freq: M, dtype: float64

通过偏移量对日期进行位移

1
2
3
from pandas.tseries.offsets import Day, MonthEnd
now = datetime(2011, 11, 17)
now + 3 * Day()
Timestamp('2011-11-20 00:00:00')

直接移位到月末,是一个相对位移

1
now + MonthEnd()
Timestamp('2011-11-30 00:00:00')

传入的参数表示第几个月的月末

1
now + MonthEnd(2)
Timestamp('2011-12-31 00:00:00')

换一种方式实现,“主语”不同

1
2
offset = MonthEnd()
offset.rollforward(now)
Timestamp('2011-11-30 00:00:00')

往回走,上一个月的月末

1
offset.rollback(now)
Timestamp('2011-10-31 00:00:00')

对日期进行移位之后分组

1
2
3
ts = Series(np.random.randn(20),
index=pd.date_range('1/15/2000', periods=20, freq='4d'))
ts.groupby(offset.rollforward).mean()
2000-01-31   -0.610639
2000-02-29    0.029121
2000-03-31   -0.089587
dtype: float64

另一种方式也可以达到相同的效果

1
ts.resample('M', how='mean')
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).mean()
  if __name__ == '__main__':





2000-01-31   -0.610639
2000-02-29    0.029121
2000-03-31   -0.089587
Freq: M, dtype: float64

时区处理

显示一些时区

1
2
import pytz
pytz.common_timezones[-5:]
['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']

显示某个时区的具体信息

1
2
tz = pytz.timezone('US/Eastern')
tz
<DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>

本地化和转换

1
2
3
rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
ts = Series(np.random.randn(len(rng)), index=rng)
ts
2012-03-09 09:30:00    0.065144
2012-03-10 09:30:00   -0.391505
2012-03-11 09:30:00    1.207495
2012-03-12 09:30:00    1.516354
2012-03-13 09:30:00   -0.253149
2012-03-14 09:30:00   -0.768138
Freq: D, dtype: float64

没有指定时区的时候默认时区为None

1
print(ts.index.tz)
None

指定时区

1
pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00',
               '2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

进行时区的转换

1
2
ts_utc = ts.tz_localize('UTC')
ts_utc
2012-03-09 09:30:00+00:00    0.065144
2012-03-10 09:30:00+00:00   -0.391505
2012-03-11 09:30:00+00:00    1.207495
2012-03-12 09:30:00+00:00    1.516354
2012-03-13 09:30:00+00:00   -0.253149
2012-03-14 09:30:00+00:00   -0.768138
Freq: D, dtype: float64
1
ts_utc.index
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

继续转换

1
ts_utc.tz_convert('US/Eastern')
2012-03-09 04:30:00-05:00    0.065144
2012-03-10 04:30:00-05:00   -0.391505
2012-03-11 05:30:00-04:00    1.207495
2012-03-12 05:30:00-04:00    1.516354
2012-03-13 05:30:00-04:00   -0.253149
2012-03-14 05:30:00-04:00   -0.768138
Freq: D, dtype: float64

依旧是转换

1
2
ts_eastern = ts.tz_localize('US/Eastern')
ts_eastern.tz_convert('UTC')
2012-03-09 14:30:00+00:00    0.065144
2012-03-10 14:30:00+00:00   -0.391505
2012-03-11 13:30:00+00:00    1.207495
2012-03-12 13:30:00+00:00    1.516354
2012-03-13 13:30:00+00:00   -0.253149
2012-03-14 13:30:00+00:00   -0.768138
Freq: D, dtype: float64

转转转

ts_eastern.tz_convert(‘Europe/Berlin’)

转换之前必须要进行本地化

1
ts.index.tz_localize('Asia/Shanghai')

操作时区意识型TimeStamp对象

初始化时间戳,本地化,时区转换

1
2
3
stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('US/Eastern')
Timestamp('2011-03-11 23:00:00-0500', tz='US/Eastern')

显式地初始化

1
2
stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')
stamp_moscow
Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')

自1970年1月1日起计算的纳秒数

1
stamp_utc.value
1299902400000000000

这个值是绝对的

1
stamp_utc.tz_convert('US/Eastern').value
1299902400000000000
1
2
3
4
# 30 minutes before DST transition
from pandas.tseries.offsets import Hour
stamp = pd.Timestamp('2012-03-12 01:30', tz='US/Eastern')
stamp
Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')

进行时间的位移

1
stamp + Hour()
Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')
1
2
3
# 90 minutes before DST transition
stamp = pd.Timestamp('2012-11-04 00:30', tz='US/Eastern')
stamp
Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')
1
stamp + 2 * Hour()
Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')

不同时区之间的运算

1
2
3
rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')
ts = Series(np.random.randn(len(rng)), index=rng)
ts
2012-03-07 09:30:00   -0.461750
2012-03-08 09:30:00    0.947394
2012-03-09 09:30:00    0.703239
2012-03-12 09:30:00    0.266519
2012-03-13 09:30:00    0.302334
2012-03-14 09:30:00   -0.000725
2012-03-15 09:30:00    0.305446
2012-03-16 09:30:00   -1.605358
2012-03-19 09:30:00    1.306474
2012-03-20 09:30:00    0.865511
Freq: B, dtype: float64

最终结果会变成UTC

1
2
3
4
ts1 = ts[:7].tz_localize('Europe/London')
ts2 = ts1[2:].tz_convert('Europe/Moscow')
result = ts1 + ts2
result.index
DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
               '2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='B')

时期及其算术运算

1
2
p = pd.Period(2007, freq='A-DEC')
p
Period('2007', 'A-DEC')
1
p + 5
Period('2012', 'A-DEC')
1
p - 2
Period('2005', 'A-DEC')
1
pd.Period('2014', freq='A-DEC') - p
7
1
2
rng = pd.period_range('1/1/2000', '6/30/2000', freq='M')
rng
PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='int64', freq='M')
1
Series(np.random.randn(6), index=rng)
2000-01    0.061389
2000-02    0.059265
2000-03    0.779627
2000-04   -0.068995
2000-05   -0.451276
2000-06   -1.531821
Freq: M, dtype: float64
1
2
3
values = ['2001Q3', '2002Q2', '2003Q1']
index = pd.PeriodIndex(values, freq='Q-DEC')
index
PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='int64', freq='Q-DEC')

时区的频率转换

以十二月为结尾的一个年时期

1
2
p = pd.Period('2007', freq='A-DEC')
p.asfreq('M', how='start')
Period('2007-01', 'M')
1
p.asfreq('M', how='end')
Period('2007-12', 'M')

以六月份结尾的一个年时期

1
2
p = pd.Period('2007', freq='A-JUN')
p.asfreq('M', 'start')
Period('2006-07', 'M')
1
p.asfreq('M', 'end')
Period('2007-06', 'M')

2007年8月是属于以六月结尾的2008年的时期中

1
2
p = pd.Period('Aug-2007', 'M')
p.asfreq('A-JUN')
Period('2008', 'A-JUN')

相当于一个批量操作

1
2
3
rng = pd.period_range('2006', '2009', freq='A-DEC')
ts = Series(np.random.randn(len(rng)), index=rng)
ts
2006    0.634252
2007   -0.738716
2008    0.398145
2009   -1.226529
Freq: A-DEC, dtype: float64
1
ts.asfreq('M', how='start')
2006-01    0.634252
2007-01   -0.738716
2008-01    0.398145
2009-01   -1.226529
Freq: M, dtype: float64
1
ts.asfreq('B', how='end')
2006-12-29    0.634252
2007-12-31   -0.738716
2008-12-31    0.398145
2009-12-31   -1.226529
Freq: B, dtype: float64

按季度计算的时间频率

以一月为截止的第四个季度

1
2
p = pd.Period('2012Q4', freq='Q-JAN')
p
Period('2012Q4', 'Q-JAN')

第四个季度的起始日

1
p.asfreq('D', 'start')
Period('2011-11-01', 'D')

结束日

1
p.asfreq('D', 'end')
Period('2012-01-31', 'D')

截止日前一天的下午四点

1
2
p4pm = (p.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
p4pm
Period('2012-01-30 16:00', 'T')

转化成时间戳对象

1
p4pm.to_timestamp()
Timestamp('2012-01-30 16:00:00')
1
2
3
rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')
ts = Series(np.arange(len(rng)), index=rng)
ts
2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32

批量转化为时间戳

1
2
3
new_rng = (rng.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
ts.index = new_rng.to_timestamp()
ts
2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int32

将时间戳转化为时期(以及其逆过程)

1
2
3
4
rng = pd.date_range('1/1/2000', periods=3, freq='M')
ts = Series(randn(3), index=rng)
pts = ts.to_period()
ts
2000-01-31    0.239752
2000-02-29   -0.469201
2000-03-31    2.835243
Freq: M, dtype: float64

默认以月份为单位进行转化

1
pts
2000-01    0.239752
2000-02   -0.469201
2000-03    2.835243
Freq: M, dtype: float64

转化为月份为单位的时期

1
2
3
rng = pd.date_range('1/29/2000', periods=6, freq='D')
ts2 = Series(randn(6), index=rng)
ts2.to_period('M')
2000-01    1.126773
2000-01   -0.979309
2000-01   -0.784376
2000-02   -1.490820
2000-02    1.125043
2000-02    0.421830
Freq: M, dtype: float64
1
2
pts = ts.to_period()
pts
2000-01    0.239752
2000-02   -0.469201
2000-03    2.835243
Freq: M, dtype: float64

逆向转换

1
pts.to_timestamp(how='end')
2000-01-31    0.239752
2000-02-29   -0.469201
2000-03-31    2.835243
Freq: M, dtype: float64

通过数组创建PeriodIndex

1
2
data = pd.read_csv('ch08/macrodata.csv')
data.year
0      1959.0
1      1959.0
2      1959.0
3      1959.0
4      1960.0
5      1960.0
        ...  
197    2008.0
198    2008.0
199    2008.0
200    2009.0
201    2009.0
202    2009.0
Name: year, dtype: float64
1
data.quarter
0      1.0
1      2.0
2      3.0
3      4.0
4      1.0
5      2.0
      ... 
197    2.0
198    3.0
199    4.0
200    1.0
201    2.0
202    3.0
Name: quarter, dtype: float64

将年份和季度数据统一起来转化为时期索引数据

1
2
index = pd.PeriodIndex(year=data.year, quarter=data.quarter, freq='Q-DEC')
index
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='int64', length=203, freq='Q-DEC')
1
2
data.index = index
data.infl
1959Q1    0.00
1959Q2    2.34
1959Q3    2.74
1959Q4    0.27
1960Q1    2.31
1960Q2    0.14
          ... 
2008Q2    8.53
2008Q3   -3.16
2008Q4   -8.79
2009Q1    0.94
2009Q2    3.37
2009Q3    3.56
Freq: Q-DEC, Name: infl, dtype: float64

重采样以及频率转换

相当于进行了一次分组操作

1
2
3
rng = pd.date_range('1/1/2000', periods=100, freq='D')
ts = Series(randn(len(rng)), index=rng)
ts.resample('M', how='mean')
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).mean()
  app.launch_new_instance()





2000-01-31   -0.055153
2000-02-29    0.189412
2000-03-31   -0.075940
2000-04-30   -0.239036
Freq: M, dtype: float64

换个索引的形式

1
ts.resample('M', how='mean', kind='period')
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).mean()
  if __name__ == '__main__':





2000-01   -0.055153
2000-02    0.189412
2000-03   -0.075940
2000-04   -0.239036
Freq: M, dtype: float64

降采样

按照分钟进行采样

1
2
3
rng = pd.date_range('1/1/2000', periods=12, freq='T')
ts = Series(np.arange(12), index=rng)
ts
2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

每5分钟降采样

1
2
ts.resample('5min').sum()
# note: output changed (as the default changed from closed='right', label='right' to closed='left', label='left'
2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32
1
ts.resample('5min', closed='left').sum()
2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32
1
ts.resample('5min', closed='left', label='left').sum()
2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32

加了个时间的偏移

1
ts.resample('5min', loffset='-1s').sum()
1999-12-31 23:59:59    10
2000-01-01 00:04:59    35
2000-01-01 00:09:59    21
Freq: 5T, dtype: int32

Open-High-Low-Close (OHLC) 降采样

1
ts
2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

以5分钟为单位

1
2
ts.resample('5min').ohlc()
# note: output changed because of changed defaults






































openhighlowclose
2000-01-01 00:00:000404
2000-01-01 00:05:005959
2000-01-01 00:10:0010111011

通过GroupBy进行重采样

1
2
3
rng = pd.date_range('1/1/2000', periods=100, freq='D')
ts = Series(np.arange(100), index=rng)
ts.groupby(lambda x: x.month).mean()
1    15
2    45
3    75
4    95
dtype: int32
1
ts.groupby(lambda x: x.weekday).mean()
0    47.5
1    48.5
2    49.5
3    50.5
4    51.5
5    49.0
6    50.0
dtype: float64

升采样和插值

1
2
3
4
frame = DataFrame(np.random.randn(2, 4),
index=pd.date_range('1/1/2000', periods=2, freq='W-WED'),
columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame































ColoradoTexasNew YorkOhio
2000-01-050.3607730.5064291.1664241.402336
2000-01-12-0.5871240.612993-0.796000-0.341138

1
2
df_daily = frame.resample('D').mean()
df_daily









































































ColoradoTexasNew YorkOhio
2000-01-050.3607730.5064291.1664241.402336
2000-01-06NaNNaNNaNNaN
2000-01-07NaNNaNNaNNaN
2000-01-08NaNNaNNaNNaN
2000-01-09NaNNaNNaNNaN
2000-01-10NaNNaNNaNNaN
2000-01-11NaNNaNNaNNaN
2000-01-12-0.5871240.612993-0.796000-0.341138

1
frame.resample('D').ffill()









































































ColoradoTexasNew YorkOhio
2000-01-050.3607730.5064291.1664241.402336
2000-01-060.3607730.5064291.1664241.402336
2000-01-070.3607730.5064291.1664241.402336
2000-01-080.3607730.5064291.1664241.402336
2000-01-090.3607730.5064291.1664241.402336
2000-01-100.3607730.5064291.1664241.402336
2000-01-110.3607730.5064291.1664241.402336
2000-01-12-0.5871240.612993-0.796000-0.341138

1
frame.resample('D').ffill(limit=2)









































































ColoradoTexasNew YorkOhio
2000-01-050.3607730.5064291.1664241.402336
2000-01-060.3607730.5064291.1664241.402336
2000-01-070.3607730.5064291.1664241.402336
2000-01-08NaNNaNNaNNaN
2000-01-09NaNNaNNaNNaN
2000-01-10NaNNaNNaNNaN
2000-01-11NaNNaNNaNNaN
2000-01-12-0.5871240.612993-0.796000-0.341138

1
frame.resample('W-THU').ffill()































ColoradoTexasNew YorkOhio
2000-01-060.3607730.5064291.1664241.402336
2000-01-13-0.5871240.612993-0.796000-0.341138

通过时期进行重采样

1
2
3
4
frame = DataFrame(np.random.randn(24, 4),
index=pd.period_range('1-2000', '12-2001', freq='M'),
columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame[:5]




















































ColoradoTexasNew YorkOhio
2000-01-0.2543400.401110-0.931350-0.872552
2000-020.390968-0.815357-1.656213-2.251621
2000-030.2062970.1973940.927518-0.657257
2000-04-0.4517090.908598-0.187902-0.498082
2000-05-0.215150-0.042141-0.7387332.499246

以年为单位

1
2
annual_frame = frame.resample('A-DEC').mean()
annual_frame































ColoradoTexasNew YorkOhio
2000-0.0493830.037021-0.272851-0.140984
2001-0.183766-0.2919930.3409410.209276

以季度为单位

1
2
3
4
# Q-DEC: Quarterly, year ending in December
annual_frame.resample('Q-DEC').ffill()
# note: output changed, default value changed from convention='end' to convention='start' + 'start' changed to span-like
# also the following cells









































































ColoradoTexasNew YorkOhio
2000Q1-0.0493830.037021-0.272851-0.140984
2000Q2-0.0493830.037021-0.272851-0.140984
2000Q3-0.0493830.037021-0.272851-0.140984
2000Q4-0.0493830.037021-0.272851-0.140984
2001Q1-0.183766-0.2919930.3409410.209276
2001Q2-0.183766-0.2919930.3409410.209276
2001Q3-0.183766-0.2919930.3409410.209276
2001Q4-0.183766-0.2919930.3409410.209276

1
annual_frame.resample('Q-DEC', fill_method='ffill', convention='start')
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: fill_method is deprecated to .resample()
the new syntax is .resample(...).ffill()
  if __name__ == '__main__':









































































ColoradoTexasNew YorkOhio
2000Q1-0.0493830.037021-0.272851-0.140984
2000Q2-0.0493830.037021-0.272851-0.140984
2000Q3-0.0493830.037021-0.272851-0.140984
2000Q4-0.0493830.037021-0.272851-0.140984
2001Q1-0.183766-0.2919930.3409410.209276
2001Q2-0.183766-0.2919930.3409410.209276
2001Q3-0.183766-0.2919930.3409410.209276
2001Q4-0.183766-0.2919930.3409410.209276

1
annual_frame.resample('Q-MAR', fill_method='ffill')
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: fill_method is deprecated to .resample()
the new syntax is .resample(...).ffill()
  if __name__ == '__main__':









































































ColoradoTexasNew YorkOhio
2000Q4-0.0493830.037021-0.272851-0.140984
2001Q1-0.0493830.037021-0.272851-0.140984
2001Q2-0.0493830.037021-0.272851-0.140984
2001Q3-0.0493830.037021-0.272851-0.140984
2001Q4-0.183766-0.2919930.3409410.209276
2002Q1-0.183766-0.2919930.3409410.209276
2002Q2-0.183766-0.2919930.3409410.209276
2002Q3-0.183766-0.2919930.3409410.209276

时间序列绘图

1
2
3
4
close_px_all = pd.read_csv('ch09/stock_px.csv', parse_dates=True, index_col=0)
close_px = close_px_all[['AAPL', 'MSFT', 'XOM']]
close_px = close_px.resample('B', fill_method='ffill')
close_px.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2292 entries, 2003-01-02 to 2011-10-14
Freq: B
Data columns (total 3 columns):
AAPL    2292 non-null float64
MSFT    2292 non-null float64
XOM     2292 non-null float64
dtypes: float64(3)
memory usage: 71.6 KB


C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: fill_method is deprecated to .resample()
the new syntax is .resample(...).ffill()
  app.launch_new_instance()

按年绘图

1
close_px['AAPL'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1fb1f85a080>

png

按月绘图

1
close_px.ix['2009'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1fb20c4d550>

png

按天绘图

1
close_px['AAPL'].ix['01-2011':'03-2011'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1fb21235668>

png

按季度绘图

1
2
appl_q = close_px['AAPL'].resample('Q-DEC', fill_method='ffill')
appl_q.ix['2009':].plot()
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: fill_method is deprecated to .resample()
the new syntax is .resample(...).ffill()
  if __name__ == '__main__':





<matplotlib.axes._subplots.AxesSubplot at 0x1fb21346c50>

png

移动窗口函数

1
close_px = close_px.asfreq('B').fillna(method='ffill')
1
2
close_px.AAPL.plot()
pd.rolling_mean(close_px.AAPL, 250).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1fb213b95f8>



C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
    Series.rolling(window=250,center=False).mean()
  from ipykernel import kernelapp as app





<matplotlib.axes._subplots.AxesSubplot at 0x1fb213b95f8>

png

1
plt.figure()
<matplotlib.figure.Figure at 0x1fb212fb550>




<matplotlib.figure.Figure at 0x1fb212fb550>
1
2
appl_std250 = pd.rolling_std(close_px.AAPL, 250, min_periods=10)
appl_std250[5:12]
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: pd.rolling_std is deprecated for Series and will be removed in a future version, replace with 
    Series.rolling(window=250,min_periods=10,center=False).std()
  if __name__ == '__main__':





2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    0.077496
2003-01-16    0.074760
2003-01-17    0.112368
Freq: B, Name: AAPL, dtype: float64
1
appl_std250.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1fb21466ba8>

png

1
2
# Define expanding mean in terms of rolling_mean
expanding_mean = lambda x: rolling_mean(x, len(x), min_periods=1)
1
pd.rolling_mean(close_px, 60).plot(logy=True)
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: pd.rolling_mean is deprecated for DataFrame and will be removed in a future version, replace with 
    DataFrame.rolling(window=60,center=False).mean()
  if __name__ == '__main__':





<matplotlib.axes._subplots.AxesSubplot at 0x1fb21571208>

png

1
plt.close('all')

指数加权函数

更好的拟合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
fig, axes = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True,
figsize=(12, 7))
aapl_px = close_px.AAPL['2005':'2009']
ma60 = pd.rolling_mean(aapl_px, 60, min_periods=50)
ewma60 = pd.ewma(aapl_px, span=60)
aapl_px.plot(style='k-', ax=axes[0])
ma60.plot(style='k--', ax=axes[0])
aapl_px.plot(style='k-', ax=axes[1])
ewma60.plot(style='k--', ax=axes[1])
axes[0].set_title('Simple MA')
axes[1].set_title('Exponentially-weighted MA')
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:6: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
    Series.rolling(window=60,min_periods=50,center=False).mean()
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:7: FutureWarning: pd.ewm_mean is deprecated for Series and will be removed in a future version, replace with 
    Series.ewm(span=60,ignore_na=False,min_periods=0,adjust=True).mean()





<matplotlib.axes._subplots.AxesSubplot at 0x1fb21983b70>






<matplotlib.axes._subplots.AxesSubplot at 0x1fb21983b70>






<matplotlib.axes._subplots.AxesSubplot at 0x1fb219c9a20>






<matplotlib.axes._subplots.AxesSubplot at 0x1fb219c9a20>






<matplotlib.text.Text at 0x1fb219b10b8>






<matplotlib.text.Text at 0x1fb219efcc0>

png

二元移动窗口函数

1
2
close_px
spx_px = close_px_all['SPX']






























































































AAPLMSFTXOM
2003-01-027.4021.1129.22
2003-01-037.4521.1429.24
2003-01-067.4521.5229.96
2003-01-077.4321.9328.95
2003-01-087.2821.3128.83
2003-01-097.3421.9329.44
2011-10-07369.8026.2573.56
2011-10-10388.8126.9476.28
2011-10-11400.2927.0076.27
2011-10-12402.1926.9677.16
2011-10-13408.4327.1876.37
2011-10-14422.0027.2778.11

2292 rows × 3 columns


1
2
3
4
spx_rets = spx_px / spx_px.shift(1) - 1
returns = close_px.pct_change()
corr = pd.rolling_corr(returns.AAPL, spx_rets, 125, min_periods=100)
corr.plot()
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: pd.rolling_corr is deprecated for Series and will be removed in a future version, replace with 
    Series.rolling(window=125,min_periods=100).corr(other=<Series>)
  app.launch_new_instance()





<matplotlib.axes._subplots.AxesSubplot at 0x1fb21a93438>

png

1
2
corr = pd.rolling_corr(returns, spx_rets, 125, min_periods=100)
corr.plot()
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: pd.rolling_corr is deprecated for DataFrame and will be removed in a future version, replace with 
    DataFrame.rolling(window=125,min_periods=100).corr(other=<Series>)
  if __name__ == '__main__':





<matplotlib.axes._subplots.AxesSubplot at 0x1fb22b21438>

png

用户自定义移动窗口函数

1
2
3
4
from scipy.stats import percentileofscore
score_at_2percent = lambda x: percentileofscore(x, 0.02)
result = pd.rolling_apply(returns.AAPL, 250, score_at_2percent)
result.plot()
C:\Users\Ewan\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: pd.rolling_apply is deprecated for Series and will be removed in a future version, replace with 
    Series.rolling(window=250,center=False).apply(kwargs=<dict>,args=<tuple>,func=<function>)
  app.launch_new_instance()





<matplotlib.axes._subplots.AxesSubplot at 0x1fb3dbdd2e8>

png

性能和内存使用方面的注意事项

1
2
3
rng = pd.date_range('1/1/2000', periods=10000000, freq='10ms')
ts = Series(np.random.randn(len(rng)), index=rng)
ts
2000-01-01 00:00:00.000   -0.428577
2000-01-01 00:00:00.010    1.650203
2000-01-01 00:00:00.020   -0.064777
2000-01-01 00:00:00.030   -0.219433
2000-01-01 00:00:00.040    1.907433
2000-01-01 00:00:00.050    0.103347
                             ...   
2000-01-02 03:46:39.940    0.989446
2000-01-02 03:46:39.950    2.333137
2000-01-02 03:46:39.960    0.354455
2000-01-02 03:46:39.970    0.353224
2000-01-02 03:46:39.980   -0.862868
2000-01-02 03:46:39.990    2.007468
Freq: 10L, dtype: float64
1
ts.resample('15min').ohlc().info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 11112 entries, 2000-01-01 00:00:00 to 2000-04-25 17:45:00
Freq: 15T
Data columns (total 4 columns):
open     11112 non-null float64
high     11112 non-null float64
low      11112 non-null float64
close    11112 non-null float64
dtypes: float64(4)
memory usage: 434.1 KB
1
%timeit ts.resample('15min').ohlc()
10 loops, best of 3: 123 ms per loop

1
2
3
rng = pd.date_range('1/1/2000', periods=10000000, freq='1s')
ts = Series(np.random.randn(len(rng)), index=rng)
%timeit ts.resample('15s').ohlc()
1 loop, best of 3: 192 ms per loop