admin 管理员组

文章数量: 1087652

arima 数据预处理

版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq,如有任何学术交流,可随时联系。

1 数据的预处理

时间序列数据生成 import pandas as pd

import numpy as np

date_range:

可以指定开始时间与周期

H:小时

D:天

M:月

# TIMES #2016 Jul 1 7/1/2016 1/7/2016 2016-07-01 2016/07/01

rng = pd.date_range('2016-07-01', periods = 10, freq = '3D')

rng

DatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-07', '2016-07-10',

'2016-07-13', '2016-07-16', '2016-07-19', '2016-07-22',

'2016-07-25', '2016-07-28'],

dtype='datetime64[ns]', freq='3D')

time=pd.Series(np.random.randn(20),

index=pd.date_range(dt.datetime(2016,1,1),periods=20))

print(time)

2016-01-01 -0.129379

2016-01-02 0.164480

2016-01-03 -0.639117

2016-01-04 -0.427224

2016-01-05 2.055133

2016-01-06 1.116075

2016-01-07 0.357426

2016-01-08 0.274249

2016-01-09 0.834405

2016-01-10 -0.005444

2016-01-11 -0.134409

2016-01-12 0.249318

2016-01-13 -0.297842

2016-01-14 -0.128514

2016-01-15 0.063690

2016-01-16 -2.246031

2016-01-17 0.359552

2016-01-18 0.383030

2016-01-19 0.402717

2016-01-20 -0.694068

Freq: D, dtype: float64

复制代码

truncate过滤 time.truncate(before='2016-1-10')

2016-01-10 -0.005444

2016-01-11 -0.134409

2016-01-12 0.249318

2016-01-13 -0.297842

2016-01-14 -0.128514

2016-01-15 0.063690

2016-01-16 -2.246031

2016-01-17 0.359552

2016-01-18 0.383030

2016-01-19 0.402717

2016-01-20 -0.694068

Freq: D, dtype: float64

time.truncate(after='2016-1-10')

2016-01-01 -0.129379

2016-01-02 0.164480

2016-01-03 -0.639117

2016-01-04 -0.427224

2016-01-05 2.055133

2016-01-06 1.116075

2016-01-07 0.357426

2016-01-08 0.274249

2016-01-09 0.834405

2016-01-10 -0.005444

Freq: D, dtype: float64

print(time['2016-01-15':'2016-01-20'])

2016-01-15 0.063690

2016-01-16 -2.246031

2016-01-17 0.359552

2016-01-18 0.383030

2016-01-19 0.402717

2016-01-20 -0.694068

Freq: D, dtype: float64

data=pd.date_range('2010-01-01','2011-01-01',freq='M')

print(data)

DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30',

'2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31',

'2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31'],

dtype='datetime64[ns]', freq='M')

# 指定索引

rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D')

rng

pd.Series(range(len(rng)), index = rng)

2016-07-01 0

2016-07-02 1

2016-07-03 2

2016-07-04 3

2016-07-05 4

2016-07-06 5

2016-07-07 6

2016-07-08 7

2016-07-09 8

2016-07-10 9

Freq: D, dtype: int32

复制代码

指定索引 periods = [pd.Period('2016-01'), pd.Period('2016-02'), pd.Period('2016-03')]

ts = pd.Series(np.random.randn(len(periods)), index = periods)

ts

2016-07-01 0

2016-07-02 1

2016-07-03 2

2016-07-04 3

2016-07-05 4

2016-07-06 5

2016-07-07 6

2016-07-08 7

2016-07-09 8

2016-07-10 9

Freq: D, dtype: int32

复制代码

时间戳和时间周期可以转换 ts = pd.Series(range(10), pd.date_range('07-10-16 8:00', periods = 10, freq = 'H'))

ts

2016-07-10 08:00:00 0

2016-07-10 09:00:00 1

2016-07-10 10:00:00 2

2016-07-10 11:00:00 3

2016-07-10 12:00:00 4

2016-07-10 13:00:00 5

2016-07-10 14:00:00 6

2016-07-10 15:00:00 7

2016-07-10 16:00:00 8

2016-07-10 17:00:00 9

Freq: H, dtype: int32

ts_period = ts.to_period()

ts_period

2016-07-10 08:00 0

2016-07-10 09:00 1

2016-07-10 10:00 2

2016-07-10 11:00 3

2016-07-10 12:00 4

2016-07-10 13:00 5

2016-07-10 14:00 6

2016-07-10 15:00 7

2016-07-10 16:00 8

2016-07-10 17:00 9

Freq: H, dtype: int32

ts_period['2016-07-10 08:30':'2016-07-10 11:45']

2016-07-10 08:00 0

2016-07-10 09:00 1

2016-07-10 10:00 2

2016-07-10 11:00 3

Freq: H, dtype: int32

ts['2016-07-10 08:30':'2016-07-10 11:45']

2016-07-10 09:00:00 1

2016-07-10 10:00:00 2

2016-07-10 11:00:00 3

Freq: H, dtype: int32

复制代码

2 数据重采样

时间数据由一个频率转换到另一个频率

降采样

升采样 rng = pd.date_range('1/1/2011', periods=90, freq='D')

ts = pd.Series(np.random.randn(len(rng)), index=rng)

ts.head()

2011-01-01 -1.025562

2011-01-02 0.410895

2011-01-03 0.660311

2011-01-04 0.710293

2011-01-05 0.444985

Freq: D, dtype: float64

ts.resample('M').sum()

2011-01-31 2.510102

2011-02-28 0.583209

2011-03-31 2.749411

Freq: M, dtype: float64

ts.resample('3D').sum()

2011-01-01 0.045643

2011-01-04 -2.255206

2011-01-07 0.571142

2011-01-10 0.835032

2011-01-13 -0.396766

2011-01-16 -1.156253

2011-01-19 -1.286884

2011-01-22 2.883952

2011-01-25 1.566908

2011-01-28 1.435563

2011-01-31 0.311565

2011-02-03 -2.541235

2011-02-06 0.317075

2011-02-09 1.598877

2011-02-12 -1.950509

2011-02-15 2.928312

2011-02-18 -0.733715

2011-02-21 1.674817

2011-02-24 -2.078872

2011-02-27 2.172320

2011-03-02 -2.022104

2011-03-05 -0.070356

2011-03-08 1.276671

2011-03-11 -2.835132

2011-03-14 -1.384113

2011-03-17 1.517565

2011-03-20 -0.550406

2011-03-23 0.773430

2011-03-26 2.244319

2011-03-29 2.951082

Freq: 3D, dtype: float64

day3Ts = ts.resample('3D').mean()

day3Ts

2011-01-01 0.015214

2011-01-04 -0.751735

2011-01-07 0.190381

2011-01-10 0.278344

2011-01-13 -0.132255

2011-01-16 -0.385418

2011-01-19 -0.428961

2011-01-22 0.961317

2011-01-25 0.522303

2011-01-28 0.478521

2011-01-31 0.103855

2011-02-03 -0.847078

2011-02-06 0.105692

2011-02-09 0.532959

2011-02-12 -0.650170

2011-02-15 0.976104

2011-02-18 -0.244572

2011-02-21 0.558272

2011-02-24 -0.692957

2011-02-27 0.724107

2011-03-02 -0.674035

2011-03-05 -0.023452

2011-03-08 0.425557

2011-03-11 -0.945044

2011-03-14 -0.461371

2011-03-17 0.505855

2011-03-20 -0.183469

2011-03-23 0.257810

2011-03-26 0.748106

2011-03-29 0.983694

Freq: 3D, dtype: float64

## 下采样

print(day3Ts.resample('D').asfreq())

2011-01-01 0.015214

2011-01-02 NaN

2011-01-03 NaN

2011-01-04 -0.751735

2011-01-05 NaN

2011-01-06 NaN

2011-01-07 0.190381

2011-01-08 NaN

2011-01-09 NaN

2011-01-10 0.278344

2011-01-11 NaN

2011-01-12 NaN

2011-01-13 -0.132255

2011-01-14 NaN

2011-01-15 NaN

2011-01-16 -0.385418

2011-01-17 NaN

2011-01-18 NaN

2011-01-19 -0.428961

2011-01-20 NaN

2011-01-21 NaN

2011-01-22 0.961317

Freq: D, Length: 88, dtype: float64

复制代码

ffill 空值取前面的值

bfill 空值取后面的值

interpolate 线性取值 day3Ts.resample('D').ffill(1)

2011-01-01 0.015214

2011-01-02 0.015214

2011-01-03 NaN

2011-01-04 -0.751735

2011-01-05 -0.751735

2011-01-06 NaN

2011-01-07 0.190381

2011-01-08 0.190381

2011-01-09 NaN

2011-01-10 0.278344

2011-01-11 0.278344

day3Ts.resample('D').bfill(1)

2011-01-01 0.015214

2011-01-02 NaN

2011-01-03 -0.751735

2011-01-04 -0.751735

2011-01-05 NaN

2011-01-06 0.190381

2011-01-07 0.190381

2011-01-08 NaN

2011-01-09 0.278344

2011-01-10 0.278344

2011-01-11 NaN

2011-01-12 -0.132255

2011-01-13 -0.132255

day3Ts.resample('D').interpolate('linear')

2011-01-01 0.015214

2011-01-02 -0.240435

2011-01-03 -0.496085

2011-01-04 -0.751735

2011-01-05 -0.437697

2011-01-06 -0.123658

2011-01-07 0.190381

2011-01-08 0.219702

2011-01-09 0.249023

2011-01-10 0.278344

2011-01-11 0.141478

2011-01-12 0.004611

2011-01-13 -0.132255

2011-01-14 -0.216643

2011-01-15 -0.301030

复制代码

3 滑动窗

滑动窗计算 %matplotlib inline

import matplotlib.pylab

import numpy as np

import pandas as pd

df = pd.Series(np.random.randn(600), index = pd.date_range('7/1/2016', freq = 'D', periods = 600))

df.head()

2016-07-01 -0.192140

2016-07-02 0.357953

2016-07-03 -0.201847

2016-07-04 -0.372230

2016-07-05 1.414753

Freq: D, dtype: float64

r = df.rolling(window = 10)

#r.max, r.median, r.std, r.skew, r.sum, r.var

print(r.mean())

016-07-01 NaN

2016-07-02 NaN

2016-07-03 NaN

2016-07-04 NaN

2016-07-05 NaN

2016-07-06 NaN

2016-07-07 NaN

2016-07-08 NaN

2016-07-09 NaN

2016-07-10 0.300133

2016-07-11 0.284780

2016-07-12 0.252831

2016-07-13 0.220699

2016-07-14 0.167137

2016-07-15 0.018593

2016-07-16 -0.061414

2016-07-17 -0.134593

2016-07-18 -0.153333

2016-07-19 -0.218928

2016-07-20 -0.169426

2016-07-21 -0.219747

2016-07-22 -0.181266

2016-07-23 -0.173674

2016-07-24 -0.130629

2016-07-25 -0.166730

2016-07-26 -0.233044

2016-07-27 -0.256642

2016-07-28 -0.280738

2016-07-29 -0.289893

2016-07-30 -0.379625

...

2018-01-22 -0.211467

2018-01-23 0.034996

2018-01-24 -0.105910

2018-01-25 -0.145774

2018-01-26 -0.089320

2018-01-27 -0.164370

2018-01-28 -0.110892

2018-01-29 -0.205786

2018-01-30 -0.101162

2018-01-31 -0.034760

2018-02-01 0.229333

2018-02-02 0.043741

2018-02-03 0.052837

2018-02-04 0.057746

2018-02-05 -0.071401

2018-02-06 -0.011153

2018-02-07 -0.045737

2018-02-08 -0.021983

2018-02-09 -0.196715

2018-02-10 -0.063721

2018-02-11 -0.289452

2018-02-12 -0.050946

2018-02-13 -0.047014

2018-02-14 0.048754

2018-02-15 0.143949

2018-02-16 0.424823

2018-02-17 0.361878

2018-02-18 0.363235

2018-02-19 0.517436

2018-02-20 0.368020

Freq: D, Length: 600, dtype: float64

复制代码

可视化 import matplotlib.pyplot as plt

%matplotlib inline

plt.figure(figsize=(15, 5))

df.plot(style='r--')

df.rolling(window=10).mean().plot(style='b')

复制代码

4 ARIMA预测

数据的预处理 import pandas_datareader

import datetime

import matplotlib.pylab as plt

import seaborn as sns

from matplotlib.pylab import style

from statsmodels.tsa.arima_model import ARIMA

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

style.use('ggplot')

plt.rcParams['font.sans-serif'] = ['SimHei']

plt.rcParams['axes.unicode_minus'] = False

stockFile = 'data/T10yr.csv'

stock = pd.read_csv(stockFile, index_col=0, parse_dates=[0])

stock.head(10)

复制代码

stock_week = stock['Close'].resample('W-MON').mean()

stock_train = stock_week['2000':'2015']

stock_train.plot(figsize=(12,8))

plt.legend(bbox_to_anchor=(1.25, 0.5))

plt.title("Stock Close")

sns.despine()

复制代码

stock_diff = stock_train.diff()

stock_diff = stock_diff.dropna()

plt.figure()

plt.plot(stock_diff)

plt.title('一阶差分')

plt.show()

复制代码

acf = plot_acf(stock_diff, lags=20)

plt.title("ACF")

acf.show()

复制代码

pacf = plot_pacf(stock_diff, lags=20)

plt.title("PACF")

pacf.show()

复制代码

model = ARIMA(stock_train, order=(1, 1, 1),freq='W-MON')

result = model.fit()

#print(result.summary())

pred = result.predict('20140609', '20160701',dynamic=True, typ='levels')

print (pred)

2014-06-09 2.463559

2014-06-16 2.455539

2014-06-23 2.449569

2014-06-30 2.444183

2014-07-07 2.438962

2014-07-14 2.433788

2014-07-21 2.428627

2014-07-28 2.423470

2014-08-04 2.418315

2014-08-11 2.413159

2014-08-18 2.408004

2014-08-25 2.402849

2014-09-01 2.397693

2014-09-08 2.392538

2014-09-15 2.387383

plt.figure(figsize=(6, 6))

plt.xticks(rotation=45)

plt.plot(pred)

plt.plot(stock_train)

复制代码

本文标签: arima 数据预处理