pandas-datareader一些用法备忘

pandas-datareader介绍

Pandas库提供了专门从财经网站获取金融数据的API接口,可作为量化交易股票数据获取的另一种途径

DataReader方法介绍

查看Pandas的操作文档可以发现,第一个参数为股票代码,苹果公司的代码为”AAPL”,国内股市采用的输入方式“股票代码”+“对应股市”,上证股票在股票代码后面加上“.SS”,深圳股票在股票代码后面加上“.SZ”。DataReader可从多个金融网站获取到股票数据,如“Yahoo! Finance” 、“Google Finance”等,这里以Yahoo为例。第三、四个参数为股票数据的起始时间断。返回的数据格式为DataFrame。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import pandas as pd
from pandas_datareader import data

start_date = "2018-04-01" # 获取数据的时间段-起始时间
end_date = "2021-04-01" # 获取数据的时间段-结束时间

stock = data.DataReader(
"000001.SS", "yahoo", start_date, end_date
)
print(stock.head(5))
print(stock.tail(5), "\n")
print(stock.index)
print(stock.columns)
print(stock.shape)

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
                   High          Low         Open        Close  Volume    Adj Close
Date
2018-04-02 3192.340088 3159.986084 3169.779053 3163.178955 177700 3163.178955
2018-04-03 3144.332031 3119.132080 3130.012939 3136.633057 152200 3136.633057
2018-04-04 3163.340088 3128.866943 3147.049072 3131.111084 147000 3131.111084
2018-04-09 3146.093018 3110.302979 3125.441895 3138.293945 139600 3138.293945
2018-04-10 3190.648926 3139.081055 3144.257080 3190.322021 168200 3190.322021
High Low Open Close Volume Adj Close
Date
2021-03-26 3423.222900 3373.316895 3373.316895 3418.326904 274600 3418.326904
2021-03-29 3449.833984 3409.886963 3429.632080 3435.300049 284800 3435.300049
2021-03-30 3457.629883 3423.320068 3432.530029 3456.679932 285400 3456.679932
2021-03-31 3452.209961 3420.830078 3452.209961 3441.909912 283000 3441.909912
2021-04-01 3470.030029 3438.830078 3444.810059 3466.330078 275200 3466.330078

DatetimeIndex(['2018-04-02', '2018-04-03', '2018-04-04', '2018-04-09',
'2018-04-10', '2018-04-11', '2018-04-12', '2018-04-13',
'2018-04-16', '2018-04-17',
...
'2021-03-19', '2021-03-22', '2021-03-23', '2021-03-24',
'2021-03-25', '2021-03-26', '2021-03-29', '2021-03-30',
'2021-03-31', '2021-04-01'],
dtype='datetime64[ns]', name='Date', length=728, freq=None)
Index(['High', 'Low', 'Open', 'Close', 'Volume', 'Adj Close'], dtype='object')
(728, 6)

数据分析

1、打印DataFrame数据前5行和尾部倒数5行
2、打印DataFrame数据索引和列名,索引为时间序列,列信息为开盘价、最高价、最低价、收盘价、复权收盘价、成交量

print stock.index
print stock.columns
3、打印DataFrame数据形状

print(stock.shape)

4、DataFrame数据每组的统计情况,如最小值、最大值、均值、标准差等

print stock.describe()
5、DataFrame数据中增加涨/跌幅列,涨/跌=(当日Close-上一日Close)/上一日Close*100%

(1)添加一列change,存储当日股票价格与前一日收盘价格相比的涨跌数值,即当日Close价格与上一日Close的差值,4月1日这天无上一日数据,因此出现缺失

1
2
3
4
5
6
7
8
9
10
11
12
13
14
change = stock.Close.diff()
stock['Change'] = change
print(stock.head(5))

'''
High Low Open Close Volume Adj Close Change
Date
2020-04-01 2773.364014 2731.079102 2743.541016 2734.521973 217300 2734.521973 NaN
2020-04-02 2780.637939 2719.904053 2720.228027 2780.637939 217900 2780.637939 46.115967
2020-04-03 2780.586914 2754.072998 2773.575928 2763.987061 200800 2763.987061 -16.650879
2020-04-07 2823.277100 2801.839111 2806.968018 2820.762939 270200 2820.762939 56.775879
2020-04-08 2823.214111 2800.295898 2805.916992 2815.368896 243500 2815.368896 -5.394043

'''

(2)对缺失的数据用涨跌值的均值就地替代NaN。

change.fillna(change.mean(),inplace=True)
(3)计算涨跌幅度有两种方法,pct_change()算法的思想即是第二项开始向前做减法后再除以第一项,计算得到涨跌幅序列。

stock[‘pct_change’] = (stock[‘Change’] /stock[‘Close’].shift(1))#
stock[‘pct_change1’] = stock.Close.pct_change()

1
2
3
4
5
6
7
                   High          Low         Open        Close  Volume    Adj Close     Change  pct_change  pct_change1
Date
2020-04-01 2773.364014 2731.079102 2743.541016 2734.521973 217300 2734.521973 NaN NaN NaN
2020-04-02 2780.637939 2719.904053 2720.228027 2780.637939 217900 2780.637939 46.115967 0.016864 0.016864
2020-04-03 2780.586914 2754.072998 2773.575928 2763.987061 200800 2763.987061 -16.650879 -0.005988 -0.005988
2020-04-07 2823.277100 2801.839111 2806.968018 2820.762939 270200 2820.762939 56.775879 0.020541 0.020541
2020-04-08 2823.214111 2800.295898 2805.916992 2815.368896 243500 2815.368896 -5.394043 -0.001912 -0.001912

7、DataFrame数据中增加跳空缺口数值序列,这里定义的缺口为上涨趋势和下跌趋势中的突破缺口,上涨趋势中今天的最低价高于昨天收盘价为向上跳空,下跌趋势中昨天收盘价高于今天最高价为向下跳空。遍历每个交易日后将符合跳空缺口条件的交易日增加缺口数值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
from pandas_datareader import data
import numpy as np

start_date = "2020-04-01" # 获取数据的时间段-起始时间
end_date = "2021-04-01" # 获取数据的时间段-结束时间

stock = data.DataReader("000001.SS", "yahoo", start_date, end_date)
change = stock.Close.diff()
change.fillna(change.mean(), inplace=True)
stock["Change"] = change
stock["pct_change"] = stock["Change"] / stock["Close"].shift(1)
stock["pct_change1"] = stock.Close.pct_change()
# print(stock.head(5))

jump_pd = pd.DataFrame()
for kl_index in np.arange(1, stock.shape[0]):
today = stock.iloc[kl_index]
yesday = stock.iloc[kl_index - 1]
today["preCloae"] = yesday.Close
if today["pct_change"] > 0 and (today.Low - today["preCloae"]) > 0:
today["jump_power"] = today.Low - today["preCloae"]
elif today["pct_change"] < 0 and (today.High - today["preCloae"]) < 0:
today["jump_power"] = today.High - today["preCloae"]
jump_pd = jump_pd.append(today)
stock["jump_power"] = jump_pd["jump_power"]
print(stock.loc["2020-04-01":"2021-04-01"]) # 默认打印全部列
# 这里有个问题:A value is trying to be set on a copy of a slice from a DataFrame不过不影响

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
                   High          Low         Open        Close  Volume    Adj Close     Change  pct_change  pct_change1  jump_power
Date
2020-04-01 2773.364014 2731.079102 2743.541016 2734.521973 217300 2734.521973 3.011556 NaN NaN NaN
2020-04-02 2780.637939 2719.904053 2720.228027 2780.637939 217900 2780.637939 46.115967 0.016864 0.016864 NaN
2020-04-03 2780.586914 2754.072998 2773.575928 2763.987061 200800 2763.987061 -16.650879 -0.005988 -0.005988 -0.051025
2020-04-07 2823.277100 2801.839111 2806.968018 2820.762939 270200 2820.762939 56.775879 0.020541 0.020541 37.852051
2020-04-08 2823.214111 2800.295898 2805.916992 2815.368896 243500 2815.368896 -5.394043 -0.001912 -0.001912 NaN
... ... ... ... ... ... ... ... ... ... ...
2021-03-26 3423.222900 3373.316895 3373.316895 3418.326904 274600 3418.326904 54.736816 0.016273 0.016273 9.726807
2021-03-29 3449.833984 3409.886963 3429.632080 3435.300049 284800 3435.300049 16.973145 0.004965 0.004965 NaN
2021-03-30 3457.629883 3423.320068 3432.530029 3456.679932 285400 3456.679932 21.379883 0.006224 0.006224 NaN
2021-03-31 3452.209961 3420.830078 3452.209961 3441.909912 283000 3441.909912 -14.770020 -0.004273 -0.004273 -4.469971
2021-04-01 3470.030029 3438.830078 3444.810059 3466.330078 275200 3466.330078 24.420166 0.007095 0.007095 NaN

[244 rows x 10 columns]

8、DataFrame数据保留两位小数显示

format = lambda x: ‘%.2f’ % x
stock = stock.applymap(format)
print stock.loc[“2017-04-26”:”2017-06-15”]#默认打印全部列

股价数据的可视化

Matplotlib是使用Python进行绘图里非常方便的库。这次 plot使用的数据是 Adj Close栏的数据。这是所说的已调整收盘价。

如下仅仅需要两行写就可以简单的将股价作为时间序列数据画出来。

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
from pandas_datareader import data
import numpy as np
import matplotlib.pyplot as plt

start_date = "2020-04-01" # 获取数据的时间段-起始时间
end_date = "2021-04-01" # 获取数据的时间段-结束时间

stock = data.DataReader("000001.SS", "yahoo", start_date, end_date)

stock['Adj Close'].plot(legend=True, figsize=(10,4))
plt.show()

img

实例操作:Python提取雅虎财经数据,并做数据分析和可视化

以csv格式存放

1
2
3
4
5
6
7
8
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import datetime

df_csvsave = web.DataReader("000001.SS","yahoo",datetime.datetime(2019,1,1),datetime.date.today())
print (df_csvsave)
df_csvsave.to_csv(r'C:\Users\15461\Desktop\table.csv',columns=df_csvsave.columns,index=True)