为了最小化预测误差,文章将预测股价的走势看做一个二分类问题(涨或跌),使用集成机器学习建模解决。文章里利用RSI(相对强弱指数)、KD随机指标、MACD等6个常用的技术指标作为分类的特征,对随机森林模型进行训练。最后发现,模型中决策树个数增加,模型准确率增加并有收敛趋势;并且,预测的时间窗口越长,模型越准确。
本文研究思路:数据收集 - 指数平滑处理 - 特征提取 - 随机森林集成学习 - 股票市场预测。
研究内容:
1.数据收集与预处理
作者收集了Apple,三星和GE的共7000个左右交易日股价走势数据,分别预测了30天、60天、90天后的股票走势。为了去除历史数据中的噪音,展现历史数据的实际规律,作者采用了指数平滑法对股价数据进行预处理:
S0=Y0
for t>0, St=α?Yt (1?α)?St?1
α在0到1之间并且比较接近1,这样给近期的数据设置了更大的权重。因为近期的走势在某种程度上更具有持续性,所以将较大的权重放在最近的数据。
2.特征提取
(1)标签设置
targeti=Sign(closei d?closei)
d是预测的时间窗口,Sign是符号函数。当targeti的值为1,代表在i这个时刻看,d天后的收盘价比今天的收盘价高,也就是说股票在d天后上涨;反之下跌。target也是模型需要预测的目标。
(2)分类特征
技术指标是在股票分析里用于判断熊牛的重要的信号,文章中运用六个技术指标作为分类的标准,让随机森林模型去学习这些特征。指标列举如下:
RSI 相对强弱指数
Stochastic Oscillator 随机指标
Williams %R 威廉指标
MACD
Price Rate of Change 价格波动率
On Balance Volume 能量潮指标
3.建模
这里使用的是随机森林模型建模(论文中有详细介绍为什么用随机森林)
4.结果
为了证明随机森林算法的优越性,作者与SVM、逻辑回归、高斯判别分析、二次判别分析等模型进行了比较。
我这里数据用的是上证指数近十几年的数据做预测,预测的是10天以后的上证指数的涨跌情况,准确率只有百分之70几。
个人认为还有一些优化方向:
1:特征可以增加一些,文章中只有6个特征。
2:数据量太少,可以考虑分钟级别或者小时级别的数据预测
3:模型参数调整,参数也占很大的影响因素
# 导入模块import numpy as npimport pandas as pd
data = get_price("000001.XSHG",count=3000,end_date="2019-04-03",fields=["open","close","high","low","volume"])print(data.head(5))datas = data.reset_index(drop=True)datas.head(5)
open close high low volume 2006-12-04 2103.82 2161.65 2164.59 2100.56 8.723279e+09 2006-12-05 2167.36 2173.28 2195.53 2159.32 8.815717e+09 2006-12-06 2175.38 2156.60 2192.90 2097.42 1.043478e+10 2006-12-07 2152.59 2156.75 2206.51 2145.79 9.927877e+09 2006-12-08 2133.36 2093.64 2164.02 2090.40 8.088972e+09
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
open | close | high | low | volume | |
---|---|---|---|---|---|
0 | 2103.82 | 2161.65 | 2164.59 | 2100.56 | 8.723279e+09 |
1 | 2167.36 | 2173.28 | 2195.53 | 2159.32 | 8.815717e+09 |
2 | 2175.38 | 2156.60 | 2192.90 | 2097.42 | 1.043478e+10 |
3 | 2152.59 | 2156.75 | 2206.51 | 2145.79 | 9.927877e+09 |
4 | 2133.36 | 2093.64 | 2164.02 | 2090.40 | 8.088972e+09 |
计算特征
# # 获取预测的标签 这里n为10 也就是预测10天后的涨跌标签 def compute_prediction_int(df, n):pred = np.sign(df.shift(-n)['close'] - df['close'])pred = pred.iloc[:-n]return pred.astype(int)
## 计算Stochastic Oscillatordef stochastic_oscillator_d(df, n):SOK = [0]for i in range(n, len(df)):high = df.loc[(i-n):i, 'high']low = df.loc[(i-n):i, 'low']SOK.append((df.loc[i, 'close'] - min(low)) / (max(high) - min(low)))SOK = pd.Series(SOK, name='SOK')df = df.join(SOK)return df
## 计算Williams %Rdef williams_R(df, n):R = [0]for i in range(n, len(df)):high = df.loc[(i-n):i, 'high']low = df.loc[(i-n):i, 'low']R.append((max(high) - df.loc[i, 'close']) / (max(high) - min(low))*(-100))williams_R = pd.Series(R, name='williams_R')df = df.join(williams_R)return df
## 计算变化率def rate_of_change(df, n):M = df['close'].diff(n - 1)N = df['close'].shift(n - 1)ROC = pd.Series(M / N, name='ROC_' + str(n))df = df.join(ROC)return df
## 计算RSIdef relative_strength_index(df, n):i = 0UpI = [0]DoI = [0]while i + 1 <= df.index[-1]:UpMove = df.loc[i + 1, 'high'] - df.loc[i, 'high']DoMove = df.loc[i, 'low'] - df.loc[i + 1, 'low']if UpMove > DoMove and UpMove > 0:UpD = UpMoveelse:UpD = 0UpI.append(UpD)if DoMove > UpMove and DoMove > 0:DoD = DoMoveelse:DoD = 0DoI.append(DoD)i = i + 1UpI = pd.Series(UpI)DoI = pd.Series(DoI)PosDI = pd.Series(UpI.ewm(span=n, min_periods=n).mean())NegDI = pd.Series(DoI.ewm(span=n, min_periods=n).mean())RSI = pd.Series(PosDI / (PosDI + NegDI), name='RSI_' + str(n))df = df.join(RSI)return df
## 计算On Balance Volumedef on_balance_volume(df, n):i = 0OBV = [0]while i < df.index[-1]:if df.loc[i + 1, 'close'] - df.loc[i, 'close'] > 0:OBV.append(df.loc[i + 1, 'volume'])if df.loc[i + 1, 'close'] - df.loc[i, 'close'] == 0:OBV.append(0)if df.loc[i + 1, 'close'] - df.loc[i, 'close'] < 0:OBV.append(-df.loc[i + 1, 'volume'])i = i + 1OBV = pd.Series(OBV)OBV_ma = pd.Series(OBV.rolling(n, min_periods=n).mean(), name='OBV_' + str(n))df = df.join(OBV_ma)return df
## 计算MACDdef macd(df, n_fast, n_slow):EMAfast = pd.Series(df['close'].ewm(span=n_fast, min_periods=n_slow).mean())EMAslow = pd.Series(df['close'].ewm(span=n_slow, min_periods=n_slow).mean())MACD = pd.Series(EMAfast - EMAslow, name='MACD_' + str(n_fast) + '_' + str(n_slow))df = df.join(MACD)return df
# 数据集准备def feature_extraction(data):data = relative_strength_index(data, n=14)data = stochastic_oscillator_d(data, n=14)data = rate_of_change(data, n=14)data = on_balance_volume(data, n=14)data = macd(data, 12, 26)data = williams_R(data, n = 14)del(data['open'])del(data['high'])del(data['low'])del(data['volume'])return data
def prepare_data(df, horizon):data = feature_extraction(df).dropna().iloc[:-horizon]data['label'] = compute_prediction_int(data, n=horizon)del(data['close'])return data.dropna()
# 数据和特征获取并合并datas1 = prepare_data(datas, horizon=10) features = [x for x in datas1.columns if x not in ['gain', 'label']]print(datas1.head(5))print(features)
RSI_14 SOK ROC_14 ... MACD_12_26 williams_R label 25 0.857233 0.813521 0.167247 ... 107.008562 -18.647937 1.0 26 0.640311 0.391454 0.138787 ... 99.453351 -60.854551 1.0 27 0.652436 0.388791 0.192446 ... 101.454929 -61.120866 1.0 28 0.685103 0.042725 0.158168 ... 103.734091 -95.727490 1.0 29 0.719127 0.005756 0.120646 ... 101.250194 -99.424404 1.0 [5 rows x 7 columns] ['RSI_14', 'SOK', 'ROC_14', 'OBV_14', 'MACD_12_26', 'williams_R']
# 训练集和测试集 划分train_size = 2*len(datas1) // 3train_df = datas1[:train_size]test_df = datas1[train_size:]print('len train', len(train_df))print('len test', len(test_df))print(train_df.head(5))print(test_df.head(5))
len train 1961 len test 981 RSI_14 SOK ROC_14 ... MACD_12_26 williams_R label 25 0.857233 0.813521 0.167247 ... 107.008562 -18.647937 1.0 26 0.640311 0.391454 0.138787 ... 99.453351 -60.854551 1.0 27 0.652436 0.388791 0.192446 ... 101.454929 -61.120866 1.0 28 0.685103 0.042725 0.158168 ... 103.734091 -95.727490 1.0 29 0.719127 0.005756 0.120646 ... 101.250194 -99.424404 1.0 [5 rows x 7 columns] RSI_14 SOK ROC_14 ... MACD_12_26 williams_R label 1986 0.448424 0.998331 -0.059932 ... 29.119562 -0.166869 1.0 1987 0.400803 0.743751 0.006476 ... 18.827430 -25.624891 1.0 1988 0.311296 0.801043 -0.030614 ... 5.713452 -19.895707 1.0 1989 0.307633 0.693099 -0.068748 ... -3.093720 -30.690075 1.0 1990 0.359978 0.667756 -0.060344 ... -6.251662 -33.224405 1.0 [5 rows x 7 columns]
# 随机深林训练 导入包from sklearn.ensemble import RandomForestClassifier
# 训练模型clf = RandomForestClassifier(n_estimators=65, max_features="auto",max_depth=30,min_samples_split=200)clf
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=30, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=200, min_weight_fraction_leaf=0.0, n_estimators=65, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
from sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import f1_score, precision_score, confusion_matrix, recall_score, accuracy_scoreclf.fit(train_df.iloc[:,:-1],train_df.iloc[:,-1:])# 模型调用pre_train = clf.predict(train_df.iloc[:,:-1])# print("在训练集预测的结果为:",pre_train)print("在训练集的accuracy_score为:",accuracy_score(pre_train,train_df.iloc[:,-1:]))pre_test = clf.predict(test_df.iloc[:,:-1])# print("在测试集预测的结果为:",pre_test)print("在测试集的accuracy_score为:",accuracy_score(pre_test,test_df.iloc[:,-1:]))
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using r*el(). This is separate from the ipykernel package so we can *oid doing imports until
在训练集的accuracy_score为: 0.7812340642529322 在测试集的accuracy_score为: 0.746177370030581
本社区仅针对特定人员开放
查看需注册登录并通过风险意识测评
5秒后跳转登录页面...
移动端课程