各种能够使用的算法都用了,最后发现市场或许就是不可预测。
仔细再思考后,其实可以换一个说法:
短期来看,市场确实不可预测。
前言
考虑做动量研究,是因为看到某篇HMM的研究中用1日、5日涨幅数据效果出奇的好,当我尝试自己更改输入feature,更换成收盘价、成交量等其他平时我们更加关心的数据时,得到的预测结果异常差!这让我突然意识到1日、5日涨幅其实就是动量。
于是,我计划采用动量参数从简单出发,试一试各个算法的预测效果
from sklearn.preprocessing import MinMaxScaler
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import cm
from matplotlib import pyplot#获取沪深300原始数据
startdate = '2005-06-01'
enddate = '2019-03-7'
df = get_price(['000300.XSHG'], start_date=startdate, end_date=enddate, frequency='daily',
fields=['close'])
close = df['close']['000300.XSHG']
close.shape
/opt/conda/lib/python3.6/site-packages/jqresearch/api.py:86: FutureWarning: Panel is deprecated and will be removed in a future version. The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/. Pandas provides a `.to_xarray()` method to help automate this conversion. pre_factor_ref_date=_get_today())
(3350,)
#1-250日涨幅,输入数据
r1 = (np.log(np.array(close[1:]))-np.log(np.array(close[:-1])))[(250-1):-5]
r5 = (np.log(np.array(close[5:]))-np.log(np.array(close[:-5])))[(250-5):-5]
r10 = (np.log(np.array(close[10:]))-np.log(np.array(close[:-10])))[(250-10):-5]
r20 = (np.log(np.array(close[20:]))-np.log(np.array(close[:-20])))[(250-20):-5]
r30 = (np.log(np.array(close[30:]))-np.log(np.array(close[:-30])))[(250-30):-5]
r50 = (np.log(np.array(close[50:]))-np.log(np.array(close[:-50])))[(250-50):-5]
r120 = (np.log(np.array(close[120:]))-np.log(np.array(close[:-120])))[(250-120):-5]
r250 = (np.log(np.array(close[250:]))-np.log(np.array(close[:-250])))[:-5]
# 1-5日涨幅,输出数据
y1 = (np.log(np.array(close[1:]))-np.log(np.array(close[:-1])))[(250-1+1):(-5+1)]
y5 = (np.log(np.array(close[5:]))-np.log(np.array(close[:-5])))[(250-5+5):]
X = np.column_stack([r1,r5,r10,r20,r30,r50,r120,r250])
#X = np.column_stack([r5,r20,r50,r120,r250])
#参数归一化
mms = MinMaxScaler()
tt = mms.fit_transform(X)
X = tt
#训练、测试数据分开
k = np.int(X.shape[0] * 0.9)
X1 = X[:k]
X2 = X[k:]
y11 = 10*y1[:k]
y12 = 10*y1[k:]
y51 = 10*y5[:k]
y52 = 10*y5[k:]
record = pd.DataFrame(columns=(['算法','训练误差','测试误差']))
数据格式为8维非线性数据,所以一般的线性回归算法不适合
from sklearn import svm
svrr = svm.SVR(kernel='rbf', C=1e1)
svrr.fit(X1,y51)
r_y52 = svrr.predict(X2)
r_y51 = svrr.predict(X1)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('SVR',train_e,test_e)
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.40973969978302655 测试数据的平均误差 0.3046188105020312
[<matplotlib.lines.Line2D at 0x7effbc764e10>]
from sklearn.neural_network import MLPRegressor # 多层线性回归
from sklearn.metrics import accuracy_score
# solver='lbfgs', MLP的求解方法:L-BFGS 在小数据上表现较好,Adam 较为鲁棒,SGD在参数调整较优时会有最佳表现(分类效果与迭代次数);SGD标识随机梯度下降。
# alpha:L2的参数:MLP是可以支持正则化的,默认为L2,具体参数需要调整
# hidden_layer_sizes=(x, x)
clf = MLPRegressor(solver='adam', alpha=1e-5,hidden_layer_sizes=(124, 124), random_state=1)
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('神经网络',train_e,test_e)
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.40413568589060633 测试数据的平均误差 0.30510380584932517
[<matplotlib.lines.Line2D at 0x7effbc689438>]
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('线性回归',train_e,test_e)
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.41411735301396474 测试数据的平均误差 0.30582235268289204
[<matplotlib.lines.Line2D at 0x7effbc64b518>]
from sklearn import tree
clf = tree.DecisionTreeRegressor(max_depth=25)
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('决策树',train_e,test_e)
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.13403025976182179 测试数据的平均误差 0.4323796945683956
[<matplotlib.lines.Line2D at 0x7effbc629a58>]
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('随机森林',train_e,test_e)
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.1364986899768852 测试数据的平均误差 0.33043807910741707
[<matplotlib.lines.Line2D at 0x7effbc5804e0>]
from sklearn import neighbors
clf = neighbors.KNeighborsRegressor()
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('KNN',train_e,test_e)
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.27278574027555313 测试数据的平均误差 0.34086055925114783
[<matplotlib.lines.Line2D at 0x7effbc580128>]
from sklearn import ensemble
clf = ensemble.GradientBoostingRegressor(n_estimators=300)
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('GBRT回归',train_e,test_e)
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.24569913706146893 测试数据的平均误差 0.31092995595911593
[<matplotlib.lines.Line2D at 0x7effbc47ed68>]
from sklearn.ensemble import BaggingRegressor
clf = BaggingRegressor()
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('Bagging回归',train_e,test_e)
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.14380879115660758 测试数据的平均误差 0.34497932345187565
[<matplotlib.lines.Line2D at 0x7effbc47e898>]
from sklearn.tree import ExtraTreeRegressor
clf = ExtraTreeRegressor()
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('极端随机森林',train_e,test_e)
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 1.7095137609083828e-05 测试数据的平均误差 0.4812990001394652
[<matplotlib.lines.Line2D at 0x7effbc47ecc0>]
from sklearn import ensemble
clf = ensemble.AdaBoostRegressor(n_estimators=50)
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('Adaboost回归',train_e,test_e)
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.40497653122445065 测试数据的平均误差 0.32104811458688914
[<matplotlib.lines.Line2D at 0x7effbc384550>]
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
X1.shape[1]
8
clf = Sequential()
tX1 = X1.reshape(X1.shape[0],1,X1.shape[1])
tX2 = X2.reshape(X2.shape[0],1,X2.shape[1])
clf.add(LSTM(50, input_shape=(tX1.shape[1],tX1.shape[2])))
clf.add(Dense(1))
clf.compile(loss='mae', optimizer='adam')
clf.fit(tX1,y51,epochs=50, batch_size=250)
r_y51 = clf.predict(tX1)
r_y52 = clf.predict(tX2)
Epoch 1/50 2785/2785 [==============================] - 1s 538us/step - loss: 0.3103 Epoch 2/50 2785/2785 [==============================] - 0s 107us/step - loss: 0.3084 Epoch 3/50 2785/2785 [==============================] - 0s 77us/step - loss: 0.3069 Epoch 4/50 2785/2785 [==============================] - 0s 109us/step - loss: 0.3062 Epoch 5/50 2785/2785 [==============================] - 0s 110us/step - loss: 0.3062 Epoch 6/50 2785/2785 [==============================] - 0s 108us/step - loss: 0.3056 Epoch 7/50 2785/2785 [==============================] - 0s 109us/step - loss: 0.3053 Epoch 8/50 2785/2785 [==============================] - 0s 138us/step - loss: 0.3051 Epoch 9/50 2785/2785 [==============================] - 0s 110us/step - loss: 0.3048 Epoch 10/50 2785/2785 [==============================] - 0s 111us/step - loss: 0.3046 Epoch 11/50 2785/2785 [==============================] - 0s 104us/step - loss: 0.3044 Epoch 12/50 2785/2785 [==============================] - 0s 108us/step - loss: 0.3042 Epoch 13/50 2785/2785 [==============================] - 0s 140us/step - loss: 0.3042 Epoch 14/50 2785/2785 [==============================] - 0s 110us/step - loss: 0.3037 Epoch 15/50 2785/2785 [==============================] - 0s 108us/step - loss: 0.3041 Epoch 16/50 2785/2785 [==============================] - 0s 107us/step - loss: 0.3035 Epoch 17/50 2785/2785 [==============================] - 0s 107us/step - loss: 0.3034 Epoch 18/50 2785/2785 [==============================] - 0s 105us/step - loss: 0.3035 Epoch 19/50 2785/2785 [==============================] - 0s 106us/step - loss: 0.3034 Epoch 20/50 2785/2785 [==============================] - 0s 76us/step - loss: 0.3032 Epoch 21/50 2785/2785 [==============================] - 0s 105us/step - loss: 0.3033 Epoch 22/50 2785/2785 [==============================] - 0s 105us/step - loss: 0.3035 Epoch 23/50 2785/2785 [==============================] - 0s 104us/step - loss: 0.3033 Epoch 24/50 2785/2785 [==============================] - 0s 78us/step - loss: 0.3031 Epoch 25/50 2785/2785 [==============================] - 0s 104us/step - loss: 0.3032 Epoch 26/50 2785/2785 [==============================] - 0s 75us/step - loss: 0.3031 Epoch 27/50 2785/2785 [==============================] - 0s 104us/step - loss: 0.3029 Epoch 28/50 2785/2785 [==============================] - 0s 75us/step - loss: 0.3028 Epoch 29/50 2785/2785 [==============================] - 0s 104us/step - loss: 0.3028 Epoch 30/50 2785/2785 [==============================] - 0s 75us/step - loss: 0.3038 Epoch 31/50 2785/2785 [==============================] - 0s 75us/step - loss: 0.3036 Epoch 32/50 2785/2785 [==============================] - 0s 103us/step - loss: 0.3026 Epoch 33/50 2785/2785 [==============================] - 0s 75us/step - loss: 0.3030 Epoch 34/50 2785/2785 [==============================] - 0s 74us/step - loss: 0.3028 Epoch 35/50 2785/2785 [==============================] - 0s 103us/step - loss: 0.3026 Epoch 36/50 2785/2785 [==============================] - 0s 76us/step - loss: 0.3026 Epoch 37/50 2785/2785 [==============================] - 0s 104us/step - loss: 0.3027 Epoch 38/50 2785/2785 [==============================] - 0s 75us/step - loss: 0.3031 Epoch 39/50 2785/2785 [==============================] - 0s 76us/step - loss: 0.3024 Epoch 40/50 2785/2785 [==============================] - 0s 108us/step - loss: 0.3025 Epoch 41/50 2785/2785 [==============================] - 0s 103us/step - loss: 0.3025 Epoch 42/50 2785/2785 [==============================] - 0s 75us/step - loss: 0.3029 Epoch 43/50 2785/2785 [==============================] - 0s 105us/step - loss: 0.3028 Epoch 44/50 2785/2785 [==============================] - 0s 77us/step - loss: 0.3034 Epoch 45/50 2785/2785 [==============================] - 0s 105us/step - loss: 0.3031 Epoch 46/50 2785/2785 [==============================] - 0s 74us/step - loss: 0.3024 Epoch 47/50 2785/2785 [==============================] - 0s 103us/step - loss: 0.3024 Epoch 48/50 2785/2785 [==============================] - 0s 76us/step - loss: 0.3024 Epoch 49/50 2785/2785 [==============================] - 0s 77us/step - loss: 0.3022 Epoch 50/50 2785/2785 [==============================] - 0s 103us/step - loss: 0.3022
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('LSTM',train_e,test_e)
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.4283344503276845 测试数据的平均误差 0.3079035993630009
[<matplotlib.lines.Line2D at 0x7effbc3e69e8>]
record
算法 | 训练误差 | 测试误差 | |
---|---|---|---|
0 | SVR | 0.40974 | 0.304619 |
1 | 神经网络 | 0.404136 | 0.305104 |
2 | 线性回归 | 0.414117 | 0.305822 |
3 | 决策树 | 0.13403 | 0.43238 |
4 | 随机森林 | 0.136499 | 0.330438 |
5 | KNN | 0.272786 | 0.340861 |
6 | GBRT回归 | 0.245699 | 0.31093 |
7 | Bagging回归 | 0.143809 | 0.344979 |
8 | 极端随机森林 | 1.70951e-05 | 0.481299 |
9 | Adaboost回归 | 0.404977 | 0.321048 |
10 | LSTM | 0.428334 | 0.307904 |
1、很明显,SVR、神经网络、线性回归、LSTM的测试误差均在0.30-0.31以内;
2、30%以上级别的误差几乎等于预测失败。
3、算法复杂度来说,LSTM、线性回归两种算法有着天壤之别,然而他们的结果居然差不多;
1、使用动量预测效果有限;
2、DeepLearning在本研究中没有逆袭……
本社区仅针对特定人员开放
查看需注册登录并通过风险意识测评
5秒后跳转登录页面...
移动端课程