请 [注册] 或 [登录]  | 返回主站

量化交易吧 /  数理科学 帖子:3364681 新帖:2

各种ML算法的动量研究说明,交易没有圣杯?

美联储发表于:5 月 9 日 20:58回复(1)

各种能够使用的算法都用了,最后发现市场或许就是不可预测。
仔细再思考后,其实可以换一个说法:
短期来看,市场确实不可预测。

动量研究¶

前言

考虑做动量研究,是因为看到某篇HMM的研究中用1日、5日涨幅数据效果出奇的好,当我尝试自己更改输入feature,更换成收盘价、成交量等其他平时我们更加关心的数据时,得到的预测结果异常差!这让我突然意识到1日、5日涨幅其实就是动量。
于是,我计划采用动量参数从简单出发,试一试各个算法的预测效果
  • 作为输入1日、5日、10日、20日、30日、50日、120日、250日涨幅
  • 预测后面1日、5日涨幅
  • 预测中使用60%数据进行训练,使用40%数据测试
  • 结果评估采用 np.sqrt(np.square(np.abs(r_y51 - y51)).mean())

一、获取数据¶

from sklearn.preprocessing import MinMaxScaler
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import cm
from matplotlib import pyplot#获取沪深300原始数据
startdate = '2005-06-01'
enddate = '2019-03-7'
df = get_price(['000300.XSHG'], start_date=startdate, end_date=enddate, frequency='daily', 
               fields=['close'])
close = df['close']['000300.XSHG']
close.shape
/opt/conda/lib/python3.6/site-packages/jqresearch/api.py:86: FutureWarning: 
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  pre_factor_ref_date=_get_today())
(3350,)
#1-250日涨幅,输入数据
r1 = (np.log(np.array(close[1:]))-np.log(np.array(close[:-1])))[(250-1):-5]
r5 = (np.log(np.array(close[5:]))-np.log(np.array(close[:-5])))[(250-5):-5] 
r10 = (np.log(np.array(close[10:]))-np.log(np.array(close[:-10])))[(250-10):-5]
r20 = (np.log(np.array(close[20:]))-np.log(np.array(close[:-20])))[(250-20):-5]
r30 = (np.log(np.array(close[30:]))-np.log(np.array(close[:-30])))[(250-30):-5]
r50 = (np.log(np.array(close[50:]))-np.log(np.array(close[:-50])))[(250-50):-5]
r120 = (np.log(np.array(close[120:]))-np.log(np.array(close[:-120])))[(250-120):-5]
r250 = (np.log(np.array(close[250:]))-np.log(np.array(close[:-250])))[:-5]

# 1-5日涨幅,输出数据
y1 = (np.log(np.array(close[1:]))-np.log(np.array(close[:-1])))[(250-1+1):(-5+1)]
y5 = (np.log(np.array(close[5:]))-np.log(np.array(close[:-5])))[(250-5+5):]

X = np.column_stack([r1,r5,r10,r20,r30,r50,r120,r250])
#X = np.column_stack([r5,r20,r50,r120,r250])


#参数归一化
mms = MinMaxScaler()
tt = mms.fit_transform(X)
X = tt
#训练、测试数据分开
k = np.int(X.shape[0] * 0.9)
X1 = X[:k]
X2 = X[k:]
y11 = 10*y1[:k]
y12 = 10*y1[k:]
y51 = 10*y5[:k]
y52 = 10*y5[k:]
record = pd.DataFrame(columns=(['算法','训练误差','测试误差']))

数据格式为8维非线性数据,所以一般的线性回归算法不适合

二、预测¶

SVR¶

from sklearn import svm
svrr = svm.SVR(kernel='rbf', C=1e1)
svrr.fit(X1,y51)
r_y52 = svrr.predict(X2)
r_y51 = svrr.predict(X1)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('SVR',train_e,test_e) 
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.40973969978302655
测试数据的平均误差 0.3046188105020312
[<matplotlib.lines.Line2D at 0x7effbc764e10>]

神经网络¶

from sklearn.neural_network import MLPRegressor  # 多层线性回归
from sklearn.metrics import accuracy_score
# solver='lbfgs',  MLP的求解方法:L-BFGS 在小数据上表现较好,Adam 较为鲁棒,SGD在参数调整较优时会有最佳表现(分类效果与迭代次数);SGD标识随机梯度下降。
# alpha:L2的参数:MLP是可以支持正则化的,默认为L2,具体参数需要调整
# hidden_layer_sizes=(x, x) 
clf = MLPRegressor(solver='adam', alpha=1e-5,hidden_layer_sizes=(124, 124), random_state=1)
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('神经网络',train_e,test_e) 
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.40413568589060633
测试数据的平均误差 0.30510380584932517
[<matplotlib.lines.Line2D at 0x7effbc689438>]
 

线性回归¶

from sklearn import linear_model        
clf = linear_model.LinearRegression()
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('线性回归',train_e,test_e) 
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.41411735301396474
测试数据的平均误差 0.30582235268289204
[<matplotlib.lines.Line2D at 0x7effbc64b518>]

决策树¶

from sklearn import tree
clf = tree.DecisionTreeRegressor(max_depth=25)
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('决策树',train_e,test_e) 
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.13403025976182179
测试数据的平均误差 0.4323796945683956
[<matplotlib.lines.Line2D at 0x7effbc629a58>]

随机森林¶

from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('随机森林',train_e,test_e) 
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.1364986899768852
测试数据的平均误差 0.33043807910741707
[<matplotlib.lines.Line2D at 0x7effbc5804e0>]

KNN¶

from sklearn import neighbors
clf = neighbors.KNeighborsRegressor()
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('KNN',train_e,test_e) 
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.27278574027555313
测试数据的平均误差 0.34086055925114783
[<matplotlib.lines.Line2D at 0x7effbc580128>]

GBRT回归¶

from sklearn import ensemble
clf = ensemble.GradientBoostingRegressor(n_estimators=300)
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('GBRT回归',train_e,test_e) 
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.24569913706146893
测试数据的平均误差 0.31092995595911593
[<matplotlib.lines.Line2D at 0x7effbc47ed68>]

Bagging回归¶

from sklearn.ensemble import BaggingRegressor
clf = BaggingRegressor()
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('Bagging回归',train_e,test_e) 
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.14380879115660758
测试数据的平均误差 0.34497932345187565
[<matplotlib.lines.Line2D at 0x7effbc47e898>]

极端随机森林¶

from sklearn.tree import ExtraTreeRegressor
clf = ExtraTreeRegressor()
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('极端随机森林',train_e,test_e) 
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 1.7095137609083828e-05
测试数据的平均误差 0.4812990001394652
[<matplotlib.lines.Line2D at 0x7effbc47ecc0>]

Adaboost回归¶

from sklearn import ensemble
clf = ensemble.AdaBoostRegressor(n_estimators=50)
clf.fit(X1, y51)
r_y51 = clf.predict(X1)
r_y52 = clf.predict(X2)
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('Adaboost回归',train_e,test_e) 
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.40497653122445065
测试数据的平均误差 0.32104811458688914
[<matplotlib.lines.Line2D at 0x7effbc384550>]

基于keras的LSTM¶

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
X1.shape[1]
8
clf = Sequential()
tX1 = X1.reshape(X1.shape[0],1,X1.shape[1])
tX2 = X2.reshape(X2.shape[0],1,X2.shape[1])
clf.add(LSTM(50, input_shape=(tX1.shape[1],tX1.shape[2])))
clf.add(Dense(1))
clf.compile(loss='mae', optimizer='adam')
clf.fit(tX1,y51,epochs=50, batch_size=250)
r_y51 = clf.predict(tX1)
r_y52 = clf.predict(tX2)
Epoch 1/50
2785/2785 [==============================] - 1s 538us/step - loss: 0.3103
Epoch 2/50
2785/2785 [==============================] - 0s 107us/step - loss: 0.3084
Epoch 3/50
2785/2785 [==============================] - 0s 77us/step - loss: 0.3069
Epoch 4/50
2785/2785 [==============================] - 0s 109us/step - loss: 0.3062
Epoch 5/50
2785/2785 [==============================] - 0s 110us/step - loss: 0.3062
Epoch 6/50
2785/2785 [==============================] - 0s 108us/step - loss: 0.3056
Epoch 7/50
2785/2785 [==============================] - 0s 109us/step - loss: 0.3053
Epoch 8/50
2785/2785 [==============================] - 0s 138us/step - loss: 0.3051
Epoch 9/50
2785/2785 [==============================] - 0s 110us/step - loss: 0.3048
Epoch 10/50
2785/2785 [==============================] - 0s 111us/step - loss: 0.3046
Epoch 11/50
2785/2785 [==============================] - 0s 104us/step - loss: 0.3044
Epoch 12/50
2785/2785 [==============================] - 0s 108us/step - loss: 0.3042
Epoch 13/50
2785/2785 [==============================] - 0s 140us/step - loss: 0.3042
Epoch 14/50
2785/2785 [==============================] - 0s 110us/step - loss: 0.3037
Epoch 15/50
2785/2785 [==============================] - 0s 108us/step - loss: 0.3041
Epoch 16/50
2785/2785 [==============================] - 0s 107us/step - loss: 0.3035
Epoch 17/50
2785/2785 [==============================] - 0s 107us/step - loss: 0.3034
Epoch 18/50
2785/2785 [==============================] - 0s 105us/step - loss: 0.3035
Epoch 19/50
2785/2785 [==============================] - 0s 106us/step - loss: 0.3034
Epoch 20/50
2785/2785 [==============================] - 0s 76us/step - loss: 0.3032
Epoch 21/50
2785/2785 [==============================] - 0s 105us/step - loss: 0.3033
Epoch 22/50
2785/2785 [==============================] - 0s 105us/step - loss: 0.3035
Epoch 23/50
2785/2785 [==============================] - 0s 104us/step - loss: 0.3033
Epoch 24/50
2785/2785 [==============================] - 0s 78us/step - loss: 0.3031
Epoch 25/50
2785/2785 [==============================] - 0s 104us/step - loss: 0.3032
Epoch 26/50
2785/2785 [==============================] - 0s 75us/step - loss: 0.3031
Epoch 27/50
2785/2785 [==============================] - 0s 104us/step - loss: 0.3029
Epoch 28/50
2785/2785 [==============================] - 0s 75us/step - loss: 0.3028
Epoch 29/50
2785/2785 [==============================] - 0s 104us/step - loss: 0.3028
Epoch 30/50
2785/2785 [==============================] - 0s 75us/step - loss: 0.3038
Epoch 31/50
2785/2785 [==============================] - 0s 75us/step - loss: 0.3036
Epoch 32/50
2785/2785 [==============================] - 0s 103us/step - loss: 0.3026
Epoch 33/50
2785/2785 [==============================] - 0s 75us/step - loss: 0.3030
Epoch 34/50
2785/2785 [==============================] - 0s 74us/step - loss: 0.3028
Epoch 35/50
2785/2785 [==============================] - 0s 103us/step - loss: 0.3026
Epoch 36/50
2785/2785 [==============================] - 0s 76us/step - loss: 0.3026
Epoch 37/50
2785/2785 [==============================] - 0s 104us/step - loss: 0.3027
Epoch 38/50
2785/2785 [==============================] - 0s 75us/step - loss: 0.3031
Epoch 39/50
2785/2785 [==============================] - 0s 76us/step - loss: 0.3024
Epoch 40/50
2785/2785 [==============================] - 0s 108us/step - loss: 0.3025
Epoch 41/50
2785/2785 [==============================] - 0s 103us/step - loss: 0.3025
Epoch 42/50
2785/2785 [==============================] - 0s 75us/step - loss: 0.3029
Epoch 43/50
2785/2785 [==============================] - 0s 105us/step - loss: 0.3028
Epoch 44/50
2785/2785 [==============================] - 0s 77us/step - loss: 0.3034
Epoch 45/50
2785/2785 [==============================] - 0s 105us/step - loss: 0.3031
Epoch 46/50
2785/2785 [==============================] - 0s 74us/step - loss: 0.3024
Epoch 47/50
2785/2785 [==============================] - 0s 103us/step - loss: 0.3024
Epoch 48/50
2785/2785 [==============================] - 0s 76us/step - loss: 0.3024
Epoch 49/50
2785/2785 [==============================] - 0s 77us/step - loss: 0.3022
Epoch 50/50
2785/2785 [==============================] - 0s 103us/step - loss: 0.3022
train_e = np.sqrt(np.square(np.abs(r_y51 - y51)).mean())
test_e = np.sqrt(np.square(np.abs(r_y52 - y52)).mean())
print('训练数据的平均误差', train_e)
print('测试数据的平均误差', test_e)
record.loc[len(record),:] =('LSTM',train_e,test_e) 
plt.plot(y52.cumsum(),'r-')
plt.plot(r_y52.cumsum(),'b-')
训练数据的平均误差 0.4283344503276845
测试数据的平均误差 0.3079035993630009
[<matplotlib.lines.Line2D at 0x7effbc3e69e8>]
record
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
算法 训练误差 测试误差
0 SVR 0.40974 0.304619
1 神经网络 0.404136 0.305104
2 线性回归 0.414117 0.305822
3 决策树 0.13403 0.43238
4 随机森林 0.136499 0.330438
5 KNN 0.272786 0.340861
6 GBRT回归 0.245699 0.31093
7 Bagging回归 0.143809 0.344979
8 极端随机森林 1.70951e-05 0.481299
9 Adaboost回归 0.404977 0.321048
10 LSTM 0.428334 0.307904

讨论:¶

1、很明显,SVR、神经网络、线性回归、LSTM的测试误差均在0.30-0.31以内;
2、30%以上级别的误差几乎等于预测失败。
3、算法复杂度来说,LSTM、线性回归两种算法有着天壤之别,然而他们的结果居然差不多;

结论¶

1、使用动量预测效果有限;
2、DeepLearning在本研究中没有逆袭……

全部回复

0/140

量化课程

    移动端课程