繁簡切換您正在訪問的是FX168財經網,本網站所提供的內容及信息均遵守中華人民共和國香港特別行政區當地法律法規。

FX168财经网>人物频道>帖子

调试(python3)——支持向量机模型(SVM)在多因子选股模型领域的应用

作者/醒掌天下权 2019-05-10 16:47 0 来源: FX168财经网人物频道

感谢一号研究小组的精彩研究!!!看到这么完整的报告,当然是得亲手动一动,学习学习。不过运行的过程中还是有不少问题的,简要总结一下优化的地方和调试的bug。

  1. 原文中是python2写的,这里改为python3进行研究。
  2. 添加--改写一段时间每个月的开始和最后一个交易日的方法:calculate_FL(time_list)
  3. 添加--改写截面因子数据的整合方式,先生成pandas.Panel,再取切片
  4. 更新--python3读取pickle文件的异常处理
  5. 更新--python3里面的sort_values/iloc取切片时注意整数
  6. 更新--sklearn版本时0.18,GridSearchCV 没有cvresults
  7. 更新--没有用原博主精简过后的回测框架,用聚宽课堂的
  8. 更新--对于group_backtest里param_values的参数传递,一种是将数字改为字符串传进去;一种是用聚宽课程传入quantile的tuple方式

整体上是跑通了,这个研究过程是非常的详实,也很系统。学习了!

引言¶

研究目的:
本文参考华泰研报《人工智能选股之支持向量机模型》,对研报里面的结果进行了分析,并研究了支持向量机模型(SVM)在多因子选股模型领域的应用,实现股票因子与收益之间非线性模型的建立,通过对模型进行多维度分析实现模型评价。

研究内容:
多因子模型的本质是关于股票当期因子暴露和未来收益之间的线性回归模型。我们希望引入机器学习的思想,对传统多因子模型进行优化。然而,由于金融市场是复杂的、非线性的,因此线性模型就会存在比较大的缺陷。线性支持向量机能够解决线性分类问题,核支持向量机则主要针对非线性分类问题,支持向量回归能够处理回归问题。本篇报告我们将支持向量机应用于多因子选股,主要关注如下几方面的问题:
(1)首先是模型选择的问题:常见的核函数有线性核、多项式核、高斯核等,不同核函数的选择会构建不同的模型,如何确定核函数才能使模型更优。
(2)其次是参数寻优的问题:支持向量机包含两个重要参数,即惩罚系数 C 和 gamma 值,如何确定最优参数。
(3)然后是组合构建的问题:在衡量过不同支持向量机模型的表现之后,应如何利用模型的预测结果构建策略组合进行回测。
(4)最后是模型评价的问题:在根据模型预测结果构建策略组合进行回测后,如何对回测结果进行评价,从而判断出模型的优劣。

研究结论:
(1)本文以 HS300 成分股为标的,以 2010-2014 年的因子及下期收益作为样本内集合,2014 年 2018 年的数据为样本外测试集,高斯核 SVM 在 HS300 选股模型交叉验证集正确率为 56.2%,AUC 为 0.561,样本外测试集平均正确率为 55.1%,平均 AUC 为 0.550。
(2)以 HS300 为票池,利用高斯核 SVM 模型构建选股策略。对于 HS300 成份股内选股的等权重策略以及行业中性策略。总体而言,高斯核 SVM 在收益、夏普比率、最大回测等方面分层明确,可见模型是有效的。
(3)比较了不同核支持向量机 (SVM) 的预测能力,一般情况下,高斯核收益能力要高于其他核函数模型。 (4)比较了高斯核支持向量机 (SVM) 以及支持向量回归 (SVR) 的预测能力,绝大多数时候,高斯核 SVC 的收益能力比 SVR 模型更强。

研究耗时:
(1)数据准备部分:大约需要 8h,主要出现在数据采集部分,为了避免程序耗时过长,已将数据提前运行出来,下载地址:https://pan.baidu.com/s/1DSI7uc5yBNY3hzedu9L7mw
(2)模型测试部分:大约需要 6h,主要耗时在参数交叉验证,该步骤可仅作为参考,直接跳过该步骤。
(3)策略构建部分:大约需要 36min,主要回测组合较多,且行业中性较为复杂,耗时较多。

1 数据准备¶

1.1 日期列表获取(约 3 s)¶

在每个月的月末对因子数据进行提取,因此需要对每个月的月末日期进行统计。
输入参数分别为 peroid、start_date 和 end_date,其中 peroid 进行周期选择,可选周期为周(W)、月(M)和季(Q),start_date和end_date 分别为开始日期和结束日期。
函数返回值为对应的月末日期。本文选取开始日期为 2010.1.1,结束日期为 2018.1.1。

from jqdata import *
from jqlib.technical_analysis import *
from jqfactor import get_factor_values
from jqfactor import winsorize_med
from jqfactor import standardlize
from jqfactor import neutralize
import datetime
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels import regression
import pickle
from six import StringIO
#导入pca
from sklearn.decomposition import PCA
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn import metrics

import matplotlib.dates as mdates
import matplotlib.pyplot as plt

import seaborn as sns
/opt/conda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/opt/conda/lib/python3.5/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
import jqdatasdk
from jqdatasdk import *
auth('1866530448','')
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-106-6252468b5dad> in <module>()
----> 1 import jqdatasdk
      2 from jqdatasdk import *
      3 auth('1866530448','')

ImportError: No module named 'jqdatasdk'
import numpy as np
import pandas as pd
import datetime as datetime
stock_data = get_price('000001.XSHE','2010-01-01', '2018-01-01','daily',fields=['close'])
stock_data['date']=stock_data.index
stock_data.head(50)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
close date
2010-01-04 7.89 2010-01-04
2010-01-05 7.75 2010-01-05
2010-01-06 7.62 2010-01-06
2010-01-07 7.54 2010-01-07
2010-01-08 7.52 2010-01-08
2010-01-11 7.52 2010-01-11
2010-01-12 7.47 2010-01-12
2010-01-13 6.97 2010-01-13
2010-01-14 6.98 2010-01-14
2010-01-15 7.13 2010-01-15
2010-01-18 7.14 2010-01-18
2010-01-19 7.40 2010-01-19
2010-01-20 7.11 2010-01-20
2010-01-21 7.55 2010-01-21
2010-01-22 7.68 2010-01-22
2010-01-25 7.38 2010-01-25
2010-01-26 7.34 2010-01-26
2010-01-27 7.29 2010-01-27
2010-01-28 7.24 2010-01-28
2010-01-29 7.22 2010-01-29
2010-02-01 7.07 2010-02-01
2010-02-02 7.11 2010-02-02
2010-02-03 7.51 2010-02-03
2010-02-04 7.39 2010-02-04
2010-02-05 7.32 2010-02-05
2010-02-08 7.17 2010-02-08
2010-02-09 7.32 2010-02-09
2010-02-10 7.41 2010-02-10
2010-02-11 7.36 2010-02-11
2010-02-12 7.47 2010-02-12
2010-02-22 7.35 2010-02-22
2010-02-23 7.17 2010-02-23
2010-02-24 7.22 2010-02-24
2010-02-25 7.37 2010-02-25
2010-02-26 7.47 2010-02-26
2010-03-01 7.47 2010-03-01
2010-03-02 7.69 2010-03-02
2010-03-03 7.75 2010-03-03
2010-03-04 7.68 2010-03-04
2010-03-05 7.74 2010-03-05
2010-03-08 7.93 2010-03-08
2010-03-09 7.92 2010-03-09
2010-03-10 7.78 2010-03-10
2010-03-11 7.87 2010-03-11
2010-03-12 7.63 2010-03-12
2010-03-15 7.45 2010-03-15
2010-03-16 7.50 2010-03-16
2010-03-17 7.71 2010-03-17
2010-03-18 7.67 2010-03-18
2010-03-19 7.66 2010-03-19
stock_data.head(50).resample('W',how='mean')
C:\Users\18665\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).mean()
  """Entry point for launching an IPython kernel.
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
close
2010-01-10 7.664
2010-01-17 7.214
2010-01-24 7.376
2010-01-31 7.294
2010-02-07 7.280
2010-02-14 7.346
2010-02-21 NaN
2010-02-28 7.316
2010-03-07 7.666
2010-03-14 7.826
2010-03-21 7.598
stock_data.head(50).resample('W',how='last')
C:\Users\18665\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).last()
  """Entry point for launching an IPython kernel.
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
close date
2010-01-10 7.52 2010-01-08
2010-01-17 7.13 2010-01-15
2010-01-24 7.68 2010-01-22
2010-01-31 7.22 2010-01-29
2010-02-07 7.32 2010-02-05
2010-02-14 7.47 2010-02-12
2010-02-21 NaN NaT
2010-02-28 7.47 2010-02-26
2010-03-07 7.74 2010-03-05
2010-03-14 7.63 2010-03-12
2010-03-21 7.66 2010-03-19
period_stock_data = stock_data.head(50).resample('M',how='last')
C:\Users\18665\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).last()
  """Entry point for launching an IPython kernel.
date=period_stock_data.index
date
DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31'], dtype='datetime64[ns]', freq='M')
date[0]
Timestamp('2010-01-31 00:00:00', freq='M')
date.to_pydatetime()
array([datetime.datetime(2010, 1, 31, 0, 0),
       datetime.datetime(2010, 2, 28, 0, 0),
       datetime.datetime(2010, 3, 31, 0, 0)], dtype=object)
np.vectorize(lambda x:x.strftime('%Y-%m-%d'))(date.to_pydatetime())
array(['2010-01-31', '2010-02-28', '2010-03-31'], dtype='<U10')
pd.Series(np.vectorize(lambda x:x.strftime('%Y-%m-%d'))(date.to_pydatetime())).values.tolist()
['2010-01-31', '2010-02-28', '2010-03-31']
from jqdata import *
stock_data.resample('M',how='last')
/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).last()
  """Entry point for launching an IPython kernel.
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
date
date
2010-01-31 2010-01-29
2010-02-28 2010-02-26
2010-03-31 2010-03-31
2010-04-30 2010-04-30
2010-05-31 2010-05-31
2010-06-30 2010-06-30
2010-07-31 2010-07-30
2010-08-31 2010-08-31
2010-09-30 2010-09-30
2010-10-31 2010-10-29
2010-11-30 2010-11-30
2010-12-31 2010-12-31
2011-01-31 2011-01-31
2011-02-28 2011-02-28
2011-03-31 2011-03-31
2011-04-30 2011-04-29
2011-05-31 2011-05-31
2011-06-30 2011-06-30
2011-07-31 2011-07-29
2011-08-31 2011-08-31
2011-09-30 2011-09-30
2011-10-31 2011-10-31
2011-11-30 2011-11-30
2011-12-31 2011-12-30
2012-01-31 2012-01-31
2012-02-29 2012-02-29
2012-03-31 2012-03-30
2012-04-30 2012-04-27
2012-05-31 2012-05-31
2012-06-30 2012-06-29
2012-07-31 2012-07-31
2012-08-31 2012-08-31
2012-09-30 2012-09-28
2012-10-31 2012-10-31
2012-11-30 2012-11-30
2012-12-31 2012-12-31
'''
计算每个月最后一个交易日的方式:
1、取出股票交易的起止时间:一种是可以通过取指数数据的方式拿到交易时间;一种是可以直接调用get_trade_days获取指定日期范围内的所有交易日;
2、将时间按照需求进行聚合,用resample得到我们想要的聚合bins的时间点,然后取出所有的时间点;
3、对应时间索引得到的array可以用vectorize方法,先将要apply的方法向量化,最后得到向量化的方法,往方法里传参即可;
'''
from jqdata import *
import pandas as pd
#获取指定周期的日期列表 'W、M、Q'
def get_period_date(peroid,start_date, end_date):
    #设定转换周期period_type  转换为周是'W',月'M',季度线'Q',五分钟'5min',12天'12D'
#     stock_data = get_price('000001.XSHE',start_date,end_date,'daily',fields=['close'])
#     #记录每个周期中最后一个交易日
#     stock_data['date']=stock_data.index
    '''直接调用get_trade_days获取指定日期范围内的所有交易日'''
    stock_data = pd.DataFrame()
    stock_data['date'] = get_trade_days(start_date='2010-01-01', end_date='2018-01-01', count=None)
    stock_data.index = stock_data['date'].apply(lambda x:pd.to_datetime(x))
    #进行转换,周线的每个变量都等于那一周中最后一个交易日的变量值
    period_stock_data=stock_data.resample(peroid,how='last')
    date=period_stock_data.index
    pydate_array = date.to_pydatetime()
    date_only_array = list(np.vectorize(lambda s: s.strftime('%Y-%m-%d'))(pydate_array))
#     date_only_series = pd.Series(date_only_array)
    start_date = datetime.datetime.strptime(start_date, "%Y-%m-%d")
    start_date=start_date-datetime.timedelta(days=1)
    start_date = start_date.strftime("%Y-%m-%d")
    date_only_array.insert(0,start_date)
    return date_only_array
np.array(get_period_date('M','2010-01-01', '2018-01-01'))
/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:20: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).last()
array(['2009-12-31', '2010-01-31', '2010-02-28', '2010-03-31',
       '2010-04-30', '2010-05-31', '2010-06-30', '2010-07-31',
       '2010-08-31', '2010-09-30', '2010-10-31', '2010-11-30',
       '2010-12-31', '2011-01-31', '2011-02-28', '2011-03-31',
       '2011-04-30', '2011-05-31', '2011-06-30', '2011-07-31',
       '2011-08-31', '2011-09-30', '2011-10-31', '2011-11-30',
       '2011-12-31', '2012-01-31', '2012-02-29', '2012-03-31',
       '2012-04-30', '2012-05-31', '2012-06-30', '2012-07-31',
       '2012-08-31', '2012-09-30', '2012-10-31', '2012-11-30',
       '2012-12-31', '2013-01-31', '2013-02-28', '2013-03-31',
       '2013-04-30', '2013-05-31', '2013-06-30', '2013-07-31',
       '2013-08-31', '2013-09-30', '2013-10-31', '2013-11-30',
       '2013-12-31', '2014-01-31', '2014-02-28', '2014-03-31',
       '2014-04-30', '2014-05-31', '2014-06-30', '2014-07-31',
       '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30',
       '2014-12-31', '2015-01-31', '2015-02-28', '2015-03-31',
       '2015-04-30', '2015-05-31', '2015-06-30', '2015-07-31',
       '2015-08-31', '2015-09-30', '2015-10-31', '2015-11-30',
       '2015-12-31', '2016-01-31', '2016-02-29', '2016-03-31',
       '2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31',
       '2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30',
       '2016-12-31', '2017-01-31', '2017-02-28', '2017-03-31',
       '2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31',
       '2017-08-31', '2017-09-30', '2017-10-31', '2017-11-30',
       '2017-12-31'], dtype='<U10')
# 计算一段时间每个月的开始和最后一个交易日
def calculate_FL(time_list):
    # time_list 可以通过get_trade_days来获取交易时间
    time_list_df = pd.DataFrame(time_list,columns=['time'])
    time_list_df['time_str'] = time_list_df['time'].apply(lambda x:datetime.datetime.strftime(x,'%Y-%m-%d'))
    time_list_df['year'] = time_list_df['time_str'].apply(lambda x:int(x.split('-')[0]))
    time_list_df['month'] = time_list_df['time_str'].apply(lambda x:int(x.split('-')[1]))
    time_list_df['day'] = time_list_df['time_str'].apply(lambda x:int(x.split('-')[2]))
    time_list_df['cum_year'] = time_list_df['year']-time_list_df['year'].iloc[0]
    time_list_df['cum_month'] = time_list_df['cum_year']*12 + time_list_df['month']
    time_list_df['diff_month'] = time_list_df['cum_month'].diff()
    time_list_df['diff_shift_month'] = time_list_df['diff_month'].shift(-1)
    trade_end = list(time_list_df[time_list_df['diff_shift_month']==1]['time_str'].values)
    trade_start = list(time_list_df[time_list_df['diff_month'] == 1]['time_str'].values)
    trade_start.append(time_list_df['time_str'].iloc[0])
    trade_start = sorted(trade_start)
    trade_end.append(time_list_df['time_str'].iloc[-1])
    return trade_start,trade_end

1.2 股票列表获取(约 2 s)¶

可选股票池: HS300、ZZ500、中证 800、创业板指以及全 A 股
股票筛选:剔除 ST 股票,剔除上市 3 个月内的股票,每只股票视作一个样本
以 HS300 为例,取 2017-06-01 当天的股票成分股

#去除上市距beginDate不足3个月的股票
def delect_stop(stocks,beginDate,n=30*3):
    stockList=[]
    beginDate = datetime.datetime.strptime(beginDate, "%Y-%m-%d")
    for stock in stocks:
        start_date=get_security_info(stock).start_date
        if start_date<(beginDate-datetime.timedelta(days=n)).date():
            stockList.append(stock)
    return stockList
#获取股票池
def get_stock(stockPool,begin_date):
    if stockPool=='HS300':
        stockList=get_index_stocks('000300.XSHG',begin_date)
    elif stockPool=='ZZ500':
        stockList=get_index_stocks('399905.XSHE',begin_date)
    elif stockPool=='ZZ800':
        stockList=get_index_stocks('399906.XSHE',begin_date)   
    elif stockPool=='CYBZ':
        stockList=get_index_stocks('399006.XSHE',begin_date)
    elif stockPool=='ZXBZ':
        stockList=get_index_stocks('399005.XSHE',begin_date)
    elif stockPool=='A':
        stockList=get_index_stocks('000002.XSHG',begin_date)+get_index_stocks('399107.XSHE',begin_date)
#         stockList = list(get_all_securities('stock').index)
    #剔除ST股
    st_data=get_extras('is_st',stockList, count = 1,end_date=begin_date)
    stockList = [stock for stock in stockList if not st_data[stock][0]]
    #剔除停牌、新股及退市股票
    stockList=delect_stop(stockList,begin_date)
    return stockList
get_stock('HS300','2017-06-01')
[u'000001.XSHE',
 u'000002.XSHE',
 u'000008.XSHE',
 u'000009.XSHE',
 u'000027.XSHE',
 u'000039.XSHE',
 u'000060.XSHE',
 u'000061.XSHE',
 u'000063.XSHE',
 u'000069.XSHE',
 u'000100.XSHE',
 u'000156.XSHE',
 u'000157.XSHE',
 u'000166.XSHE',
 u'000333.XSHE',
 u'000338.XSHE',
 u'000402.XSHE',
 u'000413.XSHE',
 u'000415.XSHE',
 u'000423.XSHE',
 u'000425.XSHE',
 u'000503.XSHE',
 u'000538.XSHE',
 u'000540.XSHE',
 u'000555.XSHE',
 u'000559.XSHE',
 u'000568.XSHE',
 u'000623.XSHE',
 u'000625.XSHE',
 u'000627.XSHE',
 u'000630.XSHE',
 u'000651.XSHE',
 u'000671.XSHE',
 u'000686.XSHE',
 u'000709.XSHE',
 u'000712.XSHE',
 u'000718.XSHE',
 u'000725.XSHE',
 u'000728.XSHE',
 u'000738.XSHE',
 u'000750.XSHE',
 u'000768.XSHE',
 u'000776.XSHE',
 u'000778.XSHE',
 u'000783.XSHE',
 u'000792.XSHE',
 u'000793.XSHE',
 u'000800.XSHE',
 u'000826.XSHE',
 u'000839.XSHE',
 u'000858.XSHE',
 u'000876.XSHE',
 u'000895.XSHE',
 u'000917.XSHE',
 u'000938.XSHE',
 u'000963.XSHE',
 u'000977.XSHE',
 u'000983.XSHE',
 u'001979.XSHE',
 u'002007.XSHE',
 u'002008.XSHE',
 u'002024.XSHE',
 u'002027.XSHE',
 u'002049.XSHE',
 u'002065.XSHE',
 u'002074.XSHE',
 u'002081.XSHE',
 u'002085.XSHE',
 u'002129.XSHE',
 u'002131.XSHE',
 u'002142.XSHE',
 u'002146.XSHE',
 u'002152.XSHE',
 u'002153.XSHE',
 u'002174.XSHE',
 u'002183.XSHE',
 u'002195.XSHE',
 u'002202.XSHE',
 u'002230.XSHE',
 u'002236.XSHE',
 u'002241.XSHE',
 u'002252.XSHE',
 u'002292.XSHE',
 u'002299.XSHE',
 u'002304.XSHE',
 u'002310.XSHE',
 u'002385.XSHE',
 u'002415.XSHE',
 u'002424.XSHE',
 u'002426.XSHE',
 u'002450.XSHE',
 u'002456.XSHE',
 u'002465.XSHE',
 u'002466.XSHE',
 u'002470.XSHE',
 u'002475.XSHE',
 u'002500.XSHE',
 u'002568.XSHE',
 u'002594.XSHE',
 u'002673.XSHE',
 u'002714.XSHE',
 u'002736.XSHE',
 u'002739.XSHE',
 u'002797.XSHE',
 u'300002.XSHE',
 u'300015.XSHE',
 u'300017.XSHE',
 u'300024.XSHE',
 u'300027.XSHE',
 u'300033.XSHE',
 u'300058.XSHE',
 u'300059.XSHE',
 u'300070.XSHE',
 u'300072.XSHE',
 u'300085.XSHE',
 u'300104.XSHE',
 u'300124.XSHE',
 u'300133.XSHE',
 u'300144.XSHE',
 u'300146.XSHE',
 u'300168.XSHE',
 u'300182.XSHE',
 u'300251.XSHE',
 u'300315.XSHE',
 u'600000.XSHG',
 u'600008.XSHG',
 u'600009.XSHG',
 u'600010.XSHG',
 u'600015.XSHG',
 u'600016.XSHG',
 u'600018.XSHG',
 u'600019.XSHG',
 u'600021.XSHG',
 u'600023.XSHG',
 u'600028.XSHG',
 u'600029.XSHG',
 u'600030.XSHG',
 u'600031.XSHG',
 u'600036.XSHG',
 u'600037.XSHG',
 u'600038.XSHG',
 u'600048.XSHG',
 u'600050.XSHG',
 u'600060.XSHG',
 u'600061.XSHG',
 u'600066.XSHG',
 u'600068.XSHG',
 u'600074.XSHG',
 u'600085.XSHG',
 u'600089.XSHG',
 u'600100.XSHG',
 u'600104.XSHG',
 u'600109.XSHG',
 u'600111.XSHG',
 u'600115.XSHG',
 u'600118.XSHG',
 u'600150.XSHG',
 u'600153.XSHG',
 u'600157.XSHG',
 u'600170.XSHG',
 u'600177.XSHG',
 u'600188.XSHG',
 u'600196.XSHG',
 u'600208.XSHG',
 u'600221.XSHG',
 u'600252.XSHG',
 u'600256.XSHG',
 u'600271.XSHG',
 u'600276.XSHG',
 u'600297.XSHG',
 u'600309.XSHG',
 u'600332.XSHG',
 u'600340.XSHG',
 u'600352.XSHG',
 u'600362.XSHG',
 u'600369.XSHG',
 u'600372.XSHG',
 u'600373.XSHG',
 u'600376.XSHG',
 u'600383.XSHG',
 u'600406.XSHG',
 u'600415.XSHG',
 u'600446.XSHG',
 u'600482.XSHG',
 u'600485.XSHG',
 u'600489.XSHG',
 u'600498.XSHG',
 u'600518.XSHG',
 u'600519.XSHG',
 u'600535.XSHG',
 u'600547.XSHG',
 u'600549.XSHG',
 u'600570.XSHG',
 u'600582.XSHG',
 u'600583.XSHG',
 u'600585.XSHG',
 u'600588.XSHG',
 u'600606.XSHG',
 u'600637.XSHG',
 u'600648.XSHG',
 u'600649.XSHG',
 u'600660.XSHG',
 u'600663.XSHG',
 u'600666.XSHG',
 u'600674.XSHG',
 u'600685.XSHG',
 u'600688.XSHG',
 u'600690.XSHG',
 u'600703.XSHG',
 u'600704.XSHG',
 u'600705.XSHG',
 u'600718.XSHG',
 u'600737.XSHG',
 u'600739.XSHG',
 u'600741.XSHG',
 u'600754.XSHG',
 u'600783.XSHG',
 u'600795.XSHG',
 u'600804.XSHG',
 u'600816.XSHG',
 u'600820.XSHG',
 u'600827.XSHG',
 u'600837.XSHG',
 u'600839.XSHG',
 u'600867.XSHG',
 u'600871.XSHG',
 u'600873.XSHG',
 u'600875.XSHG',
 u'600886.XSHG',
 u'600887.XSHG',
 u'600893.XSHG',
 u'600895.XSHG',
 u'600900.XSHG',
 u'600958.XSHG',
 u'600959.XSHG',
 u'600999.XSHG',
 u'601006.XSHG',
 u'601009.XSHG',
 u'601018.XSHG',
 u'601021.XSHG',
 u'601088.XSHG',
 u'601099.XSHG',
 u'601111.XSHG',
 u'601118.XSHG',
 u'601127.XSHG',
 u'601155.XSHG',
 u'601166.XSHG',
 u'601169.XSHG',
 u'601186.XSHG',
 u'601198.XSHG',
 u'601211.XSHG',
 u'601216.XSHG',
 u'601225.XSHG',
 u'601258.XSHG',
 u'601288.XSHG',
 u'601318.XSHG',
 u'601328.XSHG',
 u'601333.XSHG',
 u'601336.XSHG',
 u'601377.XSHG',
 u'601390.XSHG',
 u'601398.XSHG',
 u'601555.XSHG',
 u'601600.XSHG',
 u'601601.XSHG',
 u'601607.XSHG',
 u'601608.XSHG',
 u'601611.XSHG',
 u'601618.XSHG',
 u'601628.XSHG',
 u'601633.XSHG',
 u'601668.XSHG',
 u'601669.XSHG',
 u'601688.XSHG',
 u'601718.XSHG',
 u'601727.XSHG',
 u'601766.XSHG',
 u'601788.XSHG',
 u'601800.XSHG',
 u'601818.XSHG',
 u'601857.XSHG',
 u'601866.XSHG',
 u'601872.XSHG',
 u'601877.XSHG',
 u'601888.XSHG',
 u'601899.XSHG',
 u'601901.XSHG',
 u'601919.XSHG',
 u'601928.XSHG',
 u'601933.XSHG',
 u'601939.XSHG',
 u'601958.XSHG',
 u'601985.XSHG',
 u'601988.XSHG',
 u'601989.XSHG',
 u'601998.XSHG',
 u'603000.XSHG',
 u'603885.XSHG',
 u'603993.XSHG']

1.3 特征提取及特征预处理(约 8 h)¶

特征提取: 每个自然月的最后一个交易日,计算之前报告里的 66 个因子暴露度(其中wind因子数据无法提取,股东因子存在缺失较多,暂不提取),作为样本的原始特征。
特征预处理:
(1) 中位数去极值: 设第 T 期某因子在所有个股上的暴露度序列为 𝐷𝑖 ,𝐷𝑀 为该序列中位数,𝐷𝑀1 为序列|𝐷𝑖-𝐷𝑀|的中位数,则将序列𝐷𝑖中所有大于 𝐷𝑀+5𝐷𝑀1 的数重设为 𝐷𝑀+5𝐷𝑀1,将序列 𝐷𝑖 中所有小于 𝐷𝑀-5𝐷𝑀1 的数重设为 𝐷𝑀-5𝐷𝑀1,本文采用聚宽的函数库 winsorize_med() 实现该功能。
(2) 缺失值处理: 得到新的因子暴露度序列后,将因子暴露度缺失的地方设为申万一级行业相同个股的平均值,通过函数 replace_nan_indu() 实现该功能。
(3) 行业市值中性化: 将填充缺失值后的因子暴露度对行业哑变量和取对数后的市值做线性回归,取残差作为新的因子暴露度,通过聚宽函数库 neutralize() 实现该功能。
(4) 标准化: 将中性化处理后的因子暴露度序列减去其现在的均值、除以其标准差,得到一个新的近似服从 N(0,1) 分布的序列,通过聚宽函数库 standardlize() 实现该功能。
注:考虑到时间运行时间较长,已将数据提前运行出来,下载地址:https://pan.baidu.com/s/1DSI7uc5yBNY3hzedu9L7mw

def linreg(X,Y,columns=3):
    X=sm.add_constant(array(X))
    Y=array(Y)
    if len(Y)>1:
        results = regression.linear_model.OLS(Y, X).fit()
        return results.params
    else:
        return [float("nan")]*(columns+1)
linreg([1],[1])
[nan, nan, nan, nan]
#取股票对应行业
def get_industry_name(i_Constituent_Stocks, value):
    return [k for k, v in i_Constituent_Stocks.items() if value in v]

#缺失值处理
def replace_nan_indu(factor_data,stockList,industry_code,date):
    #把nan用行业平均值代替,依然会有nan,此时用所有股票平均值代替
    i_Constituent_Stocks={}
    data_temp=pd.DataFrame(index=industry_code,columns=factor_data.columns)
    for i in industry_code:
        temp = get_industry_stocks(i, date)
        i_Constituent_Stocks[i] = list(set(temp).intersection(set(stockList)))
        data_temp.loc[i]=mean(factor_data.loc[i_Constituent_Stocks[i],:])
    for factor in data_temp.columns:
        #行业缺失值用所有行业平均值代替
        null_industry=list(data_temp.loc[pd.isnull(data_temp[factor]),factor].keys())
        for i in null_industry:
            data_temp.loc[i,factor]=mean(data_temp[factor])
        null_stock=list(factor_data.loc[pd.isnull(factor_data[factor]),factor].keys())
        for i in null_stock:
            industry=get_industry_name(i_Constituent_Stocks, i)
            if industry:
                factor_data.loc[i,factor]=data_temp.loc[industry[0],factor] 
            else:
                factor_data.loc[i,factor]=mean(factor_data[factor])
    return factor_data
'''生成面板取切片'''
from jqfactor import get_factor_values
factors = get_factor_values(['000001.XSHE','000002.XSHE'],['roe_ttm','roa_ttm','total_asset_turnover_rate',\
                               'net_operate_cash_flow_ttm','net_profit_ttm',\
                              'cash_to_current_liability','current_ratio',\
                             'gross_income_ratio','non_recurring_gain_loss',\
                            'operating_revenue_ttm','net_profit_growth_rate'],end_date='2018-01-01',count=1)
df_ = pd.Panel(factors)
df_.iloc[:,0,:]

'''循环股票,将每一个股票对应的dataframe取出值添加到以股票为index的dataFrame中'''
factor_data=get_factor_values(['000001.XSHE','000002.XSHE'],['roe_ttm','roa_ttm','total_asset_turnover_rate',\
                               'net_operate_cash_flow_ttm','net_profit_ttm',\
                              'cash_to_current_liability','current_ratio',\
                             'gross_income_ratio','non_recurring_gain_loss',\
                            'operating_revenue_ttm','net_profit_growth_rate'],end_date='2018-01-01',count=1)
factor=pd.DataFrame(index=['000001.XSHE','000002.XSHE'])
for i in factor_data.keys():
    factor[i]=factor_data[i].iloc[0,:]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
cash_to_current_liability current_ratio gross_income_ratio net_operate_cash_flow_ttm net_profit_growth_rate net_profit_ttm non_recurring_gain_loss operating_revenue_ttm roa_ttm roe_ttm total_asset_turnover_rate
000001.XSHE NaN NaN NaN 5.944002e+09 0.008274 2.303300e+10 -1.899981e+07 1.055800e+11 0.007341 0.105602 0.033651
000002.XSHE 0.131772 1.210249 0.319474 1.328647e+10 0.150733 3.187653e+10 1.291532e+09 2.405229e+11 0.031301 0.190052 0.236181
q = query(valuation,balance,cash_flow,income,indicator).filter(valuation.code.in_(['000001.XSHE','000002.XSHE']))
df = get_fundamentals(q, '2018-01-01')
df['market_cap']=df['market_cap']*100000000
df
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
id code pe_ratio turnover_ratio pb_ratio ps_ratio pcf_ratio capitalization market_cap circulating_cap ... inc_total_revenue_year_on_year inc_total_revenue_annual inc_revenue_year_on_year inc_revenue_annual inc_operation_profit_year_on_year inc_operation_profit_annual inc_net_profit_year_on_year inc_net_profit_annual inc_net_profit_to_shareholders_year_on_year inc_net_profit_to_shareholders_annual
0 16041672 000001.XSHE 9.9148 0.5810 1.1524 2.1630 -23.2837 1717041.125 2.283665e+11 1691799.0 ... -5.2906 -2.2799 -5.2906 -2.2799 1.7046 3.5437 2.6762 4.0852 2.6762 4.0852
1 16041673 000002.XSHE 14.3756 0.3968 2.9463 1.4255 20.6124 1103915.250 3.363281e+11 970916.5 ... 11.9039 -7.6750 11.9039 -7.6750 19.4354 -41.9464 13.5359 -46.7393 30.1334 -42.6654

2 rows × 237 columns

his_date = [pd.to_datetime('2018-01-01') - datetime.timedelta(90*i) for i in range(0, 4)]
his_date
[Timestamp('2018-01-01 00:00:00'),
 Timestamp('2017-10-03 00:00:00'),
 Timestamp('2017-07-05 00:00:00'),
 Timestamp('2017-04-06 00:00:00')]
stats.linregress([1,2,3], [2,3,4])
LinregressResult(slope=1.0, intercept=1.0, rvalue=1.0, pvalue=9.003163161571059e-11, stderr=0.0)
stats.linregress([1,2,3], [2,3,4])[:2]
(1.0, 1.0)
#取换手率数据
data_turnover_ratio=pd.DataFrame()
data_turnover_ratio['code']=['000001.XSHE','000002.XSHE']
trade_days=list(get_trade_days(end_date='2018-01-01', count=240*2))
q = query(valuation.code,valuation.turnover_ratio).filter(valuation.code.in_(['000001.XSHE','000002.XSHE']))
temp = get_fundamentals(q, '2018-01-01')
# for i in trade_days:
#     q = query(valuation.code,valuation.turnover_ratio).filter(valuation.code.in_(stock))
#     temp = get_fundamentals(q, i)
#     data_turnover_ratio=pd.merge(data_turnover_ratio, temp,how='left',on='code')
#     data_turnover_ratio=data_turnover_ratio.rename(columns={'turnover_ratio':i})
# data_turnover_ratio=data_turnover_ratio.set_index('code').T  
temp
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
code turnover_ratio
0 000001.XSHE 0.5810
1 000002.XSHE 0.3968
pd.merge(temp2, temp,how='left',on='code')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
code turnover_ratio_x turnover_ratio_y
0 000001.XSHE 1.2304 1.2304
1 000002.XSHE 0.7039 0.7039
#取换手率数据
data_turnover_ratio=pd.DataFrame()
data_turnover_ratio['code']=['000001.XSHE','000002.XSHE']
trade_days=list(get_trade_days(end_date='2018-01-01', count=240*2))

for i in trade_days:
    q = query(valuation.code,valuation.turnover_ratio).filter(valuation.code.in_(['000001.XSHE','000002.XSHE']))
    temp = get_fundamentals(q, i)
    data_turnover_ratio=pd.merge(data_turnover_ratio, temp,how='left',on='code')
    data_turnover_ratio=data_turnover_ratio.rename(columns={'turnover_ratio':i})
data_turnover_ratio.set_index('code')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
2016-01-14 2016-01-15 2016-01-18 2016-01-19 2016-01-20 2016-01-21 2016-01-22 2016-01-25 2016-01-26 2016-01-27 ... 2017-12-18 2017-12-19 2017-12-20 2017-12-21 2017-12-22 2017-12-25 2017-12-26 2017-12-27 2017-12-28 2017-12-29
code
000001.XSHE 0.5645 0.3797 0.3567 0.4245 0.5115 0.5135 0.3954 0.3189 0.5489 0.4821 ... 0.4761 1.4174 0.6539 0.8779 0.4391 0.9372 0.6642 0.8078 0.9180 0.5810
000002.XSHE 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.4339 0.3503 0.3389 0.6130 0.3525 0.5216 0.2872 0.4073 0.3783 0.3968

2 rows × 480 columns

data_turnover_ratio=data_turnover_ratio.set_index('code').T  
data_turnover_ratio
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
code 000001.XSHE 000002.XSHE
2016-01-14 0.5645 0.0000
2016-01-15 0.3797 0.0000
2016-01-18 0.3567 0.0000
2016-01-19 0.4245 0.0000
2016-01-20 0.5115 0.0000
2016-01-21 0.5135 0.0000
2016-01-22 0.3954 0.0000
2016-01-25 0.3189 0.0000
2016-01-26 0.5489 0.0000
2016-01-27 0.4821 0.0000
2016-01-28 0.2563 0.0000
2016-01-29 0.4612 0.0000
2016-02-01 0.3539 0.0000
2016-02-02 0.3127 0.0000
2016-02-03 0.2326 0.0000
2016-02-04 0.3161 0.0000
2016-02-05 0.2295 0.0000
2016-02-15 0.2359 0.0000
2016-02-16 0.3629 0.0000
2016-02-17 0.4957 0.0000
2016-02-18 0.3441 0.0000
2016-02-19 0.2702 0.0000
2016-02-22 0.5233 0.0000
2016-02-23 0.3608 0.0000
2016-02-24 0.2542 0.0000
2016-02-25 0.5270 0.0000
2016-02-26 0.3322 0.0000
2016-02-29 0.4803 0.0000
2016-03-01 0.3202 0.0000
2016-03-02 0.5732 0.0000
... ... ...
2017-11-20 1.6810 0.4069
2017-11-21 1.4750 1.1142
2017-11-22 1.5192 1.0845
2017-11-23 1.4358 0.5977
2017-11-24 1.5547 0.5400
2017-11-27 1.2201 0.5372
2017-11-28 1.0441 0.4179
2017-11-29 0.9245 0.9246
2017-11-30 0.8155 1.0306
2017-12-01 1.0551 0.5741
2017-12-04 0.8595 0.3913
2017-12-05 1.0188 0.5812
2017-12-06 0.9575 0.4015
2017-12-07 0.6986 0.6077
2017-12-08 0.7987 0.6058
2017-12-11 1.3370 0.5172
2017-12-12 1.0305 0.4816
2017-12-13 0.7621 0.3450
2017-12-14 0.5923 0.3935
2017-12-15 0.6502 0.4081
2017-12-18 0.4761 0.4339
2017-12-19 1.4174 0.3503
2017-12-20 0.6539 0.3389
2017-12-21 0.8779 0.6130
2017-12-22 0.4391 0.3525
2017-12-25 0.9372 0.5216
2017-12-26 0.6642 0.2872
2017-12-27 0.8078 0.4073
2017-12-28 0.9180 0.3783
2017-12-29 0.5810 0.3968

480 rows × 2 columns

# 辅助线性回归的函数
def linreg(X,Y,columns=3):
    X=sm.add_constant(array(X))
    Y=array(Y)
    if len(Y)>1:
        results = regression.linear_model.OLS(Y, X).fit()
        return results.params
    else:
        return [float("nan")]*(columns+1)
#取股票对应行业
def get_industry_name(i_Constituent_Stocks, value):
    return [k for k, v in i_Constituent_Stocks.items() if value in v]

#缺失值处理
def replace_nan_indu(factor_data,stockList,industry_code,date):
    #把nan用行业平均值代替,依然会有nan,此时用所有股票平均值代替
    i_Constituent_Stocks={}
    data_temp=pd.DataFrame(index=industry_code,columns=factor_data.columns)
    for i in industry_code:
        temp = get_industry_stocks(i, date)
        i_Constituent_Stocks[i] = list(set(temp).intersection(set(stockList)))
        data_temp.loc[i]=mean(factor_data.loc[i_Constituent_Stocks[i],:])
    for factor in data_temp.columns:
        #行业缺失值用所有行业平均值代替
        null_industry=list(data_temp.loc[pd.isnull(data_temp[factor]),factor].keys())
        for i in null_industry:
            data_temp.loc[i,factor]=mean(data_temp[factor])
        null_stock=list(factor_data.loc[pd.isnull(factor_data[factor]),factor].keys())
        for i in null_stock:
            industry=get_industry_name(i_Constituent_Stocks, i)
            if industry:
                factor_data.loc[i,factor]=data_temp.loc[industry[0],factor] 
            else:
                factor_data.loc[i,factor]=mean(factor_data[factor])
    return factor_data

#数据预处理
def data_preprocessing(factor_data,stockList,industry_code,date):
    #去极值
    factor_data=winsorize_med(factor_data, scale=5, inf2nan=False,axis=0)
    #缺失值处理
    factor_data=replace_nan_indu(factor_data,stockList,industry_code,date)
    #中性化处理
    factor_data=neutralize(factor_data, how=['sw_l1', 'market_cap'], date=date, axis=0)
    #标准化处理
    factor_data=standardlize(factor_data,axis=0)
    return factor_data

#获取时间为date的全部因子数据
def get_factor_data(stock,date):
    data=pd.DataFrame(index=stock)
    q = query(valuation,balance,cash_flow,income,indicator).filter(valuation.code.in_(stock))
    df = get_fundamentals(q, date)
    df['market_cap']=df['market_cap']*100000000
    factor_data=get_factor_values(stock,['roe_ttm','roa_ttm','total_asset_turnover_rate',\
                               'net_operate_cash_flow_ttm','net_profit_ttm',\
                              'cash_to_current_liability','current_ratio',\
                             'gross_income_ratio','non_recurring_gain_loss',\
                            'operating_revenue_ttm','net_profit_growth_rate'],end_date=date,count=1)
    factor=pd.DataFrame(index=stock)
    for i in factor_data.keys():
        factor[i]=factor_data[i].iloc[0,:]
    df.index = df['code']
    del df['code'],df['id']
    #合并得大表
    df=pd.concat([df,factor],axis=1)
    #净利润(TTM)/总市值
    data['EP']=df['net_profit_ttm']/df['market_cap']
    #净资产/总市值
    data['BP']=1/df['pb_ratio']
    #营业收入(TTM)/总市值
    data['SP']=1/df['ps_ratio']
    #净现金流(TTM)/总市值
    data['NCFP']=1/df['pcf_ratio']
    #经营性现金流(TTM)/总市值
    data['OCFP']=df['net_operate_cash_flow_ttm']/df['market_cap']
    #净利润(TTM)同比增长率/PE_TTM
    data['G/PE']=df['net_profit_growth_rate']/df['pe_ratio']
    #ROE_ttm
    data['roe_ttm']=df['roe_ttm']
    #ROE_YTD
    data['roe_q']=df['roe']
    #ROA_ttm
    data['roa_ttm']=df['roa_ttm']
    #ROA_YTD
    data['roa_q']=df['roa']
    #毛利率TTM
    data['grossprofitmargin_ttm']=df['gross_income_ratio']
    #毛利率YTD
    data['grossprofitmargin_q']=df['gross_profit_margin']

    #扣除非经常性损益后净利润率YTD
    data['profitmargin_q']=df['adjusted_profit']/df['operating_revenue']
    #资产周转率TTM
    data['assetturnover_ttm']=df['total_asset_turnover_rate']
    #资产周转率YTD 营业收入/总资产
    data['assetturnover_q']=df['operating_revenue']/df['total_assets']
    #经营性现金流/净利润TTM
    data['operationcashflowratio_ttm']=df['net_operate_cash_flow_ttm']/df['net_profit_ttm']
    #经营性现金流/净利润YTD
    data['operationcashflowratio_q']=df['net_operate_cash_flow']/df['net_profit']
    #净资产
    df['net_assets']=df['total_assets']-df['total_liability']
    #总资产/净资产
    data['financial_leverage']=df['total_assets']/df['net_assets']
    #非流动负债/净资产
    data['debtequityratio']=df['total_non_current_liability']/df['net_assets']
    #现金比率=(货币资金+有价证券)÷流动负债
    data['cashratio']=df['cash_to_current_liability']
    #流动比率=流动资产/流动负债*100%
    data['currentratio']=df['current_ratio']
    #总市值取对数
    data['ln_capital']=np.log(df['market_cap'])
    #TTM所需时间
    his_date = [pd.to_datetime(date) - datetime.timedelta(90*i) for i in range(0, 4)]
    tmp = pd.DataFrame()
    tmp['code']=stock
    for i in his_date:
        tmp_adjusted_dividend = get_fundamentals(query(indicator.code, indicator.adjusted_profit, \
                                                     cash_flow.dividend_interest_payment).
                                               filter(indicator.code.in_(stock)), date = i)
        tmp=pd.merge(tmp,tmp_adjusted_dividend,how='outer',on='code')

        tmp=tmp.rename(columns={'adjusted_profit':'adjusted_profit'+str(i.month), \
                                'dividend_interest_payment':'dividend_interest_payment'+str(i.month)})
    tmp=tmp.set_index('code')
    tmp_columns=tmp.columns.values.tolist()
    tmp_adjusted=sum(tmp[[i for i in tmp_columns if 'adjusted_profit'in i ]],1)
    tmp_dividend=sum(tmp[[i for i in tmp_columns if 'dividend_interest_payment'in i ]],1)
    #扣除非经常性损益后净利润(TTM)/总市值
    data['EPcut']=tmp_adjusted/df['market_cap']
    #近12个月现金红利(按除息日计)/总市值
    data['DP']=tmp_dividend/df['market_cap']
    #扣除非经常性损益后净利润率TTM
    data['profitmargin_ttm']=tmp_adjusted/df['operating_revenue_ttm']
    #营业收入(YTD)同比增长率
    #_x现在 _y前一年
    his_date = pd.to_datetime(date) - datetime.timedelta(365)
    name=['operating_revenue','net_profit','net_operate_cash_flow','roe']
    temp_data=df[name]
    his_temp_data = get_fundamentals(query(valuation.code, income.operating_revenue,income.net_profit,\
                                            cash_flow.net_operate_cash_flow,indicator.roe).
                                      filter(valuation.code.in_(stock)), date = his_date)
    his_temp_data=his_temp_data.set_index('code')
    #重命名 his_temp_data last_year
    for i in name:
        his_temp_data=his_temp_data.rename(columns={i:i+'last_year'})

    temp_data =pd.concat([temp_data,his_temp_data],axis=1)
    #营业收入(YTD)同比增长率
    data['sales_g_q']=temp_data['operating_revenue']/temp_data['operating_revenuelast_year']-1
    #净利润(YTD)同比增长率
    data['profit_g_q']=temp_data['net_profit']/temp_data['net_profitlast_year']-1
    #经营性现金流(YTD)同比增长率
    data['ocf_g_q']=temp_data['net_operate_cash_flow']/temp_data['net_operate_cash_flowlast_year']-1
    #ROE(YTD)同比增长率
    data['roe_g_q']=temp_data['roe']/temp_data['roelast_year']-1
    #个股60个月收益与上证综指回归的截距项与BETA
    stock_close=get_price(stock, count = 60*20+1, end_date=date, frequency='daily', fields=['close'])['close']
    SZ_close=get_price('000001.XSHG', count = 60*20+1, end_date=date, frequency='daily', fields=['close'])['close']
    stock_pchg=stock_close.pct_change().iloc[1:]
    SZ_pchg=SZ_close.pct_change().iloc[1:]
    beta=[]
    stockalpha=[]
    for i in stock:
        temp_beta, temp_stockalpha = stats.linregress(SZ_pchg, stock_pchg[i])[:2]
        beta.append(temp_beta)
        stockalpha.append(temp_stockalpha)
    #此处alpha beta为list
    data['alpha']=stockalpha
    data['beta']=beta

    #动量
    data['return_1m']=stock_close.iloc[-1]/stock_close.iloc[-20]-1
    data['return_3m']=stock_close.iloc[-1]/stock_close.iloc[-60]-1
    data['return_6m']=stock_close.iloc[-1]/stock_close.iloc[-120]-1
    data['return_12m']=stock_close.iloc[-1]/stock_close.iloc[-240]-1

    #取换手率数据
    data_turnover_ratio=pd.DataFrame()
    data_turnover_ratio['code']=stock
    trade_days=list(get_trade_days(end_date=date, count=240*2))
    for i in trade_days:
        q = query(valuation.code,valuation.turnover_ratio).filter(valuation.code.in_(stock))
        temp = get_fundamentals(q, i)
        data_turnover_ratio=pd.merge(data_turnover_ratio, temp,how='left',on='code')
        data_turnover_ratio=data_turnover_ratio.rename(columns={'turnover_ratio':i})
    data_turnover_ratio=data_turnover_ratio.set_index('code').T   

    #个股个股最近N个月内用每日换手率乘以每日收益率求算术平均值
    data['wgt_return_1m']=mean(stock_pchg.iloc[-20:]*data_turnover_ratio.iloc[-20:])
    data['wgt_return_3m']=mean(stock_pchg.iloc[-60:]*data_turnover_ratio.iloc[-60:])
    data['wgt_return_6m']=mean(stock_pchg.iloc[-120:]*data_turnover_ratio.iloc[-120:])
    data['wgt_return_12m']=mean(stock_pchg.iloc[-240:]*data_turnover_ratio.iloc[-240:])
    #个股个股最近N个月内用每日换手率乘以函数exp(-x_i/N/4)再乘以每日收益率求算术平均值
    temp_data=pd.DataFrame(index=data_turnover_ratio[-240:].index,columns=stock)
    temp=[]
    for i in range(240):
        if i/20<1:
            temp.append(exp(-i/1/4))
        elif i/20<3:
            temp.append(exp(-i/3/4))
        elif i/20<6:
            temp.append(exp(-i/6/4))
        elif i/20<12:
            temp.append(exp(-i/12/4))  
    temp.reverse()
    for i in stock:
        temp_data[i]=temp
    data['exp_wgt_return_1m']=mean(stock_pchg.iloc[-20:]*temp_data.iloc[-20:]*data_turnover_ratio.iloc[-20:])
    data['exp_wgt_return_3m']=mean(stock_pchg.iloc[-60:]*temp_data.iloc[-60:]*data_turnover_ratio.iloc[-60:])
    data['exp_wgt_return_6m']=mean(stock_pchg.iloc[-120:]*temp_data.iloc[-120:]*data_turnover_ratio.iloc[-120:])
    data['exp_wgt_return_12m']=mean(stock_pchg.iloc[-240:]*temp_data.iloc[-240:]*data_turnover_ratio.iloc[-240:])

    #特异波动率
    #获取FF三因子残差数据
    LoS=len(stock)
    S=df.sort('market_cap')[:LoS/3].index
    B=df.sort('market_cap')[LoS-LoS/3:].index
    df['BTM']=df['total_owner_equities']/df['market_cap']
    L=df.sort('BTM')[:LoS/3].index
    H=df.sort('BTM')[LoS-LoS/3:].index
    df_temp=stock_pchg.iloc[-240:]
    #求因子的值
    SMB=sum(df_temp[S].T)/len(S)-sum(df_temp[B].T)/len(B)
    HMI=sum(df_temp[H].T)/len(H)-sum(df_temp[L].T)/len(L)
    #用沪深300作为大盘基准
    dp=get_price('000300.XSHG',count=12*20+1,end_date=date,frequency='daily', fields=['close'])['close']
    RM=dp.pct_change().iloc[1:]-0.04/252
    #将因子们计算好并且放好
    X=pd.DataFrame({"RM":RM,"SMB":SMB,"HMI":HMI})
    resd=pd.DataFrame()
    for i in stock:
        temp=df_temp[i]-0.04/252
        t_r=linreg(X,temp)
        resd[i]=list(temp-(t_r[0]+X.iloc[:,0]*t_r[1]+X.iloc[:,1]*t_r[2]+X.iloc[:,2]*t_r[3]))
    data['std_FF3factor_1m']=resd[-1*20:].std()
    data['std_FF3factor_3m']=resd[-3*20:].std()
    data['std_FF3factor_6m']=resd[-6*20:].std()
    data['std_FF3factor_12m']=resd[-12*20:].std()

    #波动率
    data['std_1m']=stock_pchg.iloc[-20:].std()
    data['std_3m']=stock_pchg.iloc[-60:].std()
    data['std_6m']=stock_pchg.iloc[-120:].std()
    data['std_12m']=stock_pchg.iloc[-240:].std()

    #股价
    data['ln_price']=np.log(stock_close.iloc[-1])

    #换手率
    data['turn_1m']=mean(data_turnover_ratio.iloc[-20:])
    data['turn_3m']=mean(data_turnover_ratio.iloc[-60:])
    data['turn_6m']=mean(data_turnover_ratio.iloc[-120:])
    data['turn_12m']=mean(data_turnover_ratio.iloc[-240:])

    data['bias_turn_1m']=mean(data_turnover_ratio.iloc[-20:])/mean(data_turnover_ratio)-1
    data['bias_turn_3m']=mean(data_turnover_ratio.iloc[-60:])/mean(data_turnover_ratio)-1
    data['bias_turn_6m']=mean(data_turnover_ratio.iloc[-120:])/mean(data_turnover_ratio)-1
    data['bias_turn_12m']=mean(data_turnover_ratio.iloc[-240:])/mean(data_turnover_ratio)-1
    #技术指标
    data['PSY']=pd.Series(PSY(stock, date, timeperiod=20))
    data['RSI']=pd.Series(RSI(stock, date, N1=20))
    data['BIAS']=pd.Series(BIAS(stock,date, N1=20)[0])
    dif,dea,macd=MACD(stock, date, SHORT = 10, LONG = 30, MID = 15)
    data['DIF']=pd.Series(dif)
    data['DEA']=pd.Series(dea)
    data['MACD']=pd.Series(macd)
    return data

peroid = 'M'
start_date = '2010-01-01'
end_date = '2018-01-01'
industry_old_code = ['801010','801020','801030','801040','801050','801080','801110','801120','801130','801140','801150',\
                    '801160','801170','801180','801200','801210','801230']
industry_new_code = ['801010','801020','801030','801040','801050','801080','801110','801120','801130','801140','801150',\
                    '801160','801170','801180','801200','801210','801230','801710','801720','801730','801740','801750',\
                   '801760','801770','801780','801790','801880','801890']

dateList = get_period_date(peroid,start_date, end_date)
factor_origl_data = {}
factor_solve_data = {}
for date in dateList:
    #获取行业因子数据
    if datetime.datetime.strptime(date,"%Y-%m-%d").date()<datetime.date(2014,2,21):
        industry_code=industry_old_code
    else:
        industry_code=industry_new_code
    stockList=get_stock('HS300',date)

    factor_origl_data[date] = get_factor_data(stockList,date)
    factor_solve_data[date] = data_preprocessing(factor_origl_data[date],stockList,industry_code,date)
content = pickle.dumps(factor_solve_data) 
write_file('factor_solve_data.pkl', content, append=False)
    

1.4 读取数据 (约 3 s)¶

特征数据经过预处理后,为了方便后续对数据进行处理,将数据保存为 pkl 文件。读取文件,文件格式如下所示:数据类型为 dict,key 为日期,value 为每一期对应的因子数据,格式为 dataframe,index 为股票列表,columns 为因子。
注:可直接将pkl文件上传至研究环境,进行数据读取。

import pickle
#使用pickle模块将数据对象保存到文件
with open("factor_solve_data.pkl",'rb') as f:
    
#pkl_file_read = read_file("factor_solve_data.pkl")
#     factor_data = pickle.loads(StringIO(pkl_file_read))
    factor_data = pickle.load(f,encoding='iso-8859-1')
factor_data['2009-12-31'].head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
EP BP SP NCFP OCFP G/PE roe_ttm roe_q roa_ttm roa_q ... bias_turn_1m bias_turn_3m bias_turn_6m bias_turn_12m PSY RSI BIAS DIF DEA MACD
000001.XSHE 0.028950 1.131116 0.197401 -1.261756 -2.181723 -0.684842 0.025820 1.701937 -0.272718 -0.194391 ... -0.698740 -0.666942 -0.357962 0.349568 1.148705 1.291735 -0.202865 -0.362807 -0.079355 -0.788347
000002.XSHE 0.246983 -0.479406 0.331701 0.110568 0.679921 -0.513610 -0.181344 -1.024590 -0.225544 -0.825526 ... 0.752829 0.393239 0.429813 0.190625 -0.942532 -0.768319 0.125205 0.148512 0.022209 0.357588
000009.XSHE -0.262468 -1.249222 -0.017061 0.214158 0.451111 -0.044448 -0.110604 -0.697583 -0.042721 -0.269572 ... -0.395701 -0.566032 -0.833514 -0.112768 -0.802378 -0.713272 -0.262686 0.283520 0.059975 0.630850
000012.XSHE 0.467468 0.901849 0.214317 -0.304663 0.498606 -0.487198 0.518937 1.500490 0.641853 1.789205 ... -0.586362 0.393510 0.987122 1.588373 1.631216 1.555538 0.422617 0.208753 0.557065 -0.721810
000021.XSHE 0.574750 1.552436 2.686058 -0.022736 0.135909 -0.522318 0.279071 0.115994 0.730127 0.608440 ... -0.660390 -1.038726 -1.097826 1.312336 0.818352 1.350547 0.126273 -0.598198 -0.618886 -0.184690

5 rows × 66 columns

2 模型测试¶

对于 SVM 模型原理,可参考量化课堂-SVM原理入门,此处不再多加赘述。

2.1 标签设置及数据集分割 (约 10 s)¶

对于支持向量机模型,本文主要采用分类模型,因此需要将收益率数据转化为标签,在每个月末截面期,选取下月收益排前、后 30% 的股票分别作为正例 (𝑦 = 1)、负例 (𝑦 = −1)。
为了对模型进行测试,需要对数据集进行分割,故将前 4 年数据样本合并,作为训练集,后 4 年数据样本作为测试集。

peroid='M'
start_date='2010-01-01'
end_date='2018-01-01'
dateList=get_period_date(peroid,start_date, end_date)
dateList
/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:20: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).last()
['2009-12-31',
 '2010-01-31',
 '2010-02-28',
 '2010-03-31',
 '2010-04-30',
 '2010-05-31',
 '2010-06-30',
 '2010-07-31',
 '2010-08-31',
 '2010-09-30',
 '2010-10-31',
 '2010-11-30',
 '2010-12-31',
 '2011-01-31',
 '2011-02-28',
 '2011-03-31',
 '2011-04-30',
 '2011-05-31',
 '2011-06-30',
 '2011-07-31',
 '2011-08-31',
 '2011-09-30',
 '2011-10-31',
 '2011-11-30',
 '2011-12-31',
 '2012-01-31',
 '2012-02-29',
 '2012-03-31',
 '2012-04-30',
 '2012-05-31',
 '2012-06-30',
 '2012-07-31',
 '2012-08-31',
 '2012-09-30',
 '2012-10-31',
 '2012-11-30',
 '2012-12-31',
 '2013-01-31',
 '2013-02-28',
 '2013-03-31',
 '2013-04-30',
 '2013-05-31',
 '2013-06-30',
 '2013-07-31',
 '2013-08-31',
 '2013-09-30',
 '2013-10-31',
 '2013-11-30',
 '2013-12-31',
 '2014-01-31',
 '2014-02-28',
 '2014-03-31',
 '2014-04-30',
 '2014-05-31',
 '2014-06-30',
 '2014-07-31',
 '2014-08-31',
 '2014-09-30',
 '2014-10-31',
 '2014-11-30',
 '2014-12-31',
 '2015-01-31',
 '2015-02-28',
 '2015-03-31',
 '2015-04-30',
 '2015-05-31',
 '2015-06-30',
 '2015-07-31',
 '2015-08-31',
 '2015-09-30',
 '2015-10-31',
 '2015-11-30',
 '2015-12-31',
 '2016-01-31',
 '2016-02-29',
 '2016-03-31',
 '2016-04-30',
 '2016-05-31',
 '2016-06-30',
 '2016-07-31',
 '2016-08-31',
 '2016-09-30',
 '2016-10-31',
 '2016-11-30',
 '2016-12-31',
 '2017-01-31',
 '2017-02-28',
 '2017-03-31',
 '2017-04-30',
 '2017-05-31',
 '2017-06-30',
 '2017-07-31',
 '2017-08-31',
 '2017-09-30',
 '2017-10-31',
 '2017-11-30',
 '2017-12-31']
dateList[4*12:-1]
['2013-12-31',
 '2014-01-31',
 '2014-02-28',
 '2014-03-31',
 '2014-04-30',
 '2014-05-31',
 '2014-06-30',
 '2014-07-31',
 '2014-08-31',
 '2014-09-30',
 '2014-10-31',
 '2014-11-30',
 '2014-12-31',
 '2015-01-31',
 '2015-02-28',
 '2015-03-31',
 '2015-04-30',
 '2015-05-31',
 '2015-06-30',
 '2015-07-31',
 '2015-08-31',
 '2015-09-30',
 '2015-10-31',
 '2015-11-30',
 '2015-12-31',
 '2016-01-31',
 '2016-02-29',
 '2016-03-31',
 '2016-04-30',
 '2016-05-31',
 '2016-06-30',
 '2016-07-31',
 '2016-08-31',
 '2016-09-30',
 '2016-10-31',
 '2016-11-30',
 '2016-12-31',
 '2017-01-31',
 '2017-02-28',
 '2017-03-31',
 '2017-04-30',
 '2017-05-31',
 '2017-06-30',
 '2017-07-31',
 '2017-08-31',
 '2017-09-30',
 '2017-10-31',
 '2017-11-30']
peroid='M'
start_date='2010-01-01'
end_date='2018-01-01'
industry_old_code=['801010','801020','801030','801040','801050','801080','801110','801120','801130','801140','801150',\
                    '801160','801170','801180','801200','801210','801230']
industry_new_code=['801010','801020','801030','801040','801050','801080','801110','801120','801130','801140','801150',\
                    '801160','801170','801180','801200','801210','801230','801710','801720','801730','801740','801750',\
                   '801760','801770','801780','801790','801880','801890']

dateList=get_period_date(peroid,start_date, end_date)

# 训练集数据
train_data=pd.DataFrame()
for date in dateList[:4*12]:
    traindf=factor_data[date]
    stockList=list(traindf.index)
    #取收益率数据
    data_close=get_price(stockList,date,dateList[dateList.index(date)+1],'1d','close')['close']
    traindf['pchg']=data_close.iloc[-1]/data_close.iloc[0]-1
    #剔除空值
    traindf=traindf.dropna()   
    #选取前后各30%的股票,剔除中间的噪声
    traindf=traindf.sort_values(by=['pchg'])
    traindf=traindf.iloc[:int(len(traindf['pchg'])/10*3),:].append(traindf.iloc[int(len(traindf['pchg'])/10*7):,:])
    traindf['label']=list(traindf['pchg'].apply(lambda x:1 if x>np.mean(list(traindf['pchg'])) else -1))    
    if train_data.empty:
        train_data=traindf
    else:
        train_data=train_data.append(traindf)
        
# 测试集数据        
test_data={}
for date in dateList[4*12:-1]:
    testdf=factor_data[date]
    stockList=list(testdf.index)
    # 取收益率数据
    data_close=get_price(stockList,date,dateList[dateList.index(date)+1],'1d','close')['close']
    testdf['pchg']=data_close.iloc[-1]/data_close.iloc[0]-1
    #剔除空值
    testdf=testdf.dropna()   
    #选取前后各30%的股票,剔除中间的噪声
    testdf=testdf.sort_values(by=['pchg'])
    testdf=testdf.iloc[:int(len(testdf['pchg'])/10*3),:].append(testdf.iloc[int(len(testdf['pchg'])/10*7):,:])
    testdf['label']=list(testdf['pchg'].apply(lambda x:1 if x>np.mean(list(testdf['pchg'])) else -1)) 
    test_data[date]=testdf
        
/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:20: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).last()
train_data
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
EP BP SP NCFP OCFP G/PE roe_ttm roe_q roa_ttm roa_q ... bias_turn_6m bias_turn_12m PSY RSI BIAS DIF DEA MACD pchg label
000898.XSHE -1.339908 -0.823496 -1.553743 -0.336050 -0.851616 0.901782 -0.771852 -0.184476 -0.898484 0.285664 ... -0.633324 -1.247146 -0.200697 -0.492782 0.901998 1.187574 1.229229 0.335472 -0.261379 -1
600348.XSHG 0.253238 -0.709708 0.314576 0.811228 0.250254 2.415528 1.252358 1.103479 0.755359 0.671009 ... -1.384773 -2.431529 -0.407887 -0.139634 1.312277 1.146901 0.426139 2.016046 -0.237435 -1
600104.XSHG -0.204334 0.840664 1.827264 1.702015 0.651325 -0.555469 -0.157791 1.409792 -0.058212 1.288909 ... -0.801224 -0.777448 1.023477 1.353981 -0.148693 1.551631 1.985120 -0.419854 -0.229304 -1
601699.XSHG 0.604046 -0.584376 -0.158244 -0.084201 0.236129 1.474444 1.198756 0.311401 1.020940 0.241325 ... -0.735090 0.417693 -0.632352 -0.401398 -0.448861 0.466879 0.787098 -0.596676 -0.220944 -1
000983.XSHE -0.609660 -0.945398 -0.696665 -0.149154 -0.316074 -0.460934 -0.025241 0.189220 0.104082 0.251961 ... -0.281636 0.369459 -0.473663 -0.442339 -0.194350 -0.497980 -0.042027 -1.263650 -0.219931 -1
000718.XSHE -0.563044 -1.551998 -0.612245 -0.651606 -0.641480 -0.756999 0.474626 2.170109 0.320176 1.700314 ... -0.880877 -0.290465 -0.609921 -0.725955 -1.224751 -0.567860 -0.102092 -1.248264 -0.218586 -1
600019.XSHG -0.280231 -0.208837 -0.612423 -0.914817 0.927408 -0.772553 -0.188920 -0.331904 -0.191782 0.030894 ... -0.852683 -2.015797 -0.478094 -0.260537 1.681595 1.102547 1.020271 0.583499 -0.216066 -1
600307.XSHG 0.246073 -1.547046 -0.076332 -0.064550 0.121182 0.049713 -0.004440 -0.212457 -0.049140 -0.108843 ... -1.608684 -1.235297 -0.296960 -0.732394 -0.860267 -0.680652 -0.553390 -0.550320 -0.215686 -1
000800.XSHE 0.645978 0.607839 1.395095 0.411715 0.600378 0.346006 1.096834 1.134758 1.199610 1.662074 ... -1.339557 -0.307526 1.607137 1.595132 0.375372 2.855703 2.827991 -1.548921 -0.205201 -1
600582.XSHG 1.220935 0.161788 0.330084 -0.074172 0.013625 0.475748 1.955091 1.553555 2.021309 2.160534 ... 0.769649 1.451860 1.009850 1.643493 0.645583 1.386249 1.525338 0.189815 -0.205186 -1
601166.XSHG 1.628138 1.097977 -0.069126 -2.607464 2.469360 -0.238659 1.363768 1.259437 -0.165464 -0.228330 ... -0.410905 -0.162214 1.398375 1.219716 0.515697 -0.040880 -0.079556 0.058303 -0.202107 -1
600741.XSHG 1.597167 2.126435 1.309559 2.473835 0.868861 2.448113 0.558987 1.170686 0.839599 2.343310 ... 0.395003 0.780068 1.830805 1.656058 1.410046 1.219544 1.117727 0.673861 -0.201711 -1
000932.XSHE -0.432491 0.962227 0.044747 0.702973 1.556231 0.351090 -0.103027 -0.540683 -0.066951 -0.641312 ... 0.209975 -0.186155 -0.905888 -0.577609 -0.299507 0.005007 -0.004132 0.013794 -0.198175 -1
000825.XSHE -0.560819 -1.161042 0.018134 -0.397080 -0.630818 0.568659 -0.470182 -0.139794 -0.369891 -0.224836 ... 0.438879 0.197424 -0.534888 -0.585478 -0.141694 -0.032672 0.156913 -0.451530 -0.198091 -1
000937.XSHE -0.357993 0.346295 -0.115037 0.358325 0.117128 -1.037641 -0.622369 -0.831234 -0.905939 -1.125563 ... -0.277565 -0.145597 -0.804577 -0.340420 -0.709074 0.287962 0.765937 -1.020075 -0.197540 -1
600395.XSHG -0.482925 -0.461043 -0.642086 0.048169 -0.162035 0.495222 -0.396606 0.563451 0.029798 1.538482 ... -0.223589 -0.998191 -0.389464 -0.410738 -0.433025 -0.383082 -0.404235 -0.137799 -0.197498 -1
601169.XSHG 1.259979 1.324686 -0.157433 -1.067213 1.206745 -0.017963 0.905450 0.696879 -0.108399 -0.173770 ... -0.885053 -1.486276 1.504733 1.409171 0.710609 0.232750 0.140371 0.283704 -0.194399 -1
000709.XSHE 0.157932 0.954871 0.040987 -0.831024 0.092326 -1.025100 0.021088 -0.595044 -0.015583 -0.577060 ... -0.189408 0.198160 -0.704704 -0.817105 -1.586145 0.198305 0.319341 -0.209632 -0.191523 -1
601001.XSHG 0.970052 -0.123389 -0.341525 -0.969352 -0.225479 0.331777 0.276158 -0.287747 1.065197 0.290460 ... -0.971557 -0.568488 -0.396240 -0.459546 -0.728115 -0.487051 -0.080269 -1.146645 -0.191341 -1
000060.XSHE -0.520340 -0.746671 -0.513973 -0.269318 -0.635668 -0.061345 -0.684667 -0.468283 -0.685499 -0.437695 ... -0.204781 -0.116877 -0.662184 -0.440292 1.365643 0.550537 -0.156007 1.770859 -0.190367 -1
600028.XSHG -0.445205 -1.906611 0.606283 -0.929019 -0.693458 1.847548 -0.538408 -0.670272 0.125637 0.015970 ... -0.640204 -1.910889 -1.910722 -1.908922 0.560512 1.024769 0.744422 0.978541 -0.189959 -1
600997.XSHG -1.014771 -0.482487 -0.130176 0.351810 -0.235677 -0.779942 -1.231717 -0.289891 -1.650070 -0.873043 ... 0.372966 0.352345 -0.180321 -0.426032 -0.704594 -0.632989 -0.159991 -1.350788 -0.189551 -1
600325.XSHG 0.323981 -0.204300 -0.103861 1.415510 1.612112 0.433534 -0.058768 -0.797225 -0.243463 -0.748118 ... 0.013669 -0.426368 -0.599389 -0.666574 -1.248069 -0.299024 0.287645 -1.422291 -0.187717 -1
000630.XSHE -0.033863 -0.349515 1.671445 0.158159 0.445587 -0.975289 -0.074184 -0.462556 -0.202445 -0.742567 ... 0.829676 0.960215 -0.454961 -0.421783 0.684300 0.807681 0.639520 0.645877 -0.187708 -1
000717.XSHE -1.413005 -0.240964 0.052592 -0.789472 1.573190 0.922596 -1.764376 0.181048 -2.029510 -0.000950 ... 1.825339 0.469849 -0.893000 -0.681956 -0.817644 -0.386848 -0.290421 -0.370801 -0.186846 -1
600690.XSHG -0.371019 -0.971475 -0.024128 0.813279 0.868150 0.005171 -0.462605 -0.519816 -0.160385 -0.128200 ... -0.669995 0.335240 -0.450148 -0.439052 -0.634005 0.329262 0.584221 -0.464916 -0.186667 -1
600585.XSHG 0.824933 1.472747 0.480301 -0.610967 1.244304 -0.231784 0.531591 0.566293 1.011384 1.132269 ... -1.270613 -1.195534 1.754538 1.533549 1.038025 1.287030 1.215400 0.629031 -0.186502 -1
000878.XSHE -2.191582 -0.989264 -0.170909 -1.137625 0.528887 1.906376 -2.248274 0.536849 -2.456989 -0.594343 ... -0.961349 -1.031000 -0.882532 -0.601176 0.314222 -2.081045 -2.358223 0.048406 -0.184907 -1
000933.XSHE -1.071980 -0.686538 0.060221 1.306697 0.665096 -0.907191 -1.236326 0.072424 -1.882922 -1.295668 ... -0.335325 0.277423 -0.175778 -0.057790 0.853452 1.960233 2.006145 0.580587 -0.184879 -1
002155.XSHE -0.208642 -0.757256 -0.922840 -0.468557 -0.656105 -0.679338 -0.225711 -0.508898 -0.249386 -0.060348 ... 1.204158 0.945165 -0.850366 -0.716255 -0.634589 -0.997182 -1.082108 -0.212759 -0.184638 -1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
002603.XSHE -0.740411 -0.630370 -0.636225 -0.524613 -0.313795 -0.629744 -1.450573 -1.227070 -1.352312 -1.097113 ... -0.286062 0.668399 -0.154614 -0.476288 0.439482 0.553265 0.158740 0.413281 0.069565 1
600703.XSHG -0.567762 -1.093158 -0.687669 -1.003212 -0.539549 -0.534213 -0.067931 0.109176 0.019778 0.160198 ... -0.663827 -0.733325 0.052621 -0.321021 1.310997 0.777162 0.941886 -0.138748 0.070718 1
002431.XSHE 0.289960 0.106463 0.120107 0.047904 -0.437662 0.208937 1.231325 0.800984 1.118577 0.849554 ... 0.026242 0.973704 1.462412 1.344508 0.455285 0.939335 0.740287 0.139487 0.073976 1
000651.XSHE -0.419611 -1.230606 -1.425879 1.118061 0.307694 0.027272 0.762115 1.286906 -0.008618 0.675532 ... -0.543453 -0.098476 -0.489475 -0.727966 -0.642664 0.796666 0.845183 -0.330869 0.078773 1
002038.XSHE -0.538867 -0.949778 -0.732675 -0.053173 -0.254428 -0.252169 0.452339 0.548469 0.980556 1.095278 ... -0.227217 -0.625242 -0.949553 -0.873113 -1.117598 -1.763377 -1.502315 -0.763865 0.081422 1
600859.XSHG 0.174558 0.249184 0.709174 -0.005873 0.631249 -0.715380 -0.417121 -0.272449 -0.387305 -0.232204 ... 1.466069 0.859221 -0.846247 -0.867853 -1.397736 -1.568108 -0.739348 -1.260990 0.087922 1
002353.XSHE -0.134379 -0.418818 -0.297253 0.093242 -0.104265 0.393295 2.301361 2.666142 2.910887 3.185530 ... -0.194131 -0.764993 1.775606 1.182167 -0.520576 2.061677 2.245906 -2.157168 0.090355 1
002344.XSHE -0.412449 -1.305909 -2.005687 -0.343428 -0.194461 0.466977 1.539826 1.052538 1.963830 1.458686 ... -2.220716 -2.164643 -0.483827 -0.769876 -1.025165 -1.264064 -0.517739 -1.160696 0.092265 1
600062.XSHG 0.015265 -0.145719 -0.078995 0.190691 0.007572 -0.336422 -0.599105 -0.476129 -0.380900 -0.202776 ... -0.740355 -1.241207 -0.731070 -0.451970 -0.189394 1.987875 1.345734 0.637467 0.092443 1
002051.XSHE 0.662932 0.417339 0.448513 0.274389 -0.021163 -0.015847 1.196050 1.579642 0.764441 0.937565 ... 0.498952 0.817594 0.866168 0.792453 0.174041 -2.305717 -2.517494 2.258645 0.092629 1
600873.XSHG -0.907022 -0.120058 -0.241906 0.248707 0.020900 -0.438278 -1.939781 -1.726519 -2.222759 -2.098000 ... 0.128432 -0.197061 -0.650798 -0.228685 1.694538 0.739207 0.847563 -0.694428 0.092784 1
600875.XSHG 1.280535 1.308062 1.607978 1.432229 0.851162 -0.598115 0.971294 1.137612 0.377022 0.443275 ... -0.366564 0.113943 1.234419 1.316257 0.530772 0.470040 0.061041 0.626221 0.093190 1
600010.XSHG -0.685397 -2.128134 -1.757104 0.036846 0.138641 -0.304107 -0.488046 -0.385950 -0.447053 -0.392533 ... -0.979394 -1.056849 -0.874804 -1.054628 -1.635823 -0.436518 -0.085497 -0.675888 0.100000 1
600383.XSHG 0.109267 0.007560 0.325999 -0.366492 0.448903 -1.953387 -0.735331 -0.643937 -0.349126 -0.406657 ... -0.630525 -0.118691 -0.275433 -0.466084 0.072946 0.390424 0.429765 -0.132612 0.107570 1
002385.XSHE -0.058794 -0.763792 -0.443226 -0.176786 -0.195929 -0.524683 0.998812 0.083053 1.440451 0.321642 ... -1.108461 -0.984246 -0.688405 -0.657357 -0.260509 -0.833264 -0.617463 -0.535587 0.115570 1
600085.XSHG -0.320429 -0.760672 -0.417198 1.383557 -0.102754 -0.376286 -0.652079 -0.779401 -0.573246 -0.724716 ... -0.589341 -0.838624 -0.761572 -0.851642 -0.230226 -1.766027 -1.504032 1.161345 0.116658 1
600600.XSHG -0.872722 -0.753274 -0.264229 0.472731 0.020361 0.039284 -0.947307 -0.045656 -0.991207 0.039137 ... -0.405104 -0.834415 -1.140018 -0.677486 -0.465365 1.531872 2.004210 -1.557847 0.118047 1
000876.XSHE 1.100259 0.190652 1.993477 0.171924 0.409285 -1.854602 0.836407 0.677926 0.931414 1.078163 ... -0.985522 -0.052069 -0.881026 -0.637804 -0.412150 -0.773408 -0.659267 -0.348566 0.123862 1
600196.XSHG -0.093481 -0.313940 -0.542384 0.061656 -0.230569 0.249560 -0.721332 -1.055811 -0.780183 -1.146487 ... 2.026238 0.940875 -0.987536 -0.673922 -1.545417 2.572600 3.268202 -3.334972 0.126536 1
000883.XSHE -0.942596 -0.042609 -0.637517 -2.401748 -0.383065 0.314144 -0.859761 -1.115513 -0.591986 -0.654318 ... -1.269979 -1.566752 -0.087526 -0.596491 -0.195732 0.007835 0.077862 -0.204332 0.126923 1
002422.XSHE 0.011247 -0.038333 -0.422562 0.596048 0.010152 -0.250485 -0.770060 -0.632135 -1.059059 -0.842909 ... 0.289315 -0.112635 -0.553314 -1.049855 -0.954339 -1.280748 -1.501790 1.017469 0.131336 1
002007.XSHE -0.503363 -0.720398 -0.700548 0.315512 -0.017371 -0.259966 -0.675518 -0.299809 0.179116 0.630590 ... -0.020485 0.177289 0.048873 -0.775451 -1.004383 -0.679244 -0.138067 -1.178437 0.138740 1
600150.XSHG -0.799910 0.893093 0.279119 -2.269365 -1.316620 0.838701 -0.812839 -0.508223 -0.675602 -0.592327 ... -0.028588 0.079554 1.810216 1.850153 2.880992 2.072110 2.252665 2.340815 0.139073 1
000963.XSHE -0.245027 -0.856743 0.319901 -0.022001 -0.393461 -0.267133 0.706364 0.331815 -0.468074 -0.424644 ... 0.105847 0.275388 -0.939725 -0.918321 -1.507453 -0.958342 -0.014469 -1.909879 0.140575 1
000596.XSHE -0.250955 -0.343144 -0.271222 -1.967373 -0.041948 -0.446417 -0.538039 -0.932088 0.065171 -0.770943 ... 1.522942 1.295109 -0.436997 -0.467456 -0.400809 2.300280 2.827565 -0.220746 0.141176 1
600276.XSHG -0.603900 -0.963671 -0.675125 0.108870 -0.160117 -0.417082 0.005417 -0.114954 0.953771 1.066472 ... 0.505386 0.631655 -0.806648 -0.555916 0.198815 1.703565 1.044202 0.710243 0.152260 1
000729.XSHE -0.728202 0.258428 0.033015 -0.000422 0.122282 -0.461471 -1.976899 -0.823760 -1.871123 -0.509208 ... -0.733784 -0.453398 -0.852614 -0.339944 0.354340 0.933502 1.102975 -0.848701 0.155096 1
000581.XSHE 0.261958 0.277667 -0.129266 -0.088535 0.091233 -0.001918 0.826967 0.670129 1.761931 1.648232 ... 1.480982 1.478227 1.618169 1.387193 0.341498 2.073535 2.253588 -0.277483 0.166022 1
600588.XSHG 0.156879 0.007058 0.047764 0.449100 0.387652 -0.360739 1.136623 -1.186450 1.219026 -1.192913 ... 1.531776 1.722551 1.849656 1.211272 0.872691 -0.013752 -0.490483 0.848518 0.172241 1
600690.XSHG -0.102724 -1.197982 -0.569657 0.112731 -0.419247 -0.785047 0.628862 1.261226 0.307113 0.823273 ... -0.832264 -0.753233 -0.580476 -0.606552 0.095736 0.513175 0.567618 -0.274850 0.173797 1

8618 rows × 68 columns

2.2 模型训练及测试 (约 6 h)¶

本文将前 4 年数据作为验证集,进行交叉验证,实现参数的设计,采用方法为 K 折交叉验证,将数据分成 K 组 (K=10) ,操作步骤如下:
(1)随机抽取 1 组做为验证集(10%),剩余 K-1 组训练集(90%);
(2)训练集建立模型后,将验证集放到模型中,得到预测标签;
(3)比较预测标签和实际标签,计算模型的评价指标(正确率、AUC 等);
(4)上述步骤 1-3 重复进行 n 次,对评价指标取均值,作为最后评价结果。

首先,确定核函数,核函数参数设定如下所示,用来比较不容核函数对模型的影响。

核函数 参数设定
线性核 C = 1e-4
3 阶多项式核 C = 0.003 gamma = 0.03
7 阶多项式核 C = 0.03 gamma = 0.01
高斯核 C = 1 gamma = 3e-5

考虑到高斯核函数能够将任意维数据映射到无穷维空间,因此高斯核函数应用更有意义。然后,确定模型参数,惩罚系数 C 和 gamma 值是支持向量机模型最重要的两个参数。我们希望同时对 C 和 gamma 值进行遍历,找到全局最优解。参数寻优最常用的方法是网格搜索。下面我们以 SVM 为例,展示网格搜索的过程。
取 C = (0.01, 0.03, 0.1, 0.3, 1, 3, 10),γ= (1e-4, 3e-4, 1e-3, 3e-3, 0.01, 0.03, 0.1, 0.3, 1)
测试每一组 C 和 gamma 值,得到交叉验证集的 AUC 值。(约 6 h)

np.array(train)
array([[-1.33990756, -0.82349568, -1.55374297, ...,  1.18757404,
         1.22922898,  0.33547193],
       [ 0.25323793, -0.70970837,  0.31457629, ...,  1.1469006 ,
         0.42613943,  2.01604604],
       [-0.2043338 ,  0.84066448,  1.8272637 , ...,  1.55163125,
         1.98512026, -0.41985356],
       ...,
       [ 0.26195801,  0.27766662, -0.12926636, ...,  2.073535  ,
         2.2535879 , -0.27748326],
       [ 0.15687879,  0.00705828,  0.04776438, ..., -0.01375164,
        -0.49048256,  0.84851754],
       [-0.10272446, -1.19798227, -0.56965663, ...,  0.51317528,
         0.56761772, -0.27485043]])
np.array(target)
array([-1, -1, -1, ...,  1,  1,  1])
regressor.fit(np.array(train),np.array(target))
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
target=train_data['label']
train=train_data.copy()
del train['pchg']
del train['label']

regressor = svm.SVC()
# SVM算法参数设计
# parameters = {'kernel':['rbf'],'C':[0.01,0.03,0.1,0.3,1,3,10],\
#               'gamma':[1e-4,3e-4,1e-3,3e-3,0.01,0.03,0.1,0.3,1]}
parameters = {'kernel':['rbf'],'C':[1,3,10],\
              'gamma':[0.01,0.03]}
# 创建网格搜索 scoring:指定多个评估指标  cv: N折交叉验证
clf = GridSearchCV(regressor,parameters,scoring='roc_auc',cv=10)
clf.fit(np.array(train),np.array(target))

# 输出交叉验证的结果统计列表
print(clf.cv_results_)
# 输出每个模型的结果
print(clf.grid_scores_)
# 输出最佳模型结果
print(clf.best_score_)
# 输出最佳模型参数
print(clf.best_params_)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-127-34c731808881> in <module>()
     14 
     15 # 输出交叉验证的结果统计列表
---> 16 print(clf.cv_results_)
     17 # 输出每个模型的结果
     18 print(clf.grid_scores_)

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'
'''
方法1(0.20版本已删除):
     grid_search.grid_scores_
方式2(0.20版本适用的方式):
     means = grid_search.cv_results_['mean_test_score']
     params = grid_search.cv_results_['params']
'''
import sklearn
print(sklearn.__version__)
0.18.1
# 输出每个模型的结果
print(clf.grid_scores_)
# 输出最佳模型结果
print(clf.best_score_)
# 输出最佳模型参数
print(clf.best_params_)
[mean: 0.54892, std: 0.03498, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.01}, mean: 0.55322, std: 0.03532, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.03}, mean: 0.55224, std: 0.03255, params: {'kernel': 'rbf', 'C': 3, 'gamma': 0.01}, mean: 0.54560, std: 0.03482, params: {'kernel': 'rbf', 'C': 3, 'gamma': 0.03}, mean: 0.55037, std: 0.03197, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.01}, mean: 0.54393, std: 0.03503, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.03}]
0.5532239884430333
{'kernel': 'rbf', 'C': 1, 'gamma': 0.03}
 

根据统计结果,选择 C=1,gamma=0.01,对前4年数据进行训练,在样本内进行交叉验证,预测结果如下所示。根据计算结果可知,高斯核 SVM 模型样本内训练集和交叉验证集合正确率分别为 81.9% 和 56.2%,AUC 分别为 0.819 和 0.561。(约 5 min)

# 迭代次数
m=10
# 获取特征及标签
target=train_data['label']
train=train_data.copy()
del train['pchg']
del train['label']
train.head()

train_score=[]
test_score=[]
train_auc=[]
test_auc=[]


# 获取模型
clf = svm.SVC(C=1,gamma=0.03,kernel='rbf')

for i in range(m):
    # 随机获取10%的数据作为交叉验证集
    X_train,X_test, y_train, y_test =train_test_split(np.array(train),\
                                                      np.array(target),test_size=0.1)
    # 模型训练
    clf.fit(X_train,y_train)
    # 模型预测
    train_predict=clf.predict(X_train)
    test_predict=clf.predict(X_test)
    # 样本内训练集正确率
    train_score.append(clf.score(X_train, y_train))
    # 交叉验证集正确率
    test_score.append(clf.score(X_test, y_test))
    # 样本内训练集auc值
    train_auc.append(metrics.roc_auc_score(y_train,train_predict)) 
    # 交叉验证集auc值
    test_auc.append(metrics.roc_auc_score(y_test,test_predict)) 
    
print('样本内训练集正确率:',mean(train_score))
print('交叉验证集正确率:',mean(test_score))
print('样本内训练集AUC:',mean(train_auc))
print('交叉验证集AUC:',mean(test_auc))
样本内训练集正确率: 0.8110624033006705
交叉验证集正确率: 0.5545243619489559
样本内训练集AUC: 0.8110425478617895
交叉验证集AUC: 0.5545502084708616

在后 4 年,每月末获取因子数据,放入模型中进行预测,并统计预测结果分别为:样本外测试正确率为 55.1% 及 AUC 均值为 55%,样本外正确率及 AUC 每个月变化曲线图如下所示。(约 1 min)

# 获取特征及标签
train_target=train_data['label']
train_feature=train_data.copy()
del train_feature['pchg']
del train_feature['label']

test_sample_predict={}
test_sample_score=[]
test_sample_auc=[]
test_sample_date=[]

# 获取模型
clf = svm.SVC(C=10,gamma=0.03,kernel='rbf')
# 模型训练
clf.fit(np.array(train_feature),np.array(train_target))

for date in dateList[4*12:-1]:
    test_sample_date.append(date)
    # 取样本外数据特征及标签
    test_target=test_data[date]['label']
    test_feature=test_data[date].copy()
    del test_feature['pchg']
    del test_feature['label']
    test_target=np.array(test_target)
    test_feature=np.array(test_feature)
    # 模型预测
    test_predict=clf.predict(test_feature)
    # 样本外预测结果
    test_sample_predict[date]=test_predict
    # 样本外正确率
    test_sample_score.append(clf.score(test_feature, test_target))
    # 样本外auc值
    test_sample_auc.append(metrics.roc_auc_score(test_target,test_predict))
print( '测试集正确率:',mean(test_sample_score))
print ( '测试集AUC:',mean(test_sample_auc)    )
测试集正确率: 0.522491852886406
测试集AUC: 0.5225434512013857
xs_date = [datetime.datetime.strptime(d, '%Y-%m-%d').date() for d in test_sample_date]
ys_score = test_sample_score
# 配置横坐标
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))

plt.plot(xs_date, ys_score,'r')
# 自动旋转日期标记
plt.gcf().autofmt_xdate() 
# 横坐标标记
plt.xlabel('date')
# 纵坐标标记
plt.ylabel("test accuracy")
plt.show()
xs_date = [datetime.datetime.strptime(d, '%Y-%m-%d').date() for d in test_sample_date]
ys_auc = test_sample_auc
# 配置横坐标
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))

plt.plot(xs_date, ys_auc,'r')
# 自动旋转日期标记
plt.gcf().autofmt_xdate() 
# 横坐标标记
plt.xlabel('date')
# 纵坐标标记
plt.ylabel("test AUC")
plt.show()

在每个截面上,将高斯核 SVM 模型 (C=1,γ=0.03) 对全部个股下期涨跌的预测值 与因子池中各个因子值之间计算相关系数,查看模型预测值与各个因子值之间 “真实的”相关情况,如下图所示。我们发现,超额收益预测值与换手率、技术等交易类因子关联性较弱,与基本面类型因子关联性较强。

factor_predict_corr=pd.DataFrame()
for date in dateList[4*12:-1]:
    test_feature=test_data[date].copy()
    del test_feature['pchg']
    del test_feature['label']
    test_feature['predict']=list(test_sample_predict[date])
    factor_predict_corr[date]=test_feature.corr()['predict']    
factor_predict_corr=factor_predict_corr.iloc[:-1]
# 高斯核 SVM 模型对于下期涨跌预测值与本期因子值之间相关系数示意图
fig = plt.figure(figsize= (15,10))
ax = fig.add_subplot(111)
sns.set()
ax = sns.heatmap(factor_predict_corr)

2.3 小结¶

本部分主要进行 SVM 模型的测试。首先,对核函数进行分类,根据需求将核函数分为线性核、3 阶多项式核、7 阶多项式核以及高斯核四大类;然后设定一系列 C 和 gamma 值,通过交叉验证的方法,确定高斯核 SVM 模型的最佳参数,经过验证发现 C=1,gamma=0.01 时模型最佳;紧接着,利用模型分别对样本内数据及样本完数据进行检验,根据计算结果可知,高斯核 SVM 模型样本内训练集和交叉验证集合正确率分别为 81.9% 和 56.2%,AUC 分别为 0.819 和 0.561,样本外测试正确率为 55.1% 及 AUC 均值为 55%;最后,评估了模型预测收益率与因子之间的相关性,发现超额收益预测值与换手率、技术等交易类因子关联性较弱,与基本面类型因子关联性较强。

3 策略构建 (约 36 min)¶

沪深 300 成份股:剔除 ST、停牌、上市时间 <3 个月的股票
回测时间:2014-01-01 至 2018-01-01
调仓期:每月第一个交易日
SVM 模型参数: 高斯核函数,C=10,γ=0.03
选股:
(1)利用 2010-2014 年数据建立 SVM 模型
(2)利用 SVM 模型预测 2014-2018 年下月的收益
(3)根据预测值选股投资
评价方法: 回测年化收益率、夏普比率、最大回撤、胜率等。
回测年化收益率: 年化收益率通常指投资一年后能够获得的收益率,由于回测时间的长短,往往会由于复利的影响导致长时间的总收益率更大,此时可通过年化收益率衡量模型的收益能力。
夏普比率: 一般情况下,风险和收益总是背向而驰,模型在承担收益的同时也会承担相对应的风险,因此合理控制模型的收益与风险能够提高模型的优势。
最大回撤: 最大回撤是指模型在过去的某一段时间可能出现的最大亏损程度,通常用来衡量模型的风险。在实际投资中,若是出现最大回撤较大的情况,往往会导致投资者对模型丧失信心,因此合理控制模型的最大回撤显得尤为重要。

3.1 等权重选股 (约 2 min)¶

策略步骤:
(1)每月根据股票预测值进行排序
(2)排序后将股票分为 N 层 (N=5)
(3)按层级分别回测,每层股票等权重投资,得到 5 根回测曲线

#1 先导入所需要的程序包
import datetime
import numpy as np 
import pandas as pd
import time
from jqdata import *
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import copy
import pickle

class parameter_analysis(object):
    
    # 定义函数中不同的变量
    def __init__(self, algorithm_id=None):
        self.algorithm_id = algorithm_id            # 回测id
        
        self.params_df = pd.DataFrame()             # 回测中所有调参备选值的内容,列名字为对应修改面两名称,对应回测中的 g.XXXX
        self.results = {}                           # 回测结果的回报率,key 为 params_df 的行序号,value 为
        self.evaluations = {}                       # 回测结果的各项指标,key 为 params_df 的行序号,value 为一个 dataframe
        self.backtest_ids = {}                      # 回测结果的 id
        
        # 新加入的基准的回测结果 id,可以默认为空 '',则使用回测中设定的基准
        self.benchmark_id = ''                      
        
        self.benchmark_returns = []                 # 新加入的基准的回测回报率
        self.returns = {}                           # 记录所有回报率
        self.excess_returns = {}                    # 记录超额收益率
        self.log_returns = {}                       # 记录收益率的 log 值
        self.log_excess_returns = {}                # 记录超额收益的 log 值
        self.dates = []                             # 回测对应的所有日期
        self.excess_max_drawdown = {}               # 计算超额收益的最大回撤
        self.excess_annual_return = {}              # 计算超额收益率的年化指标
        self.evaluations_df = pd.DataFrame()        # 记录各项回测指标,除日回报率外
    
    # 定义排队运行多参数回测函数
    def run_backtest(self,                          #
                     algorithm_id=None,             # 回测策略id
                     running_max=10,                # 回测中同时巡行最大回测数量
                     start_date='2016-01-01',       # 回测的起始日期
                     end_date='2016-04-30',         # 回测的结束日期
                     frequency='day',               # 回测的运行频率
                     initial_cash='1000000',        # 回测的初始持仓金额
                     param_names=[],                # 回测中调整参数涉及的变量
                     param_values=[]                # 回测中每个变量的备选参数值
                     ):
        # 当此处回测策略的 id 没有给出时,调用类输入的策略 id
        if algorithm_id == None: algorithm_id=self.algorithm_id
        
        # 生成所有参数组合并加载到 df 中
        # 包含了不同参数具体备选值的排列组合中一组参数的 tuple 的 list
        param_combinations = list(itertools.product(*param_values))
        # 生成一个 dataframe, 对应的列为每个调参的变量,每个值为调参对应的备选值
        to_run_df = pd.DataFrame(param_combinations)
        # 修改列名称为调参变量的名字
        to_run_df.columns = param_names
        to_run_df['backtestID']=''
        to_run_df['state']='waiting'
        to_run_df['times']=0
        

        # 设定运行起始时间和保存格式
        start = time.time()
        # 记录结束的运行回测
        finished_backtests = {}
        # 记录运行中的回测
        running_backtests = {}
        failed_backtests={}
        running_count=0

        # 总运行回测数目,等于排列组合中的元素个数
        total_backtest_num = len(param_combinations)
        # 记录回测结果的回报率
        all_results = {}
        # 记录回测结果的各项指标
        all_evaluations = {}
        
        # 在运行开始时显示
        print('【已完成|运行中|待运行||失败】:' ) 
        # 当运行回测开始后,如果没有全部运行完全的话:
        while len(to_run_df[(to_run_df.state=='waiting') | (to_run_df.state=='running')].index)>0:
            # 显示运行、完成和待运行的回测个数
            print('[%s|%s|%s||%s].' % (len(finished_backtests), 
                                   len(running_backtests), 
                                   (total_backtest_num-len(finished_backtests)-len(running_backtests)- len(failed_backtests)),
                                    len(failed_backtests)  
                                      )),
            # 把可用的空位进行跑回测
            for index in (to_run_df[to_run_df.state=='waiting'].index):
                # 备选的参数排列组合的 df 中第 i 行变成 dict,每个 key 为列名字,value 为 df 中对应的值
                if running_count>=running_max:
                    continue
                params = to_run_df.ix[index,param_names].to_dict()
                # 记录策略回测结果的 id,调整参数 extras 使用 params 的内容
                backtest = create_backtest(algorithm_id = algorithm_id,
                                           start_date = start_date, 
                                           end_date = end_date, 
                                           frequency = frequency, 
                                           initial_cash = initial_cash, 
                                           extras = params, 
                                           # 再回测中把改参数的结果起一个名字,包含了所有涉及的变量参数值
                                           name =str( params).replace('{','').replace('}','').replace('\'','')
                                           )
                # 记录运行中 i 回测的回测 id
                to_run_df.at[index,'backtestID'] = backtest
                to_run_df.at[index,'state']='running'
                to_run_df.at[index,'times']=to_run_df.at[index,'times']+1
                running_count=running_count+1

            # 获取回测结果
            failed = []
            finished = []
            # 对于运行中的回测,key 为 to_run_df 中所有排列组合中的序数

            for index in to_run_df[to_run_df.state=='running'].index:
                # 研究调用回测的结果,running_backtests[key] 为运行中保存的结果 id
                bt = get_backtest(to_run_df.at[index,'backtestID'])
                # 获得运行回测结果的状态,成功和失败都需要运行结束后返回,如果没有返回则运行没有结束
                status = bt.get_status()
                # 当运行回测失败
                if status in [ 'failed','canceled','deleted']:
                    # 失败 list 中记录对应的回测结果 id
                    failed.append(index)
                # 当运行回测成功时
                elif status == 'done':
                    # 成功 list 记录对应的回测结果 id,finish 仅记录运行成功的
                    finished.append(index)
                    # 回测回报率记录对应回测的回报率 dict, key to_run_df 中所有排列组合中的序数, value 为回报率的 dict
                    # 每个 value 一个 list 每个对象为一个包含时间、日回报率和基准回报率的 dict
                    all_results[index] = bt.get_results()
                    # 回测回报率记录对应回测结果指标 dict, key to_run_df 中所有排列组合中的序数, value 为回测结果指标的 dataframe
                    all_evaluations[index] = bt.get_risk()
            # 记录运行中回测结果 id 的 list 中删除失败的运行
            for index in failed:
                if to_run_df.at[index,'times']<3:
                    to_run_df.at[index,'state']='waiting'
                else:
                    to_run_df.at[index,'state']='failed'

            # 在结束回测结果 dict 中记录运行成功的回测结果 id,同时在运行中的记录中删除该回测
            for index in finished:
                to_run_df.at[index,'state']='done'
            
            running_count=len(to_run_df[to_run_df.state=='running'].index)
            running_backtests=to_run_df[to_run_df.state=='running']['backtestID'].to_dict()
            finished_backtests=to_run_df[to_run_df.state=='done']['backtestID'].to_dict()
            failed_backtests=to_run_df[to_run_df.state=='failed']['backtestID'].to_dict()
            
            # 当一组同时运行的回测结束时报告时间
            if len(finished_backtests) != 0 and len(finished_backtests) % running_max == 0 :
                # 记录当时时间
                middle = time.time()
                # 计算剩余时间,假设没工作量时间相等的话
                remain_time = (middle - start) * (total_backtest_num - len(finished_backtests)) / len(finished_backtests)
                # print 当前运行时间
                print('[已用%s时,尚余%s时,请不要关闭浏览器].' % (str(round((middle - start) / 60.0 / 60.0,3)), 
                                          str(round(remain_time / 60.0 / 60.0,3)))),
            # 5秒钟后再跑一下
            

            time.sleep(5) 
        # 记录结束时间
        end = time.time() 
        print('')
        print('【回测完成】总用时:%s秒(即%s小时)。' % (str(int(end-start)), 
                                           str(round((end-start)/60.0/60.0,2)))),
        # 对应修改类内部对应
        self.params_df = to_run_df.ix[:,param_names]
        self.results = all_results
        self.evaluations = all_evaluations
        self.backtest_ids = finished_backtests

        
    #7 最大回撤计算方法
    def find_max_drawdown(self, returns):
        # 定义最大回撤的变量
        result = 0
        # 记录最高的回报率点
        historical_return = 0
        # 遍历所有日期
        for i in range(len(returns)):
            # 最高回报率记录
            historical_return = max(historical_return, returns[i])
            # 最大回撤记录
            drawdown = 1-(returns[i] + 1) / (historical_return + 1)
            # 记录最大回撤
            result = max(drawdown, result)
        # 返回最大回撤值
        return result

    # log 收益、新基准下超额收益和相对与新基准的最大回撤
    def organize_backtest_results(self, benchmark_id=None):
        # 若新基准的回测结果 id 没给出
        if benchmark_id==None:
            # 使用默认的基准回报率,默认的基准在回测策略中设定
            self.benchmark_returns = [x['benchmark_returns'] for x in self.results[0]]
        # 当新基准指标给出后    
        else:
            # 基准使用新加入的基准回测结果
            self.benchmark_returns = [x['returns'] for x in get_backtest(benchmark_id).get_results()]
        # 回测日期为结果中记录的第一项对应的日期
        self.dates = [x['time'] for x in self.results[0]]
        
        # 对应每个回测在所有备选回测中的顺序 (key),生成新数据
        # 由 {key:{u'benchmark_returns': 0.022480100091729405,
        #           u'returns': 0.03184566700000002,
        #           u'time': u'2006-02-14'}} 格式转化为:
        # {key: []} 格式,其中 list 为对应 date 的一个回报率 list
        for key in self.results.keys():
            self.returns[key] = [x['returns'] for x in self.results[key]]
        # 生成对于基准(或新基准)的超额收益率
        for key in self.results.keys():
            self.excess_returns[key] = [(x+1)/(y+1)-1 for (x,y) in zip(self.returns[key], self.benchmark_returns)]
        # 生成 log 形式的收益率
        for key in self.results.keys():
            self.log_returns[key] = [log(x+1) for x in self.returns[key]]
        # 生成超额收益率的 log 形式
        for key in self.results.keys():
            self.log_excess_returns[key] = [log(x+1) for x in self.excess_returns[key]]
        # 生成超额收益率的最大回撤
        for key in self.results.keys():
            self.excess_max_drawdown[key] = self.find_max_drawdown(self.excess_returns[key])
        # 生成年化超额收益率
        for key in self.results.keys():
            self.excess_annual_return[key] = (self.excess_returns[key][-1]+1)**(252./float(len(self.dates)))-1
        # 把调参数据中的参数组合 df 与对应结果的 df 进行合并
        self.evaluations_df = pd.concat([self.params_df, pd.DataFrame(self.evaluations).T], axis=1)
#         self.evaluations_df = 

    # 获取最总分析数据,调用排队回测函数和数据整理的函数    
    def get_backtest_data(self,
                          algorithm_id=None,                         # 回测策略id
                          benchmark_id=None,                         # 新基准回测结果id
                          file_name='results1.pkl',                   # 保存结果的 pickle 文件名字
                          running_max=10,                            # 最大同时运行回测数量
                          start_date='2006-01-01',                   # 回测开始时间
                          end_date='2016-11-30',                     # 回测结束日期
                          frequency='day',                           # 回测的运行频率
                          initial_cash='1000000',                    # 回测初始持仓资金
                          param_names=[],                            # 回测需要测试的变量
                          param_values=[]                            # 对应每个变量的备选参数
                          ):
        # 调运排队回测函数,传递对应参数
        self.run_backtest(algorithm_id=algorithm_id,
                          running_max=running_max,
                          start_date=start_date,
                          end_date=end_date,
                          frequency=frequency,
                          initial_cash=initial_cash,
                          param_names=param_names,
                          param_values=param_values
                          )
        # 回测结果指标中加入 log 收益率和超额收益率等指标
        self.organize_backtest_results(benchmark_id)
        # 生成 dict 保存所有结果。
        results = {'returns':self.returns,
                   'excess_returns':self.excess_returns,
                   'log_returns':self.log_returns,
                   'log_excess_returns':self.log_excess_returns,
                   'dates':self.dates,
                   'benchmark_returns':self.benchmark_returns,
                   'evaluations':self.evaluations,
                   'params_df':self.params_df,
                   'backtest_ids':self.backtest_ids,
                   'excess_max_drawdown':self.excess_max_drawdown,
                   'excess_annual_return':self.excess_annual_return,
                   'evaluations_df':self.evaluations_df}
        # 保存 pickle 文件
        pickle_file = open(file_name, 'wb')
        pickle.dump(results, pickle_file)
        pickle_file.close()

    # 读取保存的 pickle 文件,赋予类中的对象名对应的保存内容    
    def read_backtest_data(self, file_name='results.pkl'):
        pickle_file = open(file_name, 'rb')
        results = pickle.load(pickle_file)
        self.returns = results['returns']
        self.excess_returns = results['excess_returns']
        self.log_returns = results['log_returns']
        self.log_excess_returns = results['log_excess_returns']
        self.dates = results['dates']
        self.benchmark_returns = results['benchmark_returns']
        self.evaluations = results['evaluations']
        self.params_df = results['params_df']
        self.backtest_ids = results['backtest_ids']
        self.excess_max_drawdown = results['excess_max_drawdown']
        self.excess_annual_return = results['excess_annual_return']
        self.evaluations_df = results['evaluations_df']
        
    # 回报率折线图    
    def plot_returns(self):
        # 通过figsize参数可以指定绘图对象的宽度和高度,单位为英寸;
        fig = plt.figure(figsize=(20,8))
        ax = fig.add_subplot(111)
        # 作图
        for key in self.returns.keys():
            ax.plot(range(len(self.returns[key])), self.returns[key], label=key)
        # 设定benchmark曲线并标记
        ax.plot(range(len(self.benchmark_returns)), self.benchmark_returns, label='benchmark', c='k', linestyle='--') 
        ticks = [int(x) for x in np.linspace(0, len(self.dates)-1, 11)]
        plt.xticks(ticks, [self.dates[i] for i in ticks])
        # 设置图例样式
        ax.legend(loc = 2, fontsize = 10)
        # 设置y标签样式
        ax.set_ylabel('returns',fontsize=20)
        # 设置x标签样式
        ax.set_yticklabels([str(x*100)+'% 'for x in ax.get_yticks()])
        # 设置图片标题样式
        ax.set_title("Strategy's performances with different parameters", fontsize=21)
        plt.xlim(0, len(self.returns[0]))
    
    # 多空组合图
    def plot_long_short(self):
       # 通过figsize参数可以指定绘图对象的宽度和高度,单位为英寸;
        fig = plt.figure(figsize=(20,8))
        ax = fig.add_subplot(111)
        # 作图
        a1 = [i+1 for i in self.returns[0]]
        a2 = [i+1 for i in self.returns[4]]
        a1.insert(0,1)   
        a2.insert(0,1)
        b = []
        for i in range(len(a1)-1):
            b.append((a1[i+1]/a1[i]-a2[i+1]/a2[i])/2)
        c = []
        c.append(1)
        for i in range(len(b)):
            c.append(c[i]*(1+b[i]))
        ax.plot(range(len(c)), c)
        ticks = [int(x) for x in np.linspace(0, len(self.dates)-1, 11)]
        plt.xticks(ticks, [self.dates[i] for i in ticks])
        # 设置图例样式
        ax.legend(loc = 2, fontsize = 10)
        ax.set_title("Strategy's long_short performances",fontsize=20)
        # 设置图片标题样式
        plt.xlim(0, len(c))     
        
    # 获取不同年份的收益及排名分析
    def get_profit_year(self):
        profit_year = {}
        for key in self.returns.keys():
            temp = []
            date_year = []
            for i in range(len(self.dates)-1):
                if self.dates[i][:4] != self.dates[i+1][:4]:
                    temp.append(self.returns[key][i])
                    date_year.append(self.dates[i][:4])
            temp.append(self.returns[key][-1])
            date_year.append(self.dates[-1][:4]) 
            temp1 = []
            temp1.append(temp[0])
            for i in range(len(temp)-1):
                temp1.append((temp[i+1]+1)/(temp[i]+1)-1)
            profit_year[key] = temp1
        result = pd.DataFrame(index = list(self.returns.keys()), columns = date_year)
        for key in self.returns.keys():
            result.loc[key,:] = profit_year[key]
        return result
            
    # 超额收益率图    
    def plot_excess_returns(self):
        
        # 通过figsize参数可以指定绘图对象的宽度和高度,单位为英寸;
        fig = plt.figure(figsize=(20,8))
        ax = fig.add_subplot(111)
        # 作图
        for key in self.returns.keys():
            ax.plot(range(len(self.excess_returns[key])), self.excess_returns[key], label=key)
        # 设定benchmark曲线并标记
        ax.plot(range(len(self.benchmark_returns)), [0]*len(self.benchmark_returns), label='benchmark', c='k', linestyle='--')
        ticks = [int(x) for x in np.linspace(0, len(self.dates)-1, 11)]
        plt.xticks(ticks, [self.dates[i] for i in ticks])
        # 设置图例样式
        ax.legend(loc = 2, fontsize = 10)
        # 设置y标签样式
        ax.set_ylabel('excess returns',fontsize=20)
        # 设置x标签样式
        ax.set_yticklabels([str(x*100)+'% 'for x in ax.get_yticks()])
        # 设置图片标题样式
        ax.set_title("Strategy's performances with different parameters", fontsize=21)
        plt.xlim(0, len(self.excess_returns[0]))
        
    # log回报率图    
    def plot_log_returns(self):
        # 通过figsize参数可以指定绘图对象的宽度和高度,单位为英寸;
        fig = plt.figure(figsize=(20,8))
        ax = fig.add_subplot(111)
        # 作图
        for key in self.returns.keys():
            ax.plot(range(len(self.log_returns[key])), self.log_returns[key], label=key)
        # 设定benchmark曲线并标记
        ax.plot(range(len(self.benchmark_returns)), [log(x+1) for x in self.benchmark_returns], label='benchmark', c='k', linestyle='--')
        ticks = [int(x) for x in np.linspace(0, len(self.dates)-1, 11)]
        plt.xticks(ticks, [self.dates[i] for i in ticks])
        # 设置图例样式
        ax.legend(loc = 2, fontsize = 10)
        # 设置y标签样式
        ax.set_ylabel('log returns',fontsize=20)
        # 设置图片标题样式
        ax.set_title("Strategy's performances with different parameters", fontsize=21)
        plt.xlim(0, len(self.log_returns[0]))
    
    # 超额收益率的 log 图
    def plot_log_excess_returns(self):
        # 通过figsize参数可以指定绘图对象的宽度和高度,单位为英寸;
        fig = plt.figure(figsize=(20,8))
        ax = fig.add_subplot(111)
        # 作图
        for key in self.returns.keys():
            ax.plot(range(len(self.log_excess_returns[key])), self.log_excess_returns[key], label=key)
        # 设定benchmark曲线并标记
        ax.plot(range(len(self.benchmark_returns)), [0]*len(self.benchmark_returns), label='benchmark', c='k', linestyle='--')
        ticks = [int(x) for x in np.linspace(0, len(self.dates)-1, 11)]
        plt.xticks(ticks, [self.dates[i] for i in ticks])
        # 设置图例样式
        ax.legend(loc = 2, fontsize = 10)
        # 设置y标签样式
        ax.set_ylabel('log excess returns',fontsize=20)
        # 设置图片标题样式
        ax.set_title("Strategy's performances with different parameters", fontsize=21)
        plt.xlim(0, len(self.log_excess_returns[0]))

        
    # 回测的4个主要指标,包括总回报率、最大回撤夏普率和波动
    def get_eval4_bar(self, sort_by=[]): 
        
        sorted_params = self.params_df
        for by in sort_by:
            sorted_params = sorted_params.sort(by)
        indices = sorted_params.index
        
        fig = plt.figure(figsize=(20,7))

        # 定义位置
        ax1 = fig.add_subplot(221)
        # 设定横轴为对应分位,纵轴为对应指标
        ax1.bar(range(len(indices)), 
                [self.evaluations[x]['algorithm_return'] for x in indices], 0.6, label = 'Algorithm_return')
        plt.xticks([x+0.3 for x in range(len(indices))], indices)
        # 设置图例样式
        ax1.legend(loc='best',fontsize=15)
        # 设置y标签样式
        ax1.set_ylabel('Algorithm_return', fontsize=15)
        # 设置y标签样式
        ax1.set_yticklabels([str(x*100)+'% 'for x in ax1.get_yticks()])
        # 设置图片标题样式
        ax1.set_title("Strategy's of Algorithm_return performances of different quantile", fontsize=15)
        # x轴范围
        plt.xlim(0, len(indices))

        # 定义位置
        ax2 = fig.add_subplot(224)
        # 设定横轴为对应分位,纵轴为对应指标
        ax2.bar(range(len(indices)), 
                [self.evaluations[x]['max_drawdown'] for x in indices], 0.6, label = 'Max_drawdown')
        plt.xticks([x+0.3 for x in range(len(indices))], indices)
        # 设置图例样式
        ax2.legend(loc='best',fontsize=15)
        # 设置y标签样式
        ax2.set_ylabel('Max_drawdown', fontsize=15)
        # 设置x标签样式
        ax2.set_yticklabels([str(x*100)+'% 'for x in ax2.get_yticks()])
        # 设置图片标题样式
        ax2.set_title("Strategy's of Max_drawdown performances of different quantile", fontsize=15)
        # x轴范围
        plt.xlim(0, len(indices))
        # 定义位置
        ax3 = fig.add_subplot(223)
        # 设定横轴为对应分位,纵轴为对应指标
        ax3.bar(range(len(indices)),
                [self.evaluations[x]['sharpe'] for x in indices], 0.6, label = 'Sharpe')
        plt.xticks([x+0.3 for x in range(len(indices))], indices)
        # 设置图例样式
        ax3.legend(loc='best',fontsize=15)
        # 设置y标签样式
        ax3.set_ylabel('Sharpe', fontsize=15)
        # 设置x标签样式
        ax3.set_yticklabels([str(x*100)+'% 'for x in ax3.get_yticks()])
        # 设置图片标题样式
        ax3.set_title("Strategy's of Sharpe performances of different quantile", fontsize=15)
        # x轴范围
        plt.xlim(0, len(indices))

        # 定义位置
        ax4 = fig.add_subplot(222)
        # 设定横轴为对应分位,纵轴为对应指标
        ax4.bar(range(len(indices)), 
                [self.evaluations[x]['algorithm_volatility'] for x in indices], 0.6, label = 'Algorithm_volatility')
        plt.xticks([x+0.3 for x in range(len(indices))], indices)
        # 设置图例样式
        ax4.legend(loc='best',fontsize=15)
        # 设置y标签样式
        ax4.set_ylabel('Algorithm_volatility', fontsize=15)
        # 设置x标签样式
        ax4.set_yticklabels([str(x*100)+'% 'for x in ax4.get_yticks()])
        # 设置图片标题样式
        ax4.set_title("Strategy's of Algorithm_volatility performances of different quantile", fontsize=15)
        # x轴范围
        plt.xlim(0, len(indices))
        
    #14 年化回报和最大回撤,正负双色表示
    def get_eval(self, sort_by=[]):

        sorted_params = self.params_df
        for by in sort_by:
            sorted_params = sorted_params.sort(by)
        indices = sorted_params.index
        
        # 大小
        fig = plt.figure(figsize = (20, 8))
        # 图1位置
        ax = fig.add_subplot(111)
        # 生成图超额收益率的最大回撤
        ax.bar([x+0.3 for x in range(len(indices))],
               [-self.evaluations[x]['max_drawdown'] for x in indices], color = '#32CD32',  
                     width = 0.6, label = 'Max_drawdown', zorder=10)
        # 图年化超额收益
        ax.bar([x for x in range(len(indices))],
               [self.evaluations[x]['annual_algo_return'] for x in indices], color = 'r', 
                     width = 0.6, label = 'Annual_return')
        plt.xticks([x+0.3 for x in range(len(indices))], indices)
        # 设置图例样式
        ax.legend(loc='best',fontsize=15)
        # 基准线
        plt.plot([0, len(indices)], [0, 0], c='k', 
                 linestyle='--', label='zero')
        # 设置图例样式
        ax.legend(loc='best',fontsize=15)
        # 设置y标签样式
        ax.set_ylabel('Max_drawdown', fontsize=15)
        # 设置x标签样式
        ax.set_yticklabels([str(x*100)+'% 'for x in ax.get_yticks()])
        # 设置图片标题样式
        ax.set_title("Strategy's performances of different quantile", fontsize=15)
        #   设定x轴长度
        plt.xlim(0, len(indices))

    #14 超额收益的年化回报和最大回撤
    # 加入新的benchmark后超额收益和
    def get_excess_eval(self, sort_by=[]):

        sorted_params = self.params_df
        for by in sort_by:
            sorted_params = sorted_params.sort(by)
        indices = sorted_params.index
        
        # 大小
        fig = plt.figure(figsize = (20, 8))
        # 图1位置
        ax = fig.add_subplot(111)
        # 生成图超额收益率的最大回撤
        ax.bar([x+0.3 for x in range(len(indices))],
               [-self.excess_max_drawdown[x] for x in indices], color = '#32CD32',  
                     width = 0.6, label = 'Excess_max_drawdown')
        # 图年化超额收益
        ax.bar([x for x in range(len(indices))],
               [self.excess_annual_return[x] for x in indices], color = 'r', 
                     width = 0.6, label = 'Excess_annual_return')
        plt.xticks([x+0.3 for x in range(len(indices))], indices)
        # 设置图例样式
        ax.legend(loc='best',fontsize=15)
        # 基准线
        plt.plot([0, len(indices)], [0, 0], c='k', 
                 linestyle='--', label='zero')
        # 设置图例样式
        ax.legend(loc='best',fontsize=15)
        # 设置y标签样式
        ax.set_ylabel('Max_drawdown', fontsize=15)
        # 设置x标签样式
        ax.set_yticklabels([str(x*100)+'% 'for x in ax.get_yticks()])
        # 设置图片标题样式
        ax.set_title("Strategy's performances of different quantile", fontsize=15)
        #   设定x轴长度
        plt.xlim(0, len(indices))

def group_backtest(start_date,end_date):
    warnings.filterwarnings("ignore")
    pa = parameter_analysis('78c119654ae4be5678a2e92ac6565b3f')
    pa.get_backtest_data(file_name = 'results_1.pkl',
                          running_max = 5,
                          start_date=start_date,
                          end_date=end_date,
                          frequency = 'day',
                          initial_cash = '10000000',
                          param_names = ['factor', 'quantile'],#变量名,即在策略中的g.xxxx变量
                          param_values = [['svm'], tuple(zip(range(0,50,10), range(10,51,10)))]            
                          )
    
start_date = '2014-01-01' 
end_date = '2018-01-01' 
group_backtest(start_date,end_date)
【已完成|运行中|待运行||失败】:
[0|0|5||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[1|4|0||0].
[2|3|0||0].
[2|3|0||0].
[3|2|0||0].
[已用0.051时,尚余0.0时,请不要关闭浏览器].

【回测完成】总用时:188秒(即0.05小时)。

3.1.1 分层回测策略模型收益指标¶

为了对模型的收益能力进行具体分析,将模型预测结果看成单因子,按照单因子有效性的测试方法,对模型的有效性进行测试,具体分析方法为:(1)根据预测结果按照从大到小的顺序进行排序;(2)将股票平均分为 5 等份,分别构成 5 个投资组合。具体每个组合的收益指标如下表所示。

pa = parameter_analysis()
pa.read_backtest_data('results_1.pkl')
pa.evaluations_df
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
factor quantile __version algorithm_return algorithm_volatility alpha annual_algo_return annual_bm_return avg_position_days avg_trade_return ... max_drawdown_period max_leverage period_label profit_loss_ratio sharpe sortino trading_days treasury_return win_count win_ratio
0 svm (0, 10) 101 0.663323 0.269375 -0.0138488 0.139055 0.150562 NaN NaN ... [2015-06-12, 2016-01-28] 0 2017-12 NaN 0.367722 0.407257 977 0.159671 NaN NaN
1 svm (10, 20) 101 0.63249 0.266582 -0.0188011 0.133614 0.150562 135.74 0.0164623 ... [2015-06-12, 2016-01-28] 0 2017-12 1.30284 0.351166 0.394109 977 0.159671 1393 0.527851
2 svm (20, 30) 101 0.509376 0.262645 -0.0391229 0.111096 0.150562 NaN NaN ... [2015-06-12, 2016-01-28] 0 2017-12 NaN 0.270693 0.304905 977 0.159671 NaN NaN
3 svm (30, 40) 101 0.136907 0.272248 -0.121327 0.0333779 0.150562 NaN NaN ... [2015-06-12, 2016-01-28] 0 2017-12 NaN -0.024324 -0.0269605 977 0.159671 NaN NaN
4 svm (40, 50) 101 0.0971357 0.261152 -0.124857 0.0240049 0.150562 NaN NaN ... [2015-06-12, 2016-01-28] 0 2017-12 NaN -0.0612482 -0.0683193 977 0.159671 NaN NaN

5 rows × 28 columns

3.1.2 分层回测净值¶

为了进一步更直观的对 5 个组合进行分析,绘制了 5 个组合及 HS300 基准的净值收益曲线,具体下图所示。由图可以看出,组合 1 能够明显跑赢组合 5 ,可见符合单因子有效性的检验,即模型证明是有效的。

pa.plot_returns()

3.1.3 多空组合净值¶

从分层组合回测净值曲线图来看,每个组合波动性较大,策略存在较大的风险,因此考虑建立多空组合。多空组合是买入组合 1、卖空组合 5 (月度调仓)的一个资产组合,为了方便统计,多空组合每日收益率为(组合 1 每日收益率 - 组合 5 每日收益率)/2,然后获得多空组合的净值收益曲线,如图所示,多空组合净值收益曲线明显比任何一个组合的波动性更低,能够获得更为稳定的收益,风险控制效果较好。

pa.plot_long_short()
No handles with labels found to put in legend.

3.1.4 不同年份的收益(分 5 层)¶

为了进一步分析模型的稳定性,对每一年每个组合的收益能力进行分析。如表所示,组合 1 每一年在 5 个组合中均能够获得较高的收益,而组合 5 基本上每年收益能力都排在最后两名。

pa.get_profit_year()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
2014 2015 2016 2017
0 0.410576 0.295119 -0.0946605 0.00567782
1 0.473467 0.180543 -0.134506 0.0843375
2 0.481693 0.262112 -0.205667 0.0161062
3 0.374386 0.0572377 -0.195955 -0.026887
4 0.306113 0.0628772 -0.210157 0.000589785

3.1.5 高斯核 SVM 模型策略组合回测分析表¶

pa.get_eval4_bar()
pa.get_eval()
pa.get_excess_eval()

3.2 行业中性选股(约 20 min)¶

策略步骤:
(1)所有股票归属申万一级行业
(2)行业内股票按每月预测结果进行排序
(3)行业内股票均分成 N层 (N=5)
(4)每个分层组合的股票进行权重配置
股票权重配置方法:
在每个一级行业内部对所有个股按因子大小进行排序,每个行业内均分成 N 个分层组合。如图所示,黄色方块代表各行业内个股初始权重,可以相等也 可以不等(我们直接取相等权重进行测试),分层具体操作方法为 N 等分行业内个股 权重累加值,例如图示行业 1 中,5 只个股初始权重相等(不妨设每只个股权重为 0.2), 假设我们欲分成 3 层,则分层组合 1 在权重累加值 1/3 处截断,即分层组合 1 包含个股 1 和个股 2,它们的权重配比为 0.2:(1/3-0.2)=3:2,同样推理,分层组合 2 包含个股 2、3、4,配比为(0.4-1/3):0.2:(2/3-0.6)=1:3:1,分层组合 4 包含个股 4、5,配比 为 2:3。以上方法是用来计算各个一级行业内部个股权重配比的,行业间权重配比与基准组合(我们使用沪深 300)相同,也即行业中性。 股票权重配置

### 这里注意的是需要将 num的参数变为字符串掺进去 
def group_backtest(start_date,end_date,num):
    warnings.filterwarnings("ignore")
    pa = parameter_analysis()
    pa.get_backtest_data(file_name = 'results_2.pkl',
                          running_max = 10,
                          algorithm_id = '794c29655a9b316c82b0cf87340631cf',
                          start_date=start_date,
                          end_date=end_date,
                          frequency = 'day',
                          initial_cash = '10000000',
                          param_names = ['num'],
                          param_values = [num]                     
                          )
    
start_date = '2014-01-01' 
end_date = '2018-01-01' 
num = ['1','2','3','4','5']
group_backtest(start_date,end_date,num)
【已完成|运行中|待运行||失败】:
[0|0|5||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[1|4|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[4|1|0||0].

【回测完成】总用时:545秒(即0.15小时)。
def group_backtest(start_date,end_date):
    warnings.filterwarnings("ignore")
    pa = parameter_analysis()
    pa.get_backtest_data(file_name = 'results_2.pkl',
                          running_max = 10,
                          algorithm_id = 'aaece09901876fda726e8768b1d34e11',
                          start_date=start_date,
                          end_date=end_date,
                          frequency = 'day',
                          initial_cash = '10000000',
                          param_names = ['factor', 'quantile'],
                          param_values = [['svm'], tuple(zip(range(0,50,10), range(10,51,10)))]                     
                          )
    
start_date = '2014-01-01' 
end_date = '2018-01-01' 
group_backtest(start_date,end_date)
【已完成|运行中|待运行||失败】:
[0|0|5||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[0|5|0||0].
[1|4|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[2|3|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[3|2|0||0].
[4|1|0||0].

【回测完成】总用时:876秒(即0.24小时)。

3.2.1 分层回测策略模型收益指标¶

为了对模型的收益能力进行具体分析,将模型预测结果看成单因子,按照单因子有效性的测试方法,对模型的有效性进行测试,具体分析方法为:(1)根据预测结果按照从大到小的顺序进行排序;(2)将股票平均分为 5 等份,分别构成 5 个投资组合。具体每个组合的收益指标如下表所示。

pa = parameter_analysis()
pa.read_backtest_data('results_2.pkl')
pa.evaluations_df
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
num __version algorithm_return algorithm_volatility alpha annual_algo_return annual_bm_return avg_position_days avg_trade_return benchmark_return ... max_drawdown_period max_leverage period_label profit_loss_ratio sharpe sortino trading_days treasury_return win_count win_ratio
0 1 101 0.738469 0.255861 0.00123646 0.152007 0.150562 NaN NaN 0.729959 ... [2015-06-12, 2016-01-28] 0 2017-12 NaN 0.437766 0.492744 977 0.159671 NaN NaN
1 2 101 0.455817 0.258532 -0.0514965 0.100871 0.150562 186.156 0.0156864 0.729959 ... [2015-06-12, 2016-01-28] 0 2017-12 1.24784 0.23545 0.261717 977 0.159671 1948 0.533114
2 3 101 0.633183 0.264279 -0.0211756 0.133737 0.150562 179.59 0.0144208 0.729959 ... [2015-06-12, 2016-01-28] 0 2017-12 1.32176 0.35469 0.403786 977 0.159671 1921 0.521303
3 4 101 0.212724 0.265138 -0.104247 0.0505905 0.150562 176.945 0.00921821 0.729959 ... [2015-06-12, 2016-02-29] 0 2017-12 1.14759 0.0399431 0.0445347 977 0.159671 1854 0.505728
4 5 101 0.237787 0.252249 -0.0927294 0.0561041 0.150562 163.021 0.00871081 0.729959 ... [2015-06-12, 2017-05-23] 0 2017-12 1.15374 0.0638424 0.0732078 977 0.159671 1588 0.499528

5 rows × 27 columns

3.2.2 分层回测净值¶

为了进一步更直观的对 5 个组合进行分析,绘制了 5 个组合及 HS300 基准的净值收益曲线,具体下图所示。由图可以看出,组合 1 能够明显跑赢组合 5 ,可见符合单因子有效性的检验,即模型证明是有效的。

pa.plot_returns()

3.2.3 多空组合净值¶

从分层组合回测净值曲线图来看,每个组合波动性较大,策略存在较大的风险,因此考虑建立多空组合。多空组合是买入组合 1、卖空组合 5 (月度调仓)的一个资产组合,为了方便统计,多空组合每日收益率为(组合 1 每日收益率 - 组合 5 每日收益率)/2,然后获得多空组合的净值收益曲线,如图所示,多空组合净值收益曲线明显比任何一个组合的波动性更低,能够获得更为稳定的收益,风险控制效果较好。

pa.plot_long_short()
No handles with labels found to put in legend.

3.2.4 不同年份的收益(分 5 层)¶

为了进一步分析模型的稳定性,对每一年每个组合的收益能力进行分析。如表所示,组合 1 每一年在 5 个组合中均能够获得较高的收益,而组合 5 基本上每年收益能力都排在最后两名。

pa.get_profit_year()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
2014 2015 2016 2017
0 0.629711 0.124005 -0.0915092 0.0457919
1 0.490134 0.0585526 -0.143853 0.0839715
2 0.549766 0.169902 -0.164742 0.080689
3 0.481532 0.00275753 -0.178317 -0.00719422
4 0.511561 0.0472044 -0.216468 -0.00130045

3.2.5 高斯核 SVM 模型策略组合回测分析表¶

pa.get_eval4_bar()
pa.get_eval()
pa.get_excess_eval()

3.3 不同核函数 SVM 模型(等权重)(约 12 min)¶

策略步骤:
(1)每月根据股票预测值进行排序
(2)排序后,选择排名靠前的股票进行交易
模型:
(1)根据 SVM 模型核函数的不同,可分为不同的模型,具体参数设置如表所示
(2)股票数量设定,分别选择 5%、10%、15%、20% 的股票数量进行交易,分析在交易股票数量下模型的收益能力。

序号 核函数 参数设定
0 线性核 C = 1e-4
1 3 阶多项式核 C = 0.003 gamma = 0.03
2 7 阶多项式核 C = 0.03 gamma = 0.01
3 高斯核 C = 1 gamma = 3e-5

3.3.1 策略构建¶

下面分别进行不同股票(5%、10%、15%、20%)的四个模型的构建,并将结果分别保存在文件(results_3.pkl','results_4.pkl','results_5.pkl','results_6.pkl')中。

def group_backtest(start_date,end_date,para):
    warnings.filterwarnings("ignore")
    pa = parameter_analysis()
    pa.get_backtest_data(file_name = 'results_3.pkl',
                          running_max = 10,
                          algorithm_id = '6ae8347a85bfea853a0c5c53035d02b8',
                          start_date=start_date,
                          end_date=end_date,
                          frequency = 'day',
                          initial_cash = '10000000',
                          param_names = ['para'],
                          param_values = [para]                     
                          )
start_date = '2014-01-01' 
end_date = '2018-01-01' 

kernel = ['linear', 'poly', 'poly', 'rbf']
C = [1e-4, 0.003, 0.03, 1]
degree = [3, 3, 7, 3]
gamma = ['auto', 0.03, 0.01, 3e-5]
num = [0.05]*4
for i in range(len(kernel)):
    temp = {}
    temp['kernel'] = kernel[i]
    temp['C'] = C[i]
    temp['degree'] = degree[i]
    temp['gamma'] = gamma[i]
    temp['num'] = num[i]
    para.append(temp)
group_backtest(start_date,end_date,para)
【已完成|运行中|待运行||失败】:
[0|0|12||0].
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-79-24e55d0ef34d> in <module>()
     28     temp['num'] = num[i]
     29     para.append(temp)
---> 30 group_backtest(start_date,end_date,para)

<ipython-input-79-24e55d0ef34d> in group_backtest(start_date, end_date, para)
     10                           initial_cash = '10000000',
     11                           param_names = ['para'],
---> 12                           param_values = [para]
     13                           )
     14 start_date = '2014-01-01'

<ipython-input-44-7f1be9bc2835> in get_backtest_data(self, algorithm_id, benchmark_id, file_name, running_max, start_date, end_date, frequency, initial_cash, param_names, param_values)
    251                           initial_cash=initial_cash,
    252                           param_names=param_names,
--> 253                           param_values=param_values
    254                           )
    255         # 回测结果指标中加入 log 收益率和超额收益率等指标

<ipython-input-44-7f1be9bc2835> in run_backtest(self, algorithm_id, running_max, start_date, end_date, frequency, initial_cash, param_names, param_values)
    161 
    162 
--> 163             time.sleep(5)
    164         # 记录结束时间
    165         end = time.time()

KeyboardInterrupt: 
def group_backtest(start_date,end_date,para):
    warnings.filterwarnings("ignore")
    pa = parameter_analysis()
    pa.get_backtest_data(file_name = 'results_4.pkl',
                          running_max = 10,
                          algorithm_id = '6ae8347a85bfea853a0c5c53035d02b8',
                          start_date=start_date,
                          end_date=end_date,
                          frequency = 'day',
                          initial_cash = '10000000',
                          param_names = ['para'],
                          param_values = [para]                     
                          )
start_date = '2014-01-01' 
end_date = '2018-01-01' 

para = []
kernel = ['linear', 'poly', 'poly', 'rbf']
C = [1e-4, 0.003, 0.03, 1]
degree = [3, 3, 7, 3]
gamma = ['auto', 0.03, 0.01, 3e-5]
num = [0.1]*4
for i in range(len(kernel)):
    temp = {}
    temp['kernel'] = kernel[i]
    temp['C'] = C[i]
    temp['degree'] = degree[i]
    temp['gamma'] = gamma[i]
    temp['num'] = num[i]
    para.append(temp)
group_backtest(start_date,end_date,para)
【已完成|运行中|待运行||失败】:
[0|0|4||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[2|2|0||0].
[2|2|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].

【回测完成】总用时:139秒(即0.04小时)。
def group_backtest(start_date,end_date,para):
    warnings.filterwarnings("ignore")
    pa = parameter_analysis()
    pa.get_backtest_data(file_name = 'results_5.pkl',
                          running_max = 10,
                          algorithm_id = '6ae8347a85bfea853a0c5c53035d02b8',
                          start_date=start_date,
                          end_date=end_date,
                          frequency = 'day',
                          initial_cash = '10000000',
                          param_names = ['para'],
                          param_values = [para]                     
                          )
start_date = '2014-01-01' 
end_date = '2018-01-01' 

para = []
kernel = ['linear', 'poly', 'poly', 'rbf']
C = [1e-4, 0.003, 0.03, 1]
degree = [3, 3, 7, 3]
gamma = ['auto', 0.03, 0.01, 3e-5]
num = [0.15]*4
for i in range(len(kernel)):
    temp = {}
    temp['kernel'] = kernel[i]
    temp['C'] = C[i]
    temp['degree'] = degree[i]
    temp['gamma'] = gamma[i]
    temp['num'] = num[i]
    para.append(temp)
group_backtest(start_date,end_date,para)
【已完成|运行中|待运行||失败】:
[0|0|4||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[1|3|0||0].
[1|3|0||0].
[1|3|0||0].
[1|3|0||0].
[1|3|0||0].
[1|3|0||0].
[1|3|0||0].
[2|2|0||0].
[2|2|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].

【回测完成】总用时:230秒(即0.06小时)。
def group_backtest(start_date,end_date,para):
    warnings.filterwarnings("ignore")
    pa = parameter_analysis()
    pa.get_backtest_data(file_name = 'results_6.pkl',
                          running_max = 10,
                          algorithm_id = '6ae8347a85bfea853a0c5c53035d02b8',
                          start_date=start_date,
                          end_date=end_date,
                          frequency = 'day',
                          initial_cash = '10000000',
                          param_names = ['para'],
                          param_values = [para]                     
                          )
start_date = '2014-01-01' 
end_date = '2018-01-01' 

para = []
kernel = ['linear', 'poly', 'poly', 'rbf']
C = [1e-4, 0.003, 0.03, 1]
degree = [3, 3, 7, 3]
gamma = ['auto', 0.03, 0.01, 3e-5]
num = [0.2]*4
for i in range(len(kernel)):
    temp = {}
    temp['kernel'] = kernel[i]
    temp['C'] = C[i]
    temp['degree'] = degree[i]
    temp['gamma'] = gamma[i]
    temp['num'] = num[i]
    para.append(temp)
group_backtest(start_date,end_date,para)
【已完成|运行中|待运行||失败】:
[0|0|4||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[0|4|0||0].
[1|3|0||0].
[1|3|0||0].
[1|3|0||0].
[1|3|0||0].
[1|3|0||0].
[1|3|0||0].
[1|3|0||0].
[2|2|0||0].
[2|2|0||0].
[2|2|0||0].
[2|2|0||0].
[2|2|0||0].
[2|2|0||0].
[2|2|0||0].
[2|2|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].
[3|1|0||0].

【回测完成】总用时:256秒(即0.07小时)。

3.3.2 不同核函数策略模型年化收益(不同股票数量)¶

files = ['results_4.pkl','results_5.pkl','results_6.pkl']
num = [ '10%', '15%', '20%']
result = pd.DataFrame(index = ['线性核', '3 阶多项式核', '7 阶多项式核', '高斯核'])
pa = parameter_analysis()
for i in range(len(num)):
    pa.read_backtest_data(files[i])
    result[num[i]] = list(pa.evaluations_df['annual_algo_return'])
result
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
10% 15% 20%
线性核 0.193814 0.185778 0.185456
3 阶多项式核 0.106208 0.139193 0.159541
7 阶多项式核 0.098947 0.114701 0.109966
高斯核 0.190774 0.186430 0.185224

横轴代表选择的股票占所有股票的百分比,纵轴代表不同的核函数

3.3.3 不同核函数策略模型夏普比率(不同股票数量)¶

files = ['results_4.pkl','results_5.pkl','results_6.pkl']
num = ['10%', '15%', '20%']
result = pd.DataFrame(index = ['线性核', '3 阶多项式核', '7 阶多项式核', '高斯核'])
pa = parameter_analysis()
for i in range(len(num)):
    pa.read_backtest_data(files[i])
    result[num[i]] = list(pa.evaluations_df['sharpe'])
result
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
10% 15% 20%
线性核 0.601520 0.570711 0.566373
3 阶多项式核 0.265304 0.398270 0.473421
7 阶多项式核 0.250818 0.312082 0.286605
高斯核 0.588754 0.573654 0.565425

3.3.4 不同核函数策略模型最大回撤(不同股票数量)¶

files = ['results_4.pkl','results_5.pkl','results_6.pkl']
num = ['10%', '15%', '20%']
result = pd.DataFrame(index = ['线性核', '3 阶多项式核', '7 阶多项式核', '高斯核'])
pa = parameter_analysis()
for i in range(len(num)):
    pa.read_backtest_data(files[i])
    result[num[i]] = list(pa.evaluations_df['max_drawdown'])
result
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
10% 15% 20%
线性核 0.419444 0.433982 0.442570
3 阶多项式核 0.484936 0.451700 0.454899
7 阶多项式核 0.484793 0.478805 0.485196
高斯核 0.419937 0.433977 0.443329

3.3.5 不同核函数策略模型信息比率(不同股票数量)¶

files = ['results_4.pkl','results_5.pkl','results_6.pkl']
num = ['10%', '15%', '20%']
result = pd.DataFrame(index = ['线性核', '3 阶多项式核', '7 阶多项式核', '高斯核'])
pa = parameter_analysis()
for i in range(len(num)):
    pa.read_backtest_data(files[i])
    result[num[i]] = list(pa.evaluations_df['information'])
result
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
10% 15% 20%
线性核 0.451676 0.412560 0.421594
3 阶多项式核 -0.464094 -0.128740 0.108245
7 阶多项式核 -0.534275 -0.396297 -0.479084
高斯核 0.418434 0.419867 0.418307

通过不同模型的年化收益、夏普比率、最大回撤及信息比率来看,线性核和高斯核很明具有更好的收益,且当选择股票数量为 10% 时能够获得最高的收益,5% 时收益最差,可见,股票数量不是越少收益能力越强,数量过少容易导致风险不能被分散,从而使得模型承受更大的风险,导致收益能力下降。

3.4 SVR 模型和 SVM 模型(等权重)(约 10 min)¶

策略步骤:
(1)每月根据股票预测值进行排序
(2)排序后,选择排名靠前的股票进行交易
模型:
(1)根据 SVM 模型的不同,可分为不同的模型,具体参数设置如表所示
(2)股票数量设定,分别选择 5%、10%、15%、20% 的股票数量进行交易,分析在交易股票数量下模型的收益能力。

序号 模型 参数设定
0 SVC C = 10 gamma = 0.01 kernel = 'rbf'
1 SVR C = 10 gamma = 0.01 kernel = 'rbf'

3.4.1 策略构建¶

下面分别进行不同股票(5%、10%、15%、20%)的四个模型的构建,并将结果分别保存在文件(results_7.pkl','results_8.pkl','results_9.pkl','results_10.pkl')中。

def group_backtest(start_date,end_date,types,num):
    warnings.filterwarnings("ignore")
    pa = parameter_analysis()
    pa.get_backtest_data(file_name = 'results_7.pkl',
                          running_max = 10,
                          algorithm_id = '8890ee93c2a304917b3ab7a4daab4c44',
                          start_date=start_date,
                          end_date=end_date,
                          frequency = 'day',
                          initial_cash = '10000000',
                          param_names = ['types','num'],
                          param_values = [types, num]                     
                          )
start_date = '2014-01-01' 
end_date = '2018-01-01' 

para = []
types = ['SVC', 'SVR']
num = [0.05]
group_backtest(start_date,end_date,types,num)
【已完成|运行中|待运行||失败】:
[0|0|2||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].

【回测完成】总用时:144秒(即0.04小时)。
def group_backtest(start_date,end_date,types,num):
    warnings.filterwarnings("ignore")
    pa = parameter_analysis()
    pa.get_backtest_data(file_name = 'results_8.pkl',
                          running_max = 10,
                          algorithm_id = '8890ee93c2a304917b3ab7a4daab4c44',
                          start_date=start_date,
                          end_date=end_date,
                          frequency = 'day',
                          initial_cash = '10000000',
                          param_names = ['types','num'],
                          param_values = [types, num]                     
                          )
start_date = '2014-01-01' 
end_date = '2018-01-01' 

para = []
types = ['SVC', 'SVR']
num = [0.1]
group_backtest(start_date,end_date,types,num)
【已完成|运行中|待运行||失败】:
[0|0|2||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[1|1|0||0].
[1|1|0||0].

【回测完成】总用时:124秒(即0.03小时)。
def group_backtest(start_date,end_date,types,num):
    warnings.filterwarnings("ignore")
    pa = parameter_analysis()
    pa.get_backtest_data(file_name = 'results_9.pkl',
                          running_max = 10,
                          algorithm_id = '8890ee93c2a304917b3ab7a4daab4c44',
                          start_date=start_date,
                          end_date=end_date,
                          frequency = 'day',
                          initial_cash = '10000000',
                          param_names = ['types','num'],
                          param_values = [types, num]                     
                          )
start_date = '2014-01-01' 
end_date = '2018-01-01' 

para = []
types = ['SVC', 'SVR']
num = [0.15]
group_backtest(start_date,end_date,types,num)
【已完成|运行中|待运行||失败】:
[0|0|2||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].

【回测完成】总用时:183秒(即0.05小时)。
def group_backtest(start_date,end_date,types,num):
    warnings.filterwarnings("ignore")
    pa = parameter_analysis()
    pa.get_backtest_data(file_name = 'results_10.pkl',
                          running_max = 10,
                          algorithm_id = '8890ee93c2a304917b3ab7a4daab4c44',
                          start_date=start_date,
                          end_date=end_date,
                          frequency = 'day',
                          initial_cash = '10000000',
                          param_names = ['types','num'],
                          param_values = [types, num]                     
                          )
start_date = '2014-01-01' 
end_date = '2018-01-01' 

para = []
types = ['SVC', 'SVR']
num = [0.2]
group_backtest(start_date,end_date,types,num)
【已完成|运行中|待运行||失败】:
[0|0|2||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[0|2|0||0].
[1|1|0||0].
[1|1|0||0].
[1|1|0||0].

【回测完成】总用时:191秒(即0.05小时)。

3.4.2 SVR 模型和 SVM 模型年化收益(不同股票数量)¶

files = ['results_7.pkl','results_8.pkl','results_9.pkl','results_10.pkl']
num = ['5%', '10%', '15%', '20%']
result = pd.DataFrame(index = ['SVC', 'SVR'])
pa = parameter_analysis()
for i in range(len(num)):
    pa.read_backtest_data(files[i])
    result[num[i]] = list(pa.evaluations_df['annual_algo_return'])
result
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
5% 10% 15% 20%
SVC 0.096107 0.126097 0.148496 0.139055
SVR 0.085703 0.125346 0.117887 0.137773

3.4.3 SVR 模型和 SVM 模型夏普比率(不同股票数量)¶

files = ['results_7.pkl','results_8.pkl','results_9.pkl','results_10.pkl']
num = ['5%', '10%', '15%', '20%']
result = pd.DataFrame(index = ['SVC', 'SVR'])
pa = parameter_analysis()
for i in range(len(num)):
    pa.read_backtest_data(files[i])
    result[num[i]] = list(pa.evaluations_df['sharpe'])
result
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
5% 10% 15% 20%
SVC 0.203402 0.310074 0.403821 0.367722
SVR 0.166291 0.311041 0.291229 0.366917

3.4.4 SVR 模型和 SVM 模型最大回撤(不同股票数量)¶

files = ['results_7.pkl','results_8.pkl','results_9.pkl','results_10.pkl']
num = ['5%', '10%', '15%', '20%']
result = pd.DataFrame(index = ['SVC', 'SVR'])
pa = parameter_analysis()
for i in range(len(num)):
    pa.read_backtest_data(files[i])
    result[num[i]] = list(pa.evaluations_df['max_drawdown'])
result
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
5% 10% 15% 20%
SVC 0.456234 0.496830 0.464156 0.477222
SVR 0.466013 0.476128 0.473968 0.473292

3.4.5 SVR 模型和 SVM 模型信息比率(不同股票数量)¶

files = ['results_7.pkl','results_8.pkl','results_9.pkl','results_10.pkl']
num = ['5%', '10%', '15%', '20%']
result = pd.DataFrame(index = ['SVC', 'SVR'])
pa = parameter_analysis()
for i in range(len(num)):
    pa.read_backtest_data(files[i])
    result[num[i]] = list(pa.evaluations_df['information'])
result
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
5% 10% 15% 20%
SVC -0.471369 -0.246785 -0.022625 -0.133438
SVR -0.594407 -0.267845 -0.369962 -0.153159

通过不同模型的年化收益、夏普比率、最大回撤及信息比率来看,SVC 模型很明具有更好的收益,且当选择股票数量为 15% 时 SVC 模型能够获得最高的收益,年化收益达到 14.85%,且夏普比率及信息比率均在所有模型中为最大值,此外,从风险指标最大回撤来看,该模型的最大回撤最小,可见该模型不仅收益能力最强,且风险控制能力也最强。接下来,我们选择股票数量为 15%,分别构建高斯核 SVM 模型和 SVR 模型的收益曲线,更直观地对模型进行分析。

3.4.6 高斯核 SVM 模型和 SVR 模型收益曲线(股票数量 15%)¶

pa = parameter_analysis()
pa.read_backtest_data('results_9.pkl')
pa.plot_returns()

3.4.7 高斯核 SVM 模型和 SVR 模型超额收益曲线(股票数量 15%)¶

pa.plot_excess_returns()

3.5 小结¶

本部分以 HS300 成分股为股票池,将 2010-2014 年数据作为训练集构建不同模型,在 2014-2018 年进行策略回测,最后通过对策略的收益能力及风险控制能力的分析评价模型的优劣。首先,构建高斯核 SVM 模型进行分层回测,分别进行等权重选股及行业中性策略,结果表明高斯核模型是有效的;紧接着,对不同核模型,分别进行不同股票数量的回测,结果表明线性核和高斯核的收益能力更强,且选择股票数量为 10% 时能够获得最高的收益;最后,对 SVC 模型及 SVR 模型进行分析,结果表明 SVC 模型表现效果更好。

总结¶

以上我们对支持向量机进行了系统的测试,并且利用支持向量机模型构建沪深 300 选股策略,初步得到以下几个结论:
一、SVM 具备不错的预测能力。我们以 2010-2014 年的因子及下期收益作为样本内集合, 2014 年 2018 年的数据为样本外测试集,高斯核 SVM 在 HS300 选股模型交叉验证集正确率为 56.2%,AUC 为 0.561,样本外测试集平均正确率为 55.1%,平均 AUC 为 0.550。
二、我们分别以 HS300 为股票池,利用高斯核 SVM 模型构建选股策略。对于 HS300 成份股内选股的等权重策略以及行业中性策略。总体而言,高斯核 SVM 在收益、夏普比率、最大回撤等方面分层明确,可见模型是有效的。
三、由此我们也可以推知,对于固定训练集的机器学习模型,训练集的选择至关重要,需要尽可能囊括市场风格的各种可能性。训练集时间长度不能过短,并且应包含至少一个经济周期,我们选择 2010~2014 年作为训练集正是出于上述考虑。然而类似 2014 年 12 月以及 2017 年上半年市场风格巨变的情况在历史上发生的次数不多,在训练集时间段中没有出现过,因此固定训练集的模型出现较大回撤在我们的预料之中。
四、我们比较了高斯核支持向量机 (SVM) 以及支持向量回归 (SVR) 的预测能力。绝大多数时候,高斯核 SVC 的收益能力比 SVR 模型更强,可能的原因在于,对原始收益率进行二值化处理,分成正例和反例后,尽管损失了部分信息,但同时消除了收益率信息中包含的大量噪音,使得模型能够更准确地捕捉数据中蕴含的规律。
五、我们比较了不同核支持向量机 (SVM) 的预测能力。绝大多数时候,高斯核收益能力要高于其他核函数模型。作为使用最为广泛的核函数,其优势在于不对原始数据做太多的先验假设。高斯核 SVM 假设原始数据线性可分,换言之因子和收益率之间存在线性关系。

分享到:
举报财经168客户端下载

全部回复

0/140

  • 狭义可亲

    恭喜楼主跑通了,我最近也在复现这篇报告的过程,但是卡在了3.1.1那里。原文读取的results_1.pkl文件是啥时候生成的呢?

    回复举报
投稿 您想发表你的观点和看法?

更多人气分析师

  • 张亦巧

    人气2192文章4145粉丝45

    暂无个人简介信息

  • 王启蒙现货黄金

    人气296文章3248粉丝8

    本人做分析师以来,并专注于贵金属投资市场,尤其是在现货黄金...

  • 指导老师

    人气1864文章4423粉丝52

    暂无个人简介信息

  • 李冉晴

    人气2320文章3821粉丝34

    李冉晴,专业现贷实盘分析师。

  • 梁孟梵

    人气2176文章3177粉丝39

    qq:2294906466 了解群指导添加微信mfmacd

  • 张迎妤

    人气1896文章3305粉丝34

    个人专注于行情技术分析,消息面解读剖析,给予您第一时间方向...

  • 金泰铬J

    人气2328文章3925粉丝51

    投资问答解咨询金泰铬V/信tgtg67即可获取每日的实时资讯、行情...

  • 金算盘

    人气2696文章7761粉丝125

    高级分析师,混过名校,厮杀于股市和期货、证券市场多年,专注...

  • 金帝财神

    人气4760文章8329粉丝119

    本文由资深分析师金帝财神微信:934295330,指导黄金,白银,...

FX168财经

FX168财经学院

FX168财经

FX168北美