基于随机森林和XGBoost的多因子模型分析

作者/fjsdjfs 2019-06-19 18:35 0 来源: FX168财经网人物频道

概述

华泰证券林晓明先生的《人工智能选股系列研报》非常经典，一直被我作为教材，反复参阅，遗憾就是没有程序源代码。本人一是能力不够，二是天性懒惰，一直等到西安交大元老师量化小组.《支持向量机模型（SVM）在多因子选股模型领域的应用》的程序分享出来，才依样画葫芦地将随机森林和XGBoost搬砖到多因子模型，在此先表示感谢！

Bagging和Boosting都是分类器的集成算法。Bagging以并行方法集成算法，先构建一些小的分类器，然后基于每个分类器的结果做均值计算得到最终模型。Bagging方法因为其方差小，所以比单个分类器的效果更好。Boosting以串行方法集成，每个分类器顺序参与模型评估，并试图降低最终模型的偏差。Boosting方法的准确率较高，且鲁棒性较强。本文中，使用了随机森林和XGBoost两种机器学习算法作为的代表，对多因子模型进行分析和比较。两者都是基于决策树算法的延伸，前者基于Bagging方法，后者基于Boosting方法。具体的研究包括：

特征数量对随机森林和XGBoost预测能力的影响分析
随机森林和XGBoost的参数（模型复杂度）对于预测能力的影响分析
随机森林和XGBoost的预测能力的分析
随机森林和XGBoost的特征重要度

随机森林应用于多因子模型我之前已经做过一次研究，参考《基于机器学习的多因子选股策略》是第一次使用。另外，有别于上文，本文用SelectPercentile进行特征选取，用Pipeline进行交叉验证，模型进行的是涨跌的分类预测（上文是市值的回归预测）。

研究结论

特征数量对随机森林和XGBoost预测能力的影响是不同的，前者随着特征数量上升，预测能力上升，后者则不稳定。本文分别选择80%和30%作为最佳选择比例。
对于随机森林，随着nestimators上升，预测能力上升，随着max_depths上升，AUC上升，但f1-score下降；对于XGBoost，随着sub_sample上升，预测能力下降，但速度更快，随着max_depth上升，预测能力下降，速度也更慢。
两种模型的预测能力接近，但是在模型训练时间上，XGBoost具有压倒性优势。在达到相近预测能力时，XGBoost比随机森林要简单。本文的XGBoost模型中，每个决策树的深度都为1。而随机模型中每个决策树的深度普遍在4以上，模型的复杂程度远大于XGBoost模型。
技术因子在模型重要度中最高，其次是估值因子。对于随机森林来说，大多数因子都得到使用，而XGBoost只使用了部分因子。

参考文献

林晓明. 《人工智能选股之Boosting模型》
林晓明. 《人工智能选股之随机森林模型》
西安交大元老师量化小组.《支持向量机模型（SVM）在多因子选股模型领域的应用》
宋天龙. 《Python数据分析与数据化运营》

研究

收起

*研究

您需要登录后才能查看源码，点击登录

1.概述¶

Bagging和Boosting都是分类器的集成算法。Bagging以并行方法集成算法，先构建一些小的分类器，然后基于每个分类器的结果做均值计算得到最终模型。Bagging方法因为其方差小，所以比单个分类器的效果更好。Boosting以串行方法集成，每个分类器顺序参与模型评估，并试图降低最终模型的偏差。Boosting方法的准确率较高，且鲁棒性较强。本文中，分别使用了RandomForest和XGBoost两种机器学习算法作为代表，对传统多因子模型进行分析和比较。两者都是基于决策树算法的延伸，前者基于Bagging方法，后者基于Boosting方法。具体的研究包括：（1）特征数量对RandomForest和XGBoost预测能力的影响评价
（2） RandomForest和XGBoost的参数（模型复杂度）对于预测能力的影响评价（3） RandomForest和XGBoost的预测能力的评价（4） RandomForest和XGBoost的特征重要度

import pandas as pdimport numpy as npimport mathimport jqdataimport timeimport datetimefrom jqfactor import standardlizefrom jqfactor import winsorize_medfrom jqfactor import get_factor_valuesfrom jqfactor import neutralizefrom sklearn.model_selection import StratifiedKFold, cross_val_score  # 导入交叉检验算法from sklearn.feature_selection import SelectPercentile, f_classif  # 导入特征选择方法库from sklearn.pipeline import Pipeline  # 导入Pipeline库from sklearn.metrics import accuracy_score  # 准确率指标from sklearn.metrics import roc_auc_scorefrom jqlib.technical_analysis import *from xgboost.sklearn import XGBClassifierfrom sklearn.ensemble import RandomForestClassifierimport matplotlib.dates as mdatesimport matplotlib.pyplot as pltimport seaborn as sns

2. 数据获取和预处理¶

2.1 股票池设定¶

中证全指成分股。剔除ST股票，剔除每个截面期下一交易日停牌的股票，剔除上市3个月内的股票，每只股票视作一个样本。

#去除上市距beginDate不足n天的股票def delete_stop(stocks,beginDate,n):stockList=[]beginDate = datetime.datetime.strptime(beginDate, "%Y-%m-%d")for stock in stocks:start_date=get_security_info(stock).start_dateif start_date<(beginDate-datetime.timedelta(days=n)).date():stockList.append(stock)return stockList

#剔除ST股def delete_st(stocks,begin_date):st_data=get_extras('is_st',stocks, count = 1,end_date=begin_date)stockList = [stock for stock in stocks if not st_data[stock][0]]return stockList

2.2 时间区间¶

2014年1月1日-2018年12月31日的5年区间。其中前4年区间（48个月）作为训练集，后1年区间（12个月）作为测试集。

#按月区间取值peroid = 'M'#样本区间（训练集+测试集的区间为2014-1-31到2018-12-31）start_date = '2014-02-01'end_date = '2019-01-31'#训练集长度train_length = 48#聚宽一级行业industry_code = ['HY001', 'HY002', 'HY003', 'HY004', 'HY005', 'HY006', 'HY007', 'HY008', 'HY009', 'HY010', 'HY011']#股票池，获取中证全指securities_list = delete_stop(get_index_stocks('000985.XSHG'),start_date,90)securities_list = delete_st(securities_list,start_date)

2.3 特征和标签提取¶

每个自然月的最后一个交易日，计算因子暴露度，作为样本的原始特征；计算下期收益率，作为样本的标签

jqfactors_list = ['current_ratio',  'net_profit_to_total_operate_revenue_ttm',  'gross_income_ratio',  'roe_ttm',  'roa_ttm',  'total_asset_turnover_rate',\                  'net_operating_cash_flow_coverage',  'net_operate_cash_flow_ttm',  'net_profit_ttm',\                  'cash_to_current_liability',  'operating_revenue_growth_rate',  'non_recurring_gain_loss',\                  'operating_revenue_ttm',  'net_profit_growth_rate']

def get_jq_factor(date):factor_data = get_factor_values(securities=securities_list, \factors=jqfactors_list, \count=1, \end_date=date)df_jq_factor=pd.DataFrame(index=securities_list)for i in factor_data.keys():df_jq_factor[i]=factor_data[i].iloc[0,:]return df_jq_factor

q = query(valuation.code, 
      valuation.market_cap,#市值  valuation.circulating_market_cap,  valuation.pe_ratio, #市盈率（TTM）  valuation.pb_ratio, #市净率（TTM）  valuation.pcf_ratio, #CFP  valuation.ps_ratio, #PS  balance.total_assets,  balance.total_liability,  balance.development_expenditure, #RD  balance.dividend_payable,  balance.fixed_assets,  
      balance.total_non_current_liability,  income.operating_profit,  income.total_profit, #OPTP  #  indicator.net_profit_to_total_revenue, #净利润/营业总收入  indicator.inc_revenue_year_on_year,  #营业收入增长率（同比）  indicator.inc_net_profit_year_on_year,#净利润增长率（同比）  indicator.roe,  indicator.roa,  indicator.gross_profit_margin #销售毛利率GPM).filter(  valuation.code.in_(securities_list))

#获取指定周期的日期列表 'W、M、Q'def get_period_date(peroid,start_date, end_date):#设定转换周期period_type  转换为周是'W',月'M',季度线'Q',五分钟'5min',12天'12D'stock_data = get_price('000001.XSHE',start_date,end_date,'daily',fields=['close'])#记录每个周期中最后一个交易日stock_data['date']=stock_data.index#进行转换，周线的每个变量都等于那一周中最后一个交易日的变量值period_stock_data=stock_data.resample(peroid).last()date = period_stock_data.indexpydate_array = date.to_pydatetime()date_only_array = np.vectorize(lambda s: s.strftime('%Y-%m-%d'))(pydate_array )date_only_series = pd.Series(date_only_array)start_date = datetime.datetime.strptime(start_date, "%Y-%m-%d")start_date = start_date-datetime.timedelta(days=1)start_date = start_date.strftime("%Y-%m-%d")date_list = date_only_series.values.tolist()date_list.insert(0,start_date)return date_list

def initialize_df(df,date):#净资产df['net_assets']=df['total_assets']-df['total_liability']df_new = pd.DataFrame(index=securities_list)#估值因子df_new['EP'] = df['pe_ratio'].apply(lambda x: 1/x)df_new['BP'] = df['pb_ratio'].apply(lambda x: 1/x)df_new['SP'] = df['ps_ratio'].apply(lambda x: 1/x)df_new['DP'] = df['dividend_payable']/(df['market_cap']*100000000)df_new['RD'] = df['development_expenditure']/(df['market_cap']*100000000)df_new['CFP'] = df['pcf_ratio'].apply(lambda x: 1/x)#杠杆因子#对数流通市值df_new['CMV'] = np.log(df['circulating_market_cap'])#总资产/净资产df_new['financial_leverage']=df['total_assets']/df['net_assets']#非流动负债/净资产df_new['debtequityratio']=df['total_non_current_liability']/df['net_assets']#现金比率=(货币资金+有价证券)÷流动负债df_new['cashratio']=df['cash_to_current_liability']#流动比率=流动资产/流动负债*100%df_new['currentratio']=df['current_ratio']#财务质量因子# 净利润与营业总收入之比df_new['NI'] = df['net_profit_to_total_operate_revenue_ttm']df_new['GPM'] = df['gross_income_ratio']df_new['ROE'] = df['roe_ttm']df_new['ROA'] = df['roa_ttm']df_new['asset_turnover'] = df['total_asset_turnover_rate']df_new['net_operating_cash_flow'] = df['net_operating_cash_flow_coverage']#成长因子df_new['Sales_G_q'] = df['operating_revenue_growth_rate']df_new['Profit_G_q'] = df['net_profit_growth_rate']#技术指标df_new['RSI']=pd.Series(RSI(securities_list, date, N1=20))    dif,dea,macd=MACD(securities_list, date, SHORT = 10, LONG = 30, MID = 15)df_new['DIF']=pd.Series(dif)df_new['DEA']=pd.Series(dea)df_new['MACD']=pd.Series(macd)    
    return df_new

dateList = get_period_date(peroid,start_date, end_date)

# (jqdata)因子df_jq_factor = {}# （财务数据）因子df_q_factor = {}# 预处理前的原始因子训练集df_factor_pre_train = {}for date in dateList:df_jq_factor = get_jq_factor(date)df_q_factor = get_fundamentals(q, date = date)df_q_factor.index = df_q_factor['code']# 合并得大表df_factor_pre_train[date] = pd.concat([df_q_factor,df_jq_factor],axis=1)# 初始化df_factor_pre_train[date] = initialize_df(df_factor_pre_train[date],date)

C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.

2.4 特征预处理¶

包括中位数去极值，缺失值处理，行业市值中性化，标准化等处理。

#取股票对应行业def get_industry_name(i_Constituent_Stocks, value):return [k for k, v in i_Constituent_Stocks.items() if value in v]

#缺失值处理def replace_nan_indu(factor_data,stockList,industry_code,date):#把nan用行业平均值代替，依然会有nan，此时用所有股票平均值代替i_Constituent_Stocks={}data_temp=pd.DataFrame(index=industry_code,columns=factor_data.columns)for i in industry_code:temp = get_industry_stocks(i, date)i_Constituent_Stocks[i] = list(set(temp).intersection(set(stockList)))data_temp.loc[i]=mean(factor_data.loc[i_Constituent_Stocks[i],:])for factor in data_temp.columns:#行业缺失值用所有行业平均值代替null_industry=list(data_temp.loc[pd.isnull(data_temp[factor]),factor].keys())for i in null_industry:data_temp.loc[i,factor]=mean(data_temp[factor])null_stock=list(factor_data.loc[pd.isnull(factor_data[factor]),factor].keys())for i in null_stock:industry=get_industry_name(i_Constituent_Stocks, i)if industry:factor_data.loc[i,factor]=data_temp.loc[industry[0],factor] else:factor_data.loc[i,factor]=mean(factor_data[factor])return factor_data

#数据预处理def data_preprocessing(factor_data,stockList,industry_code,date):#去极值factor_data=winsorize_med(factor_data, scale=5, inf2nan=False,axis=0)#缺失值处理factor_data=replace_nan_indu(factor_data,stockList,industry_code,date)#中性化处理factor_data=neutralize(factor_data, date=date, axis=0)#标准化处理factor_data=standardlize(factor_data,axis=0)return factor_data

# 预处理后的原始因子训练集df_factor_train = {}

for date in dateList:df_factor_train[date] = data_preprocessing(df_factor_pre_train[date],securities_list,industry_code,date)

2.5 训练集和交叉验证集的*¶

# 训练集数据train_data=pd.DataFrame()for date in dateList[:train_length]:traindf=df_factor_train[date]#stockList=list(traindf.index)#取收益率数据data_close=get_price(securities_list,date,dateList[dateList.index(date)+1],'1d','close')['close']traindf['pchg']=data_close.iloc[-1]/data_close.iloc[0]-1#剔除空值traindf=traindf.dropna()   #traindf=traindf.sort(columns='pchg')traindf=traindf.sort_values(by=['pchg'],ascending=False)#选取前后各30%的股票，剔除中间的噪声#取0-30%+70%-100%的数据traindf=traindf.iloc[:int(len(traindf['pchg'])/10*3),:].append(traindf.iloc[int(len(traindf['pchg'])/10*7):,:])#前30%为1，后30%为-1traindf['label']=list(traindf['pchg'].apply(lambda x:1 if x>np.mean(list(traindf['pchg'])) else -1))    if train_data.empty:train_data=traindfelse:train_data=train_data.append(traindf)

C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\IPython\core\interactiveshell.py:3267: FutureWarning: 
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  exec(code_obj, self.user_global_ns, self.user_ns)

# 测试集数据        test_data={}for date in dateList[train_length:-1]:testdf=df_factor_train[date]stockList=list(testdf.index)# 取收益率数据data_close=get_price(stockList,date,dateList[dateList.index(date)+1],'1d','close')['close']testdf['pchg']=data_close.iloc[-1]/data_close.iloc[0]-1#剔除空值testdf=testdf.dropna()   testdf=testdf.sort_values(by=['pchg'],ascending=False)#选取前后各30%的股票，剔除中间的噪声#取0-30%+70%-100%的数据testdf=testdf.iloc[:int(len(traindf['pchg'])/10*3),:].append(testdf.iloc[int(len(testdf['pchg'])/10*7):,:])testdf['label']=list(testdf['pchg'].apply(lambda x:1 if x>np.mean(list(testdf['pchg'])) else -1)) test_data[date]=testdf

y_train = train_data['label']  # 分割yX_train = train_data.copy()del X_train['pchg']del X_train['label']

3.模型构建和样本内训练¶

通过Pipeline方法，将特征选择和模型构建结合起来，形成model_pipe对象，然后针对该对象做交叉验证并得到不同参数下的检验结果，辅助于最终模型的参数设置。

3.1 特征选择¶

本文用SelectPercentile（f_classif, percentile）来做特征选择，其中f_classif用来确定特征选择的得分标准，percentile用来确定特征选择的比例。

transform = SelectPercentile(f_classif)  # 使用f_classif方法选择特征最明显的?%数量的特征#model = RandomForestClassifier()model = XGBClassifier()model_pipe = Pipeline(steps=[('ANOVA', transform), ('model', model)])  # 建立由特征选择和分类模型构成的“管道”对象

# 选择最佳特征比例# ############################################################################## Plot the cross-validation score as a function of percentile of featuresscore_means = list()score_stds = list()percentiles = (10, 20, 30, 40, 50, 60, 70, 80, 90, 100)for percentile in percentiles:model_pipe.set_params(ANOVA__percentile=percentile)# Compute cross-validation score using 1 CPUthis_scores = cross_val_score(model_pipe, X_train, y_train, cv=5, n_jobs=-1)score_means.append(this_scores.mean())score_stds.append(this_scores.std())plt.errorbar(percentiles, score_means, np.array(score_stds))plt.title('Performance of the model-Anova varying the percentile of features selected')plt.xlabel('Percentile')plt.ylabel('Prediction rate')plt.axis('tight')plt.show()

C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

由上述执行结果可知，随机森林的精度随着特征比例的上升而上升，而XGBoosting在特征比例为30%时最佳。结合效率，RandomForest和XGBoosting的最佳特征比例分别选择为80%和30%。

3.2 交叉验证调参¶

特征（比例）选择完成后，根据不同的参数（n_estimators,max_depth），对模型进行交叉验证。本文采用StratifiedKFold来将训练集分成训练集和验证集。StratifiedKFold能够有效结合分类样本标签做数据集分割，而不是完全的随机选择和分割。完成交叉验证后，选取交叉验证集AUC（或f1-score）最高的一组参数作为模型的最优参数。

transform = SelectPercentile(f_classif,percentile=80)  # 使用f_classif方法选择特征最明显的?%数量的特征#model = RandomForestClassifier()model = XGBClassifier()model_pipe = Pipeline(steps=[('ANOVA', transform), ('model', model)])  # 建立由特征选择和分类模型构成的“管道”对象

cv = StratifiedKFold(5)  # 设置交叉检验次数#RandomForest n_estimators#parameters = [50, 100, 300, 500, 800]  # 设置模型参数列表##XGB/RandomForest max_depthparameters = [3,4,5,6,7,8]#XGB subsample#parameters = [0.6,0.7,0.8,0.9,1]#score_methods = ['roc_auc','accuracy', 'precision', 'recall', 'f1']  # 设置交叉检验指标score_methods = ['roc_auc', 'f1']  # 设置交叉检验指标#mean_list = list()  # 建立空列表用于存放不同参数方法、交叉检验评估指标的均值列表#std_list = list()  # 建立空列表用于存放不同参数方法、交叉检验评估指标的标准差列表for parameter in parameters:  # 循环读出每个参数值t1 = time.time()  # 记录训练开始的时间score_list = list()  # 建立空列表用于存放不同交叉检验下各个评估指标的详细数据print ('set parameters: %s' % parameter)  # 打印当前模型使用的参数for score_method in score_methods:  # 循环读出每个交叉检验指标#model_pipe.set_params(model__n_estimators=parameter)  # 通过“管道”设置分类模型参数model_pipe.set_params(model__max_depth=parameter)  # 通过“管道”设置分类模型参数            #model_pipe.set_params(model__subsample=parameter)  # 通过“管道”设置分类模型参数            score_tmp = cross_val_score(model_pipe, X_train, y_train, scoring=score_method, cv=cv, n_jobs=-1)  # 使用交叉检验计算指定指标的得分score_list.append(score_tmp)  # 将交叉检验得分存储到列表score_matrix = pd.DataFrame(np.array(score_list), index=score_methods)  # 将交叉检验详细数据转换为矩阵score_mean = score_matrix.mean(axis=1).rename('mean')  # 计算每个评估指标的均值score_std = score_matrix.std(axis=1).rename('std')  # 计算每个评估指标的标准差score_pd = pd.concat([score_matrix, score_mean, score_std], axis=1)  # 将原始详细数据和均值、标准差合并#mean_list.append(score_mean)  # 将每个参数得到的各指标均值追加到列表#std_list.append(score_std)  # 将每个参数得到的各指标标准差追加到列表print (score_pd.round(4))  # 打印每个参数得到的交叉检验指标数据，只保留4位小数print ('-' * 60)t2 = time.time()  # 计算每个参数下算法用时tt = t2 - t1  # 计算时间间隔print ('time: %s' % str(tt))  # 打印时间间隔

set parameters: 3
              0       1       2       3       4    mean     std
roc_auc  0.5769  0.5861  0.5746  0.5960  0.4826  0.5632  0.0459
f1       0.5830  0.5944  0.5951  0.5883  0.5219  0.5765  0.0310

time: 190.25440001487732
set parameters: 4
              0       1       2       3       4    mean     std
roc_auc  0.5814  0.5856  0.5809  0.5974  0.4829  0.5657  0.0467
f1       0.5786  0.5881  0.5847  0.5838  0.5122  0.5695  0.0322

time: 239.38179993629456
set parameters: 5
              0       1       2       3       4    mean     std
roc_auc  0.5850  0.5859  0.5839  0.5979  0.4829  0.5671  0.0474
f1       0.5773  0.5829  0.5742  0.5794  0.5011  0.5630  0.0348

time: 286.11240005493164

3.3 模型构建¶

根据上述交叉验证的最优模型，分别使用随机森林和XGBoosting集成学习模型对训练集进行训练。

transform.fit(X_train, y_train)  # 应用特征选择对象选择要参与建模的特征变量X_train_final = transform.transform(X_train)  # 获得具有显著性特征的特征变量

3.3.1 随机森林¶

(1)树颗数（n_estimators）弱学习器（决策树）的个数称为树棵数(参数名为n_estimators)。一般来说如果树棵数太小，则无法发挥集成算法的优势而容易产生欠拟合；树棵数过大，会增大计算量，并且树棵数到一定的数量后，再增大树棵数获得的模型提升会很小，所以我们要寻找一个适中的数值。AUC、f1-scores随决策树棵数变化可参考表1所示。从表1可知增加决策树棵数的增加可以提升RandomForest模型的预测效果。考虑到其提升效果的边际效应，综合训练时间和效果提升考量，选取决策树棵数为500。

（2）分层数（max_depth） max_depth用于降低每棵树的复杂度。max_depth参数的性能评价参考表所示。随着max_depth的上升，AUC表示不稳定，f1-score呈上升趋势，模型训练速度变慢。综合训练时间和效果提升考量，选取max_depth=4.

model = RandomForestClassifier(n_estimators=500,max_depth=4,random_state=0)model.fit(X_train_final, y_train)  # 训练模型

time: 8.404559850692749

3.3.2 XGBoost¶

（1）subsample subsample是训练集参与模型训练的比例，取值在0-1之间，可有效地防止过拟合。subsample参数的性能评价参考3.2执行结果所示。随着subsample的上升，f1-score呈下降趋势，模型训练速度加快，综合训练时间和效果提升考量，选取subsample=0.9。

（2）max_depth max_depth参数的性能评价参考表所示。随着max_depth的上升，AUC和f1-score呈下降趋势，模型训练时间变慢。选取max_depth=1。

model = XGBClassifier(max_depth=1,subsample=0.9,random_state=0)model.fit(X_train_final, y_train)  # 训练模型

time: 8.404559850692749

3.4 样本外测试¶

# 测试集特征相关度factor_predict_corr=pd.DataFrame()for date in dateList[train_length:-1]:test_feature=test_data[date].copy()del test_feature['pchg']del test_feature['label']test_feature['predict']=list(test_sample_predict[date])factor_predict_corr[date]=test_feature.corr()['predict']    factor_predict_corr=factor_predict_corr.iloc[:-1]# 高斯核 SVM 模型对于下期涨跌预测值与本期因子值之间相关系数示意图#fig = plt.figure(figsize= (15,10))ax = fig.add_subplot(111)sns.set()ax = sns.heatmap(factor_predict_corr)#fig.s*efig('C:\\Users\\Administrator\\Desktop\\corr.png')

test_sample_predict={}test_sample_score=[]test_sample_accuracy=[]test_sample_roc_auc=[]test_sample_date=[]for date in dateList[train_length:-1]:# 取样本外数据特征及标签y_test=test_data[date]['label']X_test=test_data[date].copy()del X_test['pchg']del X_test['label']    # 新数据集做预测X_test_final = transform.transform(X_test)  # 对数据集做特征选择# 输出预测值以及预测概率y_pred_tmp = model.predict(X_test_final)y_pred = pd.DataFrame(y_pred_tmp, columns=['label_predict'])  # 获得预测标签y_pred_proba = pd.DataFrame(model.predict_proba(X_test_final), columns=['pro1', 'pro2'])  # 获得预测概率# 将预测标签、预测数据和原始数据X合并y_pred.set_index(X_test.index,inplace=True)y_pred_proba.set_index(X_test.index,inplace=True)predict_pd = pd.concat((X_test, y_pred, y_pred_proba), axis=1)print ('Predict date:')print (date)    print ('AUC:')print (roc_auc_score(y_test,y_pred))  # 打印前2条结果print ('Accuracy:')print (accuracy_score(y_test, y_pred))  # 打印前2条结果    print ('-' * 60)       ## 后续统计画图用test_sample_date.append(date)# 样本外预测结果test_sample_predict[date]=y_pred_tmp# 样本外准确率test_sample_accuracy.append(accuracy_score(y_test, y_pred))   # 样本外AUC值test_sample_roc_auc.append(roc_auc_score(y_test,y_pred))print ('AUC mean info')print (np.mean(test_sample_roc_auc))print ('-' * 60)    print ('ACCURACY mean info')print (np.mean(test_sample_accuracy))print ('-' * 60)

Predict date:
2018-01-31
AUC:
0.4300806362225671
Accuracy:
0.4306306306306306

Predict date:
2018-02-28
AUC:
0.5622402183551319
Accuracy:
0.5567567567567567

Predict date:
2018-03-31
AUC:
0.5045652294391487
Accuracy:
0.5045045045045045

Predict date:
2018-04-30
AUC:
0.5007100698293061
Accuracy:
0.5063063063063064

Predict date:
2018-05-31
AUC:
0.48390739303923747
Accuracy:
0.4810810810810811

Predict date:
2018-06-30
AUC:
0.5917825870095323
Accuracy:
0.5918918918918918

Predict date:
2018-07-31
AUC:
0.47668892706716914
Accuracy:
0.47927927927927927

Predict date:
2018-08-31
AUC:
0.5283543005985369
Accuracy:
0.527027027027027

Predict date:
2018-09-30
AUC:
0.5334079195300377
Accuracy:
0.527927927927928

Predict date:
2018-10-31
AUC:
0.5999535856794502
Accuracy:
0.5990990990990991

Predict date:
2018-11-30
AUC:
0.5002424628685436
Accuracy:
0.4990990990990991

Predict date:
2018-12-31
AUC:
0.48053715916648193
Accuracy:
0.4810810810810811

AUC mean info
0.5160392074004285

ACCURACY mean info
0.5153903903903904

4.模型评价¶

4.1 预测能力¶

xs_date = [datetime.datetime.strptime(d, '%Y-%m-%d').date() for d in test_sample_date]ys_auc = test_sample_roc_auc# 配置横坐标plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))plt.plot(xs_date, ys_auc,'r')# 自动旋转日期标记plt.gcf().autofmt_xdate() # 横坐标标记plt.xlabel('date')# 纵坐标标记plt.ylabel("test AUC")plt.show()

xs_date = [datetime.datetime.strptime(d, '%Y-%m-%d').date() for d in test_sample_date]ys_score = test_sample_accuracy# 配置横坐标plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))plt.plot(xs_date, ys_score,'r')# 自动旋转日期标记plt.gcf().autofmt_xdate() # 横坐标标记plt.xlabel('date')# 纵坐标标记plt.ylabel("test accuracy")plt.show()

f = plt.figure(figsize= (15,10))sns.set(style="whitegrid")data1 = pd.DataFrame(ys_auc, xs_date, columns={'AUC'})data2 = pd.DataFrame(ys_score, xs_date, columns={'accuracy'})data = pd.concat([data1,data2],sort=False)sns.lineplot(data=data, palette="tab10", linewidth=2.5)

4.2 特征重要度¶

#model = RandomForestClassifier(n_estimators=500,max_depth=4,random_state=0)model = XGBClassifier(max_depth=1,subsample=0.9,random_state=0)model.fit(X_train, y_train)  # 训练模型

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=4, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

#%matplotlib inlinefig = plt.figure(figsize= (15,10))n_features = X_train.shape[1]plt.barh(range(n_features),model.feature_importances_,align='center')plt.yticks(np.arange(n_features),X_train.columns)plt.xlabel("Feature importance")plt.ylabel("Feature")#plt.s*efig('C:\\Users\\Administrator\\Desktop\\rr-featureimport.png')

聲明：本文為入駐FX168財經網人物頻道的作者發布，不代表FX168財經網的觀點。文中觀點僅供參考，投資有風險，入市需謹慎

分享到：

举报财经168客户端下载

全部回复

0/140

本社区仅针对特定人员开放

查看需注册登录并通过风险意识测评

5秒后跳转登录页面...

量化客

0关注2粉丝39帖子

热门标签

原油黄金技术分析市场热点市场分析白银环球财经美元经纪商