量化交易吧 / 数理科学帖子：3367381 新帖：1

基于随机森林和XGBoost的多因子模型分析

量化客发表于：6 月 19 日 18：35回复(1)

概述

华泰证券林晓明先生的《人工智能选股系列研报》非常经典，一直被我作为教材，反复参阅，遗憾就是没有程序源代码。本人一是能力不够，二是天性懒惰，一直等到西安交大元老师量化小组.《支持向量机模型（SVM）在多因子选股模型领域的应用》的程序分享出来，才依样画葫芦地将随机森林和XGBoost搬砖到多因子模型，在此先表示感谢！

Bagging和Boosting都是分类器的集成算法。Bagging以并行方法集成算法，先构建一些小的分类器，然后基于每个分类器的结果做均值计算得到最终模型。Bagging方法因为其方差小，所以比单个分类器的效果更好。Boosting以串行方法集成，每个分类器顺序参与模型评估，并试图降低最终模型的偏差。Boosting方法的准确率较高，且鲁棒性较强。本文中，使用了随机森林和XGBoost两种机器学习算法作为的代表，对多因子模型进行分析和比较。两者都是基于决策树算法的延伸，前者基于Bagging方法，后者基于Boosting方法。具体的研究包括：

特征数量对随机森林和XGBoost预测能力的影响分析
随机森林和XGBoost的参数（模型复杂度）对于预测能力的影响分析
随机森林和XGBoost的预测能力的分析
随机森林和XGBoost的特征重要度

随机森林应用于多因子模型我之前已经做过一次研究，参考《基于机器学习的多因子选股策略》是第一次使用。另外，有别于上文，本文用SelectPercentile进行特征选取，用Pipeline进行交叉验证，模型进行的是涨跌的分类预测（上文是市值的回归预测）。

研究结论

特征数量对随机森林和XGBoost预测能力的影响是不同的，前者随着特征数量上升，预测能力上升，后者则不稳定。本文分别选择80%和30%作为最佳选择比例。
对于随机森林，随着nestimators上升，预测能力上升，随着max_depths上升，AUC上升，但f1-score下降；对于XGBoost，随着sub_sample上升，预测能力下降，但速度更快，随着max_depth上升，预测能力下降，速度也更慢。
两种模型的预测能力接近，但是在模型训练时间上，XGBoost具有压倒性优势。在达到相近预测能力时，XGBoost比随机森林要简单。本文的XGBoost模型中，每个决策树的深度都为1。而随机模型中每个决策树的深度普遍在4以上，模型的复杂程度远大于XGBoost模型。
技术因子在模型重要度中最高，其次是估值因子。对于随机森林来说，大多数因子都得到使用，而XGBoost只使用了部分因子。

参考文献

林晓明. 《人工智能选股之Boosting模型》
林晓明. 《人工智能选股之随机森林模型》
西安交大元老师量化小组.《支持向量机模型（SVM）在多因子选股模型领域的应用》
宋天龙. 《Python数据分析与数据化运营》

研究

收起

*研究

您需要登录后才能查看源码，点击登录

1.概述¶

Bagging和Boosting都是分类器的集成算法。Bagging以并行方法集成算法，先构建一些小的分类器，然后基于每个分类器的结果做均值计算得到最终模型。Bagging方法因为其方差小，所以比单个分类器的效果更好。Boosting以串行方法集成，每个分类器顺序参与模型评估，并试图降低最终模型的偏差。Boosting方法的准确率较高，且鲁棒性较强。本文中，分别使用了RandomForest和XGBoost两种机器学习算法作为代表，对传统多因子模型进行分析和比较。两者都是基于决策树算法的延伸，前者基于Bagging方法，后者基于Boosting方法。具体的研究包括：（1）特征数量对RandomForest和XGBoost预测能力的影响评价
（2） RandomForest和XGBoost的参数（模型复杂度）对于预测能力的影响评价（3） RandomForest和XGBoost的预测能力的评价（4） RandomForest和XGBoost的特征重要度

import pandas as pdimport numpy as npimport mathimport jqdataimport timeimport datetimefrom jqfactor import standardlizefrom jqfactor import winsorize_medfrom jqfactor import get_factor_valuesfrom jqfactor import neutralizefrom sklearn.model_selection import StratifiedKFold, cross_val_score  # 导入交叉检验算法from sklearn.feature_selection import SelectPercentile, f_classif  # 导入特征选择方法库from sklearn.pipeline import Pipeline  # 导入Pipeline库from sklearn.metrics import accuracy_score  # 准确率指标from sklearn.metrics import roc_auc_scorefrom jqlib.technical_analysis import *from xgboost.sklearn import XGBClassifierfrom sklearn.ensemble import RandomForestClassifierimport matplotlib.dates as mdatesimport matplotlib.pyplot as pltimport seaborn as sns

2. 数据获取和预处理¶

2.1 股票池设定¶

中证全指成分股。剔除ST股票，剔除每个截面期下一交易日停牌的股票，剔除上市3个月内的股票，每只股票视作一个样本。

#去除上市距beginDate不足n天的股票def delete_stop(stocks,beginDate,n):stockList=[]beginDate = datetime.datetime.strptime(beginDate, "%Y-%m-%d")for stock in stocks:start_date=get_security_info(stock).start_dateif start_date<(beginDate-datetime.timedelta(days=n)).date():stockList.append(stock)return stockList

#剔除ST股def delete_st(stocks,begin_date):st_data=get_extras('is_st',stocks, count = 1,end_date=begin_date)stockList = [stock for stock in stocks if not st_data[stock][0]]return stockList

2.2 时间区间¶

2014年1月1日-2018年12月31日的5年区间。其中前4年区间（48个月）作为训练集，后1年区间（12个月）作为测试集。

#按月区间取值peroid = 'M'#样本区间（训练集+测试集的区间为2014-1-31到2018-12-31）start_date = '2014-02-01'end_date = '2019-01-31'#训练集长度train_length = 48#聚宽一级行业industry_code = ['HY001', 'HY002', 'HY003', 'HY004', 'HY005', 'HY006', 'HY007', 'HY008', 'HY009', 'HY010', 'HY011']#股票池，获取中证全指securities_list = delete_stop(get_index_stocks('000985.XSHG'),start_date,90)securities_list = delete_st(securities_list,start_date)

2.3 特征和标签提取¶

每个自然月的最后一个交易日，计算因子暴露度，作为样本的原始特征；计算下期收益率，作为样本的标签

jqfactors_list = ['current_ratio',  'net_profit_to_total_operate_revenue_ttm',  'gross_income_ratio',  'roe_ttm',  'roa_ttm',  'total_asset_turnover_rate',\                  'net_operating_cash_flow_coverage',  'net_operate_cash_flow_ttm',  'net_profit_ttm',\                  'cash_to_current_liability',  'operating_revenue_growth_rate',  'non_recurring_gain_loss',\                  'operating_revenue_ttm',  'net_profit_growth_rate']

def get_jq_factor(date):factor_data = get_factor_values(securities=securities_list, \factors=jqfactors_list, \count=1, \end_date=date)df_jq_factor=pd.DataFrame(index=securities_list)for i in factor_data.keys():df_jq_factor[i]=factor_data[i].iloc[0,:]return df_jq_factor

q = query(valuation.code, 
      valuation.market_cap,#市值  valuation.circulating_market_cap,  valuation.pe_ratio, #市盈率（TTM）  valuation.pb_ratio, #市净率（TTM）  valuation.pcf_ratio, #CFP  valuation.ps_ratio, #PS  balance.total_assets,  balance.total_liability,  balance.development_expenditure, #RD  balance.dividend_payable,  balance.fixed_assets,  
      balance.total_non_current_liability,  income.operating_profit,  income.total_profit, #OPTP  #  indicator.net_profit_to_total_revenue, #净利润/营业总收入  indicator.inc_revenue_year_on_year,  #营业收入增长率（同比）  indicator.inc_net_profit_year_on_year,#净利润增长率（同比）  indicator.roe,  indicator.roa,  indicator.gross_profit_margin #销售毛利率GPM).filter(  valuation.code.in_(securities_list))

#获取指定周期的日期列表 'W、M、Q'def get_period_date(peroid,start_date, end_date):#设定转换周期period_type  转换为周是'W',月'M',季度线'Q',五分钟'5min',12天'12D'stock_data = get_price('000001.XSHE',start_date,end_date,'daily',fields=['close'])#记录每个周期中最后一个交易日stock_data['date']=stock_data.index#进行转换，周线的每个变量都等于那一周中最后一个交易日的变量值period_stock_data=stock_data.resample(peroid).last()date = period_stock_data.indexpydate_array = date.to_pydatetime()date_only_array = np.vectorize(lambda s: s.strftime('%Y-%m-%d'))(pydate_array )date_only_series = pd.Series(date_only_array)start_date = datetime.datetime.strptime(start_date, "%Y-%m-%d")start_date = start_date-datetime.timedelta(days=1)start_date = start_date.strftime("%Y-%m-%d")date_list = date_only_series.values.tolist()date_list.insert(0,start_date)return date_list

def initialize_df(df,date):#净资产df['net_assets']=df['total_assets']-df['total_liability']df_new = pd.DataFrame(index=securities_list)#估值因子df_new['EP'] = df['pe_ratio'].apply(lambda x: 1/x)df_new['BP'] = df['pb_ratio'].apply(lambda x: 1/x)df_new['SP'] = df['ps_ratio'].apply(lambda x: 1/x)df_new['DP'] = df['dividend_payable']/(df['market_cap']*100000000)df_new['RD'] = df['development_expenditure']/(df['market_cap']*100000000)df_new['CFP'] = df['pcf_ratio'].apply(lambda x: 1/x)#杠杆因子#对数流通市值df_new['CMV'] = np.log(df['circulating_market_cap'])#总资产/净资产df_new['financial_leverage']=df['total_assets']/df['net_assets']#非流动负债/净资产df_new['debtequityratio']=df['total_non_current_liability']/df['net_assets']#现金比率=(货币资金+有价证券)÷流动负债df_new['cashratio']=df['cash_to_current_liability']#流动比率=流动资产/流动负债*100%df_new['currentratio']=df['current_ratio']#财务质量因子# 净利润与营业总收入之比df_new['NI'] = df['net_profit_to_total_operate_revenue_ttm']df_new['GPM'] = df['gross_income_ratio']df_new['ROE'] = df['roe_ttm']df_new['ROA'] = df['roa_ttm']df_new['asset_turnover'] = df['total_asset_turnover_rate']df_new['net_operating_cash_flow'] = df['net_operating_cash_flow_coverage']#成长因子df_new['Sales_G_q'] = df['operating_revenue_growth_rate']df_new['Profit_G_q'] = df['net_profit_growth_rate']#技术指标df_new['RSI']=pd.Series(RSI(securities_list, date, N1=20))    dif,dea,macd=MACD(securities_list, date, SHORT = 10, LONG = 30, MID = 15)df_new['DIF']=pd.Series(dif)df_new['DEA']=pd.Series(dea)df_new['MACD']=pd.Series(macd)    
    return df_new

dateList = get_period_date(peroid,start_date, end_date)

# (jqdata)因子df_jq_factor = {}# （财务数据）因子df_q_factor = {}# 预处理前的原始因子训练集df_factor_pre_train = {}for date in dateList:df_jq_factor = get_jq_factor(date)df_q_factor = get_fundamentals(q, date = date)df_q_factor.index = df_q_factor['code']# 合并得大表df_factor_pre_train[date] = pd.concat([df_q_factor,df_jq_factor],axis=1)# 初始化df_factor_pre_train[date] = initialize_df(df_factor_pre_train[date],date)

C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\ipykernel_launcher.py:10: SADeprecationWarning: Compiled objects now compile within the constructor.
# Remove the CWD from sys.path while we load stuff.

2.4 特征预处理¶

包括中位数去极值，缺失值处理，行业市值中性化，标准化等处理。

#取股票对应行业def get_industry_name(i_Constituent_Stocks, value):return [k for k, v in i_Constituent_Stocks.items() if value in v]

#缺失值处理def replace_nan_indu(factor_data,stockList,industry_code,date):#把nan用行业平均值代替，依然会有nan，此时用所有股票平均值代替i_Constituent_Stocks={}data_temp=pd.DataFrame(index=industry_code,columns=factor_data.columns)for i in industry_code:temp = get_industry_stocks(i, date)i_Constituent_Stocks[i] = list(set(temp).intersection(set(stockList)))data_temp.loc[i]=mean(factor_data.loc[i_Constituent_Stocks[i],:])for factor in data_temp.columns:#行业缺失值用所有行业平均值代替null_industry=list(data_temp.loc[pd.isnull(data_temp[factor]),factor].keys())for i in null_industry:data_temp.loc[i,factor]=mean(data_temp[factor])null_stock=list(factor_data.loc[pd.isnull(factor_data[factor]),factor].keys())for i in null_stock:industry=get_industry_name(i_Constituent_Stocks, i)if industry:factor_data.loc[i,factor]=data_temp.loc[industry[0],factor] else:factor_data.loc[i,factor]=mean(factor_data[factor])return factor_data

#数据预处理def data_preprocessing(factor_data,stockList,industry_code,date):#去极值factor_data=winsorize_med(factor_data, scale=5, inf2nan=False,axis=0)#缺失值处理factor_data=replace_nan_indu(factor_data,stockList,industry_code,date)#中性化处理factor_data=neutralize(factor_data, date=date, axis=0)#标准化处理factor_data=standardlize(factor_data,axis=0)return factor_data

# 预处理后的原始因子训练集df_factor_train = {}

for date in dateList:df_factor_train[date] = data_preprocessing(df_factor_pre_train[date],securities_list,industry_code,date)

2.5 训练集和交叉验证集的*¶

# 训练集数据train_data=pd.DataFrame()for date in dateList[:train_length]:traindf=df_factor_train[date]#stockList=list(traindf.index)#取收益率数据data_close=get_price(securities_list,date,dateList[dateList.index(date)+1],'1d','close')['close']traindf['pchg']=data_close.iloc[-1]/data_close.iloc[0]-1#剔除空值traindf=traindf.dropna()   #traindf=traindf.sort(columns='pchg')traindf=traindf.sort_values(by=['pchg'],ascending=False)#选取前后各30%的股票，剔除中间的噪声#取0-30%+70%-100%的数据traindf=traindf.iloc[:int(len(traindf['pchg'])/10*3),:].append(traindf.iloc[int(len(traindf['pchg'])/10*7):,:])#前30%为1，后30%为-1traindf['label']=list(traindf['pchg'].apply(lambda x:1 if x>np.mean(list(traindf['pchg'])) else -1))    if train_data.empty:train_data=traindfelse:train_data=train_data.append(traindf)

C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\IPython\core\interactiveshell.py:3267: FutureWarning: 
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  exec(code_obj, self.user_global_ns, self.user_ns)

# 测试集数据        test_data={}for date in dateList[train_length:-1]:testdf=df_factor_train[date]stockList=list(testdf.index)# 取收益率数据data_close=get_price(stockList,date,dateList[dateList.index(date)+1],'1d','close')['close']testdf['pchg']=data_close.iloc[-1]/data_close.iloc[0]-1#剔除空值testdf=testdf.dropna()   testdf=testdf.sort_values(by=['pchg'],ascending=False)#选取前后各30%的股票，剔除中间的噪声#取0-30%+70%-100%的数据testdf=testdf.iloc[:int(len(traindf['pchg'])/10*3),:].append(testdf.iloc[int(len(testdf['pchg'])/10*7):,:])testdf['label']=list(testdf['pchg'].apply(lambda x:1 if x>np.mean(list(testdf['pchg'])) else -1)) test_data[date]=testdf

y_train = train_data['label']  # 分割yX_train = train_data.copy()del X_train['pchg']del X_train['label']

3.模型构建和样本内训练¶

通过Pipeline方法，将特征选择和模型构建结合起来，形成model_pipe对象，然后针对该对象做交叉验证并得到不同参数下的检验结果，辅助于最终模型的参数设置。

3.1 特征选择¶

本文用SelectPercentile（f_classif, percentile）来做特征选择，其中f_classif用来确定特征选择的得分标准，percentile用来确定特征选择的比例。

transform = SelectPercentile(f_classif)  # 使用f_classif方法选择特征最明显的?%数量的特征#model = RandomForestClassifier()model = XGBClassifier()model_pipe = Pipeline(steps=[('ANOVA', transform), ('model', model)])  # 建立由特征选择和分类模型构成的“管道”对象

# 选择最佳特征比例# ############################################################################## Plot the cross-validation score as a function of percentile of featuresscore_means = list()score_stds = list()percentiles = (10, 20, 30, 40, 50, 60, 70, 80, 90, 100)for percentile in percentiles:model_pipe.set_params(ANOVA__percentile=percentile)# Compute cross-validation score using 1 CPUthis_scores = cross_val_score(model_pipe, X_train, y_train, cv=5, n_jobs=-1)score_means.append(this_scores.mean())score_stds.append(this_scores.std())plt.errorbar(percentiles, score_means, np.array(score_stds))plt.title('Performance of the model-Anova varying the percentile of features selected')plt.xlabel('Percentile')plt.ylabel('Prediction rate')plt.axis('tight')plt.show()

C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\JoinQuant-Desktop-Py3\Python\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

由上述执行结果可知，随机森林的精度随着特征比例的上升而上升，而XGBoosting在特征比例为30%时最佳。结合效率，RandomForest和XGBoosting的最佳特征比例分别选择为80%和30%。

3.2 交叉验证调参¶

特征（比例）选择完成后，根据不同的参数（n_estimators,max_depth），对模型进行交叉验证。本文采用StratifiedKFold来将训练集分成训练集和验证集。StratifiedKFold能够有效结合分类样本标签做数据集分割，而不是完全的随机选择和分割。完成交叉验证后，选取交叉验证集AUC（或f1-score）最高的一组参数作为模型的最优参数。

transform = SelectPercentile(f_classif,percentile=80)  # 使用f_classif方法选择特征最明显的?%数量的特征#model = RandomForestClassifier()model = XGBClassifier()model_pipe = Pipeline(steps=[('ANOVA', transform), ('model', model)])  # 建立由特征选择和分类模型构成的“管道”对象

cv = StratifiedKFold(5)  # 设置交叉检验次数#RandomForest n_estimators#parameters = [50, 100, 300, 500, 800]  # 设置模型参数列表##XGB/RandomForest max_depthparameters = [3,4,5,6,7,8]#XGB subsample#parameters = [0.6,0.7,0.8,0.9,1]#score_methods = ['roc_auc','accuracy', 'precision', 'recall', 'f1']  # 设置交叉检验指标score_methods = ['roc_auc', 'f1']  # 设置交叉检验指标#mean_list = list()  # 建立空列表用于存放不同参数方法、交叉检验评估指标的均值列表#std_list = list()  # 建立空列表用于存放不同参数方法、交叉检验评估指标的标准差列表for parameter in parameters:  # 循环读出每个参数值t1 = time.time()  # 记录训练开始的时间score_list = list()  # 建立空列表用于存放不同交叉检验下各个评估指标的详细数据print ('set parameters: %s' % parameter)  # 打印当前模型使用的参数for score_method in score_methods:  # 循环读出每个交叉检验指标#model_pipe.set_params(model__n_estimators=parameter)  # 通过“管道”设置分类模型参数model_pipe.set_params(model__max_depth=parameter)  # 通过“管道”设置分类模型参数            #model_pipe.set_params(model__subsample=parameter)  # 通过“管道”设置分类模型参数            score_tmp = cross_val_score(model_pipe, X_train, y_train, scoring=score_method, cv=cv, n_jobs=-1)  # 使用交叉检验计算指定指标的得分score_list.append(score_tmp)  # 将交叉检验得分存储到列表score_matrix = pd.DataFrame(np.array(score_list), index=score_methods)  # 将交叉检验详细数据转换为矩阵score_mean = score_matrix.mean(axis=1).rename('mean')  # 计算每个评估指标的均值score_std = score_matrix.std(axis=1).rename('std')  # 计算每个评估指标的标准差score_pd = pd.concat([score_matrix, score_mean, score_std], axis=1)  # 将原始详细数据和均值、标准差合并#mean_list.append(score_mean)  # 将每个参数得到的各指标均值追加到列表#std_list.append(score_std)  # 将每个参数得到的各指标标准差追加到列表print (score_pd.round(4))  # 打印每个参数得到的交叉检验指标数据，只保留4位小数print ('-' * 60)t2 = time.time()  # 计算每个参数下算法用时tt = t2 - t1  # 计算时间间隔print ('time: %s' % str(tt))  # 打印时间间隔

set parameters: 3
              0       1       2       3       4    mean     std
roc_auc  0.5769  0.5861  0.5746  0.5960  0.4826  0.5632  0.0459
f1       0.5830  0.5944  0.5951  0.5883  0.5219  0.5765  0.0310

time: 190.25440001487732
set parameters: 4
              0       1       2       3       4    mean     std
roc_auc  0.5814  0.5856  0.5809  0.5974  0.4829  0.5657  0.0467
f1       0.5786  0.5881  0.5847  0.5838  0.5122  0.5695  0.0322

time: 239.38179993629456
set parameters: 5
              0       1       2       3       4    mean     std
roc_auc  0.5850  0.5859  0.5839  0.5979  0.4829  0.5671  0.0474
f1       0.5773  0.5829  0.5742  0.5794  0.5011  0.5630  0.0348

time: 286.11240005493164

3.3 模型构建¶

根据上述交叉验证的最优模型，分别使用随机森林和XGBoosting集成学习模型对训练集进行训练。

transform.fit(X_train, y_train)  # 应用特征选择对象选择要参与建模的特征变量X_train_final = transform.transform(X_train)  # 获得具有显著性特征的特征变量

3.3.1 随机森林¶

(1)树颗数（n_estimators）弱学习器（决策树）的个数称为树棵数(参数名为n_estimators)。一般来说如果树棵数太小，则无法发挥集成算法的优势而容易产生欠拟合；树棵数过大，会增大计算量，并且树棵数到一定的数量后，再增大树棵数获得的模型提升会很小，所以我们要寻找一个适中的数值。AUC、f1-scores随决策树棵数变化可参考表1所示。从表1可知增加决策树棵数的增加可以提升RandomForest模型的预测效果。考虑到其提升效果的边际效应，综合训练时间和效果提升考量，选取决策树棵数为500。

（2）分层数（max_depth） max_depth用于降低每棵树的复杂度。max_depth参数的性能评价参考表所示。随着max_depth的上升，AUC表示不稳定，f1-score呈上升趋势，模型训练速度变慢。综合训练时间和效果提升考量，选取max_depth=4.

model = RandomForestClassifier(n_estimators=500,max_depth=4,random_state=0)model.fit(X_train_final, y_train)  # 训练模型

time: 8.404559850692749

3.3.2 XGBoost¶

（1）subsample subsample是训练集参与模型训练的比例，取值在0-1之间，可有效地防止过拟合。subsample参数的性能评价参考3.2执行结果所示。随着subsample的上升，f1-score呈下降趋势，模型训练速度加快，综合训练时间和效果提升考量，选取subsample=0.9。

（2）max_depth max_depth参数的性能评价参考表所示。随着max_depth的上升，AUC和f1-score呈下降趋势，模型训练时间变慢。选取max_depth=1。

model = XGBClassifier(max_depth=1,subsample=0.9,random_state=0)model.fit(X_train_final, y_train)  # 训练模型

time: 8.404559850692749

3.4 样本外测试¶

# 测试集特征相关度factor_predict_corr=pd.DataFrame()for date in dateList[train_length:-1]:test_feature=test_data[date].copy()del test_feature['pchg']del test_feature['label']test_feature['predict']=list(test_sample_predict[date])factor_predict_corr[date]=test_feature.corr()['predict']    factor_predict_corr=factor_predict_corr.iloc[:-1]# 高斯核 SVM 模型对于下期涨跌预测值与本期因子值之间相关系数示意图#fig = plt.figure(figsize= (15,10))ax = fig.add_subplot(111)sns.set()ax = sns.heatmap(factor_predict_corr)#fig.s*efig('C:\\Users\\Administrator\\Desktop\\corr.png')

test_sample_predict={}test_sample_score=[]test_sample_accuracy=[]test_sample_roc_auc=[]test_sample_date=[]for date in dateList[train_length:-1]:# 取样本外数据特征及标签y_test=test_data[date]['label']X_test=test_data[date].copy()del X_test['pchg']del X_test['label']    # 新数据集做预测X_test_final = transform.transform(X_test)  # 对数据集做特征选择# 输出预测值以及预测概率y_pred_tmp = model.predict(X_test_final)y_pred = pd.DataFrame(y_pred_tmp, columns=['label_predict'])  # 获得预测标签y_pred_proba = pd.DataFrame(model.predict_proba(X_test_final), columns=['pro1', 'pro2'])  # 获得预测概率# 将预测标签、预测数据和原始数据X合并y_pred.set_index(X_test.index,inplace=True)y_pred_proba.set_index(X_test.index,inplace=True)predict_pd = pd.concat((X_test, y_pred, y_pred_proba), axis=1)print ('Predict date:')print (date)    print ('AUC:')print (roc_auc_score(y_test,y_pred))  # 打印前2条结果print ('Accuracy:')print (accuracy_score(y_test, y_pred))  # 打印前2条结果    print ('-' * 60)       ## 后续统计画图用test_sample_date.append(date)# 样本外预测结果test_sample_predict[date]=y_pred_tmp# 样本外准确率test_sample_accuracy.append(accuracy_score(y_test, y_pred))   # 样本外AUC值test_sample_roc_auc.append(roc_auc_score(y_test,y_pred))print ('AUC mean info')print (np.mean(test_sample_roc_auc))print ('-' * 60)    print ('ACCURACY mean info')print (np.mean(test_sample_accuracy))print ('-' * 60)

Predict date:
2018-01-31
AUC:
0.4300806362225671
Accuracy:
0.4306306306306306

Predict date:
2018-02-28
AUC:
0.5622402183551319
Accuracy:
0.5567567567567567

Predict date:
2018-03-31
AUC:
0.5045652294391487
Accuracy:
0.5045045045045045

Predict date:
2018-04-30
AUC:
0.5007100698293061
Accuracy:
0.5063063063063064

Predict date:
2018-05-31
AUC:
0.48390739303923747
Accuracy:
0.4810810810810811

Predict date:
2018-06-30
AUC:
0.5917825870095323
Accuracy:
0.5918918918918918

Predict date:
2018-07-31
AUC:
0.47668892706716914
Accuracy:
0.47927927927927927

Predict date:
2018-08-31
AUC:
0.5283543005985369
Accuracy:
0.527027027027027

Predict date:
2018-09-30
AUC:
0.5334079195300377
Accuracy:
0.527927927927928

Predict date:
2018-10-31
AUC:
0.5999535856794502
Accuracy:
0.5990990990990991

Predict date:
2018-11-30
AUC:
0.5002424628685436
Accuracy:
0.4990990990990991

Predict date:
2018-12-31
AUC:
0.48053715916648193
Accuracy:
0.4810810810810811

AUC mean info
0.5160392074004285

ACCURACY mean info
0.5153903903903904

4.模型评价¶

4.1 预测能力¶

xs_date = [datetime.datetime.strptime(d, '%Y-%m-%d').date() for d in test_sample_date]ys_auc = test_sample_roc_auc# 配置横坐标plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))plt.plot(xs_date, ys_auc,'r')# 自动旋转日期标记plt.gcf().autofmt_xdate() # 横坐标标记plt.xlabel('date')# 纵坐标标记plt.ylabel("test AUC")plt.show()

xs_date = [datetime.datetime.strptime(d, '%Y-%m-%d').date() for d in test_sample_date]ys_score = test_sample_accuracy# 配置横坐标plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))plt.plot(xs_date, ys_score,'r')# 自动旋转日期标记plt.gcf().autofmt_xdate() # 横坐标标记plt.xlabel('date')# 纵坐标标记plt.ylabel("test accuracy")plt.show()

f = plt.figure(figsize= (15,10))sns.set(style="whitegrid")data1 = pd.DataFrame(ys_auc, xs_date, columns={'AUC'})data2 = pd.DataFrame(ys_score, xs_date, columns={'accuracy'})data = pd.concat([data1,data2],sort=False)sns.lineplot(data=data, palette="tab10", linewidth=2.5)

4.2 特征重要度¶

#model = RandomForestClassifier(n_estimators=500,max_depth=4,random_state=0)model = XGBClassifier(max_depth=1,subsample=0.9,random_state=0)model.fit(X_train, y_train)  # 训练模型

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=4, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

#%matplotlib inlinefig = plt.figure(figsize= (15,10))n_features = X_train.shape[1]plt.barh(range(n_features),model.feature_importances_,align='center')plt.yticks(np.arange(n_features),X_train.columns)plt.xlabel("Feature importance")plt.ylabel("Feature")#plt.s*efig('C:\\Users\\Administrator\\Desktop\\rr-featureimport.png')

全部回复

0/140

本社区仅针对特定人员开放

查看需注册登录并通过风险意识测评

5秒后跳转登录页面...

达人推荐

关注
财神爷_刘磊

粉丝:473

帖子数:0

财神爷_刘磊

0帖子0关注473粉丝

关注拉黑私信
关注
犇犇

粉丝:686

帖子数:0

犇犇

0帖子56关注686粉丝

关注拉黑私信
关注
陈阿浪

粉丝:708

帖子数:7

借花（感恩节）献佛感恩回顾

美股再次暴跌，但黄金却无动于衷

陈阿浪

7帖子0关注708粉丝

关注拉黑私信

量化课程

移动端课程

量化交易吧 / 数理科学 帖子：3367381 新帖：1

基于随机森林和XGBoost的多因子模型分析

量化客发表于：6 月 19 日 18：35回复(1)

概述

研究结论

参考文献

1.概述¶

2. 数据获取和预处理¶

2.1 股票池设定¶

2.2 时间区间¶

2.3 特征和标签提取¶

2.4 特征预处理¶

2.5 训练集和交叉验证集的*¶

3.模型构建和样本内训练¶

3.1 特征选择¶

3.2 交叉验证调参¶

3.3 模型构建¶

3.3.1 随机森林¶

3.3.2 XGBoost¶

3.4 样本外测试¶

4.模型评价¶

4.1 预测能力¶

4.2 特征重要度¶

全部回复

0/140

粉丝:473

帖子数:0

粉丝:686

帖子数:0

粉丝:708

帖子数:7

量化课程

热门标签

删除回复

确认要删除这篇文章么？

举报用户

信息提示

该文章已删除

设置置顶

完成设置【置顶】！

设置置顶

已取消设置【置顶】！

设置精华

完成设置【精华】！

设置精华

已取消设置【精华】！

审核信息

该文章已审核通过

审核信息

您已设置该文章审核不通过

举报成功

您已举报成功

用户登录

移动帖子

创建私信

屏蔽提示

确认要屏蔽该用户么？

屏蔽回复

您已对该用户实现屏蔽

信息回复

已发送成功

量化交易吧 / 数理科学帖子：3367381 新帖：1