请 [注册] 或 [登录]  | 返回主站

量化交易吧 /  数理科学 帖子:3364742 新帖:6

机器学习多因子模型的自动选择及调参

汇市风云榜发表于:7 月 10 日 19:51回复(1)

机器学习多因子模型的自动选择及调参¶

概述¶

人工智能在量化领域有神奇的魅力,
有多少同学是打着量化的旗号走上了程序员的道路

RF、DDPG、ML、RL、DL、LSTM、HMM、CNN、RNN、GAN、DQN、PPO、NLP、 XGBoost ...
有无穷无尽的新领域等着你去探索,
精通这些,量化策略就可以稳定盈利了吗?
NO, 两者之间似乎没有直接关系

今天介绍一款神器, 刚开始不用精通各种机器学习算法,
先将数据扔进去,
自动的选择模型,
自动的调参,
遇到比较好的结果再重点研究下

使用jqdatasdk在本地获取聚宽数据,
因子的处理方法参考林轩大神的《基于随机森林和XGBoost的多因子模型分析》,
模型选择及调参交给模块,
坐等结果及自动生成代码就好

为多因子及机器学习策略提供了简单的方法,
先不用折腾各种模型及参数,
先把因子仍进去,
看看结果再说

但基础知识是没有捷径的,
出来混,
总是要还的

在此对精通各种机器学习模型及调参的大神们致上深深的敬意!

参考文献¶

基于随机森林和XGBoost的多因子模型分析
林晓明. 《人工智能选股之Boosting模型》
林晓明. 《人工智能选股之随机森林模型》

1.导入运行所需的模块¶

# 导入jqdatasdk
import jqdatasdk as jq
from jqdatasdk import *
from jqdatasdk.technical_analysis import *
# 登录认证
jq.auth('18888888888', 'xxx')

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
from tpot import TPOTClassifier

import warnings
warnings.filterwarnings("ignore")
auth success (JQData现有流量增加活动,详情请咨询JQData管理员,微信号:JQData02)

2. 数据获取和预处理¶

2.1 获取股票列表¶

#去除上市距beginDate不足n天的股票
def delete_stop(stocks,beginDate,n):
    stockList=[]
    beginDate = datetime.datetime.strptime(beginDate, "%Y-%m-%d")
    for stock in stocks:
        start_date=get_security_info(stock).start_date
        if start_date<(beginDate-datetime.timedelta(days=n)).date():
            stockList.append(stock)
    return stockList

#剔除ST股
def delete_st(stocks,begin_date):
    st_data=get_extras('is_st',stocks, count = 1,end_date=begin_date)
    stockList = [stock for stock in stocks if not st_data[stock][0]]
    return stockList

#按月区间取值
peroid = 'M'
#样本区间(训练集+测试集的区间为2014-1-31到2018-12-31)
start_date = '2014-02-01'
end_date = '2019-01-31'
#训练集长度
train_length = 48
#聚宽一级行业
industry_code = ['HY001', 'HY002', 'HY003', 'HY004', 'HY005', 'HY006', 'HY007', 'HY008', 'HY009', 'HY010', 'HY011']

#股票池,获取中证全指
securities_list = delete_stop(get_index_stocks('000985.XSHG'),start_date,90)
securities_list = delete_st(securities_list,start_date)

2.2 特征和标签提取¶

每个自然月的最后一个交易日,计算因子暴露度,作为样本的原始特征;计算下期收益率,作为样本的标签

jqfactors_list = ['current_ratio',
                  'net_profit_to_total_operate_revenue_ttm',
                  'gross_income_ratio',
                  'roe_ttm',
                  'roa_ttm',
                  'total_asset_turnover_rate',\
                  'net_operating_cash_flow_coverage',
                  'net_operate_cash_flow_ttm',
                  'net_profit_ttm',\
                  'cash_to_current_liability',
                  'operating_revenue_growth_rate',
                  'non_recurring_gain_loss',\
                  'operating_revenue_ttm',
                  'net_profit_growth_rate']

def get_jq_factor(date):
    factor_data = get_factor_values(securities=securities_list, \
                                    factors=jqfactors_list, \
                                    count=1, \
                                    end_date=date)
    df_jq_factor=pd.DataFrame(index=securities_list)
    
    for i in factor_data.keys():
        df_jq_factor[i]=factor_data[i].iloc[0,:]
    
    return df_jq_factor
q = query(valuation.code, 
      valuation.market_cap,#市值
      valuation.circulating_market_cap,
      valuation.pe_ratio, #市盈率(TTM)
      valuation.pb_ratio, #市净率(TTM)
      valuation.pcf_ratio, #CFP
      valuation.ps_ratio, #PS
      balance.total_assets,
      balance.total_liability,
      balance.development_expenditure, #RD
      balance.dividend_payable,
      balance.fixed_assets,  
      balance.total_non_current_liability,
      income.operating_profit,
      income.total_profit, #OPTP
      #
      indicator.net_profit_to_total_revenue, #净利润/营业总收入
      indicator.inc_revenue_year_on_year,  #营业收入增长率(同比)
      indicator.inc_net_profit_year_on_year,#净利润增长率(同比)
      indicator.roe,
      indicator.roa,
      indicator.gross_profit_margin #销售毛利率GPM
    ).filter(
      valuation.code.in_(securities_list)
    )

#获取指定周期的日期列表 'W、M、Q'
def get_period_date(peroid,start_date, end_date):
    #设定转换周期period_type  转换为周是'W',月'M',季度线'Q',五分钟'5min',12天'12D'
    stock_data = get_price('000001.XSHE',start_date,end_date,'daily',fields=['close'])
    #记录每个周期中最后一个交易日
    stock_data['date']=stock_data.index
    #进行转换,周线的每个变量都等于那一周中最后一个交易日的变量值
    period_stock_data=stock_data.resample(peroid).last()
    date = period_stock_data.index
    pydate_array = date.to_pydatetime()
    date_only_array = np.vectorize(lambda s: s.strftime('%Y-%m-%d'))(pydate_array )
    date_only_series = pd.Series(date_only_array)
    start_date = datetime.datetime.strptime(start_date, "%Y-%m-%d")
    start_date = start_date-datetime.timedelta(days=1)
    start_date = start_date.strftime("%Y-%m-%d")
    date_list = date_only_series.values.tolist()
    date_list.insert(0,start_date)
    return date_list

def initialize_df(df,date):
    
    #净资产
    df['net_assets']=df['total_assets']-df['total_liability']
    df_new = pd.DataFrame(index=securities_list)
        
    #估值因子
    df_new['EP'] = df['pe_ratio'].apply(lambda x: 1/x)
    df_new['BP'] = df['pb_ratio'].apply(lambda x: 1/x)
    df_new['SP'] = df['ps_ratio'].apply(lambda x: 1/x)
    df_new['DP'] = df['dividend_payable']/(df['market_cap']*100000000)
    df_new['RD'] = df['development_expenditure']/(df['market_cap']*100000000)
    df_new['CFP'] = df['pcf_ratio'].apply(lambda x: 1/x)
    
    #杠杆因子
    #对数流通市值
    df_new['CMV'] = np.log(df['circulating_market_cap'])
    #总资产/净资产
    df_new['financial_leverage']=df['total_assets']/df['net_assets']
    #非流动负债/净资产
    df_new['debtequityratio']=df['total_non_current_liability']/df['net_assets']
    #现金比率=(货币资金+有价证券)÷流动负债
    df_new['cashratio']=df['cash_to_current_liability']
    #流动比率=流动资产/流动负债*100%
    df_new['currentratio']=df['current_ratio']
    
    #财务质量因子
    # 净利润与营业总收入之比
    df_new['NI'] = df['net_profit_to_total_operate_revenue_ttm']
    df_new['GPM'] = df['gross_income_ratio']
    df_new['ROE'] = df['roe_ttm']
    df_new['ROA'] = df['roa_ttm']
    df_new['asset_turnover'] = df['total_asset_turnover_rate']
    df_new['net_operating_cash_flow'] = df['net_operating_cash_flow_coverage']
    
    #成长因子
    df_new['Sales_G_q'] = df['operating_revenue_growth_rate']
    df_new['Profit_G_q'] = df['net_profit_growth_rate']
    
    #技术指标
    df_new['RSI']=pd.Series(RSI(securities_list, date, N1=20))    
    dif,dea,macd=MACD(securities_list, date, SHORT = 10, LONG = 30, MID = 15)
    df_new['DIF']=pd.Series(dif)
    df_new['DEA']=pd.Series(dea)
    df_new['MACD']=pd.Series(macd)    
    
    return df_new

dateList = get_period_date(peroid,start_date, end_date)
# (jqdata)因子
df_jq_factor = {}
# (财务数据)因子
df_q_factor = {}
# 预处理前的原始因子训练集
df_factor_pre_train = {}

for date in dateList:
    df_jq_factor = get_jq_factor(date)
    df_q_factor = get_fundamentals(q, date = date)
    df_q_factor.index = df_q_factor['code']
    # 合并得大表
    df_factor_pre_train[date] = pd.concat([df_q_factor,df_jq_factor],axis=1)
    # 初始化
    df_factor_pre_train[date] = initialize_df(df_factor_pre_train[date],date)
    
# df_factor_pre_train
print("获取因子完成")
获取因子完成

2.3 特征预处理¶

from preprocess import winsorize_med, standardlize

#取股票对应行业
def get_industry_name(i_Constituent_Stocks, value):
    return [k for k, v in i_Constituent_Stocks.items() if value in v]

#缺失值处理
def replace_nan_indu(factor_data,stockList,industry_code,date):
    #把nan用行业平均值代替,依然会有nan,此时用所有股票平均值代替
    i_Constituent_Stocks={}
    data_temp=pd.DataFrame(index=industry_code,columns=factor_data.columns)
    for i in industry_code:
        temp = get_industry_stocks(i, date)
        i_Constituent_Stocks[i] = list(set(temp).intersection(set(stockList)))
        data_temp.loc[i]=mean(factor_data.loc[i_Constituent_Stocks[i],:])
    for factor in data_temp.columns:
        #行业缺失值用所有行业平均值代替
        null_industry=list(data_temp.loc[pd.isnull(data_temp[factor]),factor].keys())
        for i in null_industry:
            data_temp.loc[i,factor]=mean(data_temp[factor])
        null_stock=list(factor_data.loc[pd.isnull(factor_data[factor]),factor].keys())
        for i in null_stock:
            industry=get_industry_name(i_Constituent_Stocks, i)
            if industry:
                factor_data.loc[i,factor]=data_temp.loc[industry[0],factor] 
            else:
                factor_data.loc[i,factor]=mean(factor_data[factor])
    return factor_data

#数据预处理
def data_preprocessing(factor_data,stockList,industry_code,date):
    #去极值
    factor_data = winsorize_med(factor_data, scale=5, inf2nan=False,axis=0)
    #缺失值处理
    factor_data = replace_nan_indu(factor_data,stockList,industry_code,date)
    #标准化处理
    factor_data = standardlize(factor_data,axis=0)
    
    return factor_data

# 预处理后的原始因子训练集
df_factor_train = {}
for date in dateList:
    df_factor_train[date] = data_preprocessing(df_factor_pre_train[date],securities_list,industry_code,date)

# df_factor_train
print("预处理完成")
预处理完成

2.4 训练集和交叉验证集的合成¶

# 训练集数据
train_data=pd.DataFrame()
for date in dateList[:train_length]:
    traindf=df_factor_train[date]
    #stockList=list(traindf.index)
    #取收益率数据
    data_close=get_price(securities_list,date,dateList[dateList.index(date)+1],'1d','close')['close']
    traindf['pchg']=data_close.iloc[-1]/data_close.iloc[0]-1
    #剔除空值
    traindf=traindf.dropna()   
    #traindf=traindf.sort(columns='pchg')
    traindf=traindf.sort_values(by=['pchg'],ascending=False)
    #选取前后各30%的股票,剔除中间的噪声
    #取0-30%+70%-100%的数据
    traindf=traindf.iloc[:int(len(traindf['pchg'])/10*3),:].append(traindf.iloc[int(len(traindf['pchg'])/10*7):,:])
    #前30%为1,后30%为-1
    traindf['label']=list(traindf['pchg'].apply(lambda x:1 if x>np.mean(list(traindf['pchg'])) else -1))    
    if train_data.empty:
        train_data=traindf
    else:
        train_data=train_data.append(traindf)
        
y_train = train_data['label']  # 分割y
X_train = train_data.copy()
del X_train['pchg']
del X_train['label']

X_train.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
EP BP SP DP RD CFP CMV financial_leverage debtequityratio cashratio ... ROE ROA asset_turnover net_operating_cash_flow Sales_G_q Profit_G_q RSI DIF DEA MACD
300108.XSHE 0.046590 -0.109484 -0.771399 0.264119 -0.409877 0.421198 -1.712199 -0.940980 -0.921826 2.119890 ... -0.046318 0.394639 -0.682894 1.539315 -0.386404 -0.073648 -0.545303 -0.504069 -0.151364 -1.053205
000559.XSHE 0.383724 -0.327549 0.602881 -1.376721 0.545899 0.725168 1.100124 0.607757 1.288451 0.350109 ... 0.528103 0.116228 0.653226 0.617936 0.078269 0.319672 1.556052 0.334164 -0.022371 0.972231
000554.XSHE -0.631735 -0.189958 2.042445 0.124055 -0.571686 0.959594 -0.621134 -0.824866 -0.955347 2.119890 ... -0.694140 -0.613927 2.506178 2.041713 -0.645521 -0.204595 -0.582885 -0.494008 -0.518227 -0.092269
300130.XSHE 0.047470 0.507570 -0.571056 -0.287581 0.556346 0.143464 -1.200021 -0.960912 -0.934052 -0.759563 ... -0.293261 0.034059 -0.544145 -0.461490 -0.283673 -0.408260 0.434319 0.039622 0.287785 -0.643231
002602.XSHE -0.249923 -0.029377 -0.517098 -0.129449 0.545899 -0.603862 -0.600886 -0.862890 -0.156952 -1.589675 ... -0.362780 -0.126073 -0.161691 -0.432899 0.210254 -0.324262 3.319635 1.222322 0.547610 2.005529

5 rows × 23 columns

# 测试集数据        
test_data={}
for date in dateList[train_length:-1]:
    testdf=df_factor_train[date]
    stockList=list(testdf.index)
    # 取收益率数据
    data_close=get_price(stockList,date,dateList[dateList.index(date)+1],'1d','close')['close']
    testdf['pchg']=data_close.iloc[-1]/data_close.iloc[0]-1
    #剔除空值
    testdf=testdf.dropna()   
    testdf=testdf.sort_values(by=['pchg'],ascending=False)
    #选取前后各30%的股票,剔除中间的噪声
    #取0-30%+70%-100%的数据
    testdf=testdf.iloc[:int(len(traindf['pchg'])/10*3),:].append(testdf.iloc[int(len(testdf['pchg'])/10*7):,:])
    testdf['label']=list(testdf['pchg'].apply(lambda x:1 if x>np.mean(list(testdf['pchg'])) else -1)) 
    test_data[date]=testdf
    
for date in dateList[train_length:-1]:
    # 取样本外数据特征及标签
    y_test=test_data[date]['label']
    X_test=test_data[date].copy()
    del X_test['pchg']
    del X_test['label']    
    
X_test.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
EP BP SP DP RD CFP CMV financial_leverage debtequityratio cashratio ... ROE ROA asset_turnover net_operating_cash_flow Sales_G_q Profit_G_q RSI DIF DEA MACD
600218.XSHG -0.528988 1.485604 1.999518 4.675741e-15 0.159923 1.038654 -1.149412 -0.215476 -0.830614 -0.810007 ... -0.750746 -0.741413 0.987693 1.915607 -0.141174 -0.633958 -0.732835 -0.040049 0.025952 -0.149700
002565.XSHE 0.102230 0.434094 -0.002859 0.000000e+00 -0.418662 0.830718 -0.698243 -0.543017 -0.187345 -0.045700 ... -0.175412 -0.010769 -0.012221 1.409316 -0.490614 0.636262 -0.853811 0.290978 0.092568 0.438235
000785.XSHE 2.520602 0.157565 2.241279 0.000000e+00 0.237643 2.172042 -1.161639 -0.123736 -0.312482 0.524894 ... 2.530745 2.462733 2.693604 -2.027147 -0.480705 2.121975 -0.451280 -0.347794 -0.403809 -0.039053
300118.XSHE 0.849169 2.351807 2.037908 -4.675741e-15 0.421778 -2.250338 -0.281541 0.092110 -0.139734 -0.461784 ... -0.207077 -0.355076 0.100855 -0.187423 -0.372111 -0.370404 -0.505329 -0.068107 -0.128541 0.066421
600452.XSHG 0.614460 -0.861525 -0.193630 -4.675741e-15 -1.195453 -0.354157 -0.373111 1.987370 2.057384 -0.351079 ... 1.591976 0.272001 -0.298072 0.138400 -0.129196 0.157179 0.194636 -0.236748 0.297221 -1.054480

5 rows × 23 columns

3 自动选择机器学习模型和参数¶

# X_train, X_test, y_train, y_test = X_train, X_test, y_train, y_test

pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
                                    random_state=42, verbosity=2)
pipeline_optimizer.fit(X_train, y_train)
print(pipeline_optimizer.score(X_test, y_test))
pipeline_optimizer.export('tpot_factor_classifier.py')
Generation 1 - Current best internal CV score: 0.5464587765396075
Generation 2 - Current best internal CV score: 0.5470016538692403
Generation 3 - Current best internal CV score: 0.5470016538692403
Generation 4 - Current best internal CV score: 0.5472036120795265
Generation 5 - Current best internal CV score: 0.5472036120795265

Best pipeline: XGBClassifier(GaussianNB(input_matrix), learning_rate=0.01, max_depth=9, min_child_weight=17, n_estimators=100, nthread=1, subsample=0.2)
0.5139664804469274
# tpot_factor_classifier.py

training_features, testing_features, training_target, testing_target = X_train, X_test, y_train, y_test

exported_pipeline = XGBClassifier(learning_rate=0.01, max_depth=10, min_child_weight=17, n_estimators=100, nthread=1, subsample=0.2)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

accuracy = exported_pipeline.score(testing_features, testing_target)
print("accuracy:%s"%accuracy)
accuracy:0.5121042830540037
#%matplotlib inline
fig = plt.figure(figsize= (20, 12))

n_features = training_features.shape[1]
plt.barh(range(n_features),exported_pipeline.feature_importances_,align='center')
plt.yticks(np.arange(n_features),training_features.columns)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
<matplotlib.text.Text at 0x1168131e518>

4. 总结¶

参考林轩大神《基于随机森林和XGBoost的多因子模型分析》中因子处理方法,
采用TPOT模块基于遗传算法自动选择机器学习模型及参数,

为机器学习多因子策略的刚起步的同学们提供了简单方法,
刚开始不用耗费精力在精通各种机器学习模型及算法,
先撸一个可用的策略再说

但基础知识是没有捷径的,
出来混,
总是要还的

共勉

全部回复

0/140

量化课程

    移动端课程