量化交易吧 / 数理科学帖子：3367543 新帖：9

基于xgboost的个股涨跌预测

谎言梦发表于：9 月 10 日 21：58回复(1)

基于xgboost的个股涨跌预测¶

案例仅供学习参考，若用于实际投资，风险自负。¶

案例研究¶

研究目标：通过xgboost模型学习个股的日线因子历史数据，预测下一交易日涨跌¶

股票：贵州茅台600519¶

训练集：特征：2006年1月-2016年12月日线因子数据标签：下一交易日涨跌¶

测试集：特征：2017年1月-2018年12月日线因子数据标签：下一交易日涨跌¶

#[1]#导入所需要的库import numpy as npimport pandas as pdimport xgboost as xgbfrom xgboost.sklearn import XGBClassifierfrom sklearn.model_selection import GridSearchCVfrom time import timefrom jqfactor import get_factor_valuesfrom jqfactor import get_all_factors

从聚宽因子库取数据¶

获取因子值¶

get_factor_values(securities, factors, start_date, end_date, count)¶

参数¶

securities:股票池，单只股票（字符串）或一个股票列表¶

factors: 因子名称，单个因子（字符串）或一个因子列表¶

start_date:开始日期，字符串或 datetime 对象，与 coun t参数二选一¶

end_date: 结束日期，字符串或 datetime 对象，可以与 start_date 或 count 配合使用¶

count: 截止 end_date 之前交易日的数量（含 end_date 当日），与 start_date 参数二选一¶

返回¶

一个 dict： key 是因子名称， value 是 pandas.dataframe。¶

dataframe 的 index 是日期， column 是股票代码， value 是因子值¶

获取所有因子¶

get_all_factors()¶

参数：无¶

返回：pandas.DataFrame¶

factor:因子code¶

factor_intro:因子说明¶

category:因子分类名称¶

category_intro:因子分类说明¶

因子选择¶

因为预测目标是单一股票下一交易日的涨跌，所以选取的因子偏短线¶

情绪因子¶

VROC12，TVMA6，VR，D*OL5，ARBR¶

风险因子¶

Variance20，Skewness20，Kurtosis20，sharpe_ratio_20¶

动量因子¶

BBIC，BIAS5，CCI10，ROC6，single_day_VPT¶

技术因子¶

MAC5，EMA5，MACDC¶

#[2]#数据获取#提取因子数据(特征)get_train_data = pd.DataFrame()get_test_data = pd.DataFrame()train = pd.DataFrame()test = pd.DataFrame()factors = ['VROC12', 'TVMA6', 'VR', 'D*OL5', 'ARBR', 'Variance20', 'Skewness20', 'Kurtosis20', 'sharpe_ratio_20', 'BBIC', 'BIAS5', 'CCI10', 'ROC6', 'single_day_VPT', 'MAC5', 'EMA5', 'MACDC']get_train_data = get_factor_values(normalize_code('600519'), factors, '2007-01-01', '2017-01-03')#2017-01-02是元旦假期，2017-01-03是周二，多取一天，便于处理get_test_data = get_factor_values(normalize_code('600519'), factors, '2017-01-03', '2019-01-02')for x in factors:train[x] = get_train_data[x]['600519.XSHG']test[x] = get_test_data[x]['600519.XSHG']#提取标签数据#取出股价历史数据（比因子数据的时间段向后多一个交易日），并转换成涨跌（后一日收盘价减去前一日收盘价，大于0则为1，小于0则为-1）#注意处理停牌的历史数据！！！#停牌的交易日成交量为0，因此取数据可以同时取价格和成交量，根据成交量过滤停牌日期数据train_p = pd.DataFrame()test_p = pd.DataFrame()train_p = get_price('600519.XSHG', start_date='2007-01-01', end_date='2017-01-03', frequency='daily', fields=['close', 'volume'], skip_paused=False, fq='pre')test_p = get_price('600519.XSHG', start_date='2017-01-03', end_date='2019-01-02', frequency='daily', fields=['close', 'volume'], skip_paused=False, fq='pre')#skip_paused: 是否跳过不交易日期(包括停牌, 未上市或者退市后的日期). 如果不跳过, 停牌时会使用停牌前的数据填充(具体请看SecurityUnitData的paused属性), 上市前或者退市后数据都为 nan, #但要注意:#默认为 False#当 skip_paused 是 True 时, 只能取一只股票的信息#拼接数据train = pd.concat([train, train_p], axis=1)test = pd.concat([test, test_p], axis=1)#删除成交量为0的数据行train = train[train.volume != 0]test = test[test.volume != 0]#print(train.shape[0])#2420行#print(test.shape[0])#488行

#[3]#数据预处理#根据收盘价提取标签train_y_close_p = train['volume']['2007-01-01':'2017-01-02']test_y_close_p = test['volume']['2017-01-03':'2019-01-01']train_y_close_q = train['volume']['2007-01-05':'2017-01-03']test_y_close_q = test['volume']['2017-01-04':'2019-01-02']#先取消索引！！！！！！#df.reset_index()train_y_close_p = train_y_close_p.reset_index()test_y_close_p = test_y_close_p.reset_index()train_y_close_q = train_y_close_q.reset_index()test_y_close_q = test_y_close_q.reset_index()#去掉时间列train_y_close_p = train_y_close_p.drop('index',1)test_y_close_p = test_y_close_p.drop('index',1)train_y_close_q = train_y_close_q.drop('index',1)test_y_close_q = test_y_close_q.drop('index',1)train_y = train_y_close_q - train_y_close_ptest_y = test_y_close_q - test_y_close_p#标签train_y[train_y > 0] = 1train_y[train_y < 0] = 0test_y[test_y > 0] = 1test_y[test_y < 0] = 0#print(train_y.head())#print(test_y.head())#print(train_y.shape[0])#2419行，因为因子数据多了一行#删掉多余的最后一行因子数据，并拼接#print(train.head().append(train.tail()))train = train.reset_index()test = test.reset_index()train = train.drop('index',1)test = test.drop('index',1)#print(train.head().append(train.tail()))#print(test.head().append(test.tail()))train = train.drop(index=2419)test = test.drop(index=487)#删去close列和volume列train = train.drop(['close', 'volume'], 1)test = test.drop(['close','volume'], 1)train = pd.concat([train, train_y], axis=1)test = pd.concat([test, test_y], axis=1)#重命名train.rename(columns = {'volume':'y'}, inplace=True)test.rename(columns = {'volume':'y'}, inplace=True)#到这里，数据就全部处理完了

#[4]#模型构建target = 'y'IDcol = 'id'#使用xgb.cv寻找最优n_estimatorsdef modelfit(alg, dtrain, predictors, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):if useTrainCV:xgb_param = alg.get_xgb_params()xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds, metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=False)alg.set_params(n_estimators=cvresult.shape[0])print(cvresult.shape[0])

这里模型调参就简化了，详细调参方法参考文末链接¶

主要需要调的参数有learning_rate，n_estimators，max_depth，min_child_weight，gamma，subsample，colsample_bytree¶

影响比较大的参数是learning_rate，n_estimators，max_depth，min_child_weight这四个¶

#1.先固定学习率为默认值0.1，n_estimators调优start = time()predictors = [x for x in train.columns if x not in [target,IDcol]]xgb1 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=100,
 max_depth=3,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 seed=27)modelfit(xgb1, train, predictors)end = time()time_elapsed = end-startprint('Training is end')print('Training time is {} h.'.format(time_elapsed/3600))

14
Training is end
Training time is 0.00031509545114305287 h.

#2.max_depth 和 min_weight 参数调优param_test1 = {
 'max_depth':range(3,10,1),
 'min_child_weight':range(1,5,1)}gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=14, max_depth=6,min_child_weight=1, gamma=0, subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic', 
					scale_pos_weight=1, nthread=4, seed=27), 
					param_grid = param_test1,scoring='roc_auc',iid=False, cv=5)gsearch1.fit(train[predictors],train[target])print(gsearch1.best_params_, gsearch1.best_score_)end = time()time_elapsed = end-startprint('Training is end')print('Training time is {} h.'.format(time_elapsed/3600))

{'max_depth': 3, 'min_child_weight': 3} 0.6764277924684329
Training is end
Training time is 0.014299993382559882 h.

#3.预测xgb = XGBClassifier(
 learning_rate =0.1,
 n_estimators=14,
 max_depth=3,
 min_child_weight=3,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 eval_metric='auc',
 seed=27)#训练xgb.fit(train[predictors], train[target], eval_metric='auc')#预测test['y_predict'] = xgb.predict(test[predictors])

/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

#计算查准率#如果是纯多头操作，当然是查准率更重要#实际涨的股票天数/预测值里涨的股票天数pricision = test[test.y_predict == 1][['y', 'y_predict']]print(pricision.sum())

y            151.0
y_predict    237.0
dtype: float64

151/237 = 0.63713¶

可以看出来效果是非常棒的！！！¶

可供参考的研究思路：¶

1.研究不同的股票并进行组合¶

2.调整因子和相应的特征与标签的周期长短¶

3.将多个不同长度的周期的策略组合¶

4.尝试多分类¶

5.与其他模型集成¶

xgboost调参方法¶

文档中有部分代码因为版本更新需要修改¶

全部回复

0/140

本社区仅针对特定人员开放

查看需注册登录并通过风险意识测评

5秒后跳转登录页面...

达人推荐

关注
财神爷_刘磊

粉丝:473

帖子数:0

财神爷_刘磊

0帖子0关注473粉丝

关注拉黑私信
关注
时间都去哪了

粉丝:555

帖子数:0

时间都去哪了

0帖子5关注555粉丝

关注拉黑私信
关注
犇犇

粉丝:686

帖子数:0

犇犇

0帖子56关注686粉丝

关注拉黑私信

量化课程

移动端课程

量化交易吧 / 数理科学 帖子：3367543 新帖：9

基于xgboost的个股涨跌预测

谎言梦发表于：9 月 10 日 21：58回复(1)

基于xgboost的个股涨跌预测¶

案例仅供学习参考，若用于实际投资，风险自负。¶

案例研究¶

研究目标：通过xgboost模型学习个股的日线因子历史数据，预测下一交易日涨跌¶

股票：贵州茅台600519¶

训练集： 特征：2006年1月-2016年12月日线因子数据 标签：下一交易日涨跌¶

测试集： 特征：2017年1月-2018年12月日线因子数据 标签：下一交易日涨跌¶

从聚宽因子库取数据¶

获取因子值¶

get_factor_values(securities, factors, start_date, end_date, count)¶

参数¶

securities:股票池，单只股票（字符串）或一个股票列表¶

factors: 因子名称，单个因子（字符串）或一个因子列表¶

start_date:开始日期，字符串或 datetime 对象，与 coun t参数二选一¶

end_date: 结束日期， 字符串或 datetime 对象，可以与 start_date 或 count 配合使用¶

count: 截止 end_date 之前交易日的数量（含 end_date 当日），与 start_date 参数二选一¶

返回¶

一个 dict： key 是因子名称， value 是 pandas.dataframe。¶

dataframe 的 index 是日期， column 是股票代码， value 是因子值¶

获取所有因子¶

get_all_factors()¶

参数：无¶

返回：pandas.DataFrame¶

factor:因子code¶

factor_intro:因子说明¶

category:因子分类名称¶

category_intro:因子分类说明¶

因子选择¶

因为预测目标是单一股票下一交易日的涨跌，所以选取的因子偏短线¶

情绪因子¶

VROC12，TVMA6，VR，D*OL5，ARBR¶

风险因子¶

Variance20，Skewness20，Kurtosis20，sharpe_ratio_20¶

动量因子¶

BBIC，BIAS5，CCI10，ROC6，single_day_VPT¶

技术因子¶

MAC5，EMA5，MACDC¶

这里模型调参就简化了，详细调参方法参考文末链接¶

主要需要调的参数有learning_rate，n_estimators，max_depth，min_child_weight，gamma，subsample，colsample_bytree¶

影响比较大的参数是learning_rate，n_estimators，max_depth，min_child_weight这四个¶

151/237 = 0.63713¶

可以看出来效果是非常棒的！！！¶

可供参考的研究思路：¶

1.研究不同的股票并进行组合¶

2.调整因子和相应的特征与标签的周期长短¶

3.将多个不同长度的周期的策略组合¶

4.尝试多分类¶

5.与其他模型集成¶

xgboost调参方法¶

文档中有部分代码因为版本更新需要修改¶

全部回复

0/140

粉丝:473

帖子数:0

粉丝:555

帖子数:0

粉丝:686

帖子数:0

量化课程

热门标签

删除回复

确认要删除这篇文章么？

举报用户

信息提示

该文章已删除

设置置顶

完成设置【置顶】！

设置置顶

已取消设置【置顶】！

设置精华

完成设置【精华】！

设置精华

已取消设置【精华】！

审核信息

该文章已审核通过

审核信息

您已设置该文章审核不通过

量化交易吧 / 数理科学帖子：3367543 新帖：9

训练集：特征：2006年1月-2016年12月日线因子数据标签：下一交易日涨跌¶

测试集：特征：2017年1月-2018年12月日线因子数据标签：下一交易日涨跌¶

end_date: 结束日期，字符串或 datetime 对象，可以与 start_date 或 count 配合使用¶