请 [注册] 或 [登录]  | 返回主站

量化交易吧 /  量化平台 帖子:3364724 新帖:11

VIF方法因子独立性检验

技术性调整发表于:5 月 10 日 06:21回复(1)

基于因子模型的选股策略是股票市场量化应用最广泛的模型之一。然而很多时候,使用因子模型在实盘运行的绩效并不理想,究其原因可能是由于因子选择的偏差,市场风格轮动等。但还有一个显著的因素,就是选取因子之间可能存在高度的多重共线性,导致模型对股票价格与市场的解释能力存在很大偏误。
为了在筛选因子之初就避免陷入这样的误区。本文介绍一种VIF(方差膨胀检验)方法,来对因子之间的线性相关关系进行检验,从而帮助投资者们在可以选取到独立性更好的因子,增强因子模型的解释能力。
一、 方法介绍
所谓VIF方法,计算难度并不高。在线性回归方法里,应用最广泛的就是最小二乘法(OLS),
1.png
其中有一个检验模型解释能力的检验统计指标为R^2(样本可决系数),R^2的大小决定了解释变量对因变量的解释能力。而为了检验因子之间的线性相关关系,我们可以通过OLS对单一因子和解释因子进行回归,然后如果其R^2较小,说明此因子被其他因子解释程度较低,线性相关程度较低。
注:之所以不使用协方差计算相关性是由于协方差难以应用在多元线性相关情况下。
给出VIF计算方法:
VIF=1/(1-R^2)
从上文很容易看出,VIF越高解释变量和因变量之间线性相关性就越强。
二、 检验实践
数据来源:聚宽量化平台投资研究板块
选取因子:EPS(每股收益),ROE(净资产收益率),market_cap(市值),pb(市净
率),'net_profit_ratio','gross_income_ratio','quick_ratio','current_ratio'(后面四个因子来源于聚宽因子库)
时间窗口选取:2012.3.4—2018.7.4
回望频率:两个月
1.获取数据:(鉴于篇幅仅展示2012-03-04当日前十支股票相关因子数据)
2.png
2.缺失值检验:(鉴于篇幅仅展示2013-03-04当日检验情况)
返回0代表无缺失值,返回其他数字代表缺失值数量
4.png
3.被检验两两因子间线性相关性预了解(图例,鉴于篇幅仅展示2013-03-04当日检验情况)
3.png
4.计算并获取每个时点下被解释因子与其余7个因子之间的回归VIF值,绘制时间序列图。
堆积图每一层上下宽幅为其颜色对应的因子取值,从中可以更明显看出不同因子VIF在同一时期的变化情况。下面的柱状
图则代表实际每个时间窗口下各个因子的VIF值
5.png
5.计算时间窗口内VIF值均值,比较大小(图例)
6.png
6.相比而言quick_ratio这一因子的VIF在窗口期平均值较低,因而这就提示了我们如果在构建因子模型时,采用其余其中
因子时可以考虑添加这一因子,增强模型的解释能力。
三、方法总结
使用VIF进行检验的方法主要为,对某一因子和其余因子进行回归,得到R^2,计算VIF,剔除因子中VIF高的因子,保留VIF较低的因子,以此类推,直到得到一个相关性较低的因子组合来增强模型的解释能力。

#基本面因子的VIF检验
#选取因子 EPS(每股收益),ROE(净资产收益率),market_cap(市值),pb(市净率)
#时间窗口选取90天

#导入必要的工具包
from jqlib.alpha191 import *
from jqdata import*
from jqfactor import get_factor_values
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
import seaborn as sns
#选取测试时期
section_time_list =\
    ['2012-03-4','2012-05-4','2012-07-4','2012-09-4','2012-11-4',
     '2013-01-4','2013-03-4','2013-05-4','2013-07-4','2013-09-4','2013-11-4',
     '2014-01-4','2014-03-4','2014-05-4','2014-07-4','2014-09-4','2014-11-4',
     '2015-01-4','2015-03-4','2015-05-4','2015-07-4','2015-09-4','2015-11-4',
     '2016-01-4','2016-03-4','2016-05-4','2016-07-4','2016-09-4','2016-11-4',
     '2017-01-4','2017-03-4','2017-05-4','2017-07-4','2017-09-4','2017-11-4',
     '2018-01-4','2018-03-4','2018-05-4','2018-07-4']
#因子列表
factor_list=['market_cap','roe','pb_ratio','eps','net_profit_ratio','gross_income_ratio','quick_ratio','current_ratio']
jq_factors_list=['net_profit_ratio','gross_income_ratio','quick_ratio','current_ratio']
#获取因子数据
def get_data(date):
    stock_list=get_index_stocks('000905.XSHG',date)
    q=query(valuation.market_cap,
        indicator.eps,
        indicator.roe,
        valuation.pb_ratio) .filter(valuation.code.in_(stock_list))   
    basic_data=get_fundamentals(q,date=date) 
    for jq_factor in jq_factors_list:
        jq_factor_data=get_factor_values(securities=stock_list,factors=[jq_factor], count=1,end_date=date)
        jq_factor_data_T=jq_factor_data[jq_factor].fillna(0).T
        basic_data[jq_factor]=array(jq_factor_data_T)
    return basic_data

get_data(section_time_list[0])
/opt/conda/envs/python3new/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
market_cap eps roe pb_ratio net_profit_ratio gross_income_ratio quick_ratio current_ratio
0 54.00 0.1450 5.47 1.93 0.215392 0.457954 0.144542 1.531034
1 60.83 0.2800 6.62 4.51 0.022461 0.089590 0.852220 1.074051
2 41.07 0.0221 1.49 2.69 0.085047 0.329022 0.457541 2.785999
3 88.15 0.0100 0.33 1.98 0.188118 0.452052 0.303899 1.793942
4 39.34 0.1394 1.38 1.56 0.200929 0.540953 0.719650 2.017241
5 49.10 0.0904 4.02 3.72 0.055783 0.118679 0.883676 1.085571
6 72.80 -0.0070 -0.29 2.31 0.007663 0.062202 0.827424 1.249140
7 48.09 0.0136 1.13 6.07 0.016489 0.170078 0.796266 0.938501
8 75.15 0.0617 2.01 1.61 1.219728 0.639925 19.300528 19.301698
9 78.26 0.1040 2.74 1.14 0.355288 0.467369 4.024287 4.027870
10 37.72 0.1050 1.31 1.23 0.044867 0.198108 0.322511 1.277412
11 43.09 0.0762 2.37 2.53 0.145118 0.263430 2.353097 3.899223
12 32.49 0.0020 0.07 1.35 0.010466 0.069785 0.676482 1.012771
13 50.18 0.1700 4.55 2.48 0.768213 0.377087 1.260983 2.276709
14 40.20 0.0277 1.32 1.57 0.101427 0.384387 1.145158 1.957869
15 73.73 0.0543 0.83 2.92 0.062170 0.292490 1.225126 1.756879
16 27.53 -0.0030 -0.18 2.24 0.120120 0.062046 0.668394 0.908931
17 44.89 0.0500 1.93 2.88 0.010377 0.233992 0.542034 1.004627
18 80.54 0.1809 4.20 3.48 0.068372 0.170332 1.203481 1.396956
19 55.20 -0.0600 -2.09 3.52 -0.120436 0.019324 1.104876 1.191519
20 33.43 0.0430 2.14 2.21 0.070713 0.522932 0.522797 0.737109
21 110.53 0.0800 1.11 0.82 0.055009 0.173988 0.617493 0.941973
22 80.09 0.1100 2.78 3.93 0.038092 0.201108 0.400259 0.508944
23 64.72 -0.0170 -1.23 5.25 0.060796 0.388021 6.874510 6.882813
24 85.62 0.1468 10.15 7.14 0.177485 0.442212 0.572834 1.467243
25 45.08 -0.0128 -0.69 4.35 0.012347 0.065120 0.696403 0.808753
26 35.68 0.0200 1.33 2.26 0.035291 0.281098 0.470726 8.291899
27 65.83 0.2000 2.20 2.32 0.133983 0.550317 1.602611 1.929152
28 40.86 0.1537 4.43 1.64 0.249397 0.453249 1.164449 2.391665
29 80.77 0.1846 9.27 3.50 0.263965 0.474153 0.133982 1.611620
... ... ... ... ... ... ... ... ...
470 39.43 0.0330 1.49 2.16 0.061089 0.269733 0.720025 1.989341
471 87.20 0.0600 2.38 2.44 0.183927 0.441672 0.768582 0.797425
472 35.74 -0.1236 -13.61 8.51 -0.122628 0.111468 0.509977 0.642962
473 82.14 0.1300 4.31 3.98 0.292805 0.498387 1.873542 2.021262
474 58.38 0.0470 0.63 2.01 0.046705 0.255282 0.808122 1.116555
475 76.41 0.1713 2.03 2.15 0.164176 0.266511 1.546993 1.893138
476 59.44 -0.4400 -12.34 4.15 -0.011689 0.054945 0.411461 0.867553
477 46.63 0.0007 0.01 1.03 0.024242 0.177050 0.499658 1.156041
478 31.38 0.1142 1.86 1.02 0.044399 0.172218 0.861439 1.256691
479 55.21 0.0400 1.22 1.35 0.081401 0.284569 1.934426 2.444061
480 47.21 0.1203 5.00 3.59 0.082528 0.352909 1.028499 1.317065
481 88.81 -0.1500 -3.38 3.81 -0.024403 0.238980 0.403675 0.610682
482 52.62 0.1000 2.84 4.56 0.105951 0.384703 3.026115 3.600359
483 38.81 0.0800 3.40 3.50 0.087083 0.251744 0.419133 0.429512
484 77.49 0.1031 2.65 1.68 0.157187 0.333460 0.617765 0.622123
485 83.30 0.0110 0.45 4.91 0.049168 0.197041 0.617641 1.382637
486 102.77 0.0207 0.99 1.95 0.009979 0.052245 0.363577 0.972469
487 83.68 0.1300 2.04 2.53 0.000000 0.000000 1.688232 1.864534
488 58.94 0.1802 2.53 2.08 0.000000 0.000000 1.256081 1.886795
489 125.07 0.1204 4.19 1.34 0.378189 0.543521 1.075514 1.094820
490 70.76 0.0955 1.53 2.61 0.114261 0.458238 3.892662 4.777018
491 62.44 0.1600 2.39 2.88 0.000000 0.000000 6.972776 7.606473
492 93.20 0.1400 2.35 3.42 0.000000 0.000000 2.283927 2.784575
493 133.95 0.2200 3.35 3.37 0.000000 0.000000 3.874019 4.729325
494 104.38 0.0200 0.92 1.09 0.101727 0.446680 0.491672 2.258437
495 86.46 0.1550 2.83 2.35 0.141038 0.266123 2.175106 2.394436
496 82.40 0.1100 2.15 1.73 0.046260 0.170945 1.259560 1.521701
497 109.75 0.1400 3.46 2.93 0.128177 0.344216 2.001598 2.330632
498 127.03 0.0400 1.49 1.01 0.194723 0.378254 1.800274 1.822746
499 48.87 0.0300 1.03 2.99 0.087382 0.245859 1.577228 2.232452

500 rows × 8 columns

#缺失值检验,若返回不为0,则存在缺失值
def check_data(date):
    checking_data = get_data(date)
    print(checking_data[checking_data.isnull()==True].count())
for date in section_time_list:
    check_data(date)
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
market_cap            0
eps                   0
roe                   0
pb_ratio              0
net_profit_ratio      0
gross_income_ratio    0
quick_ratio           0
current_ratio         0
dtype: int64
#输出时期各个因子之间相关性并绘制95%置信区间散点图(鉴于运算性能,只以2012年3月数据市值和其他个体为例,直线置信区间95%)
def plot_corr_liner(date,aim_factor):
    bas_date=get_data(date)
    copy_list=factor_list.copy()
    copy_list.remove(aim_factor)
    sns.pairplot(bas_date, x_vars=copy_list, y_vars=aim_factor, size=7, aspect=0.8,kind = 'reg')
    plt.figure(figsize=(600,6.5))
    plt.savefig("pairplot.jpg")
    plt.show()
    print(bas_date.corr())
plot_corr_liner(section_time_list[0],'market_cap')
<matplotlib.figure.Figure at 0x7f317fdcc5c0>
                    market_cap       eps       roe  pb_ratio  \
market_cap            1.000000  0.382850  0.304767  0.126085   
eps                   0.382850  1.000000  0.767290 -0.041757   
roe                   0.304767  0.767290  1.000000 -0.062851   
pb_ratio              0.126085 -0.041757 -0.062851  1.000000   
net_profit_ratio      0.134832  0.406943  0.401854 -0.043624   
gross_income_ratio    0.126780  0.287163  0.392501  0.020595   
quick_ratio           0.013875  0.057032  0.016200  0.009170   
current_ratio        -0.001858  0.066956  0.028383  0.009244   

                    net_profit_ratio  gross_income_ratio  quick_ratio  \
market_cap                  0.134832            0.126780     0.013875   
eps                         0.406943            0.287163     0.057032   
roe                         0.401854            0.392501     0.016200   
pb_ratio                   -0.043624            0.020595     0.009170   
net_profit_ratio            1.000000            0.535282     0.249716   
gross_income_ratio          0.535282            1.000000     0.258236   
quick_ratio                 0.249716            0.258236     1.000000   
current_ratio               0.259855            0.297282     0.981574   

                    current_ratio  
market_cap              -0.001858  
eps                      0.066956  
roe                      0.028383  
pb_ratio                 0.009244  
net_profit_ratio         0.259855  
gross_income_ratio       0.297282  
quick_ratio              0.981574  
current_ratio            1.000000  
#对因子之间进行多元线性回归来观察每个单独因子和其他多因子的线性相关性
def sqr_R_check(data,aim_factor):
    #数据处理,特征分类
    tset_data_y=get_data(data)[aim_factor]
    copy_list=factor_list.copy()
    copy_list.remove(aim_factor)
    tset_data_x=tset_data[copy_factor]
    X_train,X_test,Y_train,Y_test = train_test_split(tset_data_x,tset_data_y,train_size=.500)
    #建立回归模型
    model = LinearRegression()
    model.fit(X_train,Y_train)
    #R^2检测
    score = model.score(X_test,Y_test)
    return data,aim_factor,score
#创建数据收集字典
factor_analist_dict={}
for date in section_time_list:
    factor_analist_dict[date]={}
#对每一个因子在不同时期的R^2进行数据填充
for date in section_time_list:
    for factor in factor_list:
        current_date,which_factor,factor_sorce=sqr_R_check(date,factor)
        factor_analist_dict[current_date][which_factor]=1/(1-factor_sorce)
#输出因子在不同时期 VIF系数,并通过堆积图和直方图进行绘制
VIF_df=pd.DataFrame(factor_analist_dict)
VIF_analist=VIF_df.T
VIF_analist.plot(kind='area')
VIF_analist.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7f6c17616198>
#计算并输出因子均值并绘图
factor_mean={}
for factor in factor_list:
    factor_mean[factor]=mean(VIF_analist[factor])
factor_name=[]
factor_num=[]
for key in factor_mean.keys():
    factor_name.append(key)
    factor_num.append(factor_mean[key])
plt.bar(range(8),factor_num,color = 'lightsteelblue')
plt.plot(range(8),factor_num,color='coral',marker = 'o')
plt.xticks(range(8),factor_name )
plt.xlabel('因子名称')
plt.ylabel("VIF膨胀系数均值")
plt.ylim((0.9, 1.05))
plt.show()
{'market_cap': 0.9791344476471046, 'roe': 0.9845565423454138, 'pb_ratio': 0.9880541689858433, 'eps': 0.9838680494450781, 'net_profit_ratio': 0.9620307170525094, 'gross_income_ratio': 0.9613276189415978, 'quick_ratio': 0.9445548497217996, 'current_ratio': 0.9628903353358331}
 
 
 
 
 
 
 

全部回复

0/140

量化课程

    移动端课程