搞量化不能不学遗传算法,在特征选择、选股、非凸函数的优化问题上都可以用此算法,而且收敛速度较快,参数设置恰当也不易进入局部最优,性能上有很多优势,可以说是很厉害了。
遗传算法和神经网络有些神似,都是基于现实经验得到,通过大量迭代可以提高性能,可以说很神奇了。
本文通过代码实现了遗传算法,分享给大家,可以直接拿去用。遗传算法不同应用场景需要改变的是fitness函数,其他部分都不用变,可以说很方便了。
前面的部分一大堆代码都是为了获取数据和基本的数据处理,大家不用细看的,主要是最后部分的代码实现了算法。
最后的结果是各个特征在迭代完成后剩余的次数,也就是说次数越多表示此“基因”越强,对应的此特征越好。
比较懒,没有画图,也没有和其他算法比较,留给感兴趣的小伙伴去研究吧。
特征选择是搞量化必须具备的工具,其他特征选择的方法:
01 什么是遗传算法?
1.1 遗传算法的科学定义
遗传算法(Genetic Algorithm, GA)是模拟达尔文生物进化论的自然选择和遗传学机理的生物进化过程的计算模型,是一种通过模拟自然进化过程搜索最优解的方法。
其主要特点是直接对结构对象进行操作,不存在求导和函数连续性的限定;具有内在的隐并行性和更好的全局寻优能力;采用概率化的寻优方法,不需要确定的规则就能自动获取和指导优化的搜索空间,自适应地调整搜索方向。
遗传算法以一种群体中的所有个体为对象,并利用随机化技术指导对一个被编码的参数空间进行高效搜索。其中,选择、交叉和变异构成了遗传算法的遗传操作;参数编码、初始群体的设定、适应度函数的设计、遗传操作设计、控制参数设定五个要素组成了遗传算法的核心内容。
1.2 遗传算法的执行过程(参照百度百科)
遗传算法是从代表问题可能潜在的解集的一个种群(population)开始的,而一个种群则由经过基因(gene)编码的一定数目的个体(individual)组成。每个个体实际上是染色体(chromosome)带有特征的实体。
染色体作为遗传物质的主要载体,即多个基因的集合,其内部表现(即基因型)是某种基因组合,它决定了个体的形状的外部表现,如黑头发的特征是由染色体中控制这一特征的某种基因组合决定的。因此,在一开始需要实现从表现型到基因型的映射即编码工作。由于仿照基因编码的工作很复杂,我们往往进行简化,如二进制编码。
初代种群产生之后,按照适者生存和优胜劣汰的原理,逐代(generation)演化产生出越来越好的近似解,在每一代,根据问题域中个体的适应度(fitness)大小选择(selection)个体,并借助于自然遗传学的遗传算子(genetic operators)进行组合交叉(crossover)和变异(mutation),产生出代表新的解集的种群。
这个过程将导致种群像自然进化一样的后生代种群比前代更加适应于环境,末代种群中的最优个体经过解码(decoding),可以作为问题近似最优解。
1.3 遗传算法过程图解
import numpy as npimport pandas as pdfrom sklearn.feature_selection import SelectKBest,SelectPercentile,SelectFromModel,chi2,f_classif,mutual_info_classif,RFEfrom scipy.stats import pearsonrfrom sklearn.ensemble import RandomForestRegressor,RandomForestClassifierfrom sklearn.svm import SVC,LinearSVC,LinearSVR,SVRfrom sklearn.tree import DecisionTreeClassifierimport lightgbm as lgbfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import r2_scoreimport gcfrom jqdata import *from sklearn.linear_model import LinearRegressionfrom jqlib.technical_analysis import *
start_date = '2016-01-01'end_date = '2018-11-01'trade_days = get_trade_days(start_date=start_date,end_date=end_date).tolist()date = trade_days[0]lookback = 5 #lstm时间轴数据长度stocks = '000905.XSHG' #中证500
#技术因子数据准备def get_factors_one_stock(stocks,date):''' 获取一只股票一天的因子数据集 input: stocks:一只股票 date:日期 output: df:dataframe,各个因子一天的数值 '''if type(date) != str:date = datetime.datetime.strftime(date,'%Y-%m-%d')price = get_price(stocks,end_date=date,count=1)price.index = [date]accer = ACCER(stocks,check_date=date,N=5)accer_df = pd.DataFrame(list(accer.values()),columns=['ACCER'])#ADTM-动态买卖气指标adtm,maadtm = ADTM(stocks, date, N = 23, M = 8)adtm_df = pd.DataFrame(list(adtm.values()),columns=['ADTM'])maadtm_df = pd.DataFrame(list(maadtm.values()),columns=['MAADTM'])#ATR-真实波幅mtr,atr = ATR(stocks, date, timeperiod=14)mtr_df = pd.DataFrame(list(mtr.values()),columns=['MTR'])atr_df = pd.DataFrame(list(atr.values()),columns=['ATR'])#乘离率bias,bias_ma = BIAS_QL(stocks, date, M = 6)bias_df = pd.DataFrame(list(bias.values()),columns=['BIAS'])#商品路径cci = CCI(stocks, date, N=14)cci_df = pd.DataFrame(list(cci.values()),columns=['CCI'])#多空线dkx,madkx = DKX(stocks, date, M = 10)dkx_df = pd.DataFrame(list(dkx.values()),columns=['DKX'])#随机指标k,d = SKDJ(stocks, date, N = 9, M = 3)k_df = pd.DataFrame(list(k.values()),columns=['KBJ'])#市场趋势cye,_ = CYE(stocks, date)cye_df = pd.DataFrame(list(cye.values()),columns=['CYE'])#MFI-资金流量指标mfi = MFI(stocks, date, timeperiod=14)mfi_df = pd.DataFrame(list(mfi.values()),columns=['MFI'])#MTM-动量线mtm = MTM(stocks, date, timeperiod=14)mtm_df = pd.DataFrame(list(mtm.values()),columns=['MTM'])#简单波动指标emv,_ = EMV(stocks, date, N = 14, M = 9)emv_df = pd.DataFrame(list(mtm.values()),columns=['EMV'])#ROC-变动率指标roc = ROC(stocks, date, timeperiod=12)roc_df = pd.DataFrame(list(roc.values()),columns=['ROC'])#RSI-相对强弱指标rsi = RSI(stocks, date, N1=6)rsi_df = pd.DataFrame(list(rsi.values()),columns=['RSI'])#MARSI-相对强弱平均线rsi10,rsi6 = MARSI(stocks, date, M1 = 10, M2 = 6)rsi10_df = pd.DataFrame(list(rsi10.values()),columns=['RSI10'])rsi6_df = pd.DataFrame(list(rsi6.values()),columns=['RSI6'])#OSC-变动速率线osc,maosc = OSC(stocks, date, N = 20, M = 6)osc_df = pd.DataFrame(list(osc.values()),columns=['OSC'])maosc_df = pd.DataFrame(list(maosc.values()),columns=['MAOSC'])#UDL-引力线udl,maudl = UDL(stocks, date, N1 = 3, N2 = 5, N3 = 10, N4 = 20, M = 6)udl_df = pd.DataFrame(list(udl.values()),columns=['UDL'])maudl_df = pd.DataFrame(list(maudl.values()),columns=['MAUDL'])wr,mawr = WR(stocks, date, N = 10, N1 = 6)wr_df = pd.DataFrame(list(wr.values()),columns=['WR'])mawr_df = pd.DataFrame(list(mawr.values()),columns=['MAWR'])#FSL-分水岭fsl,mafsl = FSL(stocks, date)fsl_df = pd.DataFrame(list(fsl.values()),columns=['FSL'])mafsl_df = pd.DataFrame(list(mafsl.values()),columns=['MAFSL'])#趋势型cho,macho = CHO(stocks, date, N1 = 10, N2 = 20, M = 6)cho_df = pd.DataFrame(list(cho.values()),columns=['CHO'])macho_df = pd.DataFrame(list(macho.values()),columns=['MACHO'])dif,difma = DMA(stocks, date, N1 = 10, N2 = 50, M = 10)dif_df = pd.DataFrame(list(dif.values()),columns=['DIF'])difma_df = pd.DataFrame(list(difma.values()),columns=['DIFMA']) emv,maemv = EMV(stocks, date, N = 14, M = 9)emv_df = pd.DataFrame(list(emv.values()),columns=['EMV'])maemv_df = pd.DataFrame(list(maemv.values()),columns=['MAEMV'])#能量型#相对强弱br, ar = BRAR(stocks, date, N=26)br_df = pd.DataFrame(list(br.values()),columns=['BR'])ar_df = pd.DataFrame(list(ar.values()),columns=['AR'])cr,M1,M2,M3,M4 = CR(stocks, date, N=26, M1=10, M2=20, M3=40, M4=62)cr_df = pd.DataFrame(list(cr.values()),columns=['CR'])mass,mamass = MASS(stocks, date, N1=9, N2=25, M=6)mass_df = pd.DataFrame(list(mass.values()),columns=['MASS'])mamass_df = pd.DataFrame(list(mamass.values()),columns=['MAMASS'])#成交量型amo,amo1,amo2 = AMO(stocks, date, M1 = 5, M2 = 10)amo_df = pd.DataFrame(list(amo.values()),columns=['AMO'])amo1_df = pd.DataFrame(list(amo1.values()),columns=['AMO1'])amo2_df = pd.DataFrame(list(amo2.values()),columns=['AMO2'])df = pd.concat([accer_df,adtm_df,maadtm_df,mtr_df,atr_df,bias_df,cci_df,dkx_df,k_df,cye_df,mfi_df,mtm_df,emv_df,roc_df,rsi_df,rsi10_df,rsi6_df,osc_df,udl_df,maudl_df,wr_df,mawr_df,fsl_df,mafsl_df,cho_df,macho_df,dif_df,difma_df,cr_df,mass_df,mamass_df,amo_df,amo1_df,amo2_df,br_df,ar_df],axis=1)df.index = [date]df = pd.concat([price,df],axis=1)return df
def get_data_from_date(start_date,end_date,stocks):''' 获取时间轴数据 '''trade_date = get_trade_days(start_date=start_date,end_date=end_date)df = get_factors_one_stock(stocks,trade_date[0])for date in trade_date[1:]:df1 = get_factors_one_stock(stocks,date)df = pd.concat([df,df1])return dfdata = get_data_from_date(start_date,end_date,stocks)
def get_day_profit(stocks,end_date,start_date=None,count=-1,pre_num=1):''' 获取每天的收益率 input: stocks:list or Series,股票代码 start_date:开始时间 end_date:结束时间 count:与start_date二选一,向前取值个数 pre_num:int,向前计算的天数 output: profit:dataframe,index为日期,values为收益率,收益率大于0标记为1,否则为0 '''if count == -1:price = get_price(stocks,start_date,end_date,fields=['close'])['close']else:price = get_price(stocks,end_date=end_date,count=count,fields=['close'])['close']profit = price.pct_change(periods=pre_num).dropna()profit[profit > 0] = 1profit[profit < 0] = 0profit = profit.to_frame()profit.columns=['profit_dis']return profitprofit_dis = get_day_profit(stocks,start_date=start_date,end_date=end_date)
def get_day_profit_data(stocks,end_date,start_date=None,count=-1,pre_num=1):''' 获取每天的收益率 input: stocks:list or Series,股票代码 start_date:开始时间 end_date:结束时间 count:与start_date二选一,向前取值个数 pre_num:int,向前计算的天数 output: profit:dataframe,index为日期,values为收益率 '''if count == -1:price = get_price(stocks,start_date,end_date,fields=['close'])['close']else:price = get_price(stocks,end_date=end_date,count=count,fields=['close'])['close']profit = price.pct_change(periods=pre_num).dropna()profit = profit.to_frame()profit.columns=['profit']return profitprofit = get_day_profit_data(stocks,start_date=start_date,end_date=end_date)
index = profit.indexindex = [ind.date() for ind in index]index = [datetime.datetime.strftime(ind,'%Y-%m-%d') for ind in index]profit.index = indexdata_concat = pd.concat([data,profit],axis=1).dropna(axis=1,how='all').dropna()
length = len(data_concat.columns)data_x = data_concat.iloc[:,-20:length-1]data_y = data_concat.iloc[:,-1]
def winsorize_and_standarlize(data, qrange=[0.05, 0.95], axis=0):if isinstance(data, pd.DataFrame):if axis == 0:q_down = data.quantile(qrange[0])q_up = data.quantile(qrange[1])col = data.columnsfor n in col:data[n][data[n] > q_up[n]] = q_up[n]data[n][data[n] < q_down[n]] = q_down[n]data = (data - data.mean()) / data.std()data = data.fillna(0)else:data = data.stack()data = data.unstack(0)q_down = data.quantile(qrange[0])q_up = data.quantile(qrange[1])col = data.columnsfor n in col:data[n][data[n] > q_up[n]] = q_up[n]data[n][data[n] < q_down[n]] = q_down[n]data = (data - data.mean()) / data.std()data = data.stack().unstack(0)data = data.fillna(0)elif isinstance(data, pd.Series):q_down = data.quantile(qrange[0])q_up = data.quantile(qrange[1])data[data > q_up] = q_updata[data < q_down] = q_downdata = (data - data.mean()) / data.std()return data
data_x_ws = winsorize_and_standarlize(data_x, qrange=[0.05, 0.95], axis=0)
/opt/conda/envs/python3new/lib/python3.6/site-packages/pandas/core/generic.py:5233: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the c*eats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._update_inplace(new_data) /opt/conda/envs/python3new/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the c*eats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.
data_x_ws.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
RSI10 | RSI6 | OSC | UDL | MAUDL | WR | MAWR | CHO | MACHO | DIF | DIFMA | CR | MASS | MAMASS | AMO | AMO1 | AMO2 | BR | AR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2016-01-05 | 0.240408 | -0.818806 | -2.180019 | 1.138539 | 1.146303 | 1.146993 | 1.121545 | 0.688089 | 1.535792 | 0.620108 | 1.202686 | -0.900730 | -0.509484 | -1.040983 | 1.996295 | 1.873522 | 1.842088 | -1.024723 | -0.736332 |
2016-01-06 | -0.071387 | -0.948186 | -2.180019 | 1.138539 | 1.146303 | 0.700084 | 0.598374 | 0.120644 | 1.238894 | 0.276379 | 1.127550 | -0.364645 | -0.250806 | -0.862669 | 1.996295 | 1.873522 | 1.842088 | -0.568865 | -0.219512 |
2016-01-07 | -0.427502 | -1.330653 | -2.180019 | 1.138539 | 1.146303 | 1.655010 | 1.678230 | -0.411413 | 0.867082 | -0.265048 | 0.987504 | -0.906148 | 0.068983 | -0.639812 | -1.609895 | 1.355824 | 1.842088 | -1.241708 | -0.928860 |
2016-01-08 | -0.724765 | -1.713604 | -2.180019 | 1.138539 | 1.146303 | 1.126899 | 1.093218 | -0.892163 | 0.398542 | -0.752588 | 0.785831 | -1.016249 | 0.438663 | -0.358436 | 1.996295 | 1.416721 | 1.842088 | -0.914497 | -1.026373 |
2016-01-11 | -1.099147 | -1.811362 | -2.180019 | 1.138539 | 1.146303 | 1.655010 | 1.678230 | -1.792080 | -0.200601 | -1.420329 | 0.498514 | -1.260926 | 0.777603 | -0.034617 | 1.996295 | 1.581907 | 1.842088 | -1.438331 | -1.337636 |
class GAFeatureSelection(object):def __init__(self,DNA_size,pop_size,cross_rate,mutation_rate,n_generations):self.DNA_size = DNA_size #DNA数量self.pop_size = pop_size #种群数量self.cross_rate = cross_rate #交换比例self.mutation_rate = mutation_rate #突变比例self.n_generations = n_generations #遗传多少代,即迭代次数self.pop = np.random.randint(2,size=(pop_size,DNA_size))def r2_for_data(self,x, y):model = LinearRegression()model.fit(x, y)y_pre = model.predict(x)R2 = r2_score(y, y_pre)return R2def get_R2_fitness(self,pop, x, y):length = len(pop)r2_list = []for i in range(length):select_x = x * pop[i]R2 = self.r2_for_data(select_x, y)r2_list.append(R2)r2_new = r2_list - min(r2_list)prob = r2_new / sum(r2_new)return probdef select(self,pop, fitness):idx = np.random.choice(np.arange(self.pop_size), size=self.pop_size, replace=True, p=fitness)return pop[idx]def crossover(self,parent, pop):if np.random.rand() < self.cross_rate:i_ = np.random.randint(0, self.pop_size, size=1)cross_points = np.random.randint(0, 2, size=self.DNA_size).astype(np.bool)parent[cross_points] = pop[i_, cross_points]return parentdef mutate(self,child):for point in range(self.DNA_size):if np.random.rand() < self.mutation_rate:child[point] = 1 if child[point] == 0 else 0return childdef genetic_algorithm(self,x,y):for _ in range(self.n_generations):fitness = self.get_R2_fitness(self.pop, x, y)self.pop = self.select(self.pop, fitness)pop_copy = self.pop.copy()for parent in self.pop:child = self.crossover(parent, pop_copy)child = self.mutate(child)parent[:] = child # parent is replaced by its childdf = pd.DataFrame(self.pop, columns=x.columns)df_sum = df.sum().sort_values(ascending=False)return df_sum
DNA_size = len(data_x_ws.columns)pop_size = 100cross_rate = 0.8mutation_rate = 0.003n_generations = 100ga = GAFeatureSelection(DNA_size,pop_size,cross_rate,mutation_rate,n_generations)res = ga.genetic_algorithm(data_x_ws,data_y)print(res)
/opt/conda/envs/python3new/lib/python3.6/site-packages/ipykernel_launcher.py:25: RuntimeWarning: invalid value encountered in true_divide
本社区仅针对特定人员开放
查看需注册登录并通过风险意识测评
5秒后跳转登录页面...