数据是最宝贵的资源,首先感谢joinquant团队提供的优质数据服务.
这篇帖子主要介绍项目数据处理方面的思路及如何使用jqdata数据来做扩展.
获取数据,目前无非两种方式,爬取和购买.
而无论哪种方式,在本地化方面都需要做不少工作,这个工作的好坏从某种程度上决定了后面整个分析和交易系统的成败.
根据自己的经验,本地化在数据结构上至少要做到:
1.数据schema的稳定性
整个金融市场的属性是稳定的,你需要关注的东西是稳定的,比如k线,比如财务报表,比如资金流.2.schema字段的确定性
举个栗子,资产负债表里面无形资产字段,joinquant为intangible_assets,eastmoney为Intangibleasset,如果你直接依赖provider提供的字段来进行后续的计算,将有极大的风险,一家provider不可用,将使整个程序不可用。因此必须将相应的provider的数据转为自己定义的格式。3.多provider支持
前面的两个稳定性决定了多provider支持的可行,并为数据的完整性提供了保障,当某家provider数据不能满足需求时,可以多provider互补。当然,还可以利用多provider对数据进行交叉验证。
将各provider提供(或者自己爬取)的数据变成符合data schema的数据需要做好以下几点:
1.初始化要抓取的标的
可抓取单标的来调试,然后抓取全量标的2.能够从上次抓取的地方接着抓
减少不必要的请求,增量抓取3.封装常用的请求方式
对时间序列数据的请求,无非start,end,size,time list的组合4.能够自动去重
5.能够设置抓取速率
6.提供抓取完成的回调函数
方便数据校验和多provider数据补全
流程图如下:
下面介绍zvt项目的数据结构及如何添加jqdata数据支持
1. 数据结构
相关概念及关系
1.1 provider
代表数据提供商,比如joinquant,eastmoney,sina,netease
1.2 store category
数据的逻辑分类,物理上代表一个db,其下面一般有多个data schema,schema间可能有关系
1.3 data schema
数据的结构描述,物理上代表一个table
逻辑视图
物理视图
一般来说,data schema是稳定的,有些数据需要多个provider来一起生成,这时也认为数据只属于某个provider;某类数据有多个provider时,可以相互验证,api上只需要指定相应的provider即可
2. 如何添加provider支持
下面以joinquant来举个栗子
2.1 添加joinquant provider
代码
class Provider(enum.Enum):EASTMONEY = 'eastmoney'SINA = 'sina'NETEASE = 'netease'EXCHANGE = 'exchange'JOINQUANT = 'joinquant'
2.2 添加store category
代码
目前定义的类别,需要扩展的在其基础上添加
class StoreCategory(enum.Enum):meta = 'meta'#个股日线行情数据stock_day_kdata = 'stock_day_kdata'index_day_kdata = 'index_day_kdata'finance = 'finance'dividend_financing = 'dividend_financing'holder = 'holder'trading = 'trading'money_flow = 'money_flow'macro = 'macro'business = 'business'category_map_db = { StoreCategory.meta: MetaBase,#个股日线行情数据StoreCategory.stock_day_kdata: StockDayKdataBase, StoreCategory.index_day_kdata: IndexDayKdataBase, StoreCategory.finance: FinanceBase, StoreCategory.dividend_financing: DividendFinancingBase, StoreCategory.holder: HolderBase, StoreCategory.trading: TradingBase, StoreCategory.money_flow: MoneyFlowBase, StoreCategory.macro: MacroBase, StoreCategory.business: BusinessBase, }
2.3 添加data schema
代码
个股日线行情数据结构
class StockDayKdata(StockDayKdataBase):__tablename__ = 'stock_day_kdata'id = Column(String(length=128), primary_key=True)provider = Column(String(length=32))timestamp = Column(DateTime)security_id = Column(String(length=128))code = Column(String(length=32))name = Column(String(length=32)) # level = Column(Enum(TradingLevel, values_callable=enum_value))level = Column(String(length=32))open = Column(Float)hfq_open = Column(Float)qfq_open = Column(Float)close = Column(Float)hfq_close = Column(Float)qfq_close = Column(Float)high = Column(Float)hfq_high = Column(Float)qfq_high = Column(Float)low = Column(Float)hfq_low = Column(Float)qfq_low = Column(Float)volume = Column(Float)turnover = Column(Float)change_pct = Column(Float)turnover_rate = Column(Float)factor = Column(Float)
2.4 关联provider和相应的store category
代码
支持其他数据时,在此扩展
provider_map_category = { Provider.JOINQUANT:[StoreCategory.stock_day_kdata], Provider.JOINQUANT.value:[StoreCategory.stock_day_kdata] }
2.5 实现相应的recorder
代码
核心代码
#将聚宽数据转换为标准zvt数据class MyApiWrapper(ApiWrapper):def request(self, url=None, method='get', param=None, path_fields=None): security_item = param['security_item'] start_timestamp = param['start_timestamp']# 不复权df = get_price(to_jq_security_id(security_item), start_date=to_time_str(start_timestamp), end_date=now_time_str(), frequency='daily', fields=['open', 'close', 'low', 'high', 'volume', 'money'], skip_paused=True, fq=None) df.index.name = 'timestamp'df.reset_index(inplace=True) df['name'] = security_item.name df.rename(columns={'money': 'turnover'}, inplace=True) df['timestamp'] = pd.to_datetime(df['timestamp']) df['provider'] = Provider.JOINQUANT.value df['level'] = param['level']return df.to_dict(orient='records')#补全复权数据def on_finish(self, security_item): kdatas = get_kdata(security_id=security_item.id, level=self.level.value, order=StockDayKdata.timestamp.asc(), return_type='domain', session=self.session, filters=[StockDayKdata.hfq_close.is_(None), StockDayKdata.timestamp >= to_pd_timestamp('2005-01-01')])if kdatas:start = kdatas[0].timestampend = kdatas[-1].timestamp# get hfq from joinquantdf = get_price(to_jq_security_id(security_item), start_date=to_time_str(start), end_date=now_time_str(), frequency='daily', fields=['factor', 'open', 'close', 'low', 'high'], skip_paused=True, fq='post')if df is not None and not df.empty:# fill hfq datafor kdata in kdatas:if kdata.timestamp in df.index:kdata.hfq_open = df.loc[kdata.timestamp, 'open'] kdata.hfq_close = df.loc[kdata.timestamp, 'close'] kdata.hfq_high = df.loc[kdata.timestamp, 'high'] kdata.hfq_low = df.loc[kdata.timestamp, 'low'] kdata.factor = df.loc[kdata.timestamp, 'factor']self.session.commit() latest_factor = df.factor[-1]# factor not change yet, no need to reset the qfq pastif latest_factor == self.current_factors.get(security_item.id): sql = 'UPDATE stock_day_kdata SET qfq_close=hfq_close/{},qfq_high=hfq_high/{}, qfq_open= hfq_open/{}, qfq_low= hfq_low/{} where ' \ 'security_id=\'{}\' and level=\'{}\' and (qfq_close isnull or qfq_high isnull or qfq_low isnull or qfq_open isnull)'.format( latest_factor, latest_factor, latest_factor, latest_factor, security_item.id, self.level.value)else:sql = 'UPDATE stock_day_kdata SET qfq_close=hfq_close/{},qfq_high=hfq_high/{}, qfq_open= hfq_open/{}, qfq_low= hfq_low/{} where ' \ 'security_id=\'{}\' and level=\'{}\''.format(latest_factor, latest_factor, latest_factor, latest_factor, security_item.id, self.level.value)self.logger.info(sql)self.session.execute(sql)self.session.commit()# TODO:use netease provider to get turnover_rateself.logger.info('use netease provider to get turnover_rate')
这里留了一个练习,由于聚宽的数据没有提供换手率和当日涨跌幅,可以通过其他数据源补全或者自己计算的方式来完成.
网易的数据没有复权信息,通过聚宽的factor来补全,同理,可以用网易的换手率,涨跌幅数据来补全聚宽数据.
参考代码
2.6 运行recorder
在settings设置自己的jqdata账户和密码
jqdata目前免费使用一年,注册地址如下
https://www.joinquant.com/default/index/sdk?f=home&m=banner
if __name__ == '__main__': init_process_log('jq_china_stock_day_kdata.log') ChinaStockDayKdataRecorder(level=TradingLevel.LEVEL_1DAY, codes=['300027']).run()
这里codes填写需要抓取的标的,如果不设置codes就是全市场抓取。
3. 获得的能力
添加一种数据源后,天然就获得相应的api,factor,selector和trader的能力,这里展示使用聚宽的数据的能力
In [1]: from zvt.api.technical import * In [2]: from zvt.api.domain import * In [3]: df1=get_kdata(security_id='stock_sz_300027', provider='joinquant',start_timestamp='2019-01-01',limit=10) In [4]: df1 id provider timestamp security_id code name level open hfq_open qfq_open close hfq_close qfq_close high hfq_high qfq_high low hfq_low qfq_low volume turnover change_pct turnover_rate factor0 stock_sz_300027_2019-01-02 joinquant 2019-01-02 stock_sz_300027 300027 华谊兄弟 1d 4.54 68.58 4.539918 4.40 66.47 4.400238 4.58 69.19 4.580299 4.35 65.71 4.349927 29554330.0 1.306117e 08 None None 15.1061 stock_sz_300027_2019-01-03 joinquant 2019-01-03 stock_sz_300027 300027 华谊兄弟 1d 4.40 66.47 4.400238 4.42 66.77 4.420098 4.45 67.22 4.449887 4.36 65.86 4.359857 15981569.0 7.052363e 07 None None 15.1062 stock_sz_300027_2019-01-04 joinquant 2019-01-04 stock_sz_300027 300027 华谊兄弟 1d 4.36 65.86 4.359857 4.52 68.28 4.520058 4.54 68.58 4.539918 4.33 65.41 4.330068 17103081.0 7.657399e 07 None None 15.1063 stock_sz_300027_2019-01-07 joinquant 2019-01-07 stock_sz_300027 300027 华谊兄弟 1d 4.54 68.58 4.539918 4.59 69.34 4.590229 4.63 69.94 4.629948 4.48 67.67 4.479677 16163938.0 7.383168e 07 None None 15.1064 stock_sz_300027_2019-01-08 joinquant 2019-01-08 stock_sz_300027 300027 华谊兄弟 1d 4.59 69.34 4.590229 4.60 69.49 4.600159 4.66 70.39 4.659738 4.56 68.88 4.559778 10908603.0 5.034655e 07 None None 15.1065 stock_sz_300027_2019-01-09 joinquant 2019-01-09 stock_sz_300027 300027 华谊兄弟 1d 4.63 69.94 4.629948 4.58 69.19 4.580299 4.73 71.45 4.729909 4.58 69.19 4.580299 16901976.0 7.881876e 07 None None 15.1066 stock_sz_300027_2019-01-10 joinquant 2019-01-10 stock_sz_300027 300027 华谊兄弟 1d 4.63 69.94 4.629948 4.61 69.64 4.610089 4.76 71.90 4.759698 4.59 69.34 4.590229 20855469.0 9.717176e 07 None None 15.1067 stock_sz_300027_2019-01-11 joinquant 2019-01-11 stock_sz_300027 300027 华谊兄弟 1d 4.60 69.49 4.600159 4.67 70.55 4.670330 4.67 70.55 4.670330 4.56 68.88 4.559778 13216260.0 6.089670e 07 None None 15.1068 stock_sz_300027_2019-01-14 joinquant 2019-01-14 stock_sz_300027 300027 华谊兄弟 1d 4.63 69.94 4.629948 4.57 69.03 4.569707 4.65 70.24 4.649808 4.55 68.73 4.549848 12421993.0 5.705187e 07 None None 15.1069 stock_sz_300027_2019-01-15 joinquant 2019-01-15 stock_sz_300027 300027 华谊兄弟 1d 4.56 68.88 4.559778 4.64 70.09 4.639878 4.66 70.39 4.659738 4.54 68.58 4.539918 14403671.0 6.637258e 07 None None 15.106#跟网易的数据比较In [24]: df2=get_kdata(security_id='stock_sz_300027', provider='netease',start_timestamp='2019-01-01',limit=10) In [25]: df2 Out[25]: id provider timestamp security_id code name level open hfq_open qfq_open close hfq_close qfq_close high hfq_high qfq_high low hfq_low qfq_low volume turnover change_pct turnover_rate factor0 stock_sz_300027_2019-01-02 netease 2019-01-02 stock_sz_300027 300027 华谊兄弟 1d 4.54 68.58 4.539918 4.40 66.47 4.400238 4.58 69.19 4.580299 4.35 65.71 4.349927 29554330.0 1.306117e 08 -6.1834 1.0652 15.1061 stock_sz_300027_2019-01-03 netease 2019-01-03 stock_sz_300027 300027 华谊兄弟 1d 4.40 66.47 4.400238 4.42 66.77 4.420098 4.45 67.22 4.449887 4.36 65.86 4.359857 15981569.0 7.052363e 07 0.4545 0.5760 15.1062 stock_sz_300027_2019-01-04 netease 2019-01-04 stock_sz_300027 300027 华谊兄弟 1d 4.36 65.86 4.359857 4.52 68.28 4.520058 4.54 68.58 4.539918 4.33 65.41 4.330068 17103081.0 7.657399e 07 2.2624 0.6164 15.1063 stock_sz_300027_2019-01-07 netease 2019-01-07 stock_sz_300027 300027 华谊兄弟 1d 4.54 68.58 4.539918 4.59 69.34 4.590229 4.63 69.94 4.629948 4.48 67.67 4.479677 16163938.0 7.383168e 07 1.5487 0.5826 15.1064 stock_sz_300027_2019-01-08 netease 2019-01-08 stock_sz_300027 300027 华谊兄弟 1d 4.59 69.34 4.590229 4.60 69.49 4.600159 4.66 70.39 4.659738 4.56 68.88 4.559778 10908603.0 5.034655e 07 0.2179 0.3932 15.1065 stock_sz_300027_2019-01-09 netease 2019-01-09 stock_sz_300027 300027 华谊兄弟 1d 4.63 69.94 4.629948 4.58 69.19 4.580299 4.73 71.45 4.729909 4.58 69.19 4.580299 16901976.0 7.881876e 07 -0.4348 0.6092 15.1066 stock_sz_300027_2019-01-10 netease 2019-01-10 stock_sz_300027 300027 华谊兄弟 1d 4.63 69.94 4.629948 4.61 69.64 4.610089 4.76 71.90 4.759698 4.59 69.34 4.590229 20855469.0 9.717176e 07 0.6550 0.7517 15.1067 stock_sz_300027_2019-01-11 netease 2019-01-11 stock_sz_300027 300027 华谊兄弟 1d 4.60 69.49 4.600159 4.67 70.55 4.670330 4.67 70.55 4.670330 4.56 68.88 4.559778 13216260.0 6.089670e 07 1.3015 0.4763 15.1068 stock_sz_300027_2019-01-14 netease 2019-01-14 stock_sz_300027 300027 华谊兄弟 1d 4.63 69.94 4.629948 4.57 69.03 4.569707 4.65 70.24 4.649808 4.55 68.73 4.549848 12421993.0 5.705187e 07 -2.1413 0.4477 15.1069 stock_sz_300027_2019-01-15 netease 2019-01-15 stock_sz_300027 300027 华谊兄弟 1d 4.56 68.88 4.559778 4.64 70.09 4.639878 4.66 70.39 4.659738 4.54 68.58 4.539918 14403671.0 6.637258e 07 1.5317 0.5191 15.106
比较两家数据
In [26]: df1.loc[:,['open','close','high','low','volume']]-df2.loc[:,['open','close','high','low','volume']] Out[26]: open close high low volume0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.03 0.0 0.0 0.0 0.0 0.04 0.0 0.0 0.0 0.0 0.05 0.0 0.0 0.0 0.0 0.06 0.0 0.0 0.0 0.0 0.07 0.0 0.0 0.0 0.0 0.08 0.0 0.0 0.0 0.0 0.09 0.0 0.0 0.0 0.0 0.0
嗯,两家的数据是一致的,数据的准确性得到进一步的确认.